Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Heejeong Nam; Quentin Le Lidec; Lucas Maes; Yann LeCun; Randall Balestriero

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Beginner

Heejeong Nam, Quentin Le Lidec, Lucas Maes et al.2/11/2026

arXiv

Key Summary

•This paper introduces Causal-JEPA (C-JEPA), a world model that learns by hiding entire objects in its memory and forcing itself to predict them from other objects.
•By masking objects (not just image patches), the model must use interactions, which acts like a safe, built-in experiment in its imagination (a latent intervention).
•C-JEPA improves counterfactual visual question answering by about 20% over the same architecture without object masking, showing stronger “what if?” reasoning.
•In robot control (Push-T), it matches strong patch-based models while using only about 1.02% of the input tokens, enabling more than 8× faster planning.
•The method is reconstruction-free: it predicts in a compact latent space instead of trying to rebuild pixels, which makes it efficient and focused on what matters.
•A tiny ‘identity anchor’ is kept for each masked object so the model knows which object it is guessing about, while all its recent history is hidden.
•A simple mathematical view shows that this masking makes the model rely on the smallest set of other variables that truly influence the target, a helpful causal bias.
•Treating actions and body-sense (proprioception) as separate, explicit inputs works better than gluing them into vision tokens.
•Too much masking can remove useful clues, so there is a sweet spot; object-level masking is also more stable than random token or tube masking.
•Overall, C-JEPA turns object masking into a training signal that makes interaction reasoning necessary, leading to better reasoning and faster control.

Why This Research Matters

C-JEPA helps AI understand interactions, not just appearances, so it can answer deeper ‘what-if’ questions and plan smarter actions. It uses far fewer tokens than patch-based models, which means faster and cheaper decision-making in robots and assistants. By training with safe, internal interventions (masking objects), it learns dependencies that are more stable and reliable. This reduces overfitting to spurious pixel patterns and improves robustness when conditions change. In practice, that means better household robots, safer industrial automation, clearer video reasoning tools, and more efficient research on world models. It also opens a path to more transparent AI by highlighting which other objects or actions most influence a target.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a game of pool. You don’t track every pixel of the table—you track the balls, where they go, and how they bounce off each other. That’s how our brains keep up with a busy world. 🥬 The Concept (Object-Centric World Models): World models try to help AI understand and predict how the world changes over time. Object-centric models summarize each scene as a small set of objects (like balls on a table) with properties in a compact code called a latent state. How it works: 1) See a video frame; 2) Group pixels into a few “object slots”; 3) Track how each object’s state changes over time; 4) Predict the next states. Why it matters: Tracking objects instead of pixels is simpler, faster, and more meaningful for planning and reasoning. 🍞 Anchor: In a robot kitchen, instead of looking at every pixel, the robot tracks the cup, the kettle, and the spoon and plans how to move them.

The World Before: Early world models were great at compressing images and predicting short-term motion, but they often relied on pixel patterns instead of true object interactions. Even with object-centric encoders (which turn frames into a small set of object slots), many models still leaned on shortcuts: they predicted each object from its own past (self-dynamics) or used accidental correlations instead of real interactions. That meant they could miss the moment two objects bumped into each other or when an action from the robot hand caused another object to move.

The Problem: How do we make the model’s training objective itself require learning interactions? If the loss can be minimized by just interpolating an object’s own recent history, the model won’t bother to learn “who pushes whom.” We need the model to be forced to ask: “If this object is hidden, can I still figure it out from the others?”

Failed Attempts: Researchers tried to fix this by: 1) Splitting self-motion and interactions inside the network; 2) Forcing attention to be sparse; 3) Using hand-made graphs connecting objects; 4) Using patch-level masking (hiding image patches). These helped, but they either depended on architecture tricks, task-specific tuning, or they still let the model solve problems with local pixel clues instead of true object interactions.

The Gap: Missing was a simple, reconstruction-free training signal that makes interaction reasoning functionally necessary. We want an objective that says, “You cannot solve this without using other objects,” while keeping the architecture flexible and efficient.

Real Stakes: This matters in daily life because: 1) Robots: A home robot has to plan how pushing a cup will move the spoon—understanding interactions saves spills! 2) Assistants: Video understanding assistants must answer “what-if” questions like “If the red ball didn’t hit the blue ball, what happens?” 3) Efficiency: Using a tiny number of tokens (object slots) instead of hundreds of image patches makes planning far faster—important for real-time control.

🍞 Hook: You know how sometimes you cover part of a picture and ask a friend to guess what’s under the cover from the rest of the picture? That makes them use context, not just memory. 🥬 The Concept (Latent Interventions via Object Masking): The key idea here is to hide whole objects in the model’s memory and ask it to guess them from the others. How it works: 1) Turn frames into object slots with a frozen encoder; 2) Choose some objects and hide their recent histories, keeping only a tiny “identity anchor” so we know which object is which; 3) Force the predictor to reconstruct the hidden objects and also predict the future; 4) Repeat across time and scenes. Why it matters: By hiding an object’s past, the model cannot cheat with self-dynamics—it must use interactions with visible objects and actions. 🍞 Anchor: If the red ball’s path is hidden, the model must look at the blue ball’s motion and the agent’s push to infer where the red ball likely is.

🍞 Hook: Think of “what if” questions: “What if that ball hadn’t been there?” 🥬 The Concept (Counterfactual Reasoning): Counterfactuals imagine alternate worlds by changing certain pieces and asking how outcomes would differ. How it works: 1) Hold most of the scene the same; 2) Change one object or event; 3) Predict the new results. Why it matters: It’s how we understand cause and effect, beyond just seeing patterns. 🍞 Anchor: “If the green block weren’t in the way, would the blue ball hit the red ball?”

02Core Idea

🍞 Hook: You know how in group projects, if one student’s notes are missing, the team has to piece together what they did from everyone else’s notes? That forces the team to pay attention to interactions, not just one person’s memory. 🥬 The Concept (Causal-JEPA in One Sentence): C-JEPA hides entire objects’ histories in latent space so the model can only solve the task by using other objects and actions—this turns training into many tiny, safe experiments (latent interventions) that inject a causal bias. How it works: 1) Convert each frame into object slots; 2) Randomly pick some objects and replace their recent history with a masked token, leaving just a small identity anchor; 3) Use a JEPA-style predictor to jointly infer the masked histories and predict the future; 4) Train with a loss that combines “fill the masked history” + “predict the future.” Why it matters: Without this, the model can ignore interactions and rely on self-dynamics; with masking, interactions become necessary to reduce loss. 🍞 Anchor: If the orange ball’s history is hidden, the model must read the motion of the cue ball and the table action to infer where the orange ball went.

The “Aha!” Moment: The trick is not just hiding pixels but hiding whole objects across time—this blocks the easiest shortcut (self-dynamics), forcing the model to use interaction clues.

Three Analogies: 1) Mystery novel: If pages about one suspect are missing, you solve the case using others’ alibis—interactions matter. 2) Chess replay: If a few moves of one piece are hidden, you reconstruct them from the positions of the other pieces. 3) Classroom science: You remove one ingredient and observe changes to infer what it used to do—an intervention.

Before vs After: Before, object-centric models could still skate by on each object’s own momentum. After, C-JEPA makes the model rely on who bumped whom and when, because some objects’ histories are hidden. Before, patch masking optimized local textures; after, object masking optimizes interactions among entities.

🍞 Hook: You know how naming your folders helps you find your files later? 🥬 The Concept (Identity Anchor): The identity anchor is a tiny keep-alive snippet for each masked object so the model knows which object slot it’s guessing about. How it works: 1) Keep a minimal identity code from an earlier frame; 2) Replace the rest of the history with a learned mask embedding + time code; 3) The predictor sees “who” but not “what happened,” and must infer that from others. Why it matters: Without the anchor, slots are unordered, so the model wouldn’t know which object it should reconstruct. 🍞 Anchor: Think of sticky labels on boxes: “This is Box A,” even if the contents (recent history) are hidden.

🍞 Hook: Imagine each object has a small circle around it that marks who can influence it most. 🥬 The Concept (Influence Neighborhood): It’s the smallest set of other variables (objects/actions) you must consult to best predict the target object under masking. How it works: 1) Hide a target object’s history; 2) Learn which other objects/actions reduce uncertainty about it; 3) Concentrate attention there; 4) Repeat across scenes and times. Why it matters: It gives a practical, stable notion of “who influences whom” without needing a perfect causal graph. 🍞 Anchor: To guess a rolling ball’s state, you mostly need the nearest balls and the last push, not the far-away lamp.

Why It Works (Intuition, no equations): Masking the target object’s history turns self-dynamics into a dead end. The only way to reduce error is to use others—exactly what interactions are. Training repeats these mini-interventions many times, nudging attention toward the variables that truly help prediction (their influence neighborhood). JEPA’s joint embedding prediction makes this efficient: no pixel reconstruction, just aligning the predicted latents with the true latents.

Building Blocks: 1) Object-Centric Encoder (slots). 2) Object-Level Masking across the history window. 3) Identity Anchor to keep track of who is who. 4) JEPA-style Predictor with bidirectional attention over masked history and future. 5) A loss that combines masked-history completion + forward prediction. 6) Optional auxiliary variables (actions, proprioception) added as separate nodes, not glued into vision tokens, so their effects are clear.

03Methodology

At a high level: Video frames → Object slots (encoder) → Object-level masking across history (with identity anchors) + future masked for rollout → JEPA-style predictor reads visible objects + auxiliaries → Predicts masked histories and future slots → Loss = history reconstruction + future prediction.

Step 1: Object-Centric Encoding (Slots)

What happens: A frozen object-centric encoder (e.g., VideoSAUR on top of DINOv2) turns each frame into N object slots (small vectors), one per entity. It preserves the idea of “things” rather than pixels.
Why this step exists: Predicting with 196+ patch tokens is heavy and can over-focus on texture. With ~4–7 slots, the model is lighter and more semantic.
Example: A CLEVRER frame with 5 shapes becomes 7 slots (6 objects + 1 background), each 128 numbers long.

🍞 Hook: You know how a class roster lists students without saying what they did on each day? 🥬 The Concept (Slot Attention): Slot Attention groups image features into a few object slots through competitive attention. How it works: 1) Start with a small set of learnable slots; 2) Let each slot attend to parts of the image; 3) Update slots; 4) Repeat so each slot specializes. Why it matters: It gives the world model a compact list of entities to think about. 🍞 Anchor: Instead of storing every pixel of a soccer photo, keep a slot for ball, goalie, and defender.

Step 2: Object-Level Masking Across History

What happens: For a random subset of objects, we hide their recent history across the time window. We keep only a small identity anchor from an earlier time to tag who they are. We also mask all future tokens to be predicted.
Why this step exists: It blocks the shortcut where the model predicts an object from its own past. Now it must use other objects and auxiliary variables.
Example with data: Suppose we have 6 history frames with 7 slots each. We choose |M|=3 objects to mask. For those 3 objects, time steps t−5…t−1 are replaced with a learned mask embedding + time code, but we keep a tiny identity from t−5.

🍞 Hook: Like putting a sticky note on a closed box: you know it’s Box B even if you don’t know what’s inside. 🥬 The Concept (Identity Anchor): A small linear projection of an early slot is kept as a tag for each masked object. How it works: 1) Store who the object is from an earlier frame; 2) Combine with learned time embeddings; 3) Mark all other recent states as masked. Why it matters: Without the tag, the predictor wouldn’t know which object it is completing. 🍞 Anchor: A suitcase has a name tag even if it’s locked.

Step 3: JEPA-Style Predictor with Bidirectional Attention

What happens: A ViT-style transformer reads all visible history slots, masked tokens (which carry only identity + time), and auxiliary variables (actions, proprioception) as separate nodes. It infers the missing histories and also predicts future slots.
Why this step exists: Bidirectional masked prediction lets the model jointly reason over a whole window, instead of just rolling one step at a time.
Example: In CLEVRER, it reads 6 frames of slots, fills masked histories, and predicts up to 10 future frames; in Push-T, it reads 3 frames and predicts the next state for planning.

🍞 Hook: Think of guessing a puzzle by looking at the whole board, not just one piece at a time. 🥬 The Concept (JEPA - Joint Embedding Predictive Architecture): JEPA learns to align predicted latent codes with target latent codes, without rebuilding pixels. How it works: 1) Encode target latents; 2) Predict masked/future latents; 3) Pull predictions toward targets in embedding space. Why it matters: It focuses learning on semantics needed for prediction and control, skipping heavy image reconstruction. 🍞 Anchor: Instead of repainting a hidden puzzle piece, you predict its code and check if it matches the real piece’s code.

Step 4: Loss = Masked History + Future Prediction

What happens: We minimize error on: (a) masked history slots and (b) masked future slots. This couples “use others to fill what was hidden” with “roll forward in time.”
Why this step exists: History masking enforces interaction learning; future prediction keeps the model pointed toward world modeling.
Example: If the model ignores other objects when the target is masked, history loss stays high. If it ignores dynamics, future loss stays high.

🍞 Hook: Picture an experiment where you temporarily stop seeing one object but still see everything else. 🥬 The Concept (Latent Intervention): Masking is like a safe, imaginary intervention in the model’s memory: you remove access to one object’s state without changing the actual world. How it works: 1) Replace the target’s history with a neutral mask; 2) Keep others and actions; 3) Force the model to infer the missing piece from what remains. Why it matters: Over many such mini-interventions, the model learns stable dependency patterns—the essence of causal bias. 🍞 Anchor: You cover the red ball in a replay video and still figure out where it is from the blue ball’s motion and the last push.

Step 5: Auxiliary Variables as Separate Nodes

What happens: Actions and proprioception are fed as their own tokens, not glued into vision slots.
Why this step exists: Keeping them separate makes it clearer “who influenced whom,” improving planning and reasoning.
Example: In Push-T, separate action/proprio tokens yield better success than concatenating them into object slots.

🍞 Hook: Think of map legends kept outside the map—they explain the map but aren’t mixed into the roads. 🥬 The Concept (Auxiliary Variables): These are extra, observable signals like actions and body-sense that influence dynamics. How it works: 1) Encode them as their own tokens over time; 2) Let the transformer attend to them when needed; 3) Keep vision tokens clean. Why it matters: Clarity improves learning and planning. 🍞 Anchor: The robot’s “move-right” command is its own token, not hidden inside the cup’s visual code.

The Secret Sauce:

Masking entire objects (not patches) prevents shortcut self-dynamics and forces interaction use.
JEPA’s latent prediction avoids pixel decoding and focuses compute on the small, meaningful token set (4–7 slots), cutting cost dramatically versus 196+ patches.
The identity anchor keeps entity identity stable without adding slot-order encodings.
Result: Better counterfactual reasoning and much faster model-predictive control with similar success rates.

04Experiments & Results

The Tests and Why:

CLEVRER Visual Question Answering (VQA): Measures if the model can answer descriptive, predictive, explanatory, and counterfactual questions about multi-object videos. This stresses interaction and “what-if” thinking.
Push-T Robot Planning (MPC): Measures if a learned world model can plan actions to push a T-shaped object to a goal. This tests if predictions are accurate and efficient enough for real control.

The Competition:

VQA (object-centric only): SlotFormer, OCVP-Seq, and OC-JEPA (same architecture as C-JEPA but without history masking) are compared.
Push-T (shared DINOv2 features): DINO-WM (patch-based), DINO-WM with registers, OC-DINO-WM (object slots but same AR predictor), OC-JEPA (JEPA predictor without history masking), and C-JEPA (with history masking).

The Scoreboard (with context):

CLEVRER: C-JEPA improves counterfactual accuracy by about 20% absolute over the no-history-masking version (OC-JEPA). That’s like jumping from a B- to an A+ specifically on the hardest “what-if” questions. Overall VQA also rises consistently. Crucially, this gain comes from the objective (masking) rather than just using object slots.
Reconstruction-free comparison: Removing reconstruction from older baselines often hurts them a lot (e.g., SlotFormer drops strongly), but C-JEPA stays strong because it was designed to learn in latent space from the start.
Push-T MPC: C-JEPA uses about 1.02% of the tokens compared to patch models yet matches their planning success while being over 8× faster in planning under identical compute settings. That’s like finishing an accurate plan in about 1 minute instead of 8+ minutes.

Surprising Findings:

Object masking helps counterfactuals most: Gains are largest on “what-if” questions, aligning with the idea that masking acts like many tiny counterfactuals during training.
Separate auxiliary tokens beat concatenation: Modeling actions and proprioception as explicit tokens consistently outperforms gluing them into vision tokens, likely because it keeps causal influences clearer.
Too much masking hurts: There’s a sweet spot. Excessive masking hides too many clues and lowers performance.
Object-level masking is more stable than random token/tube masking: While token/tube masking can sometimes match performance, they are more sensitive to budget and less consistent because they can accidentally hide the wrong combinations (e.g., masking all tokens at a time). Object-level masking gives a clearer, more controllable signal to learn interactions.

05Discussion & Limitations

Limitations:

Quality of object slots limits the ceiling: If the encoder merges objects or loses identity, masking won’t induce the right interactions. Stronger slot encoders help.
Causal structure is implicit: The paper formalizes “influence neighborhoods,” not full causal graphs. It doesn’t evaluate against datasets with known temporal causal edges.
Masking ratio must be tuned: Too little masking leaves shortcuts; too much removes vital clues.
Complexity of environments: Very crowded or highly deformable scenes may challenge fixed-slot models.

Required Resources:

A frozen, pretrained visual backbone (e.g., DINOv2) and an object-centric encoder (e.g., VideoSAUR or SAVi).
A transformer predictor (JEPA-style) trained on masked latents; a single modern GPU suffices for reported settings.
For VQA, a simple reasoning head (ALOE) over slot trajectories; for MPC, a standard optimizer like CEM.

When NOT to Use:

Single-object or non-interactive scenes where interactions add little value—simpler models may suffice.
Extremely noisy videos where object discovery fails frequently.
Ultra-low-latency edge settings with no GPU and no time to compute slot features (unless precomputed).

Open Questions:

Jointly training stronger slot encoders with JEPA without collapse: can we boost both perception and dynamics together?
Scaling to richer physics and contacts, partial occlusions, and long-term memory beyond a short window.
Making influence neighborhoods explicit and using them for explanation or safety constraints.
Adaptive masking policies: can the model learn where and when to mask to maximize learning?

06Conclusion & Future Work

Three-Sentence Summary: C-JEPA trains a world model by hiding whole objects across time and forcing the model to infer them from others, turning training into many safe, tiny interventions in latent space. This injects a practical causal bias that improves interaction reasoning (especially counterfactuals) and enables fast, efficient planning using very few tokens. It works without pixel reconstruction, aligning directly with prediction and control.

Main Achievement: Showing that object-level masking—paired with a JEPA predictor—makes interaction reasoning functionally necessary and yields big gains in counterfactual VQA and 8× faster MPC with comparable success.

Future Directions: Strengthen and co-train object encoders with JEPA; validate influence neighborhoods against known causal structures; test in busier, real-world scenes with richer contacts; and design adaptive masking strategies that target the most informative objects.

Why Remember This: C-JEPA turns “masking” from a vision trick into a principled, object-level training signal that teaches models to use interactions. That shift—from patches to objects, from reconstruction to latent prediction, and from correlations to intervention-stable dependencies—points to world models that are both smarter at reasoning and lighter to run.

Practical Applications

•Household robots that predict how moving one item will nudge others and plan tidy, spill-free actions.
•Video assistants that answer counterfactual questions about clips (e.g., ‘What if the door were closed?’).
•Warehouse automation that plans object interactions (stacking, sliding) quickly using few tokens.
•Educational physics simulators that teach cause-and-effect by visualizing masked ‘what-if’ scenarios.
•Game AI that reasons about object interactions (collisions, blocking) to plan smarter moves.
•Industrial arms that adapt pushes and grasps by predicting downstream effects on nearby parts.
•AR/VR systems that forecast how virtual and real objects will interact for smoother overlays.
•Autonomous navigation that anticipates interactions (e.g., carts, people) with compact world models.
•Content moderation or forensics tools that reason about sequences of interactions in videos.
•Scientific discovery helpers that explore ‘mask-and-predict’ counterfactuals in simulations.

Version: 1