Next Embedding Prediction Makes World Models Stronger

George Bredis; Nikita Balagansky; Daniil Gavrilov; Ruslan Rakhimov

Next Embedding Prediction Makes World Models Stronger

Intermediate

George Bredis, Nikita Balagansky, Daniil Gavrilov et al.3/3/2026

arXiv

Key Summary

•NE-Dreamer is a model-based reinforcement learning agent that skips rebuilding pixels and instead learns by predicting the next step’s hidden features.
•It uses a causal temporal transformer to guess the next encoder embedding and lines it up with the real one, teaching the model to think ahead in time.
•A Barlow Twins alignment loss keeps the learned features useful and prevents them from collapsing into boring sameness.
•Because it aims at the future instead of the present, NE-Dreamer forms memories that last across many steps, which is vital in partially observable worlds.
•On DMLab Rooms memory and navigation tasks, NE-Dreamer beats strong decoder-based and decoder-free baselines of the same size and compute.
•On the DeepMind Control Suite, it stays competitive without needing pixel reconstruction or heavy data augmentation.
•Ablations show the gains come specifically from the combo of the causal temporal transformer and the next-step target shift.
•The method is simple to add to Dreamer-style pipelines and scales without extra complexity.
•This work suggests next-embedding prediction is a powerful, general recipe for world models in complex, partially observable settings.

Why This Research Matters

Many real-world situations are partially observable—robots, vehicles, and assistants often see only a slice of the world and must remember and predict to act well. NE-Dreamer’s next-embedding prediction teaches models to think one step ahead, creating sturdy memories that last across time. This improves navigation, planning, and long-horizon tasks without the complexity of pixel reconstruction or heavy data augmentation. Because it plugs into standard Dreamer-style pipelines, it is practical to adopt. Its success suggests a general direction for building stronger, leaner world models that scale to complex environments. In short: it’s a simpler way to get better foresight.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you watch a mystery show, you don’t just look at the last scene—you remember clues from earlier to figure out what happens next? Smart problem-solvers stitch moments together over time.

🥬 The Concept: Reinforcement Learning (RL)

What it is: RL is how an AI learns by trying actions and getting rewards, like a video game player learning the best moves by scoring points.
How it works:
1. See the current situation
2. Try an action
3. Get a reward or penalty
4. Repeat to learn which actions lead to more reward
Why it matters: Without RL, the agent wouldn’t know which actions are good or bad, so it couldn’t improve. 🍞 Anchor: Think of a robot dog learning to fetch: it tries different runs, gets a treat when successful, and learns the best path over time.

🍞 Hook: Imagine planning a road trip: you don’t just react to each turn; you picture the route ahead to decide what to do now.

🥬 The Concept: Model-Based RL (MBRL)

What it is: MBRL teaches an AI to build a small “world model” in its head so it can imagine the future before acting.
How it works:
1. Compress raw images into a compact hidden state (a “latent”)
2. Learn how that hidden state changes when you take actions
3. Use the model to imagine future steps and pick actions
Why it matters: Without a world model, the agent reacts short-sightedly and struggles in tasks that require planning. 🍞 Anchor: Like a chess player imagining a few moves ahead before moving a piece.

🍞 Hook: You know how sometimes a picture doesn’t show everything—like a maze photo where you can’t see what’s around the corner?

🥬 The Concept: Partial Observability

What it is: The agent can’t see the whole world at once, only a slice of it (like a single camera frame).
How it works:
1. Collect clues over time (frames, actions, rewards)
2. Store them in memory-like hidden states
3. Use them to guess what’s hidden and what’s coming next
Why it matters: Without remembering history, the agent gets confused and makes poor choices. 🍞 Anchor: In a 3D maze, you must remember where the blue key was, even after turning three corners.

🍞 Hook: Think of tracing a drawing by eye: rebuilding every pixel exactly is careful work—but sometimes too slow and too detailed.

🥬 The Concept: Reconstruction Loss (Decoder-Based Learning)

What it is: A way to train a model by making it recreate the input image from its hidden state.
How it works:
1. Encode the image into a latent
2. Decode the latent back into pixels
3. Punish differences between the reconstruction and the real image
Why it matters: Without a strong training signal, features can be weak; but reconstruction can also waste effort on irrelevant textures. 🍞 Anchor: If the goal is to find the exit, perfectly redrawing the wallpaper pattern doesn’t help as much as remembering where the door was.

🍞 Hook: Imagine skipping the tracing and just learning the important landmarks.

🥬 The Concept: Decoder-Free Learning

What it is: Training the hidden state directly, without rebuilding pixels.
How it works:
1. Encode images into embeddings
2. Optimize those embeddings with simpler objectives (no pixel decoder)
3. Use them for prediction and control
Why it matters: Without the decoder, training is simpler and faster, but you must still keep features informative and stable. 🍞 Anchor: Instead of redrawing every tree, a hiker’s map marks only trails and checkpoints—good enough to navigate.

Before this work: Many agents either reconstructed pixels (powerful but heavy) or trained decoder-free features that mostly matched same-timestep views. Under partial observability, same-timestep matching often failed to build long-term memory; features drifted and forgot what mattered across time. The gap: a simple way to make decoder-free features explicitly predictive of the future, not just aligned with the present. The real stakes: In everyday life—like navigating buildings, driving, or assisting in warehouses—remembering and predicting what comes next is crucial. An AI that can’t connect moments over time gets lost.

🍞 Hook: Imagine teaching your future self a hint for the next scene in a movie.

🥬 The Concept: Temporal Predictiveness

What it is: Training features so that today’s state helps accurately predict tomorrow’s state.
How it works:
1. Use history to forecast the next hidden embedding
2. Compare the forecast to the real next embedding
3. Adjust the model to make forecasts closer
Why it matters: Without temporal predictiveness, the model can’t plan or remember effectively. 🍞 Anchor: If your clue for the next scene is correct, you’ll follow the plot; if not, you’ll get lost quickly.

02Core Idea

🍞 Hook: You know how a good coach doesn’t judge you on how you look now, but on whether your form today leads to a better shot on the next play?

🥬 The Concept: Next-Embedding Prediction (NEP)

What it is: Instead of matching the current hidden features to themselves, the model learns to predict the next step’s features and align to them.
How it works:
1. Encode the current image into an embedding
2. Use a causal temporal transformer to guess the next embedding from history
3. Compare the guess to the real next embedding (with stop-gradient)
4. Align them using a stability loss (Barlow Twins)
Why it matters: Without predicting the next embedding, features become short-sighted and drift; with NEP, they become forward-looking and stable. 🍞 Anchor: Like forecasting tomorrow’s weather from today’s patterns and then checking if you were right.

🍞 Hook: Imagine a storyteller who only reads past chapters, never peeking ahead.

🥬 The Concept: Causal Temporal Transformer

What it is: A sequence model that looks only backward in time to make its next-step prediction.
How it works:
1. Take in embeddings, actions, and states up to now
2. Apply attention with a causal mask (no future leakage)
3. Output a prediction for the next embedding
Why it matters: Without causal masking, the model could “cheat” by using future info; with it, predictions are honest and useful for control. 🍞 Anchor: Like solving a puzzle using only pieces you’ve already seen.

🍞 Hook: Think of writing a goal on a sticky note, then taping it to the wall so you don’t accidentally erase it while practicing.

🥬 The Concept: Stop-Gradient Target

What it is: The real next embedding is used as a fixed target that doesn’t get to change during this alignment step.
How it works:
1. Compute the true next embedding via the encoder
2. Freeze it (no gradient)
3. Update only the predictor to match it
Why it matters: Without freezing the target, both sides could chase each other and collapse to trivial solutions. 🍞 Anchor: The scoreboard doesn’t move when you shoot; only your aim changes.

🍞 Hook: Picture twins who learn to be similar where it counts but avoid copying each other’s every quirk.

🥬 The Concept: Barlow Twins (Redundancy Reduction)

What it is: A loss that makes matched features line up along the diagonal (important parts agree) and reduces unnecessary overlap across dimensions.
How it works:
1. Normalize predicted and target embeddings
2. Compute a cross-correlation matrix
3. Push the diagonal towards 1 (agreement) and off-diagonals towards 0 (less redundancy)
Why it matters: Without redundancy reduction, features can become tangled or collapse, hurting prediction and control. 🍞 Anchor: Organizing your backpack so each pocket holds different useful items, not ten pencils in the same pocket.

Three analogies for the core idea:

Crystal ball analogy: Instead of redrawing today’s scene, train a crystal ball that guesses tomorrow’s summary of the scene—and reward it for being accurate.
Bowling coach analogy: Don’t match today’s pose to a photo; adjust today’s pose so the next roll hits more pins.
GPS analogy: Rather than re-photographing every street, keep a compact map that predicts the next turn correctly.

Before vs After:

Before: Decoder-free agents often matched same-step features, which didn’t force memory or lookahead.
After: NE-Dreamer predicts next-step embeddings, making the latent state naturally encode what will matter soon.

Why it works (intuition): If you consistently predict what comes next, you must keep and combine just the right facts from history (like object identity and position), discarding noisy details (like wallpaper textures). The causal transformer is the tool that compresses history into those predictive bits, and Barlow Twins keeps the bits diverse and stable.

🍞 Hook: Think of a short recipe for smarter memory.

🥬 The Concept: Temporal Predictive Alignment

What it is: Training the whole world model so that its sequence of hidden states lines up with what truly happens next in the environment.
How it works:
1. Encode images into embeddings
2. Roll a causal transformer to predict the next embedding
3. Align prediction to the frozen true next embedding with Barlow Twins
4. Use this predictive latent space for planning via actor-critic
Why it matters: Without alignment to the future, the latent space drifts and forgets—bad for long-horizon control. 🍞 Anchor: A good diary not only records today; it sets you up to remember what to do tomorrow.

03Methodology

At a high level: Pixels and actions → Encoder and RSSM (latent state) → Causal temporal transformer predicts next embedding → Align with true next embedding (stop-gradient) via Barlow Twins → Imagination-based actor-critic uses the learned world model for control.

🍞 Hook: Imagine shrinking each camera frame into a small, useful card you can carry in your pocket.

🥬 The Concept: Encoder and Embeddings

What it is: The encoder turns each image into a compact vector called an embedding.
How it works:
1. Feed the $64×64$ RGB image into a CNN
2. Output an embedding $e_t$ that summarizes the scene
Why it matters: Without embeddings, everything stays huge and slow; the agent can’t learn efficiently. 🍞 Anchor: Like turning a big photo album into a pocket-sized postcard with key details.

🍞 Hook: Consider a memory notebook that keeps a running summary of what’s happened.

🥬 The Concept: RSSM (Recurrent State-Space Model)

What it is: A world-model backbone with a deterministic hidden state $h_t$ and a stochastic latent $z_t$ that evolve over time.
How it works:
1. Update $h_t$ from the prior state, action, and previous latent
2. Infer $z_t$ using both $h_t$ and the current embedding $e_t$ during training
3. During imagination, sample $z_t$ from the learned prior
Why it matters: Without RSSM dynamics, the model can’t connect steps into a coherent memory or simulate futures. 🍞 Anchor: Like updating your trip log with both your planned route (prior) and what you actually saw (posterior).

Step-by-step recipe with an example (maze room):

Observe $x_t$ : a corridor with a red key on the left.
Encode: $e_t$ = enc( $x_t$ ) captures “red key left, corridor shape.”
Update dynamics: $h_t$ , $z_t$ summarize history and current info.
Predict next embedding: Using the causal transformer on the sequence up to t, output $\hat{e}$ _{t+1}.
Get the true next embedding: e*_{t+1} = sto $p_g$ radient(enc( $x_{t+1}$ )).
Align: Apply Barlow Twins so $\hat{e}$ {t+1} matches e*{t+1} on the diagonal and avoids redundant overlap.
Train rewards and continuation: Heads predict $r_t$ and whether the episode continues $c_t$ .
Imagine futures: Roll the prior forward to create 15-step imagined trajectories for actor-critic learning.
Improve the policy: The actor chooses actions that lead to higher imagined returns; the critic estimates those returns.

🍞 Hook: Think of practicing free throws in your head before the game.

🥬 The Concept: Imagination-Based Actor–Critic

What it is: The agent plans in the learned latent space by simulating futures and learning a policy from them.
How it works:
1. From the current latent state, roll forward H steps using the world model
2. The critic estimates returns from imagined rewards
3. The actor updates to choose actions that increase these returns
Why it matters: Without imagination, you must learn only from real-world steps, which is slow and noisy. 🍞 Anchor: Like a chess player running mental simulations before committing to a move.

🍞 Hook: Imagine rating both accuracy and organization when you compare notes.

🥬 The Concept: Barlow Twins Alignment in Prediction

What it is: An alignment loss tailored to next-step prediction to keep features informative and non-collapsed.
How it works:
1. Normalize predicted and target embeddings across a batch of valid transitions
2. Compute cross-correlation C between them
3. Penalize (1 − $C_i$ i) to encourage agreement, and penalize off-diagonals $C_i$ j to reduce redundancy
Why it matters: Without this, the predictor could output trivial vectors or overly entangled features. 🍞 Anchor: Organizing a toolbox so each tool has a unique slot and the most-used tools are easy to grab.

🍞 Hook: Think of a rulebook that says: use only clues you’ve already seen.

🥬 The Concept: Causality (No Peeking Ahead)

What it is: The transformer is masked so it only uses past info to predict the future.
How it works:
1. Add a causal mask to attention
2. Prevent information leakage from future steps
3. Train the model on fair predictions
Why it matters: Without causality, predictions become unrealistic and useless for real control. 🍞 Anchor: It’s like taking a test without looking at the answer key.

Why each step exists and what breaks without it:

Encoder: Without it, inputs are too large and noisy.
RSSM dynamics: Without it, there’s no memory to stitch frames together.
Next-embedding predictor: Without it, features won’t be forward-looking under partial observability.
Stop-gradient target: Without it, prediction and target can collapse together.
Barlow Twins: Without it, features can be redundant or degenerate.
Actor–critic imagination: Without it, the agent can’t practice efficiently and learn long-horizon strategies.

The secret sauce:

The next-step target shift forces the model to care about tomorrow, not just today.
The causal transformer turns history into just the right predictive bits.
Redundancy reduction keeps those bits diverse, stable, and useful for control.

04Experiments & Results

🍞 Hook: Think of a memory maze game where success comes from remembering clues across rooms, not just reacting to the last picture you saw.

The test: Researchers evaluated whether next-embedding prediction improves long-horizon control under partial observability. They used two standard arenas: DeepMind Lab (DMLab) for 3D memory/navigation and DeepMind Control Suite (DMC) for continuous control from pixels. They measured return (how well the agent performs) across training steps and compared methods under the same compute and model size.

The competition: NE-Dreamer was compared to:

DreamerV3 (decoder-based reconstruction)
R2-Dreamer (decoder-free same-step Barlow Twins)
DreamerPro (decoder-free with strong augmentations)
Dreamer without reconstruction (minimal signals)
DrQv2 (strong model-free baseline)

Scoreboard with context:

DMLab Rooms: NE-Dreamer consistently outperformed both decoder-based and decoder-free baselines of the same size on four challenging memory/navigation tasks. That’s like getting an A when others hover around B or B− on the hardest parts of the exam. The biggest gains appeared when success demanded keeping stable state over many steps instead of reacting to short-lived visuals.
Ablations: Removing the causal transformer or removing the next-step target shift wiped out most of the gains—clear evidence that predictive sequence modeling is the key. Removing light projectors mainly affected training smoothness, not final scores.
DMC: On standard control tasks where many methods are already near the ceiling, NE-Dreamer matched or slightly exceeded baselines, showing that dropping reconstruction didn’t cause a performance dip.

🍞 Hook: Imagine checking your notebook to see whether you consistently wrote down the right details to help with tomorrow’s tasks.

Surprising and insightful findings:

Post-hoc reconstructions from frozen latents showed that NE-Dreamer’s representations preserved object identity and spatial layout consistently over time, while same-timestep methods sometimes “forgot” or let task-relevant details fade.
The core improvement didn’t come from extra tricks, data augmentation, or bigger models—it came from the simple shift to next-embedding prediction plus a causal transformer.

🍞 Anchor: In the Rooms Watermaze-like task, keeping a stable memory of landmarks across corridors matters more than knowing exact wall textures—NE-Dreamer’s predictive latents kept the landmarks front and center, leading to better navigation.

05Discussion & Limitations

🍞 Hook: Think of a tool that’s excellent for long hikes but may not be the perfect choice for painting miniatures.

Limitations:

The method shines when long-term structure and memory matter most. In tasks where tiny visual details are crucial (high-fidelity reconstruction), pure prediction-based, decoder-free training might need extra help.
Results focus on two popular benchmarks; broader validation in visually busier worlds remains to be seen.

Required resources:

A Dreamer-style pipeline with an RSSM, plus a small causal transformer for next-embedding prediction.
Compute comparable to prior Dreamer agents; no special augmentations or giant decoders needed.

When not to use:

If the task absolutely requires photorealistic pixel reconstructions (e.g., supervised vision tasks needing fine textures), a decoder may still be useful.
If the environment is fully observable and extremely simple, the extra sequence modeling may offer less benefit.

Open questions:

Which alignment objectives (e.g., VICReg, BYOL-style) work best for future prediction in control?
How far can next-embedding prediction scale in visually dense, multi-object 3D worlds?
Can multi-step or masked future prediction provide additional gains without more compute?
What are the best ways to combine small amounts of reconstruction with prediction to handle high-fidelity needs?

🍞 Anchor: It’s like having a great compass for long treks; now we want to test it in jungles, deserts, and cities to learn where to add maps or binoculars.

06Conclusion & Future Work

Three-sentence summary: NE-Dreamer trains a world model to predict the next encoder embedding using a causal temporal transformer and aligns it with a stable (stop-gradient) target via Barlow Twins. This future-facing objective learns temporally coherent representations that excel in partially observable, long-horizon tasks without relying on pixel reconstruction. Experiments show strong gains on DMLab Rooms and competitive performance on DMC, with ablations pinpointing the causal transformer plus next-step target shift as the key drivers.

Main achievement: Turning representation learning into next-step prediction—temporal predictive alignment—makes decoder-free world models both simpler and stronger in challenging memory/navigation settings.

Future directions: Explore alternative alignment losses, multi-step and masked prediction schemes, and hybrid approaches that mix a little reconstruction for high-fidelity domains. Scale to more complex 3D worlds and test robustness under heavy visual distractions.

Why remember this: The big idea is that teaching a model to guess tomorrow’s features—honestly and causally—builds the kind of memory and foresight that tough, partially observable environments demand, all while keeping the system lean and practical.

Practical Applications

•Indoor robot navigation: Maintain stable memories of landmarks and room layouts to reach goals efficiently.
•Warehouse automation: Remember object locations and predict next states to plan multi-step pick-and-place sequences.
•Autonomous drones: Fly through partially seen spaces by predicting future views from past clues.
•AR/VR assistants: Guide users through buildings by keeping consistent, predictive scene summaries.
•Household robots: Execute long chores (cleaning, organizing) by anticipating what will be needed next.
•Industrial inspection: Track evolving machine states over time without reconstructing raw pixels.
•Education games and tutoring: Create agents that plan multi-step strategies, remembering prior hints or actions.
•Healthcare simulators: Train policies that forecast patient-state embeddings across time for safer planning (in simulation).
•Self-driving research simulators: Learn predictive latents that help with route planning under occlusions.
•Scientific robotics: Explore unknown terrains by building forward-looking world models from sparse observations.

Version: 1