Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
Key Summary
- âąAgents often act like tourists without a map: they react to what they see now and miss long-term consequences.
- âąThis paper teaches agents to first imagine several future steps using a learned world model, then plan their next move.
- âąA key idea is adaptive lookahead: the agent decides how far to imagine based on goal difficulty and current progress.
- âąThey formalize this as a POIMDP, where decisions use both whatâs observed now and whatâs imagined next.
- âąTwo versions exist: ITP-I (training-free at inference time) and ITP-R (reinforcement-trained with learned horizons).
- âąAcross ALFWorld and ScienceWorld, ITP beats strong baselines; the trained ITP-R reaches up to ~94% success on some tasks.
- âąAdaptive lookahead outperforms fixed or random horizons, giving higher success with less computation.
- âąAn ablation shows online reinforcement training is crucial: removing it causes large drops in success.
- âąWorld model quality matters more for the training-free variant; after training, even a weaker world model can work well.
- âąThis approach helps agents avoid mistakes before they happen, saving time, compute, and failures in complex tasks.
Why This Research Matters
This work helps agents avoid making mistakes in the real world by checking likely outcomes first, like previewing a move in chess. It saves time and resources because the agent learns to imagine deeply only when needed, not at every step. For smart homes, it can prevent risky sequences (e.g., turning on the wrong appliance at the wrong time). In education and science simulations, it plans better experiments with fewer failed tries. In digital assistance (web tasks, emails), it reduces dead ends and rework by seeing a few steps ahead. Over time, this approach builds more trustworthy agents that people can rely on for complex, multi-step goals.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
Here are the core ideas, introduced step by step using the Sandwich pattern so theyâre easy to digest.
- Reinforcement Learning (RL) đ Top Bread (Hook): Imagine training a puppy. Each time it does the right trick, it gets a treat, so it repeats the good behavior. đ„Ź Filling (The Concept): RL is a way for an AI to learn by trying actions and getting rewards or penalties.
- How it works:
- The agent sees a situation.
- It picks an action.
- The world reacts and gives a reward.
- The agent updates its strategy to earn more reward next time.
- Why it matters: Without RL, agents canât improve from experience; theyâd keep repeating the same mistakes. đ Bottom Bread (Anchor): A robot vacuum learns that bumping into walls wastes time (low reward), while following the edge cleans faster (higher reward).
- World Models đ Top Bread (Hook): You know how you can âplay a scene in your headâ before you do something risky, like stacking books high? đ„Ź Filling (The Concept): A world model is an AIâs internal simulator that predicts what will likely happen next if it takes a certain action.
- How it works:
- Read the current situation.
- Predict the next situation for each possible action.
- Chain several predictions to simulate a short future.
- Why it matters: Without a world model, agents must try actions in the real world to learn consequences, which can be slow or unsafe. đ Bottom Bread (Anchor): Before telling a robot to put a hot pan in the fridge, it imagines the fridge reactionâbad ideaâso it tries cooling on the counter instead.
- Partially Observable Markov Decision Process (POMDP) đ Top Bread (Hook): Imagine walking through fog: you canât see everything, only whatâs near you. đ„Ź Filling (The Concept): A POMDP describes decision-making when the agent canât see the whole world state, only partial clues.
- How it works:
- The agent gets a limited observation.
- It keeps a memory of past observations.
- It chooses an action that should help, despite uncertainty.
- Why it matters: Without a POMDP view, agents pretend they see everything and make overconfident, brittle plans. đ Bottom Bread (Anchor): In a text game, the agent only reads the current room description, not the whole house layout, but still must choose where to go.
The World Before This Paper:
- LLM agents were good at reacting to what they just saw and at recalling history, but they lacked deep foresightâthe ability to think through consequences several steps ahead. This is called shallow grounding.
- People tried single-step checks (verify the next action) or fixed-horizon rollouts (always imagine, say, 3 steps). These helped a little, but they either missed long-term dependencies (too short) or wasted compute and amplified errors (too long for easy moves).
- Some methods (like MCTS-style planning with LLMs) pushed harder but often used lots of expensive rollouts indiscriminately.
The Problem:
- Complex tasks (like multi-step household chores or science experiments) need different amounts of âthinking aheadâ at different moments. A trivial move doesnât need deep imagination; a risky pivot does. Fixed horizons canât adapt.
What Failed Before and Why:
- Single-step checks: catch obvious mistakes, but miss chain reactions.
- Fixed-horizon lookahead: either too shallow for tough spots or too deep (slow, error-prone) for easy spots.
- Implicit world modeling inside a policy: helps, but the agent canât explicitly consult futures when it matters most.
The Gap:
- Agents lacked an adaptive lookahead that flexes based on goal difficulty and current progress, and a formal way to blend what is seen now with what is imagined next.
The Stakes (Why You Should Care):
- In homes, a helper agent should avoid risky actions (like turning on appliances at the wrong time) before causing trouble.
- In education and science games, an agent that plans experiments can save time and materials by previewing likely outcomes.
- In digital tools (email, web tasks), foresight avoids dead ends and reduces back-and-forth retries.
This paper proposes an Imagine-then-Plan (ITP) framework that explicitly imagines futures using a world model and then plans the next action, with an adaptive mechanism for deciding how far to look ahead each time. It also introduces a new formalismâPOIMDPâthat treats imagined futures as part of the agentâs decision input, reducing shallow grounding and cutting down costly real-world trial-and-error.
02Core Idea
Letâs introduce the core innovations with the Sandwich pattern and then connect them.
- Lookahead Imagination đ Top Bread (Hook): Think of a chess player who quietly plays out future moves in their head before touching a piece. đ„Ź Filling (The Concept): Lookahead imagination means the agent simulates several possible next steps to see likely consequences before acting.
- How it works:
- Use the world model to predict the next state for a candidate action.
- Repeat this for multiple steps to form an imagined mini-trajectory.
- Inspect these futures for progress and conflicts.
- Why it matters: Without imagining, agents discover failures only after they happen, wasting time and risking cascades of mistakes. đ Bottom Bread (Anchor): Before picking up a cup in a dark room, the agent imagines turning on the light first to avoid dropping it.
- Adaptive Lookahead Mechanism đ Top Bread (Hook): When biking on a straight, safe road, you glance a little ahead; when entering busy traffic, you look much farther. đ„Ź Filling (The Concept): Adaptive lookahead lets the agent choose how many steps to imagine now, balancing goal progress and compute cost.
- How it works:
- Estimate task difficulty and current progress.
- Pick a horizon K (maybe 0, maybe 5+) based on need.
- Imagine K steps, then decide the next action.
- Why it matters: Without adaptivity, the agent either overthinks easy moves or underthinks hard ones. đ Bottom Bread (Anchor): In ALFWorld, moving from kitchen to hallway may need K=0â1, but planning a wash-then-place sequence may need K=4â5.
- Imagine-then-Plan (ITP) đ Top Bread (Hook): You know how you might daydream a plan (âIf I do A, then B will happenâ) and only then choose what to do now? đ„Ź Filling (The Concept): ITP is a framework where the agent first imagines multiple future steps with a world model, then plans the immediate action using both what it sees and what it imagined.
- How it works:
- Observe the current state.
- Choose an adaptive horizon K.
- Roll out K imagined steps (actions and resulting states) using the world model.
- Reflect on the imagined futures to detect progress or conflicts.
- Pick the next action.
- Why it matters: Without this two-stream view (present + imagined), decisions canât easily account for long-term consequences. đ Bottom Bread (Anchor): The agent imagines that heating food before washing will fail the goal sequence, so it plans to wash first, then heat.
- Partially Observable and Imaginable MDP (POIMDP) đ Top Bread (Hook): Itâs like navigating with both your flashlight (what you see now) and a sketch you just drew of whatâs probably ahead (what you imagine). đ„Ź Filling (The Concept): POIMDP is a decision model where the agent conditions its policy on two streams: the observed present and the imagined future trajectory.
- How it works:
- Keep the usual partial observation and history.
- Add imagined future steps from the world model.
- Choose the next action using both inputs.
- Why it matters: Without explicitly including imagined futures, the agent canât properly weigh likely consequences. đ Bottom Bread (Anchor): The policy reads the current room text plus an imagined âforesightâ snippet and selects an action that avoids predicted bottlenecks.
The âAha!â Moment (One sentence): If agents plan with what they can see now and what they just imagined aheadâadjusting how far they imagine each timeâthey make smarter, safer, and more efficient choices.
Multiple Analogies (3 ways):
- Chess: Donât just look at the board; simulate a few lines ahead, and simulate farther only in critical positions.
- Driving: Glance near for routine cruising, look far ahead at intersections or in storms.
- Cooking: For a sandwich, no need to pre-visualize every step; for a fancy cake, mentally run through mixing, baking, cooling, and frosting.
Before vs After:
- Before: Agents reacted to observations, or imagined a fixed number of steps regardless of context.
- After: Agents imagine just enough steps when needed, blend foresight with the present, and avoid many preventable mistakes.
Why It Works (Intuition):
- Imagined futures reveal hidden dependencies (e.g., âwash before place,â âturn on light before examineâ).
- Adaptive K controls error and cost: deep lookahead only when it helps, shallow otherwise.
- Treating imagined futures as inputs (POIMDP) gives the policy a direct way to consider consequences when acting.
Building Blocks:
- World model to simulate textual next states.
- Adaptive horizon chooser (simple reflection in ITP-I; learned predictor in ITP-R).
- Policy that conditions on both observation and imagined trajectory.
- Optional reinforcement learning to jointly optimize actions and horizon choices.
03Methodology
High-level recipe: Input (task + current state) â [Decide horizon K] â [Imagine K-step future with world model] â [Reflect and Plan] â Output (next action).
Weâll introduce two extra helper concepts used in the trained version (ITP-R):
- K-head Predictor đ Top Bread (Hook): Think of a small dial that sets how far to look ahead based on how tricky things look right now. đ„Ź Filling (The Concept): The K-head predictor is a light module on top of the agent that outputs a distribution over horizons (K = 0..Kmax).
- How it works:
- Read the current state/history.
- Score each possible K.
- Sample or pick the K that balances predicted success and cost.
- Why it matters: Without a horizon chooser, the agent canât automate when to imagine more or less. đ Bottom Bread (Anchor): In a simple hallway step, it outputs K=0; at a multi-appliance step (sink + microwave), it sets K=5.
- Advantage Actor-Critic (A2C) đ Top Bread (Hook): Imagine a coach (critic) telling a player (actor) how much better a move was than average, so the player learns faster. đ„Ź Filling (The Concept): A2C is a reinforcement learning method where the actor proposes actions and the critic estimates how good the situation is to guide learning.
- How it works:
- Actor picks horizon K and action a.
- Critic estimates value; compute advantage (how much better than expected).
- Update actor to favor better-than-expected choices.
- Why it matters: Without advantage guidance, learning which horizons/actions truly help would be slower and noisier. đ Bottom Bread (Anchor): If imagining 4 steps leads to a big reward bump vs. imagining 1 step, A2C increases the chance of choosing 4 next time.
Now the full methodology, like a recipe:
Step 0: Warm up the base policy
- What: Fine-tune a language model on expert demonstrations so it can produce valid, sensible actions.
- Why: Without a decent starter policy, imagined futures wonât be grounded in reasonable behavior.
- Example: In ALFWorld, after warm-up the agent reliably outputs actions like âgo to sink,â âturn on faucet,â âwash apple.â
Step 1: Train the world model
- What: Train a text world model to predict the next observation given the current observation and action.
- Why: Without a good predictor, imagination drifts and becomes misleading.
- Example: Input: (state: âYouâre in the kitchen; holding apple.â, action: âopen fridgeâ). Output: âThe fridge door opens; shelves are visible.â
Step 2: Define POIMDP decision-making
- What: For each time t, the agent can imagine K steps using the world model, producing an imagined trajectory (actions and future states). The policy then picks the next action conditioned on both current state and this imagined trajectory.
- Why: Without combining present and imagined futures, the policy canât properly anticipate conflicts.
- Example: If foresight shows that heating before washing blocks placement success, the policy chooses to wash first.
Two Variants of ITP:
A) ITP-I (training-free, inference-time learning)
- What happens:
- Adaptive horizon selection by reflection: read the goal and current state; decide K (0..Kmax) based on difficulty.
- World-model imagination: roll out K steps to get a short foresight text.
- Reflective policy generation: use the foresight as feedback; revise plan; choose next action.
- Why this step exists: It upgrades any LLM agent without retraining, letting it âpeek aheadâ before acting.
- Example: In ScienceWorld, the agent imagines heating water before measuring temperature and realizes the order matters for the goal rubric.
B) ITP-R (reinforcement-trained, adaptive)
-
What happens (three stages): Stage 1: Pseudo-label horizons
- Use expert actions and the frozen world model to simulate several candidate horizons (0..Kmax).
- Score which K makes the expert action most likely (with a small penalty for larger K) and assign that K as a pseudo-label.
- Why: Experts donât tell us how far to imagine; this builds training targets for K.
- Example: If K=3 best explains an expertâs âturn on light then examineâ step, label K=3.
Stage 2: Warm-up with lookahead
- Train the policy to act given the imagined trajectory for the labeled K, and train the K-head predictor to predict that K.
- Why: Without this, the K-head starts from scratch and picks poor horizons; the policy wonât learn to read foresight well.
- Example: The policy learns to parse a foresight snippet like â(1) go to sink; (2) turn on faucet; (3) wash cupâ and then choose âwash cup.â
Stage 3: Online A2C optimization
- At run-time, sample K from the K-head, generate foresight, then choose an action.
- Reward = environment reward â (penalty for larger K) â (small per-step cost).
- Update both the K-head and the policy using A2C with an entropy term to keep exploring horizons.
- Why: This teaches the agent not only what to do, but also when deeper imagination is worth it.
- Example: If K=5 keeps paying off at tricky junctions but is wasteful on easy steps, the learner assigns high K only at tricky spots.
Secret Sauce (Whatâs clever?):
- Treat imagined futures as first-class inputs (POIMDP), not just side notes.
- Learn to choose K adaptively with real rewards and small penalties so imagination is deep where it counts and shallow where it doesnât.
- Provide a plug-and-play option (ITP-I) and a stronger trained option (ITP-R) to fit different deployment needs.
04Experiments & Results
The Test: Whatâs measured and why
- Metric: Success Rate (SR) â percent of tasks fully solved. This captures whether the agentâs whole plan, not just a step or two, works end-to-end.
- Compute: They also track token budgets to measure how costly the planning is. This matters because longer lookahead costs more tokens.
Benchmarks and Setup
- ALFWorld: Text-based household tasks (Pick, Clean, Heat, Cool, Look, Pick2). Average episode ~8 steps.
- ScienceWorld: Text-based science experiments (seen and unseen splits). Average episode ~15 steps.
- Backbones: Qwen2.5-7B, Qwen3-8B, Llama-3.1-8B. Same backbones used across methods for fairness.
- Baselines: Prompting methods (CoT, ReAct, RAP) and training-based methods (SFT, WKM, IWM).
The Competition: Who they beat
- Prompting baselines are strong at reasoning but donât adapt lookahead.
- Training baselines either clone expert behavior (SFT) or internalize dynamics/knowledge (IWM, WKM) without explicit adaptive foresight.
The Scoreboard (with context)
- Training-free ITP-I: With no extra training, simply adding imagination lifts performance well above prompting baselines. For example, on Qwen2.5-7B in ALFWorld, ITP-I roughly doubles ReActâs overall SR (about 35.7% vs. 17.1%), like jumping from a D to a solid C+/B- just by peeking ahead.
- Trained ITP-R: Consistently the top performer among trained methods across backbones. On ALFWorld with Qwen3-8B, ITP-R reaches around 88.6% SR overall, which is like scoring an A when others hover at B levels. On ScienceWorld with Llama-3.1-8B, ITP-R hits about 63.9% in seen splits, ahead of other training-based competitors.
- Per-task improvements: ITP-R often leads or ties for best across Pick, Clean, Heat, Cool, Look, Pick2, showing robustness to different long-horizon recipes.
Surprising/Insightful Findings
- Adaptive lookahead beats fixed horizons on success and cost
- Fixed k: Performance peaks at a medium k and then drops, while compute rises steeply as k grows.
- Adaptive: Keeps high SR without tuning a global k and uses far fewer tokens on averageâlike getting higher grades while studying fewer wasted hours.
- Adaptive lookahead beats random horizons
- Randomly varying k is not enough. The state-aware choice of K in ITP dominates random choice, achieving higher SR with lower and more stable budgets. This shows that âwhenâ you imagine matters as much as âhow much.â
- Reinforcement fine-tuning for K is critical
- Removing the online A2C stage (w/o RT) causes large drops: e.g., from ~88.6% to ~71.4% SR in ALFWorld and from ~59.7% to ~46.0% in ScienceWorld. This is like going from an A to a C just by skipping that training step, proving that learning the timing of imagination is a big deal.
- World model choice matters (especially for ITP-I)
- Training-free ITP-I depends heavily on how good the world model is at predicting futures: stronger, well-aligned world models yield better gains.
- After reinforcement training (ITP-R), even a less specialized world model can become competitive, because the policy learns when and how to rely on foresight.
Bottom line: Across two different domains and multiple backbones, Imagine-then-Plan with adaptive lookahead wins on success rate while being smarter about compute, particularly when reinforcement learning teaches the agent how deep to imagine at each moment.
05Discussion & Limitations
Limitations
- Modality gap: Evaluations are in text environments. Real robots see images and feel forces; visual/sensor noise could challenge imagination quality and horizon choices.
- Overhead: Imagining futures costs extra tokens and time. While adaptive penalties help, fully real-time settings may still need faster or distilled world models.
- World model bias: If the world model hallucinates, the agent can be misled. Bad foresight is often worse than no foresight.
- Generalization: Adaptive K tuned for ALFWorld/ScienceWorld might need retuning for other domains with very different step structures.
Required Resources
- A capable language model for the policy and another (or the same fine-tuned) for the world model.
- Expert demonstrations to warm-start the policy and world model training data from both expert and agent rollouts.
- Compute for training (including A2C) and for inference-time imagination.
When NOT to Use
- Ultra-low-latency scenarios where any added planning cost is unacceptable.
- Environments where the world model cannot be trained well (e.g., extremely chaotic or adversarial dynamics), making imagination unreliable.
- Tasks with strictly one-step dependencies where imagination brings little value.
Open Questions
- Multimodal extension: How to build vision-and-touch world models that are accurate enough for adaptive lookahead in the physical world?
- Safety under uncertainty: How can the agent detect and mitigate when the world model is uncertain or wrong (e.g., confidence-aware horizons)?
- Learning to imagine efficiently: Can we compress or cache foresight, or use speculative decoding, to keep benefits while slashing cost?
- Curriculum for K: How to train horizon selection that scales to very long tasks without exploding compute?
- Tool integration: How to combine learned world models with external simulators or structured planners for even safer, stronger foresight?
06Conclusion & Future Work
3-Sentence Summary: This paper introduces Imagine-then-Plan (ITP), where agents first imagine multiple future steps using a learned world model and then plan the next move, with an adaptive mechanism deciding how far to look ahead each time. By formalizing decisions as a POIMDP that blends whatâs observed now with whatâs imagined next, the agent avoids shallow grounding and reduces costly mistakes. Two versionsâtraining-free ITP-I and reinforcement-trained ITP-Râshow strong gains across ALFWorld and ScienceWorld.
Main Achievement: They show that explicitly using imagined futures as inputs, and learning when to imagine deeply versus shallowly, significantly boosts success rates while keeping computation in check, with ITP-R delivering the largest improvements.
Future Directions:
- Extend to multimodal (vision, sensors) and real robots with robust, efficient world models.
- Add uncertainty-aware imagination and confidence-guided horizons for safety.
- Speed up inference via distillation, caching, and speculative decoding to meet real-time demands.
Why Remember This: ITP turns âthinking aheadâ from a nice idea into a practical, adaptive skill: agents donât just reactâthey preview consequences and pick smarter actions. This shift from shallow grounding to foresightful planning is a principled step toward trustworthy, efficient agents that handle long, complex tasks with fewer surprises.
Practical Applications
- âąHome robots that preview multi-step chores (wash, heat, place) and avoid unsafe orders.
- âąWeb automation agents that try short imagined navigation paths before clicking, reducing errors.
- âąCustomer support bots that foresee conversation branches and choose questions that unlock progress fastest.
- âąEducational science tutors that simulate experiment outcomes before suggesting the next step to students.
- âąSoftware coding assistants that sketch possible build/test outcomes before choosing a refactor path.
- âąWarehouse pick-and-place planners that simulate aisle traffic and equipment availability to avoid bottlenecks.
- âąHealthcare scheduling assistants that imagine downstream conflicts before booking procedures or rooms.
- âąSmart HVAC/energy managers that preview comfort and cost effects before committing to control actions.
- âąGame-playing agents that adapt lookahead depth at critical turns, saving compute while improving win rates.
- âąData labeling/annotation agents that imagine downstream model impacts to choose the most valuable next label.