When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Shoubin Yu; Yue Zhang; Zun Wang; Jaehong Yoon; Huaxiu Yao; Mingyu Ding; Mohit Bansal

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Intermediate

Shoubin Yu, Yue Zhang, Zun Wang et al.2/9/2026

arXiv

Key Summary

•Visual spatial reasoning often fails when a model only looks at one picture and must imagine new viewpoints.
•Always turning on a world model to ‘imagine’ more views wastes computation and can even hurt accuracy.
•AVIC is a method that first decides if imagination is needed and then chooses how much to imagine.
•A small, targeted amount of imagination (about 1–2 views) usually helps most; more can add noise.
•AVIC uses a policy to gate world-model calls and to plan short action sequences for new views.
•A verifier picks the best imagined trajectory so the QA model reasons over consistent evidence.
•On SAT and MMSI, AVIC matches or beats fixed strategies while using far fewer tokens and world-model calls.
•On R2R navigation, AVIC improves success and efficiency by imagining only when it reduces ambiguity.
•Imagination is most useful for action-conditioned questions (e.g., 'after turning 90°, what do I face?').
•Treating imagination as a selective, test-time resource leads to more reliable and efficient spatial reasoning.

Why This Research Matters

Robots, AR assistants, and navigation apps often need to reason about parts of a scene they can’t currently see. If they imagine too much, they waste battery, time, and may even get fooled by low-quality guesses; if they don’t imagine enough, they stay blind to crucial details. This work shows how to treat imagination as a selective tool, switching it on briefly and precisely when it clarifies the scene. That makes AI helpers faster, cheaper to run, and more trustworthy. It also points toward safer embodied AI, because the model avoids drowning in noisy, generated views. In the long run, this adaptive strategy can guide smarter exploration in homes, warehouses, hospitals, and smart glasses that assist people in real time.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine taking a picture while standing in a room and then someone asks, “If you turn right, will you face the door?” You can’t see the door now, so you picture what the room would look like after turning.

🥬 The Concept (Multimodal Large Language Models, MLLMs):

What it is: MLLMs are AI systems that can read text, look at images, and answer questions using both.
How it works: 1) They look at an image and read the question. 2) They match patterns they’ve learned from lots of examples. 3) They generate an answer in words.
Why it matters: Without MLLMs, computers struggle to connect what they see (pictures) with what we ask (text questions). 🍞 Anchor: When you show an AI a photo of a kitchen and ask, “Where is the fridge relative to the sink?”, an MLLM tries to answer by looking at both the picture and the words.

🍞 Hook: You know how you close your eyes and imagine how your living room looks from the other side of the couch?

🥬 The Concept (Visual Imagination):

What it is: Visual imagination lets AI “picture” new views it hasn’t actually seen yet.
How it works: 1) Start with what’s visible. 2) Use a learned model of the world to predict how things look from another angle. 3) Produce new, imagined images that follow the scene’s rules.
Why it matters: Without imagination, the AI is stuck with only what the camera shows, missing hidden or off-camera details. 🍞 Anchor: If the picture shows a couch but not the TV, imagination can help “peek” from a turned viewpoint to see if the TV would be on the right.

🍞 Hook: Think of a video game that simulates physics so you can predict where a rolling ball will go.

🥬 The Concept (World Models):

What it is: World models are generators that simulate how the world could look after actions like turning or moving.
How it works: 1) Take the current view. 2) Apply an action (e.g., turn 90°). 3) Render a new view that’s consistent with the scene.
Why it matters: Without world models, the AI can’t create useful imagined views to reason about unseen angles. 🍞 Anchor: A world model can create a new image showing what you’d see if you took two steps forward in a hallway.

🍞 Hook: When you take a test, you don’t spend ten minutes on every question—you adjust time based on difficulty.

🥬 The Concept (Test-Time Scaling, TTS):

What it is: TTS is choosing how much extra thinking or computation to spend on each question during inference.
How it works: 1) Estimate difficulty. 2) If needed, add steps like self-consistency or imagination. 3) Stop when confident.
Why it matters: Without TTS, models waste time on easy questions and still fail on hard ones. 🍞 Anchor: For a tough spatial question, the model may use more imagination; for an obvious one, it answers quickly.

The world before this paper: MLLMs got better at describing images and answering simple questions but struggled with spatial reasoning that depends on alternate viewpoints, camera rotations, or future states. Researchers attached world models to generate new views, but most systems always turned this on and generated many views regardless of the question. That led to three recurring problems: (1) Unnecessary imagination when the answer was already visible; (2) Misleading imagination when the generator dropped or distorted important objects; (3) Expensive computation from calling the world model many times.

The problem: How can we know when imagination is truly needed and how much of it is helpful before it becomes noise?

Failed attempts: Fixed, always-on strategies assumed more imagined views = better reasoning. But analysis showed: most cases (54%) didn’t need imagination; only 14% benefited; 9% got worse; and performance didn’t steadily improve with more views. Plus, it burned 3–9× more tokens and ~30× more time in some settings.

The gap: A way to decide, per question, whether to imagine at all, and if so, to plan only a small, targeted set of actions to obtain just-enough views.

Real stakes: In daily life, a home robot must decide when to “peek” around a corner, a phone AR assistant should avoid wasting battery rendering pointless views, and a navigation agent should only simulate turns that clarify the next move. Getting this wrong means slower, costlier, and sometimes less accurate assistance.

02Core Idea

🍞 Hook: You don’t use a flashlight at noon, but you do in a dark closet. Tools work best when used only when needed.

🥬 The Concept (AVIC – Adaptive Visual Imagination Control):

What it is: AVIC is a strategy that first decides if imagination is needed and then decides how much to imagine.
How it works: 1) A policy checks if the current view is sufficient. 2) If not, it plans a few specific actions (turn/step) for the world model. 3) A verifier picks the best imagined trajectory. 4) The QA model answers using the original + selected imagined views.
Why it matters: Without AVIC, models either imagine too much (wasteful, risky) or too little (miss hidden evidence). 🍞 Anchor: If the question is “After turning right, will I face the store entrance?”, AVIC imagines a small right-turn sequence and then answers with confidence.

Three analogies for the key insight (imagination as a dial):

Chef’s spice rack: Add a pinch when the dish needs it; don’t dump the whole jar. AVIC adds just-enough views.
Detective’s notepad: Only take extra notes when clues are missing. AVIC imagines new angles only when the scene is ambiguous.
Flashlight at dusk: Turn it on briefly to check a shadowy corner. AVIC makes short, targeted simulations.

Before vs after:

Before: Always-on imagination generated many views per question, hoping more would help.
After: AVIC asks “Is the current evidence enough?” If yes, skip imagination. If no, plan 1–6 precise actions, verify the best imagined sequence, and reason over that single, coherent trajectory.

Why it works (intuition, not equations):

Spatial questions differ: some are solvable from the current image; others require seeing from a new angle or after a rotation. Treating all questions the same wastes compute and can pollute reasoning with noisy frames.
Short, purposeful action plans maximize signal (revealing the missing angle) and minimize noise (fewer chances for generation errors).
Verifying an entire imagined trajectory preserves temporal and geometric consistency—important for left/right, facing direction, and object permanence.

Building blocks introduced with the sandwich pattern:

🍞 Hook: Like a traffic light that decides when cars should stop or go to keep things smooth and safe. 🥬 The Concept (Gating Mechanism):

What it is: A decision step that either skips imagination or calls the world model.
How it works: 1) Inspect question + image(s). 2) Predict SKIP or CALL. 3) Use majority vote over multiple policy samples for stable decisions.
Why it matters: Without gating, imagination may be wasted on easy questions or missed on hard ones. 🍞 Anchor: If the bathtub faucet is already visible, gating says SKIP; no extra views needed.

🍞 Hook: Like a coach drawing a short play for the next few seconds, not the whole game. 🥬 The Concept (Policy Model):

What it is: The brain that decides whether to imagine and, if yes, plans a short sequence of actions (turn/move).
How it works: 1) Consider the question’s demands (e.g., rotations, perspective). 2) Propose 1–6 actions. 3) Avoid cancels (no left-then-right zigzags).
Why it matters: Without a plan, imagination is random and may miss the critical view. 🍞 Anchor: For “turn 90° right,” the policy might plan ten 9° right-turns to approximate 90°.

🍞 Hook: Think of choosing the best clip from a short playlist instead of picking scattered frames from many songs. 🥬 The Concept (Trajectory Verification):

What it is: A checker that scores whole imagined action sequences and keeps only the best one.
How it works: 1) Evaluate helpfulness, sharpness, and consistency. 2) Rank candidate trajectories. 3) Keep the top pick for QA.
Why it matters: Without trajectory-level checks, mixing unrelated frames can break spatial logic. 🍞 Anchor: If one imagined turn sequence clearly reveals the windows on the left, the verifier keeps that whole mini-clip.

Overall, the “aha!” is simple: Imagination is a limited, question-dependent resource; use it only when needed, and only as much as needed.

03Methodology

High-level recipe: Input (egocentric image(s), question) → Policy gating + short action planning → World model renders imagined views → Trajectory verifier chooses the best sequence → QA model answers using original + chosen imagined views.

Step 1: Policy gating — decide SKIP or CALL

What happens: The policy model reads the question and looks at the current view(s). It samples M times (e.g., 5) to produce SKIP/CALL votes, then uses majority vote to reduce flukes.
Why it exists: Some questions are already answerable; calling the world model would add cost and potential confusion. Others truly need another angle.
Example: Question: “From where I stand, is the window to my left or right?” If the window edge and frame are clearly visible on the left, policy predicts SKIP.

Step 2: If CALL, plan a short sequence of actions

What happens: The policy proposes 1–6 discrete actions (e.g., turn-right 9°, move-forward 0.25 m), following monotonic turns (no left-then-right). Multiple policy samples can yield different candidate plans.
Why it exists: We want precise, just-enough imagination to reveal missing evidence—especially for action-conditioned or perspective questions.
Example: Question: “If I turn right by 90°, will I see the store entrance?” The plan might be ten ‘turn-right 9°’ actions.

Step 3: World model renders imagined trajectories

What happens: For each candidate plan, the world model synthesizes a short trajectory of new views along that action sequence.
Why it exists: The agent needs visual evidence that matches the question’s hypothetical actions (e.g., after turning or stepping forward).
Example: After 3 turns, the entrance sign appears; after 5 turns, it’s centered—these frames form one trajectory.

Step 4: Trajectory-level verification

What happens: A verifier scores each entire trajectory for usefulness, visual quality, and consistency with the question. It picks one best trajectory and discards the rest.
Why it exists: Selecting a few isolated frames can break spatial continuity; entire sequences keep left/right and facing direction consistent across steps.
Example: Two candidate trajectories both show a door, but one keeps the counter visible and stable across frames; the verifier picks that one.

Step 5: Final QA with augmented evidence

What happens: The QA model answers using the original observation plus the single verified imagined trajectory (or just the original if SKIP).
Why it exists: The QA model integrates the most reliable visual evidence to make a final decision, avoiding noise from extra, low-quality views.
Example: The model answers, “Right,” because the verified trajectory shows the windows moving to the right field of view after the turn.

What breaks without each step:

No gating: You burn time/tokens on easy questions and risk misleading views.
No planning: You might imagine the wrong angles and miss key evidence.
No verifier: You might feed inconsistent sequences and confuse spatial relations.
No QA fusion: You can’t combine current and imagined evidence coherently.

Concrete walkthrough with data:

Input: Single living-room photo; Q: “If I stand facing the couch then turn left 90°, will I face the TV?” Choices: A) Yes B) No.
Gating: CALL (the TV isn’t visible; it’s possibly off-camera).
Plan: ten ‘turn-left 9°’ actions.
World model: Renders 10 frames; around frame 8, the TV appears front-left and centers by frame 10.
Verifier: Scores this trajectory high for revealing the missing evidence.
QA: Uses original + verified sequence; answers A) Yes.

The secret sauce:

Imagination as a selective, per-question budget: Spend compute only when the question truly needs hypothetical views.
Short, monotonic action plans: Minimal yet targeted exploration reduces generator errors and keeps geometry stable.
Trajectory-level verification: Prioritizes temporal consistency, crucial for left/right, facing, and object permanence.
Self-consistency voting at the policy level: Stabilizes SKIP/CALL under uncertainty without retraining.

Safety rails and efficiency tricks:

Cap action length (1–6) to limit drift and runtime.
Penalize redundant or blurry trajectories in the verifier.
Default to SKIP when confidence is high that the answer is visible.
Use the same base MLLM for policy, verifier, and QA with different prompts to avoid extra training.

Putting it together: Input → (Policy votes SKIP/CALL) → If CALL, plan 1–6 actions → World model renders trajectories → Verifier picks the best whole trajectory → QA fuses original + chosen imagined views → Final answer.

04Experiments & Results

The tests: The authors evaluated on two spatial reasoning benchmarks (SAT, MMSI) and one navigation benchmark (R2R). They measured accuracy (for QA), tokens and runtime (efficiency), world-model calls (how often imagination was used), and standard nav metrics (NE, OSR, SR, SPL).

The competition: Baselines included (a) no imagination, answer directly; and (b) always-on/dense imagination (e.g., MindJourney), which calls the world model for many views every time. AVIC competed by deciding when to imagine and limiting how much to imagine with planned actions and verification.

Scoreboard with context:

SAT (with GPT-4.1): Baseline 74.0% → AVIC 79.3%. That’s like moving from a solid B to a strong A−, while using far fewer tokens and <1 world-model call on average (~0.73) compared to ~12.34 in always-on.
SAT (with o1): Baseline 74.6% → AVIC 85.3% (best overall). That’s a big jump, like improving a test grade by more than a full letter.
MMSI: AVIC consistently lifted average scores across positional, attribute, and motion sub-tasks, showing generality beyond a single dataset.
Efficiency: Always-on added 3–9× tokens and ~30× more time, but only modest accuracy gains. AVIC achieved similar or better accuracy using roughly 10% of tokens and one selective world-model call on fewer than one in two questions, with ~30s average runtime.

Surprising findings:

More is not always better: Accuracy did not rise steadily with more imagined views; after 1–2 helpful views, performance often plateaued or dropped due to redundant or noisy frames.
Most cases don’t need imagination: In a distribution study, 54% were already answerable from the original view, 14% genuinely benefited, and 9% were hurt by imagination, underscoring the need for gating.
Biggest gains are action-conditioned: World-model imagination helped most when the question depended on future or counterfactual actions (e.g., “after turning right 90°…”). Gains were much smaller when the task only required reframing the current view.

Upper bound insight: A “selective imagination oracle” that only calls the world model when it would help reached 75.3% vs 66.6% for always-on and 62.0% for baseline in one setting. This upper bound shows how much room there is for smarter policies.

Navigation (R2R): Plugging AVIC into a step-wise navigation system improved success (SR, OSR) and path efficiency (SPL), while reducing Navigation Error, compared to a strong MapGPT baseline using GPT-4o. The model chose to imagine only when views were ambiguous, creating clearer decisions with shorter, less wandering trajectories.

Takeaway: Treating imagination like a flashlight—switching it on only where it’s dark and pointing it precisely—beats keeping it on all the time.

05Discussion & Limitations

Limitations:

Policy alignment: The gating policy sometimes calls the world model on categories (e.g., egocentric movement) that aren’t the main error source, reducing recall/precision on cases that truly need imagination.
World model quality: If the generator drops small objects or blurs edges, imagined views can mislead reasoning. Performance depends on the fidelity and controllability of the visual world model.
Benchmark scope: Results are strong on SAT, MMSI, and R2R, but generalization to outdoor, long-horizon, or heavily dynamic scenes remains to be tested.
Verification ceiling: Trajectory-level scoring helps, but it’s still a heuristic at test time without extra training; better learned verifiers could further improve consistency.

Required resources:

A capable MLLM (policy/verifier/QA prompts), a controllable world model (e.g., Stable Virtual Camera), and GPUs for rendering imagined views. Expect ~30 seconds per example under the reported setup when imagination is used.

When not to use:

Clearly visible answers (e.g., “Which side is the bathtub faucet on?” when it’s in plain view). Gating should SKIP. Also avoid in low-trust environments where generated frames might be dangerously misleading (e.g., safety-critical robotics without fallbacks).

Open questions:

Error-aware gating: Can the policy explicitly detect error types (limited observability, viewpoint dependence, action-conditioned, dynamics) and tailor imagination accordingly?
Confidence calibration: How should the model decide it has imagined “enough” versus risking more noise?
Learning to verify: Can verifiers be jointly trained to predict downstream QA success rather than using hand-designed prompts?
Better world models: How do advances in action-conditioned video generation and 3D-consistent novel view synthesis boost reliability?
Budgeting compute: Can we formalize a per-question imagination budget that maximizes expected accuracy gain per token/second spent?

06Conclusion & Future Work

Three-sentence summary: This paper shows that always-on visual imagination is often unnecessary, sometimes harmful, and expensive for spatial reasoning. It introduces AVIC, which first decides whether to imagine and then plans just-enough actions to generate the most informative imagined views, keeping only the best trajectory for final reasoning. Across benchmarks, AVIC matches or outperforms fixed strategies while using far fewer tokens and world-model calls, especially shining on action-conditioned questions.

Main achievement: Framing imagination as an adaptive, uncertainty-aware test-time resource—operationalized through policy gating, short action planning, and trajectory-level verification—so models get targeted new evidence only when it truly helps.

Future directions: Build error-aware gating that recognizes failure types, train verifiers end-to-end to predict QA success, improve 3D-consistent and action-accurate world models, and add principled compute budgeting. Explore broader embodied tasks and longer-horizon planning where selective imagination can guide multi-step decisions.

Why remember this: The lasting idea is simple and powerful—imagination is a dial, not a switch. By deciding when and how much to imagine, AI becomes more accurate, faster, and more reliable at spatial reasoning, just like a careful person who only peeks around corners when it actually clarifies the scene.

Practical Applications

•Home robotics: Only imagine alternate views when deciding how to grasp an object around a corner or behind clutter.
•AR navigation: Briefly simulate turns in indoor maps to confirm which hallway leads to the elevator before giving directions.
•Warehouse automation: Plan a tiny set of camera rotations to verify shelf positions without scanning entire aisles.
•Drones and inspection: Selectively imagine short moves to check line-of-sight around obstacles before committing to risky flight paths.
•Assistive technology: For low-vision users, imagine minimal additional views to describe room layouts more accurately.
•Education tools: Teach spatial concepts by showing just-enough imagined viewpoints to explain perspective changes.
•Game AI and NPCs: Use targeted imagination to navigate mazes or rooms efficiently without exhaustive map reveals.
•Search-and-rescue: Simulate a few steps or turns to assess the most promising route through a damaged building.
•Smart security cameras: Imagine small angle shifts to resolve whether a person is approaching an exit or a restricted area.
•In-app floor planning: Generate a couple of novel views to check furniture arrangement feasibility without full 3D reconstruction.

Version: 1