Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
Key Summary
- ā¢This paper shows that making short videos can help AI plan and reason in pictures better than writing out steps in text.
- ā¢The authors treat each video frame like a thinking step, so the model can "think in frames" from the start image to the goal image.
- ā¢They test two worlds: MazeNavigation (small changes, step-by-step paths) and TangramPuzzle (big changes, careful shape control).
- ā¢Across both worlds, the video model generalizes well to new situations it never saw during training (zero-shot).
- ā¢Giving the model visual context (like the exact agent icon or the real tangram pieces) acts like strong instructions that improve control.
- ā¢They discover a visual test-time scaling law: letting the model generate more frames at test time often improves maze planning, like giving it more thinking time.
- ā¢In mazes, the video model reaches A-level accuracy in-distribution and stays strong on bigger unseen mazes; longer videos help on long paths.
- ā¢In tangrams, text models struggle and image editing is strongest for final accuracy, but videos provide clear, step-by-step reasoning traces.
- ā¢A key limitation is keeping perfect shape geometry over many frames; too many frames can also exceed the modelās temporal capacity.
- ā¢The work suggests video generation is not just for pretty media but a general, interpretable way to do visual reasoning and planning.
Why This Research Matters
This work turns visual reasoning into something you can see, not just read about, which makes AI plans easier to trust and fix. Robots that must move safely in homes, factories, or hospitals benefit from clear, step-by-step visual planning. Design and education tools can show students or users exactly how to solve puzzles or assemble parts, frame by frame. Assistive technologies can guide people through spatial tasks, like packing or furniture assembly, with visual instructions that adapt to new layouts. Autonomous systemsādrones, warehouse bots, delivery robotsāgain stronger generalization to new spaces by learning visual rules rather than memorizing pixels. Finally, the test-time scaling knob gives practitioners a practical way to boost performance on hard cases without retraining, making deployment more flexible.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): You know how drawing a comic strip helps you plan a story better than just writing a single sentence? Each panel shows what happens next, so your plan stays clear.
š„¬ The Concept: Vision-Language Models (VLMs/MLLMs)
- What it is: These are AIs that look at pictures and read words to answer questions or solve tasks.
- How it works:
- See a picture and read text.
- Turn both into internal clues.
- Use past learning to decide what to say or do.
- Output words (and sometimes images) as an answer.
- Why it matters: If the AI only talks in text, it can miss fine visual details like exact positions, angles, or tiny collisions. š Anchor: Imagine telling a robot, āPut the triangle at 37 degrees, 12 pixels right.ā Text is clumsy for that. Seeing and moving shapes directly is easier.
š Hook: Imagine navigating a maze while blindfolded and only hearing left/right instructions. Youād bump into walls.
š„¬ The Concept: Spatial Planning
- What it is: Planning where to move in space without crashing into things.
- How it works:
- Understand the map and where you are.
- Pick the next safe step toward the goal.
- Repeat until you arrive.
- Why it matters: Without good spatial planning, the agent canāt reach the goal safely. š Anchor: A top-down maze picture helps you plan better than a long text like āright, right, up, leftā¦ā
š Hook: Think about a stop-motion movie where each picture slightly changesātogether they show action.
š„¬ The Concept: Dynamic Visual Changes
- What it is: The scene changes over timeāobjects move, rotate, or transform.
- How it works:
- Start with an initial state.
- Apply a small change (move/rotate).
- Repeat changes to reach the goal.
- Why it matters: Some problems (like tangrams) need many small, precise visual changes to succeed. š Anchor: Watching a flower bloom in a time-lapse makes the growth clear; one photo canāt show the process.
š Hook: When you solve a math problem, you write steps, not just the final answer.
š„¬ The Concept: Intermediate Reasoning Steps
- What it is: Small steps the AI takes to go from the start to the solution.
- How it works:
- Break the goal into sub-steps.
- Complete each sub-step in order.
- Check progress and adjust.
- Why it matters: Skipping steps leads to mistakes you canāt see or fix. š Anchor: In a maze, you trace the path line-by-line, not teleport to the end.
š Hook: Have you ever recognized a new kind of animal youāve never seen just by using clues you already know?
š„¬ The Concept: Zero-Shot Generalization
- What it is: The AI solves new cases it never saw in training.
- How it works:
- Learn general rules (avoid walls, keep shapes intact).
- Spot these rules in new scenes.
- Apply them without extra training.
- Why it matters: Real life throws new mazes and new shapes at you all the time. š Anchor: If the agent can handle a brand-new 7x7 maze after seeing only up to 6x6 in training, thatās zero-shot power.
š Hook: A great athlete plays well no matter the stadium or weather.
š„¬ The Concept: Robust Generalization
- What it is: Staying strong across many different settings, not just the ones you practiced on.
- How it works:
- Learn core principles, not specific pictures.
- Separate what matters (rules) from what doesnāt (colors/icons).
- Keep performance high in new conditions.
- Why it matters: We want AIs that donāt fall apart when anything looks a bit different. š Anchor: The paper shows the video model works well even with unseen agent icons and bigger mazes.
The World Before: Multimodal language models were great at describing images and answering questions. But they struggled with fine-grained spatial controlālike turning a tangram piece just right or finding tight, collision-free turns in a maze. Text is a thin straw for expressing continuous motion and tiny geometry constraints.
The Problem: How can we make the AI actually simulate the visual process of solvingāmoving step by stepāinstead of only writing about it?
Failed Attempts: Prior systems often:
- Spoke only in text (hard to capture exact angles/positions).
- Used discrete images with tiny changes (worked for simple mazes but not for continuous shape manipulation).
- Focused mainly in-distribution, not testing bigger mazes, new icons, or unseen silhouette shapes.
The Gap: A representation that naturally shows change over time and is easy to verify. Videos do this by default: each frame can be a thinking step.
Real Stakes: This matters for robots placing parts, assistive tools guiding people, autonomous agents navigating cluttered spaces, and educational apps that show students step-by-step reasoning they can watch and trust.
02Core Idea
š Hook: Imagine solving a jigsaw by sliding the pieces around on a table, not by writing a paragraph about where they should go.
š„¬ The Concept: Thinking in Frames
- What it is: Use a video model so each generated frame is an explicit reasoning step from start to goal.
- How it works:
- Start with an initial image and a goal (e.g., maze end or tangram silhouette).
- Generate a sequence of frames that move from start toward goal.
- Enforce constraints along the way (no wall hits; keep shapes intact).
- Stop when the last frame satisfies the goal.
- Why it matters: Videos show and simulate the process, making planning clearer, more controllable, and easier to check. š Anchor: In a maze, you literally watch the agent move cell by cell to the red circle; if it bumps a wall, you see it.
The āAha!ā in one sentence: Treat video frames as the AIās visual chain-of-thought, and scale the number of frames at test time to give the AI more time to think.
Multiple Analogies (three ways):
- Flipbook Planner: Each page adds a tiny step; flip faster to see the whole solution.
- Stop-Motion Chef: Instead of listing a recipe in words, you film each moveācrack egg, whisk, pourāso mistakes are visible and fixable.
- GPS Replay: Rather than just showing the destination, you replay the full drive, turn by turn; longer replays help with trickier routes.
Before vs After:
- Before: Text traces or sparse image steps; hard to express exact rotations/placements; weak OOD tests.
- After: Dense visual traces (videos) that capture continuous motion; better zero-shot generalization; a test-time knob (more frames) that boosts planning.
Why It Works (intuition, not equations):
- Dense temporal clues: Each small change is easier to learn and verify than one giant leap.
- Visual grounding: The model āseesā its own actions; no brittle visual-to-text translation.
- Compute as time: More frames = more reasoning budget, like longer chains of thought.
- Context as control: Showing the exact icon or piece shapes anchors the plan and reduces hallucinations.
š Hook: When building LEGO, it helps if you can see the exact bricks youāll use.
š„¬ The Concept: Visual Context as Control
- What it is: Feeding the exact agent icon or the real tangram pieces as part of the input so the model has concrete references.
- How it works:
- Show the actual visuals (icon, piece shapes, orientations) up front.
- The model preserves these visuals across frames.
- It plans movements and placements relative to these anchors.
- Why it matters: Without these anchors, the model may invent shapes/colors or lose consistency. š Anchor: Tangram Translation works much better when pieces start in the correct orientation on the canvas.
š Hook: Studying a little longer often leads to better test performance.
š„¬ The Concept: Visual Test-Time Scaling
- What it is: At test time, allow the model to generate more frames, giving it more āthinking time.ā
- How it works:
- Choose a longer video length than training (e.g., 81 ā 101 ā 121 frames).
- Optionally allocate more frames per action step (scaling factor Īŗ).
- Let the model refine tricky paths with finer-grained motion.
- Why it matters: In mazes, longer videos improved zero-shot performance on longer paths and bigger grids (up to a limit). š Anchor: With 121 frames, the agent can backtrack and correct a wrong turn; with fewer frames, it gets stuck.
Building Blocks:
- Video Generator: A strong text-to-video diffusion model (Wan 2.2 TI2V 5B) fine-tuned with LoRA.
- Two Regimes: MazeNavigation (discrete, low visual change) vs TangramPuzzle (continuous, high visual change).
- Evaluators: Maze Exact Match/Progress; Tangram Strict Completion, Progress per piece, and Boundary IoU.
- Controls: Visual context inputs; frame-budget scaling at test time.
- Generalization checks: Bigger mazes, longer paths, unseen icons, unseen silhouettes.
š Hook: Keeping shapes honest is like making sure a paper triangle doesnāt stretch like rubber.
š„¬ The Concept: Geometric Consistency
- What it is: Shapes must keep their size, angles, and color; no stretching or color drift.
- How it works:
- Compare final shapes to originals.
- Ensure each piece stays within the silhouette and doesnāt overlap others.
- Penalize any distortion.
- Why it matters: Even a perfect plan fails if pieces warp. š Anchor: A square canāt secretly become a rectangle while āfittingā the silhouette.
š Hook: A robot that can move through a room without knocking over a vase is doing something smart.
š„¬ The Concept: Embodied Spatial Intelligence
- What it is: Understanding and acting in physical space under constraints like collisions.
- How it works:
- Perceive the space (maze or puzzle).
- Plan safe, goal-reaching steps.
- Execute while respecting rules (no wall hits, no shape warps).
- Why it matters: This is what makes AI useful in the real world, not just on paper. š Anchor: The modelās maze agent preserves its identity and avoids walls, even with unseen icons.
03Methodology
High-Level Recipe: Input ā Visual Context + Goal ā Video Generation (frame-by-frame planning) ā Output (final frame satisfies the goal)
Inputs:
- Initial image(s): Maze with start and goal, or tangram canvas with pieces and target silhouette.
- Goal specification: Reach red circle in maze; fill silhouette exactly in tangrams.
- Constraints: No wall collisions; no shape/size/color distortions; no overlaps.
Step A: Frame the problem as video planning
- What happens: Treat the video model as the planner that outputs a sequence of frames from start to finish.
- Why it exists: Continuous visual change is easier to model frame-by-frame than to describe precisely in text.
- Example: In a 5x5 maze, the model draws a 81-frame video of the icon gliding along white paths to the goal.
Step B: Choose regimes and design controls
- What happens: Two regimes test different skills:
- MazeNavigation (discrete, small visual change): move the icon through a fixed map.
- TangramPuzzle (continuous, big visual change): rotate/translate 7 colored pieces to fill a silhouette. Visual Context is varied:
- Maze: Different agent icons, sometimes unseen.
- Tangram: Three setupsāFade-In (no prior pieces shown), Rotation (random piece angles on the left), Translation (correctly oriented pieces on the left).
- Why it exists: To separate logical planning (mazes) from delicate geometry control (tangrams) and to test generalization to OOD assets/layouts.
- Example: In Translation, only sliding is required, so geometry is easier to keep intact.
Step C: Train a video backbone with minimal changes
- What happens: Start from Wan 2.2 TI2V 5B (pretrained for 81-frame videos), then fine-tune a subset of weights (LoRA) for 20 epochs per task.
- Why it exists: Leverages strong temporal priors without heavy retraining; keeps the method practical.
- Example: Maze videos stay at 81 frames; Tangram Rotation uses up to 201 frames to reduce blur for rotations.
Step D: Define precise evaluators
- What happens: Use deterministic visual metrics to avoid ambiguity.
- Maze: Exact Match (path matches) and Progress Rate (partial goodness), with motion-based tracking and speed-invariant alignment.
- Tangram: Strict Goal Completion (all 7 correct with shape/area/color constraints, no overlaps), Progress (piece-wise correctness), and Boundary Adherence IoU (fit to silhouette area).
- Why it exists: We need verifiable, geometry-aware scores; text similarity or general vision scores arenāt precise enough.
- Example: A tangram piece must keep area within a tolerance (e.g., 60%ā140%) and stay fully inside the silhouette.
Step E: Test-time scaling knob (frames and Īŗ)
- What happens: At inference, increase video length (e.g., 81 ā 101 ā 121 frames) or allocate more frames per planned step via a scaling factor Īŗ (e.g., 5,7,9,11).
- Why it exists: Like thinking longer in words, more frames can refine visual paths, especially in long mazes.
- Example: On a long OOD path, 121 frames and Īŗ=9 improve success, while Īŗ=11 (ā200 frames) can hit positional embedding limits.
Step F: Compare against strong baselines
- What happens: Evaluate proprietary MLLMs (GPT-5.1/5.2, zero-shot), fine-tuned open MLLMs (Qwen3-VL-8B), image-based planners (VPRL), and image editing models (Qwen-Image-Edit-20B, Nano Banana).
- Why it exists: To prove the benefit of thinking in frames over purely textual or single-image edits.
- Example: In tangrams, text models misplace/overlap pieces; image editing excels at final fit; videos provide process transparency.
Secret Sauce:
- Dense temporal representation: Small, natural visual steps serve as a visual chain-of-thought.
- Visual context as a control signal: Concrete anchors reduce hallucinations and preserve geometry.
- Test-time compute: Frames are a dial you can turn up to boost planning on the fly, within architectural limits.
Concrete Data Examples:
- Maze: Training on 3x3ā6x6 grids, testing OOD on 7x7 and 8x8, and on longer paths (13ā18 steps). Unseen icons verify decoupling of plan from appearance.
- Tangram: 692 silhouettes for training variants; held-out set for unseen silhouettes. Three variants isolate the role of context and rotation.
What breaks without each step?
- No video framing: Hard to express tiny rotations/translations; plans become brittle text.
- No visual context: Model hallucinates shapes/colors; tangram performance collapses (Fade-In ~0.8% strict success).
- No test-time scaling: Long, complex mazes remain unsolved despite the modelās latent ability.
- No precise evaluators: You canāt trust whether success was real or accidental.
04Experiments & Results
The Test: Measure if video-as-reasoner improves planning and generalization.
- MazeNavigation: EM (exact path match), PR (partial progress). ID data and OOD challenges: bigger grids (7x7,8x8), longer paths (13ā18), and both combined. Also swap in unseen agent icons.
- TangramPuzzle: Strict Goal Completion (all 7 correct w/ constraints), Progress (piece-wise), IoU (boundary fit). Test three setups (Fade-In, Rotation, Translation) and unseen silhouettes.
The Competition:
- Proprietary MLLMs (GPT-5.1/5.2, zero-shot): strong language, weak spatial control.
- Open MLLMs (Qwen3-VL-8B, w/ and w/o coordinates): better with fine-tuning but still struggle on continuous control.
- Image sequence planners (VPRL): solid for simple mazes.
- Image editing (Qwen-Image-Edit-20B; Nano Banana): strong final tangram fidelity but no temporal trace.
Scoreboard with Context:
-
Maze, In-Distribution (3x3ā6x6): Wan video model ā 96% EM and ā 99% PRālike an A+ when many text models hover at C or below.
-
Maze, OOD Sizes: On 7x7, ā 90% EM; on 8x8, ā 78% EMāa graceful drop, far from collapse.
-
Maze, OOD Long Paths: EM ā 36ā44% at baseline length; increasing frames to ~121 significantly lifts performanceālike adding extra time on a tough exam.
-
Maze, Unseen Icons: Accuracy remains high (e.g., ā94% EM on 5x5), proving the model plans independently of the iconās look.
-
Tangram, Fade-In: Video nearly fails (Strict ~0.8%); no context = no anchor. Image editing ~31% Strict shows even strong editors struggle without anchors.
-
Tangram, Rotation: Video ~22.4% Strict (seen & unseen), indicating geometry stress under rotation; image editing ~45.2% Strict.
-
Tangram, Translation: Video ~68.0% Strict seen, ~60.8% unseenāstrong zero-shot transfer when orientation is given. Image editing ~85.7% Strict is best for final fit.
-
Boundary IoU is high for both video and image editing (often ~97ā100%), but Strict Completion exposes shape/overlap errors.
Surprising Findings:
- Visual Test-Time Scaling Law (Maze): More frames (e.g., 121 vs 81) improve OOD maze success; more frames per action (Īŗ up to ~9) also helpsāuntil positional limits appear at very long sequences.
- Emergent Self-Correction: With higher frame budgets, the agent can backtrack and fix early wrong turnsābehavior not seen at low frame counts.
- Irregular Mazes: Despite training on grid mazes, the model sometimes navigates diagonal moves in irregular layouts without wall collisionsāevidence of abstracted collision-avoidance policy rather than pixel memorization.
- Tangram Trade-off: Unlike mazes, more frames donāt boost tangram success; longer videos can accumulate tiny deformations, making strict geometry harder.
Takeaway: Videos beat text for showing and executing spatial reasoning; visual context acts like a strong control knob; and turning up test-time frames often acts like giving the model extra thinking timeāespecially for long-horizon paths.
05Discussion & Limitations
Limitations:
- Geometry Drift in Long Videos: In tangrams, maintaining exact shape integrity over many frames is hard; tiny warps or color shifts can fail strict checks.
- Frame-Length Limits: Going far beyond the pretrained sequence length (e.g., 81) can hurt performance due to positional embedding/extrapolation limits.
- Context Dependence: Performance strongly depends on having good visual anchors (e.g., Translation > Rotation > Fade-In). Without anchors, success collapses.
- Final-Fidelity Gap vs Image Editing: For perfect last-frame quality, a strong image editor (larger model) can outperform the video model on tangrams.
Required Resources:
- A capable video generator (e.g., Wan 2.2 TI2V 5B) and GPU compute for fine-tuning and inference with longer frame budgets.
- Clean evaluation pipelines that segment colors/shapes and track motion robustly.
When NOT to Use:
- If you only need the final perfect picture (no interest in the process), a top image editor may be simpler and more accurate for tangrams.
- Tasks with extremely long sequences well beyond the modelās temporal capacity.
- Scenarios where visual context cannot be provided and strict geometry must still be met.
Open Questions:
- How to enforce geometry better across time (e.g., differentiable constraints, shape-aware losses, or hybrid symbolic checks)?
- Can improved positional encodings extend safe frame lengths without degradation?
- How to combine video reasoning with text or code tools (e.g., planners) while keeping interpretability?
- Can we design lightweight self-correction loops that detect and fix geometry drift mid-video?
- Whatās the best way to share compute between long horizons (logic) and high-fidelity detail (appearance)?
06Conclusion & Future Work
Three-Sentence Summary: This paper treats video frames as the AIās visual chain-of-thought, showing that a video generator can plan by literally simulating the steps from start to goal. It generalizes robustly in mazes and, with the right visual context, handles tangrams far better than text-based models; test-time scaling (more frames) acts like extra thinking time. The result is a more interpretable, controllable, and often stronger approach to visual reasoning than text alone.
Main Achievement: Revealing that video generation is a powerful and scalable reasoning paradigmāwith visual context as control and a test-time scaling law that boosts long-horizon planning.
Future Directions:
- Build geometry-preserving mechanisms and better temporal encodings to push beyond current frame-length and fidelity limits.
- Hybridize with symbolic or programmatic checks for guaranteed constraint satisfaction.
- Expand to real-robot tasks, irregular maps, and complex multi-object manipulations.
Why Remember This: It reframes reasoning as something you can watchāframe by frame. That makes plans easier to trust, debug, and adapt, turning video generation from a media toy into a serious tool for spatial intelligence.
Practical Applications
- ā¢Robot navigation in unfamiliar floor plans by generating frame-by-frame visual plans that avoid obstacles.
- ā¢Assistive assembly guides that show each placement and rotation of parts as a short video tutorial.
- ā¢Education apps that teach geometry and spatial reasoning via animated, step-by-step tangram solutions.
- ā¢Warehouse picking routes planned as visual clips to minimize collisions and dead-ends with dynamic obstacles.
- ā¢AR instructions for furniture assembly where each frame shows the next exact motion and alignment.
- ā¢Game AI that plans complex paths or puzzle solutions visibly, improving both play quality and explainability.
- ā¢Quality control in factories by comparing generated ideal assembly videos with live camera feeds to spot errors.
- ā¢Autonomous drone path planning through cluttered spaces using more frames at test time for tricky routes.
- ā¢UI/UX design tools that animate optimal element arrangements (snap/align) while preserving layout constraints.
- ā¢Training data augmentation: use visual context variants (new icons/shapes) to stress-test generalization before deployment.