CoV: Chain-of-View Prompting for Spatial Reasoning

Haoyu Zhao; Akide Liu; Zeyu Zhang; Weijie Wang; Feng Chen; Ruihan Zhu; Gholamreza Haffari; Bohan Zhuang

CoV: Chain-of-View Prompting for Spatial Reasoning

Intermediate

Haoyu Zhao, Akide Liu, Zeyu Zhang et al.1/8/2026

arXiv PDF

Key Summary

•This paper teaches AI to look around a 3D place step by step, instead of staring at a fixed set of pictures, so it can answer tricky spatial questions better.
•The method is called Chain-of-View (CoV) prompting and it works without extra training, only smarter instructions at test time.
•First, a View Selection Agent picks the most helpful starting views (coarse step), then a CoV Agent rotates, moves, and switches views (fine step) to gather missing clues.
•Across major models on the OpenEQA benchmark, CoV raises scores by an average of about 11.56%, and up to 13.62% on Qwen3-VL-Flash.
•If you let the agent take more steps (test-time scaling), performance keeps getting better, with an extra average gain of about 2.51% and up to 3.73% on Gemini-2.5-Flash.
•On ScanQA, CoV reaches 116 CIDEr and 31.9% EM@1, and on SQA3D it gets 51.1% EM@1, showing strong accuracy.
•An ablation shows that skipping the coarse view selection hurts results by about 4.59%, proving it is essential.
•CoV builds clear, human-friendly reasoning chains that explain how the camera moved and why the final answer makes sense.
•The approach is model-agnostic and training-free, so it can upgrade many vision-language models for embodied question answering.
•Limitations include trouble in very dynamic scenes, possible hallucinations from too many actions, and reliance on good initial view selection.

Why This Research Matters

Many daily tasks depend on understanding where things are and how they relate in space: finding your keys, placing furniture, or navigating a crowded room. CoV makes AI better at these jobs by letting it choose better views and verify its guesses instead of guessing from fixed images. This helps robots and home assistants act more safely and reliably, especially when objects are small, occluded, or far away. Because the method is training-free and model-agnostic, organizations can upgrade existing systems quickly. Clear reasoning chains also build trust, since users can see how the AI looked around before answering. As we bring AI into homes, hospitals, warehouses, and schools, this kind of careful, viewpoint-aware reasoning becomes essential.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you walk into a messy room, you don’t just stand in one spot and guess where your backpack is? You look around, step to the side, peek behind a chair, and maybe kneel to see under the table. That movement helps your brain collect the right clues to find the answer. AI agents in 3D spaces have the same need: to answer questions like “Where is the mirror relative to the staircase?” they often need to look from more than one angle. Before this research, most vision–language models (VLMs) used a fixed set of views—like a short video clip or a few chosen photos. That means if the important detail was hidden, off to the side, or too far away, the model couldn’t move to fix it. Imagine trying to find the fridge by staring only at five snapshots of a kitchen from bad angles; you might miss it entirely. This fixed-view habit limited AI’s ability to solve embodied question answering (EQA) tasks, where questions are tied to the agent’s point of view in a real or simulated 3D environment. The problem is simple but serious: EQA often needs pieces of context scattered across different places, sometimes partially hidden. Fixed views force the model to guess from incomplete evidence. People tried larger models, better prompts, and mixing 2D and 3D inputs, but the core issue remained: no way to actively explore during inference. It’s like reading a mystery book with missing pages—you can think harder, but without the missing pages, you still lack key clues. Failed attempts mostly treated the task as one-shot answering: feed all frames, ask for an answer, hope for the best. Some methods fused multi-view features or aligned 2D images to 3D coordinates, which helped a bit, but they still couldn’t fix a bad viewpoint at test time. If the item was just out of frame or behind an object, the system stayed stuck. The missing ingredient was agency: letting the model decide where to look next. Humans naturally solve spatial puzzles by moving their viewpoint. Without that, models confuse small objects with background, misjudge left vs. right, and fail at relationships like “above,” “behind,” or “next to.” Enter Chain-of-View (CoV) prompting. Instead of passively accepting whatever frames it was given, the AI now acts like a careful explorer. First, it picks promising “anchor views” from a larger set (remove duplicates and blurry angles). Then, it performs small, purposeful camera actions—rotate a little, step forward, switch to another anchor—to reveal hidden details. Importantly, this is training-free: the same pre-trained model gets better just by using a smarter, two-stage prompting process. Why should you care? Because many real-world tasks depend on spatial reasoning: robots finding tools, home assistants locating remotes, wheelchairs navigating around chairs, and AR devices understanding where to place virtual instructions. If AI can actively gather the views it needs, it stops guessing and starts seeing. That means fewer mistakes, clearer explanations, and safer, more useful helpers. This paper’s big promise is shifting from “what did I get to see?” to “how can I see what I need?” It shows that even without retraining, simply giving the model the right exploration strategy can boost performance a lot. And just like spending a bit more time looking around a room helps a person, giving the AI a few more action steps at test time keeps paying off.

02Core Idea

The “aha!” in one sentence: Turn a fixed-view watcher into an active look-around reasoner by selecting smart starting views, then adjusting the camera step by step until the scene makes sense. Multiple analogies:

Detective analogy: Instead of guessing from a single photo, the detective walks around the room, checks behind curtains, and compares angles to find the missing clue.
Treasure map analogy: First open the world map (coarse), then zoom into the island and search specific spots (fine) to locate the treasure.
Photographer analogy: Start with a wide shot to set the scene, then change angle, distance, and focus to capture the exact detail you need. Before vs After:
Before: The model passively answered from a small, fixed set of frames, often missing occluded or faraway details.
After: The model picks anchors, explores with camera moves/rotations/switches, gathers extra evidence, and then answers with clearer reasoning. Why it works (intuition, no equations):
Spatial questions depend on geometry: relative positions like left/right/above/behind are viewpoint-sensitive. If you can’t change the viewpoint, you can’t reduce ambiguity. CoV reduces ambiguity by actively choosing better vantage points.
Noise reduction: Coarse view selection removes redundant or unhelpful frames so the model doesn’t waste attention. Fine adjustments then home in on missing details.
Evidence chaining: Each new view is added back into the context so the model’s explanation accumulates concrete, visual proof. Building blocks, each with the Sandwich pattern:

🍞 Hook: You know how a video game gives you a 3D world you can walk around in? 🥬 The Concept: 3D Scene Representation is a computer’s way of storing a place so it can render views from many angles.
- How it works:
  1. Build a digital model (like a mesh or point cloud) of rooms and objects.
  2. Keep camera positions so you can render what the camera would see.
  3. Let the agent request new viewpoints and get fresh images.
- Why it matters: Without a 3D scene, the AI is stuck with flat snapshots and can’t “peek around” obstacles. 🍞 Anchor: Imagine a virtual house tour where you can look left, right, or step closer to a painting—because the whole house is modeled in 3D.
🍞 Hook: Imagine skimming a whole book first, then rereading the most important chapters slowly. 🥬 The Concept: Coarse-to-Fine Exploration means first getting the big picture, then zooming in for detail.
- How it works:
  1. Coarse: pick promising anchor views and discard duplicates.
  2. Fine: from an anchor, make small camera moves or rotations to uncover missing clues.
  3. Stop when the evidence is sufficient or the step budget runs out.
- Why it matters: Jumping straight to tiny details can miss the context; staying too broad can miss the answer. This balances both. 🍞 Anchor: A nature photographer first scans the forest (coarse), then carefully approaches the bird’s nest (fine) for a clear shot.
🍞 Hook: You know how you solve a puzzle by trying one step, checking, then trying the next? 🥬 The Concept: Multi-Step Reasoning means thinking and acting in small steps, each building on the last.
- How it works:
  1. Observe current view.
  2. Decide the best next action (move/rotate/switch).
  3. Get a new view and update your notes.
  4. Repeat until confident.
- Why it matters: Big leaps invite mistakes; small, checked steps reduce confusion and fix errors early. 🍞 Anchor: Counting chairs around a table by walking around it and tallying as you go.
🍞 Hook: Think of a coach who doesn’t play but tells the team how to move smarter. 🥬 The Concept: Chain-of-View Prompting is a prompting strategy that turns a VLM into an active explorer without retraining.
- How it works:
  1. Prompt the model to select the best starting views.
  2. Prompt it to propose one action at a time (move/rotate/switch).
  3. Feed back the new observation and repeat.
  4. Require a verification phase before answering.
- Why it matters: It adds agency and structure at test time, improving reasoning with the same model weights. 🍞 Anchor: Using smarter instructions to guide a remote-controlled camera until it captures the evidence you need.
🍞 Hook: Picture a tour guide who knows which lookout point gives the best view of a waterfall. 🥬 The Concept: The View Selection Agent picks question-relevant anchor views and removes redundant frames.
- How it works:
  1. Read the question to understand target objects/relations.
  2. Score candidate views for clarity and coverage.
  3. Keep diverse, informative views; discard near-duplicates.
- Why it matters: Starting from weak or repetitive views wastes steps and confuses reasoning. 🍞 Anchor: Choosing one front view and one side view of a kitchen so you can likely see the fridge and the counter.
🍞 Hook: Imagine moving a flashlight around to spot dust hiding in the corner. 🥬 The Concept: Active Viewpoint Reasoning is the loop of deciding and executing camera actions to gather new evidence.
- How it works:
  1. From an anchor, choose an action (e.g., right-rotation+10).
  2. Render the new view from the 3D scene.
  3. Add the view back into context and reassess.
- Why it matters: Many spatial facts are invisible from one angle; actions reveal them. 🍞 Anchor: Rotating left 15 degrees to finally see the mirror above the cabinet.
🍞 Hook: You know how spending a bit more time double-checking homework can lift your grade? 🥬 The Concept: Test-Time Scaling means giving the agent more action steps at inference to improve accuracy.
- How it works:
  1. Set a minimum step budget.
  2. Force the agent to explore enough, not just guess early.
  3. Use the extra observations to verify and refine.
- Why it matters: Some questions need deeper exploration; more steps often mean better, safer answers. 🍞 Anchor: Letting the agent take 4–6 actions instead of 1–2 improved scores further on OpenEQA.

03Methodology

At a high level: Question + 3D scene frames → [Coarse View Selection] → [Fine View Adjustment: action–reasoning loop] → Verified Answer. Step-by-step recipe with what, why, and examples:

Inputs and setup

What happens: The system receives a natural language question Q and a video-like set of frames from a 3D scene (plus camera poses, optionally a BEV/bird’s-eye snapshot). The underlying VLM is left unchanged.
Why this step exists: The model needs both language (what to look for) and visuals (where to look). Camera poses or 3D representation make new views possible.
Example: Q = “Where is the mirror relative to the staircase?” Input frames show parts of a hallway, a cabinet, and stairs from multiple angles.

Coarse-Grained View Selection (the smart starting line)

What happens: A View Selection Agent reads Q and the candidate view IDs, then outputs a small, diverse subset of anchor views that likely contain the answer. It avoids redundant angles and prefers clear, low-occlusion shots aligned with the question type.
Why this step exists: Many frames are repetitive or irrelevant. Trimming early reduces distraction and saves fine-step actions for real discovery.
Example: Out of 30 frames, it keeps views {6, 9, 3} because they show the staircase region, a cabinet area, and a reflective surface, matching the mirror-related query.

Fine-Grained View Adjustment (the action–reasoning loop)

What happens: Starting from an anchor, the CoV Agent proposes exactly one action per step—movement (forward/back/left/right/up/down), rotation (left/right by N degrees), or switch to another selected view (or BEV). The system renders the new view from the 3D scene and appends it to the context. The VLM then reasons again with the updated context.
Why this step exists: Spatial ambiguity (e.g., “to the right of which reference?”) often disappears when you rotate or step closer. This loop builds a chain of evidence.
Example actions:
- Step 1: right-rotation+10 to reduce glare and confirm the mirror’s edge.
- Step 2: forward-movement+2 for a closer look at the frame.
- Step 3: left-rotation+15 to compare mirror position versus the staircase.
- Observation: “The mirror is above the dark cabinet and to the right of the staircase.”

Verification phase (required before answering)

What happens: The prompt forces a short verification: “I’m now verifying my answer by…” and uses remaining steps to double-check uncertain parts (e.g., a small upward-movement+2 to confirm the mount height).
Why this step exists: It reduces premature answers and encourages evidence gathering, especially helpful for tricky occlusions or similar-looking objects.
Example: For “What should I do to cool down?”, the agent verifies that the AC unit is indeed above the TV by moving forward and slightly upward before committing to “turn on the air conditioner.”

Stopping rule and answer

What happens: The loop stops when the budget is reached or when evidence is deemed sufficient. The model then outputs done+[final answer].
Why this step exists: Prevents endless exploration and focuses on useful, budgeted discovery.
Example: After 4 actions, the mirror’s relative position is clear and consistent across views, so the agent answers confidently.

Secret sauce (what makes it clever):

Agency without training: The same VLM becomes a viewpoint reasoner via prompting—no fine-tuning required.
Coarse-to-fine efficiency: First remove noise (coarse), then focus on specifics (fine), which mirrors how humans explore.
Evidence accumulation: Each new view is appended to context, turning scattered frames into a coherent, grounded story.
Test-time scaling: A simple knob—the action budget—unlocks more careful exploration and better scores.

Concrete mini-walkthrough:

Input: 20 frames of a living room; Q: “Where is the mirror?”
Coarse selection: Keep {v4, v7, v12} and BEV.
Fine loop:
1. Start at v7 → right-rotation+10 (spot reflective surface).
2. forward-movement+2 (confirm it’s a mirror, not a picture).
3. switch to v4 (see staircase relation).
4. left-rotation+15 (align mirror and staircase in one shot).
Verification: upward-movement+1 (check it’s above cabinet).
Answer: done+“On the wall above the dark cabinet, to the right of the staircase.”

What breaks without each step:

Without coarse selection: The agent chases too many similar frames; ablation shows about −4.59% average performance.
Without fine adjustments: Ambiguities stay unresolved; answers may flip left/right or miss occluded items.
Without verification: Higher chance of premature, hallucinated answers.

04Experiments & Results

The tests: The authors evaluated how well CoV answers spatial questions in real indoor 3D scenes. They measured correctness with several metrics, especially LLM-Match on OpenEQA (an LLM judge scores answer similarity from 1–5 and is normalized to a percent), plus standard text metrics (CIDEr, BLEU-4, METEOR, ROUGE-L) and exact match on ScanQA and SQA3D. The competition: CoV was plugged into multiple strong VLMs—Qwen3-VL-Flash, GLM-4.6V, GPT-4o-Mini, and Gemini-2.5-Flash—without retraining. It also compared with 3D-specific models on ScanQA and SQA3D. The scoreboard with context:

OpenEQA (LLM-Match): CoV gives an average lift of about +11.56% over baselines, and up to +13.62% on Qwen3-VL-Flash. Think of it as going from a solid B to an A, just by letting the model look around.
Test-time scaling: Forcing more action steps adds another average +2.51% (up to +3.73% on Gemini-2.5-Flash). That’s like checking your work once more and picking up extra points.
ScanQA: CoV reaches 116 CIDEr and 31.9% EM@1, outperforming prior systems (e.g., beating LEO’s 101.4 CIDEr). That’s like writing an answer key that agrees strongly with many humans and still getting the exact answer right a third of the time in a tough open-ended setting.
SQA3D: 51.1% EM@1, which shows solid situated reasoning where the agent’s own perspective matters. Surprising findings:
More steps help across all tested models, confirming that careful exploration, not just bigger parameters, can boost performance.
The reasoning chains look clearer to humans: the agent explains its rotations and moves, and ties them to the final answer (e.g., “I rotated right to confirm the mirror’s frame, then switched to view 6 to see its height relative to the TV”).
Coarse view selection is critical; removing it hurts by about −4.59% on average, showing that a clean starting point makes fine exploration far more productive. What the numbers mean in practice:
An +11.56% average gain on OpenEQA means the agent is noticeably more reliable on diverse, realistic questions.
The +2.51% average from test-time scaling is inexpensive: you don’t retrain; you simply give the model a few more moves per question.
High CIDEr and solid EM on ScanQA/SQA3D indicate both fluent and precise answers, not just lucky guesses. Takeaway: Across datasets and models, CoV turns fixed-view guessing into active, verifiable looking—delivering better scores and better explanations.

05Discussion & Limitations

Limitations (be specific):

Highly dynamic or cluttered scenes: If objects move or many look similar, the agent might chase misleading evidence or mis-time its observations.
Long action chains: Too many steps can add noise or encourage hallucinated details; careful budgets and verification help but don’t eliminate this risk.
Dependence on initial anchors: If the coarse selection picks weak views, the fine stage starts from a bad place and may never quite recover.
Rendering/pose quality: Poor 3D reconstructions or inaccurate camera poses can mislead rotations and moves. Required resources:
Access to a 3D scene or multi-view frames with camera poses; ideally a BEV snapshot.
A compatible VLM (open or proprietary).
A renderer or mechanism to synthesize new viewpoints on the fly.
Modest extra inference time to run a few action steps. When NOT to use it:
Time-critical tasks where even a few extra steps are too costly.
Fully 2D settings with no way to generate new views.
Scenes with unreliable geometry or rapidly changing layouts where past frames quickly go stale. Open questions:
How to auto-tune the step budget per question, balancing speed and accuracy?
Can the view selection agent learn from past successes to choose anchors even better over time?
How to detect and correct hallucinations mid-chain?
How to extend to outdoor, large-scale, or multi-floor environments with efficient long-range moves?
Can multi-agent collaboration (two viewpoints at once) speed up discovery without extra training?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Chain-of-View (CoV) prompting, which lets a vision–language model actively look around a 3D scene by first choosing smart anchor views and then adjusting the camera step by step. Without any extra training, CoV boosts embodied question answering across multiple models and datasets, producing clearer, better-grounded answers. Giving the agent a few more steps at test time further improves results, showing the power of exploration over pure parameter scaling. Main achievement: Turning passive, fixed-view answering into active viewpoint reasoning via a simple, two-stage prompting strategy that’s model-agnostic and training-free. Future directions: Smarter, learned view selection; adaptive step budgets; mid-chain hallucination checks; and scaling to large, dynamic, or outdoor scenes. Integrations with planners or navigation modules could speed up exploration and reduce wasted moves. Why remember this: CoV shifts the mindset from “what did the model see?” to “how can the model see what it needs?”—a small change in test-time behavior with a big impact on accuracy, interpretability, and real-world usefulness for robots and assistants that must understand space.

Practical Applications

•Home assistants that can locate items (like a remote) by actively looking from better angles.
•Service robots in hospitals that verify equipment locations before fetching or guiding.
•Warehouse inventory bots that rotate and move cameras to confirm shelf counts and labels.
•AR/VR guides that pick the right viewpoint to overlay accurate instructions on real objects.
•Smart security patrols that adjust viewpoints to reduce blind spots and verify alarms.
•Disaster response drones that move and rotate to check behind obstacles for survivors or hazards.
•Retail analytics systems that confirm product placement and signage visibility by switching views.
•Elderly care robots that safely navigate around furniture by checking occluded paths.
•Educational tools that teach spatial reasoning by letting students control viewpoint steps.
•Autonomous wheelchair systems that refine routes by looking around corners with small rotations.

Version: 1