Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

Juil Koo; Daehyeon Choi; Sangwoo Youn; Phillip Y. Lee; Minhyuk Sung

Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection

Intermediate

Juil Koo, Daehyeon Choi, Sangwoo Youn et al.12/15/2025

arXiv PDF

Key Summary

•This paper teaches robots to move their camera to a better spot before answering a question about what they see.
•The new task is called Visually-Grounded Active View Selection (VG-AVS), which means choosing the next best view using only the current picture and the question.
•Instead of using choppy, step-by-step moves, the robot predicts smooth, continuous actions: turn, move forward, and then fine-turn the view.
•The authors built a small but clever synthetic dataset (AVS) that pairs a starting view with a perfect target view and a matching question.
•They train a vision-language model in two stages: first with supervised fine-tuning (learn from examples), then with reinforcement learning (learn from rewards).
•A separate frozen model checks whether the new view lets the system answer the question correctly, and that becomes the reward.
•This approach beats strong baselines, including very large proprietary models, and transfers well from simulated homes to real-world scenes.
•Plugging this module into existing embodied question answering systems boosts their final accuracy after exploration.
•Surprisingly, one carefully chosen, precise move is usually enough to get the needed view.
•Limitations include synthetic training data, fixed camera height, and no long-term memory or multi-step navigation.

Why This Research Matters

Robots and AR assistants must often answer questions when the key detail is just out of view, and this work teaches them to take the one best step to reveal it. By using only the current image and the question, the method avoids heavy memory systems and complex maps, making it simpler and faster. Smooth, precise actions (turn, move, fine-turn) are more reliable than clunky, discrete moves when details are small or partially hidden. Training first by imitation and then by reward lets the agent start stable and then self-improve, which boosts real-world transfer. Because it plugs into existing exploration systems and works in real homes, it can help home robots, warehouses, and smart glasses make fewer guesses and more confident, correct answers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how, when you can’t see the TV because someone’s in the way, you take a step to the side to get a better view? You don’t just stare harder—you move. That’s how our eyes and body work together in real life.

🥬 The Concept (Vision-Language Models): A Vision-Language Model (VLM) is an AI that reads pictures and words together to answer questions about what it sees. How it works: (1) It looks at the pixels in an image, (2) reads the question, (3) connects visual clues to the words, and (4) produces an answer. Why it matters: Without VLMs, computers can’t explain what’s in a photo in plain language. 🍞 Anchor: Ask, “How many chairs are at the table?” and a VLM tries to count chairs it sees in the photo and says the number.

🍞 Hook: Imagine trying to guess what’s on a table when you can only see the table’s corner. You might move closer or turn your head.

🥬 The Concept (Visual Question Answering—VQA): VQA is when an AI answers a question about an image. How it works: (1) Get a single picture, (2) read the question, (3) find relevant parts of the image, (4) answer. Why it matters: If the important thing is hidden or too tiny, static VQA gets it wrong because it can’t move. 🍞 Anchor: If the question is “Is the stove on?” but the image cuts off the stove, the AI can’t be sure.

🍞 Hook: Think of a photo as a single freeze-frame in a movie; it’s helpful, but it can miss a lot.

🥬 The Concept (Snapshot Vision vs. Ambulatory Vision): Snapshot vision means understanding from one static image; ambulatory vision means moving to gather better views. How it works: (1) Notice missing info, (2) decide where to move the eyes/body, (3) go there, (4) look again. Why it matters: Without moving, you miss things behind objects, off-screen, or too small to see clearly. 🍞 Anchor: If the book is on the sofa but hidden by a pillow, stepping to the side can reveal it.

🍞 Hook: Imagine wearing a head camera in a video game house. You can walk and turn to inspect objects.

🥬 The Concept (Embodied Agent): An embodied agent is an AI that lives inside a 3D world and can move like you do—turn, walk, and look around. How it works: (1) See from a current viewpoint, (2) choose an action to move or turn, (3) get a new view. Why it matters: Without a body (or body-like controls), the AI can’t fix bad views. 🍞 Anchor: The agent turns right 60°, walks 120 cm, then looks left 30° to see what’s on the sofa.

The world before this research: VLMs got very good at answering from single images. Many systems tried to handle harder tasks like Embodied Question Answering (EQA)—where a robot moves around a house to answer questions—by focusing on exploring large areas, building heavy 3D memories, and using lots of common-sense knowledge. These are useful, but they often skipped a crucial step: once near the target, how do you make the last precise move to actually see the thing?

🍞 Hook: You know how you can explore a museum, but if you don’t step a little closer to read a tiny label, you won’t know what the artwork is called.

🥬 The Concept (EQA—Embodied Question Answering): EQA asks a moving agent to explore, remember, reason, and perceive to answer questions. How it works: (1) Explore the scene, (2) build some memory, (3) reason with language, (4) finally look closely to see the crucial detail. Why it matters: If the final look is sloppy, the answer will still be wrong—no matter how much you explored. 🍞 Anchor: After walking to the kitchen, the agent still needs to lean or turn to see the stove knobs clearly.

The problem researchers faced: Most methods used either (a) 2D tricks like cropping or zooming inside the same image (which can’t reveal what’s off-screen), or (b) coarse, discrete moves like turn-left/turn-right that are too clumsy for the final precise view. Others skipped learning entirely and just prompted a big model to guess moves, which didn’t generalize well. Also, navigation setups with long, multi-step paths made it hard to define and supervise the “best next view.”

What was missing: A simple, learnable skill that, from just the current image and question, picks one really good next viewpoint—smoothly and precisely—without scene memory or outside knowledge.

The gap this paper fills: It creates a new task, a dataset, and a training recipe so the agent can learn “the last precise step” of seeing: Visually-Grounded Active View Selection (VG-AVS). It predicts a single, continuous action (how much to turn, how far to move forward, and how much to fine-turn the view) to bring the target into clear sight.

Real stakes in everyday life: Home robots that can check if the stove is on, count mugs on a shelf, or find your keys need this final, careful view. AR glasses that answer questions about your surroundings must nudge your view just right. Warehouse bots must confirm labels or switches without guessing. In short, knowing when and how to move your viewpoint transforms guessing into knowing.

02Core Idea

🍞 Hook: Imagine you’re playing I-Spy in a room and can’t see the object. Instead of guessing, you take one step and turn your head just enough to reveal it.

🥬 The Concept (VG-AVS—Visually-Grounded Active View Selection): VG-AVS is a method that learns to choose the single most informative next view—using only the current picture and the question. How it works: (1) Read the question and current image, (2) decide a continuous action: heading rotation, forward distance, and final view rotation, (3) move once to the new spot, (4) let a verifier model answer from the improved view. Why it matters: Without this, AIs stick with bad angles, missing tiny or hidden details. 🍞 Anchor: Question: “Is there a book on the sofa?” The agent turns right 54°, walks 136 cm, then turns left 79°, revealing the book.

The “Aha!” in one sentence: One precise, learned, smooth move—picked from the current view alone—often suffices to uncover the missing evidence to answer a question.

Three analogies:

Photographer’s hop: A pro photographer takes a half-step and tilts the camera to remove glare—suddenly the text is readable.
Detective’s peek: A detective leans around a doorway to see the object that was just out of view.
Librarian’s nudge: A librarian slides one shelf over and then turns slightly to read a spine title that was hidden.

Before vs. After:

Before: AI stared at a single picture and tried to guess; or it used clunky, discrete steps or 2D crops that couldn’t reveal off-screen details.
After: AI smoothly chooses one best, continuous move to expose the needed evidence and then answers more accurately.

Why it works (intuition, not equations):

When you can’t see enough, the right motion reduces uncertainty the fastest. A single, purposeful move is more efficient than many small, blind guesses.
Making actions continuous (any angle, any distance within range) lets the model learn precise adjustments instead of settling for coarse approximations.
Training in two stages—first imitate known-good moves (SFT), then refine by reward (RL)—grounds the model and then sharpens it.
A separate “verifier” model judges whether the new view truly supports the correct answer, so the learning signal is tied to seeing, not just guessing.

Building blocks (each with a sandwich explanation):

🍞 Hook: You know how a video game level lets you practice moves safely. 🥬 The Concept (ProcTHOR and the AVS Dataset): ProcTHOR is a simulated 3D home world; the AVS dataset pairs a starting view (missing info) with a target view (answerable) and a question. How it works: (1) Pick a target object and its supporting surface, (2) render a target view where the object is clearly visible, (3) render a query view where clues are visible but the object is hidden or tiny, (4) generate a matching question. Why it matters: Without such pairs, it’s hard to teach the model which move reveals the missing detail. 🍞 Anchor: Query shows a dining table but not the laptop; target clearly shows the laptop on the table; the question asks if a laptop is on the dining table.

🍞 Hook: Think of a volume slider instead of an on/off switch. 🥬 The Concept (Continuous Action Space): It lets the agent choose any turn angle and any forward distance, plus a final fine-turn of the head. How it works: (1) Pick heading rotation (left/right), (2) move forward by a chosen distance, (3) adjust final view rotation to point exactly at the target. Why it matters: Without smooth control, you stop just short of the perfect view or overshoot it. 🍞 Anchor: Turn +62°, walk 125 cm, then look −25° to center the dresser.

🍞 Hook: Imagine a coach showing you the exact steps to solve a math problem. 🥬 The Concept (Supervised Fine-Tuning—SFT): SFT teaches the model using ground-truth moves from query to target view. How it works: (1) Show image and question, (2) provide the correct action numbers, (3) train the model to output those numbers. Why it matters: Without SFT, the model may not learn reasonable magnitudes or directions. 🍞 Anchor: On many examples, the model learns that “to see the sink faucet, turn left ~60°, walk ~130 cm, then look slightly right.”

🍞 Hook: Training a puppy: do the trick, get a treat. 🥬 The Concept (Reinforcement Learning—RL with a Verifier): RL improves the model by rewarding moves that make the answer correct. How it works: (1) The model reasons and proposes an action, (2) the environment gives a new view, (3) a frozen VLM verifier tries to answer, (4) if correct, give reward; if not, no reward; (5) update the policy to favor better moves. Why it matters: Without RL, the model can’t self-improve beyond fixed examples. 🍞 Anchor: If the new view lets the verifier correctly say “The stove is off,” that action gets a reward.

🍞 Hook: A fair judge marks your answer right or wrong. 🥬 The Concept (VLM Verifier as Reward): A frozen VLM checks if the question is answerable and correct from the new view. How it works: (1) Feed the new image and question, (2) get yes/no or a number/state, (3) convert correctness into reward. Why it matters: Without a trustworthy judge, RL doesn’t know which moves help. 🍞 Anchor: The verifier says “Book: D” (correct choice), giving a reward of 1; otherwise 0.

Put together, VG-AVS turns passive looking into active seeing: from a single, partial view and a question, it computes one precise, continuous move that reveals exactly what’s needed to answer.

03Methodology

At a high level: Input (current image + question) → Predict one continuous action (heading rotation, forward distance, final view rotation) → Move and capture the new image → Verifier answers the question from the improved view.

Step-by-step (with sandwich explanations for key parts):

Inputs: the current egocentric image and the question

What happens: The agent reads the question (e.g., “Is a laptop present on the dining table?”) and looks at the current view.
Why this step exists: Without the question, the agent can’t know what to look for; without the view, it can’t know what’s missing.
Example: The table is visible but the laptop isn’t; the question suggests turning and moving to reveal the table’s top.

🍞 Hook: Think of steering a remote-control car and then turning its camera. 🥬 The Concept (Action Space with Three Numbers): The action has three parts: heading rotation (how much to turn your body), forward distance (how far to move), and final view rotation (how much to fine-turn your head after moving). How it works: (1) Choose heading rotation (−180° to 180°), (2) move forward some centimeters, (3) choose final view rotation (−180° to 180°) to point at what matters. Why it matters: Without this sequence, you can’t both get to the right place and aim your view precisely. 🍞 Anchor: “<head> 62 </head> <fwd> 125 </fwd> <view> −25 </view>” moves you near a dresser and centers it.

Supervised Fine-Tuning (SFT): learn from ground-truth moves 🍞 Hook: Like copying a perfect dance move from a coach before freestyling. 🥬 The Concept (SFT): SFT teaches the model to output the exact action that transforms the query view into the target view. How it works: (1) The dataset provides pairs: query view (partial) and target view (answerable), (2) the ground-truth action between them is computed analytically, (3) the model is trained to output those numbers in tagged text. Why it matters: Without SFT, the model’s first guesses may be too wild or inconsistent. 🍞 Anchor: If the target is 1.3 m straight ahead and 60° to the right, SFT makes the model predict about (head=+60°, fwd≈130 cm, view≈−20°).

What happens practically: The model is prompted to ONLY predict the action numbers (within tags like <head>, <fwd>, <view>) given the current image and action-style instruction.
Example with data: Query shows dining table edge; target shows laptop on top; SFT teaches the exact one-step move that reveals the laptop.

Reinforcement Learning (RL): refine by rewards using a verifier 🍞 Hook: After learning the basics, you practice and get points when you do it right. 🥬 The Concept (RL with a Verifier Reward): The model now produces a short “think-then-act” output: it reasons, proposes an action, the environment renders the new view, and a frozen VLM verifier tries to answer. How it works: (1) Sample an action from the model, (2) execute it, (3) if the verifier answers correctly, reward = 1, else 0; also add a formatting reward so outputs stay valid, (4) update the policy with GRPO to prefer higher-reward actions. Why it matters: Without RL, the model can’t surpass imitation; with it, the model tailors actions to what actually leads to correct answers. 🍞 Anchor: For “Is the lamp on?”, if the new view shows the lamp clearly and the verifier says “off” correctly, that move gets reinforced.

What breaks without this step: The model may memorize typical moves but fail on new layouts; RL encourages generalization by rewarding real success.
Example with actual numbers: Model thinks: “Lamp is left; rotate −54°, move 120 cm; then look +30°.” If the verifier now answers correctly, that action gets a positive update.

One-step continuous move instead of many discrete steps

What happens: The policy picks a single, precise move to jump directly to an answerable view.
Why this step exists: Multi-step navigation is harder to supervise and to optimize; a single, continuous step is simpler and often enough.
Example: Instead of turn-left, move, turn-right, move, the model does one smooth combination: +70° turn, 140 cm forward, −35° final view.

Dataset design that makes learning possible 🍞 Hook: Like a workbook that shows an unclear photo next to a crystal-clear one. 🥬 The Concept (Query View, Target View, Supporting Object): The query view includes hints (like a table) but hides the target (like a laptop). The target view clearly shows the target. How it works: (1) Pick target object (laptop) and its supporting object (table), (2) sample a target view where the laptop is big and centered, (3) sample a query view where the table is visible but the laptop is hidden or tiny, (4) attach a matching question. Why it matters: Without this contrast, the model can’t learn what move reveals what’s missing. 🍞 Anchor: Query: only table edge; Target: laptop on table; Q: “Is a laptop on the dining table?”
Putting it all together (the secret sauce)

The clever bits: (a) Precise, continuous action space for smooth final positioning; (b) Two-stage training (SFT then RL) so the model starts stable and then self-improves; (c) A frozen VLM verifier creates a reliable, automatic reward signal—no human grading needed.
Mini recipe summary: Input (image + question) → SFT-trained policy proposes action → RL refines policy using verifier reward → Execute action in environment → Verifier answers from improved view.
Concrete run-through: Question: “Choose an object present on the sofa. A: Bottle B: Desktop C: Cloth D: Book.” From the current view, the agent: rotates right 54°, moves forward 136 cm, rotates left 79°. The new view shows a book on the sofa; the verifier picks D.

04Experiments & Results

🍞 Hook: Imagine a school contest: who can take one step and then see the answer clearly?

🥬 The Concept (The Test): The researchers measured how often the final answer was correct after taking one learned, continuous action. How it works: (1) Start from a partial view and a question, (2) predict one action, (3) render the new view, (4) let a verifier answer, (5) score correctness. Why it matters: Without measuring the final answer, we can’t tell if the view adjustment really helped. 🍞 Anchor: If “Is the stove on?” becomes answerable after one move, that’s a win.

🍞 Hook: Think of two playing fields: a practice field (simulated homes) and a real stadium (real houses). 🥬 The Concept (Datasets: AVS-ProcTHOR and AVS-HM3D): AVS-ProcTHOR is a synthetic benchmark with multiple question types (existence, counting, state). AVS-HM3D uses real indoor scenes and open-ended questions. How it works: (1) In ProcTHOR, views and objects are controllable and labeled; (2) In HM3D, scenes are real, so query views are crafted to be partial but hintful; (3) The model’s job is to pick a move that reveals the answer. Why it matters: Success on both shows the method generalizes beyond simulations. 🍞 Anchor: In ProcTHOR, you might count mugs on a table; in HM3D, you might check if a lamp is on in a real apartment.

🍞 Hook: When answers aren’t just yes/no, you need a fair grader. 🥬 The Concept (LLM-Match Metric): LLM-Match uses a language model to grade open-ended answers on a 1–5 scale by comparing them to the ground truth. How it works: (1) The system’s answer and the correct answer are sent to a grader LLM, (2) the grader assigns a score (1–5), (3) average scores summarize performance. Why it matters: Without a nuanced grader, close-but-not-exact answers in real scenes would be unfairly judged. 🍞 Anchor: If the model says “two chairs” and the truth is “2,” LLM-Match should give full credit.

The competition (baselines and rivals):

No Action baselines: feed either the query view (lower bound) or the ideal target view (upper bound) to the verifier.
Open models: Qwen2.5-VL-7B and spatially-trained VLMs.
EQA system: Fine-EQA (focuses on exploration/navigation rather than final precise view).
Proprietary giants: GPT-5 and Gemini-2.5-Pro.

The scoreboard with context:

On AVS-ProcTHOR, our SFT+RL version achieves around 91.5% accuracy on existence questions and strong averages across counting and state, beating open baselines and even large proprietary models. Think of it as getting an A+ while others hover around B or C.
On AVS-HM3D (real scenes), our average LLM-Match score (~70.7) surpasses GPT-5 (~64.9). That’s like winning the away game in the big league stadium.
RL-only and SFT-only each help, but the combined SFT→RL training works best overall, giving a consistent edge.
Plug-and-play with Fine-EQA: Adding our module after exploration boosts its average score (e.g., from about 52.9 to 57.7), showing that even a strong navigator benefits from a precise final view.

Surprising findings:

One careful, continuous move is often enough. Multi-turn sequences did not clearly improve results, suggesting a well-aimed single step does the job.
A small, synthetic training set—if curated well—transfers to real homes and varied question types.
Bigger isn’t always better: a well-trained 7B model with active view selection can outperform much larger, general-purpose models on this task.
RL alone was unstable; SFT alone plateaued; together, they delivered the reliable gains.

Takeaway: Training a model to make one smart, smooth move—from just the current view and question—substantially boosts answer accuracy across synthetic and real environments.

05Discussion & Limitations

Limitations:

Synthetic-first training: The core training set is simulated; although transfer to real scenes is strong, unusual real-world clutter, lighting, or occlusions may still trip it up.
Single-step assumption: The method thrives when one move suffices; some tasks may need two or more steps (e.g., long hallways or multiple corners).
Fixed camera height/elevation: The current setup simplifies vertical control; scenes requiring tilting up/down or stepping onto platforms aren’t addressed.
No long-term memory: The system uses only the current view and question; it doesn’t remember previous views or build a scene map.
Reliance on a verifier: Reward quality depends on a frozen judge. If the verifier misjudges, learning signals can be noisy.

Required resources:

A 3D simulator (e.g., ProcTHOR or an equivalent environment) to render views after actions.
A capable VLM backbone that can output structured number tokens.
Compute for SFT and RL (GRPO) training, plus the frozen verifier calls during RL.
A curated dataset with query–target view pairs and question templates.

When NOT to use:

Long-horizon navigation tasks where the agent must plan across many rooms; this method targets the last precise step rather than global exploration.
Tasks requiring domain knowledge beyond vision (e.g., “Which cup belongs to Alex?”) or where the answer isn’t visually verifiable.
Scenarios needing significant vertical camera changes or non-wheeled motion (stairs, ladders) unless the action space is extended.

Open questions:

Multi-step refinement: Can we chain two or three learned continuous steps while keeping stability and simplicity?
Richer rewards: How to blend verifier success with visibility metrics (e.g., object size, occlusion) without hurting generalization?
Broader actions: Can we add vertical motion or zoom while keeping training stable?
Memory-light context: Can a short-term episodic memory boost performance without heavy scene graphs?
Generalist backbones: How does this plug-in perform atop even stronger VLMs, and can joint training make both stronger?

06Conclusion & Future Work

Three-sentence summary: This paper reframes visual question answering as an active seeing problem and proposes VG-AVS, which picks one precise, continuous move to reveal missing visual evidence. It trains a vision-language model in two stages—first by imitation (SFT), then by reward (RL)—using a curated dataset of paired views and a frozen verifier for feedback. The result is a compact, transferable skill that improves answers in both synthetic and real homes and even boosts existing embodied QA pipelines.

Main achievement: Showing that a single, well-learned, continuous action—grounded only in the current view and the question—can significantly improve accuracy and outperform larger passive models.

Future directions: Extend the action space (vertical moves, zoom), explore short multi-step policies, design richer verifiable rewards, and combine with lightweight memory or mapping. Test on broader real-world domains (warehouses, hospitals) and integrate with stronger backbones.

Why remember this: It captures a simple human truth—when you can’t see enough, move—and turns it into a precise, learnable skill for AI. By upgrading passive looking into active seeing, VG-AVS points the way toward practical, ambulatory vision in everyday robots and assistants.

Practical Applications

•Home robots confirming whether the stove is off or the lamp is on by taking one precise step to see clearly.
•Inventory drones moving just enough to read labels or count items on shelves without full re-navigation.
•AR glasses nudging the wearer’s view to reveal the sign or object needed to answer a question.
•Assistive robots positioning themselves to read medicine labels or appliance settings for accessibility.
•Security patrol bots adjusting angle and distance to verify door states, panel lights, or safety markers.
•Inspection robots in factories taking a single, well-aimed move to check gauges or indicator LEDs.
•Service robots in hotels or hospitals peeking around obstacles to confirm room states or item placements.
•Educational robots demonstrating how a small change in viewpoint can make hidden details visible.
•Household inventory systems counting mugs or books by refining camera position from partial views.

Version: 1