VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

Wensi Huang; Shaohao Zhu; Meng Wei; Jinming Xu; Xihui Liu; Hanqing Wang; Tai Wang; Feng Zhao; Jiangmiao Pang

VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

Intermediate

Wensi Huang, Shaohao Zhu, Meng Wei et al.12/26/2025

arXiv PDF

Key Summary

•Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.
•They call this task Interactive Instance Goal Navigation (IIGN): the robot both moves and chats to clear up confusion.
•They build VL-LN Bench, a large benchmark with 41,891 dialog-augmented paths and an oracle that answers the robot’s questions automatically.
•The oracle answers three kinds of questions: Attribute (what the object is like), Route (how to get closer), and Disambiguation (is this the right one?).
•A special data pipeline merges room info into full-house maps, pairs start points with target objects, and collects long explorations using frontier-based strategies.
•They train dialog-enabled navigation models and show that asking questions improves success compared to baselines that don’t ask.
•Their best model (VLLN-D) raises success to 20.2% on IIGN and 25.0% on IGN, while also reducing getting-lost errors.
•They introduce a new metric, Mean Success Progress (MSP), which rewards getting better faster with fewer questions.
•Most remaining failures (about 73%) come from poor image–attribute alignment: the robot can’t reliably match words like color or material to what it sees.
•This benchmark shows how smart questioning plus strong visual grounding can make home and warehouse robots more practical.

Why This Research Matters

In real homes, schools, and warehouses, people give short, fuzzy directions, so robots must learn to ask quick, helpful questions. VL-LN Bench shows that a little dialog can cut down wandering and wrong picks, making assistants faster and more reliable. This matters for safety (picking the right medicine bottle), productivity (grabbing the correct package version), and accessibility (helping someone find their exact belongings). The benchmark’s automatic oracle and clear metrics let researchers test new ideas without costly human-in-the-loop trials every time. By highlighting the main bottleneck—visual grounding of attributes—it also points the field to where progress will pay off most. Over time, smarter questioning plus stronger vision will make everyday robots genuinely useful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a friend might say, “Go find the red book,” but there are five red books on different shelves? You’d probably ask a follow-up like, “Which shelf?” or “Next to what?”

🥬 The Concept (Navigation): Navigation is how a robot figures out where to go and how to get there safely. How it works:

See the world (camera, depth).
Build a plan (which way to move).
Move step by step and re-check. Why it matters: Without navigation, a robot just spins in place or hits walls. 🍞 Anchor: A vacuum robot deciding how to reach the living room couch without bumping into a table.

🍞 Hook: Imagine a treasure hunt where any coin counts as a win. 🥬 The Concept (Object Navigation, ObjectNav): ObjectNav asks a robot to find any object of a certain category (like any chair). How it works:

Read: “Find a chair.”
Explore rooms.
Stop at the first chair it sees. Why it matters: It’s easier because any chair works; no need to find one special chair. 🍞 Anchor: If your parent says “Bring me a spoon,” any spoon from the drawer will do.

🍞 Hook: Now imagine your friend says, “Bring me my favorite red swivel chair near the computer,” not just any chair. 🥬 The Concept (Instance Goal Navigation, IGN): IGN asks the robot to find one specific object instance (that exact chair), not just any in the category. How it works:

Read the full, detailed description.
Explore more widely to spot the right attributes and location.
Stop only when the exact instance is confirmed. Why it matters: Real people often want a particular item, not just a type. 🍞 Anchor: Grabbing your exact backpack (with a star sticker) from a pile, not just any backpack.

🍞 Hook: When you’re unsure in class, you raise your hand to ask a question to get unstuck faster. 🥬 The Concept (Dialog-enabled navigation): A robot that can talk (ask/answer) while moving to clarify instructions and get hints. How it works:

Notice uncertainty (too many lookalikes or too many hallways).
Ask a targeted question.
Use the answer to refine where to go next. Why it matters: Without dialog, robots waste time wandering or choose the wrong object. 🍞 Anchor: “Is your laptop the silver one on the wooden desk?” If yes, the robot stops; if no, it keeps looking.

The world before: Many navigation systems assumed clear, perfect instructions. Research mainly focused on ObjectNav (find any chair) or gave IGN agents full, exact descriptions. But real people often give short, fuzzy directions like “Find my computer,” which doesn’t say which one or where.

The problem: Robots need to handle ambiguity: multiple similar items and large, house-scale spaces. They must explore efficiently and confirm the right object without wasting steps.

Failed attempts:

Passive following: Agents only followed long, human-written dialogs (good for understanding, not for asking the right questions).
Small-room tasks: Some systems worked in tiny rooms but didn’t scale to full houses.
Narrow guidance: Oracles that only described the object, not how to move through the house.
Too little data: Few large datasets to train agents that both explore and ask well.

The gap: We needed a benchmark where an agent can actively ask different kinds of questions (about attributes, routes, and “is this the one?”) across long house-scale scenes, with enough training data and an automatic way to evaluate without humans every time.

Real stakes: Home helpers finding a person’s exact glasses; warehouse bots picking the correct box version; AR assistants guiding you to your specific classroom desk; elderly care robots confirming the right medicine bottle. In all these, a quick, smart question can save time and avoid mistakes.

02Core Idea

🍞 Hook: Imagine you’re in a giant library with only the clue “Find my book.” You’d ask: “What’s the cover color?” “Which floor?” “Is this it?”

🥬 The Concept (Interactive Instance Goal Navigation, IIGN): IIGN lets a robot both navigate and actively talk to an oracle to find a specific object in a big, confusing place. How it works:

Start with a vague instruction (like “Find the chair”).
Explore and ask Attribute, Route, or Disambiguation questions.
Use answers to narrow choices and move smarter. Why it matters: Without questions, the robot wastes time or picks the wrong lookalike. 🍞 Anchor: “What color is the chair?” “Dark gray.” “Which way first?” “Go past the brown table, then right.” “Is this the one?” “Yes.”

Aha! moment in one sentence: Let the robot shrink uncertainty by asking the right question at the right time while it moves.

Three analogies:

Detective story: The robot is a detective asking witnesses (the oracle) about clues (color/material) and the streets to take.
GPS with chat: Not just maps; it also answers, “Turn where?” and “Is this the destination?”
Hot-and-cold game: Each answer makes the search zone smaller and the moves more direct.

🍞 Hook: Think of two sports strategies—running blindly vs. asking the coach where to head. 🥬 The Concept (VL-LN Benchmark): VL-LN is a large dataset and evaluation setup that trains and tests chatty navigators across full houses. How it works:

Build rich house-level labels and relations.
Auto-generate many episodes and starting points.
Collect long explorations with dialog (41k+).
Evaluate with an oracle that answers consistently. Why it matters: Without a big, realistic benchmark, agents don’t learn scalable asking + exploring. 🍞 Anchor: It’s like a league with lots of games, fair referees (oracle), and clear scoreboards (metrics) so teams (agents) can actually improve.

Before vs. After:

Before: Agents either had perfect descriptions or stayed silent in big spaces, often getting lost.
After: Agents can ask focused questions, follow short route tips, and confirm targets in house-scale scenes.

Why it works (intuition): Each question cuts down uncertainty. Attribute answers filter out wrong lookalikes. Route tips skip dead-ends. Disambiguation locks in the final choice. Together, fewer guesses + better moves = higher success.

Building blocks:

House-level annotations from many rooms.
A frontier-based explorer to collect realistic long paths.
An oracle that answers three question types consistently.
A new progress metric (MSP) that rewards getting better faster with fewer questions.
Models trained on lots of dialog-augmented runs to learn when and what to ask.

03Methodology

At a high level: Ambiguous instruction + camera/depth → Explore and decide if/what to ask → Get oracle answers (attributes/route/confirm) → Navigate efficiently to the exact instance and stop.

🍞 Hook: Picture organizing a scavenger hunt: first map the house, then pick start spots and targets, then play while asking hints. 🥬 The Concept (House-level annotations): A complete house map with detailed objects, rooms, and their relationships. How it works:

Merge room labels into a single house dictionary (objects + regions).
Record attributes (color, material, placement) and room functions.
Build a relation graph linking nearby objects (within 1 m). Why it matters: Without a full house view, attribute clues can’t uniquely point to the right instance. 🍞 Anchor: “Deep gray chair near the computer in the bedroom” only makes sense if you know where bedrooms, computers, and chairs are in the whole house.

Step 1: Scenes metadata processing

Start from MMScan’s rich, hierarchical labels.
Merge room-level info into house-level dictionaries for each scene.
Build a spatial-relation graph (like a friendship map of objects) so relative clues (“near the TV”) help disambiguate lookalikes. What breaks without it: Attribute answers become too vague to separate twins (two similar chairs in different rooms).

🍞 Hook: Imagine each game round: choose where you start, what you’re looking for, and when you’ve actually arrived. 🥬 The Concept (Episode generation): Define start pose, instruction, and valid viewpoints to mark success. How it works:

Sample navigable start spots (from VLN-CE scenes + 18 hand-checked scenes).
Create two instructions per target: partial (category only) and full (unique description via GPT-4o using house dictionaries + relations).
Expand the target’s 3D box by 0.6 m and mark navigable points inside as success viewpoints (stop within 0.25 m). Why it matters: Without clear success zones, the robot might be “close” but never count as correct. 🍞 Anchor: “Stop within arm’s reach of the exact chair” is a precise rule the robot can follow.

🍞 Hook: When exploring a new park, you often walk along the edges of what you’ve seen to push into the unknown. 🥬 The Concept (Frontier-based exploration): A way to pick next steps at the boundary between seen and unseen space. How it works:

Find frontiers (edges of explored map).
Usually go to the nearest frontier (90%).
Occasionally bias toward the frontier closest to the target (10%). Why it matters: Without a frontier strategy, the robot wanders randomly or rechecks the same spots. 🍞 Anchor: It’s like always walking to the edge of your drawn map to extend it.

Step 2: Collect dialog-rich trajectories

Sensors: RGB-D camera + odometry.
A ground-truth detector watches for the target; once found, the agent beelines and stops.
Question triggers: • Attribute: asked at the start (e.g., color, material, shape, placement). • Route: asked when the “best frontier” (toward target) is chosen. • Disambiguation: asked when a same-category candidate is centered within 3 m.
Multiple phrasing templates per question type to keep dialogs varied. What breaks without triggers: The robot might over-ask or never ask, missing teachable moments.

🍞 Hook: Think of a super-knowledgeable guide who can tell you what the object is like, how to get a bit closer, or confirm you’ve arrived. 🥬 The Concept (Oracle): A reliable answerer that knows the entire house and the exact target. How it works:

Classify the agent’s question as Attribute, Route, or Disambiguation.
Attribute: use instance metadata (through GPT-4o) to answer naturally.
Route: compute shortest path, keep first 4 m, simplify into waypoints (at sharp turns/room changes), anchor to nearby objects (“at the brown table, turn right”), render into language.
Disambiguation: answer “yes” if the target is centered and within 3 m; otherwise “no.” Why it matters: Without consistent, precise answers, training and evaluation would be noisy and unfair. 🍞 Anchor: “Go straight to the brown table, then turn right into the bedroom” gets translated into a few clear steps.

🍞 Hook: If you can’t match the clue “blue, shiny, metal” to what your eyes see, you’ll pick the wrong thing. 🥬 The Concept (Image–attribute alignment): Matching words like color/material/placement to the correct pixels in view. How it works:

Read attributes.
Detect and segment objects.
Check if attributes fit the visual evidence. Why it matters: Without good alignment, the robot confuses lookalikes and stops at the wrong item. 🍞 Anchor: Two gray chairs, but only one is “near the computer”—the alignment step makes that distinction visible.

Training and models

Baselines: FBE (zero-shot), VLFM (zero-shot), and three trained models (VLLN-O, VLLN-I, VLLN-D) built on Qwen2.5-VL-7B and InternVLA-N1 procedures, with different data mixes (ObjectNav, IGN, IIGN dialogs).
Compute: 64×NVIDIA A800; 50–59 hours per run.

Secret sauce

Three-question toolkit (Attribute/Route/Disambiguation) that slices uncertainty from different angles.
House-level annotations + relation graph to make attribute answers truly discriminative.
Short-route language instructions (first ~4 m) that agents can reliably follow.
Large-scale, auto-collected dialogs that teach when and what to ask.
A new progress metric (MSP) that values getting better with fewer questions.

04Experiments & Results

🍞 Hook: When you take a test, it’s not just your final score—how quickly you improve with hints also matters. 🥬 The Concept (VL-LN evaluation): Test if agents can both move and chat to find a specific object in big houses (IIGN) and also handle full descriptions without dialog (IGN). How it works:

Compare zero-shot vs. trained agents.
Measure success, path efficiency, closeness to target, and gain from dialog. Why it matters: Without clear metrics, we can’t tell if asking questions truly helps. 🍞 Anchor: It’s like grading not only your answer but also how efficiently you solved the maze with a couple of clues.

Metrics explained simply:

Success Rate (SR): Did you reach the exact object? Like getting the answer right.
SPL (efficiency): Did you take a near-shortest route? Like finishing a test with minimal extra steps.
Oracle Success (OS): Did you at least reach the right area? Like being in the correct classroom even if not at the right desk.
Navigation Error (NE): How far off were you? Lower is better.

🍞 Hook: Imagine a points system that rewards you more if you improve quickly with only a few hints. 🥬 The Concept (Mean Success Progress, MSP): A score that averages how much success improves as you allow 1, 2, …, up to n dialog turns. How it works:

Measure success with 0 turns (baseline).
Allow more turns, record success each time.
Average the gains over the budgets. Why it matters: Without MSP, a model that asks many tiny-help questions could look as good as one that asks one perfect question. 🍞 Anchor: Earning more credit for a single great hint than for five tiny hints that barely help.

The competition:

Zero-shot: FBE (frontier explorer + detector), VLFM.
Trained: VLLN-O (with ObjectNav data), VLLN-I (with IGN, no dialog), VLLN-D (with dialogs).

Scoreboard highlights (Test set):

IIGN (must ask to succeed at scale): VLLN-D reaches 20.2% SR, SPL 13.07, OS 56.8%, NE 8.84, MSP 2.76. That’s like jumping from a D to a solid C by asking smart questions, while others hover lower (e.g., VLLN-I 14.2% SR).
IGN (full descriptions, dialog still allowed): VLLN-D gets 25.0% SR, SPL 15.59, OS 58.8%, NE 7.99, MSP 2.16. Interpretation: Dialog gives consistent gains, especially in the more ambiguous IIGN setting.

Surprising/important findings:

Biggest bottleneck: Image–attribute alignment. About 73% of failures are due to missing/misidentifying the correct instance under detailed attributes. In plain terms: seeing-but-not-understanding the right visual clues.
Dialog reduces exploration failures. When dialog is enabled, getting-lost cases drop notably (e.g., IIGN: exploration fails 89→71; IGN: 84→46).
Two-question sweet spot. Allowing two turns often gives the biggest gain; one turn is usually spent on the first attribute, leaving no room for a crucial follow-up.
Humans are very efficient. Human–Human reaches 93% SR with only about two questions on average. Human–GPT is close in success (91%) but uses more turns (~9.7), showing the oracle is solid but people ask more tightly focused questions.
Route tips work in practice. Agents trained with VLN data can translate short natural-language route hints into action, improving SPL.

Takeaway: Asking targeted questions makes a noticeable difference, but true reliability demands better vision-language grounding so the robot can tell near-twins apart.

05Discussion & Limitations

Limitations:

Visual grounding: The agent often fails to match attributes (color/material/placement) to the right pixels, causing most misses.
Question quality: Agents don’t yet ask as strategically as humans; they underuse highly informative follow-ups.
Oracle scope: The scripted oracle covers three question types and 4 m route segments; real humans can be broader and messier.
Domain scale: Scenes are simulated MP3D houses; transfer to real homes requires robustness to sensor noise and clutter.

Required resources:

Habitat-Sim environment + MP3D scenes with MMScan annotations.
Oracle stack (rules + GPT-4o) and logging.
Training compute (e.g., 64×A800 GPUs) for large VLM finetuning.

When NOT to use:

Hard real-time, low-power robots that can’t run VLMs or call an oracle.
Safety-critical tasks where a wrong step is dangerous (no partial confirmations allowed).
Settings with no language-capable user or no network (if oracle requires connectivity).

Open questions:

How to train stronger image–attribute alignment? (e.g., hard negatives with many similar distractors.)
How to choose the single most informative question from history and current view?
How to extend beyond 4 m route tips into long-horizon language plans while staying reliable?
How to generalize from simulated, clean scenes to messy, changing real homes?
How to self-improve online: learn from mistakes and new dialogs without human relabeling?

06Conclusion & Future Work

Three-sentence summary: This paper introduces IIGN, where robots both move and chat to find a specific object in large houses, and VL-LN Bench, a big dataset plus oracle-based evaluation that makes training and testing this ability practical. By asking Attribute, Route, and Disambiguation questions, dialog-enabled agents reduce getting lost and choose the right instance more often, improving over strong baselines. A new metric, MSP, shows not just if dialog helps, but how efficiently it helps as turns increase.

Main achievement: Building the first large-scale, house-level benchmark that jointly trains and evaluates long-horizon navigation with active dialog, and demonstrating consistent gains from smart questioning.

Future directions: Sharpen image–attribute alignment with hard negatives and better grounding losses; learn question selection that pinpoints the biggest uncertainty; generalize route language beyond 4 m; bridge sim-to-real with noisy sensors and clutter; and co-train on richer, messier human dialogs.

Why remember this: It reframes navigation as an interactive game of shrinking uncertainty—ask a bit, move a bit, confirm—bringing practical home and workplace robots closer to being truly helpful.

Practical Applications

•Home assistant robots that ask one or two precise questions to fetch the exact item a person wants.
•Warehouse pickers that confirm the correct SKU (color/material/size) before packing, reducing returns.
•AR wayfinding apps that provide short, step-by-step route tips and ask clarifying questions in malls or campuses.
•Service robots in hotels or hospitals that navigate large buildings and verify the right room or equipment.
•Training curricula for robotics classes to teach exploration strategies and active dialog in simulation.
•Quality-control tests for vision-language models to improve attribute grounding using hard negative lookalikes.
•Game AI companions that guide players through complex maps with brief, natural language hints.
•Elderly care aids that confirm the correct medication or device and safely navigate cluttered homes.
•Developer evaluation harnesses that use the oracle to benchmark new dialog policies without expensive user studies.
•Field-robot prototypes that trial short-route language instructions for safer, more predictable moves.

Version: 1