FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Jing Zuo; Lingzhou Mu; Fan Jiang; Chengcheng Ma; Mu Xu; Yonggang Qi

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Intermediate

Jing Zuo, Lingzhou Mu, Fan Jiang et al.1/20/2026

arXiv PDF

Key Summary

•FantasyVLN teaches a robot to follow language instructions while looking around, using a smart, step-by-step thinking style during training but not at test time.
•It blends three kinds of reasoning—text-only, picture-only, and both together—inside one model without getting slow.
•A Visual AutoRegressor (VAR) compresses “imagined” future pictures into just a few tokens, avoiding the thousands of tokens that usually bog models down.
•A gating mechanism switches which kind of reasoning the model uses during training, so the same model can learn multiple styles.
•A cross-mode alignment rule makes sure all reasoning styles agree on what action to take, so the model learns a single, stable decision policy.
•At inference, the robot skips writing out its thoughts and directly maps instructions and images to actions, but still benefits from the reasoning it learned.
•On the LH-VLN benchmark, FantasyVLN improves success and efficiency, while running about an order of magnitude faster than explicit multimodal CoT approaches.
•Implicit reasoning (think during training, act directly at test) beats explicit reasoning (write out steps) for long tasks because it avoids error build-up.
•Compressed visual Chain-of-Thought (CompV-CoT) in the VAR latent space trains faster and more stably than pixel-by-pixel image prediction.
•This approach shows how to keep human-like reasoning benefits without the heavy token and time costs, making real-time navigation more practical.

Why This Research Matters

Robots that understand both language and vision can help in homes, hospitals, and warehouses, but they must think well and act fast. FantasyVLN shows how to keep the benefits of step-by-step reasoning without the slowdown of writing everything out during action time. By compressing imagined visuals and unifying different reasoning styles into one model, it delivers more accurate navigation and near-real-time speed. This balance reduces delays that could cause confusion, errors, or safety issues in real environments. The approach can transfer to other embodied tasks where both seeing and understanding instructions are crucial. It also points to a future where AI uses compact internal thoughts to stay smart yet quick. In short, it’s a practical recipe for real-world, reasoning-aware robots.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you follow directions like, “Go down the hallway, turn right at the blue door, then grab the red book”? You look, you think, and you act—over and over until you reach the goal.

🥬 Filling (The Actual Concept) What it is: Vision-and-Language Navigation (VLN) is when a robot follows natural-language instructions in a 3D world by using what it sees. How it works (step by step):

Read the instruction (like “Bring toilet paper from the bedroom to the bathroom”).
Look around with its cameras (left, front, right views plus history).
Decide the next small action (move forward, turn left/right, or stop).
Repeat until the robot reaches the goal or finishes all subtasks. Why it matters: Without VLN, robots can’t help in real homes, offices, or hospitals where directions come in words and environments are visually complex.

🍞 Bottom Bread (Anchor) Example: “Go to the kitchen and get a spoon.” The robot reads the sentence, scans the room, and takes step-by-step actions until it finds the kitchen drawer with spoons.

🍞 Top Bread (Hook) Imagine doing long math problems: if you only look at the final answer, you’ll likely mess up. But if you write your steps, you stay on track.

🥬 Filling (The Actual Concept) What it is: Chain-of-Thought (CoT) reasoning is the model writing out its thinking in small steps before deciding what to do. How it works:

Break the task into subgoals.
Describe what you currently see.
Imagine what you’d see if you took certain actions.
Choose the next action supported by those steps. Why it matters: Without CoT, models may skip important clues, like a doorway on the right that leads to the bathroom.

🍞 Bottom Bread (Anchor) Example: “Find the bedroom, pick up toilet paper, bring it to the bathroom.” CoT lists subgoals, notes the hallway and doors, imagines turning right to see a bathroom sign, then picks the next move.

🍞 Top Bread (Hook) Picture writing a report about every tiny thing you see during a long walk. Your notebook explodes with pages, and you can’t finish on time.

🥬 Filling (The Actual Concept) What it is: Token inflation is when generating detailed multimodal reasoning (text + images) creates thousands of tokens per step, slowing everything down. How it works:

The model writes text thoughts.
It also generates imagined images at each step.
This quickly turns into 3,000–5,000 tokens per step. Why it matters: With so many tokens, training and inference become too slow for real-time navigation.

🍞 Bottom Bread (Anchor) Example: For 5–7 actions, producing full image descriptions and images turns a short plan into a phonebook-sized sequence, making the robot respond late.

🍞 Top Bread (Hook) Think of packing your suitcase: you squeeze bulky clothes into packing cubes so you can carry everything easily.

🥬 Filling (The Actual Concept) What it is: Latent space is a compressed, hidden way to store the most important parts of an image. How it works:

An image is converted into a small set of numbers (tokens) that still capture the key details.
These tokens can be used to predict, reason, or later reconstruct the image.
You work with the small representation most of the time to be fast. Why it matters: Without latent space compression, imagined images take too many tokens and slow the robot down.

🍞 Bottom Bread (Anchor) Example: Instead of sending a whole 4K photo to your friend, you send a tiny, high-quality preview that still tells them what’s in the picture.

🍞 Top Bread (Hook) Imagine your brain learning from different teachers: one who talks (text), one who shows pictures (vision), and one who uses both. You want to learn from all without getting confused.

🥬 Filling (The Actual Concept) What it is: Unified Multimodal CoT (UM-CoT) is a single model that learns text-only CoT, visual-only CoT, and both together. How it works:

Use a “gating” switch to pick text, vision, or both.
Train the same model under all modes, sharing parameters.
Align all modes so they agree on actions. Why it matters: Without unifying modes, you juggle separate models or get conflicting decisions.

🍞 Bottom Bread (Anchor) Example: One model can plan steps in text, imagine future visuals, or do both—yet still choose the same next action.

The World Before: Robots could follow short, simple instructions in easy scenes, but long, multi-stage missions were hard. Text-only CoT improved explanations but often missed exact spatial cues. Multimodal CoT added visuals but exploded token counts and slowed everything down. The Problem: We needed robots to plan over long sequences, use both words and images, and still act in real time. Failed Attempts: Text-only CoT overfit to specific reasoning scripts. Pixel-level visual imagination was too heavy. Separate models per mode didn’t agree. The Gap: A way to keep CoT’s brainy benefits without writing out massive reasoning during inference. Real Stakes: In homes, hospitals, and warehouses, delays or confusion can mean missed help, wasted time, or safety risks. We need thinking robots that are also fast.

02Core Idea

🍞 Top Bread (Hook) You know how you show your work on homework to learn, but on the test you solve straight from memory—faster and still right?

🥬 Filling (The Actual Concept) What it is: The key insight is to train with rich, multimodal Chain-of-Thought (including compressed imagined visuals) but act without writing those thoughts at test time. How it works:

During training, the model practices text-only CoT, visual-only CoT (in a compressed latent space), and both together.
A gating switch tells the model which reasoning mode is active.
A cross-mode alignment rule makes all modes land on the same action choices as the direct, no-CoT path.
At inference, the model maps instruction + observations directly to actions—no extra tokens—yet it behaves as if it had reasoned. Why it matters: Without this, multimodal CoT is too slow and text-only CoT can miss spatial grounding.

🍞 Bottom Bread (Anchor) Example: The robot “thinks out loud” with text and compact visuals during practice, but in the real run it moves quickly and accurately without writing anything down.

Three Analogies:

Training Wheels: Use training wheels (explicit CoT) to learn balance, then ride smoothly without them (implicit reasoning) on test day.
Recipe to Habit: You read and visualize a recipe many times while learning; later you cook from intuition without re-reading every step.
Maps to Mental Map: You study a detailed map (text + visuals) to learn a route; later you walk straight there using your mental map.

Before vs. After:

Before: Text-only CoT overfits and misses exact spatial clues; multimodal CoT is accurate but too slow.
After: One model learns all CoT flavors, compresses visual imagination, aligns decisions, and runs fast with no explicit CoT at test time.

Why It Works (intuition, no equations):

Practicing multiple reasoning styles teaches richer associations between words, places, and actions.
Compressing visuals into latent space keeps the key spatial bits but drops the heavy pixel baggage.
Aligning CoT modes to the direct path distills the “reasoning spirit” into the model’s internal representation, so it can act decisively without re-generating thoughts.

Building Blocks (each introduced with a sandwich):

🍞 Top Bread (Hook) Suppose you imagine what you’ll see after turning right—like previewing a mental snapshot.

🥬 Filling (The Actual Concept) What it is: Compact Visual Chain-of-Thought (CompV-CoT) is visual reasoning done in a tiny latent space rather than full images. How it works:

Convert images into a handful of latent tokens.
Predict future latent tokens to “imagine” what comes next.
Use those imagined tokens to choose actions. Why it matters: Without compaction, visual imagining becomes too slow to be useful in real time.

🍞 Bottom Bread (Anchor) Example: Instead of drawing a whole picture, the robot predicts 30 smart numbers that capture the next view well enough to plan.

🍞 Top Bread (Hook) Think of a super-zip file that keeps pictures tiny but meaningful.

🥬 Filling (The Actual Concept) What it is: A Visual AutoRegressor (VAR) is the compressor that turns images into compact tokens and can reconstruct them later. How it works:

Break the image into scales.
Predict the next scale step-by-step to build a compact code.
Reconstruct the image from the code when needed. Why it matters: Without VAR, you’d have to handle huge pixel sequences, slowing everything.

🍞 Bottom Bread (Anchor) Example: A 256×256 image becomes about 30 tokens—tiny but still useful for planning.

🍞 Top Bread (Hook) Imagine a light switch board: Text ON/OFF, Visual ON/OFF.

🥬 Filling (The Actual Concept) What it is: The gating mechanism picks whether the model does text CoT, visual CoT, both, or neither. How it works:

Use special tokens to flip text or visual reasoning on/off.
The same network handles all modes.
Training cycles through modes so the model becomes fluent in each. Why it matters: Without gates, you’d need separate models or tangled logic.

🍞 Bottom Bread (Anchor) Example: gT=1,gV=0 means text thoughts only; gT=0,gV=1 means visual imagination only; gT=1,gV=1 means both; gT=0,gV=0 means fast direct actions.

🍞 Top Bread (Hook) Think of a choir singing in harmony—everyone follows the same tune even if they sing different parts.

🥬 Filling (The Actual Concept) What it is: Cross-Mode Alignment Constraint makes all reasoning modes agree with the direct action path. How it works:

First, train the direct (no-CoT) path on ground-truth actions.
Use its predictions as soft targets.
Nudge text CoT, visual CoT, and multimodal CoT to match those targets. Why it matters: Without alignment, each mode might pull the policy in different directions.

🍞 Bottom Bread (Anchor) Example: Whether the model writes text thoughts or imagines visuals, it still chooses the same turn-right action as the direct path.

03Methodology

High-Level Recipe: Input → [Choose reasoning mode with gates] → [Generate optional reasoning (text/latent-visual/both)] → [Predict next actions] → Output

Inputs:

Instruction I (e.g., “Bring the toilet paper from the bedroom to the bathroom”).
Visual observations {o≤t}: history + current left/front/right views.
Training labels: next 5 actions At; optional CoT traces Tt (text), Vt (visual latents), and Mt=[Tt,Vt] (multimodal).

Step 1: Compact Visual Chain-of-Thought (CompV-CoT) 🍞 Top Bread (Hook) Imagine shrinking a movie into super-smart thumbnails that still tell the story.

🥬 Filling (The Actual Concept) What happens: The model predicts future visual latents (not full images) to “imagine” what it will see after certain actions. Why this step exists: Pixel images are slow and heavy; latents are fast and focused. Example with data: From a 256×256 frame, VAR provides ~30 tokens that represent the view. The model predicts a few sets of these tokens to imagine the next moments, then chooses the best action sequence (e.g., <|forward|>, <|right|>, <|forward|>, ...).

🍞 Bottom Bread (Anchor) Example: Before turning right, the model predicts a doorway latent will appear; it then chooses to turn right.

Step 2: Unified Multimodal CoT (UM-CoT) with Gating 🍞 Top Bread (Hook) Think of mode switches: Text ON? Visual ON?

🥬 Filling (The Actual Concept) What happens: Two binary gates gT and gV control whether to produce text CoT, visual CoT, both, or none. Why this step exists: It lets one model learn all reasoning flavors, keeping parameters shared and behavior consistent. Example with data: If (gT,gV)=(1,0), the model outputs a think block with Semantic Plan, Visual Description, Action Planning, Visual Imagination (text only), plus actions. If (0,1), it outputs visual latents plus actions. If (1,1), it outputs both text and latents plus actions. If (0,0), it outputs actions only.

🍞 Bottom Bread (Anchor) Example: For a tricky corner, (1,1) provides text reasoning about house layouts and visual latents that predict a hallway view.

Step 3: Cross-Mode Alignment Constraint 🍞 Top Bread (Hook) Different instruments, one melody.

🥬 Filling (The Actual Concept) What happens:

Train the no-CoT (direct) path on ground-truth actions.
Freeze its predictions for this batch (stop-gradient) as soft targets.
Train text, visual, and multimodal CoT branches to match those soft targets while also matching their own CoT labels. Why this step exists: To prevent mode conflicts and distill all benefits into a single, stable policy. Example with data: If the direct path predicts <|right|> with high probability, the CoT branches are nudged to also favor <|right|>.

🍞 Bottom Bread (Anchor) Example: Whether the model writes a plan or imagines latents, it still turns right at the same doorway.

Step 4: Training Data and Objective

Construct five-tuples [I, {o≤t}, Tt, Vt, At].
Uniformly sample (gT,gV) each batch to mix modes.
Losses:
- Non-CoT: cross-entropy on actions.
- CoT branches: cross-entropy on CoT outputs (Tt, Vt, Mt) and actions.
- Alignment: cross-entropy aligning CoT-branch actions to the direct-path soft targets.
Alternate updates: first optimize non-CoT, then joint CoT with alignment, repeat until convergence.

Step 5: Inference (Fast Acting) 🍞 Top Bread (Hook) Test day: no scratch work, just answer—correctly and quickly.

🥬 Filling (The Actual Concept) What happens: Use the non-CoT path only. The model directly maps instruction+observations → next actions, one token per action. Why this step exists: Real-time navigation demands speed; explicit CoT token generation is too slow. Example with data: For each time step, output one action token: <|left|>, <|right|>, <|forward|>, or <|stop|>.

🍞 Bottom Bread (Anchor) Example: The robot completes multi-stage tasks—find item, go to destination, stop—at about one action per second.

The Secret Sauce

Compression with VAR: Imagined vision stays rich but tiny.
Unified training with gates: One brain, many reasoning styles.
Cross-mode alignment: All styles sing the same tune, so the direct path is reasoning-aware even without explicit thoughts.
Train-with-CoT, infer-without: Keep the brains, lose the token burden.

04Experiments & Results

The Test

Benchmark: LH-VLN—multi-stage, long-horizon navigation in unseen scenes.
What they measured: Success Rate (SR), Independent Success Rate (ISR), Conditional Success Rate (CSR), CSR weighted by Ground Truth (CGT), and Actions Per Second (APS).
Why: These scores capture whether the agent completes whole missions, individual subtasks, and how efficiently it acts in real time.

The Competition

Textual CoT: Aux-Think.
Visual CoT: CoT-VLA, WorldVLA.
Memory/Other: MGDM, GLM-4v prompt, NaviLLM, GPT-4 + NaviLLM.
All trained/evaluated fairly on LH-VLN’s splits.

The Scoreboard (with context)

FantasyVLN: SR 2.44, ISR 11.01, CSR 9.64, CGT 8.99, APS ~1.03.
- Context: Think of SR 2.44 as earning a solid lead where others get much smaller scores on full multi-stage completion. ISR 11.01 is like consistently passing many sub-checkpoints. CSR and CGT show strong performance even when weighting by earlier success and trajectory length.
Implicit vs Explicit speed: Implicit methods (FantasyVLN, Aux-Think, WorldVLA) achieve ~1 action per second; explicit CoT-VLA only ~0.19 actions per second—like finishing 1 step when others finish 5.

Ablations (what matters most)

Mode combinations: Adding any CoT mode to non-CoT improves results; using all (non-CoT + T-CoT + V-CoT + MM-CoT) performs best. This shows the benefit of learning across modes.
Cross-mode alignment: Without it, scores are weak; with it, SR jumps from 0 to 2.44 and ISR from 2.39 to 11.01—proving alignment is essential glue.
VAR scale: Best ISR at scale 4—small scales miss details; large scales add redundancy. This sweet spot balances information and compactness.
Explicit vs implicit inference: Implicit wins, especially for MM-CoT (SR 2.44 implicit vs 0.98 explicit). Writing out long reasoning chains accumulates small errors; direct action is steadier after proper training.

Surprising Findings

Writing thoughts explicitly during inference can hurt long-horizon performance because small mistakes in long chains pile up. Implicit inference, after alignment, avoids this error snowball.
Visual imagination in latent space not only speeds things up but also trains more stably than pixel prediction—learning curves are smoother and faster.

Takeaway Numbers in Plain Words

Accuracy: FantasyVLN hits more goals and subtasks than baselines—like getting an A when others are around a C on tough, multi-part exams.
Speed: It keeps near-real-time pace (~1 action/sec), unlike explicit CoT which is too slow for real robots.

05Discussion & Limitations

Limitations

Detail loss in compression: Latent tokens are compact; some fine-grained visual details may be missed, which could matter in very cluttered scenes or for tiny objects.
Data dependence: If training trajectories are limited or unrepresentative, the internalized reasoning may not generalize to unusual homes or novel layouts.
Long-tail errors: Very long missions can still accumulate mistakes, especially if the environment is drastically different from training.
CoT supervision quality: Textual CoT labels (from an annotator model) can be noisy; poor supervision may inject biases.

Required Resources

A capable VLM backbone (e.g., Qwen2.5-VL) and a pretrained VAR.
Multi-GPU training (the paper used many H20 GPUs) for efficient joint training across modes.
Datasets with language instructions, multi-view observations, and action labels; optional CoT traces.

When NOT to Use

Ultra-fine manipulation tasks demanding pixel-level precision at every reasoning step (latent compression may drop crucial details).
Scenarios with no benefit from language (pure SLAM/path-planning problems might prefer specialized methods).
Extremely constrained compute at training time (though inference is light, training needs resources).

Open Questions

How to auto-generate higher-quality multimodal CoT traces with fewer biases?
Can the VAR be co-trained end-to-end for even better task-specific compression?
How does this framework scale to outdoor navigation or cross-embodiment transfer (wheeled to legged robots)?
Can uncertainty in imagined latents be modeled explicitly to handle ambiguous layouts?
How to incorporate memory modules or maps without breaking real-time speed?

06Conclusion & Future Work

Three-Sentence Summary

FantasyVLN trains a single model to do text-only, visual-only, and multimodal Chain-of-Thought, compressing visual imagination into a tiny latent space with a VAR.
A gating switch picks modes during training, and a cross-mode alignment rule makes all modes agree with a fast, direct action pathway.
At test time, the model skips writing out thoughts and directly maps instructions and images to actions, keeping the reasoning benefits without the token cost.

Main Achievement

It delivers reasoning-aware, real-time navigation by unifying multimodal CoT training and implicit inference, beating baselines in success and efficiency while avoiding token inflation.

Future Directions

Co-train or adapt the VAR for task-specific compression; integrate uncertainty estimates; expand to outdoor or cross-embodiment settings; refine CoT supervision quality; and add lightweight memory for very long missions.

Why Remember This

FantasyVLN shows you can keep the brains of Chain-of-Thought without paying the heavy token bill, opening a practical path to human-like, multimodal reasoning for robots that must act now, not later.

Practical Applications

•Home service robots that fetch items across multiple rooms without getting lost or delayed.
•Hospital robots that follow nurse instructions to deliver supplies quickly and safely.
•Warehouse automation that handles multi-stop pick-and-place routes efficiently.
•Campus delivery bots that parse spoken or written directions and navigate long corridors.
•Elder care assistance where the robot understands vague instructions and plans multiple subgoals.
•Office maintenance robots that complete multi-room cleaning or inspection tasks.
•Search-and-rescue reconnaissance in unfamiliar indoor layouts with language-guided goals.
•Retail inventory robots that follow staff prompts to restock or locate items over long routes.
•Museum guide robots that navigate exhibits while responding to visitors’ questions.
•Testbeds for teaching robust multimodal reasoning in other embodied AI tasks (e.g., manipulation).

Version: 1