Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Meng Wei; Chenyang Wan; Jiaqi Peng; Xiqian Yu; Yuqiang Yang; Delin Feng; Wenzhe Cai; Chenming Zhu; Tai Wang; Jiangmiao Pang; Xihui Liu

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Intermediate

Meng Wei, Chenyang Wan, Jiaqi Peng et al.12/9/2025

arXiv PDF

Key Summary

•Robots that follow spoken instructions used to be slow and jerky because one big model tried to think and move at the same time.
•DualVLN splits the job into two parts: a slow, smart planner (System 2) and a fast, smooth driver (System 1).
•The planner points to a mid-term target in the camera image (a pixel goal) and shares hidden hints (latent goals).
•The driver uses a diffusion transformer policy to turn those goals plus live camera frames into smooth trajectories dozens of times per second.
•This slow-think/fast-move setup runs asynchronously, so the robot keeps gliding even while the planner is still thinking.
•On tough benchmarks (VLN-CE and VLN-PE), DualVLN gets higher success and lower error than previous top methods.
•A new Social-VLN test adds moving people; DualVLN avoids humans better and still completes more tasks than strong baselines.
•Real-world runs on wheeled, quadruped, and humanoid robots show smooth motion, fewer stalls, and strong instruction following.
•Using both visible pixel goals and hidden latent goals preserves the big model’s generalization while giving the driver rich guidance.
•Decoupled training keeps the planner smart and the driver agile, making navigation more reliable in dynamic, messy places.

Why This Research Matters

Robots that understand our words and our world need both big-picture thinking and quick reflexes to be useful in homes, hospitals, and stores. DualVLN shows how to keep the brains of a large model while gaining the smooth, real-time motion needed for safety and comfort around people. This approach reduces jerky starts and stops, avoids unnecessary bumps, and stays reliable when scenes get crowded or change quickly. Better navigation means fewer delivery delays, less aisle blocking, and safer hallway passing. It can speed up deployment of assistance robots without heavy map setup or rule coding. As more public spaces welcome robots, this kind of socially aware, generalizable navigation becomes a practical necessity.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re walking through a school hallway with a friend telling you, “Go past the lockers, turn right at the trophy case, and stop by the art room door.” You look, you think about the plan, and you move smoothly, even if a group of kids suddenly crosses your path.

🥬 The Concept (Vision-and-Language Navigation, VLN): VLN is when a robot looks at the world through a camera and follows a human instruction like “Go past the couch and stop by the blue chair.” How it works:

The robot reads the instruction.
It processes camera images.
It plans and moves to reach the described goal. Why it matters: Without VLN, robots can’t follow helpful, natural instructions, so they get stuck needing maps, beacons, or hand-coded routes. 🍞 Anchor: A robot in a home hears, “Go to the kitchen sink past the fridge,” and uses its camera plus the sentence to find the right path and stop.

🍞 Hook: You know how a wise teacher (slow thinker) plans a lesson, and a classroom routine (fast doer) keeps things moving? Both matter.

🥬 The Concept (Vision-Language Models, VLMs): VLMs are AI models that understand pictures and words together. How it works:

They read the instruction.
They look at images.
They connect words to visual clues to reason about what to do next. Why it matters: Without VLMs, robots miss context like “the brown door on the left after the stairs,” which needs both language and visuals. 🍞 Anchor: When you say, “Walk past the sofa and stop at the lamp,” a VLM links ‘sofa’ and ‘lamp’ to spots in the picture so the robot knows where to head.

🍞 Hook: Imagine trying to think of every tiny footstep while also figuring out your whole route—you’d wobble, stop a lot, and bump into people.

🥬 The Concept (Old end-to-end pipelines): Many past systems used one big model to jump directly from camera+instruction to tiny step commands (like “walk 25 cm”). How it works:

Combine reasoning, planning, and control into one neural network.
Call the big model every step. Why it matters: This can make motion jerky, slow (high latency), and risky with moving obstacles because the model is always busy thinking instead of reacting. 🍞 Anchor: It’s like asking a librarian to read the whole book to you before you take each step—too slow and clumsy.

🍞 Hook: You know how your brain has a “planner” that decides the next landmark and a “driver” that keeps your feet moving smoothly? Splitting the jobs helps.

🥬 The Concept (Dual-System Architecture): DualVLN separates slow, deep reasoning (System 2) from fast, smooth control (System 1). How it works:

System 2 (a VLM) slowly finds a good mid-term waypoint in the image.
System 1 (a small diffusion transformer policy) quickly turns that waypoint into a smooth path using live camera frames.
They run asynchronously, so the driver stays active while the planner updates. Why it matters: Without this split, robots either think too slowly or move too roughly, especially around people. 🍞 Anchor: The planner points to “that doorway,” and the driver glides toward it while watching for moving legs and carts.

The World Before: Early VLN benchmarks used simple, discrete actions in quiet simulations. Success improved, but robots’ paths looked choppy and too slow for the real world. VLM-powered methods brought general knowledge (“stairwells,” “hallways,” “kitchens”), but still used one big loop that decided every tiny move. The Problem: Real places are messy—doors swing, people pass, and viewpoints can hide the floor. A single slow brain making every twitchy decision causes lag, fragmented motion, and poor obstacle avoidance. Failed Attempts: 1) End-to-end VLA models: good at reasoning, but high latency and jerky steps. 2) Traditional modular stacks (mapping, localization, local planners): sensitive to errors and heavy tuning; struggle with natural-language goals. 3) Simple pixel-point methods: need extra modules and still can’t move smoothly at high frequency. The Gap: No system that keeps a big model’s generalization while delivering real-time smooth control under dynamic obstacles. What the Paper Adds: DualVLN makes a “slow-think, fast-move” team—System 2 grounds mid-term pixel goals and System 1 converts them into continuous, human-like motion at high speed. Real Stakes: Safer delivery robots in hallways, store helpers that don’t block aisles, home assistants that won’t clip chair legs, and robots that can politely wait and then recover their task path.

02Core Idea

🍞 Hook: Picture a coach who points to the next cone to run to (slow and strategic) while you keep your feet moving fast and smooth (quick and reactive).

🥬 The Concept (Aha!): Let a big VLM “ground slow” by choosing a clear mid-term pixel goal, and let a small diffusion policy “move fast” by turning that goal plus fresh camera images into a smooth trajectory—both running at their best speeds. How it works:

System 2 (VLM) reads the instruction and sees images, then outputs a visible pixel goal and hidden latent hints.
System 1 (diffusion transformer) fuses those goals with live RGB frames to generate smooth waypoints many times per second.
They’re asynchronous: the driver doesn’t wait; it updates continuously while the planner thinks and refreshes. Why it matters: Without this, navigation is either too slow to react or too shallow to understand complex instructions. 🍞 Anchor: “Exit the bedroom, turn right at the stairwell, then enter the bathroom on the left.” The planner points to a mid-term pixel spot in the camera view, and the driver flows there safely, even if a person steps in.

Multiple Analogies:

GPS + Cruise Control: The GPS (System 2) picks the next turn; cruise control with lane keeping (System 1) keeps you smooth and centered.
Coach + Runner: The coach points the cone; the runner keeps perfect form at speed.
Navigator + Helmsman (on a boat): The navigator sets the next heading; the helmsman makes micro-adjustments against waves and wind.

Before vs After:

Before: One slow thinker handled every tiny motion; motion was choppy; latency was high; obstacle avoidance was limited.
After: Slow planner gives interpretable goals; fast controller produces continuous, smooth, and adaptive paths; obstacle avoidance and timing improve.

🍞 Hook: You know how pointing with your finger is clearer than saying, “Go vaguely over there”?

🥬 The Concept (Pixel Goal Grounding): The planner predicts a specific 2D pixel in the camera image that marks the next mid-term waypoint. How it works:

Project 3D waypoints into the image.
Choose the farthest visible next waypoint.
Output its pixel coordinates. Why it matters: Without a concrete pixel, the driver lacks a crisp target and motions can drift or hesitate. 🍞 Anchor: The VLM says, “(234, 447),” which on the screen is a point near the hallway opening—now the driver knows exactly where to head.

🍞 Hook: Imagine you send a text, and while you wait for a reply, you keep walking toward your bus—you don’t freeze.

🥬 The Concept (Asynchronous Inference): The planner (slow) and driver (fast) run at different speeds without waiting for each other. How it works:

System 2 updates ~2 times per second.
System 1 updates ~30 times per second.
The driver keeps moving using the last goal until a new one arrives. Why it matters: Without async, every step would pause for the planner, causing stutters and delays. 🍞 Anchor: The driver keeps a fresh, rolling path so the robot glides smoothly through a doorway while the planner thinks about the next room.

🍞 Hook: Think of a secret note passed from the planner to the driver that says not just “where,” but also “why and how.”

🥬 The Concept (Latent Goal Representation): Hidden vector hints pulled from the VLM’s inner features summarize task-relevant context for the driver. How it works:

Add learnable queries to the VLM after pixel-goal prediction.
Let them attend to the instruction, history, and current view.
Use the resulting latent vectors to condition the driver. Why it matters: Without latent goals, you only have a simple point; with them, you also get semantic clues like “doorway, then left.” 🍞 Anchor: The driver learns from both the pixel dot and a compact, rich hint: “this is the corridor split; prepare a gentle right.”

🍞 Hook: Imagine sketching a path in pencil first (noisy), then refining the strokes smoothly until the line looks clean and natural.

🥬 The Concept (Diffusion Transformer Policy): A small transformer learns to denoise a rough path into a smooth trajectory, conditioned on goals and images. How it works:

Start with a noisy version of the true path.
Predict the velocity that removes noise step by step.
Output 32 smooth waypoints for the robot to follow. Why it matters: Without diffusion-style smoothing, paths can zig-zag or jerk around, wasting time and risking bumps. 🍞 Anchor: The driver produces a silky S-curve around a chair, not a boxy stop-start pattern.

Why It Works (Intuition): Splitting thinking and moving lets each part shine—VLMs excel at semantic grounding from mixed vision+language, while compact diffusion policies excel at fast, reactive control from images. The explicit pixel goal acts like a bright landmark (interpretability), and the latent goal adds rich, adaptive hints (generalization). Async timing keeps motion fluid in real time. Together, they preserve the big model’s brains while gaining the small model’s reflexes.

Building Blocks:

System 2: VLM planner with pixel-goal grounding and self-directed view adjustment.
System 1: Multimodal diffusion transformer policy conditioned on latent goals and live RGB.
Async loop: Slow (planner) + fast (driver) for continuous, smooth control.
Decoupled training: Keep VLM generalization high; keep driver efficient and robust.

03Methodology

High-Level Recipe: Input (instruction + RGB history + current RGB) → System 2 (pixel goal + latent goal) → System 1 (diffusion policy with high-frequency RGB) → Output (32 smooth waypoints) → Low-level controller tracks the path.

🍞 Hook: Before choosing a waypoint, people often glance around and tilt their head to see the floor and the next turn.

🥬 The Concept (System 2: VLM planner with self-directed view adjustment): System 2 slowly but smartly picks a next waypoint in the camera image and knows when to adjust the view first. How it works:

See instruction, past frames, and current frame.
If the next waypoint is occluded or off-screen, issue small look/turn actions (e.g., Turn Right 15°, Look Down 15°) to get a better view.
Predict the farthest visible waypoint’s pixel coordinates (the pixel goal) or STOP. Why it matters: Without view adjustment, the planner might guess from a bad angle and pick poor goals. 🍞 Anchor: The model tilts down to see the floor path around a sofa before pointing to the next pixel.

Self-Directed View Adjustment, step by step:

Problem: Floor points can be hidden; facing the wrong way makes the goal off-screen.
Action set: Turn Left/Right 15°, Look Up/Down 15°, limited to short bursts.
Training trick: Project 3D future path into the 2D image with depth to know which waypoints are visible; train the model to pick the farthest visible one.
Result: The planner outputs either a set of look/turn steps first, or a pixel coordinate, or STOP.

🍞 Hook: Think of the planner passing two things: a clear dot on the screen to aim for, and a tiny sealed envelope of extra tips.

🥬 The Concept (Latent Goal Representation via learnable queries): Tiny trainable tokens ask the frozen VLM, “What context matters right now?” and pull out a compact latent goal. How it works:

After the VLM predicts the pixel goal, append four special query tokens.
Let them attend across the instruction, history, current frame, and the pixel goal.
The resulting vectors become the latent goals fed to System 1. Why it matters: Without these queries, System 1 only gets a dot, not the surrounding semantic plan. 🍞 Anchor: The latent goal might implicitly encode “this is the stairwell; prepare to turn left after entering.”

🍞 Hook: Imagine you’re drawing a path while constantly peeking up to see if someone is stepping into your lane.

🥬 The Concept (System 1: Multimodal diffusion transformer policy): System 1 converts goals plus fresh RGB frames into smooth trajectories at high speed. How it works:

Get the last latent goal from System 2 (low-frequency) and two RGB frames (the last one System 2 saw at time t, and the current one at time t+k).
Encode both RGBs with a ViT, fuse them with self-attention, compress with a Q-Former to 32 tokens.
Feed the fused RGB tokens and latent goal into a compact Diffusion Transformer (DiT) to generate 32 smooth waypoints. Why it matters: Without fusing fresh RGB, the driver can’t react to a new moving obstacle; without diffusion, paths are less smooth. 🍞 Anchor: A person steps out; the driver nudges the curve wider and then rejoins the planned corridor path.

🍞 Hook: Think of cleaning a smudgy sketch, stroke by stroke, until it looks crisp.

🥬 The Concept (Flow Matching training in diffusion): The model learns to predict the velocity that denoises a noisy trajectory toward the ground truth path. How it works:

Mix the true trajectory with noise (control how noisy by a time variable).
Train the DiT to predict the velocity pointing back to the clean path.
Repeat across many noise levels so it learns to refine paths from rough to smooth. Why it matters: Without this training, the model won’t master turning shaky drafts into clean, safe routes. 🍞 Anchor: From a jittery zig-zag, the trained policy outputs a gentle arc around a table leg.

Asynchronous Execution (the secret sauce):

System 2 runs slowly (~2 Hz), producing pixel and latent goals with strong reasoning.
System 1 runs quickly (~30 Hz), always updating trajectories from the latest camera frames.
A low-level controller (e.g., MPC) tracks the waypoints at a very high rate (~200 Hz) for smooth motion.
Because System 1 keeps moving with the last known goal, the robot never freezes waiting for the planner.

Concrete Walkthrough with Data:

Instruction: “Exit the bedroom, walk straight, turn right at the stairwell, and enter the bathroom on the left.”
System 2: Reviews recent frames; decides to Look Down 15° to expose the floor path; then outputs pixel goal (234, 447).
Latent goal: Queries pull context like “hallway → stairwell → bathroom-left sequence.”
System 1: Encodes previous and current RGB, fuses them, and with the latent + pixel goal, generates 32 waypoints curving to the right.
Controller: Tracks the path smoothly; next, System 2 updates the goal for entering the bathroom.

Why this method is clever:

It preserves the big model’s generalization (by freezing it after training) and adds only light learnable queries.
It upgrades a single pixel dot into a semantically rich hint, improving robustness.
It decouples thinking and moving, avoiding latency bottlenecks and enabling dynamic obstacle avoidance.

04Experiments & Results

🍞 Hook: When you race in gym class, the clock and cones matter. For robots, we also need clear scores to see who truly moves better.

🥬 The Concept (VLN-CE benchmark): A standard test where robots follow language instructions in realistic indoor scenes with continuous control. How it works:

Use photo-realistic homes (Matterport3D) in simulation.
Give natural instructions.
Measure how close and how often the robot finishes correctly. Why it matters: Without a common test, we can’t fairly compare methods. 🍞 Anchor: It’s like using the same obstacle course for every runner.

🍞 Hook: Imagine a gym floor that’s a bit slippery—you need balance, not just directions.

🥬 The Concept (VLN-PE benchmark): A physics-rich test that includes robot dynamics like slipping, falling, or getting stuck. How it works:

Same instruction-following, but with realistic body control.
Record falls and stalls. Why it matters: Without physics, a method might seem good but fail on real robots. 🍞 Anchor: It’s like testing shoes on a real track, not just on paper.

🍞 Hook: Walking through a crowd takes social smarts.

🥬 The Concept (Social-VLN benchmark): A new test with moving humanoids placed along the route to stress-test social obstacle avoidance and task recovery. How it works:

Insert multiple dynamic agents at likely interaction points.
Ensure paths aren’t completely blocked.
Measure both task success and human collision rate. Why it matters: Without social testing, robots can be unsafe in real hallways and stores. 🍞 Anchor: Think of learning to say “excuse me” with your path, not just your words.

🍞 Hook: You know how report cards have different grades—accuracy, effort, speed?

🥬 The Concept (Key metrics: NE, SR, OS, SPL, HCR):

Navigation Error (NE): final distance to the goal—lower is better.
Success Rate (SR): percentage of runs that stop within 3 m of the goal—higher is better.
Oracle Success (OS): success if you consider the closest point along the route—shows if the robot got near but didn’t stop right.
SPL: success plus path efficiency—rewards smart, short routes.
HCR (Social-VLN): human collision rate—lower is safer. Why it matters: Without these, we can’t see if the robot is accurate, reliable, efficient, and safe. 🍞 Anchor: SR is like how many kids finished the course, NE is how close the rest got, SPL is who finished with the smartest path, and HCR is how many bumped someone.

The Competition: DualVLN was compared against strong baselines like NaVid, StreamVLN, NaVILA, UniNaVid, CMA, and Seq2Seq, covering both VLM-based and traditional methods.

Scoreboard with Context:

VLN-CE (R2R Val-Unseen): DualVLN hits SR ≈ 64.3% and NE ≈ 4.05 m, beating prior top RGB-only methods (e.g., StreamVLN SR ≈ 56.9%, NE ≈ 4.98 m). That’s like going from a solid B to an A.
RxR-CE: DualVLN also leads (SR ≈ 51.8%, NE ≈ 4.58 m), showing cross-dataset strength.
VLN-PE (with physical locomotion): DualVLN achieves much better SR (≈ 58.97% seen, ≈ 51.60% unseen) and lower NE (≈ 4.13–4.66 m) than methods trained specifically for VLN-PE. That’s like winning the meet even in rain and wind.
Social-VLN: Everyone drops (crowds are hard), but DualVLN keeps higher SR and lower HCR than StreamVLN, meaning better safety and task completion under motion.

Surprising Findings:

Decoupled training helps: Freezing the VLM after pixel-goal training keeps its generalization; the driver learns fast without dragging the planner down.
Both goals matter: Removing explicit pixel goals or learned latent goals hurts results—together they’re stronger.
Data scaling: System 1 doesn’t need massive data; even 1–10% of trajectories gives near-saturation performance, suggesting the planner’s quality is the main ceiling.
Robustness to small pixel errors: System 1 can still produce safe, correct-direction paths if the pixel goal is a little off—unlike classic point-goal planners that are sensitive to projection mistakes.

Takeaway: The slow-think/fast-move pairing raises success, smooths motion, and improves safety, from clean simulations to physics-heavy tests to real robots in crowded spaces.

05Discussion & Limitations

Limitations:

Highly dynamic scenes are still tough: when multiple people cross paths at once or sudden occlusions hide key landmarks, success drops compared to static settings.
Wrong semantic guesses upstream: if System 2 misinterprets the scene (e.g., confuses two similar doors), the driver may follow a smooth but wrong path.
Latency vs. compute: Although async helps, System 2 is still a 7B VLM—on-device inference is heavy; remote compute adds network delays.
Recovery boundaries: System 1 tolerates small pixel-goal errors, but very large or semantically wrong goals can mislead it near obstacles.

Required Resources:

A capable VLM (e.g., QwenVL-2.5 7B) and GPU memory (~20 GB) for practical throughput.
RGB camera input (depth optional for analysis; training used depth to construct labels but runtime uses RGB).
A controller (e.g., MPC) to track generated waypoints.

When NOT to Use:

Extremely cluttered micro-spaces where any pixel-goal error leads to unavoidable collisions.
Settings with very poor lighting or severe motion blur where RGB features are unreliable.
Scenarios requiring centimeter-precise docking without additional sensing.

Open Questions:

Can System 2 learn to self-correct semantic confusions on the fly (e.g., ask for more views before committing)?
How to further compress or distill the VLM for on-robot, low-latency reasoning without losing generalization?
Can social norms (yielding, passing rules, personal space) be learned directly from human demonstrations at scale?
How to fuse more modalities (audio cues, sparse maps) without breaking the clean slow/fast split?

06Conclusion & Future Work

Three-Sentence Summary: DualVLN splits navigation into a slow, smart planner that grounds mid-term pixel goals and a fast, smooth driver that turns those goals plus fresh images into continuous trajectories. This asynchronous, dual-system design keeps the VLM’s generalization while enabling real-time, obstacle-aware control. Across standard, physics-rich, and social benchmarks—and on real robots—DualVLN outperforms prior methods in success, smoothness, and robustness.

Main Achievement: Showing that combining explicit pixel goals (for interpretability) with learned latent goals (for rich context) inside an asynchronous slow-think/fast-move framework delivers state-of-the-art, generalizable navigation.

Future Directions: Distill or quantize the planner for on-device speed; scale social navigation data and learning for better human-aware behavior; add self-critique to reduce semantic mistakes; and integrate lightweight depth or event cameras for hard lighting conditions.

Why Remember This: DualVLN is a clear blueprint for embodied AI—let a big model think slowly and point the way, while a small model moves quickly and safely. It shows how to keep the brains of foundation models without sacrificing the reflexes needed for the real world.

Practical Applications

•Indoor delivery in offices and hospitals while safely avoiding people and carts.
•Store assistants that follow natural instructions to fetch or guide without blocking aisles.
•Home robots that can navigate cluttered rooms to reach appliances or specific furniture.
•Campus tour robots that follow spoken directions and adapt to crowds between classes.
•Warehouse runners that move smoothly around dynamic pallets and workers.
•Hotel service robots that handle elevators, hallways, and guest interactions politely.
•Museum guides that follow complex routes while yielding at intersections.
•Pop-up event or trade-show robots that work in unseen layouts without reprogramming.
•Elder-care companions that safely navigate around moving caregivers and visitors.
•Multi-robot fleets where each unit keeps smooth control even as plans update asynchronously.

Version: 1