Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Tianshuai Hu; Xiaolu Liu; Song Wang; Yiyao Zhu; Ao Liang; Lingdong Kong; Guoyang Zhao; Zeying Gong; Jun Cen; Zhiyu Huang; Xiaoshuai Hao; Linfeng Li; Hang Song; Xiangtai Li; Jun Ma; Shaojie Shen; Jianke Zhu; Dacheng Tao; Ziwei Liu; Junwei Liang

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Intermediate

Tianshuai Hu, Xiaolu Liu, Song Wang et al.12/18/2025

arXiv PDF

Key Summary

•Traditional self-driving used separate boxes for seeing, thinking, and acting, but tiny mistakes in early boxes could snowball into big problems later.
•Vision-Action (VA) models learned to drive directly from camera views to steering and braking, yet they were black boxes and struggled with rare situations.
•Vision-Language-Action (VLA) models add language understanding and reasoning, so the car can both plan and explain its choices.
•There are two main VLA styles: End-to-End (one big brain that sees, reasons, and acts) and Dual-System (a slow, careful thinker plus a fast, safe driver).
•Action outputs come in two flavors: Textual (words/tokens like 'slow down, turn left') and Numerical (precise waypoints, speeds, and steering).
•Standard datasets and new benchmarks (nuScenes, NAVSIM, Bench2Drive, WOD-E2E) test both the quality of trajectories and how human-friendly the behavior is.
•Results show language-guided systems can cut errors and improve closed-loop driving scores, especially when reasoning and action are well aligned.
•Key challenges remain: real-time speed, high-quality data costs, handling rare edge cases, avoiding language hallucinations, and keeping decisions consistent over time.
•Future directions include unifying world models with VLA, better multimodal fusion (cameras, LiDAR, maps), continual learning, and human-centered evaluation.
•This survey organizes a fast-moving field into clear categories and a roadmap for building safer, more understandable autonomous driving systems.

Why This Research Matters

Cars that can both see and explain their choices are easier to trust, especially in confusing, busy streets. VLA models let people give natural instructions like “avoid unprotected left turns” and then check that the car actually follows them. Better reasoning helps with rare events, like a child darting out or a sudden construction detour, where simple copy-the-expert policies often fail. Human-aligned metrics mean we’re optimizing for what riders actually feel is safe and comfortable, not just for numbers on a chart. As the field standardizes data and tests, we can compare methods fairly and deploy improvements faster. In the long run, VLA could make autonomous driving safer, more transparent, and more adaptable to local rules and personal preferences.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how playing telephone with friends can go wrong if the first whisper is a little off? By the end, the message can be totally mixed up. Driving computers used to work a bit like that.

🥬 The Concept (Past → Problem → Attempts → Gap → Stakes):

What it is: Traditional autonomous driving systems used separate steps—first see (perception), then decide (prediction/planning), then act (control).
How it worked (step by step):
1. Cameras and sensors spot cars, lanes, and lights.
2. The system predicts how things will move next.
3. A planner draws a path and a controller turns it into steering and braking.
Why it mattered: Each step depended on the last. A small perception mistake (like misreading a far-away pedestrian) could become a big planning error, which could lead to unsafe control.

🍞 Anchor: Imagine reading a map (step 1), guessing traffic (step 2), then deciding where to turn (step 3). If your map is wrong, your whole trip plan falls apart.

🍞 Hook: Imagine you’re in a go-kart and you just look ahead and turn the wheel based on what you see. No big plan, just react. That’s simpler, right?

🥬 The Concept (Vision-Action, VA):

What it is: VA models learn a direct mapping from camera views (and sometimes LiDAR) straight to actions (steer, throttle, brake or waypoints).
How it works:
1. Feed raw images into a neural network.
2. The network learns patterns from expert driving.
3. It outputs control commands or a short path to follow.
Why it matters: It avoids the telephone-chain problem but can be a black box and struggle with weird, rare events it hasn’t seen.

🍞 Anchor: Like a kid learning to ride a bike by watching and copying—great in the park they practiced in, but confused on a rocky trail they’ve never seen.

🍞 Hook: You know how adding good instructions to a picture makes it easier to solve a puzzle? Words help you reason.

🥬 The Concept (Large Language Models, LLMs):

What it is: LLMs are powerful text tools that can explain, reason, and follow instructions.
How it works:
1. They read huge amounts of text to learn language patterns.
2. They answer questions, follow instructions, and show their steps.
3. When paired with vision, they can connect what they see to what they say (and do).
Why it matters: Language adds reasoning and instructions (like "turn left after the red house"), which plain vision-to-action can’t easily follow.

🍞 Anchor: Like having a tour guide who not only sees the sights but can also describe the best route in words you understand.

🍞 Hook: Imagine a driver that can see the road, read your instructions, think out loud, and then drive accordingly.

🥬 The Concept (Vision-Language-Action, VLA):

What it is: VLA models combine seeing (vision), understanding/explaining (language), and doing (action) in one policy.
How it works:
1. Inputs: multi-camera video, ego-vehicle state, and optional instructions.
2. A vision-language model fuses images and words to form a shared understanding.
3. An action head outputs either text-like commands or precise trajectories/controls.
Why it matters: This brings interpretability (the model can explain), better generalization (world knowledge), and instruction-following for human-aligned driving.

🍞 Anchor: If you say, "Please go straight and stop for kids," a VLA can spot children, explain the plan, and gently brake—then tell you why it did that.

🍞 Hook: Think of two styles of solving math: doing it all in your head quickly vs. writing out careful steps before answering.

🥬 The Concept (Why a new survey now):

What it is: The field is exploding with different VLA designs, data, and tests.
How it works:
1. The survey traces the journey from VA to VLA.
2. It organizes models into End-to-End VLA (one big brain) and Dual-System VLA (slow thinker + fast doer).
3. It compares output styles (textual vs numerical), datasets, metrics, and results.
Why it matters: Clear maps help everyone build safer, smarter systems faster.

🍞 Anchor: It’s like a field guide that sorts animals by type, tells you where to find them, and explains how to tell them apart—so you won’t get lost.

02Core Idea

🍞 Hook: You know how a great coach first understands the game (vision), then explains the plan (language), and finally calls the play (action)?

🥬 The Concept (Key Insight):

What it is: The “aha!” is to unify seeing, reasoning with language, and acting, so the car can both perform well and explain itself.
How it works:
1. Merge visual features with language tokens into a shared space.
2. Let the model reason with chain-of-thought and world knowledge.
3. Output either human-readable steps (text tokens) or precise controls/waypoints (numbers), sometimes both.
Why it matters: Without this, we either get accurate but opaque black boxes (VA) or brittle modular stacks; with VLA, we gain interpretability, instruction-following, and often safer long-horizon choices.

🍞 Anchor: It’s like having a chess engine that shows you the board, tells you why a move is smart, then makes that move cleanly.

Multiple Analogies (3 ways):

Teacher analogy: Pictures (vision) + explanation (language) + quiz answer (action). A good teacher shows, explains, then answers.
GPS analogy: Map view (vision), voice guidance (language), wheel control (action). All three together make trips smoother and safer.
Kitchen analogy: Look at ingredients (vision), read the recipe (language), cook the dish (action). Skipping the recipe leads to mistakes.

🍞 Hook: Imagine solving everything in one head—no passing notes.

🥬 The Concept (End-to-End VLA):

What it is: A single model that goes from inputs to outputs directly, reasoning and acting in one pass.
How it works:
1. Encode images and text.
2. Jointly reason inside the model.
3. Emit actions as tokens (textual) or numbers (numerical) with a small head.
Why it matters: It’s simple and powerful; fewer moving parts mean fewer handoffs that can fail.

🍞 Anchor: Like a student who reads the problem, thinks silently, and writes the correct answer in one go.

🍞 Hook: Now imagine a careful thinker who drafts a plan, then a quick driver who executes it safely.

🥬 The Concept (Dual-System VLA):

What it is: Two coordinated parts—slow, language-rich reasoning plus fast, safety-critical planning/control.
How it works:
1. VLM outputs guidance: meta-actions or coarse waypoints.
2. A specialized planner refines them into smooth, feasible trajectories under tight timing.
3. Safety constraints check physics and comfort.
Why it matters: Keeps interpretability while meeting real-time and safety demands.

🍞 Anchor: Like a coach (slow thinker) calling a play and an athlete (fast executor) carrying it out perfectly on the field.

🍞 Hook: Words are friendly; numbers are precise.

🥬 The Concept (Textual vs Numerical Actions):

What it is: Two output styles—textual actions are human-readable; numerical actions are controller-ready.
How it works:
1. Textual: meta-commands or tokenized waypoints the planner can parse.
2. Numerical: continuous waypoints, speeds, or controls via MLPs or diffusion heads.
3. Some models generate both for clarity and precision.
Why it matters: Text aids understanding and instruction-following; numbers ensure exact, smooth motion.

🍞 Anchor: Saying “slow down and turn left” (text) vs. giving exact steering angles and speeds (numbers). Together, they’re even better.

Before vs After:

Before (VA): Great at copying expert behavior, but weak at explaining and handling rare, ambiguous cases.
After (VLA): Adds reasoning, instructions, and better generalization; can explain choices and align with human goals.

Why It Works (intuition):

Language injects world knowledge and step-by-step reasoning.
Multimodal fusion ties what’s seen to what’s said and then to what’s done.
Planners or diffusion heads turn high-level goals into safe, feasible motion.

Building Blocks:

Inputs: multi-view images, ego state, optional LiDAR/maps, and instructions.
Backbone: a Vision-Language Model to fuse and reason.
Action head: language head (tokens), regressors (MLPs), selection modules, or generators (e.g., diffusion).

03Methodology

At a high level: Inputs (cameras, state, instructions) → Vision-Language Model (fuse + reason) → Action Head (textual or numerical) → Vehicle control.

🍞 Hook: Imagine packing for a trip—you bring pictures (maps), notes (instructions), and your current location.

🥬 The Concept (Input Modalities):

What it is: The model’s "senses": images, optional LiDAR/occupancy, text instructions, and ego-vehicle state.
How it works:
1. Sensor inputs: multi-camera RGB; sometimes LiDAR and BEV/occupancy features.
2. Language: goals like “turn left at the next light.”
3. Vehicle state: speed, yaw rate, indicator status.
Why it matters: Missing any piece can confuse decisions—like not knowing your speed before braking.

🍞 Anchor: Example: “Go straight” + 6 camera views + 12 m/s speed → the model plans a gentle, centered path.

🍞 Hook: You know how a bilingual friend translates pictures into words and back again?

🥬 The Concept (VLM Backbone):

What it is: The fusion-and-reasoning engine that aligns vision features with language tokens.
How it works:
1. Vision encoder (e.g., ViT) turns images into tokens.
2. A language model conditions on these tokens.
3. A bridge aligns the two so the model can reason across modalities.
Why it matters: Without tight alignment, the model may “talk” about things it doesn’t see or miss visual details.

🍞 Anchor: Example: The model reads “caution: construction cones” and highlights orange cones in the image to keep a safe middle path.

🍞 Hook: Telling vs doing—sometimes you say what to do, sometimes you just do it.

🥬 The Concept (Action Prediction Heads):

What it is: The module that turns reasoning into drive-ready outputs.
How it works (four common types):
1. Language Head (LH): outputs tokens like “slow down; turn left; waypoints: (x1,y1)…”
2. Regression (REG): MLP outputs continuous waypoints or controls.
3. Selection (SEL): scores many candidate trajectories and picks the safest.
4. Generation (GEN): diffusion/VAEs sample multi-modal future paths under uncertainty.
Why it matters: The right head balances interpretability, precision, and safety.

🍞 Anchor: Example: In a narrow lane, SEL head picks the candidate that clears cones and keeps comfort high.

🍞 Hook: Describing a move vs dialing in the exact angles.

🥬 The Concept (Action Spaces):

What it is: How actions are represented: discrete waypoints, continuous profiles, direct controls, or language tokens.
How it works:
1. Discrete waypoints: a set of (x, y) steps.
2. Continuous functions: speed v(t), curvature κ(t).
3. Direct controls: steering, throttle, brake at each step.
4. Language tokens: natural-language commands or tokenized numbers.
Why it matters: Precision and executability depend on the chosen space; language is human-friendly, numbers are actuator-friendly.

🍞 Anchor: Example: “Turn left now” (language) vs “steer 6° left, slow to 7 m/s” (numbers).

Detailed Step-Through with Example Data:

Inputs: Front/side/rear images + state (speed 10 m/s) + instruction “go straight, yield to pedestrians.” If BEV/occupancy features exist, include them.
VLM Backbone: Encodes images to tokens; aligns with instruction tokens; reasons step-by-step (e.g., “see crosswalk, no pedestrians now, keep center”).
Action Head:
- Textual head: “Proceed straight, maintain 9–10 m/s.”
- REG head: outputs 6–8 waypoints centered in lane, slight decel.
- SEL head: ranks candidate paths; picks one with lowest collision risk.
- GEN head: samples diverse futures; chooses safe, smooth sample.
Control: Low-level controller tracks the chosen trajectory.

What breaks without each step:

No state input: braking/accel timing can be wrong.
Weak fusion: language guidance may not match visual scene.
Poor action head: nice explanations but shaky, jerky paths.

The Secret Sauce:

Language-grounded reasoning narrows the action search (e.g., cones → slower, centered path).
Dual-systems keep latency low while preserving interpretable guidance.
Tokenized action spaces unify reasoning and planning inside one sequence for tighter alignment.

04Experiments & Results

🍞 Hook: You know how report cards don’t just show grades—they explain what those grades mean? Driving models need that too.

🥬 The Concept (What was tested and why):

What it is: The survey compares models on open-loop accuracy (how close to expert paths) and closed-loop behavior (how well they actually drive in simulation), plus human-aligned scores.
How it works:
1. Open-loop (e.g., nuScenes): measure L2 error (meters) and Collision Rate.
2. Closed-loop (e.g., Bench2Drive, NAVSIM): measure driving score, success rate, safety, comfort, progress.
3. Human preference (e.g., WOD-E2E RFS): do humans prefer these trajectories?
Why it matters: A model can look good on paper (open-loop) yet drive poorly live; we need both views.

🍞 Anchor: Like practicing basketball shots alone vs playing a real game. Both matter.

The Competition and Scoreboard (with context):

nuScenes (open-loop):
- Strong VA baselines like UniAD report about 0.69 m L2 with 0.12 collision rate.
- Language-guided Drive-R1 improves to ~0.31 m L2 and 0.09 collision rate—like moving from a solid B to a strong A- by using reasoning and alignment.
WOD-E2E (human-aligned RFS):
- Some VLA models (e.g., Poutine) balance good ADE with high RFS, meaning their paths better match what humans consider safe and sensible.
- Takeaway: Aligning language reasoning with trajectory generation can increase human trust.
NAVSIM (closed-loop, PDMS):
- VA world-model + selection methods (WoTE) reach strong PDMS ~88.3.
- VLA methods push safety and progress further: AutoVLA achieves high NC and EP; ReflectDrive and AdaThinkDrive report PDMS in the 90s range, indicating safer, smoother driving that keeps moving forward.
Bench2Drive (closed-loop):
- SimLingo shows a top Driving Score ≈ 85.94 and high success rate by aligning instructions with actions—like following a teacher’s rubric exactly and acing the project.

Surprising Findings:

Text-only improvements aren’t magic: language must be tightly coupled to planning heads; otherwise, gains are uneven.
Human-aligned metrics (RFS) don’t always track pure trajectory error; people prefer paths that “feel” safer and smoother, not just closest to the log.
Efficiency tricks (e.g., token pruning, compact fusion) help maintain real-time performance without huge accuracy drops.

🍞 Hook: Imagine judging a driver not just by how close they mimic a map, but by how safe and comfortable the ride feels.

🥬 The Concept (Why these results matter):

What it is: Evidence that language-grounded reasoning can reduce errors and improve closed-loop safety and human preference.
How it works:
1. Reasoning (CoT) clarifies ambiguous scenes.
2. Dual-systems keep actions feasible and fast.
3. Better datasets/benchmarks expose long-tail and preference gaps.
Why it matters: Real roads are messy; models that think and explain tend to behave more reliably.

🍞 Anchor: A car that says “construction ahead, I’ll slow and center” and then actually does it smoothly earns more trust than one that just silently swerves.

05Discussion & Limitations

🍞 Hook: Even superheroes have weaknesses; knowing them helps the team.

Limitations:

Latency and compute: High-res, multi-view vision plus language tokens can be slow; sub-50 ms decisions are still hard.
Data cost: Getting high-quality triplets (vision–language–action) is expensive; synthetic data has sim-to-real gaps.
Generalization: Rare edge cases (odd layouts, unusual behaviors) still trip up models.
Hallucination risk: Language can “explain” confidently but wrongly; explanations aren’t always faithful to causal reasoning.
Temporal coherence: Limited context windows can cause flip-flop decisions over longer horizons.

Required Resources:

Multi-camera rigs, high-throughput GPUs, curated datasets (e.g., nuScenes, NAVSIM, Bench2Drive), and sometimes simulators for closed-loop training/testing.

When NOT to Use:

Ultra-low-latency, resource-constrained platforms without room for VLM reasoning.
Domains with no instruction-following or explanation needs where compact VA may suffice.
Deployments lacking robust safety layers to validate VLM guidance.

Open Questions:

How to guarantee faithfulness of explanations and avoid hallucinations?
Best ways to align reasoning tokens with continuous control at scale (especially with diffusion/planner heads)?
How to build domain-specific foundation models for driving that fuse LiDAR, maps, and video efficiently?
Practical continual learning without catastrophic forgetting, yet with safety guarantees?
Unified evaluation that jointly scores instruction-following, reasoning quality, safety, comfort, and human preference?

06Conclusion & Future Work

Three-Sentence Summary:

This paper surveys how Vision-Language-Action models bring together seeing, reasoning with language, and acting to make autonomous driving more interpretable and robust.
It organizes the field into End-to-End and Dual-System paradigms, clarifies textual vs numerical action outputs, and reviews datasets, metrics, and results.
It also lays out challenges—latency, data, generalization, hallucination, and temporal coherence—and points to future directions like world-model integration, richer fusion, and human-centered evaluation.

Main Achievement:

A clear roadmap and taxonomy for VLA in driving that connects architectural choices to practical benefits and trade-offs, helping researchers and practitioners design safer, more human-aligned systems.

Future Directions:

Unify VLA with world models for proactive planning; build domain-specific driving foundation models; improve multimodal fusion; enable safe continual learning; and standardize human-centered benchmarks.

Why Remember This:

Adding language to vision and action isn’t just a gadget—it’s a bridge to cars that can follow your instructions, explain their choices, and behave more reliably in messy, real-world traffic.

Practical Applications

•Voice-guided driving: “Please take the next right and avoid school zones.”
•Explainable autonomy: The car narrates why it slowed or re-routed at a construction site.
•Safety coaching: The system flags risky human-driving habits with plain-language feedback.
•Fleet monitoring: Operators review language rationales to diagnose edge-case failures quickly.
•Accessible navigation: Natural-language interfaces for riders with visual or cognitive impairments.
•Policy constraints: Enforce company or regional rules via text (e.g., “no unprotected left turns”).
•On-the-fly updates: Temporary instructions like “detour along Pine Street” without map changes.
•Training with synthetic stories: Use language to describe rare scenarios and practice responses.
•Human-in-the-loop debugging: Engineers edit prompts/rules to correct behaviors without full retraining.
•Personalized driving styles: “Drive smoother,” “prefer right lanes,” or “maximize comfort.”

Version: 1