UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Hao Lu; Ziyang Liu; Guangfeng Jiang; Yuanfei Luo; Sheng Chen; Yangang Zhang; Ying-Cong Chen

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate

Hao Lu, Ziyang Liu, Guangfeng Jiang et al.12/10/2025

arXiv PDF

Key Summary

•UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.
•It mixes two worlds: language-smart models (for reasoning and instructions) and video-smart world models (for learning physics and motion from unlabeled videos).
•A hybrid expert design ties three specialists together: an Understanding Expert (for chain-of-thought), a Planning Expert (for precise trajectories), and a Generation Expert (for future videos).
•The Planning and Generation Experts use a technique called flow matching to turn noisy guesses into smooth, realistic actions and frames.
•A four-stage training plan slowly teaches the system: basics first, then motion, then reasoning, then everything together.
•New long‑tail datasets focus on rare, tricky events (like small obstacles or unusual accidents) and include instruction-following so the car can change plans on command.
•On benchmarks, UniUGP beats strong baselines in understanding, reasoning, planning accuracy, collision rate, and video quality.
•CoT (chain-of-thought) makes decisions explainable, while future video generation makes them checkable against visual physics.
•The unified design lets unlabeled videos teach visual causality while language models add world knowledge and step‑by‑step logic.
•This approach aims at safer, more general driving in surprising real-world situations.

Why This Research Matters

Autonomous cars must be safe not just on ordinary days but especially when something unusual happens. UniUGP blends language reasoning with visual physics so the car can explain its choices and also check them against how the world actually moves. This makes decision-making more trustworthy, because plans are both understandable and physically grounded. Using unlabeled videos to learn causality means the system can improve from huge amounts of real driving footage without hand labels. Instruction following lets people guide the car’s behavior in natural ways, increasing human control and comfort. Altogether, this pushes self-driving closer to practical, safer, and more adaptable real-world deployment.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re biking to school. Most days are easy, but once in a while something unusual happens—like a frisbee rolling into the street or a delivery truck blocking your lane. Those rare surprises are the moments when you need your best thinking.

🥬 Filling (The Actual Concept): Before this research, self-driving systems were getting good at everyday driving, but they struggled with rare, surprising events called long-tail scenarios—things that almost never happen, but matter a lot for safety. Many end-to-end models could map camera images to steering and speed, but they lacked two big powers: 1) deep world knowledge and reasoning (like knowing what a flashing construction sign means in context), and 2) strong visual dynamics learning from unlabeled videos (like predicting how a cyclist might move next).

What it is: The paper introduces UniUGP, a unified framework that teaches a car to understand scenes, explain its reasoning, plan safe paths, and imagine future videos, all in one system.
How it works: It combines a language-smart vision-language model (VLM) for reasoning and instructions with a video-smart world model for visual causality. A hybrid expert design ties these abilities together and uses a carefully staged training plan across many datasets, including tough long-tail cases.
Why it matters: Without both reasoning and visual dynamics, the car either overthinks without grounding in physics or moves well without understanding tricky context—both are unsafe in rare, high-stakes moments.

🍞 Bottom Bread (Anchor): Think of approaching a foggy intersection where a pedestrian might step out. UniUGP can describe what it sees (fog, crosswalk, a person), explain its plan (slow and prepare to stop), and show a short imagined video of what could happen next—helping it choose a safer path.

🍞 Top Bread (Hook): You know how a friend who’s great at trivia can tell you facts, while another friend who skateboards can predict how a board will move? What if you needed both skills at once?

🥬 Filling (The Actual Concept): Two major families of past methods each missed something:

Vision-Language-Action (VLA) systems used pre-trained language models to understand instructions and reason about scenes, then produced driving actions. They were great at logic and world knowledge, but they couldn’t fully leverage massive unlabeled driving videos to learn cause-and-effect motion.
World models predicted the next video frames, learning physics-like patterns from tons of footage. They were great at visual causality, but weak at language reasoning and following human instructions. Attempts to glue them together often stayed modular (separate boxes that didn’t learn together) or trained on single datasets, so they didn’t generalize well to odd, rare scenarios.

What was missing: A single model that truly unifies understanding (reasoning + language), generation (future videos), and planning (continuous trajectories), trained across many datasets including rare events.
Why it matters: In long-tail events—like a ladder falling from a truck—you need both to explain why slowing is wise and to predict how the ladder might bounce or slide.

🍞 Bottom Bread (Anchor): If you tell the car “turn right after the crosswalk,” a pure VLA might follow the instruction but misjudge a fast bike; a pure world model might predict the bike but ignore the instruction. UniUGP aims to do both, safely.

🍞 Top Bread (Hook): Picture a school project team: one member explains the plan, one draws the storyboard, one handles the measurements. If they never talk, the project fails.

🥬 Filling (The Actual Concept): The gap was coordination. Models needed to align language reasoning (explainable steps), visual future imagination (physics), and precise control (smooth, safe paths). They also needed data that teaches rare cases and instruction-following, not just ordinary driving. This paper builds specialized datasets for small obstacles, accident relations, accident prediction, and instruction following—plus CoT (chain-of-thought) annotations—so the model practices the exact skills needed for long-tail safety.

🍞 Bottom Bread (Anchor): For example, the dataset might ask: “Is a traffic accident likely here? True/False.” or “Which object causes most risk?” and provide the correct reasoning and future path, teaching the model to connect what it sees, what might happen, and what it should do.

🍞 Top Bread (Hook): Imagine studying for a big exam. You don’t just read facts; you also do practice problems and explain your answers out loud. That mix builds real skill.

🥬 Filling (The Actual Concept): UniUGP’s four-stage training is like a smart study plan. First it learns to understand, then to model motion and plan, then to reason step-by-step in language, and finally to do it all together. This progression helps each part learn without confusing the others, then fuses them cleanly. The stakes are real: safer autonomous driving in everyday cities where the weird, rare event is exactly when safety matters most.

🍞 Bottom Bread (Anchor): Think of a construction site with cones and workers. UniUGP can name the hazards, explain why slowing is needed, plan a smooth path, and even generate a brief future video to check that the plan fits the visual physics.

02Core Idea

🍞 Top Bread (Hook): You know how you sometimes talk through a math problem, draw a picture to understand it, and then write the final answer? Using all three makes you more confident and correct.

🥬 Filling (The Actual Concept): The key insight: Put understanding, generation, and planning into one model so they teach each other.

What it is: UniUGP is a unified framework with three “experts”: Understanding (for chain-of-thought and language instructions), Generation (for future video), and Planning (for continuous trajectories).
How it works: The Understanding Expert (a pre-trained VLM backbone) reads images and text and produces tokens that represent scene meaning and reasoning. The Planning Expert uses flow matching to turn noisy action guesses into smooth, physically consistent trajectories. The Generation Expert, built from a pre-trained DiT-based video model, imagines future frames conditioned on both the reasoning and the planned motion. A hybrid expert design (Mixture-of-Transformers) lets Understanding and Planning exchange information directly, and a multi-term objective aligns logic, motion smoothness, and visual coherence.
Why it matters: Without unification, parts disagree—language says “slow,” planning goes “fast,” or the imagined future doesn’t match the path. UniUGP keeps all three in sync.

🍞 Bottom Bread (Anchor): If the instruction is “turn left after the bus,” UniUGP can state why (bus blocks view, crosswalk ahead), plan the left turn with safe timing, and generate a short future clip that matches that careful turn.

Multiple analogies:

Orchestra analogy: The Understanding Expert is the conductor (what and why), the Planning Expert is percussion (timing and rhythm of motion), and the Generation Expert is strings (rich, flowing visuals). The hybrid design keeps them in harmony.
Sports playbook: Understanding calls the play, Planning runs the route, Generation is the replay video that confirms what would happen if you execute that route.
Recipe: Understanding writes the steps, Planning measures and stirs to the right smoothness (flow matching), Generation is the taste test—does the dish look and feel right?

Before vs After:

Before: Models either reasoned well but didn’t learn physics from unlabeled videos, or they learned physics but couldn’t follow instructions or explain decisions.
After: UniUGP can explain its thinking (CoT), plan smooth, safe trajectories, and visualize future frames—bringing logic, motion, and visuals together.

Why it works (intuition):

Cross-teaching: Reasoning guides where to focus visually; motion planning forces consistent, physically plausible decisions; video generation checks whether the plan “looks right” when unrolled into future frames.
Shared tokens and conditioning: Understanding tokens condition Planning and Generation so language and vision stay aligned.
Flow matching: Instead of jumping straight to the perfect action, the model learns a clean path from noisy guesses to good actions—leading to stable, smooth trajectories.

Building blocks (Sandwich explanations):

🍞 Hook: You know how teachers ask you to “show your work”? 🥬 CoT Reasoning: A step-by-step explanation the model writes to justify its plan. How: It predicts the next token in a reasoning sequence grounded in the scene. Why: Without it, you can’t trust or debug decisions. 🍞 Anchor: “Red light + crosswalk ahead → stop; pedestrian near curb → be ready to wait.”
🍞 Hook: Imagine polishing a rough sketch into a clean drawing. 🥬 Flow Matching: A method to turn noisy action guesses into smooth trajectories. How: Start from a noisy version of the actions and learn the vector field that denoises them over time. Why: Without it, plans can jitter or be physically inconsistent. 🍞 Anchor: The car’s 5-second path becomes a smooth curve instead of zigzags.
🍞 Hook: Think of a storyboard that shows what will happen next. 🥬 Video Generation (world model): The system imagines future frames given the plan and reasoning. How: A DiT-based model denoises latent video tokens, conditioned on understanding and planned actions. Why: Without this, the system can’t visually validate that the plan fits the scene dynamics. 🍞 Anchor: If the plan says “slow for worker,” the generated video shows cones and a worker passing safely before moving on.
🍞 Hook: A group project works best when teammates share notes. 🥬 Mixture-of-Transformers (Hybrid Experts): Understanding and Planning pass messages in shared attention layers. How: Tokens from both experts attend to each other with modality-specific feed-forward heads. Why: Without it, language and action drift apart. 🍞 Anchor: The planning route lengthens automatically when reasoning notes “rain + slippery surface → increase stopping distance.”

03Methodology

At a high level: Input (multi-frame images + optional language instruction) → Understanding Expert (tokens + chain-of-thought) ↔ Planning Expert (flow matching for trajectories) → Generation Expert (future video) → Outputs (CoT, trajectory, future frames).

Step-by-step (with Sandwich explanations and examples):

Preparing inputs and tokens

🍞 Hook: Imagine turning a picture and a note into LEGO bricks you can build with.
🥬 What: Turn camera frames and text into aligned tokens. How:
1. Text is tokenized into words/subwords.
2. Images pass through a vision encoder (ViT) and a VAE to form visual tokens.
3. These tokens are aligned so that language and vision can attend to each other. Why: Without shared token spaces, the model can’t connect “turn right” to the correct lane lines.
🍞 Anchor: Instruction: “Turn left after crosswalk.” Visual tokens include lane markings, crosswalk stripes, and a bus ahead.

Understanding Expert (next-token prediction, CoT)

🍞 Hook: You know how you take notes before making a plan?
🥬 What: A pre-trained VLM backbone (Qwen2.5-VL-3B) reads the images and instruction, producing understanding tokens and chain-of-thought. How:
1. Multimodal self-attention fuses text and image context.
2. The model predicts the next token in the reasoning sequence (next-token prediction loss).
3. The result is an interpretable CoT that can be inspected. Why: Without explicit reasoning, the system can’t explain why it slows, stops, or turns—making safety hard to verify.
🍞 Anchor: “There’s a red light and a pedestrian near the crosswalk. Action: slow to stop.”

Planning Expert (flow matching over actions)

🍞 Hook: Think of erasing smudges from a sketch until the line is clean.
🥬 What: Plan a future trajectory by learning to denoise actions from noise to signal. How:
1. Sample a noise level τ and mix true actions with Gaussian noise to create aτ.
2. Project history state s and noisy actions aτ into planning tokens.
3. Use Mixture-of-Transformers (MoT) layers where understanding and planning tokens attend to each other.
4. Predict the denoising vector field that moves aτ toward true actions.
5. Optimize with a flow matching loss. Why: Without flow matching, trajectories can be jittery; without MoT, plans may ignore reasoning cues like “hidden pedestrian.”
🍞 Anchor: When a cyclist is approaching, the plan naturally curves and slows so the minimum distance stays safe.

Generation Expert (future video via DiT + VAE)

🍞 Hook: Like playing a short movie of the future to see if your plan makes sense.
🥬 What: Generate coherent future video frames conditioned on both reasoning and planned actions. How:
1. Encode history frames and future frames (noised) into VAE tokens.
2. Concatenate history tokens with noised future tokens.
3. Condition the DiT blocks on (a) understanding hidden states and (b) action embeddings (from ground-truth or single-step denoised actions during training; from predicted actions at inference).
4. Predict the denoising vector to recover clean future frames.
5. Train with a flow matching-style loss on video latents. Why: Without conditioning on planned actions, videos can look nice but ignore the intended motion; without understanding signals, videos may miss semantic logic (e.g., workers, cones).
🍞 Anchor: The generated clip shows the car coasting to a stop at a red light, with a pedestrian safely crossing.

Hybrid Expert coupling (MoT layers)

🍞 Hook: Team huddle—everyone shares what they see before acting.
🥬 What: Understanding and Planning form a Mixture-of-Transformers, attending to each other with modality-specific FFNs. How:
1. Q/K/V projections are computed for both sets of tokens.
2. Multi-head self-attention mixes them.
3. Separate feed-forward heads refine each modality. Why: Without shared attention, plans can drift from logic; without modality-specific heads, each expert loses specialization.
🍞 Anchor: If reasoning flags “construction worker ahead,” the planner increases following distance automatically.

Training strategy (four stages)

🍞 Hook: Study in the right order: basics, practice, explain, then combine.
🥬 What: A staged curriculum that builds skills progressively. How:
- Stage 1 (Understanding): Train only the Understanding Expert on long-tail QA and ImpromptuVLA to build strong multimodal comprehension.
- Stage 2 (Visual Dynamics + Planning): Train Generation and Planning Experts on motion-rich datasets (nuScenes, NuPlan, Waymo, Lyft, Cosmos) to learn physics and trajectories.
- Stage 3 (Text Reasoning for Causal Validation): Add CoT training with custom annotated reasoning to make decisions explicit and causal.
- Stage 4 (Mixed Fusion): Jointly train all three experts on a mixture (0.1 : 0.4 : 0.5 from Stages 1–3) with weighted losses L_total = 0.3 L_und + 0.5 L_plan + 0.2 L_gen. Why: Without this order, models either fail to reason well or fail to move well; the final fusion aligns them.
🍞 Anchor: After Stage 4, a left-turn instruction leads to a reasoned explanation, a smooth turn trajectory, and a future video matching the turn.

Objectives and consistency

🍞 Hook: Tie your shoes on both feet so you don’t trip.
🥬 What: Multi-term losses ensure logic, motion, and visuals agree. How: Next-token loss for CoT, flow-matching loss for actions, and flow-matching loss for video latents. Why: Without joint objectives, one output can be right while others disagree.
🍞 Anchor: If CoT says “slow,” the trajectory’s speed profile slows, and the generated frames show deceleration.

Secret sauce:

Cross-conditioning: Understanding tokens steer both Planning and Generation, while planned actions condition video. This three-way alignment turns unlabeled videos into causal teachers and lets language knowledge guide what matters in the scene.
Practical toggle: On devices without big GPUs, the Generation Expert can be turned off at runtime; Understanding + Planning still work.

04Experiments & Results

🍞 Top Bread (Hook): When you test a science project, you don’t just check one thing—you test how well it sees, thinks, acts, and whether its predictions look right.

🥬 Filling (The Actual Concept): The authors evaluate UniUGP on four fronts: understanding (including rare events), chain-of-thought quality, planning accuracy and safety, and future video realism. They compare against strong baselines (GPT‑4o, Qwen2.5‑VL‑72B, Doe‑1, Epona, GenAD, UniAD) and run ablations without CoT or without Generation to show what each part contributes.

What they measured:
1. Understanding: accuracy on small/rare objects, accident relationships, and accident prediction using long-tail datasets (DADA2000, Lost and Found, StreetHazards, SOM, AADV) and DriveLM GVQA for language tasks.
2. CoT: rated by GPT‑4o (subjective coherence) and BLEU (text similarity).
3. Planning: L2 displacement error over 1–3s and collision rate (nuScenes style), plus instruction-following accuracy via L2 at 3s.
4. Generation: FID/FVD to check visual quality and temporal consistency (nuScenes protocol shared with Epona/FSDrive).
Why it matters: If a model only plans well but can’t explain, it’s hard to trust. If it explains but can’t avoid collisions, it’s unsafe. If its imagined videos don’t match physics, you can’t validate plans visually.

🍞 Bottom Bread (Anchor): It’s like grading a student in reading (understanding), showing work (CoT), math accuracy (planning), and lab demo (video generation).

The competition and key results:

Long-tail benchmark (Table 3): Compared to GPT‑4o and Qwen‑2.5‑VL‑72B, UniUGP achieves higher understanding accuracy (Small 89.3%, Relationship 88.6%, Abnormal Prediction 95.8%). CoT gets strong GPT and BLEU scores (0.88 and 0.240). Planning and instruction-following L2(3s) are 1.45m and 1.40m respectively, beating baselines and its own ablations. Notably, removing CoT or Generation hurts results—showing both reasoning and world-modeling improve planning.
Planning on nuScenes (Table 4): With only the front camera and QA supervision, UniUGP attains Avg L2=1.23m and Avg collision 0.33%, outperforming Doe‑1 (1.26m, 0.53%) and Epona (1.25m, 0.36%) under similar constraints, and showing competitive safety versus methods that use richer inputs. Lower collision rate suggests better rule-following learned even from next-frame prediction.
Generation quality (Table 5): Under Epona/FSDrive protocol, UniUGP improves FID and FVD thanks to leveraging a strong pre-trained DiT video model (Wan2.1) with action and reasoning conditioning. Trajectory-controllable videos show that changing the planned path changes the generated future frames accordingly—demonstrating controllability.
DriveLM GVQA (Table 6): Final score 0.59, higher than FSDrive (0.57) and OmniDrive (0.56), with leading Accuracy (0.74), BLEU (0.78), ROUGE (0.76), and Match (0.41). This indicates better scene-language grounding and general driving QA.

Context for the numbers:

Understanding 89%+ is like scoring an A on tricky perception questions where others get a B.
L2 ~1.23–1.45m at 3s is small compared to lane widths and typical braking distances, and lower collision rates show practical safety gains.
Better FID/FVD means the future looks more like reality frame-to-frame, not just one pretty picture.

Surprising findings:

Adding a world model (Generation Expert) significantly improves not only video but also reasoning focus and planning—because learning to predict the future forces attention to distant causes (e.g., a small object far ahead).
CoT improves planning metrics, not just explainability, suggesting that language reasoning can guide safer action selection when scenes are ambiguous.

Takeaway: The unified approach yields across-the-board gains: it sees rare hazards, explains choices, plans safer paths, and visually validates them.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best Swiss Army knife has limits—you can’t chop down a tree with it.

🥬 Filling (The Actual Concept): Honest assessment of UniUGP:

Limitations:
1. Extreme long-tail coverage: If a scenario is truly unprecedented (odd weather + novel object + unusual road layout), the model may not generalize perfectly because the training data only goes so far.
2. Compute costs: The Generation Expert (video DiT) is heavy; on mobile or edge hardware it’s often disabled, which removes visual-causality validation at runtime.
3. Alignment gaps: In very complex interactions (ambiguous pedestrian intent), CoT text and physical plans can still drift slightly.
4. Fixed mixing ratios in Stage 4: A static dataset blend may not be optimal for every scene type or capability focus.
Required resources: • Multi-node GPU training (e.g., 8 nodes × 8 GPUs × 80GB) and millions of steps; large-scale datasets for long-tail and planning; pre-trained VLM and video generation backbones.
When NOT to use: • Ultra-tight latency devices where even Understanding+Planning is too heavy. • Domains with completely different physics or visuals (e.g., underwater) unless retrained. • Scenarios requiring certified safety guarantees beyond learned behavior (e.g., aviation) without additional formal methods.
Open questions:
1. Can we distill the Generation Expert to a much lighter module without losing causal benefits?
2. How to dynamically reweight datasets and losses per scene difficulty to adapt on the fly?
3. Can we couple CoT with physical constraints more tightly (e.g., a physics-aware reasoning checker)?
4. What’s the best way to use unlabeled video at massive scale for self-supervised causal learning while keeping reasoning grounded?

🍞 Bottom Bread (Anchor): Think of UniUGP today as a very capable student driver with a great coach and a weather simulator. It’s already safer and clearer than many classmates, but there’s still room to train smarter, lighter, and for even stranger storms.

06Conclusion & Future Work

Three-sentence summary: UniUGP unifies understanding (chain-of-thought), planning (flow-matched trajectories), and generation (future video) in a single hybrid-expert model. By fusing a pre-trained VLM with a DiT-based video generator and training in four stages across diverse, long-tail datasets, it aligns language reasoning, physical motion, and visual causality. Experiments show state-of-the-art results in perception, reasoning, planning safety, and video realism, especially in rare scenarios.

Main achievement: Demonstrating that a truly unified framework—where reasoning guides action, action conditions video, and video teaches causality—can outperform specialized systems while remaining interpretable and instruction-following.

Future directions: Make the generation module lighter via distillation or sparse activation; dynamically adapt training mixes and expert weights by scene complexity; deepen physics-aware reasoning checks; expand to richer interaction (voice commands, multi-agent negotiation); and scale self-supervised causal learning from unlabeled videos.

Why remember this: UniUGP shows that the safest driving brains may be the ones that can explain their choices, imagine the near future, and act smoothly—all in sync. That combination is a practical step toward trustworthy autonomy in the messy real world.

Practical Applications

•Dashboard explanations that tell riders why the car slowed or turned, building trust.
•Instruction-conditioned navigation where a passenger can say ‘turn right after this light’ and the car adapts safely.
•Scenario rehearsal: generate likely near-future videos to check if a planned maneuver is safe before executing.
•Long-tail safety training by mining unlabeled dashcam videos to learn causal patterns (e.g., pedestrian emergence).
•Driver-assist systems that provide CoT feedback and safer trajectory suggestions to human drivers.
•Simulation content creation: produce controllable future videos and trajectories for training other driving models.
•Edge-friendly mode: disable video generation at runtime but keep explainable planning in resource-limited cars.
•Risk assessment tools that flag potential hazards (small obstacles, accident relations) earlier using unified reasoning.
•Post-incident analysis: reconstruct reasoning, planned trajectories, and hypothetical futures for audits.
•Interactive fleet learning: use user feedback on CoT and generated futures to refine planning policies.

Version: 1