DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Yang Zhou; Hao Shao; Letian Wang; Zhuofan Zong; Hongsheng Li; Steven L. Waslander

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Intermediate

Yang Zhou, Hao Shao, Letian Wang et al.1/4/2026

arXiv PDF

Key Summary

•DrivingGen is a new, all-in-one test that fairly checks how well AI can imagine future driving videos and motions.
•It uses a diverse 400-clip dataset covering tough weather, night scenes, and many regions so models aren’t judged only on sunny daytime roads.
•The benchmark measures four things at once: how real the videos look, how realistic the motions are, how stable things stay over time, and how well the car follows a given path.
•A new metric called Fréchet Trajectory Distance (FTD) checks if generated driving paths come from a realistic distribution, not just if frames look pretty.
•Motion-aware consistency, agent-level checks, and a test for ‘magically vanishing’ cars make temporal realism harder to fake.
•A robust SLAM pipeline extracts 3D-like camera motion from every generated video, even when quality is poor, so no model gets credit for failures.
•Testing 14 state-of-the-art models shows a big trade-off: visually stunning models often break physics, while driving-focused ones move realistically but look less photorealistic.
•No single model wins on both photorealism and physically consistent motion, highlighting a key challenge for the field.
•DrivingGen aligns well with human judgments and exposes hidden failure modes that single scores like FVD miss.
•This benchmark guides safer, more controllable driving world models for simulation, planning, and data-driven decisions.

Why This Research Matters

Safer training and testing: Car companies can test many rare, risky events (like icy nights) without putting anyone in danger. Better planning: Models that both look real and move realistically lead to smarter, calmer driving decisions. Fair comparisons: With a single, shared benchmark, everyone can see true strengths and weaknesses rather than chasing one easy metric. Fewer blind spots: Agent-level checks and motion-aware consistency stop ‘pretty but unsafe’ generations from slipping through. Faster progress: Clear feedback on where models fail (visuals vs. physics vs. control) helps teams prioritize fixes that matter. Real-world readiness: More robust world models mean more trustworthy simulators, better data generation, and ultimately safer roads.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a driving video game that can make up brand-new roads and traffic in real time. You’d want it to look real, feel real, and behave like the real world, right?

🥬 The Concept: Generative video models are AIs that can imagine and draw the next seconds of a scene as a video. They are being used as ‘world models’ for driving, which means they predict how the world will change so a car can plan safely. How it works:

Watch a short clip or get a description and optional path to follow.
Predict the next frames of the world and how the car and others move.
Use these imagined futures to test plans or train driving policies. Why it matters: If the imagined world looks great but moves unrealistically—or if paths are unsafe—the car can make bad decisions.

🍞 Anchor: Think of a GPS predicting traffic as you go. A driving world model is like a ‘video GPS’ that imagines not just roads, but how every car and person might move next.

The World Before: For years, AI video models got better at making pretty, realistic-looking clips. In driving, that inspired ‘driving world models’—AIs that imagine the future so cars can practice safely, try rare cases (like icy bridges), and learn without crashing. But how did we check if these models were any good? Mostly with general video scores like FVD, or small human ratings. That worked okay for movies and pets, but driving has special safety needs: motion must follow physics, agents (cars, people) shouldn’t magically morph or disappear, and the car must obey a planned path when told.

The Problem: Existing tests missed key pieces.

Visual-only checks ignored motion physics. A shiny video could still hide stop–go jitter or impossible turns.
Temporal checks often looked only at whole scenes, not individual agents, so a parked car might flicker or vanish and still get a good score.
Controllability—the ability to follow a given ego-trajectory—was barely tested.
Datasets were sunny-day-heavy and city-limited, missing night, snow, fog, and global differences.

🍞 Hook: You know how a school test that only asks spelling can’t tell if you understand the story? Driving tests that only check visuals can’t tell if the motion makes sense.

🥬 The Concept: A benchmark is a fair, repeatable test set with rules everyone agrees on, so progress is measurable and meaningful. How it works:

Build a diverse test set (weather, time, geography, interactions).
Define clear metrics for visuals, motion realism, time stability, and controllability.
Evaluate many models the same way to see trends and trade-offs. Why it matters: Without a good benchmark, teams can’t compare fairly, and unsafe shortcuts might look ‘good enough.’

🍞 Anchor: It’s like the same math quiz for the whole class: you learn what you know and what to improve.

Failed Attempts: Past evaluations relied on a handful of numbers or human ratings. General video metrics (like FVD) rewarded looking good, but not driving safely. Some added a path-following score (like ADE), but missed flicker, agent identity changes, and physics. And the data wasn’t diverse, so models trained on sunny roads looked fine when tested on… sunny roads.

The Gap: We needed a driving-specific benchmark that:

Judges both pictures and physics.
Checks consistency over time and per agent, not just whole scenes.
Measures how well models follow a commanded path.
Uses truly diverse data: weather, night, regions, and tricky interactions.

Real Stakes: In real life, ‘good-looking but wrong’ can be dangerous. An AI that draws a beautiful scene where a car blinks out or the ego car slides in a way real cars cannot could cause planning mistakes. Safer testing, better training, and trustworthy simulation depend on a benchmark that catches these issues before they cause harm.

02Core Idea

🍞 Hook: Imagine grading a science fair project not just on how pretty the poster is, but also on whether the experiment works, the data is steady over time, and the student followed the plan.

🥬 The Concept: DrivingGen is a benchmark that grades driving world models across four pillars at once: distribution realism, quality, temporal consistency, and trajectory alignment. How it works:

Curate a 400-clip, diverse dataset (weather, night, regions, complex maneuvers) with two tracks: open-domain and ego-conditioned.
Extract both visual signals and trajectories (with a robust SLAM pipeline) from generated videos.
Score four dimensions:
- Distribution: FVD for videos + new FTD for trajectories.
- Quality: human-aligned image quality + driving-specific flicker + trajectory plausibility.
- Temporal Consistency: scene stability, agent identity stability, and detecting abnormal disappearances; plus motion smoothness.
- Trajectory Alignment: ADE and DTW.
Benchmark 14 models to reveal strengths, weaknesses, and trade-offs. Why it matters: If any pillar fails—pretty but unphysical, stable but off-route, or realistic paths with ugly frames—driving decisions can go wrong.

🍞 Anchor: It’s like a report card with four subjects. An A in art (visuals) but a D in physics (motion) isn’t enough to drive safely.

The ‘Aha!’ Moment (one sentence): You must judge both the look and the laws—videos and trajectories—at once, or you’ll miss what really matters for safe driving.

Three analogies:

Cooking: Don’t score a cake by icing alone; check if it’s baked through (motion physics), keeps shape over time (temporal consistency), and matches the recipe (alignment).
Orchestra: The music can sound rich (visuals), but if rhythm jitters (temporal), instruments change mid-song (agent identity), or the conductor ignores the score (alignment), the performance fails.
Sports: A player can look stylish (visuals), but footwork must be legal (physics), movement steady (consistency), and the play followed (alignment).

Before vs. After:

Before: Single-number visuals dominated; models that ‘looked right’ passed.
After: Models must also ‘move right’ and ‘stay right’ and ‘follow the plan.’ Hidden failures get exposed.

Why it works (intuition, no equations):

Two-view lens: judging both frames and motion detects beauty-vs-physics gaps.
Agent focus: tracking individuals catches identity flicker and vanishing acts.
Motion-aware timing: sampling frames based on how fast things move stops static clips from cheating consistency metrics.
Alignment duo (ADE + DTW): local step-by-step accuracy plus overall path shape both matter for control.

Building Blocks (with concept sandwiches):

🍞 Hook: You know how you can compare two playlists to see if they have the same vibe? 🥬 The Concept: Fréchet Video Distance (FVD) measures how close generated videos are to real ones at a distribution level. How it works: 1) Embed videos into features. 2) Compare the clouds of features between generated and real sets. 3) Lower means closer. Why it matters: It checks overall ‘visual vibe,’ not just one clip. 🍞 Anchor: If a model’s city drives feel like real city drives, FVD is low.

🍞 Hook: Imagine comparing paths from two hikers to see if they traverse terrain like real hikers do. 🥬 The Concept: Fréchet Trajectory Distance (FTD) measures how the distribution of generated driving paths matches real driving paths. How it works: 1) Encode trajectories with a motion encoder. 2) Compare generated vs. real path distributions. 3) Lower means more realistic motion habits. Why it matters: Pretty frames can’t hide unrealistic motion anymore. 🍞 Anchor: If the model often makes gentle curves where real cars would, FTD will be low.

🍞 Hook: Think of checking a video for both beauty and usability, like watching a film on a flicker-free screen. 🥬 The Concept: Quality includes subjective image quality (human-like scoring), driving-specific flicker (MMP), and trajectory plausibility (comfort, motion, curvature). How it works: 1) Rate visual appeal (CLIP-IQA+). 2) Check PWM flicker (MMP). 3) Score motion comfort and smoothness. Why it matters: A good-looking but nausea-inducing ride still fails. 🍞 Anchor: Like a bus ride: comfy seating (comfort), steady speed (motion), smooth turns (curvature).

🍞 Hook: Have you ever watched a stop-motion video where objects jump around? It feels wrong. 🥬 The Concept: Temporal consistency tests whether scenes and agents stay stable and logical over time. How it works: 1) Motion-aware frame sampling to judge scene consistency fairly. 2) Track agents and check identity stability. 3) Flag ‘magic vanishes’ as abnormal unless occluded/exit. Why it matters: Unstable identities and pop-outs mislead planners. 🍞 Anchor: If a pedestrian blinks out without walking off-screen, that’s a red flag.

🍞 Hook: When your GPS says ‘follow this path,’ you check if your wheels actually follow it. 🥬 The Concept: Trajectory alignment measures how closely the generated motion follows a commanded path. How it works: 1) ADE for step-by-step closeness. 2) DTW for overall shape even if speeds differ. Why it matters: Control needs both local accuracy and global intent. 🍞 Anchor: If the plan says ‘gentle S-curve,’ DTW punishes a straight line or a zigzag, even if end points match.

03Methodology

At a high level: Input (image + description + optional ego path) → Model generates video → SLAM extracts motion → DrivingGen computes four groups of metrics → Report scores and ranks.

Step 1: Diverse Dataset with Two Tracks

What happens: DrivingGen evaluates on 400 clips across varied weather (rain/snow/fog/sandstorm), time (day/night/sunrise/sunset), regions (NA, Europe, Asia, Africa, etc.), and complex interactions (cut-ins, crossings, dense traffic). Two tracks: • Open-domain: tests generalization to wild, internet-scale scenes. • Ego-conditioned: provides a target ego-trajectory to test controllability.
Why this exists: Without diversity, models overfit to sunny-day habits and pass tests that don’t reflect the real world. The ego-conditioned track specifically tests following instructions.
Example: A night-in-fog clip with a pedestrian stepping onto a crosswalk; in ego-conditioned mode, the car is told to follow a curved path around a bend.

🍞 Hook: Like grading both freestyle art and paint-by-numbers. 🥬 The Concept: Open-domain vs. Ego-conditioned tracks. How it works: 1) Open-domain checks generalization without a given path. 2) Ego-conditioned supplies a path so you can grade obedience. Why it matters: A good driver can both improvise and follow directions. 🍞 Anchor: Painting from imagination vs. tracing a template; both reveal different skills.

Step 2: Video Generation by Candidate Models

What happens: Each of the 14 models generates 100-frame videos per prompt (plus path if provided). Models span four types: closed-source general, open-source general, physical-world, and driving-specific.
Why this exists: Comparing many approaches fairly reveals trade-offs and trends.
Example: A model reads ‘The camera moves forward along a coastal road…’ and produces a seaside highway clip.

Step 3: Robust Trajectory Extraction (SLAM + Depth)

What happens: DrivingGen feeds the generated video into a SLAM pipeline (features, PnP, RANSAC, monocular depth aids) to reconstruct the ego-camera motion. If tracking fails mid-video, the system extrapolates using a constant-velocity pose with small jitter so every video yields a usable trajectory.
Why this exists: Dropping failures hides poor videos; forcing a trajectory for every clip makes motion metrics honest and comparable.
Example: A blurry clip still yields a motion path; if SLAM loses lock at frame 60, the remainder is extrapolated rather than thrown away.

🍞 Hook: Like using a pedometer to track steps even when GPS is spotty. 🥬 The Concept: SLAM (Simultaneous Localization and Mapping) to recover camera motion from video frames. How it works: 1) Detect and match features across frames. 2) Estimate camera pose step by step. 3) Use depth and RANSAC to stabilize; extrapolate on failure. Why it matters: Motion scores require motion data; no data, no fairness. 🍞 Anchor: Even if the map app glitches, you still count steps to finish the day’s total.

Step 4: Metric Group A – Distribution

What happens: Compute FVD for video appearance distribution closeness and FTD for trajectory distribution realism using a motion-trajectory encoder.
Why this exists: Looking like real driving (FVD) isn’t enough; moving like real driving (FTD) prevents ‘pretty but wrong’ from passing.
Example: A model that often creates gentle accelerations and smooth lane-keeping earns lower (better) FTD.

Step 5: Metric Group B – Quality

What happens: • Subjective image quality (CLIP-IQA+) approximates human visual preference. • Objective flicker quality (MMP) checks PWM-induced luminance flicker important for automotive cameras. • Trajectory quality aggregates ride comfort (jerk/lat-acc/yaw), motion sufficiency (not frozen), and curvature smoothness.
Why this exists: Driving cares about both what you see and how it feels.
Example: A crisp, flicker-free video with smooth, comfortable arcs scores high.

🍞 Hook: Like checking a car ride: looks clean, engine hums smoothly, and no sudden jerks. 🥬 The Concept: Trajectory quality (comfort, motion, curvature). How it works: 1) Penalize sharp jerks and harsh turns. 2) Ensure the car actually moves. 3) Prefer smooth, realistic curvature over zigzags. Why it matters: A ride that looks good but makes you queasy is still bad driving. 🍞 Anchor: A bus that glides through a roundabout beats one that lurches and swerves.

Step 6: Metric Group C – Temporal Consistency

What happens: • Video consistency with motion-aware sampling: use optical flow to adjust the stride so slow clips aren’t unfairly rewarded. • Agent appearance consistency: detect and track each agent; compare features to catch identity/color/shape flicker. • Agent abnormal disappearance: a VLM decides if a disappearance is natural (left view or occluded) or unnatural (teleport vanish). • Trajectory consistency: check speed/acceleration stability to avoid oscillations.
Why this exists: Planners need steady, believable worlds; pop-in/pop-out or jitter corrupts trust.
Example: A white SUV stays white, keeps its shape, and drives smoothly without teleporting.

🍞 Hook: Like watching a magic show—but you want no magic in traffic. 🥬 The Concept: Agent abnormal disappearance detection. How it works: 1) Track an agent until it disappears. 2) Present before/after frames to a vision-language model. 3) Classify natural vs. unnatural. Why it matters: Teleporting cars break realism and can mislead decision-making. 🍞 Anchor: If a cyclist ducks behind a truck, that’s natural; if they blink out mid-lane, that’s not.

Step 7: Metric Group D – Trajectory Alignment

What happens: For ego-conditioned videos, compute ADE (pointwise error) and DTW (shape-level error under time warping).
Why this exists: Control requires local precision and correct global intent.
Example: If the planned path is a long left curve, a straight-line shortcut raises DTW even if start/end points are close.

🍞 Hook: Like tracing a maze with a pencil; you’re graded on both staying near the line and following the maze’s shape. 🥬 The Concept: ADE and DTW pairing. How it works: 1) ADE punishes per-step strays. 2) DTW checks if the overall curve matches, even at different speeds. Why it matters: A control policy needs both steady feet and the right destination. 🍞 Anchor: Running the right route matters, not just crossing the same finish line.

Secret Sauce Highlights:

Motion-aware consistency prevents ‘freeze-frame cheating.’
FTD brings distribution-level motion realism into the spotlight.
Agent-level checks and disappearance detection surface subtle identity and physics breaks.
Robust SLAM with failure handling ensures every video is counted, making the benchmark hard to game.

04Experiments & Results

The Test: DrivingGen evaluates 14 modern world models across two tracks (open-domain and ego-conditioned) over 100-frame generations. It measures distribution realism (FVD/FTD), quality (subjective, flicker, trajectory quality), temporal consistency (scene, agent, disappearance, kinematics), and alignment (ADE/DTW on ego-conditioned). This creates a full scoreboard rather than a single ‘beauty’ grade.

The Competition: Models span four groups—closed-source general video models (e.g., Gen-3, Kling), open-source general models (e.g., CogVideoX, Wan, HunyuanVideo, LTX-Video, SkyReels), physical-world models (Cosmos-Predict1/2), and driving-specific models (Vista, GEM, VaViM, UniFuture, DrivingDojo). This diversity lets us compare photorealism-focused approaches against motion-accurate ones.

Scoreboard with Context:

Closed-source leaders shine visually and overall: They get high subjective image quality, strong scene/agent stability, and few abnormal disappearances. Think ‘A in art, B in physics’: great visual presence with decent motion, earning top average ranks.
Top open-source general models compete on select axes: Models like Wan and CogVideoX can achieve strong FVD (distributional visual realism), showing that with focused training, open-source can match parts of the closed-source edge.
Driving-specialized models nail motion and alignment but lag in visuals: They often produce more physically plausible trajectories (better ADE/DTW) and sensible kinematics but with lower image fidelity or more artifacts—‘B in art, A in physics.’
No single champion: None excels across all four pillars simultaneously. The big gap is combining high photorealism with ironclad physical realism and tight controllability.

Surprising/Notable Findings:

Alignment is tougher than expected: Even in ego-conditioned mode, ADE/DTW errors remain sizable for many models. Causes include both video artifacts (hurting SLAM recovery) and imperfect motion following.
Static or near-static videos can trick naive consistency—but not DrivingGen: Motion-aware sampling blocks this shortcut, fairly penalizing ‘frozen’ clips.
Agent-level checks matter: Models that look good globally can still flicker identities or make agents vanish unnaturally. These issues were often hidden by prior scene-only metrics.
Human alignment: Aggregated scores correlate well with human judgments, especially on visual quality and overall consistency, validating metric design.

Concrete Example Walkthrough:

A clip: ‘Rainy night, dense traffic; camera moves forward; a car cuts in.’ A pretty-but-wrong model might show glossy reflections (good FVD, subjective quality) but cause the cut-in car to flicker color or disappear behind nothing (bad agent consistency, disappearance score), and the ego path to jerk (low trajectory consistency/quality). A motion-serious model may keep the cut-in realistic and the ego smooth (good ADE/DTW, trajectory quality) but render smeared headlights and banding (lower subjective quality, flicker risks).

Bottom Line Analogy: Imagine grades across four subjects—Art (visuals), Physics (motion plausibility), History (temporal consistency over time), and PE (following the course). Current models earn mixed report cards: some ace Art, others ace Physics, but nobody is valedictorian yet.

05Discussion & Limitations

Limitations:

Dataset size (400 clips): balanced for practicality but not exhaustive; long-tail rare events still under-covered.
Monocular SLAM noise: Reconstructed trajectories from generated videos can be noisy, especially when visuals have artifacts; this affects alignment and trajectory metrics.
Open-loop focus: DrivingGen evaluates predictive generation, not yet full closed-loop control in a simulator.
Single-camera videos: No multi-view or LiDAR/HD map, limiting tests of 3D structure and multi-sensor consistency.
Scene controllability not yet scored: We don’t grade whether models can obey edits like ‘add a pedestrian’ or ‘close a lane.’

Required Resources:

Video generation dominates cost; some models need powerful GPUs and minutes per 100-frame clip.
Evaluation suite needs a modern GPU and 1–2 days for 400 clips; agent-level checks (detection/tracking) are heaviest.
Basic storage for videos and trajectories; reproducible environment.

When NOT to Use:

If you must test closed-loop policy safety (full autonomy stack) today—DrivingGen is open-loop.
If your model requires multi-view sensors or LiDAR-specific cues—current data is single-view RGB.
If you only care about cinematic beauty without motion realism—simpler video benchmarks may suffice.

Open Questions:

Can we marry photorealism and perfect physics in one model? What architectures or training signals help?
How to bring multi-sensor, multi-view evaluation and view-consistency into a scalable benchmark?
How to fairly evaluate scene controllability (agent insertion/removal, map edits) across diverse models?
Can closed-loop simulation scoring be standardized (e.g., via CARLA/Navsim integrations) while staying comparable across models?
How to further reduce reliance on SLAM by leveraging direct model outputs (if available) without breaking comparability?

Overall, DrivingGen is a strong foundation that exposes the true challenges. The next leaps will blend richer sensors, interactive testing, and even tighter alignment between visual fidelity and lawful, controllable motion.

06Conclusion & Future Work

Three-Sentence Summary: DrivingGen is a comprehensive benchmark that tests generative driving world models not only on how real videos look, but also on whether motions are physically plausible, stable over time, and able to follow a commanded path. It introduces a diverse 400-clip dataset and a four-pillar metric suite—distribution (FVD/FTD), quality (visual, flicker, trajectory), temporal consistency (scene/agent/disappearance/kinematics), and alignment (ADE/DTW). Benchmarking 14 models reveals a core trade-off: photorealism vs. motion realism and controllability, with no current model mastering all.

Main Achievement: Unifying visuals and physics in one rigorous, reproducible evaluation—especially with FTD, motion-aware consistency, agent-level checks, and robust SLAM—so hidden failure modes can no longer pass unnoticed.

Future Directions: Scale data to thousands of clips for better long-tail coverage; add multi-view and multi-sensor inputs; standardize closed-loop evaluation; measure scene controllability and counterfactuals; explore composite overall scores once normalization is robust.

Why Remember This: DrivingGen shifts the field from ‘pretty videos’ to ‘safe, controllable worlds,’ giving researchers and industry a truthful compass for progress toward reliable autonomous driving.

Practical Applications

•Score new driving world models before on-road tests to catch unsafe motion behaviors early.
•Select the best model for a simulator by balancing visual realism with trajectory fidelity using DrivingGen’s four-pillar scores.
•Tune training (losses, data, augmentations) to raise FTD and temporal consistency without losing visual quality.
•Gate synthetic data: accept only generated clips that pass trajectory quality and agent-consistency thresholds.
•Benchmark controllability improvements by tracking ADE and DTW after adding ego-trajectory conditioning.
•Audit models for safety: flag abnormal agent disappearances and jittery kinematics that could mislead planners.
•Guide dataset curation: add more night/fog/snow scenes if scores dip in those conditions.
•Establish internal ‘model report cards’ for stakeholders, mapping strengths (e.g., FVD) and weaknesses (e.g., agent stability).
•Prototype closed-loop evaluation by pairing high-alignment generators with planners in a simulator.
•Use motion-aware consistency metrics to detect and prevent ‘freeze-frame’ failure modes during development.

Version: 1