DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Zhe Liu; Runhui Huang; Rui Yang; Siming Yan; Zining Wang; Lu Hou; Di Lin; Xiang Bai; Hengshuang Zhao

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Intermediate

Zhe Liu, Runhui Huang, Rui Yang et al.12/14/2025

arXiv PDF

Key Summary

•DrivePI is a single, small (0.5B) multimodal language model that sees with cameras and LiDAR, talks in natural language, and plans driving actions all at once.
•It unifies four key abilities—understanding, 3D perception, motion prediction, and planning—so the system is both smart and explainable.
•Unlike past VLA models that only used images, DrivePI adds LiDAR to capture precise 3D geometry and time, boosting safety and accuracy.
•DrivePI produces fine-grained 3D occupancy maps and occupancy flow (who moves where next) alongside language answers and planned trajectories.
•A custom data engine creates millions of question–answer pairs that teach the model to connect text with 3D space and motion over time (4D).
•On nuScenes-QA, DrivePI beats OpenDriveVLA-7B by 2.5% despite being much smaller.
•It slashes collision rate by 70% compared to ORION and outperforms specialized VA models on 3D occupancy and flow.
•The secret sauce is a spatial projector that turns high-res BEV features into compact tokens the language model can reason over without losing detail.
•All tasks are trained end-to-end together, making the model consistent across text, perception, prediction, and planning.
•Limitations include simple loss balancing and no reinforcement learning yet, but the unified design already sets new results with strong interpretability.

Why This Research Matters

DrivePI shows that one compact model can talk, see precisely in 3D, predict motion, and plan safe routes, making autonomous driving more understandable and trustworthy. For passengers, that means clearer explanations like, “I’m slowing because a cyclist will cross from the right,” not just silent motions. For engineers, it means fewer moving parts and better alignment between mapping, prediction, and planning, which speeds debugging and improves safety. Cities benefit from vehicles that can interact naturally with operators and pedestrians while respecting precise spatial constraints. The model’s strong results while being small suggest more affordable, energy-efficient deployments. Ultimately, DrivePI points to a future where safety, accuracy, and transparency come together instead of forcing trade-offs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re teaching a friend to ride a bike through a park. They need to see where people and trees are, guess which way the jogger will move next, and then choose a safe path—all while understanding your spoken advice. That’s a lot to juggle!

🥬 The Concept (Vision-Action models): What it is: Vision-Action (VA) models are systems that take in what the car sees (cameras/LiDAR) and output how to drive (actions) using a pipeline of skills like perception → prediction → planning. How it works:

See the scene (images/LiDAR).
Build a spatial map (what’s where).
Predict motion (who will move where).
Plan a trajectory (how to drive safely). Why it matters: Without a clear pipeline, the car may miss critical spatial details and make unsafe choices. 🍞 Anchor: Like a driver who first checks mirrors, predicts a cyclist’s path, and then smoothly turns.

🍞 Hook: You know how asking a friend for directions helps—"Turn left at the big red sign"—because words are quick and clear?

🥬 The Concept (Vision-Language-Action models): What it is: VLA models add language to the mix so the system can answer questions, explain itself, and follow spoken instructions. How it works:

Read an instruction or question.
Look at images.
Reason with both.
Output answers and actions. Why it matters: Without language, systems can’t explain decisions or take helpful verbal guidance. 🍞 Anchor: Asking, “Is it safe to turn left?” and getting a reasoned answer.

🍞 Hook: Imagine a game grid where some squares are free and others are blocked—it’s vital to know what spots are taken before you move.

🥬 The Concept (3D occupancy perception): What it is: A 3D occupancy map tells the car which 3D cells around it are filled (by cars, people, walls) or empty. How it works:

Divide space into many tiny 3D boxes (voxels).
Use sensors to decide if each box is occupied and by what.
Keep this map updated. Why it matters: Without occupancy, a car could plan a path through a truck it didn’t realize was there. 🍞 Anchor: Like a Lego grid where colored bricks show exactly where objects stand.

🍞 Hook: Think about watching a crowd and guessing which way people will step next so you don’t bump into them.

🥬 The Concept (Occupancy flow prediction): What it is: Occupancy flow predicts how those 3D occupied cells will move over time. How it works:

Look at recent motion.
Predict velocity for each occupied spot.
Roll the scene forward a few steps. Why it matters: Without flow, the car might plan into where a fast bike will soon be. 🍞 Anchor: It’s like seeing a ball rolling and knowing where it will be in two seconds.

🍞 Hook: When you cross a busy hallway, you first check space and motion, then choose a safe path.

🥬 The Concept (Trajectory planning): What it is: Planning is choosing a sequence of positions (a path over time) that is safe, smooth, and goal-directed. How it works:

Use occupancy to avoid obstacles.
Use flow to avoid future collisions.
Select and refine the best path. Why it matters: Without solid planning, even perfect perception won’t keep you safe. 🍞 Anchor: Drawing a curved line on a map to go around a crowd and reach the door.

🍞 Hook: If a storyteller could also draw a super-accurate 3D map and predict motion, you’d trust the story more.

🥬 The Concept (The gap): What it is: Past VLA systems could chat and plan but lacked fine-grained 3D outputs; VA systems had strong 3D perception but couldn’t explain or follow language. How it works:

VLA: +language, −precise 3D outputs.
VA: +precise 3D, −language.
The missing piece: unify both in one model. Why it matters: Without unification, we choose between safety/detail and human-friendly interaction. 🍞 Anchor: A GPS that can talk but can’t see depth vs. a map that sees depth but can’t answer questions.

🍞 Hook: You know how two eyes (stereo) help you judge depth better? Cars can have that superpower too.

🥬 The Concept (LiDAR): What it is: LiDAR is a sensor that measures exact distances using laser pulses, giving precise 3D points. How it works:

Send laser pulses.
Measure return time.
Compute distance and build a point cloud. Why it matters: Without LiDAR, fine 3D geometry is harder, especially in tricky lighting. 🍞 Anchor: Like shining a flashlight and timing the bounce-back to know how far the wall is.

🍞 Hook: Imagine taking a bird’s-eye photo of a tiny city you built on the floor—it’s easier to plan paths from above.

🥬 The Concept (BEV—Bird’s-Eye View): What it is: BEV is a top-down map of the scene that aligns everything in a driving-friendly coordinate system. How it works:

Convert images/LiDAR to a flat top-down grid.
Keep consistent real-world scales.
Use it as the shared stage for tasks. Why it matters: Without BEV, it’s hard to combine views and plan globally. 🍞 Anchor: Like a tabletop city map that shows all streets at once.

Real stakes: In daily life, this means fewer confusing car behaviors, clearer explanations (“I slowed because a bike was approaching from the right”), better safety from precise 3D understanding, improved trust for passengers, and smoother integration with human instructions. Before DrivePI, you often had to pick: talkative but fuzzy 3D or precise 3D but silent. This paper aims to give you both, together, in real time.

02Core Idea

🍞 Hook: Picture a conductor who can both read a music score (precise notes) and chat with the orchestra (language), guiding everyone to play safely and beautifully in sync.

🥬 The Concept (Aha!): What it is: The key insight is to make one small, efficient model that understands language and also directly produces fine-grained 3D maps, motion, and plans—trained end-to-end so all parts agree. How it works:

Fuse cameras and LiDAR into a BEV feature.
Use a spatial projector to compress high-res BEV into language-friendly tokens without losing detail.
Feed text and vision tokens into one MLLM.
Decode four outputs in parallel: text answers, 3D occupancy, occupancy flow, and planned trajectory. Why it matters: Without joint, fine-grained outputs, language-only planning can be unreliable; without language, perception-heavy models can’t explain or take instructions. 🍞 Anchor: It’s like a smartphone that can see, talk, and navigate with the same app—and all features share the same up-to-date map.

Multiple analogies:

Toolbox analogy: Before, you carried separate tools (a chatty assistant and a precise mapper). Now you get a Swiss Army knife that does both.
School analogy: A student who can both solve detailed math steps (fine-grained) and explain them in plain words (language) gets fewer mistakes and more trust.
Sports analogy: A quarterback reads the field (3D occupancy), predicts defenders (flow), calls the play (language), and runs the route (planning) as one fluid action.

Before vs After:

Before: VLA could explain but struggled to output precise 3D maps; VA had precise 3D but couldn’t converse.
After: One model talks and produces 3D occupancy + flow + plans, boosting safety, interpretability, and user control.
Before: Pipelines sometimes disagreed across modules; After: End-to-end training aligns all tasks.

🍞 Hook: You know how squishing a big poster into your backpack can wrinkle important details if you just fold it randomly?

🥬 The Concept (Spatial projector): What it is: A module that turns a big BEV feature map into a small set of tokens using attention, preserving crucial spatial detail for the language model. How it works:

Split BEV into patches.
Form pooled summaries.
Use cross-attention so pooled tokens attend to patch details.
Linearly project to language-model dimensions. Why it matters: Without careful compression, you’d lose the fine geometry needed for safe planning. 🍞 Anchor: Like summarizing a giant map into key landmarks for your navigator without dropping vital turns.

🍞 Hook: If a chef can taste (sensing), describe flavors (language), and plate the dish (action) all together, dinner is both delicious and explainable.

🥬 The Concept (Fine-grained heads): What it is: Three specialized decoders generate 3D occupancy, occupancy flow, and planned trajectories directly from multimodal features. How it works:

Re-shape tokens back to spatial maps.
Predict voxel categories (occupancy) and velocities (flow).
Use an action diffusion head to output future positions (trajectory). Why it matters: Without explicit heads, you rely on text-only output for pixel/voxel-level tasks, which is too coarse. 🍞 Anchor: Like having dedicated kitchen stations (grill, sauce, plating) that work from one shared recipe.

🍞 Hook: Think of a quizmaster who asks about the scene, objects, and future motion so you practice thinking in space and time.

🥬 The Concept (Data engine for 4D QA): What it is: A generator that creates text–occupancy and text–flow question–answer pairs, plus planning QAs, so the model learns to link words with 3D + time. How it works:

Make front/back captions.
Merge and polish scene descriptions.
Create QA about occupancy, class, and velocity from ground truth.
Add planning QAs for actions and future trajectories. Why it matters: Without rich 4D language data, the model can’t verbally reason about space and motion. 🍞 Anchor: Like practice worksheets that ask, “Is (x,y,z) occupied? By what? How fast is it moving? What should the car do next?”

Why it works (intuition):

Shared representation: All tasks see and shape the same core features, so they reinforce each other (text anchors semantics; occupancy/flow anchor geometry; planning anchors decisions).
Precision + explanation: LiDAR provides crisp 3D; language provides interpretability; BEV aligns the world.
Better compression: The spatial projector preserves fine detail the LLM can reason over, avoiding the usual “too blurry to plan” problem.

Building blocks:

Inputs: Multi-view cameras + LiDAR + text.
Backbone: A compact Qwen2.5-0.5B MLLM.
Spatial projector: Patch + cross-attention → vision tokens.
Heads: Text head, 3D occupancy head, occupancy flow head, action diffusion head.
Training: Joint losses across tasks, end-to-end.

03Methodology

At a high level: Multi-view images and LiDAR → Vision encoder (to BEV features) → Spatial projector (to tokens) → MLLM (reasoning) → Four heads: Text, 3D occupancy, Occupancy flow, Action diffusion → Outputs.

Step 1: Vision encoding to BEV

What happens: Multi-view camera images and LiDAR point clouds are fused by a multi-modal vision encoder to produce a compact Bird’s-Eye-View feature map (H×W×C).
Why this exists: A top-down BEV lets all sensors speak the same spatial language, simplifying perception and planning.
Example: Six cameras and one LiDAR frame become a 100×100 BEV grid where each cell stores rich features about what’s above it.

🍞 Hook: Shrinking a big poster to fit your notebook without losing the important streets. 🥬 The Concept (Spatial projector): What it is: A compressor that turns large BEV features into a small set of tokens the language model can process while keeping details. How it works:

Patchify BEV into K×K tiles (N patches).
Create pooled summaries of each patch.
Use cross-attention (pooled queries attend to patch keys/values) to retain fine structure.
Linearly project to the LLM hidden size, yielding vision tokens. Why it matters: Without it, the LLM would be overwhelmed, or you’d lose necessary geometry with naive pooling. 🍞 Anchor: Summarizing a city map into a list of key intersections and turns.

Step 2: Multimodal reasoning with the MLLM

What happens: Concatenate text tokens (from prompts/instructions) with vision tokens (from BEV) and feed them to the MLLM.
Why this exists: The LLM can jointly reason over language and space, aligning words with places and motions.
Example: Prompt: “Is (65,136,7) occupied? What class? What’s a safe action?” The model cross-references tokens to answer.

Step 3: Specialized decoding heads (four outputs in parallel)

🍞 Hook: Different kitchen stations make different parts of the meal from the same prep table.

🥬 The Concept (Text head): What it is: An auto-regressive generator for scene descriptions and QA answers. How it works:

Attend to multimodal tokens.
Predict next tokens word by word.
Output captions, QAs, and reasoning. Why it matters: Without it, the model can’t explain or answer queries. 🍞 Anchor: Answering “Go straight. Car at (70,120,15). vx:1.5, vy:2.5.”

🥬 The Concept (3D occupancy head): What it is: A decoder that outputs voxel-wise occupancy categories (what fills each 3D cell). How it works:

Select relevant vision tokens and re-shape back to H×W×C.
Expand along depth Z to form H×W×Z.
Predict occupancy class per voxel. Why it matters: Without explicit voxels, the system can’t precisely localize obstacles. 🍞 Anchor: Marking a 3D grid with cars, buildings, and free space.

🥬 The Concept (Occupancy flow head): What it is: A decoder that predicts velocities for occupied voxels. How it works:

Build on occupancy features (know what is where).
Regress per-voxel velocity (vx, vy).
Use higher loss weight for moving cells so motion isn’t drowned out by static background. Why it matters: Without flow, future collisions are easy to miss. 🍞 Anchor: Attaching tiny arrows to each car cell showing its likely movement.

🥬 The Concept (Action diffusion head): What it is: A trajectory planner that uses a diffusion-style denoising process to generate smooth future paths. How it works:

Start from a noisy trajectory guess.
Iteratively denoise using scene features.
Output a 6-step (or more) future path. Why it matters: Without a robust path generator, plans may be jerky or unsafe. 🍞 Anchor: Sketching a rough route in pencil, then cleaning it up step-by-step.

Step 4: Joint training with balanced losses

What happens: Optimize text understanding (L_llm), occupancy (L_occ), flow (L_flow), and planning (L_action) together with weights.
Why this exists: End-to-end learning makes the outputs agree—language matches maps, maps match plans.
Example: Train on 1M+ QA pairs plus occupancy/flow/plan supervision; tune weights so none of the tasks dominate.

🍞 Hook: Choosing a fair grading system so math, writing, and PE all count.

🥬 The Concept (End-to-end joint optimization): What it is: Training all tasks at once so features are shared and consistent. How it works:

Compute all task losses.
Weight and sum them.
Backpropagate through the entire model (except the frozen vision encoder at first). Why it matters: Without joint training, parts learn different stories and disagree. 🍞 Anchor: A team practicing offense and defense in the same scrimmage so they coordinate.

The secret sauce:

Rich inputs (images + LiDAR) give both appearance and geometry.
Spatial projector preserves detail for the small LLM.
Fine-grained heads guarantee pixel/voxel-level outputs, not just words.
A 4D QA data engine teaches the model to talk about space and time, making it interpretable and controllable.

04Experiments & Results

🍞 Hook: When you try a new bike, you don’t just look at it—you ride it, test the brakes, and see how it handles turns.

🥬 The Concept (What the tests measure): What it is: The model is tested on four fronts—text understanding, 3D occupancy, occupancy flow, and trajectory planning—so we know it can explain, see precisely, predict motion, and drive safely. How it works:

Text: nuScenes-QA checks language understanding.
Occupancy: OpenOcc/Occ3D check voxel accuracy (RayIoU and OccScore).
Flow: OpenOcc checks velocity error (mAVE).
Planning: nuScenes checks L2 path error and collision rate. Why it matters: Without thorough tests, we might trust a model that’s only good at talking or only good at mapping. 🍞 Anchor: Like grading a student on reading, math, science, and PE, not just one subject.

🍞 Hook: A scoreboard means more when you know who you’re playing against.

🥬 The Concept (Baselines and competitors): What it is: DrivePI is compared to strong VA and VLA systems, including OpenDriveVLA-7B, ORION, VAD, UniAD, and leading occupancy/flow models like FB-Occ and ALOcc-Flow-3D. How it works:

Use the same datasets and metrics.
Report numbers fairly.
Highlight both accuracy and safety (collisions). Why it matters: Without fair comparisons, results don’t mean much. 🍞 Anchor: Beating a varsity team says more than winning a practice match.

🍞 Hook: What do the numbers really mean? Think of grades with class averages.

🥬 The Concept (Scoreboard with context): What it is: Results show DrivePI excels across all tasks despite being small (0.5B). How it works:

Text: 60.7% on nuScenes-QA vs OpenDriveVLA-7B at 58.2% (like scoring higher than a bigger classmate).
Occupancy: 49.3% RayIoU on OpenOcc; +10.3 over FB-Occ; OccScore up to 49.3; on Occ3D, 46.0% beats OPUS by 4.8.
Flow: mAVE 0.509, improving over FB-Occ’s 0.591 and better than ALOcc-Flow-3D.
Planning: With ego status, collision rate 0.11%, a 70% drop vs ORION’s 0.37%; L2 errors competitive, 0.40 avg. Why it matters: Higher occupancy/flow accuracy means safer maps; lower collisions mean safer rides; better QA means clearer explanations. 🍞 Anchor: It’s like getting A-levels in science and math while also being the best at debate club—balanced excellence.

🍞 Hook: Did anything surprise us?

🥬 The Concept (Surprising findings): What it is: Even with just 0.5B parameters, DrivePI matches or beats much larger models when it has LiDAR, the spatial projector, and fine-grained heads. How it works:

LiDAR boosts geometry.
The projector preserves detail for the LLM.
Joint training aligns tasks. Why it matters: Smart design can beat brute size. 🍞 Anchor: A small, well-coached team outplays a bigger but uncoordinated one.

05Discussion & Limitations

🍞 Hook: Even the best backpack has a weight limit and works better for some trips than others.

🥬 The Concept (Limitations): What it is: Areas where DrivePI can improve. How it works:

Loss balancing is simple; fine-tuning weights could further improve trade-offs.
No reinforcement learning (RL) yet; RL could help with complex, long-horizon planning.
Open-loop planning evaluation; closed-loop in-the-simulator tests are the next step. Why it matters: Knowing limits guides safe deployment and future research. 🍞 Anchor: A great map that still needs live traffic updates for rush hour.

🥬 The Concept (Required resources): What it is: What you need to use or train DrivePI. How it works:

Multi-view cameras and LiDAR.
A GPU setup (e.g., 8×L40S) for training.
Access to datasets (nuScenes, OpenOcc/Occ3D) and the QA data engine outputs. Why it matters: Without the right sensors and data, results won’t transfer. 🍞 Anchor: You can’t bake bread without flour, an oven, and a recipe.

🥬 The Concept (When not to use): What it is: Situations where DrivePI might struggle. How it works:

Environments without LiDAR or with very different sensor setups.
Edge cases requiring long-term trial-and-error reasoning (no RL yet).
Ultra-tight real-time constraints where even the small LLM may be too slow without optimization. Why it matters: Picking the right tool avoids failures. 🍞 Anchor: Don’t bring an umbrella to a windstorm if you need a raincoat and goggles.

🥬 The Concept (Open questions): What it is: What we still don’t know. How it works:

Best strategies for dynamic loss balancing.
How RL and closed-loop training affect safety and efficiency.
Scaling laws: How do gains evolve with bigger backbones or more LiDAR frames?
Generalization to novel cities, weather, and rare events. Why it matters: Answering these will shape next-generation autonomous systems. 🍞 Anchor: It’s like planning the next season’s training once you’ve won the league.

06Conclusion & Future Work

In three sentences: DrivePI is a unified, spatial-aware 4D multimodal language model that combines language understanding with fine-grained 3D occupancy, motion flow, and trajectory planning in one end-to-end system. By fusing cameras and LiDAR, compressing BEV features into language-friendly tokens, and decoding with specialized heads, it achieves strong accuracy and interpretability despite a compact 0.5B backbone. It outperforms larger VLA models in QA, beats specialized VA models in occupancy/flow, and sharply reduces collisions in planning.

Main achievement: Showing that a single, small, end-to-end VLA model can talk, see precisely in 3D, predict motion, and plan safely—all at once—matching or surpassing both larger VLA systems and specialized VA pipelines.

Future directions: Smarter loss balancing; reinforcement learning and closed-loop training for richer decision-making; broader sensor setups and cities; scaling backbones and data; and faster runtime optimizations for deployment.

Why remember this: DrivePI closes a long-standing gap between talkative but coarse VLA systems and precise but silent VA systems, proving that accuracy, safety, and explainability can live in the same model—and that great design can beat sheer size.

Practical Applications

•In-car assistant that answers passenger questions about current driving decisions and upcoming maneuvers.
•Driver training simulators that provide both precise 3D scene labels and language explanations for feedback.
•Fleet monitoring tools that review incidents with aligned 3D occupancy, motion predictions, and natural-language justifications.
•Robotic delivery vehicles that navigate tight spaces using LiDAR, while following spoken depot instructions.
•Traffic management systems that analyze occupancy flow to forecast congestion and propose safer routes.
•AR navigation apps that overlay top-down occupancy and planned paths with voice explanations for cyclists or scooters.
•Safety auditing dashboards that visualize predicted flows and planned trajectories next to collision-rate statistics.
•Data generation pipelines that create 4D QA pairs to improve spatial reasoning in other domains (e.g., warehouses).
•Simulation environments that benchmark new scenarios with unified text, occupancy, flow, and planning metrics.
•On-vehicle debugging tools that pinpoint disagreements between language reasoning and 3D predictions.

Version: 1