Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Song Wang; Lingdong Kong; Xiaolu Liu; Hao Shi; Wentong Li; Jianke Zhu; Steven C. H. Hoi

Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Beginner

Song Wang, Lingdong Kong, Xiaolu Liu et al.12/30/2025

arXiv PDF

Key Summary

•Robots like cars and drones see the world with many different sensors (cameras, LiDAR, radar, and even event cameras), and this paper shows a clear roadmap for teaching them to understand space by learning from all of these together.
•Instead of paying humans to label everything, the paper focuses on self-supervised pre-training so models can learn from raw, unlabeled sensor data at scale.
•It organizes the field into a simple taxonomy: single-modality learning, cross-modal learning (camera-centric and LiDAR-centric), unified multi-modal pre-training, and extensions with radar/event cameras.
•A key idea is transferring the rich word-like knowledge from 2D vision foundation models into 3D LiDAR models (and vice versa) so each sensor benefits from the other's strengths.
•Unified pre-training, which learns a shared space for all sensors and reconstructs what’s missing, consistently boosts 3D detection and segmentation, like going from a class average to an honor-roll score.
•Text and language models are now used to auto-label scenes and enable open-world understanding so systems can handle new or rare objects they’ve never seen before.
•Generative world models help robots imagine safe futures and plan ahead, improving end-to-end driving safety metrics such as collision rate.
•The paper also maps major datasets for cars, drones, rails, and boats, showing how sensor layouts and conditions shape what models can learn.
•Finally, it lists big challenges (e.g., real-time efficiency and bridging semantics with precise geometry) and lays out a practical roadmap to future, general-purpose spatially intelligent systems.

Why This Research Matters

This roadmap shows how to teach robots to understand the real world using the sensors they already carry, without depending on expensive human labels. By unifying cameras, LiDAR, radar, and events, systems become safer in bad weather, at night, and during rare surprises. Transferring knowledge from powerful 2D models into 3D makes machines smarter about what they see, not just where it is. Occupancy and world models help robots imagine likely futures, so they plan and drive more cautiously. Text grounding lets them handle new objects and explain what they’re doing. Together, these advances reduce costs, improve safety, and speed up reliable deployment in cars, drones, rails, and boats.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a video game where you must drive a car through a busy city. You can’t only look through the front window—you also need mirrors, maybe a mini-map, and sound cues. Each view tells you something different, and together they keep you safe.

🥬 The Concept: Spatial Intelligence is a robot’s ability to truly understand 3D space—what’s around it, where it is, what might move next, and how to act safely.

What it is: A deep understanding of scenes that mixes seeing, measuring, predicting, and deciding in the real world.
How it works: It blends different sensor strengths—like camera semantics, LiDAR geometry, radar motion, and event cameras’ fast timing—into one coherent view.
Why it matters: Without it, autonomous systems might miss hazards, get confused by bad weather, or fail to plan safe paths. 🍞 Anchor: A self-driving car that can correctly spot a stroller hidden behind a parked van, predict it might roll out, and slow down smoothly is showing Spatial Intelligence.

The World Before: Robots used to learn one sensor at a time, usually with many human-made labels. Cameras gave rich colors and textures but struggled at night and in rain. LiDAR measured precise 3D shapes but knew little about object meaning. Radar sensed motion in bad weather but had low spatial detail. Event cameras caught super-fast changes but looked unfamiliar compared to normal images. Because each sensor had gaps, robots often failed in the messy, unpredictable real world.

🍞 Hook: You know how you can learn a lot just by noticing patterns—like guessing the next line in a song without anyone teaching you? 🥬 The Concept: Self-Supervised Learning (SSL) lets AI learn from raw sensor data by setting up clever puzzles using the data itself.

What it is: Learning from unlabeled data by tasks like masking and predicting missing parts or matching different views of the same scene.
How it works: The model hides parts of inputs (images, point clouds), tries to reconstruct them, contrasts correct pairs with mismatched ones, or predicts future frames.
Why it matters: Without SSL, we’d need endless human labels, which are slow, costly, and incomplete for rare events. 🍞 Anchor: The robot sees a video of a rainy street, hides some frames, and learns to fill in what happens next—without any labeled boxes.

The Problem: Even though SSL works well for single sensors, the big challenge is combining all sensors into a unified understanding so the robot can reason and act anywhere, not just on the exact data it was trained on. Foundation models in 2D vision (like strong image Transformers) know a lot about the open world, but fitting that knowledge into 3D sensors and real-time driving is hard.

🍞 Hook: Think of a school library card that works at every library in your city. That one card unlocks many places. 🥬 The Concept: Foundation Models are big, general models trained on lots of data that can be adapted to many tasks.

What it is: A reusable base that already understands a lot (like objects and scenes) before any fine-tuning.
How it works: They learn broad patterns from massive datasets and then share that knowledge with new tasks or sensors via transfer or distillation.
Why it matters: Without them, every task and sensor would start from zero, wasting time and data. 🍞 Anchor: A vision foundation model knows what a “stroller” is from photos; we can teach a LiDAR model to recognize strollers too—even without 3D labels—by copying (“distilling”) the 2D model’s hints.

Failed Attempts: People tried late fusion (keep each sensor separate, then combine at the end). It helped a little but missed deeper relationships like which point belongs to which pixel, or how motion in radar relates to 3D shapes. Others tried training only with labels, but there aren’t enough labels for every weather, city, and strange event.

🍞 Hook: Imagine two friends—one great at reading maps (camera), the other great with measuring tape (LiDAR). When they team up, they find places faster. 🥬 The Concept: Cross-Modal Interaction is when different sensors learn from each other so the system understands more than any one sensor alone.

What it is: Methods that align, guide, or distill knowledge across sensors.
How it works: Pair pixels with points, transfer 2D semantics to 3D geometry, or use video timing to teach motion in LiDAR.
Why it matters: Without cross-modal learning, we lose complementary strengths like 2D meaning or 3D precision. 🍞 Anchor: A camera sees “bicycle,” LiDAR measures its 3D shape; together the car both recognizes and precisely locates the bicycle.

The Gap This Paper Fills: The field grew fast but scattered—many methods, sensors, and datasets without a single map. This paper builds that map. It organizes pre-training into a clear taxonomy: single-modality (camera-only, LiDAR-only), cross-modal (camera-centric and LiDAR-centric), unified frameworks (jointly train all sensors), and add-ons like radar/event cameras. It connects these to platform datasets (cars, drones, rails, boats) and shows how text and occupancy help open-world planning.

🍞 Hook: Picture a 3D coloring book—each tiny cube in space gets a label like “car,” “road,” or “unknown.” 🥬 The Concept: Occupancy is a way to represent the 3D world as filled (occupied) or empty, sometimes with semantics.

What it is: A grid or continuous field saying what space is taken and by what.
How it works: Models predict which 3D cells are filled now and in the future (4D), using all sensors.
Why it matters: Without occupancy, it’s hard to plan safe paths, avoid collisions, or simulate futures. 🍞 Anchor: The car’s planner uses a 3D occupancy map to know which spots are drivable and which are blocked by a truck or pedestrian.

Real Stakes: This roadmap isn’t just academic. It affects how safely ride-hailing fleets navigate storms, how drones inspect bridges in wind, how trains spot obstacles on tracks, and how boats avoid floating debris with radar. The paper explains why learning at scale without labels, transferring knowledge from images to 3D, and unifying sensors leads to more reliable, fair, and robust autonomy you can trust.

02Core Idea

🍞 Hook: You know how a great team doesn’t just add everyone’s opinions at the end? The best teams practice together so they think alike during the game.

🥬 The Concept: The paper’s “Aha!” is that multi-modal pre-training should be organized and unified—train sensors together (not separately), use self-supervision at scale, transfer knowledge from powerful vision models, and represent the world in occupancy form to enable open-world reasoning and action.

What it is: A roadmap and taxonomy showing how to go from single sensors to unified, foundation-style learning across cameras, LiDAR, radar, and events.
How it works: Pre-train with masking, contrast, forecasting, and rendering; align camera and LiDAR; distill semantics from 2D into 3D; build a shared BEV/volumetric space; add text and occupancy for open-world tasks.
Why it matters: Without a unified approach, systems overfit to narrow cases, break under domain shifts, and can’t plan robustly. 🍞 Anchor: Instead of separate camera and LiDAR models glued at the end, a unified model learns one shared 3D story of the scene and fills in missing sensor pieces.

Three Analogies:

Orchestra: Cameras are violins (rich melody/semantics), LiDAR is percussion (precise rhythm/geometry), radar is brass (motion power). Rehearsing together (unified pre-training) creates harmony.
Puzzle: Each sensor is a puzzle piece. Pre-training learns the picture on the box, so pieces snap together fast—especially the tricky sky or shadow parts.
Study Group: The vision foundation model is the top student in language and semantics. LiDAR copies its study notes (distillation) to ace 3D quizzes without needing extra tutoring (labels).

Before vs After:

Before: Single-sensor learning, label-heavy, late fusion, brittle to rare events and weather.
After: Unified pre-training, label-light self-supervision, cross-modal distillation, occupancy/world models for open-world perception and end-to-end planning.

Why It Works (Intuition):

Shared Latent Space: Training all modalities to reconstruct and agree on one representation (like BEV or volumetric occupancy) forces deep alignment—semantics + geometry + motion.
Teacher-Student Transfer: 2D models carry broad word-like knowledge (open vocabulary). Passing it to 3D sensors gives them meaning without lots of 3D labels.
Generative Objectives: Predicting missing parts and futures trains causal, physics-aware understanding instead of memorizing labels.
Text Grounding: Language connects closed-set perception to open-world reasoning—naming new things and explaining scenes.

Building Blocks (with mini sandwich cards):

🍞 Hook: Think of hiding some jigsaw pieces and guessing what’s missing. 🥬 The Concept: Masked Modeling pre-training hides parts of inputs and makes the model reconstruct them.
- What it is: A self-supervised puzzle for images, point clouds, radar, and events.
- How it works: Randomly mask tokens, encode, then decode to predict what was masked.
- Why it matters: Without it, the model won’t learn strong local structure. 🍞 Anchor: Hide 60% of LiDAR points; the model learns to rebuild the car roof from the remaining points.
🍞 Hook: Imagine color-tagging matching frames in a photo and a 3D scan. 🥬 The Concept: Contrastive Alignment pulls matching camera pixels and LiDAR points together in feature space.
- What it is: A way to teach sensors to talk the same language.
- How it works: Project points to pixels, match them, push apart mismatches.
- Why it matters: Without it, sensors disagree about the same object. 🍞 Anchor: The pixel on the cyclist’s jacket matches the 3D points on the cyclist; their features become close.
🍞 Hook: Picture an older student tutoring a younger one. 🥬 The Concept: Knowledge Distillation transfers rich semantics from 2D foundation models to 3D encoders.
- What it is: A teacher-student training trick.
- How it works: The 3D student mimics soft features or pseudo-labels from the 2D teacher.
- Why it matters: Without it, 3D stays geometry-rich but meaning-poor. 🍞 Anchor: A CLIP-style teacher highlights “stroller” regions; the LiDAR student learns to spot strollers from shape alone.
🍞 Hook: Think of a 3D Minecraft world that can play forward in time. 🥬 The Concept: Generative World Models predict how the 3D scene changes over time.
- What it is: A simulator inside the model.
- How it works: Learn to forecast 3D occupancy flow or render future views.
- Why it matters: Without it, planning is reactive and risky. 🍞 Anchor: The model predicts a merging car’s path and chooses a safe slowdown.

Put together, the roadmap explains how to pre-train each sensor, how to cross-train them, how to train all together, and how to use text and occupancy to reach open-world perception and action.

03Methodology

At a high level: Multi-sensor Input (cameras, LiDAR, radar/events) → [Self-Supervised Puzzles: Masking, Contrast, Forecasting, Rendering] → [Cross-Modal Alignment & Distillation] → [Unified Shared Space: BEV/Volume] → [Generative Reconstruction & Future Prediction] → Output (robust perception, occupancy, planning).

Step-by-step (like a recipe):

Collect synchronized multi-modal data

What happens: Gather camera images (multi-view), LiDAR point clouds, sometimes radar/event streams, all time-synced and calibrated.
Why it exists: If timing is off, features from different sensors won’t match.
Example: A nuScenes sequence with 6 cameras + LiDAR every 0.5s around a city block.

Single-modality pre-training

What happens: Train each sensor alone to become strong on its own via SSL. • Cameras: temporal ordering (TempO), 2D→BEV lifting (LetsMap), or 3D-aware rendering (NeRF-MAE, VisionPAD). • LiDAR: mask-and-reconstruct (MAELi, BEV-MAE), contrastive at point/patch/BEV levels (PointContrast, BEVContrast), or forecast future (ALSO, 4D-Occ, Copilot4D).
Why it exists: Each sensor must stand strong; weak solo features make fusion brittle.
Example: Mask 60% of LiDAR voxels and train the model to reconstruct the full car shape.

Cross-modal interaction (two directions) A) LiDAR-centric (inject semantics into 3D)

What happens: Pair LiDAR points with camera pixels; align features or distill from 2D foundation models (SLidR, Seal, CSC, OLIVINE).
Why it exists: LiDAR has precise geometry but weak semantics; cameras can teach labels for free.
Example: Use SAM/CLIP-derived pseudo labels on images and transfer them to nearby LiDAR points.

B) Camera-centric (inject geometry into 2D)

What happens: Use LiDAR depth/occupancy as supervision so cameras learn 3D structure (DD3D, GeoMIM, OccNet, ViDAR, DriveWorld).
Why it exists: Cameras are cheap and everywhere; teaching them depth/occupancy makes camera-only inference powerful.
Example: Predict future LiDAR points (ViDAR) from past video, learning motion physics.

Unified pre-training (joint training)

What happens: Mask tokens in both images and point clouds, encode each with its backbone, transform into a shared space (BEV/volume), fuse, and reconstruct both modalities (UniPAD, UniM2AE, BEVWorld, GS3).
Why it exists: Joint masking and reconstruction force the model to learn one shared 3D story that all sensors agree on.
Example: Randomly mask 50% of image patches and 50% of LiDAR tokens, fuse into BEV, then reconstruct missing image pixels and LiDAR geometry together.

Generative objectives for future prediction

What happens: Train the model to roll the world forward in time: 4D occupancy flow, next-scale occupancy, or differentiable rendering (OccWorld, OccVAR, MIM4D, GaussianPretrain).
Why it exists: Planning needs foresight; future prediction trains causal understanding.
Example: Given the current scene, predict where space will be occupied 1–3 seconds ahead and plan a safe path.

Text grounding and auto-labeling

What happens: Distill language-aware 2D/3D signals (CLIP2Scene, OpenScene, Affinity3D, LangOcc) into the 3D space so the model can handle open vocabulary.
Why it exists: Real roads include rare objects and new terms; language bridges the gap.
Example: Ask “Where is the wheelchair user?” and retrieve the matching 3D region from occupancy.

Downstream fine-tuning and deployment

What happens: Use the pre-trained backbone(s) for tasks like 3D detection, segmentation, occupancy, planning.
Why it exists: Pre-training provides general skills; fine-tuning specializes for a job.
Example: Fine-tune on nuScenes 3D detection and achieve higher mAP with fewer labels.

The Secret Sauce:

Multi-Modal Masking: Hiding bits in both modalities and reconstructing them forces true alignment.
BEV/Volumetric Unification: Putting sensors in the same coordinate system bakes in geometry.
Teacher-Student Semantics: 2D foundation models make 3D models meaning-aware without 3D labels.
Generative Futures: Forecasting trains physical and causal understanding essential for safe planning.

Concrete mini-examples:

Masked MAE for LiDAR: Input a point cloud with 60% masked; decoder rebuilds roads and cars → the encoder learns shape priors.
Contrastive pixel-to-point: Project LiDAR to image, match features at “bicycle” pixels/points, push apart mismatches → consistent semantics.
Rendering-based camera pre-train: Use neural fields/3D Gaussians to render views from images; if the render matches the photo, the model captured correct geometry.
Unified reconstruction: From the fused BEV, decode both masked image patches and LiDAR tokens → one shared representation that serves both sensors.
Future occupancy: Predict which 3D cells will be occupied in 1s; the planner chooses a trajectory through free space.

What breaks without each step:

No synchronization/calibration → misaligned features = bad fusion.
No single-modality strength → fusion is a house on sand.
No cross-modal transfer → LiDAR lacks meaning; cameras lack depth.
No unified space → late fusion misses deep correspondences.
No generative futures → planning is reactive and less safe.
No text grounding → closed-set blindness to new objects.

04Experiments & Results

The Test: The paper aggregates results across major tasks and benchmarks to show how pre-training strategies really help when it counts.

3D Object Detection (nuScenes): Measures how well the model finds and localizes objects in 3D using mean Average Precision (mAP) and NuScenes Detection Score (NDS).
LiDAR Semantic Segmentation: Measures point-wise labeling quality (mIoU), especially under low-label regimes.
Self-Supervised Occupancy: Checks how well models learn dense 3D volumes without manual labels.
Planning: Evaluates end-to-end safety and accuracy (trajectory error L2, Collision Rate) when models must act.

The Competition: Baselines include camera-only or LiDAR-only models trained from scratch or with standard pre-training, versus cross-modal and unified methods (e.g., UniPAD, UniM2AE, BEVWorld). Some methods also compare to strong supervised pipelines.

Scoreboard Highlights (with context):

3D Detection (nuScenes): Unified pre-training shines. UniM2AE reaches about 71.1 mAP and 73.8 NDS—like scoring an A+ while earlier camera-only baselines hovered closer to a B range. UniPAD also delivers sizable jumps (e.g., +4–5 mAP on strong baselines), showing that learning one shared space for images and points beats stitching features at the end.
LiDAR Segmentation: Distilling semantics from 2D to 3D doubles performance in low-label settings. With only 1% of labels, naive training might get around 30 mIoU, while advanced distillation (e.g., OLIVINE, LiMoE) reaches around 50 mIoU—like going from barely passing to a solid B+ with the same tiny study time.
Self-Supervised Occupancy: Methods that use language-aligned features and/or 3D Gaussian splatting steadily raise IoU/mIoU without 3D labels. Trend: adding strong 2D teachers and temporal consistency improves 3D volumes, which then help open-vocabulary queries and planning.
Planning (nuScenes): Generative world models and latent world approaches (OccWorld, LAW, SSR) lower L2 error and collision rate compared to traditional pipelines—like a driver who looks farther ahead and brakes earlier for safety. Some occupancy-driven planners achieve very low collision rates and competitive speed, proving that learning to predict futures is a powerful training signal for safe control.

Surprising Findings:

Unified pre-training changes the game for detection: multi-modal masking and reconstruction outclass piecemeal fusion by baking in geometric consistency early.
Scaling Laws Transfer: Bigger/better 2D teachers improve 3D students (LiDAR) without 3D labels. That means we can piggyback on existing 2D web-scale training to bootstrap 3D.
Forecasting helps everything: Teaching models to predict the next moments (points or occupancy) improves not just planning, but also static perception, because it encodes motion and causality into the features.
Occupancy as a universal canvas: Whether you come from images or points, expressing the scene as 3D occupancy aligns tasks (detection, mapping, planning) and even language grounding, making one representation useful across the board.

Takeaway in Plain Terms: The more we train sensors together on puzzles (reconstruct, align, forecast), and the more we borrow language-savvy from 2D giants, the better robots get at seeing, understanding, and acting safely—especially when labels are scarce and the world gets weird.

05Discussion & Limitations

Limitations (be specific):

Real-Time Constraints: Foundation-style unified models can be heavy. Running them on a car’s limited computer must be faster and leaner without losing too much accuracy.
Semantic–Geometric Gap: Language-rich 2D teachers don’t always transfer centimeter-precise 3D grounding. We still need better ways to tie words to exact 3D shapes and positions.
Data Curation: SSL treats most frames equally, but driving has many boring moments and few rare hazards. We need data engines that mine the long tail and weight it more.
Sensor Failures & Domain Shifts: Even with fusion, extreme weather, sensor dropouts, or new cities can degrade performance; robust fallback and uncertainty estimates are essential.
Evaluation Coverage: Benchmarks are improving but still miss many edge cases (e.g., odd construction equipment, unusual road users), so measured gains may overestimate field robustness.

Required Resources:

Multi-sensor rigs (cameras + LiDAR ± radar/events), accurate calibration, and synchronized logging.
Substantial compute for pre-training (GPUs/TPUs), plus efficient distillation and quantization for deployment.
Access to large-scale datasets (e.g., nuScenes, Waymo, Argoverse, UAVScenes) and simulation engines for rare-event synthesis.

When NOT to Use:

Extremely constrained hardware with strict millisecond budgets and no room for lightweight distillation.
Settings with no sensor calibration or wildly drifting time sync, where cross-modal learning will misalign.
Tiny datasets without access to broader pre-training: better to first gather more unlabeled data or use teacher signals.

Open Questions:

Can we build physically consistent world models that obey real dynamics and contact constraints, not just look plausible?
How do we quantify and propagate uncertainty through perception → world model → planner so the car knows when to be cautious?
What is the best shared space—BEV grids, explicit occupancy, or continuous Gaussians—and how do we scale it over time (4D) with semantics?
How do we tokenize multi-modal scenes for fast Vision-Language-Action without losing crucial 3D detail?
Can we design data engines that automatically find, simulate, and emphasize rare but safety-critical scenarios during pre-training?

06Conclusion & Future Work

3-Sentence Summary: This paper maps the fast-growing world of multi-modal pre-training for autonomous systems and shows how to forge Spatial Intelligence by training cameras, LiDAR, radar, and events together. It explains a clear taxonomy—from single-modality to cross-modal and unified frameworks—plus how language, occupancy, and generative world models enable open-world understanding and end-to-end planning. The result is a practical roadmap that reduces labeling needs, boosts robustness, and moves autonomy from seeing to simulating and safely acting.

Main Achievement: A unifying framework that organizes methods, datasets, and goals into one coherent path: self-supervised, cross-modal, and unified pre-training feeding into occupancy/world models and text grounding for open-world action.

Future Directions: Build physically consistent world simulators; compress foundation models for real-time use; fuse continuous 3D geometry (e.g., Gaussians) with dense semantics over time; and integrate System 2-style reasoning so cars can explain, plan, and handle rare surprises.

Why Remember This: Because it shows how to turn piles of unlabeled, messy sensor data into a single, reliable brain for robots—one that understands space, predicts the future, and makes safe choices in the open world.

Practical Applications

•Pre-train a unified camera–LiDAR backbone on unlabeled fleet data to boost detection with less annotation.
•Use 2D foundation models (e.g., SAM/CLIP) to auto-label images, then distill semantics into LiDAR for 3D segmentation.
•Adopt BEV or occupancy as a shared canvas to fuse sensors and support both perception and planning.
•Train forecasting objectives (e.g., 4D occupancy flow) to improve end-to-end trajectory planning and reduce collisions.
•Leverage event cameras for high-speed navigation in drones where motion blur hurts RGB cameras.
•Exploit radar-LiDAR alignment to maintain perception in fog/rain for maritime or highway driving.
•Deploy data engines that mine rare corner cases from logs, up-weight them in SSL, and simulate variants.
•Compress and distill unified models into lightweight students (quantization/pruning) for real-time onboard use.
•Enable open-vocabulary queries (e.g., ‘find wheelchair users’) by aligning 3D occupancy with language features.
•Use simulation-to-real transfer to cover long-tail hazards that are hard to capture in the real world.

Version: 1