RadarGen: Automotive Radar Point Cloud Generation from Cameras

Tomer Borreda; Fangqiang Ding; Sanja Fidler; Shengyu Huang; Or Litany

RadarGen: Automotive Radar Point Cloud Generation from Cameras

Intermediate

Tomer Borreda, Fangqiang Ding, Sanja Fidler et al.12/19/2025

arXiv PDF

Key Summary

•RadarGen is a tool that learns to generate realistic car radar point clouds just from multiple camera views.
•It turns tricky radar data into bird’s‑eye images, uses a fast diffusion model to paint likely radar patterns, then mathematically picks out the exact radar points.
•The model is guided by three helpers extracted from the cameras—depth, semantics, and motion—so the radar matches the scene’s 3D layout, object types, and movement.
•RadarGen captures core radar attributes like location, Radar Cross Section (how reflective something is), and Doppler (how fast it moves toward/away).
•On a real driving dataset, RadarGen beats a strong baseline on most geometry and attribute metrics and narrows the gap to real radar for downstream detectors.
•Edits to the input images (like swapping a car for a truck) automatically update the generated radar, including occlusions.
•A lightweight deconvolution step recovers sparse radar points from smooth maps, preserving spatial detail.
•The method trains in a few days on commodity GPUs and runs in about 10.5 seconds per frame on a single L40 GPU.
•Limitations include dependence on upstream vision models (worse at night) and occasional hallucinations in unseen regions.
•This is a step toward unified, camera-driven simulation across sensing types for safer, cheaper autonomous driving development.

Why This Research Matters

Safer autonomous driving needs realistic, scalable sensor simulation, especially for radar, which shines in bad weather and low visibility. RadarGen turns camera-only recordings into plausible radar, letting teams train and test without expensive radar data collection. Its controllable, editable pipeline allows rapid what-if experiments, like adding a truck or changing traffic, with radar updating automatically. This strengthens downstream models such as detectors by exposing them to more diverse scenarios. Over time, such unified, camera-driven simulation can reduce costs, speed up iteration cycles, and help catch corner cases before they appear on real roads.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how weather radars can see rain clouds even when our eyes can’t? Cars use radar too—because radar can sense things through fog, rain, and glare that cameras struggle with.

🥬 The Concept: Automotive radar creates point clouds—tiny dots showing where things are and how they reflect and move. How it works: (1) A car sends radio waves. (2) Waves bounce off objects and return. (3) A processor turns this into sparse dots with position, Radar Cross Section (RCS), and Doppler (motion toward/away). Why it matters: Without radar, self-driving systems can be blind in bad weather or make unsafe guesses.

🍞 Anchor: Imagine driving in thick fog. The camera sees gray mush, but radar still returns a handful of dots where cars are, plus how fast they’re moving.

The world before this paper:

Generating synthetic data for autonomous driving had made big strides for images and even LiDAR, but radar lagged behind. Cameras and LiDAR are easier: their measurements are dense or grid-like, so common image/video models fit well.
Real radar is quirky: it’s sparse (few dots), messy (multipath, material effects), and wrapped in a secret sauce of proprietary signal processing. Most big datasets don’t store raw radar signals—they only keep processed point clouds, which are lighter but lose fine details.

The problem researchers faced:

How do you generate realistic radar point clouds from cameras alone? You need to match the scene’s 3D structure, object types, and motion, but radar also reflects from materials in odd ways and is inherently random. A single, fixed prediction from cameras won’t feel like real radar; you need many plausible outcomes.

What people tried before (and why it fell short):

Physics-based simulators (ray tracing, solvers of wave equations) are accurate but heavy and hard to scale; you also need tons of handcrafted assets to cover rare cases.
Scene-specific reconstructions (e.g., NeRF-like for radar) need multi-view radar captures and don’t generalize well to new places.
Generative models for radar often targeted raw radar tensors (range–Doppler/cubes) or relied on LiDAR; but raw radar isn’t widely available, and deterministic mappings underplay radar’s randomness.
Some camera-to-radar attempts missed crucial ingredients: they didn’t model uncertainty, ignored motion cues, or couldn’t leverage large pretrained vision models.

The missing piece (the gap):

A scalable, probabilistic method that learns the distribution of radar point clouds conditioned on multi-view cameras, aligns with the scene’s geometry/semantics/motion, and works with point clouds (the practical data format). Crucially, it should be efficient enough for large-scale simulation.

What this paper brings:

RadarGen: a diffusion model that works in the latent space of an efficient image backbone (SANA), but treats radar as bird’s-eye-view (BEV) images so it can reuse powerful image tools.
It adds BEV-aligned conditioning from foundation models: depth (for where things are), semantics (what they are), and motion (how they move). These guides nudge the diffusion process toward physically plausible radar patterns.
After generation, a simple, principled deconvolution step recovers exact sparse radar points (locations) and picks RCS/Doppler values.

Why this matters in daily life:

Safer autonomy: You can stress-test perception in rain, fog, and traffic without risky and costly data collection.
Faster development: Camera-only datasets can be “upgraded” with radar, widening training material.
Flexible editing: If you add, remove, or move a car in the input images, radar updates accordingly—useful for simulation and scene planning.

🍞 Hook: Imagine a school theater where you practice a fire drill. It’s safer to rehearse than start a real fire. Good driving simulators are like safe rehearsals.

🥬 The Concept: Generative simulation creates realistic sensor data (radar included) from controllable inputs, so we can safely train and test. How it works: (1) Start with camera views. (2) Extract 3D structure, labels, and motion. (3) Generate radar maps stochastically. (4) Recover radar point clouds. Why it matters: Without it, testing for rare but dangerous cases (like sudden cut-ins in heavy rain) is slow, costly, or unsafe.

🍞 Anchor: In simulation, you can spawn a truck emerging from fog and see whether the radar-driven detector still catches it in time.

02Core Idea

🍞 Hook: Imagine you’re baking cookies shaped like cars and trucks. If you have a good map of where the cookie cutters should go and what shapes they should be, your cookies come out right—even if some dough is a bit lumpy.

🥬 The Concept: The key idea is to turn radar generation into image-like BEV maps, guide them with BEV scene cues (depth, semantics, motion), generate them probabilistically with a fast latent diffusion model, then mathematically pick out the exact radar points. How it works: (1) Represent radar as three BEV maps: point density, RCS, and Doppler. (2) Build BEV conditioning from cameras using foundation models: depth (geometry), semantics (object types), and radial velocity (motion). (3) Use a SANA-style latent diffusion transformer to denoise noise into plausible radar maps. (4) Deconvolve the density map to recover point locations and read RCS/Doppler at those spots. Why it matters: Without this pipeline, camera-to-radar either ignores randomness, misses motion/material effects, or becomes too slow/complex to scale.

🍞 Anchor: If you edit the images to swap a far car for a closer truck, RadarGen updates the radar: more returns from the big truck, fewer where it now occludes things, and different Doppler.

Three analogies for the same idea:

Stencil + Spray Paint: The BEV condition maps are stencils (where and what things are). Diffusion is like spraying paint that fills in realistic radar patterns. Deconvolution carefully picks exact dot locations from the sprayed texture.
Orchestra + Conductor: Cameras (depth/semantics/motion) are sections of an orchestra; diffusion is the conductor shaping a coherent performance; deconvolution records the final, distinct notes (points).
Treasure Map + Metal Detector: BEV cues are the treasure map. Diffusion sweeps an area and lights up likely spots. Deconvolution pinpoints the exact treasure locations.

Before vs After:

Before: Camera-to-radar methods were often deterministic, missing radar’s natural variability and struggling with scalability or motion/material realism.
After: A stochastic, BEV-aligned, image-latent diffusion approach that captures geometry, semantics, and motion together and recovers faithful sparse points.

Why it works (intuition, no equations):

Alignment: Putting both conditions and targets in the same BEV grid makes correspondence simple and spatially consistent.
Leverage big priors: Using an image-latent diffusion model (SANA) taps into robust, efficient representations for high-dimensional generation.
Physical hints: Depth tightens geometry, semantics hint at RCS statistics, and motion guides Doppler—reducing guesswork.
Stochasticity: Diffusion produces multiple plausible outcomes, matching radar’s variability (multipath, clutter, occlusions).
Invert the blur: The density map is a blurred proxy of points; knowing the blur lets you reverse it and recover clean, sparse locations.

Building blocks (each as a mini “Sandwich”):

🍞 Hook: Imagine looking at a city map from above. 🥬 The Concept (BEV Representation): BEV is a top-down map where each pixel is a ground spot. How it works: Project radar points to the ground plane, make three maps—density (smoothed), RCS (nearest-point value), Doppler (nearest-point value). Why it matters: Without BEV, you can’t easily reuse strong image backbones. 🍞 Anchor: Parking lot from above: you see clusters (density), shininess (RCS), and toward/away motion (Doppler).
🍞 Hook: Picture wearing 3D glasses that also label things. 🥬 The Concept (BEV Conditioning): Use depth (3D), semantics (labels), and motion (radial velocity) from cameras, projected into BEV. How it works: Predict per-pixel depth; segment objects; compute optical flow between frames, backproject to 3D, keep radial velocity; rasterize into BEV maps. Why it matters: Without these guides, the generator has to guess geometry, class, and motion from scratch. 🍞 Anchor: A road BEV: colored roads/vegetation/cars plus arrows for movement.
🍞 Hook: Think of fog that slowly clears to reveal the scene. 🥬 The Concept (Latent Diffusion with SANA/DiT): Start from noise in a compressed latent space and denoise step-by-step into radar maps, conditioned on BEV cues. How it works: Concatenate radar-map latents (density/RCS/Doppler) and BEV condition latents; a DiT with modality tokens models their joint distribution and correlations. Why it matters: Without diffusion, you lose the natural randomness of radar; without latent compression, it’s too slow. 🍞 Anchor: After a few steps the fog lifts to show a realistic radar texture.
🍞 Hook: Imagine unblurring a photo when you know the camera’s blur. 🥬 The Concept (Deconvolution Recovery): Convert the smooth density map into sharp point locations using a sparsity-friendly solver (IRL1/FISTA), then sample RCS/Doppler at those spots. How it works: Solve an optimization that prefers few points that, when blurred, match the density. Why it matters: Without this, you either get too many clumps or miss objects. 🍞 Anchor: From a misty heatmap to crisp dots where cars actually are.

03Methodology

At a high level: Multi-view images at times t and t+Δt → foundation models build BEV conditioning (appearance, semantics, radial velocity) → latent diffusion denoises three radar BEV maps (density, RCS, Doppler) → deconvolution recovers sparse radar points and reads out attributes.

Step 1: Build BEV radar targets from real radar (for training)

What happens: Convert each radar point cloud into three 2D BEV images: (a) Point Density Map (smooth Gaussian of where points are), (b) RCS Map (per-pixel value from nearest radar point), (c) Doppler Map (per-pixel value from nearest radar point). Each is normalized and duplicated across 3 channels so an image autoencoder can process it.
Why this step exists: The diffusion backbone is great at images, not unordered point clouds. Turning radar into images lets us reuse strong, efficient image-latent tools.
Example: On a 512×512 grid covering ±50 m, a car cluster near (x=15 m, y=3 m) appears as a bright blob in the density map; RCS is higher near metallic truck faces; Doppler is positive (toward) on approaching vehicles and negative (away) on receding ones.

Step 2: Build BEV conditioning from cameras

What happens: For each camera, predict metric depth, semantic labels (e.g., road, building, car), and optical flow between t and t+Δt. Backproject to 3D with known camera poses, then rasterize into three BEV maps: Appearance (image colors), Semantic (color-coded classes), and Radial Velocity (component of 3D motion toward/away from ego).
Why this step exists: Geometry narrows where radar returns can be; semantics tie to typical reflectivity; motion aligns with Doppler. Without these, the model must discover too much by itself and can drift from physics.
Example: A car turning left near the intersection shows a semantic-car region and a positive radial velocity on the front quarter; buildings show near-zero Doppler and stable RCS-like regions.

Step 3: Encode to latent space

What happens: Feed each BEV radar map through a frozen, pretrained autoencoder (from SANA) to get compact latents z_p (density), z_r (RCS), z_d (Doppler). Similarly, encode the conditioning maps so they live in a compatible latent space.
Why this step exists: Latent diffusion is faster and cheaper than pixel-space diffusion, enabling large-scale training and inference.
Example: A 512×512 BEV image becomes a smaller latent grid (32× compression), slashing compute while preserving structure.

Step 4: Conditional denoising with a Diffusion Transformer (DiT)

What happens: During training, add noise to the target radar latents and train a DiT to predict and remove it, conditioned on the BEV maps. The three radar latents are processed jointly with shared attention so the model learns cross-channel relations (e.g., fast-moving cars tend to have certain Doppler patterns and RCS silhouettes). Modality indicators (learnable embeddings) help the transformer respect each map’s statistics while allowing cross-talk.
Why this step exists: Diffusion models learn the whole distribution of plausible radar patterns, not just one “best guess,” matching radar’s randomness and clutter.
Example: In one denoising run, a distant car may yield a few strong points; in another, slightly different multipath-like speckle appears—both realistic.

Step 5: Decode and recover sparse points

What happens: Decode the denoised latents back to the three BEV images. Then, recover point locations from the density map using non-negative, L1-regularized deconvolution (IRL1 with FISTA): find a sparse set of pixels that, when blurred with the known Gaussian kernel, reproduce the predicted density. Finally, at each recovered pixel, read RCS and Doppler from their maps.
Why this step exists: The diffusion outputs are smooth textures; radar is sparse points. Deconvolution bridges the two, producing crisp detections while preserving distribution quality.
Example: With σ≈2 pixels for the Gaussian blur, the solver finds compact peaks at car positions; increasing λ yields sparser solutions; too large σ blurs structure, harming recovery.

Step 6: Training and inference details

What happens: Train on MAN TruckScenes clips (daytime) for ~2 days on 8×L40 GPUs. Use conditioning dropout (10%) to make the model robust. During inference, 20 diffusion steps with a fixed seed produce one sample; change the seed for diverse outcomes. Whole pipeline takes ~10.5 s/frame: ~9 s BEV conditioning extraction, ~1 s diffusion, ~0.5 s deconvolution.
Why this step exists: Practicality—this must be fast enough for scale and flexible enough for diverse sampling.
Example: Swapping an edited image (e.g., remove a car) instantly changes the BEV semantics/geometry; the regenerated radar shows fewer returns and updated occlusions.

The secret sauce:

Unified BEV alignment: Conditions and targets share a grid. The model doesn’t struggle to guess correspondences.
Multi-map joint modeling: Density, RCS, and Doppler are generated together, capturing their correlations.
Principled recovery: Knowing the blur kernel turns point recovery into a well-posed inverse problem, avoiding ad-hoc peak-picking.
Efficient backbone: SANA’s latent diffusion and linear attention keep the method scalable.

04Experiments & Results

🍞 Hook: Imagine grading a test where answers must be correct, clear, and written in the right spot. Radar generation is similar—you must put the right number of dots in the right places with the right attributes.

🥬 The Concept: Evaluation covers three areas: (1) geometric fidelity (are points in the right spots and counts?), (2) radar attribute fidelity (do RCS and Doppler match?), and (3) distribution similarity (do the stats across scenes look like the real world?). How it works: Use metrics like Chamfer Distance, IoU@1m, density similarity, hit rate, DA (distance-attribute) scores, and MMD. Why it matters: Without broad, careful metrics, you can’t tell if a model is just drawing “pretty patterns” or really matching radar behavior.

🍞 Anchor: Getting 87% on a hard test is great, but getting that with neat handwriting and correct labels is even better—that’s like scoring well across all metric families.

The dataset and baseline:

Dataset: MAN TruckScenes (multi-view cameras + radar). Train on day clips; evaluate on annotated frames with boxes for cars/trucks/trailers.
Baseline: A strong multi-view RGB2Point model adapted to output location+RCS+Doppler (432M params), comparable in size to RadarGen (~592M params).

Scoreboard (highlights, lower is better unless noted):

Entire area:
- CD Loc.: 1.68 vs 1.84 (better spatial alignment).
- IoU@1m: 0.31 vs 0.23 (more overlaps within 1 m—think going from a B- to a solid B+/A- in placement).
- DA F1: 0.24 vs 0.14 (much better joint location+attribute matching—like doubling the number of answers both correct and neatly placed).
- MMD (Location/RCS/Doppler): RadarGen shows markedly lower values, indicating distributions closer to real data.
Foreground objects (inside boxes):
- CD Loc.: 0.95 vs 1.32 (tighter shapes).
- Density Similarity: 0.51 vs 0.35 (closer point counts per object).
- Hit Rate: 0.66 vs 0.37 (more objects contain at least one generated point—detectors have something to latch onto).
- Per-class MMDs (Cars/Trucks/Trailers): consistently improved distributions for location, RCS, and Doppler.

Compatibility with downstream detectors:

A VoxelNeXt detector trained on real radar gets NDS 0.48 on real data. On RadarGen outputs, it gets NDS 0.30 (usable), while on the baseline’s outputs it’s near zero (not usable). That’s like the detector still recognizing a lot of objects on our synthetic radar, but struggling on the baseline.

Ablations and insights:

BEV conditioning matters: Removing the semantic map hurts the most (RCS distribution degrades—the model loses class/material cues). Removing appearance or velocity hurts Doppler, showing motion/texture cues guide speed patterns.
Direct multi-view image conditioning (without BEV) trained for 9 days (3× slower) and had worse geometric fidelity than BEV conditioning, though attribute MMD in the entire area slightly improved. BEV conditioning is thus the better compute-accuracy trade-off.
Deconvolution vs alternatives: Deconvolution consistently yields the best point recovery across blur sizes. Random sampling leads to spotty coverage; peak-picking is too sparse; deconvolution balances coverage and sharpness.
Blur size σ trade-off: Larger σ helps the autoencoder reconstruct maps but can over-smooth, hurting point recovery. σ≈2 balances both.

Surprising findings:

Semantics punch above their weight: Even with appearance present, losing semantics still hurts distributions—class priors really guide RCS and even Doppler profiles.
The detector likes our points but still prefers real radar: Despite strong hit rate and distributions, NDS trails real data, hinting at subtle, learned radar quirks that remain to be captured (e.g., fine multipath patterns or timing).

Bottom line: Across geometry, attributes, and distributions, RadarGen beats a strong baseline and yields synthetic radar that downstream detectors can use, marking a meaningful step toward camera-driven, multimodal simulation.

05Discussion & Limitations

🍞 Hook: Imagine a super-talented art student who can sketch great scenes but struggles in the dark or when given a blurry reference photo.

🥬 The Concept: RadarGen is powerful but not perfect. How it works: It leverages vision foundation models for geometry/semantics/motion; if those struggle (e.g., at night), radar generation suffers. It infers points even in unseen areas (occlusions), which is good for filling gaps but can hallucinate. Why it matters: Knowing when and why it fails helps you use it safely and improve it.

🍞 Anchor: In very low light, mislabeling a truck as a building can ripple into wrong RCS/Doppler patterns.

Limitations:

Dependence on upstream cues: Depth, segmentation, and flow errors (e.g., low light, severe glare) propagate into radar maps.
Occluded regions: The model may extrapolate behind obstacles; sometimes helpful, sometimes hallucinatory.
Subtle radar quirks: Even with good stats, a detector still prefers real radar (NDS gap), suggesting fine details (e.g., sensor-specific noise) aren’t fully matched.
Data coverage: Trained on a single dataset (daytime focus), may generalize less to different cities, sensors, or weather extremes.
Latency: ~10.5 s/frame on an L40 might be heavy for real-time needs, though fine for offline simulation.

Required resources:

Multi-view, calibrated cameras with known intrinsics/extrinsics; two consecutive frames (t, t+Δt).
Foundation models: depth (e.g., UniDepthV2), segmentation (e.g., Mask2Former), flow (e.g., UniFlow).
GPU for training/inference; storage to cache BEV conditioning images.

When not to use:

Real-time embedded systems with tight latency budgets.
Night-time or reflective/glare-heavy scenarios unless foundation models are robustly fine-tuned there.
Settings lacking good camera calibration (BEV alignment breaks).

Open questions:

Temporal extension: How to model radar across video sequences with temporal diffusion for even better Doppler consistency?
Broader conditioning: Can text or map priors guide large-scale scenario authoring (“add a silver truck merging from the right”)?
Sensor diversity: How to adapt across different radar hardware and processing chains without retraining from scratch?
Learning recovery: Replace hand-designed deconvolution with a learned, uncertainty-aware point sampler?
Detector co-training: Jointly train generators and detectors for tighter alignment with downstream tasks.

06Conclusion & Future Work

Three-sentence summary: RadarGen generates realistic automotive radar point clouds from multi-view cameras by turning radar into BEV images, guiding them with BEV-aligned depth/semantics/motion, and using an efficient latent diffusion model. A principled deconvolution step recovers sparse point locations and extracts RCS/Doppler, producing outputs that match real radar statistics and work with detectors. Experiments show strong gains over a capable baseline and enable camera-driven scene editing for multimodal simulation.

Main achievement: The first scalable, probabilistic camera-to-radar point cloud generator that jointly models location, RCS, and Doppler via BEV-conditioned latent diffusion and recovers precise sparse points through deconvolution.

Future directions:

Temporal/video diffusion for smoother Doppler over time.
Text/map-conditioned generation for fast scenario authoring.
Training across datasets and radar configurations; domain adaptation.
Learned, uncertainty-aware recovery and tighter detector integration.

Why remember this: It shows how to bridge camera-rich worlds and radar’s unique physics by aligning everything in BEV, leveraging powerful image-latent diffusion, and finishing with a clean inverse step. This combination yields practical, controllable, and scalable radar simulation—key for safer, faster autonomous driving development.

Practical Applications

•Augment camera-only datasets with realistic radar to improve multimodal training.
•Stress-test detectors in fog/rain or rare traffic patterns by editing images and regenerating radar.
•Generate multiple plausible radar outcomes (seeds) to evaluate robustness under stochastic sensing.
•Pre-validate autonomous-driving stacks in simulation before costly road tests.
•Teach students/engineers radar concepts by visualizing BEV density, RCS, and Doppler maps.
•Perform data balancing by inserting underrepresented vehicle classes (e.g., trailers) and regenerating radar.
•Benchmark perception under occlusions by adding/removing objects and observing radar changes.
•Prototype new sensor configurations by adapting BEV ranges/resolution and analyzing resulting radar.
•Support domain adaptation research by training on mixed real + synthetic radar point clouds.

Version: 1