MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Baorui Ma; Jiahui Yang; Donglin Di; Xuancheng Zhang; Jianxun Cui; Hao Li; Yan Xie; Wei Chen

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Intermediate

Baorui Ma, Jiahui Yang, Donglin Di et al.1/29/2026

arXiv PDF

Key Summary

•Metric Anything is a new way to teach AI real, ruler-like distances (metric depth) from very mixed and noisy 3D data.
•Its key trick is a Sparse Metric Prompt: give the model just a sprinkle of true depth pixels and let it fill in the rest.
•This prompt acts like a universal adapter that separates “what space looks like” from camera and sensor quirks.
•They pretrain on about 20 million image–depth pairs from rebuilt scenes, real sensors like LiDAR, and clean renderings.
•A distilled, prompt-free student model learns from the teacher’s high-quality predictions and works with just RGB images.
•Together, they set or match state-of-the-art on many tasks: monocular depth, depth completion, super-resolution, radar–camera fusion, camera intrinsics recovery, multi-view 3D, VLA planning, and MLLM spatial reasoning.
•For radar–camera fusion, fine-tuning the pretrained model nearly halves error compared to training from scratch.
•They reveal a clear scaling law: more diverse data steadily improves metric depth—something past works struggled to show.
•The pretrained ViT encoder also boosts video-language models on 3D reasoning (like object size, distances, and route planning).
•This points to a scalable, general-purpose path for real-world 3D perception without hand-crafted, task-specific tricks.

Why This Research Matters

Real-world machines need real measurements. Metric Anything shows a simple path to train models that understand true meters from messy, mixed data at massive scale. That makes self-driving safer in rain or night, robots better at grasping and placing, and AR apps more accurate at measuring rooms and objects. It also gives video-language models a strong spatial backbone, improving answers about sizes, distances, and routes. By proving that metric depth can obey scaling laws, this work turns a stubborn problem into a solvable, general foundation for everyday 3D tasks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can tell if a table is close enough to reach by stretching out your arm? Your eyes and brain team up to judge distances so you can act safely.

🥬 The Concept (Depth Estimation): It’s how computers guess how far things are in a picture.

How it works: (1) Look at the image, (2) learn patterns that mean near or far (like big/small and sharp/blurry), (3) assign a distance to each pixel.
Why it matters: Without it, robots, cars, and AR apps wouldn’t know how far objects are and could bump, miss, or misplace things. 🍞 Anchor: A robot needs to know if a cup is 0.3 meters away so it can grasp it without knocking it over.

🍞 Hook: Imagine seeing a 2D photo but wanting to build a LEGO model of the whole room from it.

🥬 The Concept (3D Perception): It’s teaching computers to understand the 3D world from 2D pictures.

How it works: (1) Find shapes and edges, (2) estimate depths, (3) form a 3D map.
Why it matters: Without 3D, machines can’t plan safe paths or fit objects together. 🍞 Anchor: A cleaning robot figures out the 3D space of a living room so it doesn’t get stuck under the couch.

🍞 Hook: Think of a color-by-number picture—each spot has a number telling you what color to fill in.

🥬 The Concept (Depth Map): It’s an image where each pixel stores a distance number instead of a color.

How it works: (1) For every pixel, write its distance, (2) assemble all pixels into a depth picture, (3) use masks to mark missing data.
Why it matters: Without depth maps, you can’t tell how far each tiny part of a scene is. 🍞 Anchor: A depth map of a desk scene might say the keyboard is 0.5 m away and the monitor is 0.8 m away.

🍞 Hook: Guessing a classmate’s height from just one selfie is tricky—but sometimes you can do it if you’ve seen enough selfies.

🥬 The Concept (Monocular Depth Estimation): Predicting depth from a single RGB image.

How it works: (1) Learn visual cues from many images, (2) predict depth per pixel, (3) adjust using learned patterns.
Why it matters: Most devices only have one camera; no extra sensors. 🍞 Anchor: Your phone estimates how far a face is to blur the background nicely in portrait mode.

🍞 Hook: There’s a difference between saying “Alice is taller than Bob” and saying “Alice is 1.55 meters tall.”

🥬 The Concept (Metric Depth Estimation): Predicting real-world distances in units like meters.

How it works: (1) Learn absolute scale, (2) correct for camera settings, (3) map pixels to real meters.
Why it matters: Without true meters, robots can’t safely grasp, drive, or plan in the real world. 🍞 Anchor: A self-driving car needs the truck ahead to be 32.4 m away, not just “farther than the tree.”

🍞 Hook: Imagine baking cookies with thermometers that all read a little different—one too hot, one too cold.

🥬 The Concept (Heterogeneous Sensor Noise): Different sensors (LiDAR, RGB-D, stereo) have different errors and gaps.

How it works: (1) Each device adds its own noise, (2) different cameras have unique settings, (3) fusing them is messy.
Why it matters: If models learn the noise instead of the truth, predictions fail when the sensor changes. 🍞 Anchor: A cheap depth camera might miss shiny metal, while LiDAR might skip thin chair legs.

🍞 Hook: Wearing someone else's glasses makes everything look warped.

🥬 The Concept (Camera Intrinsics): The camera’s “glasses”—like focal length—that change how big or small things look.

How it works: (1) The lens focuses light, (2) focal length affects scale, (3) intrinsics decide how pixels map to rays.
Why it matters: If you ignore intrinsics, meters become muddled, and distances look wrong. 🍞 Anchor: A wide-angle camera makes rooms look huge; if you don’t fix that, your distances are off.

The World Before: AI got very good at relative depth (who is nearer/farther) using huge, mixed datasets. But metric depth lagged because mixing many 3D sources introduced sensor noise, camera biases, and mismatches, stopping the “more data helps” trend.

The Problem: How do we make a single model learn true meters from many noisy, biased, and mismatched sources and cameras?

Failed Attempts: Hand-crafted prompts, camera-specific models, and complex pipelines improved narrow tasks (like depth completion) but didn’t scale or generalize.

The Gap: A universal, simple interface that lets the model learn geometry while ignoring where the hint came from (which sensor/camera), so scaling with big, messy data finally works.

Real Stakes: Safer self-driving in rain or night, better robot grasping, smarter AR measuring, and stronger video-language models that understand space, sizes, and routes.

02Core Idea

🍞 Hook: You know how a treasure map with just a few X marks can still help you find the whole path?

🥬 The Concept (Sparse Metric Prompt): Give the model a sprinkling of trusted depth points, then let it complete the full depth map in meters.

How it works: (1) Randomly pick a small set of valid depth pixels, (2) feed them plus the RGB image, (3) train the model to fill in the rest accurately, even when prompts are noisy.
Why it matters: This tiny hint acts like a universal adapter that hides sensor quirks and camera biases from the main learning. 🍞 Anchor: From 2,000–40,000 sparse points (about 1%) and one photo, the model paints a full, meter-true depth map.

🍞 Hook: If you can solve puzzles from many different puzzle brands, you learn the picture, not the cardboard.

🥬 The Concept (Decoupling Spatial Reasoning from Sensor Bias): The prompt is the standard doorway for all sensors, so the backbone learns geometry, not device weirdness.

How it works: (1) Convert many sources to sparse prompts, (2) a single conditioning head reads them, (3) the shared backbone focuses on structure and scale.
Why it matters: Mixing 20M samples across 10k+ cameras becomes helpful instead of harmful. 🍞 Anchor: Whether the hint came from LiDAR, SLAM, or rendering, the backbone treats it the same and learns meters.

🍞 Hook: Practicing with lots of books makes you smarter—if you’re reading the words, not the smudges.

🥬 The Concept (Scaling Laws for Metric Depth): With the right interface (prompts), bigger, more diverse data steadily improves meter-accurate depth.

How it works: (1) Aggregate reconstructed, captured, and rendered data, (2) pretrain with sparse prompts, (3) watch accuracy climb as data grows.
Why it matters: Past works didn’t see clear scaling; this paper shows it is possible for metric depth. 🍞 Anchor: On Middlebury zero-shot, δ1 accuracy rises as training data grows from 2M to 20M pairs.

🍞 Hook: After a coach trains the team with drills, a player can perform alone in a game.

🥬 The Concept (Teacher–Student Distillation): The prompt-trained teacher makes great full-depth labels; a student learns to predict meters from just RGB.

How it works: (1) Teacher densifies noisy prompts into strong pseudo-labels, (2) train a student without prompts, (3) adjust loss and architecture for long distances and sharp details.
Why it matters: You get a fast, practical model that works anywhere without extra inputs. 🍞 Anchor: The student hits state-of-the-art on monocular depth across indoor, outdoor, and even synthetic scenes.

Before vs After:

Before: Task-specific hacks, small curated data, poor transfer; mixing sources often hurt.
After: One simple prompt interface + huge mixed data + distillation = robust meters, many tasks, clear scaling.

Why It Works (intuition): The sparse prompt is a small, trustworthy anchor in a noisy sea. The backbone learns geometry that’s consistent across cameras and sensors, so adding more diverse data teaches more cases instead of confusing the model.

Building Blocks:

Data: ~20M image–depth pairs from reconstruction, sensors, and renders.
Prompt: Randomly sampled valid depth points, regularized into a neat 3-channel input.
Model: ViT encoder with a light conditioned head (no exotic, task-specific designs).
Losses: Robust depth losses for noise and long-range supervision.
Distillation: Convert teacher outputs to broad, high-quality pseudo-labels; train a prompt-free student with an inverse-depth, distance-balanced loss and deeper semantic skip connections.

03Methodology

At a high level: Image + Sparse Metric Prompt → Prompt Preparation + Injection → ViT Backbone + DPT Head → Dense Metric Depth. Then: Teacher’s dense predictions → Pseudo-labels → Train Student (RGB-only) → Prompt-free metric depth.

Step 1: Multi-source data collection

What happens: Gather reconstructed (SfM/MVS/SLAM), captured (LiDAR/ToF/RGB-D), and rendered depth. Standardize to per-pixel metric depth G with a valid mask M.
Why this exists: We need lots of varied meters to learn meters; each source covers different cases (fine detail, long range, noise-free structure).
Example: A LiDAR frame projected onto the camera gives sparse but true distances; a renderer gives perfect, sharp depth everywhere.

Step 2: Make Sparse Metric Prompts

What happens: Randomly sample N valid pixels (2k–40k) from G to form P = {(x, y, d)} and a prompt mask. Then create a neat 3-channel prompt using: (a) a depth prior map Pd from an off-the-shelf model, (b) Pixel-wise Depth Scale Alignment and Global Metric Depth Recovery to align and fill, and (c) the mask.
Why this exists: Real prompts are messy and irregular; turning them into a uniform H×W×3 input avoids custom networks for every prompt shape.
Example: On an indoor scene, 3,000 sampled points plus Pd produce a filled, aligned prompt map that the network can read easily.

🍞 Hook: Like whispering a hint to a singer but keeping the main melody strong.

🥬 The Concept (Prompt Injection): A light conditioning path feeds the prompt into the decoder (DPT head) while the ViT backbone stays general.

How it works: (1) Feed RGB into ViT, (2) feed prompt into a conditioned head, (3) fuse features near the output.
Why it matters: The backbone learns universal geometry; the prompt helps steer and denoise without bloating the model. 🍞 Anchor: Adding the prompt branch increases parameters by about 5% yet improves zero-shot completion and super-resolution.

Step 3: Train the teacher (pretraining)

What happens: Optimize with robust depth losses. For synthetic data, use MAE + scale-and-shift-invariant gradient losses; for real data, drop the top 20% noisiest pixels per image to avoid overfitting to bad labels.
Why this exists: Real labels are noisy; robust losses help the model learn the true geometry, not the glitches.
Example: If LiDAR misses shiny poles, those pixels often fall into the dropped region, protecting learning.

Step 4: Distill to a prompt-free student

What happens: Run the teacher to create dense, high-quality pseudo depth for all images; train a student that takes only RGB.
Why this exists: You want a simple, deployable model that doesn’t need prompts or extra sensors at test time.
Example: The teacher turns a sparse prompt into a dense map covering near and far distances; the student learns to reproduce this from just RGB.

🍞 Hook: Measuring near pebbles and far mountains needs a ruler that works at all scales.

🥬 The Concept (Distance-balanced inverse-depth loss): A reweighting in log-space keeps supervision strong both nearby and far away.

How it works: D_log = 1 − ln(x)/ln(C). Tuning C (e.g., 400) balances attention between close and distant depths.
Why it matters: Plain inverse-depth can ignore far ranges; plain depth can blur near details. This keeps both sharp. 🍞 Anchor: On DIODE, this loss stays competitive at 0–10 m and clearly wins beyond 40 m versus standard losses.

🍞 Hook: Give your best players the ball near the goal, not only at midfield.

🥬 The Concept (Inverse Skip-Connections): Inject deeper, semantic ViT features into later decoder layers, while shallow features stay earlier.

How it works: (1) Keep multi-scale fusion, (2) swap skip emphasis to highlight high-level cues at the final prediction.
Why it matters: The teacher’s labels are clean and wide-range; this design sharpens structure without overfitting to textures. 🍞 Anchor: Compared to U-Net-style skips, the inverted scheme better exploits pseudo-labels and improves accuracy on ETH3D and nuScenes.

Secret Sauce

Universal prompt interface: One tiny, uniform prompt channel for all sensors.
Robust training: Losses and dropping outliers tame noise.
Distillation: Turns messy supervision into a clean, RGB-only student.
Architecture tweaks: Minimal changes, maximum generality (shared ViT, light prompt head, smarter skips, better loss).

End-to-end flow: Input (RGB + sparse prompt) → prompt prep (align/fill/mask) → ViT + conditioned DPT → dense metric depth; then teacher outputs → student trains RGB-only with distance-balanced loss → fast, prompt-free metric depth.

04Experiments & Results

The Test: Can one framework learn true meters from many sources and do well on many tasks? They measured accuracy with AbsRel, RMSE, δ thresholds, F1 edge scores, and task-specific metrics on standard datasets (NYUv2, ETH3D, KITTI, nuScenes, Booster, Middlebury, Sintel) and specialized benchmarks.

The Competition: Strong baselines like DepthAnything (v1/v2), Metric3D (v1/v2), UniDepth, ZoeDepth, DepthPro, and task-specific prompt methods (PriorDA, Omni-DC, Marigold-DC, PromptDA), plus radar–camera fusion systems and a multi-view 3D baseline (MapAnything).

Scoreboard with context

Zero-shot depth super-resolution and completion: Without any fine-tuning, the pretrained model consistently beats prompt baselines on NYUv2/ETH3D/KITTI across 8×/16× downsampling, LiDAR-like sparsity, and extreme sparsity (100 points). Think of scoring 1.53 AbsRel on NYUv2 8× vs strong baselines >1.7–2.6: it’s like getting an A when others get B’s.
Radar–camera fusion: Pretraining transfers to an unseen sensor (radar). Fine-tuning the pretrained model on nuScenes achieves 651 mm MAE at 0–50 m vs 1047–1424 mm for top plug-in/radar fusion baselines—like cutting error nearly in half. Training from scratch with radar prompts is much worse, showing the value of the pretraining scheme.
Monocular metric depth (student): The prompt-free student ranks 1st or 2nd on six diverse datasets. Examples: Sun-RGBD AbsRel 0.085 (best), ETH3D δ near 99.9% on strict thresholds (top-tier), Booster AbsRel 0.282 (best). That’s like top-of-class consistency across indoor, outdoor, and synthetic domains.
Edge sharpness (F1): Student achieves the best average boundary accuracy across Sintel, Spring, and iBims-1, preserving thin structures others blur.
Camera intrinsics recovery: Using the student’s point maps, a simple reprojection optimization estimates focal length better on average than dedicated methods—like beating camera-calibration specialists with just your model’s outputs.
Multi-view 3D reconstruction: Plugging student monocular depth into MapAnything improves AbsRel and δ dramatically (ETH3D AbsRel 20.43% → 18.98%; ScanNet δ1 jumps to 99.41%). Free gains without multi-view optimization.
VLA planning: Distilling depth knowledge into a policy lifts LIBERO average success to 87.7% (above a strong 86.6% baseline), meaning better 3D-aware decisions without feeding depth at test time.
MLLM spatial reasoning: Using the pretrained ViT as a frozen spatial encoder boosts video QA on VSI-Bench across object size, absolute/relative distance, route planning, and appearance order—ranking best among open-source models tested. Like giving the language model 3D glasses.

Surprising findings

Clear scaling trend: For the first time, metric depth accuracy steadily improves as training data grows (e.g., Middlebury δ1 rises from 2M → 20M data). Past work struggled here.
Unseen sensor transfer: Training with random sparse prompts helps the model adapt to radar despite never seeing radar in pretraining.
Test-time resolution scaling: Inference at higher resolutions yields finer depth details without retraining.

Takeaway: The simple sparse-prompt pretraining + distillation consistently wins or ties across very different tasks. It’s like a single toolbox fixing bikes, cars, and drones—because it learned the physics of distance, not the quirks of each tool.

05Discussion & Limitations

Limitations

Compute heavy: Pretraining and distillation used many GPUs (e.g., 144 H200s), which is out of reach for small labs.
Data curation burden: Though robust to noise, extremely poor labels or miscalibrations can still slip through.
Extreme conditions: While strong at night/rain/fog, some edge cases (e.g., glass, heavy glare, rapid motion blur) can still degrade accuracy.
Prompt prep dependencies: The alignment-and-fill step uses an external prior; future work may learn this end-to-end.
Latency vs. resolution: The prompt-free student is fast enough for many uses, but ultra-high-res, real-time robotics may need further optimization.

Required resources

Large, diverse 3D datasets (reconstructed, sensor, rendered) with calibration.
Substantial compute (multi-node GPUs), efficient training (FlashAttention, ZeRO), and storage.

When not to use

Tiny embedded devices with strict power/latency budgets that can’t handle ViT-sized models.
Scenarios needing exact certified safety without post-hoc validation; additional redundancy is advised.
Domains with no overlap to pretraining data and no way to fine-tune (e.g., exotic microscopes) may require adaptation.

Open questions

Can prompt preparation be fully learned to remove external priors?
How to further improve meters on transparent/reflective objects and extreme long-range scenes?
Can active data selection make scaling cheaper while preserving gains?
What’s the tightest, certifiable safety bound for metric depth in autonomous systems?
Can this approach unify depth with surface normals, occupancy, and flow into one scalable 3D foundation?

06Conclusion & Future Work

Three-sentence summary: Metric Anything shows that a tiny, universal Sparse Metric Prompt lets one model learn true meters from huge, noisy, mixed 3D sources. With simple pretraining and a robust distillation step, it delivers a prompt-free student that excels across monocular depth, calibration, multi-view 3D, radar fusion, VLA planning, and MLLM spatial reasoning. Crucially, it establishes a clear scaling law for metric depth—more diverse data steadily boosts accuracy.

Main achievement: Decoupling spatial understanding from sensor/camera bias via a minimal prompt interface, enabling scalable, general-purpose metric depth pretraining that actually benefits from heterogeneous data at massive scale.

Future directions: Learn prompt preparation end-to-end, enhance extreme-condition robustness (glass, glare, ultra-long range), compress models for edge devices, and broaden to richer 3D outputs (normals, occupancy, dynamics). Explore active data selection to get the most scaling for the least cost, and develop formal safety checks for deployment.

Why remember this: It turns metric depth from a patchwork of task-specific tricks into a scalable foundation—much like how large language models unified NLP. By proving that “more, messier data” can help if you use the right interface, it lights a path toward dependable, meter-accurate perception for robots, vehicles, AR, and multimodal AI.

Practical Applications

•Safer autonomous driving through robust, meter-accurate depth in adverse weather and low light.
•Robotic grasping and manipulation that precisely judges object distances and sizes without extra sensors.
•AR/VR room scanning with better measurements for furniture placement, gaming, and design.
•Drone navigation with improved obstacle avoidance using prompt-free, monocular depth.
•Warehouse and factory automation for shelf inspection, bin picking, and route planning.
•Smartphone measuring tools that estimate room size and object dimensions more accurately.
•Security and surveillance systems that maintain depth awareness across varied cameras and lighting.
•Assistive technologies (e.g., fall detection, safe navigation aids) via reliable depth in homes.
•3D content creation and editing with cleaner depth layers for compositing and effects.
•Boosting multimodal AI (VLM/VLA) to reason about space, plan routes, and act more reliably.

Version: 1