Masked Depth Modeling for Spatial Perception

Bin Tan; Changjiang Sun; Xiage Qin; Hanat Adai; Zelin Fu; Tianxiang Zhou; Han Zhang; Yinghao Xu; Xing Zhu; Yujun Shen; Nan Xue

Masked Depth Modeling for Spatial Perception

Intermediate

Bin Tan, Changjiang Sun, Xiage Qin et al.1/25/2026

arXiv PDF

Key Summary

•The paper turns the 'holes' (missing spots) in depth camera images into helpful training hints instead of treating them as garbage.
•It introduces Masked Depth Modeling (MDM), which learns to fill in missing depth by looking at the matching color image.
•A Vision Transformer (ViT) reads both RGB and depth patches and a special ConvStack decoder paints a complete, metric-scale depth map.
•The team built a huge training set: 2 million real captures and 1 million simulated ones, plus many public datasets.
•On depth completion tests, the model beats strong methods and even outperforms top RGB-D cameras in precision and coverage.
•It also improves monocular depth estimation when used as a backbone, showing better spatial understanding than DINOv2 on many benchmarks.
•As a better depth prior, it speeds up and stabilizes training for FoundationStereo and reaches top results by epoch 15.
•Without any video training, the model produces smooth, consistent depth across frames in tough scenes like glass lobbies and aquariums.
•With refined depth, 3D point tracking becomes steadier, and a dexterous robot grasps shiny and transparent objects more reliably.
•They release code, checkpoints, and 3 million RGB–depth pairs to help the spatial perception community.

Why This Research Matters

Reliable depth is the backbone of safe robots, realistic AR, and stable 3D mapping. By learning from real sensor failures, this approach makes depth cameras act smarter exactly where they used to break—on glass, mirrors, and plain surfaces. That unlocks steadier navigation, better obstacle avoidance, and cleaner scene understanding in homes, hospitals, warehouses, and city streets. It also reduces the need for expensive hardware or time-consuming multi-view setups, lowering costs and complexity. Stronger depth priors help other systems too, speeding up stereo training and improving monocular depth. In short, this work brings us closer to trustworthy, real-time 3D perception in the messy real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine wearing a pair of magic glasses that not only show colors but also tell you how far everything is. Super handy for a robot or self-driving car, right? But sometimes those glasses get confused by shiny windows or plain white walls, leaving blank spots.

🥬 The Concept — RGB-D cameras:

What it is: An RGB-D camera captures both a regular color image (RGB) and a depth image that measures how far each pixel is.
How it works:
1. The camera takes a color photo (like your phone does).
2. Another sensor inside measures distance for each pixel (using stereo vision or active light).
3. The camera outputs two aligned pictures: color and depth.
Why it matters: Without good, dense, and accurate depth, robots can’t plan safe moves or understand 3D scenes well. 🍞 Anchor: Think of a robot vacuum that needs to know where the coffee table legs are in 3D to avoid bumping them.

🍞 Hook: You know how a puzzle can be missing some pieces, but you still try to guess the picture from the pieces you do have?

🥬 The Concept — Depth completion:

What it is: Filling in the missing or corrupted parts of a depth image so every pixel has a reliable distance.
How it works:
1. Look at where the depth image has holes or weird values.
2. Use the matching color image as a hint for edges, textures, and objects.
3. Predict the missing distances to make a complete depth map.
Why it matters: Without it, robots and AR apps see broken geometry and make clumsy or unsafe choices. 🍞 Anchor: A robot arm trying to pick up a clear glass can’t do it if the glass is invisible to the depth sensor; completion makes it “visible.”

🍞 Hook: Try closing one eye and guessing how far the door is—harder, but still possible with experience.

🥬 The Concept — Monocular depth estimation:

What it is: Estimating distance from just one color image (no depth input at all).
How it works:
1. Learn patterns like perspective, shading, and object sizes from many images.
2. Use those clues to infer which parts are near or far.
3. Output a depth map aligned to the image.
Why it matters: When no depth sensor is available or reliable, this gives you geometry from plain RGB. 🍞 Anchor: Smartphones that create portrait blur or room scans from a single camera tap into monocular depth.

The world before: There were three main ways to get 3D: multi-view geometry (needs multiple frames and time), monocular learning (good but struggles with exact scale), and active sensors (fast but fail on shiny/texture-less stuff). For real-time robots, RGB-D cameras are the only tool that promise accurate, metric, pixel-aligned depth on the spot—but they break in tough lighting, glass, mirrors, and plain white walls.

The problem: Missing or wrong depth pixels create holes and errors. Traditional pipelines either toss bad pixels or try to repair them with hand-tuned tricks. These fixes aren’t robust across scenes and can be slow or fragile.

Failed attempts: Random masking pretraining (like MAE) helps a model learn to fill in missing color pixels, but the masks don’t reflect real-world sensor failures. Also, many datasets avoid difficult scenes or render perfect depth, which doesn’t teach models to handle real messiness.

The gap: We needed a way to treat real missing depth not as junk but as a teacher—something that points exactly to the hard places where a model should learn to reason using the RGB context and the few valid depth hints.

🍞 Hook: Think of a teacher who gives you the exact tricky questions you usually miss—that’s better than random practice.

🥬 The Concept — Masked Depth Modeling (MDM):

What it is: A training method that uses the actual missing spots from depth sensors as masks and teaches the model to predict those pixels using the RGB image and the remaining valid depth.
How it works:
1. Take the color image and the raw depth map with holes.
2. Mask out the depth tokens that are missing (and some partly-bad ones too).
3. Feed all RGB + only the unmasked depth into a Vision Transformer to learn a joint representation.
4. Use a decoder to output a full, metric, dense depth.
Why it matters: The model learns exactly where sensors struggle (glass, mirrors, low texture) and gets good at fixing them. 🍞 Anchor: Like training to read smudged text by using the clear words around it to guess the missing parts.

Real stakes: Better depth means safer self-driving, steadier AR, smoother robots in homes and factories, and 3D tracking that doesn’t drift. This paper shows a way to make depth cameras “smarter” by learning from their own mistakes, not in spite of them.

02Core Idea

🍞 Hook: You know how crossword puzzles give you blanks where the hardest letters are? Solving those blanks teaches you the most.

🥬 The Concept — The “Aha!” moment in one sentence:

Use the real holes from depth cameras as training masks so a model learns to fill them using the full color image plus the remaining valid depth, building a strong, metric, pixel-aligned depth predictor.

Multiple analogies:

Jigsaw puzzle: Instead of removing random pieces, we remove exactly the pieces that are truly missing in real life—so the model practices on the hard spots, guided by the picture on the box (the RGB image).
Detective work: The RGB image is all the clues; the few valid depth points are eyewitnesses; the model cross-examines them to reconstruct what happened in the missing regions.
Tutor targeting weak spots: The sensor’s failures mark where the student (the model) needs help most; practicing there builds the right skill.

🍞 Hook: Imagine a team project where each teammate shares different info: one knows colors, another knows distances, and they talk to agree.

🥬 The Concept — Joint embedding architecture:

What it is: A model design that mixes RGB tokens and unmasked depth tokens into the same “language” space so they can inform each other.
How it works:
1. Turn RGB patches and depth patches into tokens with separate patch embedders.
2. Add position and modality hints so tokens know where they are and what type they are.
3. Feed them together into a Vision Transformer that lets tokens attend to each other.
4. Decode a full, dense depth from the learned context.
Why it matters: Without joint embedding, the model can’t align colors with shapes and distances at the same pixels. 🍞 Anchor: Like linking a street map (RGB) with elevation data (depth) so every address knows its height above sea level.

🍞 Hook: Picture a librarian sorting picture cards and number cards into the same index so you can find matching pairs quickly.

🥬 The Concept — Vision Transformer (ViT):

What it is: A model that looks at images as sequences of patches and uses attention to decide which patches matter to each prediction.
How it works:
1. Split images into patches and embed each into a token.
2. Use self-attention to share information among tokens.
3. Output rich features that remember what is where.
Why it matters: Without attention over both RGB and depth tokens, the model can’t align edges, textures, and distances reliably. 🍞 Anchor: A ViT lets a pixel near a window “ask” the RGB context if that bright stripe is glass glare or a wall.

🍞 Hook: Think of a crafts workshop that turns ideas into finished objects.

🥬 The Concept — Convolutional decoder (ConvStack):

What it is: A decoder made of convolutional layers that turns the ViT’s tokens into a high-resolution depth image.
How it works:
1. Take the final ViT features (keep RGB/context tokens; drop masked depth tokens).
2. Inject global context by adding the [cls] summary to each location.
3. Upsample step-by-step through residual and transpose-conv layers.
4. Output a dense, metric depth map aligned to the RGB.
Why it matters: Without a geometry-friendly decoder, predictions look blurry and lose fine details and boundaries. 🍞 Anchor: The decoder paints a crisp, full “depth picture” from the model’s understanding.

Before vs. After:

Before: Random masks teach generic filling-in; datasets often avoid the messy, real failures of sensors; models struggle on glass and blank walls.
After: Natural masks focus learning on the exact hard cases; the model aligns RGB and depth better; depth completion becomes sharper, more metric, and more complete.

🍞 Hook: Learning by doing is powerful—especially when you practice on your real mistakes.

🥬 The Concept — Self-supervised learning:

What it is: Training where the data itself supplies the targets (e.g., the missing depth is predicted using the observed parts).
How it works:
1. Hide some data (here: masked depth) on purpose.
2. Ask the model to predict it from what’s left (RGB + valid depth).
3. Compare prediction to ground truth or pseudo labels and improve.
Why it matters: Labels are scarce for depth; using natural masks scales to millions of samples. 🍞 Anchor: Like covering parts of a picture and asking yourself to sketch what’s underneath, then checking the original.

Why it works (intuition without equations):

The RGB image carries rich appearance cues—edges, textures, shading—while valid depth points anchor the scale. Attention lets the model find which RGB regions explain which missing depths. Practicing exactly where sensors fail teaches the model the physics-like patterns of reflections, textureless areas, and lighting, so it generalizes beyond the training scenes.

Building blocks: RGB-D cameras, depth completion, monocular depth estimation, self-supervised masked depth modeling, ViT joint embedding, and a ConvStack decoder all stack together to turn broken sensor maps into complete, metric, and pixel-aligned depth—fast enough to help robots and AR in the real world.

03Methodology

High-level recipe: Input (RGB + raw depth) → Separate patch embedding (RGB tokens, depth tokens) → Mask depth tokens where the sensor failed (plus some partial/extra masks) → Concatenate all RGB + unmasked depth tokens → Vision Transformer encoder learns joint context → Drop latent depth tokens → ConvStack decoder upsamples to a full, dense, metric depth map → Output.

🍞 Hook: Imagine two translators—one for colors and one for distances—turning pages into word cards so a smart committee can discuss and fix the missing parts.

🥬 The Concept — Separated patch embedding for RGB-D:

What it is: Two small networks convert RGB patches and depth patches into aligned tokens, each with position and a tag for modality.
How it works:
1. Split RGB and depth into 14×14 patches.
2. Embed RGB patches into RGB tokens; embed depth patches into depth tokens.
3. Add 2D position encodings so tokens know where they are.
4. Add a modality tag (RGB vs. depth) so the model knows the source.
Why it matters: Without separate, aligned embeddings, the model can’t properly fuse appearance and geometry. 🍞 Anchor: Like labeling two decks of flashcards—one ‘color’ and one ‘depth’—with seat numbers so you can match pairs.

Masking strategy: using the sensor’s own holes.

Treat patches that are fully missing in the raw depth as always masked.
For mixed patches (some valid, some invalid), mask with high probability (e.g., 0.75) so the model practices hard cases.
If needed, add a few random masks over valid patches to reach a 60–90% depth masking ratio.
Keep all RGB tokens visible (they are the main context). Why this step exists: If masks were random, the model wouldn’t focus on real sensor failure patterns like glass glare or textureless walls. Example: A 960×1280 image becomes 14×14 patches → tokens; if the glass door region has 80% missing depth, many patches there get masked, forcing the model to infer using the RGB edges and reflections.

🍞 Hook: Think of a roundtable where every token asks, “Who can help me fill in my blanks?”

🥬 The Concept — Joint embedding with a ViT encoder:

What it is: A ViT-L/14 with self-attention over RGB + unmasked depth tokens (plus a [cls] token for global summary).
How it works:
1. Concatenate all RGB tokens and remaining depth tokens with position+modality encodings.
2. Run through 24 layers of attention, where depth tokens attend to RGB tokens that share edges, textures, and locations.
3. The [cls] token collects scene-wide context.
Why it matters: Without attention across both modalities, the model cannot align RGB cues to depth predictions at each pixel. 🍞 Anchor: A depth token near a mirror learns to trust RGB patterns that signal reflection and to downweight misleading depth readings.

Decoder design: ConvStack for crisp geometry.

Drop latent depth tokens (we want the decoder to rely on the fused context).
Add the [cls] global context to each spatial location (broadcast and add).
Use a pyramid of residual and transpose-conv layers to upsample features back to high resolution.
Inject UV positional hints at each scale for layout fidelity and aspect ratio.
Produce a multi-scale feature pyramid and decode the final dense depth; upsample to match the input size. Why this step exists: Convolutions are strong at local detail and edges; they make output depth sharper and more stable than a tiny transformer decoder for this task. Example: Thin rods of gym equipment stay visible in depth because the decoder preserves edges while using the ViT’s global context to decide distances.

🍞 Hook: Practicing with both pretend and real mistakes makes you better at real tests.

🥬 The Concept — Data curation:

What it is: Building massive training sets that include the real imperfections of depth sensors.
How it works:
1. Synthetic branch: Render RGB, perfect depth, and stereo IR pairs with speckle patterns; run SGM to simulate realistic sensor depth with holes; keep perfect depth for supervision.
2. Real branch: Build portable rigs with popular RGB-D cameras; capture RGB, raw depth, and stereo IR; use stereo matching and quality checks to get pseudo-depth labels.
3. Mix in public RGB-D datasets; for those with near-perfect depth, add random patch masks during training.
Why it matters: Without realistic missing patterns and diverse scenes, the model won’t learn to fix real sensor failures. 🍞 Anchor: In the aquarium tunnel, the synthetic SGM and real captures both teach the model how refractive glass ruins depth, so it learns to recover it from RGB cues.

Training details (like a recipe):

Backbone: ViT-L/14 initialized from DINOv2; decoder initialized randomly.
Optimizer: AdamW with differential learning rates (1e-5 for the encoder, 1e-4 for the decoder), weight decay 0.05.
Schedule: Warm up encoder for 2k iters, then step decay by 0.5 every 25k; train 250k iterations total.
Batch/compute: Global batch 1024 (128 GPUs × 8), BF16 mixed precision, gradient clipping at 1.0.
Augmentations: Crops, flips, color jitter, JPEG artifacts, motion blur, shot noise.
Loss: L1 on valid ground-truth depth pixels. Secret sauce: Natural masks from real sensor failures + joint RGB-depth attention + a geometry-friendly ConvStack decoder + big, realistic data curation. This combo teaches the model to fix exactly the kinds of problems depth cameras have in the wild.

04Experiments & Results

The test: Does the model truly fill holes and stay accurate at metric scale? The authors measure how close the predicted depth is to ground truth using RMSE, REL, and MAE. They also check thresholds like “how many pixels are within 1 pixel disparity error” (BP-1.0) for stereo, and evaluate across indoor, outdoor, synthetic, and challenging real scenes.

The competition: Strong baselines for depth completion (OMNI-DC, PromptDA, PriorDA) and leading backbones (DINOv2) used in MoGe and FoundationStereo. The question: Can MDM pretraining make depth completion stronger, monocular depth smarter, and stereo training faster and better?

Scoreboard with context:

Depth completion (Protocol 1: block-wise masking + noise on iBims, NYUv2, DIODE): The MDM model wins across easy to extreme settings. For NYUv2 extreme, it achieves errors small enough to be like getting an A+ while others get B-level scores. On DIODE-Indoor, it consistently has the lowest RMSE and REL through all difficulty levels, and on DIODE-Outdoor (with large depth ranges) it still leads.
Depth completion (Protocol 2: Sparse SfM on ETH3D): With only a sprinkle of 3D points, the model still reconstructs full depth better than others—RMSE drops by 47% (indoor) and 38% (outdoor) versus the best baseline. That’s like finishing a maze in half the time with fewer wrong turns.
Monocular depth estimation (as a MoGe backbone): Replacing DINOv2 with the MDM-pretrained encoder improves affine-, scale-, and disparity-invariant metrics across many datasets (NYUv2, HAMMER, KITTI, GSO, Sintel, DIODE, Spring). Translation: the encoder learned real 3D logic that carries over even when there’s no depth input at test time.
FoundationStereo with MDM: Using the MDM encoder as a depth prior speeds up learning and stabilizes early epochs. By epoch 5, it’s already ahead on multiple benchmarks; by epoch 15, it hits the best or tied-best results on Middlebury, HAMMER, and FSD. That’s like showing up to class already warmed up and acing the quiz.

Surprising findings:

Zero-shot video consistency: Even trained only on single images, the model outputs smooth, stable depth across frames in hard scenes (glass lobby, rowing machine near windows, mirrors in a gym, aquarium tunnel). Stereo cameras (like ZED) struggle or even fail in these conditions, but MDM predictions remain plausible and continuous.
Strong cross-modal attention: Visualizations show depth tokens attending to the exact matching RGB regions, proving the model didn’t just memorize averages—it learned pixel-precise RGB–depth correspondences.
Downstream power: With refined depth, 3D point tracking in SpatialTrackerV2 drifts less and runs more efficiently. In robotics, a dexterous gripper grasps shiny steel cups and transparent boxes far more often, even when raw depth makes them nearly invisible.

Takeaway: When you practice on real sensor mistakes, you learn the right fixes. MDM turns ugly holes into a guiding teacher, and the numbers back it up across completion, monocular depth, stereo, tracking, and grasping.

05Discussion & Limitations

Limitations:

Extreme materials and lighting: Fully transparent or mirror-like objects under tricky lighting can still fool the model; predictions may be plausible but not perfect.
Domain shifts: Outdoor scenes with unusual weather or rare camera intrinsics might need fine-tuning to stay precise at metric scale.
Heavy training: Pretraining ViT-L with a 3M+ dataset and 128 GPUs is resource-intensive; not every lab can reproduce the full recipe.
Pseudo-label noise: Real-data supervision relies on stereo-derived pseudo-depth; while filtered, it still carries noise that can cap ultimate accuracy.

Required resources:

A capable GPU cluster (or lots of patience) to pretrain the ViT-L with ConvStack.
Large, diverse RGB-D data with realistic missing-depth patterns (the paper releases 3M pairs to help).
For best performance, synchronized RGB and depth streams and good calibration in applications.

When not to use:

If you need millimeter-grade depth on ultra-reflective factory lines without any RGB context, specialized active sensors or multi-view setups may still be required.
If compute is tiny (e.g., microcontrollers), this large ViT-L model might be too heavy; consider distilled or smaller variants.

Open questions:

Can we add light temporal training to further improve video consistency and handle motion blur or rolling shutter effects?
How well do smaller, efficient backbones distill MDM’s benefits for edge devices?
Can we fuse other modalities (events, polarization, thermal) to further reduce glass/mirror failures?
How robust is MDM under severe weather (rain on lenses, fog) or underwater imaging?
Can active learning target scenes where the model is least confident, to improve data efficiency even more?

06Conclusion & Future Work

Three-sentence summary: The paper introduces Masked Depth Modeling (MDM), which uses real missing pixels from depth cameras as training masks so a model learns to fill them using RGB context and remaining valid depth. A ViT-based joint embedding and a ConvStack decoder, trained on 3M curated RGB-D pairs plus open datasets, produce dense, metric, pixel-aligned depth that beats strong baselines and even helps stereo and monocular tasks. The model generalizes to videos and fuels better 3D tracking and robot grasps, especially for shiny and transparent objects.

Main achievement: Turning sensor failures into a learning signal—natural masks—so the model practices exactly where depth cameras struggle, leading to state-of-the-art depth completion and stronger spatial priors.

Future directions:

Lightweight distillation for mobile robots and AR headsets.
Gentle temporal training for even smoother video depth.
Multimodal fusion (e.g., polarization) to tackle mirrors and glass.
Active data collection that focuses on hardest scenes.

Why remember this: MDM flips the script—holes aren’t junk; they are the best teacher. By learning from what depth cameras miss, we can make them behave smarter, enabling safer robots, steadier AR, and more reliable 3D perception in the messy real world.

Practical Applications

•Make home robots safely navigate and manipulate in kitchens and bathrooms with lots of glass and tile.
•Improve AR room scanning on consumer devices for accurate virtual furniture placement.
•Enhance warehouse robots’ obstacle detection on shiny floors and plastic wraps.
•Stabilize 3D point tracking for sports analysis or fitness equipment monitoring.
•Provide robust depth for telepresence robots in offices with glass walls and mirrors.
•Boost stereo systems (like FoundationStereo) with better, faster depth priors.
•Upgrade mobile mapping and SLAM in malls, airports, and museums with reflective surfaces.
•Enable grasping of transparent and reflective objects in factories and labs with dexterous hands.
•Support safer autonomous driving perception under rare textures (e.g., blank walls, foggy glass).
•Aid construction and inspection by delivering sharper depth on metal and glass structures.

Version: 1