Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Xin Lin; Meixi Song; Dizhe Zhang; Wenxuan Lu; Haodong Li; Bo Du; Ming-Hsuan Yang; Truong Nguyen; Lu Qi

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Intermediate

Xin Lin, Meixi Song, Dizhe Zhang et al.12/18/2025

arXiv PDF

Key Summary

•This paper builds a foundation model called DAP that estimates real-world (metric) depth from any 360° panorama, indoors or outdoors.
•The team created a massive 2-million-image data engine mixing synthetic and real panoramas, plus a smart three-stage pseudo-label pipeline to turn unlabeled images into reliable training signals.
•DAP uses a powerful DINOv3-Large backbone and adds a simple ‘range mask head’ so the model knows which distances to trust (10/20/50/100 meters).
•Training uses geometry- and sharpness-focused losses that respect 360° image distortions, helping edges look crisp and 3D shapes stay consistent.
•In zero-shot tests (no fine-tuning), DAP beats prior methods on Stanford2D3D, Matterport3D, and Deep360, showing strong generalization.
•On their new outdoor benchmark (DAP-Test), DAP dramatically lowers error versus strong baselines, showing the power of data scaling plus curated pseudo-labels.
•Ablations show each ingredient (distortion map, geometry losses, sharpness losses, range masks) contributes meaningful gains.
•The approach is practical for robotics, AR/VR, and mapping where panoramic cameras are common and meters matter.
•A key insight is ‘data-in-the-loop’: better pseudo-labels create better models, which then create even better pseudo-labels, in a virtuous cycle.
•DAP is robust to distant regions and sky areas, where many older models collapse or get scale wrong.

Why This Research Matters

Accurate panoramic depth in meters lets robots, drones, and self-driving systems see the whole scene and plan safe paths instantly. AR/VR becomes more believable because virtual objects sit at the correct real-world distances and don’t drift. Cities and buildings can be mapped with cheap 360 cameras, speeding up inspection, renovation, and asset tracking. Film and game creators can relight or composite scenes more naturally using consistent 3D structure. Emergency response teams can quickly assess room sizes and obstacles from one panoramic snapshot. Altogether, DAP moves 360° vision from toy demos to reliable tools that work indoors and outdoors without per-scene tuning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a hamster ball lets you see all around you at once? A 360° photo is like that—it shows everything around the camera, floor to ceiling, wall to wall.

🥬 The Concept: Panoramic depth estimation is figuring out how far away every pixel is in a full 360° picture, in real-world meters.

How it works (big picture):
1. Take a 360° image (a panorama).
2. For each pixel, guess how many meters away that point is.
3. Build a full distance map so robots, AR glasses, and maps know the scene’s shape.
Why it matters: Without depth, machines can’t tell what’s near or far. They might bump into things, place virtual objects wrong, or make bumpy 3D maps.

🍞 Anchor: Think of a rolling robot with a panoramic camera. It needs to know if a chair is 1 meter away (stop!) or 10 meters away (safe to go). That’s panoramic depth.

The World Before:

AI was good at depth for normal, narrow camera views, but 360° images are different: they stretch the top and bottom (called equirectangular projection), and they contain everything at once.
Many models only predicted “relative” depth (who is closer than whom), but not “metric” depth (exact meters). For AR/VR, robots, and mapping, meters matter.
Datasets for panoramas were small and often indoors. Outdoor panoramas with ground-truth meters were rare because measuring depth everywhere is hard.

🍞 Hook: Imagine you learned to measure distances only inside your house. Now you go outside to a park—suddenly, your tricks don’t work as well.

🥬 The Concept: Domain gap is when a model trained in one world (say, indoor or synthetic) struggles in another (outdoor or real).

How it works:
1. Train mostly indoors or on computer-generated images.
2. Test outdoors or on real photos.
3. Predictions fall apart because textures, lighting, and distances differ.
Why it matters: If a robot goes from a hallway to a sunny street, we still need accurate meters. Gaps cause bad depth, wobbly edges, and wrong scales.

🍞 Anchor: A model that thinks a blue sky is a nearby wall has a big domain gap problem.

The Problem:

Panoramic metric depth models didn’t generalize well, especially outdoors and for very far distances (think 50–100 meters and the sky).
Collecting labeled panoramic depth is expensive; unlabeled panoramas are easy to find but lack ground truth.

Failed Attempts:

Train in-domain only: works on that dataset, fails elsewhere (overfitting).
Convert perspective images to fake panoramas: helps, but still misses true 360° geometry and long-range depth behavior.
Use perspective depth models directly: better than nothing, but not robust across full 360° geometry and distortions.

The Gap:

We needed both: a giant, diverse panoramic dataset and a way to turn millions of unlabeled panoramas into trustworthy training signals.
We also needed a model that respects 360° distortions and stays sharp and geometrically correct at many distances.

🍞 Hook: Imagine a teacher who gets better as they see more student homework, even if some answers are guessed first but then checked and improved.

🥬 The Concept: Data-in-the-loop learning continuously improves the training labels using the model itself.

How it works:
1. Train a good starter model on trusted data.
2. Use it to guess labels (pseudo-labels) for huge unlabeled sets.
3. Keep only the best guesses and train a better model.
Why it matters: You break free from small labeled datasets. Each round produces cleaner labels and a stronger model.

🍞 Anchor: It’s like practicing piano with a tuner app: your early notes are a bit off, the app shows corrections, and over time you hear better and play better.

Real Stakes (Why care?):

Safer navigation for delivery robots and drones.
More believable AR—virtual furniture sits at the right distance and doesn’t float.
Faster 3D mapping of buildings and streets from cheap 360 cameras.
Better video editing—consistent scene depth helps relighting and effects.
Education and tourism—walkthroughs with accurate room sizes and distances.

02Core Idea

🍞 Hook: Imagine building a Lego city using both an instruction booklet and thousands of photos from other Lego cities so you can copy what works and avoid mistakes.

🥬 The Concept: The “aha!” is to make a panoramic metric-depth foundation model by pairing a massive, diverse data engine with a three-stage pseudo-label pipeline and a geometry/sharpness-aware network that respects 360° distortions.

How it works (one sentence): Combine 2M panoramas, a careful pseudo-label curation process, and a model with a range mask and distortion-aware losses to predict true meters anywhere in a 360° image.
Why it matters: Without all three (data scale, curated pseudo-labels, geometry-aware design), models crumble outdoors, blur edges, and misjudge long distances.

🍞 Anchor: It’s like training a lifeguard who studies many beaches (data), learns from reliable reports (pseudo-labels), and wears polarized glasses (geometry/sharpness losses) to see underwater clearly.

Multiple Analogies (three ways):

Chef analogy: Gather tons of ingredients (diverse data), taste-test and refine recipes (pseudo-label curation), and use the right tools (range mask + losses) to cook a consistent dish (metric depth) in any kitchen (scene).
Hiking analogy: Pack a detailed map (data engine), check trail markers (pseudo-label filtering), and use a rangefinder (range mask) plus a compass and altimeter (geometry/sharpness losses) to know exactly where you are in meters.
Classroom analogy: Start with a solid textbook (synthetic labels), grade practice sheets (pseudo-labels) with a careful grader (discriminator), then teach with lab gear (losses) so students measure real distances accurately.

🍞 Hook: You know how a 360° photo looks stretched at the top and bottom, like a world map that makes Greenland huge?

🥬 The Concept: Distortion-aware training treats panoramic pixels fairly so edges and shapes stay true.

How it works:
1. Use a distortion map to rebalance learning where pixels are stretched.
2. Break the panorama into 12 normal views to compare fine details safely.
3. Add geometry-focused losses so 3D surfaces and points match reality.
Why it matters: If you ignore distortions, far-away regions and poles go wrong—edges blur and scales drift.

🍞 Anchor: It’s like grading a test where some questions are printed bigger than others—you must score them evenly so the student isn’t punished or rewarded by font size.

Before vs After:

Before: Models either nailed a single dataset or got confused by outdoors, skies, and long ranges; meter-true predictions were shaky.
After: DAP gives stable, metric-consistent depth across varied scenes, with crisp edges and better long-range behavior—all without test-time scale fixes.

Why It Works (intuition):

Big, varied data reduces surprise at test time.
Pseudo-label curation keeps only reliable guesses, compounding quality.
Range masks and tailored losses guide the model to focus on the right distances and true geometry, especially where panoramas distort the most.

Building Blocks (with sandwiches):

Panoramic Depth Estimation 🍞 Hook: Think of a globe turned into a flat map—everything is there, just warped. 🥬 What: Predict real distances for every pixel in a 360° image. How: Ingest panorama → estimate per-pixel meters → make a depth map. Why: Robots/AR need true meters to act correctly. 🍞 Anchor: A home robot avoids a staircase because it knows the drop is 2 meters, not 20 cm.
Pseudo-Labeling 🍞 Hook: Imagine filling in a worksheet answer with a best guess before the teacher checks it. 🥬 What: Use a model’s predictions as temporary labels for unlabeled data. How: Train on trusted labels → predict on unlabeled images → filter good guesses → retrain. Why: Unlocks huge unlabeled datasets cheaply. 🍞 Anchor: Guessing vocab words from context, then keeping only the ones you’re pretty sure about.
Three-Stage Training Pipeline 🍞 Hook: Cooking: prep, cook, plate. 🥬 What: A 3-step plan to turn millions of raw panoramas into strong training. How: (1) Scene-Invariant Labeler on synthetic indoor/outdoor; (2) pick top-quality pseudo-labels using a discriminator, train a Realism-Invariant Labeler; (3) train final DAP on all refined data. Why: Each stage removes noise and closes domain gaps. 🍞 Anchor: Like practicing a song slowly, then with a metronome, then performing smoothly.
DINOv3-Large Backbone 🍞 Hook: A super-reader who already knows many patterns in images. 🥬 What: A strong vision transformer that extracts rich features. How: Pretrained features → fine-tuned for panoramic depth. Why: Better starting point = better generalization. 🍞 Anchor: A coach who’s seen thousands of games can quickly train a new team.
Range Mask Head 🍞 Hook: Using reading glasses with different strengths to see near or far. 🥬 What: A head that selects valid distance ranges (10/20/50/100 m) for safer predictions. How: Predict a mask for a chosen range → multiply with depth map → trust distances within range, ignore the rest. Why: Prevents unstable far-depth guesses and stabilizes training. 🍞 Anchor: For a hallway, pick 10 m; for a park, pick 100 m.
Sharpness-Centric Optimization 🍞 Hook: Sharpening a slightly blurry photo so edges pop. 🥬 What: Losses (like Gram-based DF and gradient loss) that keep edges crisp. How: Split panorama into 12 normal views → compare texture/structure; also focus on strong edges in the panorama. Why: Edges define shapes; without them, geometry looks mushy. 🍞 Anchor: Chair legs and door frames look clean instead of melted.
Geometry-Centric Optimization 🍞 Hook: Using a ruler and protractor to keep shapes correct. 🥬 What: Losses on surface normals and 3D points to preserve true 3D shape. How: Convert depth to normals and point clouds → match predictions to ground truth. Why: Keeps planes flat, curves smooth, and scales steady. 🍞 Anchor: Walls stay flat, floors stay level, and far buildings don’t collapse.

03Methodology

At a high level: Panorama → Data Engine + Pipeline → DAP Network (Backbone + Range Mask + Depth Decoder) → Metric Depth Map

Data Engine (collect and balance data)

What happens: Build a 2M-panorama collection mixing synthetic and real, indoor and outdoor. Includes Structured3D (indoor synthetic), 90k UE5/AirSim360 outdoor labeled renders, 1.7M real panoramas scraped from the web, and 200k extra indoor panoramas from a generator (DiT-360).
Why this step exists: Diversity defeats domain gaps. Outdoor meters and sky regions are underrepresented without this scaling.
Example: A drone fly-through of Rome (synthetic) plus a real street market panorama both teach long-range depth.

Stage 1 – Train Scene-Invariant Labeler (prep)

What happens: Train a first depth model only on trusted synthetic indoor (20k) + outdoor (90k) with accurate metric depth.
Why: Synthetic labels are clean and cover many geometries; this gives a strong, unbiased starter that understands both room corners and city blocks.
Example: The model learns that doors are thin planes and skyscrapers have vertical walls reaching far upward.

Stage 2 – Filter Pseudo-Labels and Train Realism-Invariant Labeler (cook)

What happens:
- Use the Scene-Invariant Labeler to predict depth for all 1.9M unlabeled real panoramas.
- A discriminator (PatchGAN-style) scores the quality; pick top 300k indoor + 300k outdoor pseudo-labeled images.
- Retrain a stronger Realism-Invariant Labeler on synthetic + these filtered real pseudo-labels.
Why: Real photos have textures and lighting that differ from synthetic. Filtering keeps only reliable guesses, closing the synthetic–real gap.
Example: Night street scenes with neon signs get included only if their predicted depths look physically consistent.

Stage 3 – Train Final DAP on All Curated Data (plate)

What happens: Train the final DAP model on every labeled sample plus all refined pseudo-labeled samples (1.9M).
Why: Biggest, cleanest set → best generalization. The final model benefits from both dense supervision and diverse looks.
Example: The model now handles tiny indoor objects and far-away outdoor buildings in one network.

DAP Network Architecture

Input: A panorama at 512×1024.
Backbone: DINOv3-Large extracts strong, general-purpose visual features.
Two heads: a) Range Mask Head (10/20/50/100 m): predicts a binary mask of valid depths for the chosen range (trained with weighted BCE + Dice). b) Metric Depth Head: predicts dense depth D (in meters). Final output = Mask ⊙ Depth (keeps predictions consistent within range).
Why this design: The mask stabilizes training and lets you “dial” distance ranges for different scenes (room vs park).
Example: For a living room, choose 10 m; for a boulevard, 100 m.

Distortion-Aware and Geometry/Sharpness Losses (the secret sauce)

Distortion Map (fair grading across the panorama)
- What: Weighting that compensates for equirectangular stretching (especially near poles).
- Why: Without it, the model may overfit or underfit distorted areas.
- Example: Ceiling corners won’t unfairly dominate training.
SILog Loss (baseline depth accuracy)
- What: A scale-invariant log loss commonly used for depth regression.
- Why: Stabilizes learning across varying depth scales while keeping metric targets.
- Example: Prevents near vs far imbalance from exploding errors.
Sharpness-Centric Optimization a) DF-Gram Loss (detail preservation)
- What: Split panorama into 12 perspective patches (icosahedron-like views), normalize, then match Gram (structure) statistics between predicted and ground-truth depths.
- Why: Panoramas stretch; patches keep fine details intact for better supervision.
- Example: Window grids remain crisp lines, not smudges. b) Gradient Loss (edge focus in ERP)
- What: Use Sobel gradients to find strong edges; apply SILog only on those edge regions.
- Why: Edges define object boundaries; sharpening them improves shape clarity.
- Example: Table edges and stair steps pop clearly.
Geometry-Centric Optimization a) Normal Loss
- What: Convert depth to surface normals; penalize normal differences.
- Why: Keeps planes flat and orientations correct.
- Example: Floors stay horizontal; walls stay vertical. b) Point-Cloud Loss
- What: Project depth to 3D points on the sphere; match predicted and true 3D locations.
- Why: Aligns actual 3D geometry, not just image pixels.
- Example: A lamppost’s pole stands straight and aligned in 3D.
Overall Objective
- Total loss = Distortion-weighted sum of SILog + DF-Gram + Gradient + Normal + Point-Cloud + Mask losses.
- Why: Each term addresses a failure mode (distortion bias, blur, shape drift, range instability).

Inference (using the model)

Input a single panorama.
Choose or auto-select a reasonable range mask (e.g., 50 m for streets).
Run DAP to get a meter-true depth map.
Example: On a plaza scene, you recover near benches (~2 m), a fountain (~15 m), and a cathedral facade (~60 m).

04Experiments & Results

The Test (what they measured and why):

Datasets: Stanford2D3D (indoor), Matterport3D (indoor), Deep360 (outdoor), plus their new outdoor DAP-Test.
Metrics:
- AbsRel (lower is better): average relative error in meters.
- RMSE (lower is better): typical size of errors.
- δ (higher is better): percent of pixels close to the right answer.
Why these: They show both precision (δ), typical mistakes (RMSE), and fairness across near/far distances (AbsRel).

The Competition (baselines):

Metric depth: DAC, Unik3D.
Scale-invariant references: MoGe, DepthAnything, PanDA, DA (for context; these typically need scale alignment at test time).

The Scoreboard (zero-shot, no fine-tuning):

Stanford2D3D (indoor): DAP gets AbsRel ≈ 0.0921, RMSE ≈ 0.3820, δ ≈ 0.9135.
- Context: That’s like scoring an A when many older models scored a B. Edges and long halls look correct.
Matterport3D (indoor): DAP hits AbsRel ≈ 0.1186, RMSE ≈ 0.7510, δ ≈ 0.8518.
- Context: Still strong zero-shot, maintaining scale indoors across varied rooms.
Deep360 (outdoor): DAP reaches AbsRel ≈ 0.0659, RMSE ≈ 5.2240, δ ≈ 0.9525.
- Context: Outdoor is hardest—sky and far buildings—but DAP scores top marks, like an A+ where others wobble.

Their New Benchmark (DAP-Test, outdoor):

DAP: AbsRel ≈ 0.0781, RMSE ≈ 6.804, δ ≈ 0.9370.
Versus DAC and Unik3D: Huge error drop (AbsRel 0.25–0.32 down to ~0.08) and big δ jump (from ~0.52–0.61 up to ~0.94).
Meaning: The data engine + curated pseudo-labels + range-aware design pays off outdoors.

Surprising/Notable Findings:

Long-range stability: The range mask helps far distances not destabilize training. 100 m threshold gave the best overall trade-off in ablations.
Distortion-aware ingredients matter: Adding distortion map, then geometry losses, then sharpness losses improved steadily; the full recipe worked best.
Visuals match numbers: Qualitative results show crisp edges (e.g., furniture edges, building outlines) and stable sky/distant regions, where older models often fail.

Ablations (what breaks without each part):

Remove distortion map: Slightly worse metrics—training less stable near poles.
Remove geometry losses (normals/points): Shapes drift; planes less consistent.
Remove sharpness losses (DF-Gram/gradient): Boundaries blur; fine details get lost.
Remove range mask: Performance drops, especially for far distances (outdoor scenes).

05Discussion & Limitations

Limitations:

Sky ambiguity: The sky lacks texture and depth; while DAP is robust, absolute meters for sky regions are inherently tricky.
Out-of-distribution scenes: Extremely unusual cameras, heavy motion blur, or exotic image artifacts can still confuse the model.
Metric anchoring from a single frame: Monocular metric depth is hard; certain lighting or reflective surfaces may cause local scale errors.
Pseudo-label ceiling: Even with a discriminator, some pseudo-label noise remains and might bias training in rare cases.

Required Resources:

Data: Access to large panoramic datasets (2M scale) or a similar curation pipeline.
Compute: Multi-GPU training (the paper used H20 GPUs), and storage for millions of images.
Engineering: Tools for web curation, filtering horizons, category splitting (e.g., with a vision-language model), and panorama handling.

When NOT to Use:

If you need perfect precision on specialized sensors (e.g., thermal panoramas) the model hasn’t seen.
If your environment is extremely dynamic (crowds running, flashing lights) and you need millisecond-latency updates without any smoothing.
If you cannot tolerate any residual errors in reflective or glass-heavy scenes.

Open Questions:

Automatic range selection: Can the model self-tune the best range mask per scene at test time?
Less supervision: How far can we push self-training without synthetic labels at the start?
Multimodal fusion: Would pairing panoramas with IMU or sparse depth further stabilize meters outdoors?
Beyond ERP: Could alternative spherical representations reduce distortion even more during training and inference?
Continual learning: Can the data-in-the-loop cycle run safely post-deployment to keep improving on new cities/seasons?

06Conclusion & Future Work

Three-Sentence Summary:

DAP is a panoramic metric-depth foundation model trained with a massive data engine and a three-stage pseudo-label pipeline that bridges indoor/outdoor and synthetic/real gaps.
A simple range mask head plus distortion-aware sharpness and geometry losses keep long-range predictions stable, edges crisp, and shapes correct.
In zero-shot tests across multiple benchmarks, DAP achieves state-of-the-art performance and robust, scale-consistent results in real scenes.

Main Achievement:

Showing that large-scale, curated panoramic data combined with a geometry/sharpness-aware architecture can produce a single model that predicts true meters for any 360° scene.

Future Directions:

Auto-selecting range masks, better sky modeling, and fusing extra sensors for even steadier metric scale.
Exploring new spherical representations to reduce distortion further.
Extending the data-in-the-loop pipeline to continual learning in the wild.

Why Remember This:

It turns millions of unlabeled panoramas into reliable training fuel, proving that a careful loop of data and model design can unlock robust, meter-true 360° depth for robots, AR, and mapping—no fine-tuning required.

Practical Applications

•Indoor robot navigation using affordable 360 cameras to avoid obstacles and plan routes in meters.
•AR furniture placement that snaps to true distances in panoramic room scans.
•Drone surveys of parks and streets with robust long-range depth for safer flight and mapping.
•Rapid 3D documentation of real estate, construction sites, and museums from single panoramas.
•Video post-production tools that separate foreground/background and relight scenes using consistent depth.
•Tourism and education apps that measure room sizes and exhibit distances in virtual walkthroughs.
•Smart security cameras that understand scene layout and detect blocked paths or hazards.
•Assistive technology that warns users about steps, drop-offs, or nearby obstacles using panoramic wearables.
•Game engines that import real spaces via 360 captures with meter-accurate geometry for level design.
•Urban planning tools that monitor sidewalk widths, curb heights, and street furniture spacing from panoramic sweeps.

Version: 1