AnyDepth: Depth Estimation Made Easy

Zeyu Ren; Zeyu Zhang; Wukai Li; Qingxiang Liu; Hao Tang

AnyDepth: Depth Estimation Made Easy

Intermediate

Zeyu Ren, Zeyu Zhang, Wukai Li et al.1/6/2026

arXiv PDF

Key Summary

•AnyDepth is a new, simple way for a computer to tell how far things are in a picture using just one image (monocular depth).
•It keeps the image-understanding part (DINOv3) frozen and adds a tiny, smart decoder called SDT that is much smaller than popular decoders like DPT.
•Instead of juggling many branches and sizes like DPT, SDT fuses information once and then upsamples in a single path, saving lots of compute.
•A special detail helper (SDE) sharpens edges and textures so thin objects and borders look clear.
•A learnable upsampler (DySample) replaces blurry bilinear resizing, bringing back crisp details as the picture gets bigger.
•They also clean the training data with two quick tests (Depth Distribution and Gradient Continuity), throwing out noisy, harmful samples.
•With cleaner data (about 369K images kept from 584K) and a lighter decoder, AnyDepth matches or beats DPT’s accuracy with about 85–89% fewer decoder parameters.
•AnyDepth cuts FLOPs and latency, especially at higher resolutions, and runs better on small devices like a Jetson Orin Nano.
•Ablation studies show each piece (filtering → SDE → DySample) adds a meaningful boost.
•The big message: simple design + clean data can deliver fast, reliable, zero-shot depth without giant models or huge datasets.

Why This Research Matters

Depth from a single camera makes phones, robots, and AR apps smarter without extra sensors. AnyDepth shows you can get sharp, reliable depth quickly—even on small devices—by using a tiny decoder and cleaner data. This means smoother AR placement, safer robot navigation, and faster 3D tools for creators. Because the encoder is frozen and the decoder is simple, researchers and developers can reproduce results without giant compute budgets. Edge devices benefit from lower memory and latency, which translates to longer battery life and more responsive behavior. Overall, it lowers the barrier to bringing 3D understanding into everyday products.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can tell your friend is closer than a tree in the background just by looking at a photo? Your brain guesses depth from a single picture.

🥬 Filling (The Actual Concept): Monocular depth estimation is when a computer looks at one image and estimates how far away every pixel is.

How it works: (1) The computer learns patterns that hint at distance (like size, blur, and texture). (2) It turns the image into features. (3) It predicts a depth map, a picture where each pixel says “how far.”
Why it matters: Without it, robots, phones, and AR apps would struggle to understand 3D from a single camera.

🍞 Bottom Bread (Anchor): A phone can place a virtual chair in your room at the right size and spot because it knows the room’s depth from just one photo.

The World Before:

For years, strong depth systems leaned on heavy decoders like DPT and on giant data collections. Models fused information from many feature scales with complex, multi-branch wiring. It worked, but it was slow and bulky.
Big data approaches (like Depth Anything) trained on tens of millions of images. Powerful, yes—but expensive to build, hard to reproduce, and full of noisy labels that can teach the wrong lessons.

🍞 Top Bread (Hook): Imagine packing for a trip with five backpacks instead of one. You’d be slow and tired.

🥬 Filling: DPT’s multi-branch, multi-scale decoder is like carrying many backpacks. It reassembles features from each layer into different sizes, aligns them again and again, and only then upsamples.

How it works (simplified): (1) Turn tokens into maps at several scales per layer. (2) Align them across branches. (3) Fuse them repeatedly. (4) Upsample with bilinear interpolation.
Why it matters: This gives detail, but costs lots of parameters, compute, and time. Bilinear upsampling can blur edges.

🍞 Bottom Bread: On a small robot, that heavy decoder can be too slow to react to a chair suddenly in the way.

The Problem:

Architecturally: Reassembling and fusing across multiple branches is expensive and slows inference, especially at high resolution. Fixed bilinear upsampling softens fine details like thin rails and object borders.
Data-wise: Huge, mixed datasets bring label noise and uneven depth distributions (for example, almost all pixels near or far), which can confuse training.

Failed Attempts:

“Just make it bigger”: Larger backbones and more complex decoders often improved benchmarks but exploded compute and memory.
“Just add more data”: Scaling from millions to tens of millions helps some cases, but it also scales noise, cost, and makes results hard to reproduce.

The Gap:

A need for a decoder that is simple, single-path, and learnable in key places (like upsampling), paired with a data strategy that values quality over quantity.

🍞 Top Bread (Hook): Think about sharpening a photo on your phone. If the photo itself is blurry, fancy filters won’t fix everything.

🥬 Filling: Data quality filtering removes samples that are likely to teach bad lessons.

How it works: (1) Check if depths cover the full range (not just near or far). (2) Check if smooth surfaces have smooth depth changes. (3) Toss out the worst 20% by each test.
Why it matters: Cleaner data makes training more stable and less biased.

🍞 Bottom Bread (Anchor): If your homework set has pages with smudged ink, you toss those pages; you learn better from the clean ones.

Real Stakes:

Edge devices (drones, home robots, phones) need fast, light models with low memory. Waiting even half a second longer can be unsafe for a robot.
Developers and researchers want something reproducible without needing massive compute or 60+ million images.
AR, mapping, and 3D creation become smoother when fine edges and tiny structures are preserved, not blurred.

02Core Idea

The “Aha!” Moment in one sentence: Fuse features once while they’re still tokens, rebuild the image just once in a single path with a learnable upsampler, and train only on clean data.

Three Analogies:

One-pot cooking: Instead of cooking each ingredient in a separate pan (multi-branch) and mixing later, toss all prepped ingredients into one pot (token fusion), then finish with a careful simmer (single-path upsampling).
Jigsaw puzzle: Sort and weigh pieces once (weighted token fusion), assemble the full picture in a single build (reassemble once), and use a magnifying glass to tidy edges (SDE + DySample).
Travel light: Carry one smart backpack (SDT) instead of five heavy ones (multi-branch DPT) and toss junk from your bag (data filtering); you’ll walk faster and farther.

Before vs After:

Before: Multi-branch reassembly, repeated cross-scale alignment, fixed bilinear upsampling, and massive noisy datasets.
After: Single-path fusion→reassemble→learnable upsampling (DySample) with an SDE for crisp details, trained on a filtered, smaller dataset that’s cleaner and cheaper.

Why It Works (intuition):

Fusing first avoids paying the cost of building and aligning multiple feature maps per layer; it also lets the model balance low-level textures and high-level meaning via learnable weights that add to 1 (like slicing a pie fairly).
The Spatial Detail Enhancer (SDE) adds local continuity, rescuing textures that can get lost when tokens are reshaped into a map.
DySample learns where to sample when scaling up, so thin structures and object borders don’t get blurred like with bilinear.
Filtering out skewed or noisy samples teaches the model consistent geometry, improving zero-shot generalization without more data.

Building Blocks (each with a Sandwich):

🍞 Top Bread: Imagine a bilingual dictionary that turns sentences into the same language before you combine them. 🥬 Filling: Weighted Token Fusion gives each encoder layer a learnable weight (normalized to sum to 1) and linearly projects tokens to a shared size.

How it works: (1) Project tokens from several layers to 256-d. (2) Give each a learnable scalar weight. (3) Softmax the weights. (4) Sum them to one fused token set.
Why it matters: Without it, the model can’t easily balance sharp edges from low layers and scene meaning from high layers. 🍞 Bottom Bread: When asking “What’s important here?”, the model can lean more on edges near fences and more on semantics for big walls.

🍞 Top Bread: You know how you first mix batter, then bake once—you don’t half-bake five times. 🥬 Filling: Fusion→Reassemble (SDT) means fuse as tokens first, then reshape and upsample once along a single path.

How it works: (1) Fuse tokens. (2) Reshape to a feature map. (3) Run through SDE. (4) Upsample progressively with DySample. (5) Predict depth.
Why it matters: Avoids the heavy costs of multi-branch reassembly used by DPT. 🍞 Bottom Bread: Like baking one cake well instead of juggling five ovens.

🍞 Top Bread: Think of a photo sharpen tool that restores tiny hairs and edges. 🥬 Filling: Spatial Detail Enhancer (SDE) uses a light, depthwise 3×3 convolution with a residual connection to boost local structure.

How it works: (1) Depthwise conv models small neighborhoods. (2) Batch norm stabilizes. (3) Residual add + ReLU sharpen details.
Why it matters: Without SDE, reshaped tokens can look blocky and lose texture. 🍞 Bottom Bread: Thin tree branches pop out more clearly.

🍞 Top Bread: Zooming a photo by “smart pixels” looks better than just stretching it. 🥬 Filling: DySample is a learnable upsampler that figures out exactly where to sample when enlarging the map.

How it works: (1) Learn tiny offsets for sampling positions. (2) Sample with a differentiable grid. (3) Do this in two ×4 steps (as four ×2) with refinements in between.
Why it matters: Bilinear interpolation blurs high-frequency details; learned sampling brings them back. 🍞 Bottom Bread: Window blinds look like crisp slats, not mushy stripes.

🍞 Top Bread: When studying, you skip the pages with smudged ink. 🥬 Filling: Data-Centric Filtering keeps only good training samples.

How it works: (1) Toss images with <20% valid depth. (2) Rank by Depth Distribution Score (even coverage) and Gradient Continuity Score (smooth surfaces, sharp edges). (3) Drop the worst 20% in each.
Why it matters: Cleaner data trains faster, generalizes better. 🍞 Bottom Bread: From 584K samples, they kept about 369K high-quality ones and did better with less.

03Methodology

At a high level: Image → Frozen DINOv3 encoder → Linear projection + weighted token fusion → Reshape to feature map → Spatial Detail Enhancer (SDE) → Progressive DySample upsampling → Depth head → Disparity output (d' = 1/d).

Step-by-step (with why and a mini example):

Input and Encoder (DINOv3, frozen)

What happens: Feed a 768×768 RGB image into a pre-trained DINOv3. Grab tokens from four intermediate layers (e.g., [2,5,8,11] for S/B; [4,11,17,23] for L).
Why this step exists: Pretrained visual features carry rich texture and semantics. Freezing them saves compute and stabilizes training.
Example: For a street photo, low layers catch edges of road lines; high layers understand “car vs. building.”

Linear Projection to a Shared Space

What happens: Project tokens from each chosen layer to 256 dimensions with a linear layer + GELU.
Why: Aligns all layers to the same size so they can be combined cheaply; 256-d keeps compute low.
Example: Translating four languages into the same language before combining ideas.

Weighted Token Fusion

What happens: Give each layer a learnable scalar weight; softmax the weights (sum to 1); sum weighted tokens to one fused representation.
Why: The model can adaptively prefer edges (lower layers) or semantics (higher layers) per image.
Example: In a forest scene, boost lower layers to capture leaves; in a room, boost higher layers to understand big flat walls.

Reshape and Spatial Detail Enhancer (SDE)

What happens: Turn fused tokens into a spatial map; run a depthwise 3×3 convolution + batch norm + residual add + ReLU.
Why: Reshaping can lose local continuity; SDE repairs and sharpens textures.
Example: After reshaping, thin railings might look faint; SDE brings their edges back.

Progressive Upsampling with DySample

What happens: Upsample in two ×4 stages (implemented as four ×2 DySamples), inserting a light 3×3 refinement after each.
Why: Jumping straight ×16 forces big, error-prone offsets. Small, learnable steps keep details stable and crisp.
Example: Like zooming a photo in small, smart steps so text stays readable.

Depth Head and Output

What happens: A final conv head predicts disparity d' = 1/d. Inputs and ground-truth are normalized to [0,1].
Why: Disparity often makes optimization smoother for a wide range of depths.
Example: Very far objects have tiny disparities, making the numbers easier to manage.

Losses and Optimization

What happens: Use a scale- and shift-invariant loss (L_ssi) and a gradient-matching loss (L_gm) with weights 1:2. Train with AdamW (lr 1e-3), PolyLR (power 0.9), 2-epoch warmup, total 5 epochs. The encoder stays frozen.
Why: L_ssi removes global scale/shift ambiguity; L_gm teaches smooth surfaces and sharp edges. Short, stable training emphasizes the lightweight goal.
Example: If two datasets label depth with different scales, L_ssi keeps learning consistent.

Data-Centric Filtering (before training)

What happens: From 584K synthetic samples across Hypersim, VKITTI2, BlendedMVS, IRS, TartanAir: (a) drop samples with <20% valid depth; (b) compute Depth Distribution Score (balanced use of near→far) and Gradient Continuity Score (smooth surfaces, sharp edges); (c) drop the lowest 20% by each score. Keep ~369K.
Why: Removing skewed/noisy labels stabilizes learning and reduces cost.
Example: Outdoor sets often cluster depths near the horizon; filtering rebalances what the model sees.

Secret Sauce:

Fuse-then-reassemble avoids repeated, costly multi-branch alignments. SDE fixes local detail lost in token reshaping. DySample restores edges during upsampling. Filtering raises the “signal-to-noise” of supervision. Together, they preserve high-frequency detail while slashing parameters and FLOPs.

What breaks without each step:

No weighted fusion: the model can’t balance layers; edges or semantics may dominate wrongly.
No SDE: details smear after reshaping; textures look flat.
No DySample: bilinear blurs thin structures and borders.
No filtering: skewed depths and noisy gradients teach bad habits, hurting zero-shot generalization.

04Experiments & Results

The Test (what and why):

Zero-shot evaluation on NYUv2 (indoor), KITTI (driving), ETH3D (varied), ScanNet (indoor), and DIODE (indoor/outdoor). No fine-tuning on these sets.
Metrics: Absolute Relative Error (AbsRel, lower is better) and δ thresholds (percent of pixels within a 1.25× factor, higher is better). These capture both overall accuracy and how often predictions are close.

The Competition:

Baseline decoders: DPT (standard multi-branch) under the same frozen encoders. Other references include large-scale data approaches (Depth Anything v2) and huge encoders (e.g., DINOv3-7B), but AnyDepth’s goal is efficiency with smaller data.

The Scoreboard (with context):

Across ViT-S/B/L backbones, SDT (AnyDepth) consistently matches or outperforms DPT on AbsRel and δ, despite using about 85–89% fewer decoder parameters.
Think of it like this: AnyDepth gets an A or A− when DPT gets an A− or B+, but with a backpack that’s a fraction of the weight.
Efficiency: AnyDepth reduces FLOPs by roughly 37% at higher resolutions and trims training iteration time by about 10%. On an edge Jetson Orin Nano (4GB), SDT cuts latency notably and needs about 33% less peak memory at 256×256 than DPT, improving FPS.
Visual quality: Qualitative results show crisper edges and clearer thin structures for SDT than DPT, aligning with SDE and DySample’s design goals.

Surprising Findings:

Outdoor datasets often had depth values clustered near certain ranges, hurting the Depth Distribution Score. This imbalance can mislead training if left unfiltered.
Filtering out roughly 215K of 584K samples still improved results, showing quality beats quantity here.
Ablations confirm the stack: Filtering → +SDE → +DySample each adds an incremental gain; together they deliver the strongest performance.

Takeaway:

You don’t need a giant model or 60M+ images to get robust zero-shot depth. A simple decoder plus clean data can keep accuracy high and devices happy.

05Discussion & Limitations

Limitations:

Not optimized for fully supervised or fine-tuned metric-depth settings; designed for zero-shot relative depth. If you need absolute centimeters with high precision, extra steps are required.
The filtering metrics (distribution and gradient continuity) are heuristic and simple; some tricky cases may slip through or be over-filtered.
While SDT is efficient, the final accuracy still lags massive, fully data-driven systems trained on tens of millions of images when unlimited compute is acceptable.

Required Resources:

A pre-trained DINOv3 encoder (frozen) and a modest GPU setup are sufficient. Memory and compute needs are much lower than multi-branch decoders, enabling edge deployment.

When NOT to Use:

If you must have state-of-the-art scores from huge data and don’t mind heavy compute, larger data-driven models may edge out SDT.
If your task demands metric-accurate depth for precise measurements (e.g., industrial metrology) without any calibration or fine-tuning, you’ll need additional adaptation.
Extremely textureless or reflective scenes may still challenge monocular methods; pairing with other signals (stereo, LiDAR) could help.

Open Questions:

Can we learn the filtering metrics end-to-end, letting the model discover which samples help most?
How well does SDT extend to related dense tasks: surface normals, segmentation, and metric depth with light supervision?
Could we co-train simple geometry priors (e.g., planar regions) to further stabilize high-resolution details?
What are the best fusion weights patterns across diverse scenes, and can we predict them from context?
How far can we push mobile deployment (phones, micro-drones) with further pruning or quantization?

06Conclusion & Future Work

3-Sentence Summary:

AnyDepth shows that a simple, single-path decoder (SDT) plus a quick, effective data filtering strategy can deliver strong zero-shot monocular depth with far fewer parameters and lower compute.
By fusing tokens once, enhancing local details, and using a learnable upsampler, the model preserves sharp edges and high-frequency structure without heavy multi-branch machinery.
Filtering out skewed and noisy samples lets a smaller dataset teach better lessons, improving generalization and efficiency.

Main Achievement:

An efficiency-first depth framework that matches or beats DPT’s accuracy while cutting decoder parameters by about 85–89%, reducing FLOPs, and improving latency—especially at higher resolutions and on edge devices.

Future Directions:

Extend to metric depth, normals, and other dense tasks; explore learnable, end-to-end data-quality estimators; push mobile deployment with compression and quantization; study fusion weight patterns for adaptive scene-aware decoding.

Why Remember This:

AnyDepth is a clear reminder that smart design and clean data can rival raw scale. It makes zero-shot depth more accessible, reproducible, and deployable—from research labs to tiny robots—without sacrificing the details that make 3D understanding truly useful.

Practical Applications

•Real-time robot navigation on small computers, avoiding obstacles with a lightweight depth model.
•AR furniture placement that looks stable and the right size using only a phone camera.
•Drone flight in GPS-denied spaces by estimating scene geometry without heavy sensors.
•3D scanning and quick room modeling for interior design or construction planning.
•Video game and VR scene understanding to ground characters and objects on correct surfaces.
•Assistive technology (e.g., wearable aids) that warns about steps or obstacles using a single camera.
•Smart home devices (vacuums, indoor patrol robots) that map rooms efficiently with low power.
•Content creation pipelines that use depth as a control signal for stylization or relighting.
•On-device photo enhancement that uses depth to separate subject and background cleanly.
•Education and research demos that replicate results quickly on modest hardware.

Version: 1