InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Hao Yu; Haotong Lin; Jiawei Wang; Jiaxin Li; Yida Wang; Xueyang Zhang; Yue Wang; Xiaowei Zhou; Ruizhen Hu; Sida Peng

InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields

Intermediate

Hao Yu, Haotong Lin, Jiawei Wang et al.1/6/2026

arXiv PDF

Key Summary

•InfiniDepth is a new way to predict depth that treats every image location as a smooth, continuous place you can ask for depth, not just the fixed pixels of a grid.
•Instead of upsampling a low-res depth map (which blurs details), it directly queries depth at any coordinate, so it naturally produces crisp 4K, 8K, or even 16K depth maps.
•It uses a Vision Transformer to understand the picture and a simple local implicit decoder to answer “what’s the depth here?” for any tiny spot you choose.
•A special depth query strategy spreads 3D points evenly on surfaces, reducing holes and artifacts when creating new views from big camera moves.
•The authors built Synth4K, a 4K benchmark from five popular games, and new “high-frequency masks” to test fine details like edges and tiny structures.
•InfiniDepth reaches state-of-the-art results on both synthetic and real datasets, especially in fine-detail regions and in strict, metric-depth tests with sparse depth hints.
•Ablations show the neural implicit field (continuous querying) and multi-scale feature fusion are the key ingredients behind the accuracy and detail gains.
•For novel view synthesis, InfiniDepth’s evenly distributed 3D points lead to cleaner renderings with fewer gaps than pixel-aligned baselines.
•Training uses sparse, sub-pixel supervision on synthetic data, which fits the continuous nature of the model and helps it learn fine geometry.
•Limitations include no explicit temporal consistency for videos and training mostly on synthetic data, but the approach opens doors to stable 4K+ depth and better single-image 3D.

Why This Research Matters

Sharp, scalable depth is the foundation for next-generation AR, VR, and mobile photography where effects must look crisp at any zoom. Robots and autonomous vehicles can better perceive thin obstacles and far structures when depth edges are clean, improving safety. Creators and gamers can get more reliable single-image 3D with fewer holes when the camera moves a lot. Industrial inspection, construction, and mapping benefit from accurate metric depth, especially when only a few precise measurements are available. Because InfiniDepth queries any coordinate directly, it future-proofs depth estimation as displays and sensors push to 4K, 8K, and beyond. Uniform 3D sampling also strengthens fast renderers like Gaussian splatting, linking perception and graphics. Overall, it moves depth from being a fixed, pixel-tied product to a flexible, high-fidelity service on demand.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you zoom into a photo on your phone, it can look fuzzy because there just aren’t enough pixels? That’s what has happened to most depth estimators for years—they predict depth on a fixed set of pixels (a grid), and when you try to go beyond that grid, details blur.

The World Before:

Many classic depth methods treated the depth map as values on a fixed image grid (or a graph). This made them easy to pair with CNNs and Transformers, but the output was stuck at the training resolution.
To get higher resolution, people upsampled (stretched) the depth map. Upsampling acts like spreading butter—it smooths things out and washes away tiny edges and thin structures (like railings, wires, or leaf edges).
Diffusion or heavy decoders can sharpen some details but are slow or still tied to grids, making arbitrary resolution tricky.

The Problem:

Depth on a fixed grid can’t directly answer, “What’s the depth exactly at this tiny spot in between pixels?” That means it struggles with arbitrarily high output resolutions and often loses high-frequency details.
In 3D uses like novel view synthesis (making new camera views), if you unproject per-pixel depth, you get clumpy 3D points—dense nearby, sparse far away, and uneven on slanted surfaces. This creates holes and artifacts when the camera moves a lot.

Failed Attempts:

CRFs and graph models: elegant but hard to optimize at scale, and still tied to grids.
CNN/Transformer decoders: strong generalization, but they either upsample (blurring) or project latents to patches (missing local variations), which harms fine detail.
Diffusion-based depth: better boundaries but still generating discrete grids and often slower; VAE bottlenecks can hurt precision in metric depth.

The Gap:

What’s missing is a depth representation that is continuous. We need to ask for depth at any coordinate—not just at pixel centers—and do this efficiently while still using strong learned image features.

Real Stakes (Why you should care):

Phones and AR glasses want sharp, stable depth for effects like portrait blur, object insertion, or measurement tools at any zoom level.
Robots and cars benefit from crisper depth edges to avoid obstacles and perceive thin or far structures.
Creators need clean geometry for single-image 3D, so big camera moves don’t reveal holes in the scene.
As displays and cameras move beyond 4K, depth should scale too, without retraining for every new resolution.

New Concepts (Sandwich explanations):

🍞 Hook: Imagine having a super-precise ruler that can measure anywhere on a page, not just on the printed grid lines. 🥬 The Concept: Neural Implicit Fields represent signals (like depth) as a smooth function you can query at any coordinate.

How it works:
1. A small neural network learns “depth = f(x, y)” for any point.
2. You give it a coordinate (x, y), it returns the depth there.
3. Because it’s continuous, you can ask at pixel centers or anywhere in between.
Why it matters: If you only have grid points, you must stretch or guess between them, losing detail. A continuous field answers exactly where you ask. 🍞 Anchor: It’s like asking “what’s the height of this mountain exactly here?” on a map, even if there’s no printed dot there.

🍞 Hook: You know how you solve a big jigsaw by focusing on small areas first? 🥬 The Concept: A Vision Transformer (ViT) looks at an image in patches and learns relationships across the whole picture.

How it works:
1. Split the image into patches (like puzzle pieces).
2. Let the model compare all pieces to understand global context.
3. Extract multi-level features that capture both details and meaning.
Why it matters: Good features are the clues that help the depth function answer accurately at any spot. 🍞 Anchor: Like finding the sky pieces vs. roof edges—ViT helps separate them clearly.

🍞 Hook: Think of a nearby street sign: to read it, you glance locally, not over the whole city. 🥬 The Concept: A Local Implicit Decoder uses features from nearby locations (at multiple scales) to predict depth at a chosen coordinate.

How it works:
1. Build a feature pyramid (fine-to-coarse) from the ViT.
2. For your coordinate, pull the local features at each scale via bilinear interpolation.
3. Fuse them from high-res to low-res with a gated feed-forward block.
4. A tiny MLP turns the fused feature into a depth value.
Why it matters: Local focus preserves sharp edges; multi-scale context understands the scene shape, so both tiny and big structures are handled. 🍞 Anchor: It’s like asking a local guide for street details while also checking a city map for orientation.

🍞 Hook: If you sprinkle seeds by hand, some spots get too many, some too few. A spreader keeps it even. 🥬 The Concept: The Depth Query Strategy spreads 3D points evenly on surfaces by asking for extra sub-pixel depth where surfaces cover more area.

How it works:
1. Estimate how much real 3D surface each pixel represents (farther and more slanted surfaces cover more area per pixel).
2. Give those pixels a bigger “query budget.”
3. Sample extra sub-pixel coordinates inside those pixels and query their depths.
4. Back-project to 3D to get a near-uniform point cloud.
Why it matters: Even point coverage reduces holes and visual glitches when you move the camera a lot. 🍞 Anchor: It’s like painting a wall evenly by putting more paint where one brush stroke covers a bigger patch.

🍞 Hook: When you sharpen a photo, you boost edges; you’re hunting for where things change quickly. 🥬 The Concept: High-Frequency Depth Masks highlight depth pixels with lots of tiny, sharp changes (edges, fine structures) to test detail recovery.

How it works:
1. Analyze depth with multi-scale filters to find sharp variations.
2. Create a mask of those “high-frequency” spots.
3. Evaluate models specifically on these tricky regions.
Why it matters: A model can look good overall but fail on details. This mask checks the hard parts. 🍞 Anchor: Like grading handwriting by looking closely at the places with tight curves and corners.

02Core Idea

Aha! Moment in one sentence: Represent depth as a continuous neural implicit field that you can query at any image coordinate, then feed it local, multi-scale features so it returns sharp, high-resolution depth anywhere you ask.

Three Analogies:

Infinite zoom coloring book: Instead of coloring only within printed dots (pixels), you can color at any tiny spot and still stay inside the lines.
On-demand thermometer: You can take a temperature reading anywhere in the room rather than only at a few fixed sensors.
Smart metal detector: It scans broadly to know the scene but gives you a precise reading exactly where you point.

Before vs After:

Before: Depth was tied to a pixel grid. To get higher resolution, you stretched the map, blurring edges and missing super-fine details. Unprojected 3D points were clumped and uneven, causing holes in new views.
After: Depth is a function you can ask anywhere. You naturally get crisp 4K–16K depth without retraining for that size, and you can smartly sample sub-pixel points to build even 3D surfaces for stronger novel views.

Why It Works (intuition, no equations):

Continuity: Treating depth as a smooth function means nearby coordinates have related depths, so you can predict in-between points without guessy upsampling.
Locality + Globality: The model queries local features (for edges) and fuses multi-scale context (for big shapes), balancing detail and stability.
Adaptive sampling: Surfaces that “occupy more real area per pixel” get more queries, so the 3D sampling matches the scene’s true shape.

Building Blocks (each piece as a mini sandwich reminder):

Neural Implicit Field: A tiny network maps any (x, y) to depth—so you can ask anywhere.
Feature Pyramid from ViT: Multi-level features capture details (edges) and meaning (objects, layout).
Local Feature Query: At your coordinate, gather nearby features at each scale using bilinear interpolation—fast and accurate.
Hierarchical Fusion with Gates: Start with high-res details, inject deeper semantic context step-by-step with a learnable gate that decides how much to mix.
MLP Head: A small predictor turns the fused feature into a single depth value.
Depth Query Strategy: Allocate more sub-pixel samples to pixels that represent larger chunks of surface, producing uniform 3D point clouds.
HF Masks + Synth4K: A 4K benchmark and detail-focused masks to truly test fine edges and tiny structures.

Concrete Anchor Example:

Suppose you want a 12K depth map of a city skyline. Traditional methods would upsample from lower-res and blur rooftop edges. InfiniDepth simply asks the depth field at every 12K coordinate—no stretching—and keeps the antennae and cables crisp. For 3D, it places points more densely on far, slanted rooftops so new views don’t show gaps.

03Methodology

At a high level: Input RGB → ViT encoder → Reassemble into feature pyramid → Local feature query at any (x, y) → Hierarchical fusion (detail to semantics) → Tiny MLP → Depth value at (x, y). Repeat for as many coordinates as you like to form a depth map of any resolution.

Step-by-step (with “why” and mini examples):

Image encoding with ViT

What happens: The image is split into patches and processed by a pretrained DINOv3 ViT-Large. Features are taken from layers 4, 11, and 23 and projected to sizes 256, 512, and 1024 channels. Shallow features are upsampled to higher spatial resolutions.
Why this step exists: We need both fine local details (edges) and global context (room layout). Shallow layers carry texture/edges; deep layers carry semantics/structure.
Example: In a living room photo, shallow features catch the crisp table edge; deep features grasp that the floor is planar and the wall is vertical.

Reassemble into a feature pyramid

What happens: The extracted multi-layer features are aligned into a pyramid of resolutions (high-res shallow, lower-res deep). This keeps detail at fine scales and meaning at coarse scales.
Why: A single resolution can’t do both sharp details and big-picture scene understanding well.
Example: A fence with thin rails (needs high-res) in front of a building (needs semantic context).

Local feature query at a continuous coordinate (x, y)

What happens: For any chosen coordinate—even between pixels—the method maps it to each pyramid scale, looks at its 2×2 neighborhood, and uses bilinear interpolation to produce one local feature vector per scale.
Why: Bilinear interpolation is fast, stable, and parameter-free; it cleanly blends nearby info. Without local query, you’d miss sharp edges (mixing unrelated regions) or overfit.
Example data: Query at x=1000.3, y=512.7 in a 4K image; at each scale, it gathers the four neighboring feature cells and blends them.

Hierarchical fusion from detail to semantics

What happens: Start with the feature from the highest resolution. Then, for each next-lower resolution scale, project and add it with a learnable gate that decides how much of the previous detail to keep. A small FFN refines at each step.
Why: Edges alone can be noisy; semantics alone can be blurry. The gated fusion balances them. Without it, edges get lost or predictions become unstable near boundaries.
Example: The fusion keeps the crisp silhouette of a statue while using deeper features to avoid confusing the statue with the background.

Tiny MLP head predicts depth

What happens: A lightweight MLP (few layers) turns the final fused feature into a scalar depth value for that coordinate.
Why: We want a small, efficient predictor, not a heavy decoder, since we query many coordinates.
Example: At (x, y) = (1500.8, 900.4), the MLP outputs depth 4.2 meters.

Arbitrary-resolution output by repeated querying

What happens: To build a depth map at, say, 8K or 16K, we just query those grids of coordinates. There’s no upsampling—every value is directly predicted at its exact coordinate.
Why: This preserves details and avoids the blur of stretching.
Example: Generating a 16K depth of a city skyline keeps flagpoles and cables intact.

Training with sparse, sub-pixel supervision

What happens: Instead of supervising the whole grid, the method samples many coordinate-depth pairs (e.g., 100k per image) from high-quality synthetic datasets (Hypersim, VKITTI, TartanAir, IRS, UnrealStereo4K, UrbanSyn). It uses an L1 loss between predicted and ground-truth depths. Depths are normalized in log space for stability.
Why: Supervising at continuous coordinates matches the model’s continuous nature and teaches sub-pixel precision. Using synthetic data avoids noise and holes from real sensors.
Example: During training, one batch might supervise 100k random points across the 4K ground-truth depth, including lots near edges.

Relative vs metric depth

What happens: For relative depth (no absolute scale), predictions are aligned to ground truth by scale and shift before evaluation. For metric depth, sparse depth samples are fed via a depth prompt module (as in PromptDA), and stricter accuracy thresholds are used.
Why: Relative tasks care about shape; metric tasks care about exact distances. Sparse metric hints reduce ambiguity and let the continuous field shine.
Example: With 1500 sparse depth points, the model locks onto true meters and nails fine structures.

Infinite Depth Query for even 3D point clouds (secret sauce for NVS)

What happens: The model estimates how much real-world surface each pixel covers (farther and more slanted surfaces cover more area). It then gives those pixels more sub-pixel queries and jitters sample locations inside each pixel. Query depths at those sub-pixel points and back-project to 3D.
Why: A regular per-pixel point cloud is uneven and causes holes from new angles. Near-uniform surface sampling fills gaps.
Example: On a slanted roof far away, the method samples many sub-pixel points, preventing holes when viewed from above.

Optional Gaussian Splatting head for rendering

What happens: With the uniform 3D points as centers, a small head predicts attributes for Gaussian splats, allowing fast, high-quality novel view rendering.
Why: Tying clean geometry to a practical renderer demonstrates real benefits.
Example: Render a bird’s-eye view from a single street photo with far fewer artifacts.

Secret Sauce Summary:

Continuous querying (neural implicit field) + local multi-scale features + gated fusion = crisp, fine-grained depth at any resolution.
Adaptive sub-pixel sampling = uniform 3D coverage and stronger novel view synthesis.

04Experiments & Results

The Tests (what and why):

High-resolution and fine detail: Introduce Synth4K, a 4K dataset from five realistic games. Use high-frequency (HF) masks to focus on tiny structures and sharp edges—the toughest parts.
Generalization to the real world: Evaluate zero-shot on KITTI, ETH3D, NYUv2, ScanNet, and DIODE to see how well a synthetic-trained model holds up.
Metric accuracy with sparse depth: Add 1500 sparse depth points and test strict thresholds (1%, 2%, 4%) to measure true meter-level precision.
Novel view synthesis (NVS): Compare point clouds and renderings, especially under large viewpoint shifts, where holes typically appear.

The Competition (baselines):

Relative depth: DepthAnything/V2, DepthPro, MoGe/MoGe-2, Marigold, PPD.
Metric depth (with sparse depth): Marigold-DC, Omni-DC, PriorDA, PromptDA.

The Scoreboard with Context:

On Synth4K full images (relative depth), InfiniDepth tops the charts across subsets. For example, on Synth4K-1 it reaches about δ0.5 ≈ 74.3%, δ ≈ 89.0%, δ(highest) ≈ 96.1%. Think of δ as “how often the prediction is close enough”—these are like moving from a B to an A/A+ compared to earlier methods.
On HF-masked fine-detail regions, InfiniDepth consistently leads or is in the top group across all five games, showing it keeps thin edges and tiny geometry where others smooth them out.
On real-world datasets (relative depth), InfiniDepth is on par with the best methods (e.g., δ-high variants around 97–99%), showing strong zero-shot transfer despite synthetic-only training.
For metric depth with sparse hints (Ours-Metric), InfiniDepth achieves clear gains over Marigold-DC, Omni-DC, PriorDA, and PromptDA at tight thresholds (δ0.01, δ0.02, δ0.04). Hitting high scores at 1% error is like measuring distances with a very fine ruler and still reading them right most of the time.

Surprising/Notable Findings:

The biggest improvements show up where it’s hardest: high-resolution fine details and strict metric thresholds—exactly where continuous querying and local fusion should help.
For NVS, evenly distributed 3D points give visibly fewer holes than pixel-aligned depth baselines like ADGaussian under big camera shifts.
Bilinear feature interpolation for local queries beat fancier alternatives (offset MLPs, cross-attention) while being simpler and faster.

Ablations (what matters most):

Neural implicit fields vs grid decoders: The implicit field notably boosts metric accuracy and visual sharpness; relative tasks also benefit in detail regions.
Multi-scale query and gated fusion: Removing them drops accuracy across datasets, confirming that detail+context fusion is key.
Encoder choice: DINOv3 ViT-Large works very well; swapping encoders changes results modestly.
Sub-pixel supervision: Supervising at continuous coordinates (not just pixel centers) improves metric accuracy—matching the model’s continuous nature.

Efficiency:

The decoder is lightweight compared to many baselines. While not the fastest overall (some grid decoders are quicker), InfiniDepth offers better fine-detail quality at high resolutions and competes well with detail-focused methods in speed.

Takeaway:

Numbers and visuals agree: treating depth as a continuous field unlocks crisp, scalable depth and cleaner single-image 3D reconstructions.

05Discussion & Limitations

Limitations (specific):

Video consistency: Trained on single views, so predicting frame-by-frame can cause tiny flickers. There’s no explicit temporal smoothing yet.
Synthetic-to-real gap: Training on synthetic data avoids sensor noise but can miss real-world quirks (e.g., glossy reflections, imperfect calibration). Zero-shot results are strong, but mixed real-synthetic training could help more.
Throughput at ultra-high output: Querying millions of coordinates for 8K–16K outputs is compute-heavy, even with a small MLP. Efficient sampling or tiling strategies are helpful in practice.
Normal/area estimation cost: The even-point sampling for NVS uses extra queries and gradients (for normals), adding overhead.

Required Resources:

A strong GPU for training (the paper uses 8 GPUs, long training runs), and a modern ViT backbone (e.g., DINOv3-Large).
For metric tasks, access to sparse depth inputs (e.g., from LiDAR or stereo) when absolute scale is needed.

When NOT to Use:

Real-time mobile apps that need instant 4K depth at high FPS may prefer lighter, lower-res grid methods.
Long video sequences requiring rock-solid temporal consistency without any post-processing.
Scenarios with extremely limited compute or without any tolerance for adaptive sampling overhead.

Open Questions:

How to build temporal and multi-view consistency directly into the implicit field for stable video depth?
Can we jointly predict uncertainty and adapt queries to focus on uncertain regions on-the-fly?
How well does mixed real+synth training push real-world fine-detail accuracy even further?
Could the same implicit idea extend to other dense tasks (surface normals, optical flow) at arbitrary resolution?
What are the best strategies to accelerate massive-coordinate querying without losing detail?

06Conclusion & Future Work

Three-sentence summary: InfiniDepth turns depth maps into a continuous function you can query at any coordinate, enabling crisp, arbitrary-resolution depth without upsampling blur. By combining local multi-scale features with a lightweight implicit decoder and an adaptive query strategy, it captures fine details and produces evenly sampled 3D points for cleaner novel views. It reaches state-of-the-art results on 4K synthetic and real benchmarks, with especially strong gains on fine-detail and strict metric tests.

Main Achievement: A practical, accurate, and simple formulation of depth as a neural implicit field—with multi-scale local querying—that finally makes “infinite-resolution” depth estimation and fine-detail recovery feasible.

Future Directions:

Add temporal and multi-view constraints for flicker-free video and consistent 3D over time.
Explore mixed real+synth training and uncertainty-aware adaptive querying.
Extend the implicit, arbitrary-resolution recipe to related tasks (normals, flow, semantics) and integrate tightly with fast renderers.

Why Remember This: It changes depth from a fixed grid of guesses to a smooth function you can interrogate anywhere—like switching from a pixelated map to a living, zoomable globe. That shift unlocks sharper edges, scalable outputs, and sturdier single-image 3D for the next generation of AR, robotics, and creative tools.

Practical Applications

•4K–16K photo relighting and background blur that keeps hair strands and fence edges crisp.
•AR object placement that snaps cleanly to surfaces without floating or sinking, even when zoomed in.
•Single-image 3D for real estate or e-commerce with fewer holes when orbiting around objects.
•Robot navigation that better recognizes thin obstacles like poles, wires, and railings.
•Accurate tape-measure apps that leverage sparse laser or LiDAR pings for true metric distances.
•VFX and game modding: generate clean point clouds from a single screenshot for quick mockups.
•Digital twins and mapping where fine edges (window frames, beams) must be preserved at high resolution.
•Pre-visualization for drones or autonomous driving: simulate large viewpoint shifts from one frame.
•Quality control in factories: precise depth for checking tiny defects or assembly tolerances.
•Education and demo tools: interactive “ask depth anywhere” viewers to learn 3D geometry.

Version: 1