How Much 3D Do Video Foundation Models Encode?

Zixuan Huang; Xiang Li; Zhaoyang Lv; James M. Rehg

How Much 3D Do Video Foundation Models Encode?

Intermediate

Zixuan Huang, Xiang Li, Zhaoyang Lv et al.12/23/2025

arXiv PDF

Key Summary

•This paper asks a simple question: do video AI models trained only on 2D videos secretly learn about 3D worlds?
•The authors build a tiny "probe" that reads features from frozen video models and predicts 3D points, depth, and camera motion.
•Surprisingly, top video generators (like WAN2.1-14B and Open-Sora2.0) encode very strong 3D understanding—sometimes beating expert 3D models.
•Temporal reasoning (using information across time) is crucial; per-frame image features can estimate depth but struggle with full 3D geometry.
•3D fine-tuning can help on in-domain data but may hurt generalization to new kinds of scenes.
•Bigger does not always mean better: model scaling helped WAN but not CogVideoX unless high-quality data scaled too.
•The most 3D-aware features come from mid network layers at early-but-not-first diffusion timesteps.
•Replacing DINO features with video model features in a 3D reconstructor (VGGT) yields big gains, especially with little 3D training data.
•Feature consistency across views is not a perfect proxy for true 3D awareness when comparing different model families.
•The work provides a model-agnostic 3D evaluation protocol and benchmark to guide building scalable 3D world models.

Why This Research Matters

This work shows that powerful video models, trained only on 2D data, quietly learn a lot about the 3D world. That means we can build accurate 3D tools—like for AR headsets, robots, and mapping—without needing gigantic 3D-labeled datasets. A simple, universal probe lets us compare models fairly and pick the best ones for 3D tasks. Developers can also harvest these features to supercharge 3D reconstruction systems, especially when labeled data is scarce. Ultimately, this points to a scalable path to world models that are cheaper, more general, and more reliable in real-life scenes.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how watching a movie lets you guess where things are in the room even if the screen is flat? Your brain uses changes across frames to feel the shape of the 3D world.

🥬 Filling (The Actual Concept): Video Foundation Models (VidFMs)

What it is: VidFMs are big AI models trained on lots of videos to understand or generate video clips.
How it works: 1) They watch many short clips; 2) They learn patterns of how pixels change over time; 3) They turn each frame into feature maps (summaries); 4) They use these features to do tasks like labeling, retrieving, or generating videos.
Why it matters: Without VidFMs, each frame is just a flat picture, and the model won’t easily learn rules about motion or the 3D world.

🍞 Bottom Bread (Anchor): Imagine a model that has seen millions of soccer videos; it learns how a ball arcs in the air and how a camera pans across a stadium.

🍞 Top Bread (Hook): Imagine standing in front of a painting versus walking around a statue. The statue gives you real depth—even when your eyes see only 2D images.

🥬 Filling (The Actual Concept): 3D Awareness

What it is: 3D awareness means knowing the shapes, distances, and camera motion behind the 2D pixels of a video.
How it works: 1) Notice how objects move across frames; 2) Use parallax (near things move more than far things); 3) Combine views to infer 3D points and camera poses; 4) Keep everything consistent across time.
Why it matters: Without 3D awareness, models confuse size with distance, mix up object shapes, and fail to keep scenes stable.

🍞 Bottom Bread (Anchor): If you ask, “Where is the chair relative to the table across the whole clip?”, a 3D-aware model can answer consistently instead of changing its mind each frame.

🍞 Top Bread (Hook): Picture telling a story in order—"first this happened, then that." If you jumble the order, the plot stops making sense.

🥬 Filling (The Actual Concept): Temporal Reasoning

What it is: Temporal reasoning is making sense of changes over time across frames.
How it works: 1) Align frames; 2) Track features that persist; 3) Compare how they shift; 4) Use the differences to infer 3D structure and camera motion.
Why it matters: Without temporal reasoning, you only see flat snapshots. You miss the parallax clue that turns 2D into 3D.

🍞 Bottom Bread (Anchor): A single photo of a mountain looks flat; a video while walking shows the near trees moving fast and the mountain slow—temporal reasoning turns that into depth.

The world before this paper looked like this: classic 3D-from-images methods matched points across photos and solved geometry, but they break in hard cases (few textures, glare, or big camera jumps). New data-driven 3D models improved things but require lots of clean 3D data—hard to get at scale. Meanwhile, video data is easy to collect and huge. So researchers started to wonder: maybe VidFMs trained on plain 2D videos quietly learn a lot about 3D. Some works fine-tuned video models for 3D control or produced 3D-like outputs, but it was unclear if a general, strong 3D sense naturally emerges from video alone.

The problem: We lacked a clean, model-agnostic way to measure a video model’s true 3D awareness. Prior checks used indirect signals like depth or cross-view matching or required scene-by-scene optimization.

The gap: Build a simple, shared “thermometer” (a shallow probe) that reads a model’s frozen features and directly predicts 3D points, depth, and camera poses. If the features carry 3D knowledge, a tiny probe should recover it without special tricks.

The stakes: If VidFMs already encode strong 3D, we could build better AR/VR, robotics, and mapping systems with far less 3D data, saving time and cost and making such tech more robust in the real world.

02Core Idea

🍞 Top Bread (Hook): Imagine testing whether a sponge already holds water by giving it a gentle squeeze—no need to tear it apart.

🥬 Filling (The Actual Concept): The Aha! Moment

What it is: If a video model truly understands 3D, then a tiny, shallow read-out should be able to pull out 3D points, depth, and camera motion from its frozen features—no fine-tuning of the big model.
How it works: 1) Freeze the VidFM; 2) Extract per-frame features; 3) Feed a few frames to a small transformer probe; 4) The probe predicts dense 3D point maps, depth maps, and camera poses; 5) Measure errors to score 3D awareness.
Why it matters: Without this simple probe, we can’t fairly compare different video models or know whether their 3D sense is real and general.

🍞 Bottom Bread (Anchor): Like gently pressing a sponge to see if it’s wet, the probe “presses” the model’s features to see if they drip out 3D.

Three analogies for the same idea:

Detective kit: You dust for fingerprints (frozen features) and use a small UV light (probe) to reveal the hidden pattern (3D).
Radio tuner: The music (3D info) is already in the signal (features); a tiny tuner (probe) picks the right station.
Shadow puppets: The hand shapes (features) already suggest animals; a flashlight (probe) reveals the full creature (3D shape and pose).

Before vs. After:

Before: People weren’t sure if VidFMs learned solid 3D or just made pretty pixels. Tests were indirect or model-specific.
After: Using the same tiny probe, top video generators clearly show strong, even expert-level 3D awareness, and we learn what boosts or hurts it.

Why it works (intuition):

Videos are 2D slices of a 3D world; watching many clips teaches stable object shapes and how cameras move. Generative video models must keep scenes coherent as they imagine next frames, pressuring them to encode geometry. The information lives mid-way through their networks and is clearest when we peek at features during early-but-not-first diffusion steps, before fine RGB details overwrite global structure.

🍞 Top Bread (Hook): You know how a tea strainer lets the tea flavor through without changing the teabag?

🥬 Filling (The Actual Concept): Shallow Read-Out Modules (the Probe)

What it is: A small network placed on top of frozen features to extract specific signals (here: 3D properties).
How it works: 1) Take a few frames’ feature maps; 2) Mix information within and across frames; 3) Predict 3D points, depth, and camera poses; 4) Train only this small probe.
Why it matters: If a tiny read-out works well, it proves the base model already encodes the knowledge.

🍞 Bottom Bread (Anchor): Like clipping a thermometer onto a pipe to read water temperature without touching the boiler itself.

🍞 Top Bread (Hook): Think of kids lining up from shortest to tallest—size helps, but practice and good instructions help too.

🥬 Filling (The Actual Concept): Model Scaling

What it is: Increasing model size (parameters) and, ideally, training data quality/quantity.
How it works: 1) Make the network bigger; 2) Train with more/better data; 3) Capabilities can grow if data and training scales match the size.
Why it matters: Without the right data, a bigger model won’t necessarily learn better 3D.

🍞 Bottom Bread (Anchor): WAN got much better 3D when scaled with higher-quality data; CogVideoX grew in size but 3D didn’t improve unless data improved too.

🍞 Top Bread (Hook): Like rehearsing a song to nail tricky notes.

🥬 Filling (The Actual Concept): 3D Fine-Tuning

What it is: Extra training that encourages 3D-consistent outputs.
How it works: 1) Start from a video model; 2) Add 3D-aware goals or controls; 3) Train on curated 3D-style data; 4) Improve 3D on those domains.
Why it matters: It can help in-domain, but may reduce generalization to different scenes if the fine-tuning data is narrow.

🍞 Bottom Bread (Anchor): Aether improved over its base on large scenes like DL3DV, but slipped a bit on object-centric CO3Dv2.

🍞 Top Bread (Hook): Think of cleaning a foggy window: wipe a little to see better, but not so much you smear things.

🥬 Filling (The Actual Concept): Denoising Timesteps (in Diffusion Models)

What it is: Moments during the diffusion process when the model removes noise to recover structure.
How it works: 1) Add controlled noise; 2) Take a small denoising step; 3) Read features at chosen layers/timesteps; 4) Early-but-not-first steps and mid layers best expose 3D cues.
Why it matters: Too-early or too-late features hide 3D; mid + early steps balance global structure and clarity.

🍞 Bottom Bread (Anchor): Probing mid-layer features at an early step gave the lowest 3D errors across multiple video generators.

03Methodology

At a high level: Input video → Extract frozen VidFM features → Sample 4 frames → Tiny transformer probe with alternating attention → Three heads predict 3D points, depth, and camera poses → Compute losses → Output 3D results and errors as the 3D-awareness score.

Step-by-step recipe:

Input and Feature Extraction

What happens: Feed a video clip to a frozen video foundation model (self-supervised encoder or diffusion-based generator). For diffusion models, lightly add noise, take one denoising step, and grab hidden activations (features) at a specific layer and timestep. For encoders, read last-layer spatial features.
Why this step exists: We want to test what’s already encoded, not to change the VidFM. Freezing avoids mixing in new training.
Example: From a CO3Dv2 clip of a spinning chair, we collect one feature map per chosen frame (e.g., 256×160 tokens) from WAN2.1-14B at a mid-layer and early-but-not-first timestep.

Frame Sampling and Tokens

What happens: Pick S=4 frames: the first frame is the reference; choose three more separated by at least 5 frames. Convert each feature map into tokens.
Why this step exists: Four frames offer parallax and motion cues without overwhelming the small probe.
Example: Using frames 1, 8, 15, 22 of a clip ensures noticeable viewpoint change.

Alternating-Attention Blocks (Shallow Transformer)

What happens: Apply four blocks; each block first mixes tokens within each frame (frame attention), then mixes across frames (global attention). This copies the spirit of VGGT but stays shallow and light.
Why this step exists: Within-frame attention sharpens local details; across-frame attention fuses temporal cues needed for global 3D consistency.
Example: The arm of a chair aligns across frames so the model can triangulate its 3D curve.

🍞 Top Bread (Hook): Imagine asking a group first to agree within each team, then sending team captains to meet so everyone agrees.

🥬 Filling (The Actual Concept): Alternating Attention

What it is: A pattern that alternates between attention inside a frame and attention across frames.
How it works: 1) Intra-frame attention tidies each frame’s features; 2) Inter-frame attention shares parallax and motion; 3) Repeat a few times.
Why it matters: Without this, the probe can’t combine local clarity with global 3D coherence.

🍞 Bottom Bread (Anchor): It’s like two classroom discussions: table-talk first, then a whole-class share-out.

Three Readout Heads

Point Map Head
- What happens: Predicts a 3D coordinate (x, y, z) per pixel, expressed in the first frame’s coordinate system.
- Why this step exists: Point maps directly test if features carry global, consistent 3D.
- Example: The chair’s seat points form a flat 3D surface; legs become skinny columns in space.
Depth Map Head
- What happens: Predicts per-pixel depth values at a common scale across frames.
- Why this step exists: Depth shows how far surfaces are from the camera but doesn’t fix global pose by itself.
- Example: Near table edges appear with small depth; the wall is large depth.
Camera Pose Head
- What happens: Predicts each frame’s rotation and translation relative to the first frame.
- Why this step exists: Correct poses are vital to place all points in a shared 3D world.
- Example: As the camera circles a toy truck, the pose head tracks the turn angle and direction.

🍞 Top Bread (Hook): Think of measuring how far things are, where they are in space, and where you stood when you looked at them.

🥬 Filling (The Actual Concept): Point Map, Depth Map, Camera Pose

What it is: Point map gives 3D coordinates for visible pixels; depth map gives distance; camera pose gives where and how the camera moved.
How it works: 1) Use fused features; 2) Decode dense maps for points and depth; 3) Decode pose vectors; 4) Keep all in a shared reference frame.
Why it matters: Without any of the three, 3D breaks—no points means no shape; no depth means no distances; no pose means no common world.

🍞 Bottom Bread (Anchor): Reconstructing a room needs all three: where each pixel lands in 3D, how far it is, and how the camera walked around.

Losses and Normalization

What happens: Compute confidence-weighted L1 losses for depth and point maps; use a robust loss (Huber) for camera poses. Normalize each scene to remove global scale ambiguity and align predictions to ground truth via Umeyama alignment before scoring.
Why this step exists: Fair, stable training and evaluation; removes trivial scale mismatches and focuses on structure and motion.
Example: Two reconstructions differing only by overall zoom should count the same after scale normalization.

Scoring 3D Awareness

What happens: Report errors for points and depth; for pose, compute accuracy as the fraction of frame pairs whose rotation and translation angular errors are both below a threshold, summarized by AUC@5 and AUC@30.
Why this step exists: These scores are direct tests of 3D understanding, not indirect proxies.
Example: Getting “AUC@30” like a high exam grade means the model’s pose guesses are usually within 30 degrees—a practical threshold.

🍞 Top Bread (Hook): Remember wiping a foggy window just enough to see the garden?

🥬 Filling (The Actual Concept): Denoising Timesteps (where we read features)

What it is: In diffusion models, we pick when to read features during the one-step denoise for the clearest 3D signal.
How it works: 1) Too-early: not enough structure; 2) Too-late: features tilt toward pixel-perfect color; 3) Early-but-not-first + mid-layer: best 3D cues.
Why it matters: The probe works best when the features still carry global geometry.

🍞 Bottom Bread (Anchor): The paper finds a sweet spot across models: mid-layer, early-but-not-first step regularly yields the lowest 3D errors.

The secret sauce:

Freeze-and-probe: keep base models fixed to test what they truly know.
Alternating attention: tiny yet powerful fusion of within-frame detail and across-frame geometry.
Triple-head outputs: points, depth, and pose ensure a complete check of global 3D, not just a piece.

04Experiments & Results

The Test: Measure how well a tiny probe can pull out three 3D properties—point maps, depth, and camera poses—from frozen features. Lower point/depth errors and higher pose AUC scores mean stronger 3D awareness.

The Competition: Compare multiple families—video generators (WAN2.1-14B, Open-Sora2.0, CogVideoX; plus Aether, a 3D-fine-tuned variant), a self-supervised video encoder (V-JEPA), an image model baseline (DINOv2 per frame), and a native 3D expert (Fast3R). Datasets: CO3Dv2 (object-centric) and DL3DV (large, cluttered scenes).

The Scoreboard with Context:

Big picture: Top video generators show strong, general 3D understanding, in some cases rivaling or beating Fast3R, which was trained directly for 3D.
On CO3Dv2: WAN2.1-14B is just behind Fast3R (e.g., point, depth, and pose AUC@30 are close), while Open-Sora2.0 is also strong. This is like scoring A-level grades right next to the class topper.
On DL3DV (out-of-distribution for Fast3R): WAN2.1-14B surpasses Fast3R across all metrics. That’s like transferring to a new school and still acing the test when the previous top student struggles.
Temporal reasoning matters: Per-frame DINOv2 can estimate depth on simple objects but collapses on global 3D in complex scenes. Even V-JEPA (a video encoder) beats DINOv2 on global 3D.
Fine-tuning trade-offs: Aether (3D fine-tuned from CogVideoX) improves on large, cluttered scenes but slightly underperforms on object-centric scenes—helpful in-domain yet risky for generalization.
Where in the network to read features: Mid layers and early-but-not-first diffusion timesteps consistently give the best 3D results across generators.

Surprising findings:

Video generators, trained only on 2D videos, can encode 3D as well as or better than some 3D-specific models—especially when evaluating generalization.
More parameters aren’t enough: WAN’s 3D awareness improved a lot with size and better data; CogVideoX grew in size but 3D didn’t necessarily improve without the right data scale/quality.
Multi-view feature consistency alone is a biased proxy: DINOv2 and V-JEPA can look great by nearest-neighbor matching across views, yet lag behind on the full 3D probe tasks that require global geometry and poses.
Probe size doesn’t change the story: Using a smaller probe preserves the same rankings, suggesting the 3D signal is genuinely present in the features.

Practical implication test:

Swapping DINO with frozen WAN2.1-14B features in VGGT yields big gains on both datasets, especially with limited 3D training data. With less than 10% of the training data, the VidFM-based VGGT often beats the original full-data baseline. This shows VidFM features are especially valuable when 3D labels are scarce.

05Discussion & Limitations

Limitations:

Public checkpoints only: The study uses available models, so factors like exact data composition, training schedule, and architecture are not controlled. This makes it hard to isolate which ingredient (data, size, objective) drives 3D awareness.
No pure data-scaling ablation: There are no open model families that vary data scale alone, so we can’t cleanly separate “more data” versus “bigger network” effects.
Resource limits: Training truly large 3D reconstructors from scratch with VidFM features on massive 3D datasets wasn’t feasible here.

Required resources to use this approach:

Access to pre-trained VidFMs and the ability to extract internal features (e.g., mid-layer activations during diffusion denoising).
A modest GPU budget to train a small probe on 3D-supervised datasets and to run evaluation (point errors, depth errors, pose AUC).

When not to use:

If you need end-to-end generative video outputs with guaranteed 3D consistency across long sequences, a probe alone won’t fix generation artifacts.
If you only have single still images without any motion or parallax, a video-based probe won’t add temporal cues that aren’t there.

Open questions:

How to fine-tune for 3D without hurting generalization? Can we design objectives or data curricula that broaden, not narrow, a model’s 3D skills?
What’s the exact role of data quality versus sheer quantity in scaling 3D awareness?
Can we extend probing beyond four frames to long-horizon consistency without blowing up compute?
Can we design better, family-agnostic proxies than multi-view feature matching to estimate true 3D awareness quickly?

06Conclusion & Future Work

Three-sentence summary: The paper introduces a simple, model-agnostic probe that reads frozen video-model features to directly predict 3D points, depth, and camera poses. Using this shared yardstick, top video generators are shown to encode strong, generalizable 3D understanding—even rivaling or surpassing 3D experts in some settings. The study also maps what helps (temporal reasoning, right layer/timestep, data-aware scaling) and what can hurt (narrow 3D fine-tuning, naive scaling), and shows VidFM features boost 3D reconstruction especially when labeled 3D data is scarce.

Main achievement: Proving, with a clean and fair test, that 3D emerges strongly in modern video foundation models trained only on 2D videos, and that this 3D signal is practically useful.

Future directions:

Develop fine-tuning methods and data strategies that raise 3D awareness while preserving broad generalization.
Build long-range, memory-augmented probes for full-scene, long-horizon 3D coherence.
Scale up VidFM-feature-based reconstructors with more data and tasks to approach universal 3D world models.

Why remember this: It shows we can tap into the hidden 3D knowledge already present in powerful video models with a tiny, non-invasive read-out—opening a scalable path to better AR/VR, robotics, and mapping, even when 3D labels are limited.

Practical Applications

•Boost 3D reconstruction systems by swapping image features (e.g., DINO) for VidFM features in models like VGGT.
•Select the best video model for 3D tasks using the probe as a quick evaluation tool.
•Improve AR/VR scene understanding (room layout, object placement) using 3D-aware video features.
•Enhance robot navigation and manipulation by decoding camera poses and scene geometry from onboard video.
•Speed up mapping for drones by extracting 3D point clouds from short video bursts, even with limited 3D labels.
•Pre-screen fine-tuning strategies: test whether 3D-aware fine-tuning helps or hurts generalization before large-scale training.
•Choose optimal diffusion layers and timesteps (mid, early-but-not-first) when mining features for geometry-sensitive tasks.
•Design low-data pipelines for 3D tasks by leveraging VidFM features that perform well with minimal supervision.
•Create fair 3D benchmarks for new video models using the same frozen-feature probing protocol.
•Diagnose failure cases (e.g., at object boundaries) and guide model/data improvements.

Version: 1