InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Shuai Yuan; Yantai Yang; Xiaotian Yang; Xupeng Zhang; Zhonghao Zhao; Lingming Zhang; Zhipeng Zhang

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Intermediate

Shuai Yuan, Yantai Yang, Xiaotian Yang et al.1/5/2026

arXiv PDF

Key Summary

•InfiniteVGGT is a streaming 3D vision system that can keep working forever on live video without running out of memory.
•It uses a rolling memory: it keeps only the most useful past information and smartly drops the rest.
•Instead of slowing down to read giant attention maps, it uses a quick cosine-similarity trick to find and remove redundant tokens.
•The very first frame is saved as an “anchor” so all 3D predictions stay aligned to the same world coordinates.
•Each transformer layer gets a different memory budget, focusing space where the model needs the most variety.
•This approach is training-free and fully compatible with FlashAttention, so it stays fast and efficient.
•On long sequences where other systems crash or drift, InfiniteVGGT stays stable and accurate.
•The authors also built Long3D, a new benchmark with sequences up to about 10,000 frames to test true long-term performance.
•Across 7-Scenes, NRGBD, Bonn, and Long3D, InfiniteVGGT reduces errors and maintains better normal consistency than strong baselines.
•One current weakness is completeness (filling in every surface), which the authors flag for future improvement.

Why This Research Matters

Endless, stable 3D understanding unlocks everyday tools like AR navigation that doesn’t drift, warehouse robots that don’t get lost, and drones that can scan entire facilities without restarting. By keeping memory strictly bounded, InfiniteVGGT stays fast and affordable to run on real hardware. Its training-free design means teams can upgrade existing streaming models today, not after months of retraining. The method’s compatibility with FlashAttention preserves speed while handling very long videos. And with the Long3D benchmark, the field finally has a fair, tough test for long-term performance, nudging everyone toward more reliable, real-world 3D systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine filming a school tour with your phone, walking through classrooms, hallways, and the playground. Wouldn’t it be cool if a computer could turn that whole video into a neat 3D map in real time, without stopping?

🥬 The World Before: For years, computers built 3D scenes from many photos using classic tools like Structure-from-Motion and Multi-View Stereo. These are like careful puzzle-solvers: very accurate, but slow, with many steps that can trip over each other. New deep learning models like DUSt3R and VGGT changed the game by doing most of this in one big forward pass: they take a batch of images and predict depth, camera poses, and 3D points together—fast and strong. But there’s a catch: they work best when you can load all images at once (offline). Real life is a stream: robots, AR glasses, and drones see one frame at a time and keep going.

🍞 Hook: You know how a backpack can only carry so much? Live systems face the same problem with memory.

🥬 The Problem: Streaming architectures try to handle video one frame after another. To remember the past, some methods keep a growing history (a KV cache). But that memory grows and grows until it explodes—Out-Of-Memory. Others squash the past into a tiny hidden state, like squeezing a whole book into a sticky note; that avoids memory blowups but loses important details, causing long-term drift (the 3D map slowly bends or shifts). So we had a tough choice: either crash from too much memory or drift by forgetting too much.

🍞 Hook: What if we only kept the important stuff from before, just like saving your best notes for a test?

🥬 Failed Attempts: A natural idea is to use attention weights to see which tokens matter and delete the rest. The snag: fast attention (like FlashAttention) never builds the full attention matrix to stay speedy. But attention-weight pruning needs that exact matrix to decide what to keep—so it either slows down a lot or becomes impossible to use with the fast kernels we rely on.

🍞 Hook: Think of flipping through very similar photos taken one step apart; many parts barely change.

🥬 The Gap: Streaming camera motion usually changes slowly, so many tokens become near-duplicates. We need a way to find and remove those duplicates without touching the heavy attention matrices. Also, not all layers of the model need the same amount of memory—some layers see lots of variety, others don’t. The missing piece was a simple, fast, and attention-agnostic way to (1) score token redundancy, (2) keep memory strictly bounded, and (3) adapt per layer to keep the right kind of information.

🍞 Hook: Picture a robot helper in your house that can map rooms forever without freezing or getting lost.

🥬 Real Stakes: If we solve this, AR headsets can place holograms stably all day, warehouse robots can navigate long shifts, and drones can scan huge buildings without restarting. Doctors can plan procedures with steady 3D guidance, and game studios can capture massive scenes smoothly. Infinite, stable 3D understanding means systems that don’t choke on time or distance.

Now, let’s carefully introduce the key ideas with simple sandwiches.

🍞 Hook (VGGT): You know how an architect can look at different photos of a house and imagine its 3D shape? 🥬 The Concept (VGGT): VGGT is a big transformer that looks at several images together and predicts camera poses, depth, and 3D points in one pass.

How it works:
1. Break images into tokens.
2. Mix information within each frame and across frames.
3. Output depth, camera poses, 3D points, and tracking features.
Why it matters: It replaces slow, multi-step pipelines with a single, efficient process. 🍞 Anchor: Give VGGT a batch of room photos; it outputs a clean 3D room model in one go.

🍞 Hook (Causal Attention): Imagine telling a story one page at a time—you can only use pages you’ve already read. 🥬 The Concept (Causal Attention): Causal attention lets the model focus only on past information when processing a new frame.

How it works:
1. Store keys/values from old frames in a cache.
2. The new frame queries only this past cache.
3. Update the cache with the new frame.
Why it matters: It makes streaming possible by respecting time order. 🍞 Anchor: As a robot moves, it uses what it has already seen to understand the next step, never peeking into the future.

🍞 Hook (KV Cache): Think of a bookshelf of helpful notes you’ve written so you don’t have to re-learn everything. 🥬 The Concept (KV Cache): A KV cache stores key and value vectors from past frames so the current frame can quickly look them up.

How it works:
1. Each new frame adds about 1,000 tokens (keys/values).
2. The cache grows over time.
3. The current frame attends to this cache.
Why it matters: It gives the model memory—but if it never shrinks, it runs out of space. 🍞 Anchor: After 100 frames, that bookshelf can overflow unless you tidy it.

🍞 Hook (Attention Weight Pruning): Like cleaning your binder by removing pages you never read. 🥬 The Concept (Attention Weight Pruning): Remove tokens with low attention weights to save memory.

How it works:
1. Compute attention weights.
2. Rank tokens by importance.
3. Drop low-score tokens.
Why it matters: In theory, it keeps only useful info, but in practice it needs attention matrices that fast kernels avoid building—so it becomes too slow or impossible. 🍞 Anchor: It’s like needing the full giant gradebook to decide what to toss—but your speed tool hides that gradebook to go fast.

02Core Idea

🍞 Hook: Picture packing a suitcase for a long trip: you keep the essentials and one special item that never leaves (like your passport), and you repack a little each day so it never overflows.

🥬 The “Aha!” Moment (One sentence): Keep memory bounded and stable forever by rolling it forward—always retain the first frame as an anchor, and, for the rest, keep only the most diverse tokens per layer using a fast cosine-similarity score that doesn’t touch attention weights.

Multiple Analogies:

Bookshelf analogy: You have a fixed-size shelf. The first book (anchor) stays forever. Each day, you add a new book but remove any book that says the same thing as others, keeping the shelf full of different ideas.
Class notes analogy: You can only carry 10 note cards. You never throw away the “cheat sheet” (anchor). For new notes, you keep the ones that cover new topics (diverse) and toss repeats.
Backpack analogy: You hike all day. Your backpack space is fixed. You always keep the map (anchor). Everything else you keep only if it adds new info about the trail; duplicates get tossed.

Before vs After:

Before: KV caches balloon (OOM) or RNN states forget details (drift). Attention-based pruning isn’t compatible with fast kernels.
After: Memory stays strictly bounded; tokens are pruned by diversity in key space; the system remains fast (FlashAttention-friendly) and stable for very long streams.

Why It Works (intuition, no equations):

Streaming cameras move smoothly, so many tokens from nearby frames are near-duplicates.
Keys from past frames spread in a feature space; by measuring how different each key is from the average (using cosine similarity), we keep the ones that represent unique views and drop look-alikes.
Different layers see different variety: mid layers often carry richer geometric detail than very early or very deep layers. Giving bigger budgets to more diverse layers preserves what matters most.
Keeping the first frame’s tokens intact guarantees all 3D predictions remain aligned to one stable world coordinate system.

Building Blocks (the ingredients):

Immutable anchor: keep all first-frame tokens forever to lock the global coordinate frame.
Diversity score: compute a quick score for each key token as negative cosine similarity to the mean key (higher score = more unique).
Top-K per head/layer: pick the most diverse tokens until each layer/head’s budget is met.
Layer-wise adaptive budgets: compute each layer’s average diversity and assign it more or less memory accordingly (softmax across layers with a temperature).
Rolling update: at every new frame, add its tokens, score them, prune redundancies, and move on—so the cache size never grows.

Now, let’s sandwich the remaining key concepts.

🍞 Hook (Rolling Memory): Imagine a movie reel that keeps rolling; you only keep the scenes that add something new so the reel never gets too big. 🥬 The Concept (Rolling Memory): A fixed-size memory that is continuously refreshed by adding new tokens and pruning redundant old ones.

How it works:
1. Add new frame tokens to a candidate set.
2. Score all candidate keys by diversity.
3. Keep the top ones per layer/head, plus the anchor.
4. Discard the rest so memory stays bounded.
Why it matters: It enables infinite-horizon streaming without memory overflow or long-term forgetting. 🍞 Anchor: A robot mapping a warehouse all day keeps a crisp global map because its memory never bloats or erases key context.

🍞 Hook (Cosine Similarity): Think of comparing two arrows by the angle between them rather than their length. 🥬 The Concept (Cosine Similarity): A quick way to measure how alike two vectors are by their direction only.

How it works:
1. Normalize vectors (ignore size).
2. Compute the cosine of the angle between them.
3. Closer to 1 = more similar; closer to -1 = very different.
Why it matters: It’s fast and doesn’t need attention weights, so we can prune before expensive attention. 🍞 Anchor: Two nearly identical room views have very similar keys (high cosine); we drop one to save space.

🍞 Hook (Diversity Score): Like checking if an ice cream flavor is new to the menu or just another kind of vanilla. 🥬 The Concept (Diversity Score): A number that says how different a token’s key is from the average key in that layer/head.

How it works:
1. Compute the mean key from candidate keys (after L2-normalizing).
2. Score each key as negative cosine similarity to that mean.
3. Higher score = more unique; keep those.
Why it matters: It preserves rare, informative views and drops near-duplicates, keeping memory useful. 🍞 Anchor: If most frames see the same wall, but one glimpses the hallway, the hallway token gets a high diversity score and is kept.

03Methodology

High-level overview: Input video frames → Frame encoder (per-frame tokens) → Causal temporal attention with KV cache → Diversity scoring and pruning (per layer/head) → Outputs (depth, camera pose, 3D points, tracking) while cache stays bounded.

Step-by-step (like a recipe):

Prepare the anchor (first frame):

What happens: Pass frame 1 through the encoder and temporal stack. Save all its KV tokens per layer/head. Mark them immutable.
Why this step exists: VGGT aligns everything to frame 1’s coordinates. If we prune these, the whole world can wobble.
Example: In a classroom scan, the very first frame defines the origin and axes; we keep all its tokens forever.

For each new frame t (t ≥ 2), collect candidate tokens:

What happens: Encode frame t to produce new keys/values for all layers/heads. Concatenate these with the existing mutable cache (not the anchor).
Why: We need to consider both new info and older, possibly redundant info.
Example: Frame 23 sees the whiteboard again; its tokens join the candidate pile.

Normalize keys and compute the reference (mean key) per head:

What happens: L2-normalize every key vector so we compare directions only. Compute the mean (average) key of the candidate set per layer/head.
Why: This gives a stable center to measure how different each key is.
Example: If most candidate keys describe the same front-row desks, their mean drifts toward that “desk” direction.

Score diversity for each token (attention-agnostic):

What happens: For each candidate key, compute the diversity score as negative cosine similarity to the mean key. Higher means more different.
Why: We want to keep tokens that add new geometric viewpoints and drop near-duplicates.
Example: A token from a new camera angle on the door earns a high score; a token showing the same desk angle gets a low score.

Assign layer-wise budgets adaptively:

What happens: Compute each layer’s average diversity. Use a softmax (with temperature) to turn these averages into proportions of a total budget, then into a per-layer and per-head Top-K.
Why: Some layers capture more useful variation (often mid layers). Give them more room. Early/late layers typically need less.
Example: Layer 8 may get 2× the budget of layer 2 if it’s consistently more diverse.

Prune and roll the memory:

What happens: For each layer/head, keep the anchor set plus the Top-K highest-diversity candidate tokens. Discard the rest. The cache size remains fixed.
Why: This locks memory use, prevents OOM, and keeps the cache fresh and informative.
Example: If the per-head budget is 512 tokens and there are 1,200 candidates, keep the 512 most diverse and drop 688.

Run causal attention with FlashAttention:

What happens: With the pruned KV cache, compute attention for the current frame efficiently. No attention matrices need to be materialized for pruning.
Why: Maintaining compatibility with FlashAttention keeps inference fast and memory-light.
Example: Processing a 6,000-frame video stays smooth, never exceeding the VRAM cap.

Output predictions for frame t:

What happens: The model outputs camera pose, depth map, 3D point map, and tracking features.
Why: These are the core 3D geometry outputs for mapping and navigation.
Example: For a dormitory scene, you get the bed’s depth, the camera’s precise position, and a growing, globally aligned 3D cloud.

Concrete mini-example with numbers:

Suppose each frame contributes about 1,000 tokens per key layer/head. Without pruning, 5,000 frames ≈ 5 million tokens—OOM.
With rolling memory: per head budget = 512 tokens; 16 heads → 8,192 tokens per layer; 12 layers → ~98k tokens total (plus anchor). This stays constant even at 10,000 frames.

The secret sauce:

Attention-agnostic pruning: Decide what to keep using cosine similarity in key space before attention runs, so FlashAttention remains fully usable.
Immutable anchor: Guarantees a stable world frame and prevents subtle global drift.
Layer-wise adaptivity: Spends memory where variety and useful geometry live, reducing drift and boosting robustness.
Training-free: You don’t retrain; you bolt this onto a StreamVGGT-style model and immediately gain infinite-horizon steadiness.

Cautionary note (what breaks without each step):

Without the anchor: world alignment can slip, causing cumulative pose and depth drift.
Without diversity scoring: the cache fills with duplicates; memory bloats or useful tokens get squeezed out.
Without adaptive budgets: you waste space on layers with little variety, starving layers that capture vital geometry.
Without causal attention: you’d need future frames, which isn’t streaming.

04Experiments & Results

The test (what they measured and why):

Tasks: Dense 3D reconstruction (point clouds), video depth estimation, and camera pose estimation on long sequences.
Datasets: 7-Scenes and NRGBD (classic indoor datasets), Bonn (video depth sequences), and the new Long3D (about 2,000–10,000 frames per sequence, continuous and hard).
Metrics: Accuracy (how close predicted points are), Completeness (how much of the scene you cover), Chamfer Distance (average of accuracy and completeness), Normal Consistency (surface smoothness/alignment), and also speed and peak GPU memory.

The competition (baselines):

Offline VGGT: great in batches, but OOM on long streams.
StreamVGGT: causal streaming, but KV cache grows without bound → OOM.
CUT3R and TTT3R: RNN-style persistent state that fits in memory but can forget details over time (drift).
Point3R: explicit pointer memory, strong but memory keeps growing on long streams.

The scoreboard (with context):

On 7-Scenes and NRGBD (300–500 frames, stride 2):
- VGGT (offline) and StreamVGGT both OOM on long inputs.
- InfiniteVGGT achieves top-tier reconstruction: e.g., on 7-Scenes, Accuracy mean ≈ 0.040 with Normal Consistency competitive or better (≈ 0.570), and low Chamfer Distance (e.g., 0.025 median), edging out strong baselines like TTT3R in many cases.
- Translation: That’s like getting an A when others get A− or B+, while some classmates can’t finish the test at all (OOM).
On Bonn (video depth):
- InfiniteVGGT improves AbsRel error and δ<1.25 accuracy across long clips (e.g., AbsRel down to around 0.063–0.072, δ<1.25 up to ≈ 0.964–0.958), outperforming CUT3R/TTT3R.
- Translation: Crisper per-frame depth and better consistency over time.
On Long3D (2k–10k frames, continuous):
- InfiniteVGGT consistently limits drift and maintains better CD/NC than CUT3R/TTT3R over long horizons.
- Example scenes (Classroom, Dormitory, Library, Badminton Court, Academic Building): InfiniteVGGT lowers Chamfer Distance and improves Normal Consistency in most cases, with notable gains in large/complex spaces.

Speed and memory (practical wins):

Cosine-similarity pruning vs attention-weight pruning:
- Faster per-frame by about 120 ms (e.g., 0.168 s vs 0.288 s) and lowers peak VRAM (e.g., 14.49 GB vs 17.30 GB) in ablations.
- Translation: It stays zippy and under the memory cap even as frames pile up.

Surprising findings:

Even on shorter sequences where the baseline fits, InfiniteVGGT often matches or slightly improves normal consistency. The curated memory isn’t just smaller; it’s also cleaner.
The immutable anchor notably reduces drift—removing it worsens accuracy and normal consistency (ablation confirms).
The one caveat: mean completeness can lag on some long scenes. The model is picky about quality-over-coverage; future work aims to capture more surfaces without sacrificing stability.

05Discussion & Limitations

Limitations (honest and specific):

Completeness: On some long scenarios, mean completeness can be lower than RNN-style baselines, meaning certain surfaces aren’t fully captured.
Rapid motion/jumps: Extremely fast camera moves that reduce viewpoint overlap may stress the diversity scoring and cause temporary coverage gaps.
Hyperparameters: Budgets per head/layer and the temperature in the softmax need sensible defaults; extreme settings can under- or over-prune.
Anchor dependency: Heavily relying on the first frame assumes it’s a good reference. If the first view is poor, quality can suffer unless you re-anchor.
Encoder assumptions: The redundancy property is strongest with DINO-like encoders trained for invariance; very different encoders may change the diversity profile.

Required resources:

A modern GPU with FlashAttention support (or equivalent fast attention) to process long streams efficiently.
Enough VRAM for the fixed cache size chosen (though strictly bounded, the budget must be sized for your resolution and layers/heads).

When NOT to use it:

Very short sequences with fixed batches: classic VGGT offline may be simpler and slightly higher fidelity.
Ultra-dynamic scenes with many sudden changes and little redundancy: rolling memory may need larger budgets or adaptive re-anchoring.
If you must use attention-weight-based pruning specifically: this method is intentionally attention-agnostic.

Open questions:

Learning the pruning policy: Could a small learned module predict diversity or even plan re-anchoring automatically?
Value compression: We prune keys/values by diversity of keys; can we further compress values without hurting accuracy?
Multi-anchor strategy: Can we promote new anchors safely when entering very different sub-scenes?
Multimodal fusion: How would IMU/LiDAR cues guide diversity and budget allocation to boost completeness?
Theory: Can we bound long-term drift given cache size, motion patterns, and diversity statistics?

06Conclusion & Future Work

Three-sentence summary: InfiniteVGGT keeps a streaming 3D system both stable and memory-bounded by rolling its memory: it always preserves the first-frame anchor and keeps only the most diverse tokens per layer using an attention-agnostic cosine-similarity score. This design stays compatible with FlashAttention, runs training-free, and outperforms strong streaming baselines on long sequences across multiple datasets. The authors also introduce Long3D, enabling rigorous tests on sequences up to about 10,000 frames.

Main achievement: Turning the long-standing memory-vs-stability trade-off into a win-win by inventing a diversity-aware, layer-adaptive rolling KV cache that truly enables infinite-horizon streaming.

Future directions: Improve completeness while keeping stability, explore learned or multimodal pruning policies, support safe multi-anchor updates for scene shifts, and extend to broader tasks (e.g., dynamic objects, open-vocabulary 3D).

Why remember this: It’s a simple, elegant shift—from reading heavy attention maps to using a quick diversity score—that unlocks endless 3D understanding without retraining or breaking speed. InfiniteVGGT shows that careful memory design can be as powerful as bigger models, and Long3D gives the community a fair way to prove it over truly long horizons.

🍞 Hook (Long3D Benchmark): Like running a marathon instead of a sprint—you find out who can really go the distance. 🥬 The Concept (Long3D Benchmark): A new test with continuous sequences up to about 10,000 frames to measure true long-term 3D stability.

How it works:
1. Provide uninterrupted RGB streams in diverse scenes.
2. Align predictions to ground truth with ICP.
3. Score with Accuracy, Completeness, Chamfer Distance, and Normal Consistency.
Why it matters: It finally lets us measure infinite-horizon performance instead of guessing from short clips. 🍞 Anchor: A library walk-through of thousands of frames reveals which method drifts, which crashes, and which keeps a clean 3D map to the end.

Practical Applications

•AR indoor navigation that maintains stable anchors over hours-long sessions.
•Warehouse and factory robots that build accurate, drift-resistant maps during full shifts.
•Drones scanning large buildings or bridges continuously without memory overflow.
•Real-time 3D capture for film and game studios across long takes.
•Construction site progress monitoring with reliable, long-duration mapping.
•Emergency response robots maintaining situational awareness throughout extended missions.
•Room-scale to building-scale digital twins captured in a single continuous pass.
•Persistent SLAM in consumer devices (phones, headsets) that avoids crashing or drifting.
•Autonomous retail inventory scanning with stable long-horizon 3D perception.
•Museum or campus tours streamed into consistent, navigable 3D guides.

Version: 1