Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Bowen Wen; Shaurya Dewan; Stan Birchfield

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Intermediate

Bowen Wen, Shaurya Dewan, Stan Birchfield12/11/2025

arXiv PDF

Key Summary

•Fast-FoundationStereo is a stereo vision system that sees depth from two cameras in real time while still working well on brand‑new scenes it was never trained on.
•It speeds up a slow but very accurate “foundation” teacher model by teaching a smaller student to mimic it (knowledge distillation).
•It automatically redesigns the heavy middle part of the network using blockwise neural architecture search to meet a chosen time budget without losing much accuracy.
•It trims extra parts from the final polishing module (structured pruning) and then retrains so quality bounces back.
•The team also created 1.4 million real‑world stereo pairs with automatic pseudo‑labels, filtering bad labels by checking shape consistency, to make the student robust.
•On common datasets (Middlebury, ETH3D, KITTI, Booster), it beats other real‑time methods by a large margin and gets close to much slower foundation models.
•It runs over 10× faster than FoundationStereo (about 49 ms per image pair on an RTX 3090, ~21 ms with TensorRT), yet keeps strong zero‑shot accuracy.
•The divide‑and‑conquer design—distill the backbone, search the cost‑filter blocks, prune the refiner—turns a research model into something practical for robots and AR.
•Results are especially strong on hard surfaces like glass and shiny objects, where many fast methods fail.
•This approach shows we don’t have to choose between speed and smarts: with the right tricks, we can have both.

Why This Research Matters

Depth that is both fast and reliable unlocks safer robots, smoother AR, and more responsive drones. Many real‑world places—warehouses, streets, homes—change constantly, so zero‑shot generalization saves costly per‑site fine‑tuning. A 10× speedup means systems can react in time to avoid obstacles and users don’t feel lag. The pseudo‑label pipeline taps into the internet’s diversity to prepare models for odd lighting, reflections, and new layouts. The divide‑and‑conquer strategy is a reusable recipe to turn other heavy vision foundation models into practical tools. Overall, this brings cutting‑edge 3D perception from the lab into everyday devices and applications.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how two of your eyes help you tell how far away things are? With two pictures, your brain guesses depth so you don’t bump into stuff.

🥬 The Concept (Stereo Matching): Stereo matching is how a computer figures out depth from a left and a right photo of the same scene.

What it is: A method that finds how much each point shifts between the two images (called disparity) to compute depth.
How it works:
1. Look at a small patch in the left image.
2. Slide along the same row in the right image to find the best match.
3. The slide distance is the disparity; bigger shift usually means closer.
Why it matters: Without stereo matching, robots, cars, and AR glasses can’t understand 3D space from cameras alone. 🍞 Anchor: Like matching two almost‑identical “spot the difference” pictures: where the object appears shifted more, it’s closer to you.

🍞 Hook: Imagine measuring how “off” two puzzle pieces are when you try to place them together.

🥬 The Concept (Disparity): Disparity is the left‑right horizontal shift of a point between the two images.

What it is: A number per pixel showing how far a scene point moved from left to right view.
How it works:
1. For each pixel in the left image, search possible positions in the right image.
2. Pick the best match; the offset is the disparity.
3. Convert disparity to depth using the camera setup.
Why it matters: Disparity is the bridge from 2D images to 3D distances. 🍞 Anchor: Hold your finger in front of your face and blink each eye—your finger jumps more when it’s close (big disparity) and barely moves when it’s far (small disparity).

🍞 Hook: Imagine tasting ingredients in many combinations to see which blend tastes best.

🥬 The Concept (Cost Volume): A cost volume is a big 3D grid that stores how well every pixel matches at many possible shifts.

What it is: A stack of “match scores” for each pixel over many disparities.
How it works:
1. Extract features from both images.
2. Compare left and right features across candidate disparities.
3. Store the match quality at each disparity in a volume.
Why it matters: Without the cost volume, the network can’t reason globally about which shifts make sense together. 🍞 Anchor: It’s like a spreadsheet where each row is a pixel, each column is a possible shift, and the numbers show how good the match is.

🍞 Hook: Picture taking a test in a subject you never studied, but you still ace it because you understand the basics deeply.

🥬 The Concept (Zero‑Shot Generalization): Zero‑shot generalization means solving new tasks or scenes without extra training on them.

What it is: Robust performance on unfamiliar data distributions.
How it works:
1. Learn broad rules and patterns from diverse sources.
2. Avoid overfitting to a single domain.
3. Use priors that transfer to new places.
Why it matters: Collecting perfect labels everywhere is impossible; models must work off‑the‑shelf in the wild. 🍞 Anchor: A robot moving from your classroom to a factory and still navigating safely on day one.

🍞 Hook: Think of a giant Swiss‑army knife of vision tricks learned from the whole internet.

🥬 The Concept (Vision Foundation Models): These are large models trained on huge datasets that capture general visual knowledge.

What it is: Pretrained backbones that “know” edges, shapes, textures, and depth cues.
How it works:
1. Train at massive scale on varied images.
2. Learn features useful for many tasks.
3. Adapt or guide smaller models.
Why it matters: They bring strong priors but are usually heavy and slow. 🍞 Anchor: Like borrowing the wisdom of a world‑traveling tour guide when you visit a new city.

🍞 Hook: Imagine a self‑driving car that must react faster than a blink.

🥬 The Concept (Real‑Time Inference): Real‑time means the model runs fast enough to keep up with live video.

What it is: Processing each frame within tight time limits (e.g., 30–60 fps).
How it works:
1. Use efficient backbones and ops.
2. Keep memory and compute small.
3. Optimize for the specific hardware.
Why it matters: Slow models miss events and can be unsafe or unusable. 🍞 Anchor: A drone dodging a branch needs depth now, not a second later.

The world before: Researchers had two roads. Road A used foundation models to generalize amazingly to new scenes but was too slow for robots or AR. Road B built very fast stereo networks using lightweight parts and local refiners, but these often failed outside their training domains unless you fine‑tuned them for each new place. Getting dense, high‑quality real‑world depth labels at scale is hard, so fast methods stayed fragile.

The problem: Can we keep the superpower of zero‑shot generalization and still meet strict real‑time speed?

Failed attempts: Simply pruning big models hurt accuracy a lot; directly redesigning the heavy middle blocks by hand was guesswork; training from scratch ignored the teacher’s wisdom.

The gap: We needed a way to compress the teacher’s knowledge into a fast student, automatically reshape the heaviest parts under a time budget, and slim the final refiner—without throwing away what makes the teacher robust. Plus, we needed a flood of realistic, diverse training pairs without paying for ground‑truth depth.

Real stakes: This matters for delivery robots, AR headsets, drones, and assistance systems that must understand 3D now, anywhere, safely. Fast‑FoundationStereo fills that gap by combining smart teaching, smart searching, smart trimming, and smart data curation.

02Core Idea

🍞 Hook: Imagine learning a dance from a pro, then rearranging moves to fit a 1‑minute stage limit, and finally cutting extra flourishes so you can perform fast and clean.

🥬 The Concept (Fast‑FoundationStereo): It’s a stereo system that keeps a foundation model’s brains but runs at real‑time speed by teaching, searching, and trimming.

What it is: A student network distilled from a strong teacher, with an auto‑designed cost‑filter and a pruned refiner.
How it works:
1. Knowledge distillation compresses a hybrid backbone into one efficient student.
2. Blockwise neural architecture search finds fast‑but‑good cost‑filter blocks under a time budget.
3. Structured pruning removes redundancy in the iterative refiner and then retraining restores quality.
4. A huge set of in‑the‑wild pseudo‑labels toughens the student for zero‑shot use.
Why it matters: Without this combo, you either stay slow (great but impractical) or fast (but brittle). This gives both. 🍞 Anchor: It’s like turning a gourmet recipe into a weeknight dinner that still tastes amazing and cooks in 15 minutes.

Aha! moment in one sentence: Don’t hand‑tune everything—distill what’s essential, search where it’s heavy, and prune where it’s redundant, all while feeding on massive, carefully filtered real‑world data.

Three analogies:

Chef analogy: Copy the master’s flavor (distillation), pre‑plan the fastest cooking steps (search), trim garnishes (pruning).
Sports analogy: Learn from a star coach (distill), pick plays that fit the shot clock (search), cut drills that don’t add performance (prune).
Travel analogy: Get a condensed guidebook (distill), map the quickest route with stops (search), drop detours (prune).

Before vs. After:

Before: FoundationStereo was super accurate but slow; fast models were quick but fragile.
After: Fast‑FoundationStereo keeps most of the accuracy while running over 10× faster, and it holds up on new scenes.

Why it works (intuition):

The backbone’s knowledge is where generalization lives. Distilling hybrid monocular+stereo priors into a single student keeps that wisdom without the bulk.
The cost volume filter is the heaviest piece. Searching for the best small blocks under a runtime budget finds non‑obvious, efficient designs humans might miss.
The final refiner repeats similar computations; many channels contribute little. Removing weak parts (with structure awareness) and retraining keeps quality.
Massive pseudo‑labels from real videos (carefully filtered by normal consistency) give variety and realism that synthetic data lacks.

Building blocks (each with a mini “sandwich”):

🍞 Hook: Think of a wise teacher tutoring a student to solve problems faster. 🥬 The Concept (Knowledge Distillation): A small network learns to mimic a big one.

What it is: Transfer of representations and outputs from teacher to student.
How it works: (1) Freeze teacher; (2) Train student features to match teacher features; (3) Use loss (e.g., MSE) to align them.
Why it matters: Keeps smarts, loses weight. 🍞 Anchor: Like studying a solved test key to learn how to think, not just memorize answers.

🍞 Hook: Imagine building a Lego castle room by room and trying alternatives for each room to save time. 🥬 The Concept (Blockwise Neural Architecture Search): Auto‑finds the best block design under a time budget.

What it is: Divide the heavy module into blocks, train many candidates per block, then pick the best combo.
How it works: (1) Propose fast block variants; (2) Distill each to match the teacher’s local output; (3) Combine blocks to fit a latency budget with the least accuracy drop.
Why it matters: Humans can’t explore huge design spaces efficiently; search can. 🍞 Anchor: Like choosing the fastest kitchen layout by testing counters, shelves, and appliances separately, then assembling the best set.

🍞 Hook: Picture pruning a tree’s branches to help it grow stronger and lighter. 🥬 The Concept (Structured Pruning): Remove whole channels/layers that add little.

What it is: A way to slim networks in hardware‑friendly chunks.
How it works: (1) Build a dependency graph; (2) Rank importance via gradients; (3) Cut the least important parts; (4) Retrain.
Why it matters: Fewer, stronger parts run faster with minimal quality loss. 🍞 Anchor: Like cleaning your backpack by tossing items you never use so you can move faster.

🍞 Hook: Suppose you practice piano with songs chosen by a smart app that tosses out confusing, mislabeled sheets. 🥬 The Concept (Pseudo‑Labeling with Normal Consistency): Auto‑create training labels and filter bad ones by checking 3D shape.

What it is: Generate disparity labels from a teacher and keep only those consistent with monocular depth after surface‑normal checks.
How it works: (1) Teacher predicts disparity; (2) A mono model predicts depth; (3) Convert both to normals; (4) Keep pixels where normals agree; (5) Mask sky via segmentation.
Why it matters: Gives millions of realistic examples without paying for ground truth. 🍞 Anchor: Like keeping only the flashcards that agree across two textbooks—and skipping the sky because it’s “infinitely far.”

03Methodology

High‑level recipe: Input (left/right images) → Feature extraction (distilled backbone) → Cost volume build → Cost filtering (searched blocks) → Initial disparity → Iterative refinement (pruned ConvGRU) → Output disparity.

Step 1: Feature extraction (distilled hybrid priors)

What happens: Replace the teacher’s dual backbone (monocular foundation + side‑tuned CNN) with one efficient student.
Why it exists: The teacher backbone is a big speed bottleneck; we need the priors without the bulk.
How it works (like a recipe):
1. Freeze the teacher’s two‑part backbone.
2. Train the student to match the teacher’s multi‑scale features (use MSE; add linear projection if channels differ).
3. Feed both left and right images during training so the student “sees” stereo statistics.
Example: At 1/8 image scale, if the teacher’s feature for a shiny door edge has strong contrast, the student learns to produce a similar strong edge response.
What breaks without it: You either stay slow (use teacher) or lose generalization (naive lightweight backbone).

Step 2: Cost volume construction

What happens: Build a match‑score stack across disparities using groupwise correlation plus concatenation features.
Why it exists: The volume lets the network compare many possible shifts for each pixel in one go.
How it works:
1. For each disparity d (e.g., up to 192), compare left and right features.
2. Store similarity and combined features as a “slice.”
3. Stack slices into a 4D volume (channels × disparities × height × width).
Example: A pixel on a near table might have its best score around disparity 50; a far wall peaks near disparity 5.
What breaks without it: The model would guess shifts locally and get confused on repetitive textures or low‑texture areas.

Step 3: Cost filtering via blockwise search

What happens: Replace the teacher’s heavy hourglass + transformer cost filter with a searched set of faster blocks.
Why it exists: This is the heaviest module; trimming here brings big speedups.
How it works:
1. Split the filter into N blocks (e.g., downsample convs, APC layers, upsample convs, and a disparity‑transformer block).
2. For each block, create many faster candidates (vary channels, layers, heads, etc.).
3. Distill each candidate to match the teacher block’s output given the teacher’s previous block output.
4. Measure each candidate’s accuracy change (Δerror) and time change (Δtime) when swapped into the full teacher pipeline.
5. Pick one candidate per block to meet a total time budget with minimal error rise (a simple budgeted selection).
Example with data: If Block 3 candidate A is +0.2% error but −6 ms, and candidate B is −0.6% error but +5 ms, the search balances these across all blocks to fit, say, −40 ms total.
What breaks without it: Hand‑tuning misses good designs; naive pruning inside already‑small volumes hurts a lot.
Secret sauce #1: Distill per block to reduce search from exponential to linear in the number of candidates per block, making the search practical and parallel.

Step 4: Initial disparity prediction

What happens: The final cost‑filter block turns the filtered volume into an initial disparity map.
Why it exists: Provides a strong starting point for refinement.
How it works:
1. Convert volume scores into a probability over disparities per pixel.
2. Take an expected value or argmax to get initial disparity.
Example: A pixel’s distribution peaks at disparity 22 with a tight spread—yielding a confident initial estimate.
What breaks without it: The refiner has nothing reliable to polish, slowing convergence and hurting quality.

Step 5: Iterative refinement (pruned ConvGRU)

What happens: A recurrent ConvGRU improves the disparity step by step using context features and indexed volume cues.
Why it exists: Local details, edges, and hard regions (like glass) benefit from iterative polishing.
How it works:
1. Start with hidden feature h0 (from a context network) and initial disparity d0.
2. At each iteration k: warp/index features, encode motion cues, update hk and dk with ConvGRU gates.
3. Repeat K times (e.g., 8 iterations).
Pruning details:
1. Build a dependency graph that knows which channels must stay matched (e.g., inputs consuming hk−1 and outputs producing hk).
2. Use gradient‑based importance to rank parameters globally.
3. Remove the least important α fraction in structured chunks (whole channels/filters).
4. Retrain only the refiner (others frozen) with a loss that supervises later iterations more and distills intermediate features.
Example with data: Aggressive pruning first drops accuracy; after retraining, most loss is recovered, revealing redundancy.
What breaks without it: You keep unnecessary compute at each iteration and miss big speed gains.
Secret sauce #2: Prune with recurrence‑aware constraints so channel sizes remain consistent across time steps.

Step 6: Pseudo‑labeling with normal consistency (data curation)

What happens: Build a 1.4M in‑the‑wild stereo training set with filtered pseudo‑labels.
Why it exists: Real scenes are diverse; synthetic‑only training doesn’t cover glass, rain, odd lighting, etc.
How it works:
1. Teacher predicts stereo disparity.
2. A monocular model predicts depth from the left image.
3. Convert both to surface normals; compute per‑pixel normal agreement.
4. Drop frames/pixels with poor agreement; zero out sky via segmentation (infinite depth).
5. Use the remaining disparities as supervision (optionally mask by consistency).
Example: A city scene with reflective windows passes the normal check on walls and roads but rejects messy reflections.
What breaks without it: You either overfit synthetic patterns or learn from noisy labels that teach the wrong lessons.
Secret sauce #3: Checking agreement in normal space is robust to scale differences and weird depth ranges.

Secret Sauce summary:

Per‑block distillation makes huge architecture spaces searchable.
Recurrence‑aware structured pruning unlocks safe slimming of the refiner.
Normal‑based filtering turns noisy web data into high‑value training fuel.

Output: A real‑time disparity map that preserves much of the teacher’s zero‑shot strength while fitting strict latency budgets.

04Experiments & Results

🍞 Hook: Imagine a race where sprinters must also solve puzzles mid‑run. The winner is fast and smart.

🥬 The Concept (The Test Setup): We measure both accuracy on new scenes and how fast the model runs.

What it is: Zero‑shot evaluation on standard datasets plus runtime on the same GPU.
How it works:
1. Use no training data from the target test splits.
2. Compare errors (like BP‑X, D1) across datasets.
3. Profile runtime on an RTX 3090 (and with TensorRT acceleration).
Why it matters: Real systems need both brains (accuracy) and legs (speed). 🍞 Anchor: It’s like grading both how many questions you get right and how quickly you finish the test.

Datasets and metrics:

Middlebury (indoor, high‑res, high‑quality ground truth), ETH3D (indoor/outdoor grayscale), KITTI 2012/2015 (driving with LiDAR ground truth), Booster (shiny/transparent surfaces).
Metrics: BP‑X (percent of pixels with error > X px), D1 (KITTI: error > 3 px and >5% of GT disparity). We evaluate on non‑occluded regions.

Competition:

Slow but strong: FoundationStereo, MonSter, Zero‑RAFT‑Stereo, StereoAnywhere, DEFOM‑Stereo.
Real‑time baselines: RT‑IGEV, LightStereo‑L, BANet, IINet (some retrained on the same mixed data, including our pseudo‑labels, for fairness).

Scoreboard (context, not just numbers):

Speed: FoundationStereo takes about 496 ms per frame on RTX 3090. Fast‑FoundationStereo runs about 49 ms—and about 21 ms with TensorRT—over 10× faster. That’s like turning a minute‑long wait into a blink.
Middlebury‑Q (typical real‑time resolution): Our BP‑2 ≈ 2.12%. This is close to the slow leader’s numbers but at real‑time speed, and it’s far better than prior real‑time methods (many are >7–20%). Think of it as getting an A‑ when everyone else in the fast class got C’s.
ETH3D: Our zero‑shot BP‑1 ≈ 0.62–1.22% range depending on settings, handily beating other real‑time methods trained only on synthetic data and competitive with those retrained on larger sets.
KITTI 2012/2015: Our D1 is around 2.35–3.25%, which is much closer to slow, heavy models than typical fast ones, and clearly better than prior real‑time baselines under fair training.
Booster (hard, non‑Lambertian surfaces): EPE around 1.54 px; we land near strong slow methods and well ahead of fast baselines. This is like reading shiny, crinkly foil where others see glare.

Surprising findings:

Distilled backbones capture delicate, high‑frequency edges and relative depth cues similar to the teacher—even on translucent glass—at a fraction of the cost.
Naively pruning the cost filter barely helps and hurts accuracy, but blockwise search discovers fast designs that stay sharp.
The refinement module had lots of redundancy: structured pruning slashed channels yet, after retraining, quality bounced back, proving not all capacity was needed.
Pseudo‑labels from the wild helped every model we tried, not just ours; the normal‑consistency filter was key to avoiding trash labels.

Runtime breakdown:

All three stages sped up: distilled backbone, searched cost filter, and pruned refiner. Add them up, and you exceed a 10× speed boost over the teacher—without throwing away zero‑shot robustness.

Big picture: Among real‑time methods, Fast‑FoundationStereo sets a new bar on accuracy while staying genuinely fast. Against big slow models, it closes much of the gap—enough to be practical in latency‑critical systems.

05Discussion & Limitations

🍞 Hook: Even superhero suits have weak spots and power needs.

🥬 The Concept (Honest Assessment): Know where it shines and where to be careful.

Limitations:
1. Ultra‑textureless regions (blank walls under low light) can still challenge the initial match; refinement helps but can’t invent texture.
2. Extremes outside the training disparity range (very wide baselines or tiny baselines) may need reconfiguration (e.g., max disparity).
3. The method relies on a strong teacher and a good monocular model for pseudo‑labels; if those are biased, some bias can transfer.
4. Blockwise search optimizes blocks locally; rare long‑range interactions between blocks might not be perfectly captured by the proxy.
5. TensorRT or similar deployment gains depend on hardware support for chosen ops.
Required resources:
1. A decent GPU for training (distillation, blockwise candidates, pruning+retrain).
2. Access to a teacher (e.g., FoundationStereo) and a monocular depth model.
3. Storage and bandwidth for 1.4M pseudo‑labeled pairs if you reproduce the data scale.
When not to use:
1. If you must run on ultra‑tiny MCUs without GPU/NPUs (consider quantization+further compression first).
2. If you have only monocular cameras (use mono depth or structure‑from‑motion instead).
3. If precision at ultra‑high resolutions with very large disparities is the only goal and latency is irrelevant—then a heavier, slower model may eke out a bit more accuracy.
Open questions:
1. Can quantization (e.g., INT8) stack with pruning and searched blocks to reach edge devices while keeping zero‑shot gains?
2. How to extend normal‑consistency filtering to dynamic scenes with moving objects and rolling shutter effects?
3. Could joint, global search (across blocks) with smarter proxies capture cross‑block effects better while staying tractable?
4. How to incorporate temporal cues from stereo videos to further stabilize depth without adding latency?
5. What are the best fairness checks to prevent bias from teacher/mono models seeping into the student? 🍞 Anchor: Think of it like a fast, well‑trained team that still practices, upgrades equipment, and reviews game footage to improve next season.

06Conclusion & Future Work

Three‑sentence summary: Fast‑FoundationStereo keeps the zero‑shot smarts of a big foundation stereo model but runs in real time by distilling the backbone, searching the heavy cost‑filter blocks, and pruning the iterative refiner. A massive, carefully filtered set of in‑the‑wild pseudo‑labels further strengthens generalization. The result is over 10× speedup with only a small accuracy cost, setting a new real‑time state of the art.

Main achievement: Proving that you don’t have to pick between speed and robustness—divide‑and‑conquer acceleration plus smart data curation delivers both for stereo.

Future directions:

Add quantization to the distilled+searched+pruned pipeline to reach phones, drones, and tiny robots.
Explore video‑aware refinement that uses motion cues without slowing inference.
Improve search proxies to catch cross‑block effects and co‑design with hardware compilers.
Expand pseudo‑label checks beyond normals (e.g., photometric or multi‑view consistency) and to dynamic scenes.

Why remember this: It’s a blueprint for turning heavy foundation vision models into practical, real‑time tools—teach the essence, auto‑design the heavy parts, trim the rest, and feed on smartly filtered real‑world data.

Practical Applications

•On‑device AR depth for occlusion and placement that runs smoothly on headsets or phones.
•Warehouse robots navigating cluttered aisles safely without site‑specific retraining.
•Drones mapping indoors in real time for inspection or search‑and‑rescue.
•Driver‑assistance systems estimating distance to cars and pedestrians with low latency.
•Home robots recognizing stairs, tables, and obstacles to move safely around pets and people.
•Industrial arms gauging object pose and thickness for precise grasping and assembly.
•3D scanning apps creating fast meshes of rooms for remodeling or insurance.
•Telepresence robots avoiding collisions while streaming video over limited compute.
•Security cameras estimating depth to filter false alarms (e.g., shadows vs. intruders).
•VR treadmills and simulators rendering reactive, depth‑aware scenes with minimal lag.

Version: 1