StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Tjark Behrens; Anton Obukhov; Bingxin Ke; Fabio Tosi; Matteo Poggi; Konrad Schindler

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Intermediate

Tjark Behrens, Anton Obukhov, Bingxin Ke et al.12/11/2025

arXiv PDF

Key Summary

•StereoSpace turns a single photo into a full 3D-style stereo pair without ever estimating a depth map.
•It teaches a diffusion model to understand camera viewpoint directly, using a shared, simple 'playground' called a canonical stereo space.
•A special per-pixel camera encoding (Plücker rays) tells the model exactly how each pixel sees the world.
•A dual U-Net design reuses Stable Diffusion’s visual smarts while learning stereo geometry end-to-end.
•No warping at inference: the model learns correspondences and fills disocclusions by itself.
•A fair, leakage-free test protocol evaluates true stereo quality using iSQoE (comfort) and MEt3R (geometry), not just pixel math.
•StereoSpace consistently beats popular monocular-to-stereo baselines, especially on hard scenes with glass, reflections, and multiple depth layers.
•Users can set the stereo baseline in real-world units, getting predictable, controllable parallax.
•Multi-baseline training and viewpoint conditioning make it generalize across cameras and scenes.
•This work shifts stereo generation from 'depth-then-warp' to 'viewpoint-conditioned generation,' making it more robust and scalable.

Why This Research Matters

StereoSpace makes it much easier to turn ordinary 2D pictures into comfortable, realistic 3D without expensive camera rigs or brittle depth estimation. That unlocks higher-quality 3D movies, VR/AR experiences, and games where parallax feels right and doesn’t strain your eyes. Because the baseline is set in real-world units, creators can dial the 3D strength to fit different screens and audience comfort. The method is more robust to glass, mirrors, and layered scenes, common in real life, so results hold up outside controlled studios. By avoiding test-time depth, pipelines become simpler and potentially more reliable at scale. This can cut costs for content producers and open 3D storytelling to more people and platforms.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how 3D movies give you that ‘pop-out’ feeling because each eye sees a slightly different picture? That tiny difference is what makes your brain feel depth.

🥬 The Concept: Stereo imaging is showing two slightly shifted images to create a 3D illusion.

What it is: A stereo pair is two photos taken from viewpoints a small distance apart (the baseline), usually left and right.
How it works:
1. Place two cameras side-by-side (or slide one camera sideways).
2. Take two pictures; objects shift horizontally depending on their depth (this shift is disparity).
3. Show the left picture to the left eye and the right picture to the right eye; your brain fuses them into 3D.
Why it matters: Without the tiny viewpoint difference, you lose the 3D effect and stereo comfort.

🍞 Example: Hold your thumb up and blink left-right; it jumps against the background. That jump is disparity creating depth.

🍞 You know how moving seats in a theater changes what you can see behind someone’s head? A small shift reveals hidden parts.

🥬 The Concept: Disparity (or depth cues) tells how much a point shifts between left and right views.

What it is: Disparity is the horizontal pixel shift of an object between the two images; bigger shift usually means closer.
How it works:
1. For each pixel in the left image, find where that same point appears in the right image.
2. Measure the horizontal shift.
3. Use the camera’s focal length and baseline to relate shift to depth.
Why it matters: Matching points across views is how we feel (and compute) depth; mess up matches, and the 3D feels wrong.

🍞 Example: A nearby cup slides a lot between views; a distant wall barely moves.

🍞 Imagine copying a picture onto tracing paper and sliding it sideways to make a second view, then painting in missing parts.

🥬 The Concept: Warp-and-inpaint pipelines estimate depth, warp the image to a new view, then fill holes.

What it is: A common way to make the right image is: predict depth, warp the left image forward, then inpaint missing pixels.
How it works:
1. Predict a depth map from the left image.
2. Compute disparities and forward-warp pixels to make the right image.
3. Where pixels are missing (disocclusions), inpaint them with a generative model.
Why it matters: If the depth map is wrong—especially with glass, reflections, or many layers—the warp breaks and inpainting can’t fix geometry.

🍞 Example: A glass door shows reflections and the room behind it; a single depth per pixel can’t represent both, so warps tear or blur.

🍞 Picture two cameras perfectly level, sliding only left-right on a rail.

🥬 The Concept: Rectified stereo and baseline make geometry simple and predictable.

What it is: Rectified stereo lines up the two images so matching points lie on the same row; the baseline is the exact left-right distance.
How it works:
1. Align cameras so epipolar lines are horizontal.
2. Fix intrinsics (like focal length) and control baseline.
3. Now matching is 1D along rows, and parallax scales with baseline.
Why it matters: Without rectification, matching gets harder; without a known baseline, parallax control is unpredictable.

🍞 Example: In correctly rectified stereo, a door handle appears on the same scanline in both images—just shifted left or right.

🍞 Think of the world before: If you wanted 3D from one photo, you often had to guess the depth of every pixel first.

🥬 The Problem: Depth-first methods inherit depth estimator failures.

What it is: Depth-estimation errors (e.g., on transparent or reflective surfaces) cause warped images to crack, bend, or ghost.
How it works:
1. Depth model predicts a single surface per pixel.
2. Real scenes can have multiple layers or view-dependent effects.
3. Warping with wrong depths misplaces or duplicates content; inpainting smooths but can’t fix geometry.
Why it matters: Artifacts reduce stereo comfort and realism.

🍞 Example: A shiny car mirror may warp the environment reflection into the wrong place, causing eye strain.

🍞 Imagine if, instead of measuring every distance, you just told an artist, “Draw the same scene from 10 cm to the right,” and they nailed it.

🥬 The Gap: We needed a way to generate the right view directly, conditioned on viewpoint, without explicit depth.

What it is: A model that learns stereo geometry end-to-end from data if we encode the camera move well.
How it works:
1. Use a powerful image generator.
2. Tell it exactly where the target camera is (baseline, rays).
3. Let it learn correspondences and disocclusions directly.
Why it matters: Avoids depth’s single-surface trap and handles tricky optics better.

🍞 Example: Ask for a right-eye view at baseline B; the model paints crisp parallax and fills what the left eye couldn’t see.

Real stakes: This matters for 3D movies, VR/AR, games, and converting family photos or classic films into comfortable, realistic 3D—without expensive rigs, painful artifacts, or per-scene calibration. It can also help telepresence, education, and accessibility by making depth-rich content easier and cheaper to produce.

02Core Idea

🍞 Imagine switching from “measure everything, then copy” to “just draw the right view directly but with perfect camera instructions.”

🥬 The Concept: The key insight is to learn stereo geometry directly through viewpoint conditioning in a canonical stereo space, using a diffusion model—no explicit depth or warping at inference.

What it is: A depth-free, viewpoint-conditioned diffusion framework called StereoSpace.
How it works:
1. Canonicalize the stereo setup (cameras on x-axis, rectified).
2. Encode each pixel’s camera ray (Plücker coordinates) so the model knows how the target camera “looks.”
3. Use a dual U-Net diffusion model: one reads the source image, the other generates the target view while attending to source features and the viewpoint.
4. Train end-to-end with losses that promote realism and geometric consistency.
Why it matters: Removes dependency on brittle depth maps, enabling sharp parallax and better handling of multi-layer, non-Lambertian scenes.

🍞 Example: Give the model a left photo and say “make the right photo 8 cm away”; it paints a clean, comfortable 3D partner image.

Three analogies for the same idea:

Camera GPS for pixels: Instead of measuring terrain heights (depth), give the painter a per-pixel compass and map (ray directions) to move sideways correctly.
Language class: Put everyone in the same classroom (canonical space) so directions like “step 10 cm right” mean the same thing for all scenes.
Magic window: Slide the window frame (baseline) and ask the artist to redraw the view seen through it, not to move the objects themselves.

Before vs After:

Before: Predict depth, warp, and inpaint—works until glass, reflections, or thin structures break the single-depth assumption.
After: Condition on the exact viewpoint; the generator learns correspondences and disocclusions naturally, producing crisp parallax and fewer artifacts.

Why it works (intuition):

Canonicalization reduces the problem’s chaos: the model sees consistent camera moves, not random world poses.
Plücker rays give dense, per-pixel camera geometry, making pixel-to-pixel matching easy to learn in the network.
Diffusion priors from Stable Diffusion provide strong semantics and texture realism, helping fill disocclusions plausibly.
End-to-end training with geometry-aware supervision (warp-consistency loss) quietly teaches the model to respect stereo alignment without needing depth at test time.

Building blocks, each as a sandwich:

🍞 You know how a sculptor reveals a statue by removing marble bit by bit? 🥬 Diffusion models turn noise into an image step-by-step.
- What: A generator that denoises a latent image over many steps.
- How: Start with noisy latent; iteratively predict and remove noise (velocity), guided by the source image and viewpoint; decode to pixels.
- Why: This gradual process can learn complex structure and fill missing parts well. 🍞 Example: Start with static on a TV and refine until you get the right-eye view.
🍞 Imagine every project uses the same workbench and rulers. 🥬 Canonical StereoSpace is a shared, rectified coordinate setup for all scenes.
- What: Fix the stereo rig center at the origin and slide cameras along x by the baseline.
- How: Express intrinsics and baseline in this space; train only on these moves.
- Why: The model focuses on stereo changes, not random camera poses. 🍞 Example: Any 10 cm request means the same parallax across scenes.
🍞 Think of giving each pixel its own little arrow showing where it looks. 🥬 Viewpoint conditioning with Plücker rays feeds per-pixel camera rays to the network.
- What: A 6D vector per pixel (direction and moment) encoding the ray.
- How: Compute ray for each pixel for source and target; inject via adaptive layer norms and input concatenation.
- Why: Per-pixel geometry guides correspondences and parallax magnitude. 🍞 Example: The pixel that looked at the mug gets told how it shifts in the right view.
🍞 Imagine two teammates: one understands the source scene, the other paints the new view while listening. 🥬 Dual U-Nets (reference + denoiser) with cross-attention.
- What: One U-Net encodes source features; another denoises the target latent conditioned on those features and rays.
- How: Initialize from Stable Diffusion to inherit strong visual priors; inject viewpoint signals throughout.
- Why: Balances semantic preservation (what’s in the scene) with geometric adaptation (how it shifts). 🍞 Example: The painter keeps the mug’s texture but moves it the right amount.
🍞 You know how a coach checks both style and form? 🥬 Geometry-aware training losses keep images pretty and aligned.
- What: Velocity loss (diffusion), photometric loss (looks), and warp-consistency loss (geometry) with masks.
- How: Supervise the predicted clean sample; backward-warp it with ground-truth disparity only during training; mask occlusions.
- Why: Encourages correct stereo geometry without needing depth at test time. 🍞 Example: The model is gently corrected if the mug’s shift doesn’t match the baseline.

03Methodology

High-level recipe: Input (left image + camera intrinsics + chosen baseline) → encode source view and viewpoint rays → denoise target latent with guidance from source features and rays → decode to get right image.

Step-by-step, each as a sandwich:

🍞 Picture starting from a fuzzy sketch and sharpening it little by little. 🥬 Step 1: Latent diffusion with velocity parameterization.
- What: Represent the target image in a compressed latent; iteratively predict its denoising velocity.
- How:
  1. Encode the (unknown) target into a latent and add noise (during training).
  2. Train the denoiser to predict the velocity v (how to move towards the clean latent).
  3. At inference, start from noise and follow predicted velocities to reach the target latent.
- Why: Operating in latent space is efficient and leverages strong image priors. 🍞 Example: Start with noisy latent; after 50 steps you get a crisp right-eye latent to decode.
🍞 Think of putting everyone on the same straight track before a race. 🥬 Step 2: Canonicalization into StereoSpace.
- What: Express cameras in a rectified frame: both on x-axis, separated by baseline B, known focal length f.
- How:
  1. Center the stereo rig at the origin.
  2. Place source and target along x with distance B.
  3. Compute per-pixel rays in this canonical frame.
- Why: Reduces variability; the model learns a single, consistent kind of motion. 🍞 Example: Asking for 20 cm right always means the same slide in training and testing.
🍞 Imagine giving the model glasses that show exactly where each pixel is looking. 🥬 Step 3: Viewpoint conditioning via Plücker rays.
- What: 6D per-pixel ray encoding (direction d and moment m = c × d) for source and target.
- How:
  1. Compute rays from intrinsics and camera centers.
  2. Normalize directions; concatenate (d, m) per pixel.
  3. Inject via Adaptive Layer Norm and by concatenating to U-Net inputs, so layers can “see” geometry.
- Why: This makes pixel-to-pixel matches and parallax magnitudes explicit to the network. 🍞 Example: The pixel that sees the lamp in the left image gets told how that sightline shifts to the right camera.
🍞 Think of a reporter (source encoder) and a painter (target denoiser) working together. 🥬 Step 4: Dual U-Net with cross-attention.
- What: A reference U-Net encodes source features; a denoising U-Net generates the target latent, attending to source features and rays.
- How:
  1. Initialize both from Stable Diffusion 2.0 for strong priors.
  2. Freeze the highest-res up-block of the reference U-Net to stabilize appearance features.
  3. Use cross-attention so the denoiser can copy semantic details while adjusting geometry.
- Why: Keeps textures and identities consistent while shifting them correctly. 🍞 Example: The pattern on a shirt stays the same, but moves appropriately between views.
🍞 Imagine a teacher grading both neatness and correctness, ignoring parts you couldn’t possibly see. 🥬 Step 5: Training losses (the coach trio).
- What:
  - Velocity loss L_vel: learns denoising trajectory.
  - Photometric loss L_pix: SSIM + L1 between predicted clean image and ground truth.
  - Warp-consistency loss L_warp: backward-warp predicted right to left using ground-truth disparity; mask occlusions.
- How:
  1. Use DDIM to get a clean-sample estimate at t=0 and compute L_pix.
  2. Warp that estimate to the left with known disparity (training only), apply a validity mask, and compute L_warp.
  3. Sum with weights: L_total = L_vel + λ_pix L_pix + λ_warp L_warp.
- Why: Balances realism and geometry; crucially, no disparity is needed at test time. 🍞 Example: If the fridge edge is off by a pixel given the baseline, L_warp nudges it back during training.
🍞 Think of a library that covers many neighborhoods and streets so you learn how far a ‘block’ really is. 🥬 Step 6: Training data with multi-baseline supervision.
- What: Mix ~750K single-baseline pairs plus multi-baseline tuples rendered from NeRF/3DGS scenes.
- How:
  1. Include indoor/outdoor, synthetic/photorealistic data.
  2. Render short rectified stacks along stereo directions to see how parallax scales with baseline.
  3. Heavily weight multi-baseline tuples so the model internalizes metric control.
- Why: Teaches the model how different baselines change parallax, enabling generalization. 🍞 Example: Seeing the same room at 5 cm, 10 cm, 20 cm baselines teaches “how much should things shift.”
🍞 Picture a fair race where no runner gets a head start. 🥬 Step 7: Leakage-free evaluation with stereo-aware metrics.
- What: At test time, no access to ground-truth depth or disparity for any method; calibrate monocular scale per scene fairly.
- How:
  1. Pick the best baseline/scale to align generated and real disparities using SGBM (coarse-to-fine search), then fix it.
  2. Score with iSQoE (perceptual stereo comfort) and MEt3R (multi-view geometric consistency via feature lifting and reprojection).
- Why: Ensures fair comparisons and rewards true stereo quality, not just pixel overlap. 🍞 Example: Even if a method blurs edges to match pixels, iSQoE/MEt3R will reveal poor comfort or geometry.

Secret sauce:

Dense per-pixel viewpoint signals (Plücker rays) + canonical StereoSpace + dual U-Nets initialized from Stable Diffusion + geometry-aware but depth-free-at-test training = sharp, comfortable stereo with controllable, metric parallax—and strong robustness on glass, reflections, and layered scenes.

04Experiments & Results

🍞 Think of checking both how comfy a 3D movie feels and whether the props really line up on the stage.

🥬 The Test: Measure stereo comfort and geometric consistency.

What it is: Two main scores—iSQoE (how good and comfortable the stereo looks) and MEt3R (how geometrically self-consistent the pair is).
How it works:
1. iSQoE: A learned predictor outputs a single quality/comfort score for a stereo pair.
2. MEt3R: Lifts features from both views to a shared 3D using a pretrained geometry model; reprojects and compares for consistency.
3. No test-time depth is allowed for anyone; per-scene scale is calibrated fairly.
Why it matters: Pixel metrics (PSNR/SSIM) can favor blurry, over-smoothed images; iSQoE/MEt3R better reflect real 3D quality.

🍞 Example: A method might score high PSNR but still feel uncomfortable to watch in 3D; iSQoE will catch that.

Competitors:

Warp-and-inpaint: ZeroStereo’s Stereo-Gen (with Depth Anything v2), StereoDiffusion variants.
Warped conditioning: GenStereo (conditions diffusion on warped embeddings using monocular depth).
3DGS generative: Lyra (distilled from video diffusion; approximated small camera slide).

Datasets:

Middlebury 2014 (indoor, varied geometry), DrivingStereo (outdoor driving), Booster (challenging real-world stereo), LayeredFlow (real scenes with multi-layer transparency/reflectance).

Scoreboard with context:

Middlebury 2014:
- StereoSpace: iSQoE ≈ 0.6829 (best), MEt3R ≈ 0.0893 (best).
- GenStereo: MEt3R ≈ 0.1339 (about 50% worse than StereoSpace’s 0.0893), higher iSQoE.
- Lyra: MEt3R ≈ 0.1163 (still notably worse than StereoSpace).
- Others trail further, often due to oversmoothing or warping artifacts.
- Takeaway: Like getting an A when others score B–C; big geometry gains and better viewing comfort.
DrivingStereo:
- StereoSpace: iSQoE ≈ 0.7829 (best), MEt3R ≈ 0.0717 (best, slightly better than GenStereo’s ≈ 0.0728).
- Margins are smaller (easier geometry—farther objects, smaller disparities), but StereoSpace still wins comfortably.
- Takeaway: Still top of class even when the test is easy and everyone does okay.
Booster (hard, reflective):
- StereoSpace: iSQoE ≈ 0.6764 (best), MEt3R ≈ 0.1013 (best).
- GenStereo ≈ 0.1457 MEt3R, Lyra ≈ 0.1293—both clearly behind.
- Takeaway: On tough mirror/glass scenes, depth-reliant pipelines falter; StereoSpace stays stable.
LayeredFlow (multi-layer transparency):
- StereoSpace: iSQoE ≈ 0.7489 (best), MEt3R ≈ 0.1619 (best).
- GenStereo ≈ 0.2275 MEt3R, Lyra ≈ 0.1877—farther behind.
- Takeaway: When one pixel has multiple depths, warping breaks; viewpoint-conditioned generation shines.

Surprising findings and ablations:

Plücker rays vs other conditioning:
- Plain text prompts or PRoPE-style projective attention help, but dense Plücker rays give the best scores; stacking PRoPE on top didn’t add gains.
Multi-baseline training matters:
- Removing multi-baseline tuples hurts both iSQoE and MEt3R—seeing how parallax scales with baseline teaches true metric control.
Adding a disparity loss:
- Slightly improves MEt3R (tighter geometry) but can mildly worsen iSQoE (comfort trade-off), suggesting a balance between strict alignment and perceived quality.
PSNR/SSIM mismatch:
- Some depth-warping baselines get better PSNR/SSIM yet look less realistic and less comfortable; iSQoE/MEt3R correlate better with what your eyes feel.

Bottom line:

Across easy and hard datasets, StereoSpace is consistently top or tied-top, with the biggest wins on glassy, layered, or reflective scenes—exactly where depth-first methods stumble. It’s like switching from tracing-paper copying (warp) to smart redrawing (generation) that respects the camera move.

05Discussion & Limitations

🍞 Think of a great bicycle that still needs the right-sized rider, a smooth road, and regular tune-ups.

🥬 Limitations, resources, and when not to use:

Limitations:
1. Needs correct camera intrinsics and a chosen baseline; wrong inputs lead to wrong parallax.
2. Assumes rectified, horizontal baseline shifts; large rotations or vertical parallax are out of scope.
3. Very extreme baselines (far beyond training range) may extrapolate poorly.
4. Although better than depth-first methods, super tricky optics (strong specular glare, complex transparent layers) can still be challenging.
5. Diffusion inference is heavier than simple warping; not yet ideal for low-power, real-time mobile.
Required resources:
- A GPU for training/inference, mixed synthetic/photorealistic multi-baseline datasets, Stable Diffusion checkpoints, and code for Plücker ray computation and canonicalization.
When not to use:
1. If you need exact pixel-accurate correspondences for classical stereo matching benchmarks (use a dedicated matcher).
2. Real-time on embedded devices with strict latency or power limits.
3. Non-rectified or arbitrary camera motions unless you extend the conditioning to full 6-DoF.
4. Scenes with moving objects between left/right captures (current setup assumes static scenes for clean stereo).
Open questions:
1. Stereo video: temporal consistency and flicker-free parallax over time.
2. Beyond rectified stereo: handling rotations and vertical shifts (full camera control).
3. Speed-ups: distillation, fewer sampling steps, real-time feasibility.
4. Uncertainty: confidence maps for safe parallax ranges and viewer comfort tuning.
5. Training without any disparity supervision: fully self-supervised geometry from multi-view priors.
6. Alternative or complementary camera encodings to Plücker, and learned conditioning mixers.

🍞 Example anchor: If you try to use this on a hand-held phone clip with big tilt and rolling shutter, parallax control may be unreliable until the method is extended to handle richer camera motions.

06Conclusion & Future Work

Three-sentence summary:

StereoSpace is a depth-free, viewpoint-conditioned diffusion method that generates the right-eye view from a single left image inside a canonical stereo space.
By feeding dense per-pixel camera rays (Plücker coordinates) and training end-to-end with geometry-aware losses, it learns correspondences and fills disocclusions without explicit depth at inference.
It outperforms depth-reliant baselines on both comfort (iSQoE) and geometry (MEt3R), especially on glassy, reflective, and multi-layered scenes, while offering metric control over baseline.

Main achievement:

Reframing monocular-to-stereo as direct, viewpoint-conditioned generation—eliminating test-time depth and warping—delivers sharper parallax, better comfort, and stronger robustness.

Future directions:

Extend to stereo video with temporal stability; support full 6-DoF camera motions; accelerate inference via distillation; add uncertainty-aware comfort controls; broaden training to fully self-supervised geometry.

Why remember this:

It marks a shift from "estimate depth then warp" to "condition on viewpoint and generate," showing that diffusion models can internalize stereo geometry when given the right camera signals and a clean canonical space. This makes high-quality, controllable 3D content creation more accessible for movies, VR/AR, games, and everyday media.

Practical Applications

•2D-to-3D film conversion: batch-convert movie frames into stereo pairs with controllable parallax.
•VR scene enrichment: add a realistic stereo partner view to single images used in immersive experiences.
•Photo apps: one-tap 3D for portraits and landscapes with a slider to choose the baseline for comfort.
•Game cutscenes: generate stereo versions of pre-rendered cinematics without rebuilding depth assets.
•AR previews: simulate slight viewpoint changes to check parallax comfort before deploying content.
•Telepresence: create a stereo partner view from one camera to enhance depth in video calls.
•Education: produce 3D illustrations from textbooks or museum images to teach depth perception.
•Cinematography planning: test different baselines on a concept frame to pick the best 3D feel on set.
•Accessibility: tune stereo strength for viewers sensitive to eye strain by adjusting the metric baseline.
•Robotics visualization: render a plausible second view for human operators to better judge depth.

Version: 1