Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Shaocong Xu; Songlin Wei; Qizhe Wei; Zheng Geng; Hong Li; Licheng Shen; Qianpu Sun; Shu Han; Bin Ma; Bohan Li; Chongjie Ye; Yuhang Zheng; Nan Wang; Saining Zhang; Hao Zhao

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Intermediate

Shaocong Xu, Songlin Wei, Qizhe Wei et al.12/29/2025

arXiv PDF

Key Summary

•Transparent and shiny objects confuse normal depth cameras, but video diffusion models already learned how light bends and reflects through them.
•This paper repurposes a big video diffusion model into a video-to-video translator that outputs depth (and normals) from ordinary RGB videos.
•The team built TransPhy3D, a synthetic video dataset with 11,000 scenes (1.32M frames) of transparent and reflective objects rendered with true physics.
•They train with lightweight LoRA adapters, so the original generative 'physics sense' is kept while learning to predict depth and normals.
•A simple trick co-trains on both single images and videos, yielding smooth, temporally consistent depth across long clips.
•On real and synthetic benchmarks (ClearPose, DREDS, TransPhy3D-Test), the model reaches state-of-the-art in zero-shot testing.
•A compact 1.3B version runs fast (~0.17 s per frame at 832×480), making it practical for robots.
•In real robot grasping tests, the model’s depth improves success rates on translucent, reflective, and diffuse surfaces.
•Key idea: 'Diffusion knows transparency'—so we can reuse that knowledge for accurate, stable 3D perception without real-world labels.
•The same recipe also produces the best video surface normals on ClearPose, showing the approach generalizes beyond depth.

Why This Research Matters

Transparent and reflective objects are everywhere—glasses, screens, bottles, shiny tools—and standard depth cameras often fail on them. A robot that can reliably see these objects can grasp dishes, sort recyclables, and work safely around glass and chrome. Phones and AR headsets can anchor graphics better on shiny tables and windows, making experiences more stable and realistic. Drones and home robots can navigate spaces with mirrors and glass doors more safely. Because the method runs quickly on modest GPUs and needs no real-world labels, it lowers the barrier to deploying robust 3D perception in practical systems. In short, this turns a long-standing blind spot in machine vision into a strength.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how trying to see water in a clear glass is tough because it looks almost invisible? Cameras feel the same way. Before this work, robots and 3D apps struggled with transparent and mirror-like things. Time-of-Flight (ToF) sensors and stereo cameras assume light bounces straight back or that the same point can be matched in two images. But glass bends light (refraction), mirrors bounce it elsewhere (reflection), and some light passes through (transmission). The result: missing holes in depth, flickering over time, and bad 3D shapes that break robot actions.

🍞 Hook: Imagine walking behind a shiny window. Your reflection moves, the background warps, and everything looks confusing. 🥬 The Concept (Video Diffusion Models): A video diffusion model is a smart video painter that learns how the world changes frame by frame by adding noise and then learning to clean it up.

How it works (like a recipe):
1. Start with a video and add noise.
2. Train a model to predict how to remove that noise step by step.
3. Repeat until a clean, realistic video pops out.
Why it matters: Without it, we miss a powerful prior that already “knows” how light acts in transparent scenes. 🍞 Anchor: These models can already make believable videos of shiny glasses on tables—so they must have learned the rules of refraction and reflection.

🍞 Hook: Think of judging how far a soccer ball is by how big it looks and how it moves. 🥬 The Concept (Depth Estimation): Depth estimation is figuring out how far each pixel in a picture is from the camera.

How it works:
1. Look at visual cues (size, blur, texture, shading).
2. Predict distance per pixel (a depth map).
3. Keep it consistent over time in a video.
Why it matters: Without depth, robots can’t grab objects and 3D apps can’t build accurate models. 🍞 Anchor: Your phone’s portrait mode guesses depth to blur the background; robots need even better depth to avoid dropping a clear cup.

🍞 Hook: You know how adding a tiny sticky note to a page can change how you read it without rewriting the whole book? 🥬 The Concept (LoRA): LoRA is a tiny add-on that tweaks a big model using small, low-rank weight adjustments.

How it works:
1. Freeze the big model.
2. Train small adapter matrices that gently steer the model.
3. Keep performance but avoid forgetting what it already knows.
Why it matters: Without LoRA, fine-tuning could erase the model’s hard-won physics knowledge (catastrophic forgetting) or be too expensive. 🍞 Anchor: It’s like adding training wheels that guide the bike without rebuilding the bike.

🍞 Hook: If you want to practice catching a glass ball safely, you might use a virtual reality room first. 🥬 The Concept (TransPhy3D Dataset): TransPhy3D is a big, synthetic video dataset of transparent and reflective scenes with ground-truth depth and normals.

How it works:
1. Collect lots of 3D assets (category-rich and shape-rich).
2. Simulate real physics and camera moves.
3. Render RGB, depth, and normal videos with a ray tracer and denoiser.
Why it matters: Without diverse, labeled videos, the model can’t learn to generalize to real scenes. 🍞 Anchor: It’s a training playground where the rules of light are accurate, and nothing breaks.

🍞 Hook: Animated movies are made from scratch, but still teach you a lot about how motion and light look. 🥬 The Concept (Synthetic Video Dataset): A synthetic video dataset is a computer-made collection of videos that look real but are fully labeled.

How it works:
1. Build 3D scenes.
2. Move virtual cameras around.
3. Render realistic images plus perfect depth/normal labels.
Why it matters: Without it, collecting real labels for glass and mirrors would be extremely hard or impossible. 🍞 Anchor: Like a driving simulator that safely teaches you road rules before you touch a real car.

🍞 Hook: In a flipbook, the drawing should change smoothly, or it looks jittery. 🥬 The Concept (Temporal Consistency): Temporal consistency means predictions stay stable and sensible across video frames.

How it works:
1. Respect motion and scene structure over time.
2. Avoid frame-to-frame flicker.
3. Keep depth smooth unless the scene truly changes.
Why it matters: Without it, robots act on wobbly 3D maps and fumble grasps. 🍞 Anchor: A steady hand draws steadier flipbooks; a steady depth map makes steadier robot moves.

🍞 Hook: To know how a smooth marble is tilted, you look at how light glints off it. 🥬 The Concept (Normal Estimation): Normal estimation finds the facing direction of each tiny patch of surface.

How it works:
1. Analyze shape cues and shading.
2. Predict a direction (like an arrow) per pixel.
3. Keep the directions consistent across frames.
Why it matters: Without normals, fine details and contact geometry are lost, hurting manipulation and rendering. 🍞 Anchor: A robot hand needs to know a cup’s surface tilt to place fingers correctly.

The problem: older, discriminative methods mapped pixels directly to depth but overfit to small datasets and often flickered on videos—especially for glass and mirrors. People tried better encoders (like DINO) and image diffusion models, which helped per-frame accuracy but still wobbled over time. The missing piece was a model that already “understands” transparent physics and handles time naturally. This paper fills that gap by reusing a pre-trained video diffusion model and adapting it with tiny LoRA add-ons, trained mostly on synthetic videos. Why this matters to daily life: more reliable robot dishwashers, safer factory pick-and-place, better AR try-ons on shiny or glassy things, and clearer 3D reconstructions for phones and drones.

02Core Idea

Aha! Treat depth (and normals) for transparent objects as video-to-video translation using a pre-trained video diffusion model that already internalized how light behaves, and adapt it with small LoRA modules using mostly synthetic video supervision.

Three analogies:

Translator analogy: The model is a skilled interpreter who already speaks the language of light and time (video). We just teach it a new dialect—“depth and normals”—so it can translate ordinary videos into 3D maps.
Restoration analogy: Think of an art restorer who removes noise layer by layer. The model learns how to peel away uncertainty to reveal a clean, consistent depth movie.
Coach analogy: The pre-trained model is a talented athlete. LoRA is a coach adding a few targeted tips so the athlete excels at a specific event without forgetting the basics.

Before vs After:

Before: Methods did well on single frames but stumbled over time, especially on glass or mirrors. Labeling real transparent scenes was rare, so models overfit and broke in the wild.
After: A diffusion backbone supplies a powerful, physics-aware prior. With LoRA and synthetic co-training, the model outputs accurate, temporally smooth depth and normals for long, in-the-wild videos—zero-shot on real data.

Why it works (intuition):

Video diffusion models learn how scenes evolve, including how light travels, reflects, and refracts. That makes them naturally good at capturing transparency effects.
Instead of asking the model to spit out pixels directly, we ask it to predict the “velocity” that denoises a noisy depth latent toward a clean one. This stabilizes learning and preserves temporal coherence.
Concatenating RGB and (noisy) depth latents lets the model condition on visual appearance while it shapes the depth signal—linking what you see to how far it is.
LoRA prevents catastrophic forgetting and makes training efficient, so the model keeps its general video knowledge while gaining depth/normals skills.
Co-training with both single images and multi-frame videos balances data scale and temporal learning, improving generalization.

Building blocks (the idea in pieces):

Pre-trained video diffusion backbone (WAN): already strong at temporal dynamics and visual realism.
VAE encoders/decoders: compress videos into a latent space where diffusion operates efficiently, then reconstruct outputs.
Channel-wise concatenation: combine RGB latent and noisy depth latent so appearance guides the 3D prediction.
Flow matching objective: predict the denoising velocity to move from noisy to clean depth latent; this is robust and stable.
LoRA adapters on DiT blocks: small, trainable low-rank updates that steer the big model for depth/normal tasks.
TransPhy3D dataset: richly varied transparent/reflective videos with ground-truth depth and normals.
Co-training schedule: a simple rule (F = 4N + 1) that sometimes samples single frames (images) and other times full clips (videos), blending spatial detail with temporal consistency.
Long-video inference: split input into overlapping chunks, process, then smoothly stitch—so any length is possible without drift.

Big picture: Instead of building a depth model from scratch, reuse a powerful generative video prior that already “knows” transparency. Give it small, targeted updates and the right synthetic practice, and it becomes a reliable, smooth, and fast depth/normal estimator for the real world.

03Methodology

At a high level: Input RGB video → VAE encoders (get RGB latent; build and noise a depth latent) → Concatenate latents along channels → DiT with LoRA predicts denoising velocity for the depth latent (conditioning on RGB) → VAE decoder reconstructs the clean depth video (and normals in the normal variant) → Output.

Step-by-step details:

Data preparation (what): Build TransPhy3D: 11k videos (1.32M frames) of transparent/reflective scenes rendered with Blender/Cycles using real physics (refraction, reflection, transmission), plus NVIDIA OptiX denoising. Include both category-rich static assets and shape-rich parametric assets; simulate objects settling using physics; sample circular camera paths with sinusoidal wiggles; export RGB, depth, and normals.

Why: Real labels for glass/mirrors are scarce and imprecise; synthetic renders give perfect ground truth and vast diversity.
Example: Imagine a scene with six glass bottles and a chrome spoon in a bowl. The camera circles around for 120 frames while light bends through the bottles and gleams off the spoon. We render RGB, precise depth, and crisp normals for every frame.

Co-training schedule (what): Mix videos from TransPhy3D with image-only datasets (HISS, DREDS, ClearGrasp) using a simple rule: sample a frame count F = 4N + 1 with N ∼ Uniform(0,5). If F == 1, load image pairs (treated as a 1-frame video); otherwise, load multi-frame video pairs from TransPhy3D.

Why: This saves rendering/compute while teaching both single-frame detail and multi-frame smoothness.
Example: On one batch, the model sees a single frame of a glass cup (image dataset). Next batch, it sees a 9-frame clip of a shiny kettle (video dataset). It learns both crisp geometry and stable motion.

Latent construction (what): Normalize RGB and depth/disparity to [-1,1]. Encode RGB video with the VAE to get an RGB latent (xc). Encode the target depth video with the same VAE to get a depth latent (xd). Add noise using flow matching at a random timestep t to create a noisy intermediate (xdt).

Why: Operating in latent space makes training faster and more stable, and the noisy/clean pair enables diffusion-style learning.
Example: A 832×480 video clip becomes compact latent tensors. At t=0.6, the depth latent is partly noisy, giving the model a denoising challenge.

Conditioning via concatenation (what): Concatenate xdt (noisy depth latent) with xc (RGB latent) along channels; feed into the DiT (diffusion transformer) blocks. The model predicts the velocity that moves xdt toward the clean xd (the target).

Why: RGB appearance informs where glass edges, highlights, and distortions are—exactly the cues needed to infer correct depth under transparency.
Example: The RGB latent shows a bright highlight on a glass bowl; the model learns that this corresponds to a particular shape and depth change.

LoRA adaptation (what): Freeze the big diffusion model weights and train only small LoRA adapters within DiT blocks.

Why: Prevent catastrophic forgetting of the model’s rich video/physics prior while keeping training fast and memory-light.
Example: It’s like adding tiny steering knobs at key layers; just a few parameters change, but the model’s behavior adjusts significantly.

Loss and optimization (what): Use mean-squared error between predicted and ground-truth velocities (the flow matching objective). Train for ~70k iterations at 832×480 with AdamW (lr 1e-5), batch size 8, across 8×H100 GPUs for two days.

Why: Velocity prediction stabilizes training and aligns with diffusion’s denoising view.
Example: If the model slightly overshoots on a shiny spoon edge, the MSE nudges it to correct the denoising direction.

Inference strategy (what): Use a small number of denoising steps (default 5) for speed. For long videos, split into overlapping windows, process, and blend the overlaps with complementary weights (as in DepthCrafter) to maintain continuity.

Why: This scales to arbitrary-length videos without seams or drift.
Example: A 2-minute kitchen video becomes many manageable chunks; after processing, the stitched depth looks like one smooth take.

Normal estimation variant (what): Train DKT-Normal with the same recipe but targeting normals instead of depth. The conditioning and training loop are analogous.

Why: Normals capture fine surface tilt essential for grasping and high-fidelity 3D.
Example: On ClearPose, normals around a glass rim become sharp and temporally stable, improving grasp planning.

Secret sauce:

“Diffusion knows transparency”: the backbone’s learned physics prior.
Lightweight adaptation via LoRA: steer without forgetting or huge cost.
RGB+depth latent concatenation: marry appearance cues with 3D structure.
Mixed image+video co-training: spatial sharpness plus temporal smoothness.
Efficient inference: few denoising steps, overlapping windows for any length.

What breaks without each step:

No synthetic videos: poor supervision for transparency; model won’t generalize.
No LoRA: full fine-tuning risks forgetting and heavy compute; naive results underperform.
No concatenation: weaker link between appearance and 3D; depth errors around highlights/refractions.
No co-training: either flicker (if only images) or less detail (if only videos).
No window stitching: long clips show seams or drift.

04Experiments & Results

The tests: Evaluate zero-shot on both real and synthetic benchmarks focusing on transparency and specularity. Report accuracy (REL, RMSE) and threshold hits (δ1.05, δ1.10, δ1.25), and examine temporal consistency (profile visualizations). Also test runtime and memory for practicality, and do a real robot grasping study.

Competitors: Strong image methods (Depth-Anything v2, MoGe, VGGT, Marigold-E2E-FT, Depth4ToM) and a leading video method (DepthCrafter). For normals, compare with NormalCrafter and Marigold-E2E-FT.

Scoreboard with context:

ClearPose (real transparent/translucent): DKT hits REL 9.72, RMSE 14.58 cm, δ1.05 38.17%, δ1.10 65.50%, δ1.25 93.04%. That’s like getting an A while others hover around B/B− on the hardest parts (particularly δ1.05 and δ1.10). Temporal profiles show much less flicker than baselines.
DREDS-STD CatKnown (real): DKT achieves REL 5.30, RMSE 4.96 cm, δ1.05 53.86%, δ1.10 84.93%, δ1.25 99.89%—top results.
DREDS-STD CatNovel (real): DKT achieves REL 5.71, RMSE 4.66 cm, δ1.05 52.12%, δ1.10 79.51%, δ1.25 99.84%—again best overall.
TransPhy3D-Test (synthetic): DKT reaches REL 2.96, RMSE 19.50 cm, δ1.05 87.17%, δ1.10 97.09%, δ1.25 98.56%. This is like acing your own home-field exam but with high standards: camera circles demand perfect temporal alignment, and DKT handles it.
Video normals (ClearPose): DKT-Normal beats NormalCrafter and Marigold-E2E-FT across metrics (e.g., higher within-11.25° accuracy), delivering the sharpest, most stable normals.

Efficiency:

DKT-1.3B runs ~0.17 s/frame at 832×480 and uses ~11.19 GB memory on an Nvidia L20—faster than prominent baselines and easy to deploy on many robots.
More denoising steps didn’t bring big gains; 5 steps balanced speed and detail best.

Surprising findings:

Purely synthetic training (plus image co-training) was enough to get zero-shot SOTA on real transparent objects—evidence that the diffusion prior truly encodes transparency physics.
LoRA fine-tuning outperformed naïve full fine-tuning and avoided forgetting, even when scaling up to a 14B backbone.
The circular camera paths in TransPhy3D make temporal errors amplify under global alignment—yet DKT stayed robust, highlighting its temporal consistency strength.

Takeaway: Across real-world and synthetic datasets, DKT consistently outscored strong image and video baselines in both accuracy and stability, and it did so efficiently. The robotic grasping study confirms these numbers translate to real gains in manipulation success.

05Discussion & Limitations

Limitations:

Synthetic-to-real gap: While zero-shot results are strong, extreme real scenes (odd lighting, wet surfaces with complex caustics, or heavy motion blur) may still trip the model.
Dynamics and occlusions: Rapidly moving transparent liquids, splashes, or partial occlusions can challenge temporal consistency.
Metric scale: The model outputs relative depth; extra steps (like AprilTag scaling) are needed for true metric depth in robotics.
Compute trade-offs: The 14B version is slower/heavier; the 1.3B version is fast but may miss some fine details compared to the largest model.
Sensor fusion not exploited: No direct use of multi-sensor cues (e.g., sparse LiDAR or events) that might help in edge cases.

Required resources:

Training used 8×H100 GPUs for ~2 days at 832×480 resolution with 70k steps; synthetic rendering of TransPhy3D also requires time/GPU budget.
For deployment, an Nvidia L20-class GPU runs the 1.3B model at ~0.17 s/frame; lower-end hardware may require further optimization or resolution reduction.

When NOT to use it:

If you need exact metric depth without any calibration/scale cues.
Highly dynamic fluids, steam, or dense translucency where appearance changes are chaotic and labels are absent.
On severely degraded videos (extreme noise, very low light) where RGB appearance cues vanish.

Open questions:

Can we learn metric scale directly, perhaps via monocular cues plus tiny bits of self-calibration?
How far can synthetic-only training go—could generative domain randomization cover even more real corner cases (e.g., harsh caustics)?
What’s the best way to fuse this with sparse sensors for robustness in extreme conditions?
Could the same idea extend to other tricky properties (e.g., thin films, ice, smoke) or to joint pose/shape recovery for manipulation?
How to further reduce steps/latency while preserving detail, enabling always-on embedded deployment?

06Conclusion & Future Work

In three sentences: This paper shows that large video diffusion models already “know” how light behaves in transparent and reflective scenes, and we can adapt that knowledge to estimate depth and normals from ordinary videos. By training tiny LoRA adapters on a new synthetic video dataset (TransPhy3D) and co-training with image datasets, the model outputs accurate, temporally stable 3D maps—zero-shot on real benchmarks. It runs fast in a compact version and boosts real robot grasping.

Main achievement: Reframing transparent-object depth/normal estimation as video-to-video translation with a pre-trained video diffusion prior, then proving—through strong zero-shot SOTA and robot results—that “diffusion knows transparency.”

Future directions: Add direct metric scale learning; expand synthetic coverage to more optical phenomena; fuse sparse sensors; push latency lower; and generalize the recipe to other challenging material effects. Also, scale to broader manipulation pipelines where consistent 3D feeds planning and control.

Why remember this: It’s a clear example of reusing a powerful generative prior to solve a hard perception problem—efficiently and without real labels—turning a long-standing weakness (glass and mirrors) into a practical strength for robotics and 3D vision.

Practical Applications

•Robotic grasping of glassware and shiny utensils on cluttered tables.
•Quality control in factories with reflective parts (e.g., metal casings) using stable 3D inspection.
•AR furniture or decor placement on glossy floors and glass surfaces without jitter.
•Household robots sorting recyclables (glass vs. plastic) using reliable depth of transparent items.
•Warehouse picking robots handling shrink-wrapped or glossy packages.
•Surgical or lab robotics interacting with transparent tubes and containers safely.
•Autonomous checkout systems estimating 3D shapes of packaged goods in clear containers.
•3D scanning/reconstruction apps capturing glass showcases and mirrors more accurately.
•Collision avoidance for drones/robots in indoor spaces with glass doors or mirrors.
•Film/VFX previsualization where accurate normals on shiny props improve relighting and compositing.

Version: 1