V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Ye Fang; Tong Wu; Valentin Deschaintre; Duygu Ceylan; Iliyan Georgiev; Chun-Hao Paul Huang; Yiwei Hu; Xuelin Chen; Tuanfeng Yang Wang

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Intermediate

Ye Fang, Tong Wu, Valentin Deschaintre et al.12/12/2025

arXiv PDF

Key Summary

•V-RGBX is a new video editing system that lets you change the true building blocks of a scene—like base color, surface bumps, material, and lighting—rather than just painting over pixels.
•It first takes a video apart into intrinsic properties (RGB→X), then puts it back together (X→RGB), which makes edits physically believable and steady over time.
•You can edit just a few keyframes, and the system smartly spreads those changes across the whole video while keeping other things unchanged.
•A special interleaved conditioning trick feeds one intrinsic channel per frame (with identity tags), so the model never confuses lighting with material or geometry.
•A keyframe reference helps keep the scene’s style and details, covering things not directly in the intrinsic channels.
•On benchmarks, V-RGBX beats prior systems in accuracy, video quality, and smooth, flicker‑free results.
•It works for tasks like retexturing objects, changing materials, and relighting rooms without breaking realism.
•The method is trained on synthetic indoor data, so outdoors can still be challenging and long videos remain tough.
•Even if one intrinsic channel is missing, the model stays robust and can still produce consistent results.
•This research pushes video generation beyond simple color/style changes toward physically consistent, controllable editing.

Why This Research Matters

This research gives creators fine control over what truly changes in a video—color, light, or material—without breaking realism. Filmmakers can relight scenes after shooting, saving costly reshoots and keeping continuity. Product teams can recolor or retexture items while preserving believable highlights and shadows for ads. Educators and AR/VR developers can reliably alter environments without flicker or drift. Even hobbyists can make cleaner, physics-aware edits that look professional. In short, V-RGBX turns video editing from surface paint into ingredient-level cooking, making results more trustworthy and creative.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re coloring a cartoon. Sometimes you want to change the character’s shirt color without changing the shadows, or you want to make the sun brighter without repainting the whole picture. Wouldn’t it be nice if videos worked that way too?

🥬 The Concept (World Before): For years, AI video tools were great at making videos look cool, but they mostly worked in pixel space—like painting on top of the final picture. They could change style, add textures, or follow a prompt, but they didn’t understand the physics of how scenes look: what’s the true base color of an object (albedo), how surfaces point (normals), what they’re made of (material), or how light falls on them (irradiance). Without these, edits often tangled things together—change the color and you might accidentally flatten a shadow; brighten the light and the texture might blur.

🍞 Anchor: Think of trying to make a banana look ripe in a video. You want the yellow to change (albedo), but the shiny reflection and the room’s light should stay the same. Old tools struggled to keep those pieces separate.

🍞 Hook: You know how a recipe card breaks a cake into flour, sugar, eggs, and butter? If you can change just the sugar, you don’t have to rebuild the cake from scratch.

🥬 The Concept (The Problem): Video editing needed a way to break a video into its ingredients—intrinsic properties—so we could change one without ruining the others, and then re-bake the video. Two big challenges stood in the way: (1) extracting clean intrinsic layers from normal videos (inverse rendering), and (2) spreading your edits from a few keyframes to all frames without flicker or drift.

🍞 Anchor: If you recolor a sofa in frame 1, you need that color to follow the sofa as it moves in frames 2, 3, and 4—without the shadows turning weird or the fabric turning into metal.

🍞 Hook: Imagine playing telephone with four different messages at once—one for color, one for shape, one for shininess, and one for light. If you shout them all together, they get mixed up.

🥬 The Concept (Failed Attempts): Past methods either worked only for single images, or guided videos with signals like depth, optical flow, or text prompts directly in RGB space. They didn’t build a clean intrinsic space. Some tried placing empty placeholders when signals were missing, which was memory-hungry and still confused the model. Others could decompose and recompose but couldn’t propagate pixel-level edits over time, especially when only a few frames were edited.

🍞 Anchor: It’s like trying to retell a story when some pages are missing and the chapter titles are mixed up—you end up guessing and making mistakes.

🍞 Hook: Suppose you can label each ingredient spoonful as it goes into the bowl—“this is color,” “this is light,” “this is material.”

🥬 The Concept (The Gap): What was missing was a single, end-to-end system that (1) decomposes videos into intrinsic channels (RGB→X), (2) synthesizes photorealistic video back from them (X→RGB), and (3) accepts sparse keyframe edits, then sensibly spreads those changes over time while keeping other properties intact. This requires clean disentanglement, smart timing, and a memory of which ingredient is which.

🍞 Anchor: You tweak the shirt’s color on one frame, and the system knows to keep the shirt’s fabric (material) and the room’s light (irradiance) steady while carrying your new color through the whole scene.

🍞 Hook: Think of a librarian who files every page by topic and time, so you can quickly find “light on frame 7” or “material on frame 12.”

🥬 The Concept (Real Stakes): Precise, physically grounded control matters in real life. Filmmakers can relight a scene after a shoot without reshooting. Advertisers can recolor products while keeping realistic reflections. Interior designers can try wall paints and lamp setups. Social creators can edit styles without flicker. And game/AR tools can blend reality and edits smoothly. With V-RGBX, changes look believable because they come from the scene’s actual ingredients, not just surface paint.

🍞 Anchor: Change daylight to sunset across a living-room tour video while keeping the couch fabric and wood floor reflections correct—that’s the kind of edit V-RGBX targets.

02Core Idea

🍞 Hook: You know how choreographers place a few key moves, and the dancers smoothly fill in the rest in time with the music?

🥬 The Concept (Aha! in one sentence): V-RGBX teaches a video diffusion transformer to read and mix intrinsic layers—one at a time, tagged by type—together with edited keyframes, so it can propagate physically correct edits across the whole video without mixing up color, material, or light.

How it works (intuitive):

Pull apart a video into intrinsic channels (albedo, normals, material, irradiance).
Let users edit a few keyframes in intrinsic space (e.g., recolor albedo, change light).
Feed the model an interleaved stream: one intrinsic channel per frame, clearly labeled by type, plus the edited keyframes as references.
The model learns to keep identities straight, spread edits through time, and render photorealistic RGB.

Why it matters: Without this, edits bleed together—lighting touches ruin texture, color changes break shading—and results drift over time.

🍞 Anchor: Edit the lamp’s light color on one frame; V-RGBX updates shadows and glow everywhere while keeping the sofa texture and wall paint unchanged.

Three analogies:

Cooking: Add ingredients one at a time, labeled (color, light, material), so the chef (model) never confuses salt with sugar.
School binder: Each subject has its tab (albedo, normal, irradiance, material). During review, you flip tabs in sequence so notes don’t mix.
Orchestra: Only strings play in bar 1, brass in bar 2, winds in bar 3—by interleaving sections with labels, the conductor keeps perfect harmony over time.

Before vs After:

Before: Edits in RGB space tangled properties; consistency suffered; keyframe guidance was coarse.
After: Edits in intrinsic space are clean; temporal propagation is stable; a single framework handles decompose, edit, and recompose.

Why it works (intuition):

Disentanglement: Treating albedo, normals, materials, and light as separate, labeled streams stops cross-talk (like separate lanes on a highway).
Interleaving: Alternating modalities per frame reduces memory, prevents conflicts with edited frames, and still gives the model all ingredients over time.
Type tags (Temporal-aware Intrinsic Embedding): Explicit labels per frame protect identity—“this is light,” “this is material”—even when frames are packed.
Keyframe reference: A quick peek at edited RGB keyframes adds missing context (fine style details) not inside intrinsic maps.

Building blocks (with mini-sandwiches):

🍞 Hook: Think of sorting LEGO bricks by color and shape before building. 🥬 The Concept: Intrinsic Decomposition Module (RGB→X) splits the video into albedo, normal, material, and irradiance maps using a diffusion transformer backbone. How it works: Encode frames, predict a target intrinsic layer, decode to an image, repeat per modality. Why it matters: Clean layers let you change one property without breaking others. 🍞 Anchor: Separate the true wall color from the shadows so recoloring doesn’t remove shading.

🍞 Hook: Imagine feeding the chef one labeled ingredient per step. 🥬 The Concept: Interleaved Conditioning Mechanism constructs a single timeline by alternating intrinsic channels and avoiding conflicts with edited keyframes. How it works: For keyframes, draw from the edited channels; for other frames, sample from safe, untouched channels; pack into one sequence. Why it matters: Prevents memory blow-up and confusion when multiple channels exist. 🍞 Anchor: One frame shows albedo, the next shows light; over a few frames, the model sees the whole picture.

🍞 Hook: Like name tags on students during a field trip. 🥬 The Concept: Temporal-aware Intrinsic Embedding (TIE) adds a per-frame modality label and preserves order when frames are compressed into chunks. How it works: Learn an embedding per modality; pack four frames’ labels together; broadcast to modulate the latent features. Why it matters: The model never mistakes “light” for “material,” even inside compressed time windows. 🍞 Anchor: The system knows frame 5 is “irradiance” and frame 6 is “normal,” so edits don’t swap roles.

🍞 Hook: A photo reference helps an artist match style. 🥬 The Concept: Keyframe Reference supplies edited RGB keyframes as extra guidance, concatenated with intrinsic embeddings. How it works: Encode edited keyframes to latents; drop sometimes during training to make the model robust; use guidance at inference. Why it matters: Recovers fine details not in intrinsic maps and aligns overall scene style. 🍞 Anchor: The wood grain or fabric weave stays realistic after recoloring.

03Methodology

At a high level: Input video → Inverse rendering (RGB→X) → Keyframe edits (intrinsics) → Interleaved intrinsic conditioning sequence → Forward rendering (X→RGB) with keyframe reference + modality tags → Output edited video.

Step 1: Inverse rendering (RGB→X)

What happens: The system takes the input video and predicts four intrinsic channels for each frame: albedo (base color), normal (surface direction), material (roughness/metallic/AO), and irradiance (incoming light/shading). A diffusion-transformer backbone encodes the video frames and decodes each requested intrinsic layer.
Why this step exists: If we don’t separate the ingredients, any edit becomes a messy paint-over that breaks realism—color changes can erase shadows, and lighting tweaks can warp textures.
Example: For a living room video, it outputs (a) the true wall paint color without shadows, (b) the couch’s surface directions, (c) how shiny the table is, and (d) the room’s light/shadow patterns.

Step 2: Edit keyframes in intrinsic space

What happens: You choose one or a few frames and change only the channels you care about: recolor albedo, adjust light color in irradiance, tweak material roughness, or nudge normals for bumps. The system re-decomposes those edited keyframes to get their updated intrinsic maps.
Why this step exists: Edits must be precise and physically meaningful; by touching just one channel, you keep other properties safe.
Example: Paint the sofa blue in albedo while leaving irradiance (shadows) and material (fabric vs. leather) untouched.

Step 3: Build the interleaved intrinsic conditioning sequence

What happens: Instead of feeding every channel at every frame (which is heavy and can conflict with edits), the system constructs one sequence that alternates which intrinsic channel is shown per frame. At keyframes, it samples from the edited channel(s); elsewhere, it samples non-conflicting channels from that frame.
Why this step exists: It reduces memory, avoids feeding signals that conflict with your edits, and still gives the model all necessary ingredients over a short time window.
Example: Frames 1–4 might be [albedo, normal, material, irradiance]; frames 5–8 repeat the pattern. At a keyframe where irradiance was edited, the sequence uses the edited irradiance instead of the original.

Step 4: Add modality identity with Temporal-aware Intrinsic Embedding (TIE)

What happens: The video backbone compresses time in chunks (e.g., 4 frames per chunk). TIE tags each frame in the chunk with a learned modality embedding (“this is albedo,” “this is normal,” etc.) and packs them together, then broadcasts this tag into the model’s features.
Why this step exists: Without name tags, the model can confuse which ingredient it’s seeing inside a compressed time block, causing cross-talk and flicker.
Example: Within one chunk, the system knows slot 1 = albedo, slot 2 = normal, etc., so it won’t treat a light map like a material map.

Step 5: Provide the edited keyframe reference

What happens: The edited RGB keyframes are encoded into a reference latent. During training, the model sometimes drops this reference so it learns to rely on intrinsics too. At inference, guidance adjusts how strongly the reference steers the look.
Why this step exists: Intrinsic maps don’t carry every tiny style cue (like micro-texture or color tone). The keyframe helps match the scene’s appearance faithfully.
Example: The exact wood grain and fabric weave remain consistent after recoloring because the keyframe supplies those details.

Step 6: Forward rendering (X→RGB)

What happens: The model takes (a) the noisy video latent to denoise, (b) the interleaved intrinsic sequence embeddings with TIE tags, and (c) the keyframe reference latent, and then denoises step-by-step to produce the edited RGB video. Untouched properties are preserved; edited ones propagate over time.
Why this step exists: This is where the ingredients rebake into a photorealistic, temporally coherent video that respects your keyframe edits.
Example: If the sofa was turned blue in one frame, all future frames keep the blue sofa with correct shadows and reflections.

The secret sauce (what makes it clever):

Interleaving + conflict-aware sampling: One modality per frame keeps memory low and avoids mixing conflicting signals near edits, while still delivering full context over time.
Modality name tags (TIE): Explicit per-frame labels protect identities when time is compressed, stopping “light” from being mistaken as “material.”
Keyframe reference fusion: Adding the edited keyframe latent next to intrinsic inputs helps the model recover fine appearance details.
Unified end-to-end loop: A single framework that does RGB→X and X→RGB plus keyframe propagation, so training and inference stay consistent.

Concrete mini-walkthrough:

Input: A 6-second living room pan at 832×480.
Decompose: Produce albedo (flat paints and fabrics), normals (surface directions), material (roughness/metallic/AO), irradiance (light/shadows).
Edit: In frame 1, change the lamp’s light color (irradiance) to warm orange.
Interleave: Build [albedo, normal, material, irradiance(edited), albedo, normal, material, irradiance, ...].
Tag: TIE labels each frame’s modality inside 4-frame chunks.
Reference: Include the edited frame-1 RGB as style guidance.
Render: The output video shows warm lighting across the scene, with consistent shadows and unchanged sofa texture and wall paint.

What breaks without each step:

No inverse rendering: You’d paint over pixels, mixing up shadow and color.
No interleaving: Memory balloons, and conflicting channels confuse the model near edits.
No TIE: The model swaps identities, causing flicker and cross-talk.
No keyframe reference: Fine style and texture fidelity suffer, even if physics is right.

04Experiments & Results

The test: The authors check three abilities.

RGB→X (inverse rendering): How accurately can the model extract albedo, normal, material, and irradiance from videos?
X→RGB (forward rendering): Given intrinsic layers (one interleaved per frame) and a keyframe reference, how well can it recreate photorealistic videos?
End-to-end cycle (RGB→X→RGB): If you decompose a video and then rebuild it, how close is the result to the original, and how smooth is it over time?

Why these matter: If decomposition is weak, edits won’t be precise. If recomposition is weak, results won’t look real. If the cycle is weak, information gets lost—edits won’t propagate cleanly.

The competition: They compare with image/video baselines like RGB↔X (image-level intrinsic editing) and DiffusionRenderer (video inverse/forward rendering), and show qualitative comparisons to general video editors like VACE and AnyV2V.

Scoreboard with context:

X→RGB (forward rendering): • V-RGBX hits PSNR ≈ 22.42 and SSIM ≈ 0.795, with LPIPS ≈ 0.193; FVD ≈ 368. Think of PSNR like a clarity score—V-RGBX gets an A when others get a C. A lower FVD means videos look more natural and consistent over time; here V-RGBX’s 300s is far better than the 1000+ of RGBX. • Without keyframe reference, V-RGBX still does well (PSNR ≈ 21.48), but the reference bumps quality further, like adding a sharpen filter that respects physics.
RGB→X (inverse rendering): • On synthetic interiors, V-RGBX yields better albedo, normal, and irradiance than RGB↔X and DiffusionRenderer, both in pixel accuracy and visual consistency. It’s like separating the cake ingredients more cleanly so later baking goes smoothly.
Cycle consistency (RGB→X→RGB): • Synthetic: V-RGBX reaches PSNR ≈ 22.57, SSIM ≈ 0.799, FVD ≈ 368. It reconstructs the original better and stays steadier over time—like reassembling a LEGO set that looks just like the box picture. • RealEstate10K (real-world): V-RGBX still leads, with PSNR ≈ 17.88 and SSIM ≈ 0.753, showing good generalization beyond training data.

Surprising findings and insights:

DiffusionRenderer sometimes scores very high on smoothness, but qualitatively looks faded; this shows that smoothness alone can be misleading if detail is lost. V-RGBX balances smoothness and realism.
Keyframe reference helps more than expected: small details (micro-textures and style tones) return, which intrinsic maps alone don’t capture.
Robust to missing channels: Even when an intrinsic channel (like irradiance) is dropped from conditioning, V-RGBX stays strong—like a team that can still win even if one player sits out. Adding that channel just for the first frame already improves results, showing edits propagate well.

Qualitative comparisons (what your eyes see):

Intrinsic-aware edits: Prior methods drift—colors shift, shiny bits become dull, or new unwanted objects appear. V-RGBX keeps edits clean and consistent.
Relighting: V-RGBX changes light color or shadows while maintaining materials and geometry. Others often tangle light with texture or introduce artifacts.
X→RGB rendering: V-RGBX models shadows and reflections reliably, keeping temporal coherence; others miss realistic reflections or produce frame-to-frame wobble.

Takeaway: The numbers and visuals agree. Interleaving plus modality tags and keyframe reference creates a stable, physically grounded pipeline that beats baselines in fidelity (PSNR/SSIM/LPIPS), realism over time (FVD), and practical edit propagation.

05Discussion & Limitations

Limitations (specific):

Training data skew: The model is trained on synthetic indoor scenes. Outdoors (strong sun, sky lighting, vegetation) and very complex real-world cases can challenge the decomposition and relighting.
One-modality-per-frame interleaving: The current setup feeds exactly one intrinsic type each frame. This keeps memory low and identities clear but can limit scenarios where multiple properties change together in the same instant.
Backbone constraints: Relying on a large video diffusion transformer means higher compute and memory costs, making very long videos and real-time editing difficult.
Edit detection reliance: The pipeline depends on clean identification of which intrinsic channel was changed in keyframes. If the edit tool or decomposition mislabels an edit, propagation can carry the wrong signal.

Required resources:

Strong GPUs (the paper uses many A100s for training), good VRAM, and time for both inverse and forward training. For inference, a capable GPU is recommended, especially for higher resolutions.
Edited keyframes from an image editor or text-to-image tool, plus the original video.

When not to use:

Rapidly changing, non-physical effects (strobe lights, fireworks) where irradiance changes wildly from frame to frame.
Scenes demanding exact measured materials and lighting (e.g., scientific photometry) beyond what a learned model can infer.
Ultra-long videos or strict real-time pipelines without sufficient compute.

Open questions:

Multi-modality per frame: Can we feed multiple intrinsic channels at once while keeping memory low and identities clear?
Broader generalization: How to train or adapt for outdoors and diverse real scenes without losing the clean disentanglement?
Efficiency: Can we compress or distill the backbone for faster inference while preserving edit fidelity?
Self-supervised decomposition: Can we reduce reliance on synthetic paired data and learn intrinsics directly from raw videos at scale?

06Conclusion & Future Work

Three-sentence summary: V-RGBX is an end-to-end video editor that splits a video into physical ingredients (albedo, normals, materials, light), lets you edit a few keyframes in that intrinsic space, and then rebuilds the whole video with your changes cleanly propagated. Its key trick is an interleaved, labeled conditioning stream plus edited keyframe references, which prevents mixing up properties and keeps results stable over time. The system outperforms prior methods in accuracy, realism, and temporal smoothness, enabling reliable retexturing and relighting in real videos.

Main achievement: Unifying RGB→X (inverse rendering), interleaved intrinsic conditioning with identity tags, and X→RGB (forward rendering) into a single, practical pipeline for physically grounded, keyframe-driven video editing.

Future directions: Support multiple intrinsic channels per frame without losing clarity; extend training to outdoor and challenging real scenes; scale to longer videos with better efficiency; and explore self-supervised or weakly supervised intrinsic learning from large video corpora.

Why remember this: V-RGBX moves video editing from surface paint to true scene ingredients, giving creators surgical control over what changes (and what doesn’t). It shows that video diffusion models can understand and respect physics-like layers, opening a path to more trustworthy, consistent, and creative video tools.

Practical Applications

•Scene relighting in post-production (changing time of day or lamp color while preserving materials).
•Product recoloring/retexturing for e-commerce without reshooting (keep realistic reflections and shadows).
•Interior design previews (try new wall paints or fabrics on existing room videos).
•Film and TV continuity fixes (match lighting and colors between takes).
•AR/VR content editing where physical consistency reduces motion sickness and improves immersion.
•Marketing and social media stylization that stays stable across frames (no flicker or drifting textures).
•Education and scientific visualization to demonstrate lighting/material effects independently.
•Game trailer polish: adjust materials or lighting late in production while keeping motion consistent.
•Virtual staging for real estate: change furniture textures or wall colors in walkthrough videos.
•Cinematic color grading that preserves true materials while adjusting light mood.

Version: 1