Over++: Generative Video Compositing for Layer Interaction Effects

Luchao Qi; Jiaye Wu; Jun Myeong Choi; Cary Phillips; Roni Sengupta; Dan B Goldman

Over++: Generative Video Compositing for Layer Interaction Effects

Intermediate

Luchao Qi, Jiaye Wu, Jun Myeong Choi et al.12/22/2025

arXiv PDF

Key Summary

•Over++ is a video AI that adds realistic effects like shadows, splashes, dust, and smoke between a foreground and a background without changing the original footage.
•It introduces a new task called augmented compositing, which focuses on generating semi-transparent, physics-like interactions where layers meet.
•Users can steer the effects with a simple scribble-like mask, a text prompt (e.g., “soft shadow” or “red smoke”), or both at the same time.
•The team built a special training set using video layer decomposition to get paired examples with and without effects, plus extra unpaired examples to keep strong text understanding.
•Over++ fine-tunes a video diffusion transformer, keeps the original video details by not blanking out masked areas, and preserves motion and identity very well.
•A tri-mask design lets the same model work with full masks, no masks, or occasional keyframe masks, reducing tedious annotation.
•Across benchmarks and a user study (including VFX artists), Over++ beats or matches strong baselines while keeping the input video intact.
•A new direction-aware metric (CLIP_dir) better measures whether the change you made matches the effect you wanted.
•Even with limited data and no depth or camera assumptions, Over++ generalizes to many scenes and effects.
•Limitations include occasional color shifts at very high guidance, rare hallucinated effects, and not being pixel-perfect due to VAE reconstruction.

Why This Research Matters

When you paste a person into a new scene, the subtle glue—shadows on the floor, reflections in a puddle, tiny dust puffs—makes it feel real. Over++ generates that glue while leaving the original subject and background untouched, so edits look natural rather than fake. This cuts down on the most tedious, frame-by-frame work that slows creators and studios. With simple controls (a rough mask and a plain-English prompt), non-experts can achieve effects that used to take pros hours. From movies and ads to social media and education, believable video edits become faster, cheaper, and more accessible. As a result, more people can tell richer visual stories without huge budgets or technical teams.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you filmed a friend riding a bike through a puddle. In real life, the water would splash, make ripples, and reflect light on the wet road. But if you just cut your friend out and paste them on top of a clean background, the magic is missing—no splash, no shadow, no ripple.

🥬 The Concept (World Before): Video compositing has long used the classic “over” trick to stack a foreground (like your friend) onto a background (like the street). It’s great for building a scene from pieces, but it doesn’t create the in-between stuff—semi-transparent, physics-like effects that glue the pieces together (shadows, reflections, dust, smoke, splashes). Artists in tools like Nuke or After Effects often hand-craft those effects, which is slow and hard.

🍞 Anchor: A movie scene with a dragon landing looks fake until artists add dust clouds, wing shadows sweeping across the ground, and tiny debris. Those “layer interaction” details sell the shot.

🍞 Hook: You know how erasing and redrawing every frame of a flipbook to fix one tiny thing feels exhausting? That’s what per-frame video inpainting can be like when making water wakes or smoke.

🥬 The Concept (The Problem): Existing generative video models can create pretty scenes, but they tend to drift—changing colors, shapes, or even the subject—so they’re hard to use inside careful, professional workflows. Inpainting methods can change certain areas using masks, but they often need a new mask every frame and still struggle with complex, semi-transparent effects (like a boat’s wake).

🍞 Anchor: If you ask a general video AI to add a shadow under a jumping kid, it might change the kid’s shoes or the background too. If you use mask-only inpainting, you may spend hours drawing perfect masks and still get a weird, rubbery “shadow.”

🍞 Hook: Think of a sandwich: bread (foreground and background) and the sauce that ties flavors together (the environmental effects). Without the sauce, the layers feel separate and dry.

🥬 The Concept (Failed Attempts): People tried: (1) fully simulated effects (accurate but time-consuming and expensive), (2) general editing models (powerful but unstable for keeping the original video), and (3) inpainting models (need dense masks, still miss subtle, transparent phenomena). None directly targeted “add realistic in-between effects while preserving everything else.”

🍞 Anchor: It’s like trying to add a gentle puddle splash with a tool that either remodels the whole road or makes you trace every droplet for each frame.

🍞 Hook: Imagine a sticker that says “only add magic here,” and a caption that says “soft, bluish smoke.” Now imagine an AI that obeys both, and otherwise leaves the video alone.

🥬 The Concept (The Gap): What was missing was a specialized, controllable generator for semi-transparent environmental effects that: (a) keeps the original subject and background intact, (b) can follow a mask and/or a text prompt, and (c) works even without perfect camera or depth info.

🍞 Anchor: You drop in a skateboarder over a new background and say “add a soft shadow under the board,” scribble roughly under the board, and the system fills in a correct shadow over time while leaving everything else untouched.

🍞 Hook: When a fix is both faster and more faithful, whole teams breathe easier.

🥬 The Concept (Real Stakes): For pros, this means fewer hours simulating dust, rotoscoping masks, or fixing drift. For everyday creators, it means believable edits: reflections on wet floors, wakes behind kayaks, or color-tuned smoke—all while keeping the original look-and-motion of the footage.

🍞 Anchor: A YouTuber can take a parkour clip shot on a dry day and convincingly add soft puddle ripples and reflections after a rainstorm—without reshooting or repainting every frame.

02Core Idea

🍞 Hook: You know how when you decorate cookies, you can choose where to sprinkle sugar (mask) and what flavor to use (text), but you still want the cookie itself to stay the same?

🥬 The Concept (Aha! in one sentence): Over++ is a video diffusion model fine-tuned to add only the “in-between” environmental effects—guided by a mask and/or a prompt—while preserving the original foreground and background.

How it works (big picture):

Start with a simple composite (foreground over background) that lacks effects.
Optionally mark rough regions where effects should appear (mask) and describe the effect (text prompt).
Diffuse from noise back to video, but condition on the original composite, the mask, and the text, so the model adds just the right semi-transparent effect.
Keep the rest of the video intact by passing the full context into the network instead of blanking masked areas.

Why it matters: Without this, models either hallucinate changes everywhere or require exhausting per-frame masks and still miss realistic, semi-transparent physics.

🍞 Anchor: You paste a runner over a sandy beach and ask for “light dust trails.” Over++ softly paints dust that trails behind the feet, matches the wind, and leaves the runner and beach textures alone.

Three analogies:

Movie makeup artist: The actor (foreground) and set (background) are unchanged; the artist adds just the right smudge (effects) where needed.
Window doodles: You draw on the glass (semi-transparent layer) without altering the room behind it.
Spice shaker: You shake paprika (effect) only over the masked part of the dish, and you choose “smoky” or “mild” by prompt.

Before vs After:

Before: General editors often changed subjects unintentionally; inpainting demanded dense masks and still struggled with wakes/smoke.
After: Over++ follows text and/or sparse masks to synthesize physically sensible, see-through effects that bind layers together, keeping the scene’s identity and motion.

Why it works (intuition):

Conditioning cocktail: The model is conditioned on three things at once—(a) the input composite for context, (b) a mask that says “where,” and (c) a prompt that says “what/how.”
Don’t erase the canvas: By not zeroing out masked regions in the latent space, the model always “sees” the full scene, reducing hallucinations and preserving fidelity.
Data that teaches subtlety: Paired training (with/without effects) shows exactly the difference to add; unpaired text-to-video augmentation preserves prompt-following skills.
One model, many controls: A tri-mask trick lets the same model work with full masks, no masks, or just keyframe masks.

Building blocks (explained with sandwiches):

🍞 Hook: Imagine sticking a character sticker onto a photo of a park. 🥬 Augmented Compositing: It’s a special kind of compositing where the goal is to add only the environmental interactions between foreground and background (like shadows and splashes) without changing either layer. How it works: (1) Combine FG over BG. (2) Ask for effects with text and optional mask. (3) Generate just the in-between effects. (4) Keep FG/BG identity and motion. Why it matters: Without it, scenes look pasted-together and fake. 🍞 Anchor: A bike over a puddle gets believable ripples and reflections that match the tires and water.

🍞 Hook: Think of using correction fluid to fix only certain words on a page. 🥬 Video Inpainting: It’s editing selected regions of a video while keeping the rest untouched. How it works: (1) Mark areas. (2) Predict missing/new content that matches neighbors and time. Why it matters: Without inpainting, you’d have to recreate whole frames. 🍞 Anchor: Erase unwanted wires from a stunt shot while leaving the actor intact.

🍞 Hook: A chef learns many recipes and can invent new dishes by remixing patterns. 🥬 Generative Models: They learn patterns from lots of examples so they can create new, fitting content. How it works: (1) Study data. (2) Learn a compact recipe. (3) Generate new samples that fit the recipe. Why it matters: Without this, the system can’t invent realistic effects on the fly. 🍞 Anchor: Making a brand-new splash that still looks like real water.

🍞 Hook: Imagine a foggy window where an image slowly appears as you wipe away mist. 🥬 Diffusion Model: It starts from noisy video and step-by-step removes noise, guided by context, to form the final video. How it works: (1) Add noise training; (2) Learn to denoise; (3) At test time, denoise from pure noise with guidance. Why it matters: Without diffusion, you don’t get stable, high-quality, controllable generation. 🍞 Anchor: A wake behind a boat emerges gradually as the model “clears” noise along the boat’s path.

🍞 Hook: Use painter’s tape so paint only lands where you want. 🥬 Effect Masking: A mask marks roughly where the new effect is allowed. How it works: (1) Provide a binary/gray mask. (2) Model prioritizes effect inside the mask. (3) Outside remains unchanged. Why it matters: Without masks, effects can spill everywhere. 🍞 Anchor: A soft shadow appears under a dancer but not on the walls.

🍞 Hook: Like sliding the tape a bit wider or narrower to tweak the edge. 🥬 Mask Control: Adjusting or keyframing masks to refine where and when effects appear. How it works: (1) Use no mask, a full mask, or keyframe masks. (2) The tri-mask design supports all in one model. Why it matters: Without flexible control, you either over-constrain or under-control the effect. 🍞 Anchor: Stronger splashes exactly at the jump’s peak frame; lighter ripples before and after.

🍞 Hook: Ordering at a café: “Make it a mild, blue-gray smoke, please.” 🥬 Text-Prompt Guidance: Words steer the style, intensity, and color of effects. How it works: (1) Encode text. (2) Condition denoising on text. (3) Use guidance to emphasize prompt. Why it matters: Without prompts, you’d be stuck with one look. 🍞 Anchor: Switching “harsh shadow” to “soft shadow” changes the shadow’s edges and darkness in the output.

03Methodology

At a high level: Composite without effects (FG over BG) + optional effect mask + text prompt → Over++ diffusion model (conditioned on video, mask, and text) → Output video with realistic environmental effects while preserving the original content.

Step 1. Build paired training data so the model learns exactly what to add. 🍞 Hook: Imagine you have two photos of the same scene: one with splashes, one without. The difference between them teaches you what a splash looks like. 🥬 Omnimatte Layer Decomposition: A method to split a video with effects into a foreground-with-effects layer and a clean background layer. How it works: (1) Decompose I_gt (with effects) ≈ alphaI_fg + (1−alpha)*I_bg. (2) Extract the pure subject I_fg with a segmentation mask. (3) Re-compose I_over = subject-over-background to make a version with no effects. Why it matters: Without pairs (with/without effects), the model can’t learn the exact “delta” that glues layers together. 🍞 Anchor: A biker splashing through a puddle is split into biker+splashes and a dry puddle; recombining biker over dry puddle gives the clean input.

Step 2. Derive an effect mask automatically from differences. 🍞 Hook: Like tracing only the parts of two drawings that changed. 🥬 Mask Generation with Pruning: Compute δ(I_gt, I_over), binarize with an automatic threshold, and clean noise with simple morphology. How it works: (1) Difference → grayscale. (2) Otsu threshold → binary mask. (3) Erode/dilate/median filter to remove speckles and wobble. Why it matters: Noisy masks confuse the model; cleaned masks teach where effects truly live. 🍞 Anchor: The mask lights up only where the puddle splashes and ripples actually appear.

Step 3. Train for both masked and mask-free use. 🍞 Hook: Sometimes you know exactly where to paint; sometimes you don’t. 🥬 Tri-Mask Design: A single model that handles (a) precise masks, (b) unknown frames (gray fill), and (c) no mask. How it works: (1) Randomly replace masks with a uniform gray “unknown” during training. (2) The model learns to generate effects even when location is uncertain. Why it matters: Without this, you’d need separate models or tedious frame-by-frame masks. 🍞 Anchor: You annotate only a key frame for a big splash; the model fills in consistent splashes around it.

Step 4. Keep strong language skills via unpaired data. 🍞 Hook: If you only practice one song, you forget how to improvise. 🥬 Unpaired Text-to-Video Augmentation: Generate extra training clips from varied captions to preserve prompt-following. How it works: (1) Create multiple caption variants (same scene, different effect styles). (2) Generate I_gt with a T2V backbone. (3) Train Over++ with I_gt and text while zeroing the missing mask and input video latents so it practices text-only conditioning too. Why it matters: Without unpaired data, the model can “forget” how to follow text edits (language drift). 🍞 Anchor: Prompting “thin blue smoke” vs “dense red smoke” actually changes the result.

Step 5. Fine-tune a video inpainting diffusion transformer with preserved context. 🍞 Hook: Don’t blindfold the painter while asking for tiny details. 🥬 Pass-Through Latents (Don’t Blank the Mask): Feed fully encoded video latents instead of zeroing masked regions. How it works: (1) Freeze the VAE encoder/decoder. (2) Fine-tune attention blocks of the DiT. (3) Condition on concatenated latents: I_over, M_effect, and text. Why it matters: Blanking breaks context and invites hallucinations; pass-through preserves scene structure and identity. 🍞 Anchor: A runner’s outfit, lighting, and motion stay consistent while only dust is added.

Step 6. Inference controls for finer edits. 🍞 Hook: Volume knob for style. 🥬 Classifier-Free Guidance (CFG): A knob that strengthens how strongly the model follows the prompt. How it works: (1) Mix conditioned and unconditioned predictions. (2) Higher CFG → stronger effect emphasis. Why it matters: Without a knob, you can’t dial “subtle vs strong.” 🍞 Anchor: “Mild wake” vs “turbulent wake” from the same input by changing CFG.

🍞 Hook: Long movies need smooth continuity. 🥬 Temporal Multidiffusion: A way to handle longer videos consistently by stitching overlapping windows with coherence. How it works: (1) Process chunks with overlap. (2) Blend to keep motion and effect continuity. Why it matters: Without this, long sequences can drift or show seams. 🍞 Anchor: A wake behind a boat stays stable across a full minute clip.

Secret Sauce (why Over++ feels smart):

Paired-without/with-effects training shows the model the exact “difference layer” to generate.
Tri-mask training makes the same model useful with or without annotations.
Unpaired T2V augmentation keeps the language muscle strong for prompt edits.
Pass-through latents preserve original identity and reduce drift.
Simple, robust mask pruning teaches the model to trust coarse, human-drawn masks.

04Experiments & Results

The Test: The team measured two big things—(1) How realistic and prompt-aligned the added effects look, and (2) How well the original foreground and background are preserved. They used standard frame and video metrics (SSIM, PSNR, LPIPS, FVD, VMAF, VBench) and prompt/image similarity scores. Because effects can be subtle, they also introduced a direction-aware score to better capture the intended change.

🍞 Hook: Think of grading not the whole essay again, but whether the edits you made improved it in the right direction. 🥬 CLIP_dir Metric: It checks if the change from “no effect” to “with effect” points in the same direction as the ground-truth change. How it works: (1) Embed frames with CLIP. (2) Compute the vector from I_over to I (your result) and from I_over to I_gt (ground truth). (3) Measure how aligned those vectors are. Why it matters: Plain similarity can say “these two look close” and miss the specific effect you added, while direction checks if you added the right kind of change. 🍞 Anchor: If ground truth adds a shadow, your result should move toward “shadowy” in embedding space, not just “slightly similar overall.”

The Competition: Over++ was compared to AnyV2V (tuning-free editing), VACE (all-in-one with masked inpainting), LoRA-Edit (first-frame guided tuning), and a commercial editor (Runway Aleph). Some baselines can’t use masks; some require edited first frames.

The Scoreboard (with context): Across 24 test videos, Over++ delivered state-of-the-art or second-best results on most metrics while best preserving input identity and motion. In user studies (with both VFX pros and non-experts), participants strongly preferred Over++ for matching the text prompts, obeying masks, and preserving the original video. In plain terms, Over++ often scored like getting an A on both “follow the instructions” and “don’t change what I didn’t ask you to change,” while others got more like B’s or C’s on one of those.

Surprising Findings:

Even coarse, hand-drawn masks worked well—imperfections in training masks acted like natural augmentation, making the model robust to scribbles.
Adding unpaired T2V augmentation noticeably improved text editability; remove it, and the model’s prompt-following weakens.
Standard CLIP similarity sometimes underrated correct edits; the direction-aware score helped reveal real improvements.

Examples in action:

No-mask mode: Over++ added wakes, shadows, or reflections while keeping subjects and backgrounds unchanged, unlike some baselines that altered identities or layouts.
Mask-guided mode: Over++ respected the masked area better and added physically plausible effects; baselines sometimes leaked or ignored subtle transparencies.
Prompt tweaks: Changing “soft shadow” to “harsh shadow,” or “mild wake” to “turbulent wake,” produced clear, controllable differences.
Keyframe masks: A single annotated frame could boost splash intensity at the critical moment while leaving other frames light and natural.

05Discussion & Limitations

Limitations:

Not pixel-perfect: Because encoding/decoding uses a VAE, tiny details can shift; this is usually minor but visible on close inspection.
Rare hallucinations: In tricky backgrounds, the model may invent faint dust or tone shifts, especially at very high guidance.
Scope: Over++ focuses on effect generation, not full harmonization or relighting; those remain separate steps.

Required Resources:

A modern GPU setup for fine-tuning/inference (the paper used multiple high-memory GPUs). For typical usage, a single strong GPU suffices, but long videos and high resolution benefit from more memory.
Optional masks and prompts; the model also works mask-free or with only keyframe masks.
Access to the trained Over++ weights and preprocessing scripts (e.g., for Omnimatte-based pairing, if you plan to extend training).

When NOT to Use:

If you need exact pixel identity (e.g., forensic-level preservation), pure compositing without generative steps might be safer.
If the scene requires heavy relighting or color harmonization beyond effect generation, consider dedicated harmonization tools first.
If the effect is fully opaque, rigid, or requires strict physics simulation (e.g., complex fluid-structure interaction for engineering), a physics simulator may be preferable.

Open Questions:

Can we integrate harmonization/relighting so shadows and reflections always match tricky lighting?
Can stronger base video priors further reduce hallucinations and improve tough cases (e.g., glass, rain-on-glass, fog volumes)?
Can we learn a richer “physics prior” for non-rigid media (smoke, dust, splashes) from fewer examples?
Can we speed up long-sequence generation while keeping temporal coherence and fine control?

06Conclusion & Future Work

Three-sentence summary: Over++ introduces augmented compositing, a focused way to add semi-transparent, physically grounded effects between video layers while keeping the rest of the scene intact. It fine-tunes a video diffusion model with paired data (with/without effects), robust masks, and unpaired text-to-video augmentation to preserve prompt-following skills. The result is a controllable, professional-friendly system that respects masks and prompts, generalizes to many scenes, and outperforms strong baselines in preserving input identity.

Main achievement: Turning the fuzzy, hand-crafted “in-between” effects of compositing into a learned, controllable generation task—so shadows, splashes, dust, and smoke can be added realistically without sacrificing the original footage.

Future directions: Fold in harmonization and relighting, leverage stronger base video priors, deepen the physics intuition for volumetric effects, and further streamline long-video stability and speed. Also, broader datasets with more effect diversity could improve rare corner cases like thin transparent glass reflections or mist on complex textures.

Why remember this: It’s a practical shift in how we think about video editing—rather than replacing content, we now reliably generate the invisible glue between layers, with simple controls (masks and text) that match real production workflows.

Practical Applications

•Add realistic shadows under inserted objects for product shots or set extensions.
•Create water splashes, wakes, and ripples when pasting a subject into lakes, puddles, or oceans.
•Generate smoke, dust, or mist that interacts with moving subjects for action or sports edits.
•Add reflections on wet floors, windows, or shiny tables when combining multiple takes.
•Use keyframe masks to boost effect intensity only at critical story moments.
•Quickly prototype director notes like “softer shadow” or “denser red smoke” via text prompts.
•Enhance continuity across shots by keeping effects consistent without re-simulating physics.
•Teach filmmaking by showing how environmental interactions sell a composite without heavy tools.
•Speed up previz and animatics by adding plausible environmental cues early.
•Repair scenes by subtly adding missing interaction cues (e.g., faint ground contact shadows).

Version: 1