InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Hoiyeong Jin; Hyojin Jang; Jeongho Kim; Junha Hyung; Kinam Kim; Dongjin Kim; Huijin Choi; Hyeonji Kim; Jaegul Choo

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Intermediate

Hoiyeong Jin, Hyojin Jang, Jeongho Kim et al.12/19/2025

arXiv PDF

Key Summary

•InsertAnywhere is a two-stage system that lets you add a new object into any video so it looks like it was always there.
•First, it builds a 4D understanding of the scene (3D space plus time) to figure out where the object should be and how it should move and be hidden by other things.
•Then, a diffusion-based video generator inserts the object and also adjusts nearby lighting and shadows so everything looks natural.
•Users only place and size the object in the first frame; the system carries that decision through the whole video consistently.
•A new dataset called ROSE++ teaches the model to handle illumination and shadow changes caused by the inserted object.
•The method uses scene flow to make the object ride along with moving surfaces (like an apple on a moving plate).
•On a new benchmark (VOIBench), InsertAnywhere beats strong commercial tools (Pika-Pro and Kling) on object fidelity, video quality, and multi-view consistency.
•A first-frame, high-quality inpainting step serves as a visual anchor that keeps the object’s look consistent across the video.
•The approach handles tricky occlusions (things passing in front of the new object) far more reliably than prior methods.
•This makes video object insertion practical for real uses like virtual product placement and film post-production.

Why This Research Matters

InsertAnywhere makes it practical to add products, props, or signs into finished videos without reshoots, saving time and money for creators and studios. By respecting 4D geometry and lighting, the inserted items look believable even when people move in front of them or when sunlight changes. This enables high-quality virtual product placement, safer content revisions, and rapid creative experimentation. It also helps small teams achieve film-level edits using consumer hardware with efficient fine-tuning. Better training data (ROSE++) teaches the system to handle shadows and brightness realistically, which viewers subconsciously demand. Overall, it turns a delicate visual trick into a reliable, controllable tool for real-world video production.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re making a school play video and want to add a friendly robot into the scenes after filming. You don’t want the robot to float weirdly, pass through people, or glow like it’s from a different planet—you want it to fit right in.

🥬 The Concept (Video Object Insertion, VOI): VOI is putting a new object into an existing video so it looks real—right place, right size, right lighting, and right timing. How it works (big picture):

Understand the 3D world of the video over time (that’s 4D: 3D + time).
Decide where and how big the new object should be.
Render the object so it matches camera movement, shadows, and what blocks what. Why it matters: Without VOI, creators must reshoot or do messy edits; the new object would slide, flicker, or clash with lighting, breaking the illusion. 🍞 Anchor: In a cooking video, you add a pepper shaker to the counter. If done right, it stays on the counter, hides behind a chef’s hand when needed, and casts the right shadow.

🍞 Hook: You know how in a flipbook, each page shows the scene from slightly different views over time? If you draw a ball in the wrong place on even a few pages, the animation looks broken.

🥬 The Concept (Temporal Coherence): Temporal coherence means the inserted object looks stable and consistent across frames. How it works:

Track where the camera and scene move.
Keep the object’s position and appearance aligned in every frame.
Respect appearances after events like occlusions (when something passes in front). Why it matters: Without it, the object jitters or teleports, and viewers instantly spot the fake. 🍞 Anchor: A mug placed on a moving cart should ride smoothly with the cart in every frame—not lag behind or slide off.

🍞 Hook: Think of wearing sunglasses indoors—everything looks wrong because the light isn’t right. Videos feel the same when lighting doesn’t match.

🥬 The Concept (Illumination and Shadows): Illumination is how light brightens or darkens surfaces; shadows are the darker shapes where light is blocked. How it works:

Estimate scene lighting direction and strength.
Adjust the inserted object’s brightness and shading.
Add soft shadows and reflections around it. Why it matters: Without this, the object looks pasted on—too bright, too flat, or casting no shadow. 🍞 Anchor: Add a paper bag near a window: it should look brighter when the sun shines in and dimmer when the door closes.

🍞 Hook: Imagine you place a sticker on the first page of a book, and someone perfectly flips and repositions it on all later pages without you doing anything.

🥬 The Concept (User-Specified Placement): The user chooses the object’s spot, size, and pose in the first frame; the system carries that forward. How it works:

User marks the initial placement.
The scene’s 4D geometry guides where the object should appear next.
The system auto-generates per-frame masks and positions. Why it matters: Without this, you’d have to hand-edit every frame—too slow and error-prone. 🍞 Anchor: You drag a toy car onto a shelf in frame 1; it stays attached to the shelf as the camera moves.

🍞 Hook: Picture building a Lego diorama and then filming it while walking around it—your camera moves, but the blocks stay put in 3D.

🥬 The Concept (4D Scene Geometry): 4D scene geometry means understanding how the 3D scene looks and moves across time. How it works:

Estimate depth for each frame (how far things are).
Recover camera motion (where the camera is looking from).
Track scene motion so objects stay consistent over time. Why it matters: Without 4D geometry, the object’s size, angle, and occlusions won’t match the scene. 🍞 Anchor: A vase inserted on a table should grow larger correctly as the camera walks closer, and vanish behind a chair when the chair blocks it.

🍞 Hook: Using a cookie cutter helps you shape dough precisely. In videos, masks are like cookie cutters for pixels.

🥬 The Concept (Object Mask): An object mask is a per-frame stencil that marks where the new object belongs. How it works:

Create a silhouette of the object in each frame.
Update it as the camera and scene move.
Use it to guide where the video model should edit and where to leave things untouched. Why it matters: Without masks, edits bleed into the background or miss occlusions. 🍞 Anchor: The pepper shaker’s mask changes shape slightly as the view changes—thin from the side, wider from the front—and gets trimmed when a hand passes in front.

The world before: Early image-level insertion looked convincing in single photos, but extending this to videos broke down because the object’s placement, size, and lighting changed across frames. Some video tools edited only inside a user mask, so they couldn’t fix nearby lighting or shadows. Others tried to regenerate big regions, which reduced user control and sometimes harmed the original background.

The problem: We need both precision and realism. Precision means the user can choose exactly where and how big the object is. Realism means proper 3D alignment, occlusions, motion, and lighting—frame after frame.

Failed attempts: Pure inpainting within a small mask often missed shadows and reflections outside the mask. Fully generative methods without masks often misplaced objects and altered unrelated parts of the scene. Methods that didn’t model visibility over time broke when the inserted object got partially hidden and then reappeared.

The gap: A bridge between strong 4D scene understanding (for geometry and occlusion) and modern video diffusion (for photorealistic synthesis), plus training data that explicitly teaches lighting changes caused by the new object.

Real stakes: This matters for commercials (placing products into existing footage), film post-production (adding props without reshoots), social media edits, and AR previews—saving time, cost, and keeping creative control in the user’s hands.

02Core Idea

The “aha!” in one sentence: Combine a 4D-aware mask that knows the scene’s geometry and occlusions with a diffusion-based video generator that not only inserts the object but also edits nearby lighting and shading—anchored by a perfect first frame and trained with an illumination-aware dataset.

Three analogies:

Theater Stage Manager: The 4D module is the stage map (who stands where, who blocks whom). The diffusion model is the lighting crew and costume designer, making sure colors and shadows match the scene.
Sticker-on-Flipbook: The mask tells you exactly where the sticker goes on every page; the diffusion model paints in the sticker and the lighting around it so it doesn’t look pasted.
GPS + Camera Filter: The 4D system is the GPS keeping you on the right route through 3D space-time; the diffusion model is the smart filter adjusting brightness and shadows to match the environment.

Before vs. After:

Before: Users had to choose between control (but flat lighting and bad occlusions) or realism (but risky changes to the whole video). Objects often jittered, scaled wrong, or ignored nearby shadows and reflections.
After: You place the object once; the system produces a geometrically correct, temporally stable mask that respects occlusions and camera motion, and the diffusion model inserts the object plus its soft shadows and lighting variations.

Why it works (intuition):

Geometry pins things down. When the system knows where surfaces are and how the camera moves, it can keep the object’s size and angle correct and decide when it should be hidden.
A first-frame anchor locks in the object’s look, so later frames copy the right texture, color, and materials.
Training with ROSE++ shows the model many examples where an inserted object should brighten, darken, or cast shadows beyond the mask, so it learns to make those adjustments automatically.

Building blocks (with mini sandwich intros):

🍞 Hook: You know how you mark a spot on a map and then follow roads to stay aligned? 🥬 The Concept (4D-Aware Mask Generation): It builds a per-frame stencil for the object that respects 3D geometry and occlusions. How: reconstruct 4D scene → place the object once → propagate with scene flow → reproject to each frame. Why: Without it, the object drifts or ignores what blocks it. 🍞 Anchor: A mug rides a moving cart and disappears behind a person at the right times.
🍞 Hook: Imagine taking a perfect passport photo before making a slideshow. 🥬 The Concept (First-Frame Anchor): Generate a high-fidelity first frame via strong image inpainting to set the object’s exact appearance. How: edit frame 1 with an image model → feed it to the video model to guide later frames. Why: Without it, textures and colors wander. 🍞 Anchor: The drawer’s wood grain stays consistent from start to end.
🍞 Hook: Think of practicing piano with a teacher who points out subtle rhythm issues you’d miss alone. 🥬 The Concept (ROSE++ Illumination-Aware Training): A dataset where each sample includes object-removed video, object-present video, object masks, and a reference object image generated by a VLM. How: invert an object-removal dataset, add VLM-generated references, and train supervision to match lighting/shadows. Why: Without it, the model can’t learn to add realistic soft shadows. 🍞 Anchor: A bottle on a table darkens the tablecloth beneath it and slightly brightens one side from a lamp.
🍞 Hook: Imagine a lightweight add-on that lets your bike carry cargo without replacing the whole frame. 🥬 The Concept (LoRA Fine-Tuning): A small adapter module that fine-tunes the big diffusion model for insertion tasks efficiently. How: attach low-rank adapters and train briefly on ROSE++. Why: Without it, you’d need huge, slow retraining. 🍞 Anchor: After LoRA, the inserted bag in a room brightens when the door opens to sunlight.

Together, these pieces turn a delicate editing trick into a repeatable, production-ready process.

03Methodology

High-level pipeline: Input video + user places object (first frame) + reference image → 4D scene reconstruction → scene-flow-based propagation and reprojection → geometry-aware mask sequence → first-frame image inpainting anchor → diffusion-based video generation (LoRA-tuned) that inserts the object and nearby lighting/shadow effects → output video.

Step 1: 4D Scene Reconstruction 🍞 Hook: Imagine rebuilding a tiny stage set so you can walk around it and know where every prop stands. 🥬 The Concept: Turn the 2D video into a 4D scene (3D + time) using off-the-shelf predictors. How it works:

Depth estimation predicts how far each pixel is from the camera.
Camera pose recovery figures out how the camera moved each frame.
Optical flow tracks how pixels shift between frames.
Segmentation helps separate objects and regions. These are combined (as in Uni4D-style orchestration) into a coherent 4D scene. Why it matters: Without this, the system can’t tell whether the object should get bigger as the camera approaches or whether it should hide behind furniture. 🍞 Anchor: As the camera walks past a table, the table edges line up correctly across frames.

Step 2: User-Controlled Placement 🍞 Hook: You drop a pin in a map and set the zoom—now the system knows where and how big. 🥬 The Concept: Convert the reference object image into a simple 3D form (point cloud), then rotate, scale, and translate it into the reconstructed scene with an interactive GUI. How it works:

Single-view 3D reconstruction makes a point cloud from the object image.
User adjusts rotation (pose), translation (position), and scale.
The object is now aligned to the scene’s coordinates. Why it matters: Without precise placement, even a beautiful render will look wrong because it’s in the wrong spot or size. 🍞 Anchor: You align a toy car so its wheels sit exactly on a shelf board.

Step 3: Scene-Flow-Based Propagation 🍞 Hook: If your toy sits on a rolling cart, it should roll with the cart without you pushing it. 🥬 The Concept (Scene Flow): Estimate local 3D motion so the inserted object moves naturally with nearby surfaces. How it works:

Compute dense optical flow between frames (e.g., SEA-RAFT).
Find scene points near the inserted object and map their 2D flows back to 3D motion vectors.
Average these vectors to update the object’s position frame by frame. Why it matters: Without scene flow, objects look glued to the air and don’t ride along with moving supports. 🍞 Anchor: An apple on a lifted plate moves up with the plate in sync.

Step 4: Camera-Aligned Reprojection 🍞 Hook: Think of shining a projector through a 3D model to paint its silhouette on each frame. 🥬 The Concept (Reprojection): Project the object’s 3D points into each frame using the camera intrinsics and extrinsics, then rasterize a clean silhouette. How it works:

Use camera matrices to map 3D points to 2D pixels per frame.
Keep only visible points (respect occlusion ordering).
Rasterize to get the object’s shape per frame. Why it matters: Without correct reprojection, the mask won’t match perspective or occlusions. 🍞 Anchor: The pepper shaker looks thinner from a side view and wider from the front, just like the real one.

Step 5: Geometry-Aware Mask Extraction 🍞 Hook: A cookie cutter per frame ensures precision, even when hands pass in front. 🥬 The Concept: Use a video-capable segmenter (e.g., SAM2) on the synthesized object-inserted sequence to get temporally consistent binary masks. How it works:

Feed the synthetic sequence to SAM2.
Extract per-frame masks of the object region.
Achieve temporally aligned, occlusion-aware masks. Why it matters: Without consistent masks, later editing will leak or miss occlusions. 🍞 Anchor: The mask trims correctly when a sleeve crosses in front of the object.

Step 6: First-Frame Inpainting Anchor 🍞 Hook: Snap the perfect cover photo before making the video slideshow so every page knows what to copy. 🥬 The Concept: Use a powerful image inpainting model on frame 1 to render the inserted object with high fidelity. How it works:

Apply image inpainting within the mask on the first frame.
Lock in textures, colors, and material cues.
Pass this frame as a strong visual guide to the video model. Why it matters: Without an anchor, the object’s appearance can drift. 🍞 Anchor: The drawer’s wood color and handle shape stay consistent through the clip.

Step 7: Diffusion-Based Video Generation with LoRA Fine-Tuning 🍞 Hook: Think of a skilled painter who first studies a reference photo, then paints a whole scene with consistent lighting. 🥬 The Concept: A video diffusion model (e.g., Wan2.1-VACE-14B) is fine-tuned with small LoRA adapters on ROSE++ to learn insertion plus local lighting fixes. How it works:

Input: source video, geometry-aware mask sequence, reference object image, and the first-frame anchor.
The diffusion model synthesizes frames that keep background intact while inserting the object and adjusting nearby lighting/shadows—even outside the mask when needed.
LoRA enables efficient adaptation (few hours, single GPU) without retraining the whole model. Why it matters: Standard inpainting edits only inside masks and misses soft shadows; mask-free editing loses user control. This hybrid keeps control and realism. 🍞 Anchor: The bag brightens when the door opens, and a soft shadow falls on the floor beyond the mask.

Training Data: ROSE++ 🍞 Hook: Practice with feedback that includes both the right notes and the right rhythm. 🥬 The Concept: ROSE++ turns an object-removal dataset into insertion triplets: object-removed video (input), object-present video (target), mask, and a clean reference image generated by a VLM. How it works:

Start from ROSE (has pairs and masks with side effects like shadows removed).
Generate white-background object images via a VLM using multi-view crops and select with DINO similarity.
Train the model to re-insert the object and surrounding illumination effects. Why it matters: Without this, the model can’t learn lighting and shadow behavior tied to insertion. 🍞 Anchor: The system learns to darken the table under a bottle and slightly tint it based on light color.

Secret sauce:

Pin it with 4D geometry (placement + occlusion), guide it with a pristine first frame, and teach it illumination with ROSE++ so the diffusion model can do both object and nearby light edits reliably.

04Experiments & Results

🍞 Hook: When you try out for a team, you don’t just run fast—you also show ball control, teamwork, and consistency. A good video insertion model needs to ace several skills, not just one.

🥬 The Test: The authors evaluate how well the inserted object matches the reference (subject consistency), how good the whole video looks (video quality), and how stable things stay across changing views and occlusions (multi-view consistency). How it works:

Subject Consistency (CLIP-I, DINO-I): Check how similar the inserted object looks to the reference image over multiple frames, focusing on the masked object region.
Video Quality (VBench metrics): Assess image quality, background/subject consistency, and motion smoothness for the whole clip.
Multi-View Consistency: See if the object stays correct as the viewpoint shifts and occlusions happen. Why it matters: Without these checks, a model might get one thing right (like sharpness) but fail on realism (like shadows or occlusions). 🍞 Anchor: It’s like scoring not just for speed, but also passing accuracy and endurance.

🍞 Hook: You know how teachers sometimes compare your project not just to the average, but also to the best in class?

🥬 The Competition: The method is compared against two strong commercial tools—Pika-Pro and Kling—on a new benchmark called VOIBench (50 videos, 100 insertions across varied scenes). How it works:

Evaluate all methods on the same videos and reference objects.
Compute metrics fairly for masked regions (subject) and full clips (quality, consistency).
Include challenging cases like occlusions and moving cameras. Why it matters: Without tough comparisons, claims of realism aren’t convincing. 🍞 Anchor: Inserting a pepper shaker that moves behind and in front of a hand—hard for models that don’t understand occlusions.

Results with context:

Subject Consistency: InsertAnywhere achieves CLIP-I 0.8122 and DINO-I 0.5678, comfortably ahead of Pika-Pro (0.4940, 0.3856) and Kling (0.6349, 0.5028). Think of this as getting an A+ while others get B– to B+ on matching the object’s identity.
Video Quality (VBench): Background Consistency 0.9429, Subject Consistency 0.9520, Motion Smoothness 0.9916, Imaging Quality 0.7101—either the best or tied at the top. This means the edited video keeps the original scene intact, moves smoothly, and looks sharp.
Multi-View Consistency: 0.5857 vs. 0.5439 (Kling) and 0.5123 (Pika-Pro). That’s like the object staying on-beat even when the music (camera/view) changes.

Surprising/Notable findings:

Occlusions handled right: Prior methods often had the new object float in front when it should be hidden, or disappear incorrectly. The 4D-aware mask fixed this, making the object appear and vanish at the right times.
First-frame anchor matters: Generating a high-fidelity first frame significantly boosted subject consistency (textures, colors, shapes remained stable afterward).
Lighting learned after LoRA on ROSE++: Before fine-tuning, the object’s brightness didn’t react to doors opening/closing; after fine-tuning, it brightened/dimmed realistically, and shadows extended beyond the mask where appropriate.
VLM reference helps: Using VLM-generated references (instead of random crops) avoided copy-paste artifacts and improved multi-view consistency (0.5857 vs. 0.5295 in an ablation), because the model learned from clean, consistent object images like those used at inference.

Ablations (what each part contributes):

Geometry-aware mask vs. naive mask: Preserves real occlusions and improves VBench scores; without it, arms/scarves misalign during overlaps.
Add first-frame inpainting: Big jump in CLIP-I/DINO-I (object identity) thanks to a crisp anchor.
Add ROSE++ LoRA: Better illumination/shadow realism and overall video quality.

Bottom line: InsertAnywhere doesn’t just paste an object in—it understands where it belongs in 4D and how light should treat it, which is why it beats commercial baselines across multiple, meaningful tests.

05Discussion & Limitations

Limitations:

Reliance on initial user placement: If the first-frame position/scale/pose is poorly set, the system will faithfully propagate a bad choice. A quick GUI helps, but careful first placement still matters.
Quality of 4D reconstruction: Errors in depth, camera pose, or flow can misplace masks and occlusions. Very fast motion, motion blur, or low light can reduce robustness.
Static-object assumption: The method targets inserting static objects (not characters that deform or articulate). Moving deformable subjects require extra modeling.
Training domain: ROSE++ is synthetic and illumination-aware, but extreme real-world lighting or reflective/transparent materials can still be challenging.

Required resources:

A capable video diffusion backbone (e.g., Wan2.1-VACE-14B) and GPU for LoRA fine-tuning (reported single H200, ~40 hours). At inference, 50 denoising steps balance smoothness and detail.
Off-the-shelf vision models (depth, optical flow, segmentation) for 4D reconstruction; SAM2 for masks; VLM for reference generation (during dataset building or as needed).

When not to use:

If you need to insert a highly deformable/moving subject (e.g., a dancing person) without modeling its motion.
If the video is extremely noisy, very dark, or heavily compressed, making 4D reconstruction unreliable.
If you cannot provide a reasonable first-frame placement.

Open questions:

Dynamic, articulated insertions: How to extend from static objects to moving, deformable ones while preserving control and realism?
Real-light estimation: Can the model directly estimate and edit global illumination to handle high-gloss reflections and complex interreflections?
Fewer external modules: Can end-to-end training reduce reliance on multiple off-the-shelf components while keeping geometry fidelity?
Speed-ups: How to make 4D reconstruction and diffusion synthesis faster for real-time or interactive editing?
Broader domain generalization: How to better handle extreme lighting, transparent objects, water, or highly reflective metals without artifacts?

06Conclusion & Future Work

Three-sentence summary: InsertAnywhere combines a 4D-aware mask generator with a diffusion-based video model to insert objects into videos in a way that respects geometry, occlusions, and lighting. A high-quality first-frame anchor locks in the object’s look, and training on the illumination-aware ROSE++ dataset teaches the model to render shadows and brightness changes naturally. Across diverse tests, it surpasses strong commercial tools on identity fidelity, video quality, and multi-view consistency.

Main achievement: Bridging precise 4D scene understanding with learned photorealistic synthesis so the system not only knows where the object should be, but also how the light and shadows around it should change.

Future directions:

Extend from static to dynamic, deformable insertions with the same level of control and realism.
Improve light estimation and material modeling for tough cases like glass, chrome, and water.
Streamline components for faster, more end-to-end training and inference.

Why remember this: It turns video object insertion from a fragile trick into a reliable tool—place it once, and the system does the rest, keeping geometry, occlusions, and lighting in sync so the result looks like it was always there.

Practical Applications

•Virtual product placement in commercials without reshooting scenes.
•Film and TV post-production to add props or set dressing consistently across shots.
•Social media content edits to insert branded items or fun objects that feel real.
•E-commerce demos that place products into lifestyle videos under varied lighting.
•AR/VR previews by inserting furniture or appliances into home-tour videos.
•Education and training videos that add safety signs or labels in the right places.
•News and documentary corrections (e.g., adding blurred overlays) while preserving scene integrity.
•Game trailers or machinima that integrate high-fidelity 3D props into captured footage.
•Previsualization for directors to test different set layouts without rebuilding sets.
•Cultural heritage restoration videos adding missing artifacts in context for interpretation.

Version: 1