PISCO: Precise Video Instance Insertion with Sparse Control

Xiangbo Gao; Renjie Li; Xinghao Chen; Yuheng Wu; Suofei Feng; Qing Yin; Zhengzhong Tu

PISCO: Precise Video Instance Insertion with Sparse Control

Beginner

Xiangbo Gao, Renjie Li, Xinghao Chen et al.2/9/2026

arXiv

Key Summary

•PISCO is a video AI that lets you place a specific object into a real video exactly where and when you want, using just a few keyframes instead of editing every frame.
•It keeps the original scene steady (no drifting backgrounds) while the new object moves smoothly with correct shadows, reflections, and occlusions.
•Two new ideas make sparse control work: Variable-Information Guidance (VIG) teaches the model to listen to anything from one hint to many hints, and Distribution-Preserving Temporal Masking (DPTM) keeps the video’s time-patterns stable.
•PISCO also uses depth cues and clever training tricks (amodal completion and relighting) so the inserted object fits the scene’s geometry and lighting.
•A new benchmark, PISCO-Bench, fairly tests instance insertion with paired “clean background” and “with-object” videos and both reference-based and reference-free metrics.
•Across many tests, PISCO beats strong inpainting, video-to-video editing, and agentic image-edit+I2V pipelines, especially when only sparse control is given.
•Adding more keyframes steadily improves quality, showing PISCO scales naturally with more user guidance.
•PISCO preserves the original video’s motion while generating realistic object motion, making it practical for professional post-production with minimal manual work.
•It supports broader edits like repositioning, rescaling, speed changes, background tweaks, and even controlled simulations.
•PISCO’s best model (14B) achieves top scores on FVD, LPIPS, SSIM and VBench’s subject consistency, especially with first-and-last or five-frame control.

Why This Research Matters

PISCO makes professional-quality video edits possible with just a few user hints, saving hours or days of manual masking and keyframing. It preserves the original video’s motion and look, so creators can trust that only the intended object is changed. Correct shadows, reflections, and occlusions mean insertions feel real, not pasted on. This helps filmmakers, advertisers, and educators produce believable content on tighter budgets and timelines. It also supports safe simulation for training robots or self-driving systems by inserting controlled objects without altering the rest of the scene. As AI video moves from novelty to production tool, this kind of precise, low-effort control is essential.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how stickers can be added to a photo so they look like they were always there? Doing that in a video is way harder because everything moves, lights change, and objects block each other at different times. Filmmakers and creators want to add a new thing (like a car, mascot, or product) into an existing video, exactly where and when it should appear, without breaking the rest of the scene.

🍞 Hook: Imagine you have a school play video, and you want to add a cartoon dragon that walks across the stage at 10 seconds, waves at 12 seconds, and exits at 15 seconds—while the actors and lights keep behaving normally. 🥬 The Concept: Video instance insertion is adding a specific object into a real video at the right place and time, while keeping the rest of the video unchanged.

How it works (big picture): (1) You tell the system what the object looks like and where/when it appears. (2) The system makes the object move frame by frame, matching the scene’s motion and lighting. (3) It also keeps the original video’s motion and details intact.
Why it matters: Without careful insertion, the new object looks floaty, casts wrong shadows, or the original scene gets warped. 🍞 Anchor: Placing a puppy into a family picnic video so it trots behind the table, hides briefly behind a chair (occlusion), and casts a soft shadow that matches the afternoon sun.

The world before: Recent video AIs could make pretty clips from text prompts, but they were not great at doing exact, picky edits. Professionals had to do heavy prompt engineering and then cherry-pick the best result. Even then, getting precise control—like “this exact object, starting here, moving like this, ending there”—was tricky.

The problem: Precise video instance insertion demands five things at once: (1) instance-level control (exact when/where), (2) physically plausible motion over time, (3) correct scene effects (shadows, reflections, ripples), (4) perfect preservation of the original video’s motion and identity, and (5) low effort from the user (no drawing masks on every frame!).

Failed attempts:

Video inpainting can fill holes, but it usually needs a mask for every frame. That’s too much work if your object doesn’t exist yet.
Reference-guided video-to-video editing spreads style over time but struggles with exact placement and timing.
Image-edit + image-to-video pipelines often hallucinate or drift the background, losing the original scene’s precise motion.
4D geometry-heavy methods are theoretically neat but slow, fragile, and struggle with unseen motions.

🍞 Hook: You know how you might give a friend just a few waypoints to find your house? They should still arrive, even if you don’t guide them turn-by-turn. 🥬 The Concept: Sparse keyframe control is telling the AI only a few important frames—like just the first frame, or first and last frames—and asking it to handle the rest.

How it works: (1) Provide 1–5 keyframes with the cutout object and a mask. (2) The AI figures out in-between poses and motion. (3) It blends the object into the scene across time.
Why it matters: Drawing per-frame masks is exhausting; sparse control is the practical way professionals want to work. 🍞 Anchor: Give the AI the first frame of a mascot waving from the left, and the last frame waving from the right—then it animates the walk in between.

The gap: We needed a system that (a) honors sparse control, (b) stays stable in time (no flickers), (c) adapts to scene geometry and lighting, and (d) preserves the original background perfectly.

Real stakes: For filmmakers, ad agencies, and creators, this means fewer reshoots, faster iterations, and believable effects on a budget. For educators, journalists, and simulation builders, it means controlled, ethical insertions that don’t rewrite reality outside the target object.

02Core Idea

🍞 Hook: Imagine drawing just a few comic panels, and your smart assistant fills in the whole animated scene while keeping the background exactly the same. 🥬 The Concept: The key insight is to teach a video diffusion model to follow a few strong hints (sparse keyframes) while protecting the original video’s motion and look.

How it works: (1) Feed the model the clean background video plus a few object cutouts with masks and depth. (2) Use training tricks so the model learns to handle missing frames without freaking out. (3) Add geometry and lighting cues so the object sits in the world correctly. (4) The model then generates the full edited video.
Why it matters: Without this, sparse hints cause glitches—like flickers, wrong colors, lost objects, or wobbly backgrounds. 🍞 Anchor: Provide a single frame of a skateboarder in a street video; the system animates them cruising down the road with matching shadows, while cars and people in the original video keep moving as they did.

Multiple analogies (three ways):

GPS with waypoints: You mark start and end; the GPS fills in the route smoothly (sparse keyframes → full motion).
Choreographer with beats: You clap on a few beats, and dancers fill the rest in rhythm (temporal stability, no drift).
Paper doll with lighting: You place a cutout under a lamp; to look real, you adjust shadows and reflections (geometry- and light-aware insertion).

Before vs After:

Before: You needed dense masks or got wobbly edits; backgrounds often drifted; shadows and occlusions were off.
After: With PISCO, a handful of keyframes suffice; backgrounds remain faithful; motion, shadows, and occlusions are physically consistent.

Why it works (intuition):

The model learns to hear “how much” you’re telling it (Variable-Information Guidance), so it neither ignores nor overfits sparse hints.
It keeps the time-patterns of video data normal (Distribution-Preserving Temporal Masking), so the encoder doesn’t get confused by missing frames.
It reasons about who is in front or behind (depth) and how light should look (relighting), so the object blends naturally.

Building blocks (each with the sandwich pattern):

🍞 Hook: You know how sculptors start with a rough block and slowly reveal the statue? 🥬 The Concept: A video diffusion model starts from noisy video and denoises it step by step to create a realistic sequence.

How it works: (1) Add noise to the target video representation. (2) A neural net predicts and removes noise in steps. (3) Conditions (background, object, masks, depth) guide what to keep or change.
Why it matters: This stepwise process gives strong control to keep the background steady while inserting the new object. 🍞 Anchor: Starting from TV static and ending with a clean scene where a toy robot walks through a living room without moving the furniture.

🍞 Hook: Sometimes you only get a few puzzle pieces and must guess the picture. 🥬 The Concept: An availability mask marks which frames have real object input and which don’t.

How it works: (1) Create a timeline of 1s (frames with object hints) and 0s (frames without). (2) Feed it in so the model knows when to trust vs. infer. (3) Train with lots of such patterns.
Why it matters: The model won’t hallucinate or collapse when hints are sparse. 🍞 Anchor: First and last frames are 1s; everything in the middle is 0s—the model fills the middle.

🍞 Hook: If a friend sometimes whispers and sometimes speaks loudly, you learn to listen differently. 🥬 The Concept: Variable-Information Guidance (VIG) teaches the model to handle many guidance levels—from a single hint to dense hints.

How it works: (1) Randomly hide/show object frames during training. (2) Let the model practice both sparse and dense cases. (3) It learns how strongly to rely on hints.
Why it matters: With VIG, one model works well for any number of keyframes. 🍞 Anchor: With only the first frame, it infers a plausible path; with first and last, it nails the route.

🍞 Hook: Imagine smoothing a bumpy road so the ride doesn’t shake. 🥬 The Concept: Distribution-Preserving Temporal Masking (DPTM) keeps the video’s timing patterns normal even when many frames are missing object input.

How it works: (1) In pixel space, copy the nearest known object frame into missing spots to keep natural video statistics. (2) After encoding, mask out the fake tokens so the model knows which were real. (3) Pass an availability channel that aligns with the compressed temporal grid.
Why it matters: Without DPTM, the encoder sees weird patterns and produces flicker/miscoloring. 🍞 Anchor: Supplying only odd frames of an object—DPTM stops the model from stuttering on the evens.

🍞 Hook: To place a sticker on a 3D object, you need to know what’s closer or farther. 🥬 The Concept: Geometry-aware conditioning uses depth of the background and the object to respect front/back order and occlusions.

How it works: (1) Estimate depth for background and object. (2) Feed both into the model with RGB and masks. (3) The model composes layers so the object can pass behind or in front correctly.
Why it matters: Without depth cues, objects blend or float unrealistically. 🍞 Anchor: A cyclist inserted into traffic gets correctly hidden when a bus passes in front.

Together, these pieces let PISCO follow a few clear hints, keep time stable, and fit the object into the scene’s 3D and lighting logic.

03Methodology

At a high level: Input → [Prepare sparse keyframes, masks, and depth] → [Stabilize time with DPTM] → [Condition a video diffusion model with VIG + geometry + availability] → Output the edited video.

Step 0: Inputs and signals

Background video V: the clean scene without the object.
Instance cutouts I with masks M at a few timestamps (keyframes).
Depth DV for background and DI for the object cutouts.
Availability mask A marking where object hints exist (1) or not (0).

🍞 Hook: Like leaving breadcrumbs on a path so someone can follow the route. 🥬 The Concept: Sparse keyframe control uses only a few object frames and masks to drive the whole insertion.

How it works: (1) Provide first, first&last, or a few frames. (2) A tells the model when real hints are present. (3) The model infers motion in between.
Why it matters: Saves huge manual effort—no per-frame masks. 🍞 Anchor: Give frames 0 and 48: the model fills frames 1–47 with smooth motion.

Step 1: Distribution-Preserving Temporal Masking (DPTM)

What happens: We first fill missing object frames by copying the nearest available frame in pixel space. Then we encode the video into tokens. Finally, we mask out tokens that came from copied frames and pass an availability channel aligned to the encoder’s temporal compression.
Why this step exists: Pretrained temporal VAEs expect natural video statistics; hard zeros from missing frames create a distribution shift that causes flicker. DPTM keeps the input “looking like video,” yet still tells the model which frames were real vs. filled.
Example: If only frames 0 and 48 have the object, frame 20 gets a copy of frame 0 in pixel space for stability; later, token masking tells the model, “this was not a real observation.”

🍞 Hook: Think of a cartoon flipbook that compresses time by grouping pages. 🥬 The Concept: A temporal VAE compresses sequences over time to encode videos efficiently.

How it works: (1) It groups nearby frames and turns them into compact tokens. (2) The diffusion model denoises these tokens. (3) Decoding recovers the full video.
Why it matters: Stable inputs to this encoder are crucial; otherwise, the output flickers. 🍞 Anchor: Grouping every 4 frames into one token; DPTM ensures each token still looks like normal video.

Step 2: Conditioning with geometry and availability

What happens: We feed the diffusion model: background video, background depth DV, object RGB I with mask M and depth DI at available frames, plus A. A context adapter (VACE-style) merges these multi-channel signals into the backbone.
Why this step exists: The model needs to know when object hints are real, where they are, and how they relate in depth to the background.
Example: If depth shows a table is closer than the inserted mug, the mug gets partially hidden when moving behind the table.

🍞 Hook: Like an interpreter who takes many languages and turns them into one story. 🥬 The Concept: A context adapter fuses multiple condition channels (RGB, masks, depths, availability) into the diffusion model.

How it works: (1) Project all conditions into a shared space. (2) Inject them into the U-Net/Transformer at key layers. (3) Let the backbone denoise using these cues.
Why it matters: Without clean fusion, conditions can fight each other or get ignored. 🍞 Anchor: Feeding both the object’s mask and its depth lets the model know the object’s exact silhouette and its front/back order.

Step 3: Variable-Information Guidance (VIG)

What happens: During training, we randomly sample availability patterns—from single-frame to dense—to teach the model to handle any guidance density.
Why this step exists: Real users vary: sometimes they give just a starting frame; other times, start and end; occasionally, a few more. One model should work for all.
Example: Training today with only the first frame, tomorrow with first&last, next with five frames spread out.

Step 4: Geometry- and appearance-robust training

Depth-aware insertion: Feeding DV and DI so the model respects occlusions and correct scale.
Amodal completion augmentation: Many training instances are partially hidden; we train a helper to “complete” cutouts so the main model learns to correctly re-occlude them when needed.
Relighting augmentation: We slightly change object lighting during training, so at inference the model can adapt the object’s light to the scene.

🍞 Hook: If a sticker’s edge is under a lamp, its shadow must follow. 🥬 The Concept: Depth-aware insertion uses depth maps of the scene and object so the model learns what should be in front or behind.

How it works: (1) Estimate per-pixel depth. (2) Condition the network on these depths. (3) Compose the object respecting occlusion.
Why it matters: Prevents “ghosting” and floating objects. 🍞 Anchor: A person walks behind a parked car and disappears partially at the right times.

🍞 Hook: Completing a jigsaw puzzle even if some pieces were hidden under the table. 🥬 The Concept: Amodal completion gives the model a fully completed object shape so it can learn when to hide parts again in the real scene.

How it works: (1) Train a completion helper that reconstructs full object appearance. (2) Feed completed cutouts as conditions. (3) Supervise against the real, possibly occluded video.
Why it matters: Teaches correct occlusion without needing perfect training masks. 🍞 Anchor: Completing a bike’s back wheel that was hidden by a fence; the final video still shows it blocked when the fence passes.

🍞 Hook: Wearing a shirt in daylight vs. candlelight looks different. 🥬 The Concept: Relighting augmentation slightly changes the object’s lighting during training so the model learns to harmonize illumination.

How it works: (1) Synthesize relit versions of the object. (2) Train the model to adapt while preserving identity. (3) At inference, it better matches scene light.
Why it matters: Avoids pasted-on look when the object’s lighting doesn’t match the scene. 🍞 Anchor: A toy car inserted at sunset gets warmer highlights matching the sky.

Step 5: Denoise to final video

The conditioned diffusion backbone iteratively removes noise, using all signals to produce the final edited video where the background is preserved and the inserted object moves smoothly with correct interactions.

Secret sauce

DPTM + VIG together: DPTM keeps the time-encoder calm; VIG teaches the model to handle any level of hints. Add depth, amodal, and relighting, and you get realistic, stable insertions from just a few keyframes.

04Experiments & Results

🍞 Hook: If two chefs cook the same recipe, you taste both to see which one nailed flavor, texture, and timing. 🥬 The Concept: The team built PISCO-Bench to fairly test instance insertion, then compared PISCO against strong baselines using both reference-based and reference-free scores.

How it works: (1) Pick videos with an object and also make a matching clean background version (object removed). (2) Insert the object back with different methods. (3) Measure how close each method gets to the original target video, and how good it looks overall.
Why it matters: This isolates the quality of insertion while checking that the background remains intact. 🍞 Anchor: Testing how well each method re-inserts a pedestrian into a street scene, checking motion, occlusions, and background stability.

🍞 Hook: A science fair needs a clear scoreboard so judges can compare apples to apples. 🥬 The Concept: PISCO-Bench is a curated benchmark with verified instance masks and paired clean background videos.

How it works: (1) Start from BURST videos. (2) Clean up instance masks. (3) Use a high-quality removal model to get the clean background. (4) Provide both for fair testing.
Why it matters: Paired data lets you check both the whole frame and just the inserted object. 🍞 Anchor: Measuring how close your reinsertion is to the original scene, not just if it “looks cool.”

Baselines compared

Agentic pipeline: edit the first (and maybe last) frame, then use an image-to-video generator. Strong flexibility, but often drifts backgrounds.
Video inpainting: fills masked regions; needs dense masks and can struggle with long-term motion.
Reference-guided video-to-video editing (e.g., VACE, UniVideo): globally edits video with references but weaker at pin-point space-time control.

What they measured and why

Reference-based: FVD (temporal realism), LPIPS (perceptual closeness), PSNR/SSIM (pixel/structure closeness). Measured for the whole video and just the foreground region.
Reference-free: VBench, including masked Background Consistency and Subject Consistency to separately judge background preservation and object fidelity.

Scoreboard with context

Whole video: PISCO-14B with first&last control reduces FVD from a strong baseline’s 371 to 204 and improves LPIPS from 0.103 to 0.097—like turning a solid B into a clear A.
Foreground: PISCO-14B first&last gets Foreground FVD 138 and LPIPS 0.022, much better than others—meaning the inserted object stayed sharp, consistent, and well-timed.
Reference-free (VBench): PISCO-14B first&last achieves top Subject Consistency (~91.6), edging out VACE and inpainting models, while staying competitive on Background Consistency (~94.2).

🍞 Hook: Adding more clues should help your friend find your house. 🥬 The Concept: Scalability with sparse control means more keyframes lead to better results, smoothly and predictably.

How it works: (1) Start with first-only control. (2) Add last frame. (3) Add three intermediate frames. (4) Watch quality rise monotonically.
Why it matters: Professionals can dial in as much control as needed, getting predictable gains. 🍞 Anchor: With five frames, PISCO-14B reaches whole-video FVD ~136 and Foreground LPIPS ~0.015—its best.

Surprising findings

Agentic pipelines that seem intuitive (edit then animate) often rewrite the background, hurting PSNR/SSIM and temporal scores.
Inpainting excels at background preservation but, without reference instances and with dense-mask needs, it under-delivers on object identity and motion coherence.
Reference V2V is good overall but struggles with precise space-time placement over long sequences—PISCO’s sparse keyframe approach shines here.

Takeaway: PISCO wins because it keeps the original video steady, inserts the object reliably with correct physics, and gracefully benefits from each extra keyframe you provide.

05Discussion & Limitations

Limitations

Very sparse or low-quality instance masks can mislead placement or silhouette; while PISCO is robust, garbage-in still risks garbage-out.
Extreme motion or rapidly changing occlusions may need more than one keyframe to lock down the object’s path.
Depth estimation is not perfect; rare failure cases in complex lighting or reflective scenes can cause minor occlusion or shadow errors.
Rigid lighting mismatches can persist if the object’s material is unusual (e.g., strong subsurface scattering) even with relighting augmentation.

Required resources

A capable GPU is needed for inference (video diffusion over tens to hundreds of frames). Training requires multi-GPU setups and careful staged finetuning.
Instance masks and at least one reference cutout frame per inserted object.
Optional: background depth maps and relighting tools for best realism (the pipeline estimates depth automatically in practice).

When NOT to use

If you can easily provide dense, perfect masks and only need simple hole-filling, traditional inpainting may be faster.
If you want to restyle the whole video globally rather than insert a precise new object, general video-to-video editing may be simpler.
If you need physical simulation-grade interactions (e.g., fluid splashes interacting bi-directionally with the object), pure 2D diffusion may be insufficient.

Open questions

Active control: Can the model take coarse motion sketches or arrows to guide trajectories interactively?
Material-aware physics: How to better model complex light transport (metallic, glassy, translucent objects) without heavy 3D?
Longer videos: How to maintain precise control over minutes instead of seconds while keeping memory and compute reasonable?
Multi-object coordination: How well can multiple inserted instances interact with each other and with the scene?
Reliability under domain shift: How to make insertion equally robust in cartoons, medical, or drone footage without task-specific tuning?

06Conclusion & Future Work

In three sentences: PISCO is a video diffusion framework that inserts a specific object into an existing video using only a few keyframes, while preserving the scene’s original motion. It introduces Variable-Information Guidance (VIG) and Distribution-Preserving Temporal Masking (DPTM) to make sparse control stable and reliable, plus depth-aware and relighting strategies for physical realism. On the PISCO-Bench evaluation, it outperforms strong baselines and improves predictably as you add more control frames.

Main achievement: Showing that precise, professional-grade instance insertion is possible with minimal user effort by combining sparse keyframe control with time-stable encoding and geometry/lighting-aware conditioning.

Future directions: Add interactive trajectory tools (scribbles/arrows), stronger material-aware lighting, longer-duration consistency, and multi-object coordination. Explore real-time previews and tighter integration with production toolchains.

Why remember this: PISCO marks a shift from “prompt and pray” video generation to dependable, controllable editing that respects the original footage. With just a few well-placed hints, it makes complex insertions look natural—unlocking faster, cheaper, and more believable video storytelling for everyone.

Practical Applications

•Add a product or character into a finished commercial while keeping actors and set unchanged.
•Localize an ad by inserting region-specific signage without re-shooting the entire scene.
•Create stunt previews by inserting a digital double into real footage to plan camera and lighting.
•Generate training data for autonomous driving by inserting controlled pedestrians or vehicles into clean street videos.
•Adjust blocking in post: reposition or rescale a prop or actor to improve composition without reshoots.
•Continuity fixes: reinsert a missed prop in a few shots using first&last control.
•Education and news explainers: place illustrative objects into real scenes while preserving scene integrity.
•Previsualization: try alternate object paths with minimal keyframes to compare creative options quickly.
•Game and AR trailers: insert characters into live-action plates with correct occlusions and shadows.
•A/B testing creative variants by changing only the inserted object while keeping the background identical.

Version: 1