RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Boyang Wang; Haoran Zhang; Shujie Zhang; Jinkun Hao; Mingda Jia; Qi Lv; Yucheng Mao; Zhaoyang Lyu; Jia Zeng; Xudong Xu; Jiangmiao Pang

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Intermediate

Boyang Wang, Haoran Zhang, Shujie Zhang et al.1/8/2026

arXiv PDF

Key Summary

•RoboVIP is a plug-and-play tool that turns ordinary robot videos into many new, realistic, multi-view training videos without changing the original robot actions.
•It adds a new idea called visual identity prompting, which shows the generator example pictures (like cups or bottles) so it can fill scenes with the right small objects and textures, not just guess from text.
•The system segments (cuts out) the robot arm and the touched object using the gripper-open/close timeline to find the exact interaction moments, then inpaints everything else.
•RoboVIP fine-tunes a large video diffusion model with LoRA and stitches multiple camera views together, so the generated videos stay consistent over time and across views.
•A million-scale pool of clean, diverse identity images (cups, plates, bowls, etc.) is automatically curated from big robot datasets using panoptic segmentation and quality filters.
•On video quality, RoboVIP beats Cosmos-Transfer2.5 and RoboEngine with better FID/FVD/LPIPS and stronger cross-view matches, meaning sharper frames, smoother motion, and better view alignment.
•For robot skills in simulation, training Octo and π on RoboVIP-augmented data raises success rates compared to zero-shot, standard fine-tuning, and previous augmentation methods.
•In real-robot tests (cube stacking), a Diffusion Policy trained with RoboVIP data jumps to 10/10 success in clean scenes and 9/10 in cluttered scenes, far above the baselines.
•The method scales with only raw videos plus action logs and needs no hand-tuned scene setups, making it practical for large robot learning projects.
•Limitations remain in off-the-shelf segmentation and reasoning tools and in simulation benchmarks that currently lack multi-view inputs.

Why This Research Matters

RoboVIP helps robots learn from the kind of videos they truly need: multi-view and time-consistent scenes that still match the original actions. This reduces the cost and time of collecting huge real-world datasets by safely generating more variety from what you already have. With visual identity prompting, tiny but important details—like the exact look of a cup or bottle—are captured, which improves generalization to new kitchens, labs, and offices. Stronger generalization means robots are less likely to fail when backgrounds change or clutter appears. In practice, this translates to more reliable household help, warehouse handling, and lab assistance. As video models and policies co-evolve, tools like RoboVIP can keep training data aligned with what policy models expect, accelerating progress in real-world robotics.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re teaching a friend to make a sandwich, but you only ever practice in your own kitchen with the same plate, same counter, and same lighting. When your friend visits a different kitchen, they might freeze—where does the bread go, what’s this new countertop pattern, where’s the plate?

🥬 The Concept (Robotic Manipulation): Robots learning to handle objects (like picking up a cup) need lots of varied, high-quality videos to truly understand how to act across different scenes. How it works:

Collect videos of robots acting, plus the actions they took (like how they moved their hand and when they closed the gripper).
Train models to map what they see (vision) and what you tell them (language) to what they should do (actions).
Test in new scenes and with new distractors. Why it matters: Without diverse, realistic data, robots get confused in new places—like your friend in a new kitchen. 🍞 Anchor: A robot that can stack blocks at one table might fail when the tablecloth changes color unless it trained on varied scenes.

🍞 Hook: You know how a sports game is filmed by many cameras—front, side, and a moving close-up—so everyone sees the same play clearly?

🥬 The Concept (Multi-View Video Generation): Making videos that show the same moment from several cameras at once, with all views matching. How it works:

Take multi-camera robot videos.
Keep the robot and the touched object intact.
Generate the rest of the scene across all views so they agree with each other. Why it matters: Without cross-view agreement, one camera might show a bowl that doesn’t exist in another, which confuses training. 🍞 Anchor: If the side view shows a red cup, the top view must show the same red cup in the right place.

🍞 Hook: Think of a cartoonist who starts with a noisy pencil blur and slowly erases the fuzz to draw a crisp scene.

🥬 The Concept (Video Diffusion Model): A generator that starts from noise and gradually paints a realistic video based on conditions like text, images, and masks. How it works:

Start with noisy frames.
Use clues (text prompts, example images, and masked videos) to guide each clean-up step.
End with a sharp, coherent video. Why it matters: Without this step-by-step “denoising,” the result would be random static, not a useful training video. 🍞 Anchor: The model learns to turn noise into a scene with a wooden table, a blue bowl, and the same robot arm doing the same action as in the original clip.

🍞 Hook: If you ask a friend to decorate a table with “nice things,” you might get anything. But if you show pictures of the exact vase and cup you want, they’ll match your style.

🥬 The Concept (Visual Identity Prompting): Giving the generator small example images (like a specific cup or bottle) so it fills the scene with the right objects and textures. How it works:

Build a big gallery of object cutouts (cups, bowls, bottles, etc.).
Pick a few and pass them along with the masked video into the generator.
The generator uses them as visual hints to inpaint realistic, matching tabletop and background content. Why it matters: Text alone misses fine details (shape, gloss, patterns). Without visual identities, scenes can look vague or wrong. 🍞 Anchor: Showing the model a “ridged glass bottle” picture yields a ridged bottle in the scene, not a random smooth one.

🍞 Hook: When editing a movie, you mask the actor to change the background without touching them.

🥬 The Concept (Inpainting): Filling the missing or masked parts of an image or video with new, realistic content while keeping the unmasked parts (like the robot and target object) untouched. How it works:

Segment out the robot and the interacted object.
Mask everything else.
Generate new content to fill masked areas while keeping time and view consistency. Why it matters: Without inpainting, you might change the robot shape or object geometry, breaking the original action labels. 🍞 Anchor: The robot hand and the grasped spoon remain exact; only the table texture and nearby items change.

🍞 Hook: If you want to find the exact moment someone presses a doorbell in a long video, watch for the finger closing in.

🥬 The Concept (Gripper State): A 1-D signal that tells whether the robot’s gripper is open or closed, which pinpoints when it actually interacts with an object. How it works:

Scan the gripper signal to find close/open moments.
Focus segmentation and naming around those frames.
Track objects across time and views from these anchor points. Why it matters: Without this, you’d search the whole video and easily miss or mislabel the target object. 🍞 Anchor: A spike when the gripper closes helps locate “that’s when the carrot was grasped,” guiding object detection.

🍞 Hook: A school project might say, “Look at this picture and write what to do.” Robots need the same: see, read, then act.

🥬 The Concept (Vision-Language-Action Models, VLA): Models that read images and instructions, then output actions to do the task. How it works:

Encode the video frames (vision) and the instruction (language).
Fuse them to understand goals and context.
Predict the next robot action. Why it matters: Without good VLA training data (clear, varied visuals), action choices become brittle. 🍞 Anchor: “Put the spoon on the towel.” VLA looks at the scene, finds spoon and towel, and plans a grasp and place.

🍞 Hook: When you learn to ride a bike, your eyes and muscles work together.

🥬 The Concept (Visuomotor Policy Learning): Training a controller that turns what it sees into how it moves. How it works:

Show video states and the correct actions.
Learn a mapping from pixels to motor commands.
Execute in the real world. Why it matters: If visuals change (clutter, textures), the mapping can fail unless training covered those cases. 🍞 Anchor: A policy that practiced on varied tabletops can still pick and place when the tablecloth pattern changes.

The world before: Robot learning suffered from scarce, samey videos—few views, shaky time consistency, and minimal background variety. People tried simple augmentation (crops, color jitter), scripted green-screen setups, or single-image edits. But policies today often expect multi-view, temporally coherent inputs. The problem: text-only generation can hallucinate or miss details, and single-view, single-frame edits break the data robots really need.

The gap: A scalable, plug-and-play way to create rich, multi-view, time-consistent scenes while preserving the original robot and action labels. The stakes: Better generalization means robots that don’t panic when the countertop changes or new objects appear.

02Core Idea

🍞 Hook: Picture a stage crew keeping the actor and prop exactly the same, while swapping the scenery around them in perfect sync for every camera angle.

🥬 The Concept (RoboVIP’s Aha!): Keep the robot and touched object fixed, and use a multi-view video diffusion inpainting model—guided by example object images (visual identities)—to repaint everything else consistently across time and cameras. How it works (big idea):

Segment and save the robot and target object (so actions still match).
Stitch all camera views together per time-step so the model learns cross-view consistency.
Feed example identity images (cups, bowls) so the model knows exactly what to add.
Generate the masked regions over all frames and views so the scene looks real and stays aligned. Why it matters: This creates heaps of high-quality, varied videos that match how modern policies learn—multi-view and multi-frame—without expensive real-world filming. 🍞 Anchor: The same spoon grasp is preserved, but the background becomes a wooden counter with new bottles that appear correctly in both the wrist and side cameras.

Multiple analogies: • Movie set: Actors (robot + object) are locked. The set decorators swap new props and backdrops that appear correct from all cameras. • Dressing a diorama: You keep the figurines, but place different furniture and wallpapers, matching the same 3D layout from every side. • Recipe + references: A chef follows a recipe (text prompt) but also a photo board (identity images) to nail the exact look.

Before vs After: • Before: Single-frame edits, one view at a time; vague text descriptions; mismatched objects across cameras. • After: Time-smooth, multi-view-coherent videos; visual identities ensure realistic, detailed props; stronger training data that boosts policy success.

Why it works (intuition, not math): • Conditions act like rails: masks say “change only here,” actions say “don’t touch the robot/object timing,” identities say “add these exact-looking items,” and multi-view stitching says “keep all views in agreement.” • Diffusion is great at filling in missing pixels with context, and LoRA fine-tuning teaches it the robot-video specifics without forgetting its general video skills.

Building blocks (mini-sandwiches): 🍞 Hook: If you tape two camera views side-by-side, you can see if they match. 🥬 Multi-View Stitching: Concatenate views per frame so the model learns cross-view correspondence during training. Without it, each view drifts. 🍞 Anchor: The red cup’s position lines up in both the wrist and third-person views.

🍞 Hook: Showing a paint-by-number guide makes the painter color inside the lines. 🥬 Masks for Inpainting: Keep robot/object pixels; only repaint masked areas. Without masks, you’d corrupt the action labels. 🍞 Anchor: The robot wrist pose stays identical while the table texture changes.

🍞 Hook: A mood board helps a set designer pick just the right props. 🥬 Visual Identity Prompting: Pack example object crops into the model so it learns low-level details. Without identities, small items look generic. 🍞 Anchor: A ribbed glass bottle appears as ribbed, not smooth.

🍞 Hook: Small steering tweaks keep a train on track. 🥬 LoRA Fine-Tuning: Add low-rank adapters to a big video model so it learns new tricks without forgetting old ones. Without LoRA, you risk overfitting and collapse. 🍞 Anchor: The model keeps its clean motion and texture skills while adapting to robot scenes.

In short: RoboVIP is the stage crew, the blueprint, and the prop photos—working together so robots get the kind of videos they really need to learn well.

03Methodology

High-level recipe: Input (multi-view robot videos + actions) → [Action-guided segmentation] → [Visual identity curation] → [Multi-view inpainting video diffusion with identity prompting] → Output (augmented multi-view videos with the same actions).

Step A: Action-guided segmentation (find and protect what must not change) • What happens: We detect when the gripper closes/opens to find the interaction window, name the object from the wrist view, and segment both the robot and that object across all frames and views. We use open-vocabulary and video segmentation models, and we refine masks to reduce flicker and outliers. • Why this step exists: If we don’t perfectly preserve the robot and target object, the original action labels (like 6-DoF deltas and gripper state) no longer match the visuals. • Example: In a “pick the carrot” clip, we use the gripper-close moment to focus on frames around the grasp, ask a video-language model to confirm it’s a carrot, and then track that carrot and the arm over time from wrist and side cameras.

Mini-sandwiches inside Step A: 🍞 Hook: Watching for a door click helps you find the exact moment it closed. 🥬 Gripper-Triggered Keyframes: Use the 1-D gripper state to anchor when interaction happens, narrowing where to look. Without it, the object might be missed. 🍞 Anchor: A spike at frame 120 marks the grasp; that’s where we start object tracking.

🍞 Hook: Labels and outlines make coloring pages easier. 🥬 Automated Segmentation: Use open-vocab and video segmentation to get masks for robot and object; post-process to smooth noise. Without this, inpainting bleeds into the robot. 🍞 Anchor: The arm mask stays solid even when the wrist camera moves quickly.

Step B: Visual identity curation (build a prop library) • What happens: From large robot datasets, we run panoptic segmentation to crop tabletop-sized objects (cups, bowls, bottles, utensils, etc.), then filter by quality (sharpness, resolution), and text–image alignment (CLIP score). We discard huge background slabs (walls, tables) and poor crops. • Why this step exists: Text prompts can’t guarantee fine visual details. Identity images act like high-fidelity references. • Example: We collect thousands of clean cup and bottle crops, remove blurry or partial ones, and keep a balanced, diverse set.

Mini-sandwich inside Step B: 🍞 Hook: A museum curator only displays the clearest, most complete artifacts. 🥬 Panoptic Segmentation for Identities: Auto-crop objects and filter for quality and semantic match. Without curation, identities are messy and mislead generation. 🍞 Anchor: A well-cropped “green bowl” identity yields a neat, green bowl in the final video.

Step C: Multi-view inpainting video diffusion with identity prompting (paint the scene consistently) • What happens: We fine-tune a large video diffusion model (Wan2.1 I2V) with LoRA. We vertically stitch multi-view frames at each time-step so the model learns cross-view consistency. We feed masked videos, structured text (scene + action), and a packed image of several identity examples. The model inpaints only masked regions over time. • Why this step exists: Policies need time-smooth, cross-view-stable visuals. Inpainting keeps robot/object intact; identities add detail; stitching locks views together. • Example: A wrist view and a side view are concatenated. The model fills in a wooden tabletop, adds a selected ridged bottle and a blue cup that appear correctly in both views and across frames.

Mini-sandwiches inside Step C: 🍞 Hook: Add-on gadgets let a bike gain gears without rebuilding the frame. 🥬 LoRA Fine-Tuning: Small adapter layers teach new tasks without erasing old skills. Without LoRA, training is unstable and expensive. 🍞 Anchor: After fine-tuning, motion remains smooth, but robot scenes look more accurate.

🍞 Hook: Lining up a choir ensures voices harmonize. 🥬 Multi-View Stitching: Concatenate views per frame so spatial cues align. Without stitching, props drift between cameras. 🍞 Anchor: A red cup appears at the same table spot in both the wrist and third-person frames.

🍞 Hook: A collage board guides a painter’s style. 🥬 Frame-wise Identity Concatenation: Encode identity images with a VAE, append them as extra frames, then drop them from loss so they guide but aren’t targets. Without identities, small objects look generic. 🍞 Anchor: The model paints a specific embossed bottle it saw in the identity frame.

The Secret Sauce: • Using the gripper timeline to anchor object discovery (precise, robust). • Multi-view stitching so the model learns cross-view coherence, not just per-view prettiness. • Identity prompting to inject semantically and visually precise tabletop content. • Inpainting to preserve robot kinematics and action alignment.

Concrete data flow example: Input: Two-view video (wrist + side), actions (Δx, Δy, Δz, Rx, Ry, Rz, gripper), text (“Scene: wooden counter; Action: move cubes and stack”), and identity pack (blue cup + ridged bottle). Process: Segment robot/object → stitch views → encode identities → diffusion inpaints masked areas over 49 frames. Output: Time-smooth, multi-view-consistent clips where the robot’s grasp is unchanged, but the scene now has a wooden counter, a blue cup, and a ridged bottle consistently visible from both cameras.

04Experiments & Results

The Test: We measured whether RoboVIP produces better generative videos (sharp frames, smooth motion, consistent views) and whether policies trained on its augmented data succeed more often. Metrics included FID (single-frame realism), FVD (video coherence), LPIPS (perceptual similarity), and cross-view matches (are views aligned?). We also ran policy tests in a simulation with real-like textures and on a real robot.

The Competition: We compared against RoboEngine (single-image inpainting) and Cosmos-Transfer2.5 (video-to-video with pixel-aligned controls), plus standard training baselines without augmentation.

The Scoreboard (with context): • Generative video (Droid, 300 cases): RoboVIP achieved FID 39.97, FVD 138.4, LPIPS 0.409, and the highest cross-view correspondence. Think of this as getting the best report card for picture quality (FID), smoothness over time (FVD), and how well two cameras agree (cross-view). RoboEngine’s frames looked notably worse (e.g., FID 62.77), and Cosmos struggled with multi-view needs. • Simulation (SimplerEnv tasks like “put spoon on towel,” “carrot on plate,” “stack cube,” “eggplant in basket”):

Octo: Zero-shot ≈ 12% success; standard BridgeV2 fine-tune ≈ 13%; with RoboEngine ≈ 8%; with RoboVIP (text only) ≈ 13%; with RoboVIP (text + identities) ≈ 18.5%. That’s like raising a grade from a low D to a solid C in tough classes.
π: Standard fine-tune ≈ 17%; RoboEngine ≈ 18.5%; RoboVIP (text only) ≈ 29%; RoboVIP (text + identities) ≈ 27.75%. That’s a big jump—like going from a C to a strong B. • History length stress test (Octo): As we increased how many past frames the model used, RoboVIP stayed steady, while RoboEngine’s performance dropped toward zero at long histories. Translation: video-level augmentation matters when memory gets longer. • Real robot (Franka, cube stacking):
Plain Diffusion Policy: 7/10 successes in clean scenes, 0/10 in cluttered scenes.
With RoboVIP data: 10/10 in clean, 9/10 in cluttered—like acing the easy test and nearly acing the hard one with distractors.

Surprising findings: • Visual identity prompting helped not only identity faithfulness but also made tabletops richer without breaking action alignment. In a user study, identity-conditioned videos were preferred ~97% for identity preservation and ~80% for richer tabletops. • Multi-view stitching brought noticeable gains: view alignment improved, and policies trained on these videos handled new visuals better. • Even with powerful text captions from strong VLMs, text alone wasn’t enough; identity images made small objects and textures click into place.

05Discussion & Limitations

Limitations: • Segmentation brittleness: Off-the-shelf open-vocab and video segmentation can flicker or miss the gripper/objects, especially in fast wrist views, which can degrade masks for inpainting. • Caption and reasoning noise: VLMs may hallucinate scene details; mislabels can misguide segmentation or text conditioning. • Multi-view sim gap: Our real-world setup is multi-view, but some popular simulators evaluate only single-view, underestimating the benefit of cross-view consistency. • Resource needs: Fine-tuning a large video model with LoRA still requires multi-GPU memory and careful batching; curating identity pools at scale takes compute and storage.

When not to use: • If your policy and data are strictly single-image, single-view, and very short, simpler augmentation might suffice. • If your environment demands exact geometry edits (not appearance), pixel-aligned controls (edges/depth) may be a better fit. • If you cannot preserve accurate robot/object masks (e.g., severe occlusions with no action logs), inpainting risks corrupting labels.

Open questions: • Can we make segmentation and object naming fully robust in fast-moving wrist cams without action logs? • How best to train end-to-end with multi-view consistency losses across both generator and policy? • Can identity prompting be made 3D-aware (e.g., multi-identity, multi-view triangulation) to further improve spatial realism? • How far can we scale long-horizon generation (hundreds of frames) while maintaining coherence and low compute?

06Conclusion & Future Work

Three-sentence summary: RoboVIP keeps the robot and touched object fixed and inpaints everything else across multiple views and frames, guided by example identity images, to create rich, realistic training videos. This yields better video quality and stronger policy performance in both simulation and real robots than prior augmentation methods. The approach is plug-and-play over raw videos plus action logs and scales via an automatically curated identity pool.

Main achievement: Showing that multi-view, identity-prompted video inpainting can consistently boost vision-language-action and visuomotor policies by delivering the kind of varied, temporally coherent, cross-view-aligned data those models actually need.

Future directions: • Stronger, faster, more robust segmentation and object naming in wrist views. • 3D-aware, multi-identity prompting to improve spatial realism and placement. • Joint training loops where the policy and generator co-teach each other. • Extending coherent horizons and resolutions while keeping compute manageable.

Why remember this: It reframes data augmentation for robots from “prettier pictures” to “policy-ready videos”—multi-view, time-smooth, and detail-correct—so robots trained today keep their cool tomorrow when the kitchen, clutter, and camera angles all change.

Practical Applications

•Augment a small robot dataset to handle different countertops, walls, and clutter without re-recording everything.
•Stress-test a policy by injecting controlled distractors (extra cups, bottles) while keeping actions unchanged.
•Match multi-camera setups by generating view-consistent scenes for wrist and third-person cameras.
•Create domain-randomized tabletop textures that better simulate deployment environments.
•Pre-train or fine-tune VLA models with richer, longer, and more coherent video histories.
•Improve robustness of visuomotor policies (e.g., Diffusion Policy) to real-world clutter and lighting shifts.
•Build a reusable identity pool for a lab’s common objects to standardize augmentation across projects.
•Rapidly prototype tasks in simulation with visuals matched to real data distributions.
•Reduce annotation labor by using gripper-triggered keyframes to auto-locate interacted objects.
•Perform safer data scaling for new tasks where real collection is expensive or limited.

Version: 1