Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Yujie Zhao; Hongwei Fan; Di Chen; Shengcong Chen; Liliang Chen; Xiaoqi Li; Guanghui Ren; Hao Dong

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Beginner

Yujie Zhao, Hongwei Fan, Di Chen et al.12/22/2025

arXiv PDF

Key Summary

•Robots learn better when they see many examples, but collecting lots of real videos is slow and expensive.
•This paper shows how to turn just 1–5 real robot demos into hundreds of new, realistic multi-view videos without using a simulator.
•The key trick is using depth (how far things are) as a 3D control interface to guide video generation from edited 3D scenes.
•They first rebuild the scene’s 3D shape at the right real-world scale, then safely rearrange objects and fix robot poses to keep motions physically valid.
•A video model uses depth, edges, action hints, and camera rays to generate smooth, multi-view-consistent demonstrations.
•On four real tasks, policies trained on generated data from 1–5 demos matched or beat policies trained on 50 real demos (10–50× data efficiency).
•The method handles new object placements, heights, and textures while keeping visuals realistic and interactions correct.
•Ablations show that metric-accurate 3D reconstruction, robot pose correction, and smooth object relocation are crucial for quality.
•It works with RGB-only multi-view inputs and is compatible with VLA policies used in modern robot learning.
•Limitations include handling articulated objects and needing reasonable compute and calibration data.

Why This Research Matters

Robots that can learn reliably from just a few examples are far cheaper and faster to deploy in homes, hospitals, and factories. This approach multiplies a handful of real demonstrations into a large, realistic, multi-view training set, so robots handle new placements, heights, and appearances with confidence. It avoids simulators and extra depth sensors, making it compatible with common RGB camera setups used today. By keeping both visuals and interactions correct, it trains policies that actually work on real hardware. This can shorten development cycles, reduce data collection costs, and make robust robot skills accessible to more teams. Ultimately, it’s a practical step toward everyday robots that adapt to the messy, changing real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how practicing a sport is easier when you can watch lots of good game replays from many angles? Robots are similar—they learn skills by watching demonstrations. The more varied the examples, the better they perform when things change.

🥬 The Concept: Before this work, getting robots to be good at tasks like picking up a cup or scanning a barcode required collecting many real-world demonstrations from many positions and angles. That’s costly and slow. People tried two main shortcuts: (1) use simulators to auto-collect demos, and (2) edit a few demos to make many. Simulators are fast but often don’t look or feel like reality, so models trained in sim can struggle in the real world (this is the Sim2Real gap). Editing demos sounds great, but past methods had trade-offs—some needed special depth sensors, some needed dense 3D scans, and many couldn’t produce multi-view RGB videos with correct robot-object contact.

🍞 Anchor: Imagine you have a single video of a robot putting a mug in a basket. Wouldn’t it be amazing if you could move the mug somewhere else, update the robot’s motion correctly, and instantly get a new, realistic video from all cameras? That’s the dream.

🍞 Hook: Imagine rearranging your room in a house design app—you drag the couch, and the app shows how your room will look from different corners. Easy and fun, right?

🥬 The Concept: In robotics, doing this kind of rearrangement with videos is hard. You need to keep geometry right (how far things are), keep physics believable (the robot’s arm must bend correctly), and keep visuals realistic across all camera views. Past attempts fell short:

Simulation-only approaches: fast but visuals and physics don’t perfectly match reality.
3D Gaussian Splatting methods: realistic but require dense captures, limiting scalability.
Point-cloud editing methods: can move things in 3D but often rely on depth sensors and don’t output RGB videos for common 2D policies.
Video-only augmentation: can change appearance (textures) but not spatial layouts and correct trajectories.

🍞 Anchor: It’s like wanting to move the mug to a new spot and have the robot adjust its path naturally while the video still looks real from every camera—most tools could only do one or two pieces, not the whole puzzle.

🍞 Hook: Think of depth—the “how far” of every pixel—like the secret blueprint behind a photo.

🥬 The Concept: This paper’s key idea is to use depth as a 3D control interface that connects 3D edits (like moving objects or adjusting robot poses) to 2D multi-view video generation. The system reconstructs the scene’s 3D geometry at true real-world scale (so 1 cm is really 1 cm), safely edits object placements and robot motion in 3D while fixing any kinematic mistakes, then guides a video generator with depth, edges, action hints, and camera info to make new, realistic, multi-view robot demos.

🍞 Anchor: It’s like first building a clean 3D Lego model of your room, then moving the mug Lego piece, bending the robot arm Lego joints correctly, and finally snapping high-quality photos from each corner that all match.

🍞 Hook: Why should you care? Because collecting 50 real demos is like filming 50 games—it costs time and energy.

🥬 The Concept: With this method, using only 1–5 real demos, you can generate 200+ new demos that work just as well or better than 50 real ones. That’s 10–50× data efficiency. It also handles new heights (like placing the mug on a platform) and new textures (like changing the table color) while keeping the robot’s interactions correct.

🍞 Anchor: Think of studying with 5 excellent sample problems that you can remix into 200 variations—suddenly, your practice set covers almost everything you’ll be tested on.

02Core Idea

🍞 Hook: Imagine directing a movie. You place the actors, plan their moves, and the camera crew captures it from different angles. If you could control the invisible 3D blueprint underneath the scene, you could reshoot the same scene anywhere, perfectly.

🥬 The Concept (Aha! in one sentence): Use depth as a universal 3D control interface that lets us edit scenes and motions in 3D and then reliably generate realistic, multi-view 2D videos that match those edits.

How it works (big picture):

Rebuild the scene’s 3D geometry at real scale from a few RGB cameras. 2) Edit the 3D scene: move objects, plan new robot paths, and correct the robot’s joint poses to stay physically valid. 3) Feed time-aligned depth (plus edges, actions, camera rays) to a video generator that produces multi-view consistent, realistic demos.

Why it matters: Without a faithful 3D-to-2D bridge, you either get pretty videos that don’t match physics or correct 3D edits that don’t look real on camera. Depth glues 3D intent to 2D appearance.

🍞 Anchor: You move a mug 20 cm left in 3D, fix the robot’s elbow bend, and the system outputs camera videos where the robot naturally reaches the new spot—like a reshoot that never needed a real set.

Three analogies:

Puppet strings: Depth is the set of invisible strings controlling where every pixel sits in 3D, so the puppet (video) moves correctly when you pull in 3D.
Coloring book: Depth is the outline that keeps coloring (textures) inside the right shapes from every viewpoint.
GPS for a movie: Depth provides turn-by-turn 3D directions so the video generator doesn’t get lost when objects and robots move.

Before vs After:

Before: Either lots of real demos, or simulated/edited data that doesn’t quite look/behave right; limited spatial generalization in RGB policies.
After: Start with 1–5 real demos, create hundreds of realistic, kinematically consistent, multi-view RGB videos with new placements, heights, and textures; train policies that match or beat 50 real demos.

Why it works (intuition):

Depth encodes geometry and interactions—where the gripper is, how close it is to the mug, and whether paths are feasible.
Correct robot kinematics (pose correction) keeps motions physically believable.
Multi-view attention weaves information across cameras for consistent visuals.
Auxiliary signals (edges, actions, ray maps) sharpen shapes, tie motion to intent, and keep views in sync.

Building blocks (with sandwich explanations):

🍞 Hook: You know how measuring with a real ruler beats eyeballing from a photo? 🥬 Metric-scale geometry reconstruction: It predicts true-to-scale depth and camera poses from a few RGB views by fine-tuning a feed-forward model (VGGT) with a hybrid of clean simulated depth/poses and real but noisy depth data. Without metric scale, later edits won’t line up and projections will drift across views. 🍞 Anchor: If the real mug is 9 cm away, the model also estimates ~9 cm, not a vague guess.
🍞 Hook: Imagine moving a lamp on your desk—you don’t drag your whole arm as a single block; your wrist and elbow bend. 🥬 Depth-reliable spatial editing: It splits a demo into motion segments (free moves) and skill segments (contact), moves objects in 3D, updates cameras attached to wrists, inpaints backgrounds, filters depth, plans new motions, and crucially corrects robot joint poses so only the end-effector is repositioned while other links are re-aligned. Without this, depth becomes inconsistent and the robot looks broken. 🍞 Anchor: Shift the mug and the robot’s elbow re-bends to reach it, producing clean, hole-free, consistent depth frames.
🍞 Hook: Think of making a flipbook where each page must match the 3D plan. 🥬 3D-controlled video generation: A transformer uses depth as the main control, plus edges, action maps, and ray maps, with a dual-attention design (intra-view for details, cross-view for consistency) to generate multi-view videos. Without 3D control, videos might be pretty but miss the correct contact and alignment. 🍞 Anchor: The flipbook shows the robot grasping precisely where the 3D plan says, from head and both wrist cameras.
🍞 Hook: Ever push furniture a little at a time to avoid scratching the floor? 🥬 Smooth object relocation: Instead of teleporting objects in the first frame, it interpolates poses over a short prelude, turning static editing into a brief, believable move. Without it, first-frame edits can confuse the generator and misplace objects. 🍞 Anchor: The mug glides to a new spot for a few frames, then the task begins—clean and stable.

03Methodology

High-level recipe: Input (a few multi-view RGB demos + robot joint states/actions + robot/camera info) → 1) Metric-scale 3D reconstruction → 2) Depth-reliable spatial editing (new placements + motion planning + pose correction) → 3) 3D-controlled multi-view video generation → Output: many new, realistic, multi-view demos.

Step 1: Metric-scale geometry reconstruction 🍞 Hook: You know how wearing correct glasses makes everything crisp at the right distance? 🥬 The Concept: The system fine-tunes a feed-forward geometry model (VGGT) so it outputs accurate camera poses and metric-true depth from just three views (head, left wrist, right wrist). It mixes two worlds during training: simulation (perfect poses and clean depth) and real data (noisy but real-scale depth). Camera loss is trained on sim (since sim poses are exact), and depth loss is trained on both real (with masks to ignore bad sensor noise) and sim (to learn cleanliness). A point-map loss on sim further stabilizes geometry. Without this, the 3D point cloud is messy, and later projections become unreliable. 🍞 Anchor: After this step, a single frame becomes a clean point cloud where the mug, basket, arms, and table are all in the right size and place.

Key sub-concepts:

🍞 Hook: Think of a depth map as a heat map of distance. 🥬 Depth map: An image where each pixel holds the distance to the camera. It’s the backbone for projecting to 3D. Without good depth, you can’t reliably edit or render from new poses. 🍞 Anchor: The mug’s rim is closer (brighter/warmer), the far table edge farther (darker/cooler).
🍞 Hook: A point cloud is like sprinkling tiny dots to fill a 3D shape. 🥬 Point cloud: 3D points derived from depth and camera intrinsics. It’s the editable 3D canvas. Without a solid point cloud, edits and re-rendered depth will break. 🍞 Anchor: The mug becomes a cluster of dots forming a hollow cylinder.

Step 2: Depth-reliable spatial editing 🍞 Hook: Rearranging furniture is easy if you know what’s fixed and what moves. 🥬 The Concept: The original demonstration is split into motion segments (moving in free space) and skill segments (in contact). You apply a 3D transform to objects (e.g., shift/rotate the mug) and appropriately to the robot’s end-effector in skill segments so contact stays realistic. Cameras fixed to wrists are updated with the same transforms. Background holes (from moved objects) are inpainted and re-reconstructed for consistency, and filtered to remove depth noise. Motion planning fills in free-space moves between keyframes. Why: Without segmentation and background completion, you get floating artifacts or missing pixels; without motion planning, the path might collide or be jerky. 🍞 Anchor: Move the mug 25 cm left and rotate 20°. The robot re-plans a smooth path to grasp it; the background where the mug used to be is neatly filled.

Critical ingredient—robot pose correction 🍞 Hook: Your wrist can move, but your elbow and shoulder must adjust too. 🥬 Robot pose correction: Only the end-effector should be directly repositioned. The rest of the arm must be recalculated using kinematics (IK to find joint angles that reach the new end-effector pose; FK to render where each link ends up). This re-renders the arm depth so the robot’s shape and self-occlusions are physically correct. Without it, the arm looks like a stiff stick and depth becomes wrong, confusing the video generator. 🍞 Anchor: When the mug shifts, the elbow bends differently; the resulting depth frames show a believable arm shape in all views.

Step 3: 3D-controlled video generation 🍞 Hook: Imagine painting a scene while tracing a precise stencil—hard to go wrong. 🥬 The Concept: A transformer-based video generator receives, per frame and per view, depth (main control), Canny edges (sharp boundaries), an action map (what the robot intends to do), and ray maps (view geometry cues). A dual-attention design learns details within each view (intra-view) and the correspondence across views (cross-view) for multi-view consistency. During training, sometimes depth/edges are randomly dropped so the model doesn’t overfit to one cue and remains robust to small depth errors. Without these controls and attention structure, the model might drift, blur objects, or desynchronize views. 🍞 Anchor: Given the edited depth, the generator outputs realistic head, left-wrist, and right-wrist videos where the robot grasps the mug at the new spot.

Key sub-concepts:

🍞 Hook: Two conversations are better than one—a quiet chat within a group and a cross-table exchange between groups. 🥬 Dual-attention (intra-view + cross-view): Self-attention runs inside each view to refine details and across views to align perspectives. Without cross-view, wrist and head views can disagree; without intra-view, details get mushy. 🍞 Anchor: The head camera sees the mug exactly where the wrist camera expects it, with sharp edges in both.
🍞 Hook: Teleporting furniture can look weird; sliding it looks natural. 🥬 Smooth object relocation: Interpolate object transforms over a short intro sequence. This turns a sudden edit into a brief, believable move and sets up the generator with an easy, consistent start. Without it, the first frame can produce misplaced or unstable generations. 🍞 Anchor: The mug visibly glides to its new spot for a second, then the robot begins.

Putting it all together (example: Mug to Basket):

Reconstruct depth/poses from three cameras; build a metric-accurate point cloud.
Choose a new mug/basket placement; inpaint background; segment the demo into move and contact parts; motion-plan to the new locations.
Correct robot poses via IK/FK; re-render reliable arm depth; project edited point clouds to depth for each camera, each frame.
Feed depth+edges+actions+rays to the video generator; it outputs multi-view videos where the robot grasps and places correctly.

Secret sauce:

Depth as the universal 3D control signal that ties edits to pixels.
Kinematic pose correction to preserve physical plausibility.
Dual-attention for crisp, multi-view-consistent frames.
Smooth relocation to stabilize early frames and object placement.

04Experiments & Results

🍞 Hook: If a class studies from just a few great examples that magically expand into hundreds of variations, they’ll ace the test—especially when the test mixes up where things are placed.

🥬 The Test: The authors evaluated whether policies (robot brains) trained only on generated demos can succeed on real robots with new object placements. Four tasks were used: Mug to Basket, Pour Water, Lift Box (two arms), and Scan Barcode (two objects and tools). Inputs were three RGB views (head, left wrist, right wrist). Success was: did the robot complete the task in a randomly placed scenario?

The Competition: They compared policies trained on:

10, 20, or 50 real demos (collected by teleoperation with varied placements), vs.
200 generated demos made from just 1, 2, or 5 real source demos. They also tested two popular VLA policies (Go-1 and π0.5), and ran a Diffusion Policy study on Mug to Basket. Extra tests checked height and texture changes. Ablations tested removing pieces like metric reconstruction, robot pose correction, or smooth relocation.

Scoreboard (with context):

Data efficiency: Training with only 1–5 real demos plus 200 generated ones matched or beat training with 50 real demos. That’s like getting an A when others using 50 real examples got a B.
As sources increased from 1 → 5, success rose further, since more original interaction patterns yield richer generated variations.
Average success with 5-source-generated data reached about 79–81% (Go-1 and π0.5), exceeding the ~61% of 50 real demos—roughly a 17–20 percentage point jump.
Scaling generated demos: From a single source demo, increasing generated demos from 200 to 300–400 further improved success; beyond ~300, average success surpassed the 50-real baseline.
Height editing: A policy trained only on tabletop data failed when the mug was placed on a taller platform (0%). Mixing generated tabletop+platform demos achieved 80%—like going from a goose egg to a strong B.
Texture robustness: On five table colors, training on generated demos with mixed textures improved average success from 50% to 68%.
Diffusion Policy check (Mug to Basket): Policies trained on generated data from just 1–5 source demos beat the policy trained on 50 real demos, confirming general usefulness beyond VLA models.

Surprising (good) findings:

Just 1–5 real demos can be enough to reach or exceed 50-real-demo performance once expanded by this pipeline.
The clean, metric reconstruction plus pose-corrected editing produces depth that the video model can reliably follow, making the generated videos realistic and useful.
Smooth object relocation notably stabilizes early frames and reduces misplacement errors.

Ablations (what breaks without each part):

Without metric-accurate reconstruction: Point clouds were noisy, camera poses drifted, and depth projections degraded—downstream videos became inconsistent.
Without robot pose correction: Arms looked wrong in depth; generated videos became blurry/inconsistent during motion and contact.
Without smooth relocation: First-frame object teleports caused placement errors and unstable generations.

Bottom line: Across four real tasks and multiple policies, the method delivered 10–50× data efficiency, strong spatial generalization, and flexibility in height/texture—while staying RGB-only and simulator-free.

05Discussion & Limitations

🍞 Hook: Imagine a magical photocopier for demos—it’s powerful, but it still needs good lighting, paper, and a few care rules.

🥬 Limitations:

Articulated objects (like doors with hinges) are not handled well yet, mainly because the video model saw few such examples during training.
The pipeline relies on decent 3D reconstruction from sparse views; extreme lighting, heavy reflections, or very textureless scenes can still cause depth errors.
It doesn’t explicitly model forces or complex physics (e.g., deformable objects or fluid slosh beyond visual cues), so tasks needing precise force control may require additional sensing.
Good robot kinematic models (URDF) and calibration data are assumed; poor calibration reduces quality.

Required resources:

Multi-view RGB demos (e.g., head and wrist cameras), robot URDF/camera parameters, and access to GPUs for fine-tuning the reconstruction and video models (the paper used 8×H100 during training; generation is under a minute per 20-second clip with parallelization).
Some curated real and sim data to fine-tune metric-scale reconstruction and a multi-task dataset to fine-tune the video model.

When NOT to use:

If you have only a single monocular view with no way to recover reasonable geometry.
Tasks dominated by haptics/forces (e.g., snapping tight lids) where visuals alone are insufficient.
Highly deformable or heavily articulated objects where simple pose correction isn’t enough.
Environments with severe sensor glare, mirrors, or deep occlusions that break reconstruction.

Open questions:

How to incorporate articulated-object physics and more complex contacts reliably?
Can haptic/force feedback be integrated as a control signal alongside depth and actions?
How well does it transfer across robot embodiments and camera layouts (e.g., mobile bases, eye-in-hand only)?
Could online, on-the-fly generation be coupled with policy learning in a loop for continual improvement?
What’s the best way to blend language goals (VLA prompts) with 3D control signals for richer task scaling?

🍞 Anchor: Think of this as a strong first version of a demo “factory”—future upgrades (better 3D perception, richer physics, more diverse training) could make it handle doors, drawers, soft objects, and even trickier lighting with ease.

06Conclusion & Future Work

🍞 Hook: If you could multiply a few good lessons into a whole workbook of practice problems that still feel real, you’d master the test faster.

🥬 Three-sentence summary: Real2Edit2Real turns a handful of multi-view RGB demos into hundreds of new, realistic, multi-view robot demonstrations by using depth as a 3D control interface. It reconstructs true-scale geometry, safely edits object placements and robot motions with pose correction, and then generates consistent videos guided by depth, edges, actions, and ray maps. Trained policies achieve the success of 50 real demos using data made from only 1–5 real demos, delivering 10–50× data efficiency.

Main achievement: Bridging editable 3D (point clouds, kinematics) with 2D video generation through a depth-centered control pathway that preserves both visual realism and physically plausible interactions.

Future directions: Add robust handling of articulated and deformable objects, integrate force/haptic cues, adapt across embodiments and camera setups, and jointly learn with language and planning for broader, instruction-following tasks. Speed and accessibility improvements (lighter models, fewer GPUs) would widen practical use.

Why remember this: It shows that the bottleneck of collecting tons of real robot demos can be broken by a clever 3D-to-2D bridge—depth as the universal translator—unlocking fast, scalable, and realistic training data for more capable, robust robots.

Practical Applications

•Bootstrap a new manipulation task by collecting 3 real demos and generating 200+ high-quality, multi-view demos for training.
•Rapidly expand a dataset to cover new object placements and orientations without re-teleoperating.
•Adapt existing demonstrations to new table heights or platforms by editing height and regenerating videos.
•Increase robustness to visual changes (e.g., table textures, lighting) by first-frame editing plus 3D-controlled generation.
•Speed up policy iteration: edit a few key demos, generate variants, retrain, and test on the real robot within a day.
•Create multi-view training data for VLA models when only RGB cameras are available (no depth sensors).
•Produce bimanual demonstrations (e.g., Lift Box, Scan Barcode) with correct arm kinematics via pose correction.
•Perform domain adaptation: mix simulated clean geometry with real videos to improve metric accuracy and realism.
•Build curriculum datasets that gradually increase spatial difficulty (longer reaches, trickier placements) via 3D edits.
•Stress-test policies by generating edge cases (near-collisions, tight grasps) using motion planning and controlled depth.

Version: 1