RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Seungku Kim; Suhyeok Jang; Byungjun Yoon; Dongyoung Kim; John Won; Jinwoo Shin

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Beginner

Seungku Kim, Suhyeok Jang, Byungjun Yoon et al.2/21/2026

arXiv

Key Summary

•RoboCurate is a way to make better robot training videos by checking if the actions in a generated video actually match what a robot would do in a simulator.
•It generates lots of diverse scenes using image-to-image edits and video-to-video style transfers, while keeping the robot’s motion the same.
•It verifies actions by replaying the predicted actions in a simulator and comparing the simulator’s motion with the generated video using a small attention-based checker.
•Compared to using only real data, RoboCurate boosts success rates by +70.1% on GR-1 Tabletop (with 300 demos) and +16.1% on DexMimicGen in pre-training.
•On a real humanoid robot (ALLEX), it improves success by +179.9% and even unlocks new behaviors that weren’t in the real demos.
•Vision-language models alone aren’t enough to judge physics and motion; RoboCurate focuses on action-level motion consistency instead of surface-level plausibility.
•Best-of-N sampling uses the action checker as a critic to pick the most consistent video-action pair during generation, improving quality without throwing away data.
•The method is general: it works across different robot bodies (like humanoids and dual-arm setups) and camera views.
•Its two secret ingredients are action-verified filtering and controllable visual diversity, which together make synthetic data much more useful.
•This approach helps robots learn faster, safer, and cheaper by turning noisy synthetic videos into trustworthy training examples.

Why This Research Matters

High-quality training data is the fuel for capable robots, but collecting it in the real world is slow, costly, and risky. RoboCurate turns cheap synthetic videos into trustworthy lessons by verifying the actions, not just the looks. This helps robots succeed more often on real tasks, even when scenes, lighting, or objects change. It also reduces dependence on huge real datasets, speeding up development for homes, labs, and factories. Because it works across different robot bodies, teams can share improvements more easily. In short, it makes robot learning faster, safer, and more affordable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re learning to play piano. If your teacher shows you crystal-clear videos with the right finger moves, you learn fast. But if some videos are blurry or show impossible hand motions, you get confused.

🥬 Filling (The Actual Concept): Before this paper, robots learned from two main sources: real demonstrations (slow and expensive to collect) and simulation (fast but often looks different from the real world). Lately, people also used synthetic videos—made by powerful video generators—to show robots lots of tasks cheaply. The plan was: generate a robot task video, then label the actions by using an inverse dynamics model (IDM) that guesses “what actions produced this motion?” Finally, train a policy on the paired video-and-actions.

How it works (step by step, like a recipe):
1. Gather real demos and train tools like IDMs and video generators.
2. Make new task videos from a still image plus an instruction using a video generator.
3. Convert the video into actions using an IDM.
4. Train a robot policy on many such pairs to handle new tasks.
Why it matters: If those synthetic videos have weird physics (like a hand grabbing without moving, or objects clipping), the IDM will give wrong actions, and the robot will learn the wrong thing. Bad training data → bad robot.

🍞 Bottom Bread (Anchor): Think of practicing basketball by watching training clips. If the ball teleports into the hoop, you can’t learn the real shot. Robots face the same problem with low-quality synthetic videos.

🍞 Top Bread (Hook): You know how practicing in a video game can help with real sports—up to a point? That’s because the game and real life aren’t identical.

🥬 Filling (The Actual Concept – Sim-to-Real Transfer): Sim-to-real transfer means taking skills learned in a fake world (simulator) and using them in the real world.

How it works:
1. Train a policy in a simulator.
2. Try it in the real world.
3. Adjust training or visuals to close the gap.
Why it matters: If the simulator looks or behaves too differently from reality, the robot stumbles outside the sim.

🍞 Bottom Bread (Anchor): Like learning to ride a bike in a game: you understand balance, but the real bike feels different—you still need some practice in the real world.

🍞 Top Bread (Hook): Suppose you see the first and last frame of someone throwing a ball and try to guess the throw. You’re inferring the actions from the before-and-after.

🥬 Filling (The Actual Concept – Inverse Dynamics Model, IDM): An IDM takes a start frame and a future frame, and predicts the sequence of actions that likely happened in between.

How it works:
1. Look at two frames: now and a little later.
2. Predict the actions that move the robot between them.
3. Repeat along the video to get the full action list.
Why it matters: IDMs let us add action labels to action-free videos, turning them into training data.

🍞 Bottom Bread (Anchor): It’s like seeing a goal celebration and a replay kickoff, then figuring out the kicks that must have happened in between.

🍞 Top Bread (Hook): When you read a picture book with captions, the words help you understand the images better.

🥬 Filling (The Actual Concept – Vision-Language Models, VLMs): VLMs understand pictures and text together, so they can describe images or generate instructions that fit a scene.

How it works:
1. Look at an image.
2. Read or produce a text instruction that matches what’s in view.
3. Use that to guide video generation.
Why it matters: Good instructions lead to videos with meaningful robot-object interactions.

🍞 Bottom Bread (Anchor): If the image shows a red mug on the left, a good instruction might be “Use the left hand to pick up the red mug,” not “Fold the blanket.”

🍞 Top Bread (Hook): Think of a “movie” that exists only in the computer’s imagination.

🥬 Filling (The Actual Concept – Synthetic Data Generation): This is creating pretend—but realistic—data, like robot videos, to practice on.

How it works:
1. Start with a real frame and a task instruction.
2. Use a video generator to imagine the task happening.
3. Label actions with an IDM.
Why it matters: It’s a cheap way to get lots of training examples.

🍞 Bottom Bread (Anchor): Like practicing math with well-made practice problems the teacher invents, not just from the textbook.

What was missing? Earlier, people tried to ask VLMs, “Does this video follow physics?” But those checks were too shallow (they might miss whether the hand moved far enough, or whether the motion precisely matched the action labels). The big gap: we needed a way to directly verify the actions themselves, not just whether the video “looks okay.”

The real stakes: Poorly labeled actions teach robots bad habits. That can waste training time, make robots fail on new tasks, or even cause unsafe behaviors. If we fix data quality and add diverse, realistic practice, we get robots that succeed more often—in kitchens, factories, and labs—while needing fewer costly real-world demos.

02Core Idea

🍞 Top Bread (Hook): Imagine you wrote a dance routine on paper, then asked a friend to perform it while you watched a cartoon character do the same dance. If both dances match beat-by-beat, your plan was correct; if not, something’s off.

🥬 Filling (The Actual Concept – Aha! Moment): RoboCurate’s key insight is to trust actions only if they work in a simulator and visually match the generated video’s motion.

How it works:
1. Generate a synthetic video of a robot doing a task.
2. Use an IDM to predict the actions behind that video.
3. Replay those actions in a simulator to get a rollout video that’s guaranteed to follow those actions.
4. Compare motion in the generated video vs. the simulator video using a small attention-based checker.
5. Keep the pair only if the motions align (action-verified), and discard the rest.
Why it matters: This checks the actions directly, not just surface “physics-looking” signals. It turns noisy data into trustworthy training fuel.

🍞 Bottom Bread (Anchor): It’s like checking a cooking recipe by following it in a kitchen simulator and watching a food video; if the steps and the result look the same, the recipe (actions) is probably right.

Multiple Analogies:

Music duet: One musician (the generator video) plays a melody; another musician (the simulator rollout) plays from sheet music (the actions). The duet works only if both match rhythm and notes.
Map vs. drone flight: The actions are the flight plan; the simulator is a safe test flight; the generated video is the “real” drone cam. If the paths overlap, your plan is solid.
Lip-sync check: The actions are the script, the simulator is the actor saying the lines, and the generated video is the mouth movement. If lip movements and audio align, it’s consistent.

🍞 Top Bread (Hook): You know how remixing a song keeps the beat but changes instruments?

🥬 Filling (The Actual Concept – Action-Preserving Visual Diversity): RoboCurate boosts diversity by editing the scene (image-to-image, I2I) and changing appearance over time (video-to-video, V2V) while keeping the motion (actions) intact.

How it works:
1. I2I: Edit the first frame to vary tables, lighting, and background, guided by edge maps so geometry stays plausible.
2. V2V: Restyle full videos (e.g., colors, textures) while preserving motion using edge constraints.
3. Reuse the same actions if motion is preserved.
Why it matters: The policy sees many looks of the same skill, learning to ignore cosmetic changes and focus on doing the task.

🍞 Bottom Bread (Anchor): Like practicing the same piano piece on different pianos in different rooms; your fingers move the same way even as the look and sound change.

🍞 Top Bread (Hook): Picking the best photo from a burst on your phone is smarter than keeping the first snap.

🥬 Filling (The Actual Concept – Best-of-N Sampling): Generate several candidate videos, score each by motion consistency with the simulator, and keep the best.

How it works:
1. Sample N videos with different random seeds.
2. Predict actions and simulate each.
3. Use the checker’s score to pick the top match.
Why it matters: You improve quality without discarding data budgets—great when you need a few excellent samples.

🍞 Bottom Bread (Anchor): Like recording multiple takes of a line in a play, then choosing the one that fits the script and emotion best.

Before vs. After:

Before: Synthetic videos were filtered by general “look plausible?” tests, which missed subtle but crucial motion issues.
After: RoboCurate verifies actions via simulator-replay consistency, so labels match what the robot actually does.

Why it works (intuition): The simulator acts like a truth mirror for actions: replayed motion is guaranteed to reflect those actions. If the generated video’s motion doesn’t match, either the video is off or the labels are wrong. The attention-based checker focuses on motion and robot geometry, not just appearance, so it can spot misalignments like too-short reaches or off-by-a-bit grasps that matter for success.

Building Blocks:

Synthetic Data Generation: make candidate videos from images and instructions.
Inverse Dynamics Model (IDM): convert video frames into action sequences.
Simulator-Replay Consistency: replay actions and compare motions between two videos.
Attentive Probe: a tiny cross-attention module over a frozen video encoder that classifies “match” vs “mismatch.”
Visual Diversity: I2I for varied scenes; V2V for appearance changes while keeping motion.
Best-of-N Sampling: use the checker as a critic during generation to pick the most reliable sample.

03Methodology

High-level overview: Input (initial frame + task instruction) → Diversify visuals and tasks (I2I edits, V2V transfer, VLM-generated instructions) → Generate video (I2V diffusion) → Label actions (IDM) → Verify actions (simulate and motion-compare) → Keep good pairs (curated dataset) → Train robot policy (VLA).

Step A: Diversify the starting point (I2I)

What happens: We take the initial frame and produce edited versions that change the table, lighting, background, and sometimes the target object’s appearance, using edge maps to keep layout realistic.
Why this step exists: If all videos look similar, the policy overfits to one look and fails when lighting or background changes.
Example: Original shows a light-wood table and a red mug; edits might create a marble table under warm light or a lab bench under cool, even light. The mug remains a plausible, hand-sized object.

🍞 Top Bread (Hook): Like adding costumes and new sets to the same stage play. 🥬 Filling (Concept – Image-to-Image Editing, I2I): Change one image into another while keeping structure.

How it works: Use edge maps to preserve geometry; apply controlled textual prompts to vary surfaces, lighting, and distractors.
Why it matters: Keeps physics plausible while expanding visual variety. 🍞 Bottom Bread (Anchor): Same scene, new wallpaper and table texture—still the same reachable mug in the same spot.

Step B: Diversify full videos while keeping motion (V2V)

What happens: We restyle entire videos (textures, colors) but preserve robot motion using edge-conditioned video-to-video transfer.
Why this step exists: It multiplies diversity without breaking the motion we already labeled.
Example: The hand reaches, grasps, and lifts along the same path, but the mug’s color changes and the table becomes glass.

🍞 Top Bread (Hook): It’s like colorizing a black-and-white movie without changing the acting. 🥬 Filling (Concept – Video-to-Video Transfer, V2V): Use one video to create another with changed appearance, preserving motion.

How it works: Condition on structure cues (like edge maps); vary textures, lighting, and backgrounds.
Why it matters: The same actions remain valid across styles, so labels can be reused. 🍞 Bottom Bread (Anchor): The robot’s reach-to-grasp timing is identical; only the scene’s look is new.

Step C: Generate task videos (I2V diffusion)

What happens: From an initial image and a VLM-created instruction, a video generator imagines a plausible robot execution.
Why this step exists: This is how we get lots of candidate demonstrations quickly.
Example: Instruction says, “Use the left hand to pick up the red mug.” The video shows a left-hand approach, grasp, and lift.

Step D: Label actions (IDM)

What happens: The IDM takes frame pairs and predicts the action chunk in between, building a full action sequence.
Why this step exists: Synthetic videos come without ground-truth actions; we need labels to train a policy.
Example: Between t and t+H, the IDM predicts joint angles or velocities that move the hand from near the mug to around its handle.

🍞 Top Bread (Hook): Like inferring a recipe from a cooking time-lapse. 🥬 Filling (Concept – Inverse Dynamics Model, IDM): Predicts the sequence of actions from pairs of observations.

How it works: Looks at “before” and “after,” and outputs the actions that connect them.
Why it matters: Converts pretty videos into actionable lessons. 🍞 Bottom Bread (Anchor): From a start frame (hand far) and an end frame (hand on mug), the IDM fills in the precise approach motions.

Step E: Verify actions via simulator-replay consistency

What happens: We replay the IDM’s predicted actions in a simulator to render a rollout video that is guaranteed to reflect those actions.
Why this step exists: To check whether the video’s motion and the action labels truly match.
Example: If the generated video shows a long reach but the simulator reach is short, that pair is inconsistent and gets filtered out.

🍞 Top Bread (Hook): Like checking homework by plugging the answer back into the equation. 🥬 Filling (Concept – Simulator-Replay Consistency): Compare motion between the generated video and the simulator rollout of the same actions.

How it works: Render the simulator with those actions; then judge alignment using a motion-focused checker.
Why it matters: Validates the actions themselves, not just surface plausibility. 🍞 Bottom Bread (Anchor): If both videos show the hand moving 20 cm to the left and then 10 cm forward, they agree; if not, discard.

Step F: Motion-consistency checker (attentive probe)

What happens: A frozen video encoder extracts features; a small cross-attention layer with a learnable query token focuses on whether the pair’s motions and robot geometry agree, outputting a probability.
Why this step exists: Simple similarity can be fooled by appearance; attention helps pick out motion cues and timing.
Example: It detects that, in one video, the grasp happens 4 frames too early relative to the rollout, flagging it as inconsistent.

🍞 Top Bread (Hook): Like a coach who watches two athletes do the same drill and points out subtle timing differences. 🥬 Filling (Concept – Attentive Probe): A lightweight attention module on top of a frozen video encoder that classifies “match” vs “mismatch.”

How it works: Cross-attends between embeddings of the two clips and outputs an alignment score.
Why it matters: Sensitive to small but important motion mismatches (reach length, grasp timing, path shape). 🍞 Bottom Bread (Anchor): It notices the hand didn’t move far enough before closing fingers in one video, even though both look plausible.

Step G: Best-of-N sampling (optional during generation)

What happens: Sample multiple candidates, score each with the checker, keep the best-matching one.
Why this step exists: Improve quality without needing to massively over-generate and throw away data.
Example: From 5 takes, pick the one whose timing and path best match the simulator rollout.

🍞 Top Bread (Hook): Like shooting several practice free throws and counting the swish. 🥬 Filling (Concept – Best-of-N Sampling): Choose the candidate with the highest action-consistency score.

How it works: Score each candidate via the attentive probe; pick the max.
Why it matters: Raises average quality especially in data-scarce, task-specific settings. 🍞 Bottom Bread (Anchor): For “pour can,” keep the one whose tilt and lift match the simulated pouring arc.

Secret sauce:

Dual engine: (1) Action-verified filtering ensures labels match motion; (2) Visual diversity teaches robustness to appearance changes.
Together they turn synthetic data from “looks okay” into “moves right,” which directly boosts task success when training the final vision-language-action (VLA) policy.

🍞 Top Bread (Hook): Like a student who practices the same skill in many classrooms but is graded only when the steps are truly correct. 🥬 Filling (Concept – Enhancing Observation Diversity): Making sure robots see many different examples of the same skill.

How it works: I2I changes scenes; V2V changes appearances; instructions cover many task variations.
Why it matters: The robot focuses on core motion, not distractions. 🍞 Bottom Bread (Anchor): Whether the table is wood or marble, the robot still reaches, grasps, and places correctly.

04Experiments & Results

The test: Can action-verified, visually diverse synthetic data improve robot policies in both simulation and the real world? The authors measure task success rates across benchmarks and compare against baselines that use real data only or synthetic data without action-level filtering or diversity.

The competition (baselines):

Real only: Train on real demos (strong but data-limited).
DreamGen (prior pipeline): Generate videos and label with IDM, but no I2I/V2V diversity or action verification.
RoboCurate (ours): Add I2I and V2V diversity; verify actions with simulator-replay consistency; optionally use Best-of-N.

Scoreboard with context:

GR-1 Tabletop (24 tasks; 300 demos): • Real: 15.4% avg success. • DreamGen: 19.5% (small bump). • RoboCurate (diversity only): 22.7%. • RoboCurate (diversity + filtering): 26.2%. Context: Jumping from 15.4% to 26.2% is like going from a D+ to a solid B—without collecting extra real demos.
GR-1 Tabletop (1,000 demos): • Real: 30.3%. • DreamGen: 32.2%. • RoboCurate (diversity only): 34.8%. • RoboCurate (diversity + filtering): 37.9%. Context: Even with more real data, RoboCurate keeps adding value, like consistent extra credit that raises the final grade.
DexMimicGen (6 tasks; 100 demos each): • Real: 44.6%. • DreamGen: 46.4%. • RoboCurate (diversity only): 49.3%. • RoboCurate (diversity + filtering): 51.8%. Context: The final 51.8% is about +16.1% relative to real-only—strong considering these are dexterous, bimanual tasks.
Real-world ALLEX humanoid (3 tasks; 48 real demos only for the seen task): • Real: 13.9% avg. • DreamGen: 27.8%. • RoboCurate with Best-of-N (action checker as critic): 38.9%. Context: That’s +179.9% relative to real-only—like turning occasional wins into regular successes. It even enabled emergent success on brand-new actions (0.0% to 25.0%).

Surprising findings:

Action-level filtering beats video-only physics checks: When compared to methods that ask a VLM if a video “looks physically plausible,” RoboCurate’s action-verified filtering wins—because task-critical motion (reach length, grasp timing) matters more than superficial plausibility.
Human judgment can be noisy on subtle motion: Training the checker with carefully constructed positive/negative pairs from real data and simulator rollouts outperforms small sets of human binary labels.
Diversity compounds gains: Task diversity (varying behavior, target object, placement, hand) steadily improves results; adding visual diversity (I2I/V2V) adds another bump at the same dataset size.

What changes because of RoboCurate:

Policies trained with RoboCurate generalize better across embodiments (e.g., pre-train on GR-1 humanoid, fine-tune on bimanual Panda arms) and views.
In low-data regimes (300 demos) the relative gain is largest, showing the approach is especially helpful when real data is scarce.
In the real world, Best-of-N guided by the action checker acts like an on-the-fly quality gate, making each synthetic sample count more.

Takeaway numbers you can remember:

+70.1% relative gain on GR-1 Tabletop with 300 demos.
+16.1% on DexMimicGen pre-training.
+179.9% on real ALLEX humanoid co-finetuning.
Strong OOD results: +162.3% on novel object pick-and-place; brand-new action success emerges (0%→25%).

05Discussion & Limitations

Limitations:

Simulator dependence: The method relies on having a simulator close enough to the real robot and task to serve as a reliable motion mirror. If the sim is too different or missing objects/contacts, the consistency score may be misleading.
Video generator quality: If the generator repeatedly produces severe artifacts (tele-grabs, clipping), you’ll either filter out most data (lower yield) or risk keeping some noise.
IDM accuracy: Poor IDM predictions will also be rejected by the checker, but if IDM fails systematically, you may filter out too much.
Compute and tooling: I2I/V2V, simulation rollouts, and pairwise checking add overhead. It’s still cheaper than massive real data collection, but requires solid infrastructure.

Required resources:

A reasonably accurate simulator for the target embodiment and tasks.
A trained or fine-tuned IDM for the embodiment.
Access to a competent I2V generator, plus optional I2I and V2V tools.
Modest training for the attentive probe on top of a frozen video encoder.

When NOT to use:

Tasks with physics the simulator cannot represent (deformable foods, complex fluids beyond simple pouring) where rollouts won’t reflect reality.
Settings without reliable initial frames/instructions or where the video generator can’t meaningfully follow the prompt (you’ll filter away too much).
Ultra time-critical deployments where added generation and filtering latency isn’t acceptable.

Open questions:

Richer alignment signals: Beyond binary consistency, can we score partial matches (good reach, weak grasp) and fix them automatically?
Multi-view and 3D cues: Would depth or multi-camera inputs help the checker detect subtle spatial errors better?
Action repair: Instead of discarding, can we adjust the actions or edit the video to restore consistency?
Beyond rigid objects: How to extend verification to soft materials, cloth, or high-contact tasks where sim gaps are larger?
Unified critic: Can a single learned critic replace both VLM plausibility checks and motion consistency with grounded physics reasoning?

06Conclusion & Future Work

Three-sentence summary: RoboCurate generates many synthetic robot task videos, labels them with predicted actions, and then replays those actions in a simulator to check that motions match. It keeps only the action-verified pairs and also expands visual diversity with image and video edits that preserve motion, producing high-quality, varied training data. As a result, robot policies learn faster and generalize better in both simulation and the real world.

Main achievement: Turning action verification into a visual motion-matching problem—by comparing generated videos with simulator rollouts—and combining that with controllable visual diversity (I2I and V2V) to create reliably useful synthetic datasets.

Future directions: Build richer, graded consistency scores to repair data instead of discarding it; incorporate multi-view or depth to sharpen motion checks; extend to deformable and highly contact-rich tasks; and unify critics that understand both language instructions and physical dynamics. Exploring closed-loop generation—where the checker steers the video generator in real time—could further raise data quality.

Why remember this: Because it shifts the focus from “does the video look okay?” to “do the actions really match?”—a simple but powerful idea that turns synthetic data into dependable lessons. With cleaner labels and broader visual variety, robots trained with RoboCurate succeed more often, even on new objects or behaviors, while saving time and cost on real-world data collection.

Practical Applications

•Augment small real robot datasets with action-verified synthetic demos to boost task success.
•Generate diverse tabletop scenes (tables, lights, backgrounds) to make policies robust to visual changes.
•Use Best-of-N sampling during video generation to select the most action-consistent synthetic trajectory for each task.
•Pre-train a generalist VLA on curated neural trajectories, then fine-tune quickly on new robots or camera views.
•Improve out-of-distribution performance (new objects or behaviors) without collecting extra real demos.
•Curate third-party synthetic datasets by filtering with simulator-replay consistency before policy training.
•Accelerate domain adaptation by restyling videos with V2V while preserving motion labels.
•Diagnose failure cases by inspecting low-scoring pairs to see if issues come from the generator, IDM, or sim mismatch.
•Use I2I to build counterfactual initial frames that cover rare backgrounds or lighting not seen in real data.
•Prioritize data generation budgets toward tasks that benefit most from action-verified synthetic augmentation.

Version: 1