MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Ruicheng Zhang; Mingyang Zhang; Jun Zhou; Zhangrui Guo; Xiaofan Liu; Zunnan Xu; Zhizhou Zhong; Puxin Yan; Haocheng Luo; Xiu Li

MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Intermediate

Ruicheng Zhang, Mingyang Zhang, Jun Zhou et al.12/7/2025

arXiv PDF

Key Summary

•Robots need lots of realistic, long videos to learn, but collecting them is slow and expensive.
•MIND-V is a three-part "brain-to-muscle" system that plans a task, turns the plan into a clear blueprint, and then makes a video that follows the plan.
•It keeps long stories consistent by breaking big tasks into small subtasks and checking each step before moving on.
•A special "physics referee" reward (PFC) teaches the video generator to follow real-world rules like not teleporting objects or making them pass through solid things.
•The system uses reinforcement learning (GRPO) to steadily prefer videos that look better and obey physics more closely.
•Compared to strong baselines, MIND-V makes longer, clearer, and more physically believable robot videos.
•It scales to long tasks because it plans and renders one subtask at a time while reusing memory.
•When key parts are removed (like physics alignment or staged rollouts), performance drops a lot, proving those parts matter.
•This can create safe, high-quality training videos for teaching real robots complex chores.
•The approach shows how to connect high-level reasoning with pixel-by-pixel video generation for reliable long-horizon control.

Why This Research Matters

Robots that learn from realistic, long, and safe videos can be trained faster and more reliably than if humans had to demonstrate every task. MIND-V makes such videos by combining smart planning, a clear action blueprint, and a physics-aware checker that discourages impossible motion. This helps build better home assistants, warehouse pickers, and lab robots without risking real hardware during early learning. It also reduces the need for expensive manual annotations like hand-drawn trajectories. As a general recipe—think, blueprint, render, check—it can inspire other areas of AI that need long, reliable, instruction-following behavior. In short, it narrows the gap between language goals and real-looking actions that obey everyday physics.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how learning to juggle takes lots of practice videos, not just a few snapshots? Robots are the same: to imitate humans well, they need many examples of how tasks unfold over time—especially long chores like “clear the table, wash the cup, and put it on the rack.” But the world before this paper looked like this: collecting long, diverse robot videos was slow, expensive, and limited. Video generators could make short, pretty clips, but they often forgot the story mid-way, broke physics (objects vanished, hands clipped through bowls), or failed to follow complex instructions. That was a big roadblock for teaching robots reliably.

🍞 Hook: Imagine you’re playing a long board game with many turns. If you mess up the rules on turn 2, the whole game falls apart later.

🥬 The Concept (Reinforcement learning): Reinforcement learning is a way for computers to learn by trying things and getting rewards when they do well.

How it works: 1) Try an action, 2) Get a score (reward), 3) Adjust to do better next time, 4) Repeat.
Why it matters: Without rewards, the system can’t tell good behavior (like realistic motion) from bad (like teleporting cups).

🍞 Anchor: A robot learner tries to pick up a spoon. If it lifts the spoon smoothly, it gets a higher score; if the spoon jumps around, it gets a low score, so it learns smoother moves next time.

🍞 Hook: Think of a GPS that draws the exact route you should drive.

🥬 The Concept (Trajectory-control models): These models guide where and how something should move to complete a task carefully and precisely.

How it works: 1) Define the path, 2) Follow it step by step, 3) Check that each step matches the path, 4) Reach the goal.
Why it matters: Without paths, motion can wobble or go the wrong way, especially in tight spaces.

🍞 Anchor: The GPS says “turn right in 50 meters,” helping you take the correct lane instead of drifting randomly.

🍞 Hook: You know how comic books have both pictures and speech bubbles? You understand the scene better when you read both.

🥬 The Concept (Vision-Language Model, VLM): A VLM understands images and text together, so it can read a picture and a sentence like “put the towel in the pot” and link them.

How it works: 1) Look at the scene, 2) Read the instruction, 3) Find the right objects, 4) Plan steps to complete the task.
Why it matters: Without this, the system might mix up objects or misunderstand the goal.

🍞 Anchor: Show a VLM a kitchen photo and say “grab the spoon.” It figures out which item is the spoon and where it is.

The problem researchers faced had three parts: (1) Long-horizon coherence: keep a multi-step story logically correct from start to finish; (2) Semantic-to-pixel gap: turn abstract words like “clean the floor” into exact, frame-by-frame visuals; (3) Physical plausibility: make sure videos obey the rules of the real world (no teleporting, no clipping through solid surfaces, objects persist when occluded). People tried two directions. Big video models guided only by text often looked nice but lost track of goals over time. Models that required manual paths or masks had great control, but needed lots of human effort, so they didn’t scale. The gap was a missing bridge between high-level reasoning and low-level pixel control that also checks physics.

Why should you care? Because teaching robots from safe, on-demand, realistic videos could help with home assistance, warehouse sorting, and lab automation—without risking real accidents and with far less time spent recording demonstrations. This paper’s promise is to make long, physically believable robot videos that follow instructions step by step, so future robots can learn complex chores more reliably.

02Core Idea

The “Aha!” in one sentence: Split the job like a brain—plan with language and vision at the top, translate that plan into a universal action blueprint in the middle, and render believable motion at the pixel level at the bottom—then use a physics referee to nudge the whole system toward real-world behavior.

Three analogies:

Orchestra: The conductor (high-level planner) decides the music, the sheet music (action blueprint) tells each section what to play, and the musicians (video generator) perform; a sound engineer (physics referee) keeps the performance from distorting.
Cooking: The recipe (plan), the shopping list and prep steps (blueprint), and the actual cooking (rendering) combine to make a meal; the food safety inspector (physics referee) ensures safe, real procedures.
Road trip: The navigator picks the stops (plan), the turn-by-turn directions (blueprint) guide each segment, the car drives (rendering), and a safety system (physics referee) prevents dangerous maneuvers.

Before vs After:

Before: Long videos drift off-task, break physics, or need manual trajectories.
After: The system (MIND-V) plans tasks, builds a structured guide, renders a video that follows that guide, and double-checks physics at each stage—resulting in longer, clearer, rule-following videos.

🍞 Hook: Think of a coach who breaks a big game strategy into precise plays.

🥬 The Concept (Semantic Reasoning Hub, SRH): SRH is the “thinking center” that reads the instruction and the scene, then breaks a big job into small, ordered subtasks and drafts smooth paths.

How it works: 1) Read the scene and command with a VLM, 2) Split the job into atomic steps (pick, place, etc.), 3) Find objects and safe contact points, 4) Plan smooth, non-colliding trajectories.
Why it matters: Without SRH, the system guesses at once and often drifts, missing objects or mixing steps.

🍞 Anchor: For “put the spoon and towel in the pot,” SRH plans “pick spoon → place in pot → pick towel → cover spoon.”

🍞 Hook: Imagine a blueprint that turns an idea into buildable steps.

🥬 The Concept (Behavioral Semantic Bridge, BSB): BSB is a structured blueprint that turns the SRH’s plan into masks, paths, and timing the video generator can follow.

How it works: 1) Encode object masks (who’s who), 2) Provide a trajectory split into pre-contact, contact, and post-contact, 3) Mark phase transition frames.
Why it matters: Without BSB, the video generator doesn’t know which pixels to move when, causing identity swaps or shaky motion.

🍞 Anchor: The BSB says “this mask is the spoon; move it from here to there from frame 10 to 25; the gripper approaches before, grasps during, and retracts after.”

🍞 Hook: Think of an animator who draws each frame so the hero moves exactly along a path.

🥬 The Concept (Motor Video Generator, MVG): MVG is a conditional video generator that turns the BSB blueprint into a realistic video, frame by frame.

How it works: 1) Take masks and paths from BSB, 2) Encode them into a motion guidance signal, 3) Inject that signal into a diffusion transformer at every step, 4) Denoise into a crisp, path-following video.
Why it matters: Without correct conditioning, the video looks nice but doesn’t actually do the instructed task.

🍞 Anchor: Given the BSB, MVG renders the spoon being grasped and lowered into the metal pot with steady, smooth hand motion.

🍞 Hook: Before a big trip, you preview a few routes and pick the safest one.

🥬 The Concept (Staged Visual Future Rollouts): At each subtask boundary, the system proposes multiple short futures, checks them, and picks or refines the best one.

How it works: 1) Propose K candidate mini-videos, 2) Judge success, physics, and quality, 3) Keep the best or replan, 4) Move to next subtask.
Why it matters: Without rollouts, early mistakes snowball and ruin the rest of the task.

🍞 Anchor: If a grasp looks off-center in all candidates, the system asks SRH to adjust the approach and try again.

🍞 Hook: You know how a physics teacher can tell if a drawing of a bouncing ball makes sense?

🥬 The Concept (World model): A world model is a learned simulator in its head that predicts what should happen next from what it sees now.

How it works: 1) Encode recent frames, 2) Predict future features, 3) Compare to actual video features.
Why it matters: Without this, it’s hard to score whether motion follows real physics.

🍞 Anchor: If the ball “teleports,” the world model’s prediction won’t match the actual future, signaling a physics problem.

🍞 Hook: Think of a fair referee who rewards safe, realistic play.

🥬 The Concept (Physical Foresight Coherence, PFC): PFC is a reward that uses a world model to measure how physically believable a generated video is.

How it works: 1) Slide over the video in windows, 2) Predict future features, 3) Compare with actual features, 4) Give higher scores to windows that match physics and focus on the worst mistakes.
Why it matters: Without PFC, the system might “cheat” visually but break real-world rules.

🍞 Anchor: A spoon that slowly settles into a pot scores well; a spoon that blurs through the pot rim scores poorly.

🍞 Hook: Training a puppy works better with treats for good behavior.

🥬 The Concept (GRPO reinforcement learning): GRPO is a way to tweak the video generator by sampling groups of videos, scoring them (with PFC and aesthetics), and nudging the model toward the higher-scoring ones while staying close to the safe starting policy.

How it works: 1) Generate a group of videos, 2) Normalize rewards within the group, 3) Push up the better ones, 4) Keep changes stable using a KL “stay-close” rule.
Why it matters: Without GRPO, you can’t steadily align the model to physics and quality.

🍞 Anchor: Over training, the generator prefers grasp videos that look clear and obey contact rules, not flashy but impossible moves.

Why it works (intuition): Separate thinking from doing (SRH vs MVG), use a universal blueprint (BSB) to prevent confusion, check each step before committing (rollouts), and train with a physics-conscious reward (PFC+GRPO). Together these parts keep long tasks coherent, faithful to instructions, and physically sensible.

Building blocks: SRH → BSB → MVG, plus Rollouts at test time, and GRPO+PFC post-training to align with physics.

03Methodology

At a high level: Instruction + Scene → SRH (plan subtasks) → BSB (masks, paths, timing) → MVG (render video) → Rollouts (check/correct per subtask) → Output long, coherent video. After supervised fine-tuning, GRPO post-training uses PFC to better align with physics.

Step A: Semantic Reasoning Hub (SRH)

What happens: The VLM reads the instruction (e.g., “first put the spoon in the pot, then cover it with the towel”) and the initial image. It decomposes the request into ordered subtasks: {action, object, destination}. It grounds objects (e.g., the spoon’s handle) and proposes a smooth, collision-free trajectory per subtask.
Why it exists: Without SRH, the generator gets only abstract text and can easily mis-handle ordering or targets.
Example: For “spoon → pot; towel → cover,” SRH plans approach–grasp–place for the spoon, then approach–place for the towel, each with its own path and timing.

Step B: Behavioral Semantic Bridge (BSB)

What happens: SRH’s plan is turned into a structured, domain-invariant package. It includes (1) object masks (e.g., spoon and robot arm), (2) a trajectory split into three phases—pre-interaction (approach), interaction (manipulate), post-interaction (retract), and (3) phase transition frames.
Why it exists: The MVG needs precise, time-aligned signals about who moves, where, and when. This prevents identity swaps and keeps motion smooth across frames.
Example: The spoon mask is constant, the gripper path is marked from frames 1–12 (approach), 13–25 (grasp and move), and 26–37 (retract).

Step C: Motor Video Generator (MVG)

What happens: The MVG is a diffusion transformer conditioned on the BSB. It encodes the BSB into a spatiotemporal guidance tensor, computes a motion embedding, and injects it into the generator’s layers during denoising. This steers each frame to follow the planned path.
Why it exists: Unconditioned or text-only generation may make pretty motion that ignores the exact task.
Example: With a 37-frame subtask at 480×640 resolution, MVG produces a clip where the gripper follows the BSB path, the spoon is consistently visible, and the background remains stable.

Step D: Supervised Fine-Tuning (SFT)

What happens: Start from a strong open-source video model and fine-tune it on short robot clips paired with ground-truth BSBs (e.g., from Bridge v2). The loss teaches the model to map BSB → correct video.
Why it exists: It gives a safe, competent starting point (reference policy) before RL tuning.
Example: Train on many short “pick-and-place” segments so the model learns clean grasp and move patterns.

Step E: GRPO Post-Training with the PFC Reward

What happens: Treat video generation as a policy. Sample a group of candidate videos, score each using a combined reward: physics via PFC (from a frozen world model, V-JEPA2) and aesthetics (clarity, artifacts). Normalize scores within the group, then push the model toward higher-scoring samples while keeping it close to the SFT policy.
Why it exists: Simple pixel losses don’t capture “does this obey physics?” PFC measures dynamic realism in feature space, catching violations like teleporting or ghosting.
Example: A grasp that clips through the pot rim gets a low physics score; a clean contact and gentle settle gets a high one.

Step F: Staged Visual Future Rollouts (Test-Time Optimization)

What happens: At each subtask boundary, propose K short candidate futures, render them, and judge success, physics, and visual quality. If none pass, feed structured feedback to SRH to replan and retry. If one passes, accept it and proceed.
Why it exists: Long tasks suffer from error accumulation. Rollouts catch small mistakes early and prevent cascade failures.
Example: If the spoon looks off-center during grasp in all candidates, SRH adjusts the approach point and tries again until a clean grasp is found.

Secret Sauce:

The middle “bridge” (BSB) decouples meaning from pixels, so the generator stays on-script.
Physics-aware rewards (PFC) guide the model using learned dynamics, not hand-coded rules.
Test-time rollouts add a self-check loop, which is rare for video generators but crucial for long-horizon success.

Data and settings (illustrative):

Dataset: Bridge v2. Training on 37-frame subtasks at 480×640. Inference can chain 3 subtasks (about 111 frames) in ~180s using ~50 GB VRAM; peak VRAM can stay nearly constant across longer tasks by reusing memory per subtask. Rollouts often use K=3 as a good balance of quality and cost.

Putting it all together: From input text and a scene image, SRH writes the plan, BSB codifies it, MVG draws it, GRPO+PFC keeps it realistic, and rollouts keep it on track for the whole journey.

04Experiments & Results

The test: Can MIND-V make long, instruction-following, physically plausible robot videos better than strong baselines? Researchers evaluated two regimes. For short subtasks, they checked visual quality (aesthetic look, smooth motion, lack of flicker). For long-horizon tasks (2–4 subtasks), they also measured task success (did each subtask actually complete?), and a physics score, the Physical Foresight Coherence (PFC), which compares predicted futures from a world model to the generated video’s futures.

The competition: They compared against trajectory-free world-model-style generators (RoboDreamer, WoW, Wan2.2, HunyuanVideo) and, for short segments, also looked at controllable baselines (e.g., MotionCtrl, Tora) that get extra help like manual trajectories or masks at inference time. MIND-V, by design, does not rely on privileged manual guidance at inference.

The scoreboard (with context):

Long-horizon physics (PFC): MIND-V scored about 0.446. Think of that as getting an A- while others hovered around a B-/C+ (0.402–0.423). That means its videos matched learned physical dynamics more often.
Task success rate: MIND-V reached about 61.3%, whereas others often landed far lower (for example, 11–35%). That’s like finishing 6 out of 10 multi-step chores correctly when peers finish only 1–3.
Visual quality: On metrics like aesthetic quality, imaging quality, motion smoothness, and subject consistency, MIND-V was at or near the top among long-horizon models, and competitive even against short-horizon, privileged-control methods.
User preference: In head-to-head choices, viewers preferred MIND-V’s results roughly 46.7% of the time—more than any single baseline—thanks to clearer task completion and fewer physics glitches.

Surprising findings:

Rollouts matter a lot. Ablation showed that removing Staged Visual Future Rollouts caused the biggest drop in long-horizon stability. Catching small mistakes early drastically improved final success.
Physics-aware RL isn’t just a bonus. Without GRPO post-training, the PFC score and success rate both fell—showing that pixel-level training alone doesn’t guarantee realistic dynamics.
Affordance grounding counts. Replacing the affordance localizer with a weaker pipeline significantly reduced success. Good grasp/contact reasoning is key.

Efficiency notes:

Time scales roughly linearly with the number of subtasks because each subtask is planned and generated in sequence.
Peak memory can remain flat across longer tasks by reusing memory per subtask, making very long videos feasible.
Rollouts with K=3 gave the best trade-off between quality and cost; going to K=5 added little accuracy but a lot of memory/time.

Takeaway: MIND-V delivers state-of-the-art long-horizon coherence, higher physics plausibility, and better task completion than strong text-to-video baselines. The combination of explicit planning (SRH), a structured blueprint (BSB), physics-aware RL (PFC+GRPO), and self-checking rollouts produced consistently stronger results.

05Discussion & Limitations

Limitations:

Extra compute at inference: Rollouts (propose-verify-refine) generate multiple candidates per subtask, which increases latency compared to a single pass.
Upstream dependency: If the planner or affordance grounding is wrong, errors can propagate. There are fallback strategies (e.g., direct coordinate inference then segmentation), but upstream quality still matters.
Video-only realism: It enforces physics in feature space and appearance, but it doesn’t directly simulate forces. Some subtle dynamics may still slip by.

Required resources:

A capable VLM API (for SRH) and an affordance localizer.
A large video generator backbone (e.g., CogVideoX-5B) and 4 high-memory GPUs (e.g., H200s) for training.
For inference with rollouts, enough VRAM to render a few candidate clips per subtask; K=3 is a practical default.

When not to use:

Ultra-low-latency settings where even small rollout overheads are unacceptable.
Tasks requiring exact numerical physics (e.g., precise torque/force laws) rather than visually consistent dynamics.
Settings with no reasonable initial image or where objects are too ambiguous to ground reliably.

Open questions:

Can we fuse 3D scene understanding (e.g., NeRFs or 3D diffusion) to improve occlusions and depth-sensitive interactions?
Can we learn better affordances end-to-end to reduce reliance on external tools?
How far can physics-aware rewards go—could they incorporate contact detection or differentiable simulators?
Can the same framework close the loop from video to real robot actions more directly, improving sim-to-real transfer?
How to make rollouts smarter (e.g., learned proposal policies) so we need fewer candidates while keeping quality?

06Conclusion & Future Work

In three sentences: MIND-V is a hierarchical system that plans multi-step robot tasks (SRH), turns plans into a structured, universal blueprint (BSB), and renders realistic videos that follow that blueprint (MVG). It stays reliable over long tasks by checking short candidate futures at each step (rollouts) and becomes more physics-faithful through reinforcement learning with a world-model-based reward (PFC+GRPO). The result is state-of-the-art long-horizon robotic manipulation videos that look good, follow instructions, and obey everyday physics.

Main achievement: Bridging high-level reasoning and low-level pixels with a domain-invariant blueprint, then aligning generation to physics using learned dynamics, so long-horizon videos remain coherent and believable.

Future directions: Add 3D awareness for better depth and occlusion handling; improve end-to-end affordance learning; explore more direct sim-to-real links; and make rollouts more efficient through learned proposal and judging.

Why remember this: It shows a practical recipe for long, reliable robot videos—think, blueprint, render, check—and proves that a physics-aware reward can meaningfully shape video generators toward realistic, instruction-following behavior.

Practical Applications

•Create large libraries of realistic robot training videos for imitation learning without filming everything by hand.
•Prototype multi-step household tasks (e.g., set the table, load the dishwasher) safely in video before trying on a real robot.
•Benchmark planning policies by visualizing and scoring long-horizon rollouts for task success and physics quality.
•Design warehouse manipulation sequences (grasp, sort, place) and validate them via physics-aware video generation.
•Generate failure cases (e.g., near-collisions) to harden robot policies against edge conditions.
•Teach robots to follow complex natural language chores by pairing instructions with high-fidelity synthetic videos.
•Support HRI (human–robot interaction) studies by producing consistent, realistic demonstrations on demand.
•Accelerate sim-to-real by filtering candidate motions that look physically implausible before hardware trials.
•Pre-visualize factory assembly steps to identify likely bottlenecks or unsafe motions.
•Create educational content that shows correct and incorrect manipulations, highlighting physics differences.

Version: 1