RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang; Kunyang Lin; Jinwei Li; Wencong Zhang; Tianwei Lin; Longyan Wu; Zhizhong Su; Hao Zhao; Ya-Qin Zhang; Li Chen; Ping Luo; Xiangyu Yue; Hongyang Li

RISE: Self-Improving Robot Policy with Compositional World Model

Intermediate

Jiazhi Yang, Kunyang Lin, Jinwei Li et al.2/11/2026

arXiv

Key Summary

•RISE lets a robot learn safely and cheaply by practicing in its imagination instead of always in the real world.
•The key is a Compositional World Model that splits the job into two parts: one predicts what will happen (dynamics), and one scores how good it is (value).
•RISE turns those scores into an 'advantage' label that tells the robot which actions are better, then trains the robot to prefer those actions.
•A task-centric training trick makes the video predictor follow robot actions more faithfully, so imagined futures match what the robot actually does.
•The value model combines a smooth progress signal with failure-aware Temporal-Difference learning, so it is both stable and sensitive to mistakes.
•RISE first warms up on real data to avoid silly moves, then runs a self-improving loop entirely in imagination.
•On three tough real-world tasks—dynamic brick sorting, backpack packing, and box closing—RISE boosts success by about +35% to +45% over strong baselines.
•RISE avoids the costs and risks of constant real-world trial-and-error by shifting the learning bottleneck from hardware time to compute time.
•Ablations show each piece (pretraining, task-centric batching, progress+TD, online imagined states and actions) is necessary for the best results.
•Even though RISE relies on imagination, it adds zero inference cost when the trained robot runs in the real world.

Why This Research Matters

Robots that can safely practice in imagination cut down on costly and risky real-world trials, making them more practical for homes, hospitals, and warehouses. By learning recovery behaviors for contact-rich tasks, they become far more reliable at chores like packing, sorting, and assembling. This approach turns compute time into skill, so even a single robot can improve without constant human supervision or resets. Better action-following imagination also means fewer broken items and safer interactions with people. Over time, this can speed up deployment of helpful robots in everyday life. It also democratizes robotics research and development by reducing the need for fleets of expensive robots. In short, smarter practice leads to safer, more useful robots.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how athletes often visualize their routines before a performance so they can make fewer mistakes on stage? Robots want to do that too—practice safely and get better without breaking anything.

🥬 Filling (The Actual Concept)

What it is: This paper tackles how to make robots reliably handle tricky, touch-heavy tasks (like zipping a bag or sorting moving bricks) by letting them learn from imagined practice, not just real-world trial-and-error.
How it works (story of WHY):
1. The world before: Vision–Language–Action (VLA) models learned mainly by watching and copying (imitation learning). They can follow instructions and understand scenes, but they stumble when real physics gets messy—like grasping a moving object or coordinating two hands.
2. The problem: Small slips (a slightly off grasp) snowball into failures, and pure on-robot Reinforcement Learning (RL) is slow, risky, and expensive because every try is serial, needs monitoring, and often a human reset.
3. Failed attempts:
  - Imitation Learning (IL) alone hits a ceiling. It copies experts but suffers exposure bias—once the robot drifts off the taught path, it doesn’t know how to recover.
  - Sim-to-real RL shines in simulators but doesn’t scale in the physical world where resets and safety matter.
  - World models that look pretty (good visuals) often don’t obey actions precisely (poor controllability), so imagined futures don’t reflect what a robot would actually cause.
4. The gap: We need a fast, action-following world model plus a way to score in-between steps (not just final success/failure), so the robot can learn from partial progress within its imagination.
5. The real stakes: If robots can practice and improve robustly without wearing themselves (and us) out, they can help at home (packing, tidying), in warehouses (sorting, boxing), and in labs and factories—safely and affordably.
Why it matters: Without safe, scalable practice, robots stay brittle in the messy real world. With imagination that truly follows actions and gives useful step-by-step feedback, robots can become reliable helpers.

🍞 Bottom Bread (Anchor): Imagine a robot that needs to grab red bricks from a moving belt and drop them into the red bin. Instead of making 1,000 real attempts (and dropping bricks on the floor), it imagines many action sequences, sees which ones actually pick-and-place correctly in its mental movie, and learns to choose those in real life.

🍞 Top Bread (Hook): Imagine copying someone’s homework without understanding it; the minute the questions change a little, you’re stuck.

🥬 Filling (The Concept: Imitation Learning)

What it is: Imitation Learning teaches robots by copying expert demonstrations.
How it works:
1. Watch expert videos.
2. Learn a mapping from what you see to actions.
3. Try to mimic at test time.
Why it matters: Without recovery skills, tiny deviations cause cascading errors (exposure bias), so IL alone isn’t enough for dynamic, contact-rich tasks.

🍞 Bottom Bread (Anchor): A robot learned to zip a backpack by copying. If the zipper head starts just a bit off, the robot doesn’t know how to fix it and keeps failing.

🍞 Top Bread (Hook): You know how you get better at biking by trying, wobbling, correcting, and trying again?

🥬 Filling (The Concept: Reinforcement Learning)

What it is: RL helps agents improve by rewarding good outcomes and discouraging bad ones through repeated interactions.
How it works:
1. Observe the state.
2. Take an action.
3. Get feedback (reward).
4. Update the policy to do more of what works.
Why it matters: RL can teach recovery and robustness. But in the real world it’s slow, risky, and needs many resets.

🍞 Bottom Bread (Anchor): A robot that keeps slightly missing a moving brick can, with RL, learn to lead the target—if it could practice cheaply and safely.

🍞 Top Bread (Hook): Picture practicing a skateboard trick inside a safe simulator that reacts exactly as the real board would.

🥬 Filling (The Concept: World Models)

What it is: A learned simulator that predicts what will happen next if the robot takes certain actions.
How it works:
1. Learn from recorded robot experiences (videos + actions).
2. Given a current scene and a proposed action, predict the next frames.
3. Repeat to roll out a future.
Why it matters: If the model doesn’t follow actions well, imagined practice teaches the wrong lessons.

🍞 Bottom Bread (Anchor): If the robot imagines pulling a zipper but the video predictor ignores the pull, the robot can’t learn true cause-and-effect.

02Core Idea

🍞 Top Bread (Hook): Imagine a coach who can instantly replay many futures for each move you consider, then tells you which move helps you win fastest.

🥬 Filling (The Actual Concept)

What it is: RISE splits imagination into two parts—one predicts what happens (dynamics) and one scores how good it is (value)—then uses those scores (advantages) to train the robot policy entirely inside imagination.
How it works:
1. Compositional World Model: Predicts multi-view future frames following the robot’s exact action chunk (dynamics), and evaluates how close those futures are to success (value).
2. Advantage: Turns those evaluations into a “how much better than now” signal for the proposed actions.
3. Policy Update: Train the robot to generate actions conditioned on higher advantage bins, so it prefers better moves.
4. Self-Improving Loop: Repeat—imagine, score, update—without touching the real robot.
Why it matters: Decoupling “what happens” from “how good it is” lets each part be excellent at its job, making imagined practice fast, faithful to actions, and rich in feedback.

🍞 Bottom Bread (Anchor): For backpack packing, the dynamics predicts how the bag and clothes move when you lift and zip; the value says, “You’re 70% to done.” The advantage teaches the policy to choose the lift-and-zip action sequence that pushes progress up, not down.

Aha! in one sentence: If you precisely imagine what your actions will cause and score those futures well, you can safely self-improve your policy without endless real-world trials.

Three analogies:

GPS planning: The dynamics is the map+traffic (what happens if you take a route). The value is ETA (how close to destination). Advantage is the time saved by switching routes. The policy picks the route with higher advantage.
Chess analysis: The dynamics simulates moves ahead; the value scores positions; advantage tells which move improves your position fastest; the player chooses those moves.
Cooking rehearsal: The dynamics visualizes each step’s results; the value rates doneness; advantage shows which tweak (more heat, more stirring) improves the dish most.

Before vs After:

Before: Robots improved either by brittle imitation or costly real-world RL; world models looked pretty but didn’t follow actions tightly; rewards were often too sparse.
After: Robots learn in imagination that follows their actions well, with dense, failure-sensitive scores; they update policies quickly and safely, then deploy with no extra runtime cost.

Why it works (intuition, not equations):

Splitting prediction (dynamics) from scoring (value) avoids a one-size-fits-all network trying to do two different jobs. Each model can be trained with the best objective for its goal. Accurate, action-controlled prediction plus dense, stable scoring produces reliable advantages, and advantage-conditioned training is a simple, stable way to nudge the policy toward better choices.

Building blocks (each as a Sandwich):

🍞 Hook: You know how a movie storyboard shows what might happen next shot-by-shot? 🥬 Concept: Compositional World Model

What it is: A two-part imagination machine—one part predicts future frames given actions; the other part scores how good those frames are for the task.
How it works: (1) Feed in recent multi-view images and a chunk of future actions. (2) Predict the next frames. (3) Score each predicted frame. (4) Compare to now to get advantage. (5) Use it to train the policy.
Why it matters: If you mix prediction and scoring in one box, you compromise both; separating them makes each sharper. 🍞 Anchor: For box closing, it predicts flap folding under your exact hand motions, then scores if the tab is lining up, so the policy can choose motions that improve alignment.

🍞 Hook: Imagine a game where the controller exactly moves the character; any delay ruins it. 🥬 Concept: Dynamics Model

What it is: A fast video predictor that follows robot actions faithfully.
How it works: Pretrain on large action-labeled robot datasets; add a light action encoder; use task-centric batching so each training batch focuses on many action variations of the same task; fine-tune for each real task.
Why it matters: If the future doesn’t obey your joystick (actions), you learn the wrong cause-and-effect. 🍞 Anchor: Pulling the zipper harder should move the slider farther in the imagined frames—RISE’s dynamics does that.

🍞 Hook: Think of a progress bar while downloading—smooth but not fooled by errors. 🥬 Concept: Value Model

What it is: A scorer that tells how close a state is to success.
How it works: Start from a pre-trained VLA backbone, train with (a) progress estimation for smooth, dense feedback, and (b) Temporal-Difference learning to be sensitive to subtle failures using both success and failure rollouts.
Why it matters: Sparse ‘win/lose’ at the end is too late; you need mid-course feedback that notices small mistakes. 🍞 Anchor: If the robot starts zipping the wrong side, the value drops immediately, warning the policy.

🍞 Hook: Picture a coach saying, “That move is not just good, it’s better than average right now.” 🥬 Concept: Advantage Conditioning

What it is: Label actions by how much they improve progress over now, then train the policy to generate actions conditioned on higher labels.
How it works: Discretize advantage into bins; sample actions; imagine; compute advantage; train the policy to favor higher bins.
Why it matters: It’s a simple, stable way to push the policy toward better behaviors across many scenarios. 🍞 Anchor: In brick sorting, actions that lead the gripper to the right color bin get higher bins; the robot learns to make those moves first.

🍞 Hook: Imagine trying a few next steps in your head before committing. 🥬 Concept: Imagined Rollouts

What it is: Short, action-conditioned future predictions used to evaluate choices.
How it works: From a real starting state, propose an action chunk; predict H future frames; score each; compute advantage; repeat.
Why it matters: You don’t need to simulate the whole task perfectly—just enough to tell better from worse next moves. 🍞 Anchor: Two imagined futures—placing blue brick in blue vs yellow bin—produce positive vs negative advantages that steer learning.

03Methodology

At a high level: Multi-view images + language → (A) Policy proposes an action chunk → (B) Dynamics imagines futures → (C) Value scores them → (D) Compute advantage → (E) Train policy to prefer higher-advantage actions → Repeat in imagination.

Stage 1: Build the Compositional World Model

🍞 Hook: Think of a driving simulator that instantly renders what happens when you press the gas or brake. 🥬 Concept: Action-Following Dynamics (fast video predictor)

What it is: A video model that predicts multi-view future frames that tightly follow the robot’s action chunk.
How it works:
1. Initialize from a fast, efficient video generator (Genie Envisioner).
2. Add a lightweight action encoder so actions explicitly guide motion.
3. Pretrain on large robot datasets (Agibot World, Galaxea) with strong noise augmentation to handle blur and artifacts.
4. Use Task-Centric Batching: sample many variants of the same task per batch to emphasize action controllability over scene diversity.
5. Fine-tune on each target task.
Why it matters: RL needs lots of rollouts; if generation is slow or ignores actions, training stalls or misleads the policy. 🍞 Anchor: When the robot commands “lift then zip,” the imagined video lifts the bag and moves the zipper head accordingly, not randomly.

🍞 Hook: You know a teacher who gives you a score after each step in a math problem, not just at the end? 🥬 Concept: Progress Value Model (stable and failure-aware)

What it is: A scorer that outputs how far along you are toward finishing the task.
How it works:
1. Start from a pre-trained VLA policy backbone (multi-view ready).
2. Train with progress estimation for dense, monotonic guidance over an episode.
3. Add Temporal-Difference learning with both successes and failures so it reacts to subtle mistakes.
4. Keep it frozen during self-improvement for stability.
Why it matters: Dense, reliable scoring turns short imagined rollouts into useful learning signals. 🍞 Anchor: During box closing, when flaps align, the value rises; when the tab misaligns, it drops, even before final success/failure.

Stage 2: Policy Warm-Up on Real Data

🍞 Hook: Before free-soloing a rock wall, you practice on a top-rope so you don’t learn bad habits. 🥬 Concept: Policy Warm-Up

What it is: A safe start using offline demonstrations, prior rollouts, and human corrections to anchor behavior.
How it works:
1. Fine-tune a pre-trained VLA (π0.5) on task data.
2. Condition on advantage labels: expert and human corrections get top-bin labels; rollout data get learned advantages.
3. Teach the policy to generate actions given an advantage bin condition.
Why it matters: Prevents wild, unrealistic exploration later in imagination. 🍞 Anchor: The robot learns that expert zipping actions correspond to the highest advantage bin, so it knows what ‘good’ looks like.

Stage 3: Self-Improving Loop in Imagination

🍞 Hook: Picture practicing a piano piece by trying a bar, previewing how it might sound, adjusting, and repeating—all silently in your head. 🥬 Concept: Imagined Rollout + Advantage Update

What it is: An iterative loop that proposes actions, imagines futures, scores them, and updates the policy.
How it works:
1. Start from a real initial state from the warm-up dataset.
2. Prompt the rollout policy with an optimistic advantage (e.g., top bin) to propose an action chunk.
3. Use dynamics to predict H future frames.
4. Score each frame with the value model; compute advantage (average improvement over the chunk).
5. Discretize advantage into bins.
6. Store (state, proposed action, evaluated advantage) in a buffer.
7. To broaden coverage, sometimes roll out one more step using the newly generated frames (limit to control model drift).
8. Update the behavior policy to match proposed actions conditioned on the evaluated advantage; softly update the rollout policy via EMA.
Why it matters: The policy learns from its own successes and mistakes, discovered safely in imagination. 🍞 Anchor: If a proposed grasp leads to steady progress in imagined frames, its bin is high; the policy learns to produce similar grasps next time.

Secret sauce highlights (each as a Sandwich):

🍞 Hook: Like practicing many moves of the same sport to get timing perfect. 🥬 Concept: Task-Centric Batching

What it is: A training trick that samples many action variants of the same task per batch.
How it works: Keep scenario variety overall, but within a batch focus on one/few tasks with diverse actions.
Why it matters: Sharply improves action controllability so imagined futures obey the joystick. 🍞 Anchor: In brick sorting, small action changes alter the gripper’s timing on the moving belt; task-centric batching teaches the model to reflect that.

🍞 Hook: Think of planning a short set of chess moves, not the whole game. 🥬 Concept: Action Chunks

What it is: Small sequences of actions predicted together (length H).
How it works: The policy outputs a chunk; the dynamics predicts H frames; the value scores them; advantage summarizes the chunk.
Why it matters: You don’t need perfect long-horizon prediction; faithful short chunks are enough to learn better next steps. 🍞 Anchor: A 50-step chunk lets the robot preview a pick-and-place and adjust before errors snowball.

🍞 Hook: Like a report card with letter grades. 🥬 Concept: Advantage Bins

What it is: Discrete labels (e.g., 1–10) for how helpful an action chunk is.
How it works: Compute advantage, bucket it, and condition the policy on these bins during training.
Why it matters: Simple, stable target distribution that reliably nudges the policy. 🍞 Anchor: Bin 10 for good zip progress, Bin 1 for getting stuck—clear guidance.

Example with actual data flow:

Input: Last 4 multi-view frames + task text; policy proposes a 50-step action chunk.
Dynamics: Predicts next 25 multi-view frames in ~2 seconds.
Value: Scores each predicted frame; compute average gain vs now.
Advantage: Discretize into bins; store with (state, action).
Training: Minimize distance between policy’s output (conditioned on the bin) and the proposed action; mix in some offline data to avoid forgetting.

What breaks without each step:

No task-centric batching: action-following weakens; advantages become noisy.
No progress+TD: scores become either too smooth (miss failures) or too jittery (unstable learning).
No warm-up: policy explores unrealistic actions; imagination drifts.
No advantage conditioning: policy lacks a simple, stable learning target.

Secret sauce in one line: Fast, action-faithful prediction + dense, failure-aware scoring + simple advantage labels → stable, scalable learning in imagination.

04Experiments & Results

The Test (what and why):

Three real-world, contact-rich tasks:
1. Dynamic Brick Sorting: Pick color-coded bricks from a moving conveyor and place into matching bins (timing + precision).
2. Backpack Packing: Open, insert clothes, lift to settle, then zip (deformable objects + force control).
3. Box Closing: Load cup, fold flaps, tuck locking tab (bi-manual precision).
Metrics: Success Rate (pass/fail) and a 0–10 stage-wise Score (partial credit for progress) to capture long-horizon performance.

The Competition (baselines):

π0.5 (strong VLA fine-tuned on demos),
π0.5 + DAgger (human corrections),
π0.5 + PPO (on-robot RL),
π0.5 + DSRL (steering diffusion policy with RL),
RECAP (advantage-conditioned offline RL).

The Scoreboard (with context):

Dynamic Brick Sorting: RISE hits 85% success and 9.78/10 score. Context: Baselines range around 10–50% success; this is jumping from a C to a solid A.
Backpack Packing: RISE hits 85% success and 9.50/10. Context: Others manage 30–50%; RISE is like scoring in the top quartile when many struggle with the zipper stage.
Box Closing: RISE hits 95% success and 9.88/10. Context: The best baseline reaches ~60%; RISE is near-flawless on a fiddly, bi-manual task.

Surprising findings:

Online imagined states matter a lot: Adding just online actions helped (35%→40% success in a subtask), but adding imagined states jumped it further (to ~70%), showing imagination widens useful training distribution beyond fixed offline data.
Task-Centric Batching substantially reduces optical-flow error (EPE) and improves perceptual video quality (PSNR↑, LPIPS↓, SSIM↑, FVD↓), confirming that batch design—not just model size—drives action controllability.
Progress+TD beats either alone: Progress gives smooth learning signals; TD catches subtle failures. Together they boost both stability and accuracy.

Diagnostic ablations (simple language):

Offline data mix: Too little offline (e.g., 10%) causes forgetting and collapse; too much (e.g., 90%) constrains exploration. A balanced middle (around 60%) works best.
Remove dynamics pretraining: Sorting accuracy drops big; you lose visual-motion priors.
Remove task-centric batching: Completion plunges; the predictor stops obeying actions tightly.
Remove progress: Success falls; scores become less stable.
Remove TD: Value misses errors; success declines sharply.

Meaning of the numbers:

When we say +35% absolute, think of going from 50% to 85%—like turning frequent failures into mostly reliable finishes.
High 9.x/10 scores mean not only final successes but smoother progress through intermediate steps (e.g., clean grasps, aligned flaps, zipper not stuck).

Takeaway: RISE consistently outperforms both imitation and real-world RL baselines while doing the heavy learning inside imagination, then deploying with no extra runtime cost.

05Discussion & Limitations

Limitations (honest and specific):

Imagination gaps: Rare or underrepresented scenarios can still lead the dynamics to produce physically implausible transitions, which can mislabel advantages.
Data balance tuning: Mixing offline (real) and online (imagined) data needs tuning; too little or too much offline both hurt.
Compute cost: Training fast, high-fidelity world models is heavy on GPUs; RISE shifts cost from hardware resets to compute cycles.
Short-horizon rollouts: Chunked imagination avoids long-term drift but can miss very long-horizon dependencies unless iteratively chained with care.

Required resources:

Multi-GPU training for dynamics (e.g., H100s), moderate GPU for value and policy fine-tuning, multi-view cameras, and synchronized action logs for pretraining.

When NOT to use:

Ultra-novel domains with no pretraining data and chaotic physics the video model can’t generalize to.
Tasks dominated by very long delays before any observable progress (advantage becomes noisy if frames don’t reflect future payoffs at chunk scale).
Hard real-time systems with zero tolerance for any training compute budget.

Open questions:

Uncertainty-aware imagination: How to detect and downweight low-confidence predictions during scoring and policy updates?
Physics priors: How to bake in geometry and contact constraints to further improve action controllability and stability?
Automatic data mixing: Can we adaptively schedule the ratio of offline vs online samples per training phase?
Beyond vision: How to incorporate tactile and force sensing into both dynamics and value for richer contact understanding?
Longer horizons: Can hierarchical chunks (short for contact, long for intent) improve planning without brittleness?

06Conclusion & Future Work

Three-sentence summary:

RISE makes robots self-improve by practicing in imagination with a Compositional World Model that separately predicts what happens and how good it is.
Turning those scores into advantage labels and conditioning the policy on higher-advantage actions yields stable, scalable improvement without expensive real-world trial-and-error.
On tough, contact-rich tasks, RISE significantly outperforms strong baselines, then runs with zero extra inference cost in deployment.

Main achievement:

Demonstrating that a fast, action-faithful dynamics model plus a dense, failure-aware value model can power on-policy RL entirely in imagination for real-world manipulation.

Future directions:

Add uncertainty estimation and physics-informed constraints to the world model; integrate touch/force sensing; develop adaptive schedules for mixing offline and online data; and extend to longer-horizon hierarchical planning.

Why remember this:

RISE shows that the right kind of imagination—one that obeys actions and scores progress—can replace most risky, expensive on-robot exploration, opening a practical path to robust, precise robots that keep getting better while we keep them safe.

Practical Applications

•Warehouse sorting: Train a policy in imagination to pick and place items on moving belts into the correct bins.
•E-commerce packing: Practice folding flaps and tucking tabs to close boxes reliably without crushing contents.
•Home assistance: Learn to pack bags, zip backpacks, or tidy up by rehearsing many variations safely in imagination.
•Manufacturing assembly: Refine precise insertions and alignments (e.g., clips, tabs) via imagined rollouts before factory trials.
•Healthcare logistics: Practice gently handling deformable items (linens, supplies) with reduced risk to equipment and staff.
•Agriculture handling: Rehearse delicate grasps of produce to cut bruising and improve throughput.
•Education and prototyping: Students and engineers iterate robot skills with limited hardware by relying on imagined training.
•Failure recovery training: Use value drops in imagined futures to teach the robot to backtrack and retry when misaligned.
•Policy updates in the field: Periodically self-improve from fresh camera logs using imagination without halting operations.
•Multi-view setup optimization: Evaluate which camera placements most improve value estimation and action controllability.

Version: 1