🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning | How I Study AI

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Intermediate
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin et al.1/22/2026
arXivPDF

Key Summary

  • •Cosmos Policy teaches robots to act by fine-tuning a powerful video model in just one training stage, without changing the model’s architecture.
  • •The key trick is to treat robot actions, future images, and a 'goodness score' (value) as extra 'video frames' inside the model’s latent diffusion process.
  • •This lets the same model act as a policy, a world model (imagining the future), and a value function (rating futures) all at once.
  • •At test time, the robot can plan by sampling several action ideas, imagining what each would lead to, scoring those futures, and picking the best.
  • •On the LIBERO benchmark, Cosmos Policy achieves a new state of the art with 98.5% success, and on RoboCasa it hits 67.1% using only 50 demos per task.
  • •In real bimanual robot tasks, it gets the top average score (93.6%) and improves even more (by 12.5 points) when using planning.
  • •Auxiliary learning targets (predicting future states and values while learning actions) and starting from a pretrained video model both clearly boost performance.
  • •Planning is slower (about 5 seconds per action chunk) and needs rollout data to refine predictions, so speedups and data-efficiency are key future steps.
  • •Model-based planning (using a world model plus state-value) outperforms a model-free Q-value variant with limited rollout data.
  • •The approach naturally handles multiple cameras and modalities because all inputs and outputs live in one unified latent video timeline.

Why This Research Matters

Robots that can both act and plan reliably are crucial for homes, factories, and hospitals. Cosmos Policy shows that a single, simple fine-tuning step can turn a powerful video model into a capable controller that imagines futures and scores them. This reduces engineering complexity and better uses what video models already know about motion and physics. The method works well even with modest human demonstration data, and it gets stronger by learning from its own rollouts. As a result, robots become more precise at difficult tasks like high-accuracy grasps or handling deformable objects. Over time, this unified approach could lower costs, speed deployment, and broaden the range of tasks robots can do safely and effectively.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine teaching a friend to bake cookies by only showing them pictures. They might know what a cookie looks like, but not how the dough changes over time or how long to bake it. Videos, on the other hand, show what happens step by step.

🥬 The Concept (The World Before):

  • What it is: Before this work, many robot policies were trained from image-text models that learned meanings from single pictures and captions, not from videos of things changing over time.
  • How it works: Vision-language-action (VLA) systems like RT-2 or OpenVLA use big image-language models and add a small action head on top. They’re great at understanding objects and instructions, but they don’t learn real motion patterns from videos.
  • Why it matters: Without the sense of time and physics you get from videos, robots can misjudge how to move smoothly, how objects react to pushes, or how to keep a stable grasp.

🍞 Anchor: Think of trying to untie a knot from just one snapshot. You really need a short video to understand which string to pull first and how the knot tightens or loosens as you move.

🍞 Hook: You know how watching many sports clips helps you predict where the ball will go next? Video models learned something similar: how scenes evolve over time.

🥬 The Concept (The Problem):

  • What it is: Researchers wanted to use powerful video models for robots, but adapting them usually needed extra parts (separate action generators, inverse dynamics modules) and multi-stage training.
  • How it works: Prior methods fine-tuned a video model, then bolted on a new network to convert predicted frames into actions, or trained custom video-action models from scratch.
  • Why it matters: Extra stages and custom modules add complexity, slow development, and lose some benefits of the pretrained model.

🍞 Anchor: It’s like buying a great all-in-one camera but then adding extra pieces just to take a photo, making it bulky and harder to use.

🍞 Hook: Imagine writing a comic strip where you can slip in sticky notes between panels to add more info, like sound effects or character thoughts, without redrawing anything.

🥬 The Concept (The Gap):

  • What it is: We lacked a simple way to make a pretrained video model produce robot actions (and planning signals) directly, in one stage, without changing its architecture.
  • How it works: If actions and other signals could live right alongside video frames inside the same timeline the model already knows, we could train the model to generate them together.
  • Why it matters: A single, unified model means simpler training, stronger use of video priors, and easier deployment.

🍞 Anchor: If the comic already flows panel by panel, adding sticky notes between panels lets you store speech bubbles or directions without changing the drawings.

🍞 Hook: You know how you try several ideas in your head before choosing the best one? Robots benefit from this too: imagine, score, and pick.

🥬 The Concept (Real Stakes):

  • What it is: Homes, factories, and hospitals need robots that can skillfully manipulate objects, follow instructions, and plan several steps ahead.
  • How it works: A robot that can imagine the near future from video-like understanding, rate how promising each future is, and choose actions that lead there is more reliable.
  • Why it matters: It reduces failures (like dropping a zipper or missing a grasp), saves time and money, and makes robots more helpful in everyday life.

🍞 Anchor: When a robot opens a zip-top bag and places candy inside without spilling, that’s careful foresight plus precise motion—exactly what this paper targets.

— Key Concepts Introduced in Background —

🍞 Hook: You know how a good coach studies game videos to understand movement, not just names of players?

🥬 The Concept: Cosmos-Predict2-2B

  • What it is: A pretrained video diffusion model that predicts future frames from an image and text description.
  • How it works: It encodes video into compact latent tokens and learns to denoise them step by step, turning noisy guesses into crisp future frames.
  • Why it matters: It already knows a lot about motion, timing, and physics from massive video datasets.

🍞 Anchor: Like a weather forecaster watching past clouds to predict the next few frames of a storm rolling in.

🍞 Hook: Picture a camera that stores its photos as tiny puzzle pieces you can clean up to reveal the picture.

🥬 The Concept: Latent Diffusion

  • What it is: A way to create images or videos by starting from noisy latents and denoising them step by step.
  • How it works: Encode frames into a compact latent space, add noise, train a network to remove the noise; at test time, reverse the process to generate clean samples.
  • Why it matters: It’s stable, scalable, and can represent very complex, multimodal possibilities over time.

🍞 Anchor: It’s like restoring a blurry puzzle piece by piece until the full scene appears clearly.

02Core Idea

🍞 Hook: Imagine you have a magic flipbook that can show what happens next. Now imagine you can also slip in notes like “move hand here” and “this path scores 0.9” between the pages—without changing the flipbook itself.

🥬 The Concept: Cosmos Policy

  • What it is: A way to fine-tune a pretrained video model so it directly outputs robot action chunks, imagined future states (images and robot readings), and a value score—all inside the model’s own video timeline.
  • How it works: Treat actions, robot state, and values as extra ‘latent frames’ interleaved between image latents; then fine-tune with one training objective so the model learns to generate all of them together.
  • Why it matters: No new modules or multi-stage pipelines; the powerful video prior drives smooth, precise actions and enables planning.

🍞 Anchor: The robot looks at the scene, the model proposes the next motion chunk, imagines what the cameras will see after, rates how good that future is, and then executes the best plan.

Multiple Analogies for the Same Idea:

  1. Comic Strip Analogy: Insert action and score sticky notes between frames so the same storyteller creates pictures, moves, and ratings in one go.
  2. Orchestra Analogy: One conductor (the video model) unifies strings (images), percussion (actions), and winds (values) so the whole performance stays in rhythm.
  3. GPS Analogy: The map (future images), the turn-by-turn steps (action chunk), and the ETA quality (value) all come from the same navigator.

Before vs After:

  • Before: Multiple training stages, extra networks for action prediction, weaker use of video priors, and more complexity.
  • After: Single-stage fine-tuning, no architectural changes, actions/futures/values unified in one model that plans by imagining and scoring.

Why It Works (Intuition, not equations):

  • Video diffusion models are great at capturing how scenes evolve. By slipping actions and values into the same timeline, the model treats them as just more ‘frames’ to denoise. Because it already knows temporal causality, it learns the link: “this action now → that future look → that score.” That coupling makes action predictions smooth and realistic, and value predictions more grounded.

Building Blocks (each with a mini Sandwich):

🍞 Hook: You know how a seasoned dancer anticipates the next beat? 🥬 The Concept: Spatiotemporal Priors

  • What it is: Built-in knowledge about how things move and change over time.
  • How it works: The pretrained video model learned motion patterns from millions of clips; fine-tuning reuses this timing sense for robot moves.
  • Why it matters: Movements become smoother and more physically plausible. 🍞 Anchor: Predicting a cup will tip if pushed near the edge is learned timing and physics.

🍞 Hook: Imagine adding flashcards between book pages. 🥬 The Concept: Latent Frame Injection

  • What it is: Turning actions, robot proprioception, and values into extra latent frames and inserting them between image latents.
  • How it works: Normalize each non-image signal, duplicate to fill a latent frame, interleave with image frames; the diffusion model then learns to denoise the whole sequence.
  • Why it matters: No new layers needed; the old model now ‘speaks robot.’ 🍞 Anchor: Action frames sit between current and future images like instructions between story panels.

🍞 Hook: Planning a dance routine in chunks, not one step at a time. 🥬 The Concept: Action Chunk Prediction

  • What it is: Predict a short sequence of actions (like 2 seconds) in one shot.
  • How it works: The model outputs a block of commands, which the robot executes before requerying.
  • Why it matters: Chunks reduce jitter and make motions smoother. 🍞 Anchor: Instead of “left-right-left” one by one, it plans “left-right-left-pause” as a single smooth unit.

🍞 Hook: Trying a few moves in your head before choosing the best. 🥬 The Concept: Best-of-N Sampling

  • What it is: Sample several action chunks, imagine futures, score them, and pick the top one.
  • How it works: Use the same model to propose, predict, and value; optionally ensemble predictions for robustness.
  • Why it matters: Increases reliability in tricky tasks like precise grasps. 🍞 Anchor: Like testing a few bowling angles in your imagination, then picking the one most likely to strike.

🍞 Hook: Looking at a scene from several cameras is like seeing a stage from the balcony and the wings. 🥬 The Concept: Multimodal Input Handling

  • What it is: Jointly model images from multiple cameras plus robot state as interleaved latent frames.
  • How it works: Insert wrist and third-person views alongside proprioception and actions.
  • Why it matters: Different views reduce blind spots and improve control. 🍞 Anchor: A wrist camera sees the gripper tips while a room camera sees the bigger picture—together, they prevent misses.

03Methodology

At a high level: Inputs (multi-view images + robot proprioception + task text) → Tokenize to latents and interleave with action/state/value latent frames (latent frame injection) → Diffusion denoising learns to predict actions, future states, and values → Outputs: action chunk, imagined future images/states, and a value score.

Step-by-step recipe (with mini Sandwiches for key steps):

  1. Prepare the data timeline 🍞 Hook: Think of making a storyboard where every panel is in the right order. 🥬 The Concept: Latent Timeline Construction
  • What it is: Build a single sequence of latent frames that includes current observations, actions, future observations, and a value.
  • How it works: Tokenize each camera image with a video VAE, add placeholder images for non-image slots, then overwrite those slots with duplicated, normalized vectors (proprioception, action chunk, value).
  • Why it matters: Keeps everything the video model needs in the order it already understands. 🍞 Anchor: The sequence (s, a, s', V(s')) becomes a left-to-right mini-story the model can read and write.
  1. Teach the model to denoise the whole story 🍞 Hook: Like cleaning smudges off a comic until every panel is clear. 🥬 The Concept: Joint Denoising Objective
  • What it is: Train the one model to recover clean actions, futures, and values from noised latents, conditioned on the clean prefix.
  • How it works: In each batch, decide which part is given (clean) and which part to generate (noised): policy training (p(a,s',V|s)), world model training (p(s',V|s,a)), or value training (p(V|s,a,s')).
  • Why it matters: A single learner becomes policy+world model+value function without extra heads. 🍞 Anchor: If you see the current scene (clean) and the model fills in the next actions and outcomes (denoised), it’s learning to act and imagine.
  1. Use auxiliary targets to steady the hand 🍞 Hook: When you learn to shoot hoops, you also practice balance and footwork. 🥬 The Concept: Auxiliary Supervision
  • What it is: While training the policy, also predict future states and values; while training the world model, also predict values.
  • How it works: Don’t drop these extra losses; they cross-teach timing, consequences, and goal direction.
  • Why it matters: Removing them drops success; predicting futures especially helps action quality. 🍞 Anchor: Practicing dribbling (future prediction) improves your shot (actions) even if the goal is scoring.
  1. Predict in parallel or step-by-step 🍞 Hook: Sometimes you write a paragraph all at once; sometimes you outline first, then fill in details. 🥬 The Concept: Parallel vs Autoregressive Decoding
  • What it is: Generate actions, futures, and values together (fast) or action first then future then value (more accurate for planning).
  • How it works: For direct control, use parallel for speed. For planning, use autoregressive to get crisper future/value estimates.
  • Why it matters: Lets you trade speed for accuracy depending on the situation. 🍞 Anchor: Quick moves during routine steps; careful, stepwise imagining before a delicate grasp.
  1. Adjust the noise schedule for precision 🍞 Hook: Cleaning glasses: too little cleaner does nothing; too much can smear. 🥬 The Concept: Noise Distribution Tuning
  • What it is: Shift training to include more high-noise levels and end inference a bit before σ≈0.
  • How it works: Hybrid sampling puts some probability mass on larger σ; at test time, stop at σ_min=4 to avoid over-smoothing.
  • Why it matters: Empirically yields crisper actions and futures. 🍞 Anchor: Starting from the right level of blur makes the final deblur sharper.
  1. Plan with best-of-N 🍞 Hook: Before crossing stepping stones, you picture several paths and pick the safest. 🥬 The Concept: Model-based Planning with Ensembling
  • What it is: Sample N action chunks, imagine 3 possible futures each, evaluate 5 value samples per future, then aggregate by a ‘majority mean’ rule.
  • How it works: Use two checkpoints: the ‘policy model’ for proposals, and a ‘planning model’ fine-tuned on rollouts for better futures/values.
  • Why it matters: Robust to uncertainty and multimodal futures; improves success especially on hard tasks. 🍞 Anchor: If most predictions say success, averaging those gives a stable green light.
  1. Learn from on-policy experiences 🍞 Hook: After a rehearsal, you adjust based on what went wrong. 🥬 The Concept: Rollout Fine-tuning
  • What it is: Collect successes and failures, then fine-tune with heavier weight on the world model and value function.
  • How it works: 90% of the batch focuses on futures/values, 10% on actions, so the imagination and scoring get sharper where the base demos were sparse.
  • Why it matters: Demos cover mostly successes; rollouts add coverage of tricky, failure-prone states. 🍞 Anchor: If you kept slipping on the bag zipper, adding those trials teaches the model to foresee and avoid that slip.
  1. Choose the value flavor 🍞 Hook: Do you judge a move after seeing the future picture (state value) or directly from the move itself (Q-value)? 🥬 The Concept: V(s') vs Q(s,a)
  • What it is: Train value to depend on predicted future state (model-based) or directly on current state+action (model-free).
  • How it works: Mask inputs so value sees only s' for V(s') or only (s,a) for Q(s,a); planning either predicts s' then V(s') or directly scores (s,a).
  • Why it matters: With limited data, V(s') + world model tends to be more sample-efficient and less overfit. 🍞 Anchor: Seeing the future snapshot before judging feels easier when data is scarce.

Concrete data example (RoboCasa):

  • Inputs: 3 camera views (wrist + two third-person), robot joint angles, text task, and previous chunk context.
  • Latent injection: sequence like [current proprio, current wrist, cam1, cam2, action chunk, future proprio, future wrist, future cam1, future cam2, value].
  • Training split: 50% policy tuples from demonstrations, 25% world model + 25% value from rollouts/demos.
  • Output: A 32-step action chunk to approach, grasp, and manipulate a kitchen object; predicted future images show the hand closer to the handle; value near 1.0 if success is likely.

Secret sauce:

  • Treating everything as frames in one diffusion timeline makes actions and consequences co-evolve. This tight coupling, plus video priors and auxiliary targets, is what lifts performance without architectural changes.

04Experiments & Results

🍞 Hook: When you try a new study method, you test it on easy quizzes and tough exams and compare your scores with your classmates.

🥬 The Concept: The Test Setup

  • What it is: Three arenas—LIBERO (many table tasks), RoboCasa (kitchen tasks), and ALOHA (real bimanual robot).
  • How it works: Train with provided demos, then measure success rates or average task scores across many trials and seeds.
  • Why it matters: Shows generalization to new layouts, objects, and styles, and real-world reliability. 🍞 Anchor: Like checking how your strategy works on math, science, and history, not just one subject.

Competitors (baselines): Diffusion Policy, Dita, UVA, Video Policy, UWM, π and π0.5, OpenVLA-OFT, CogVLA, UniVLA, DP-VLA, GR00T-N family, FLARE, etc.

Scoreboard with context:

  • LIBERO: 98.5% average success (A+), beating strong VLAs like π0.5 and OpenVLA-OFT.
  • RoboCasa: 67.1% average success with only 50 human demos per task—top score, despite others often using 300–1000+ demos.
  • Real ALOHA tasks: Highest average score (93.6%) across four challenging bimanual tasks; especially strong on high-precision and multimodal grasps (e.g., zip-top bag, scattered candies).

Meaning of the numbers:

  • 98.5% vs 97% is not just a small bump; at scale, that’s many more successful episodes and fewer frustrating failures.
  • 67.1% with 50 demos shows data efficiency; it’s like getting the highest grade while studying fewer hours.
  • 93.6% in the real world signals smooth, robust control under noisy sensors and occlusions.

Ablations (what happens if we remove parts?):

  • Remove auxiliary targets: average success drops (e.g., −1.5 points on LIBERO), proving predicting futures and values helps learning actions.
  • Train from scratch (no video prior): bigger drop (e.g., −3.9 points on LIBERO) and jerkier motions—pretraining matters.
  • In RoboCasa, removing future-state prediction during policy training causes the largest fall, confirming futures are key to strong actions.
  • Even with 1 denoising step per chunk (very fast), RoboCasa only drops ~0.5%—suggesting practical speed/accuracy trade-offs.

Planning results and surprises:

  • After fine-tuning on rollout data, model-based planning (V(s')) boosts tough ALOHA tasks by +12.5 points on average.
  • The planning model more accurately predicts future mistakes (like losing the zipper grasp), helping the chooser avoid low-value actions.
  • Model-based (V(s')) outperforms model-free (Q) when rollout data is limited—likely due to easier learning via explicit future images.

🍞 Anchor: It’s like practicing free throws on tough courts, then using slow-motion replays (world model) and a coach’s rating (value) to pick the best shooting form—scores go up.

05Discussion & Limitations

🍞 Hook: Even the best backpack has weight limits—you need to know when it’s too heavy for a long hike.

🥬 The Concept: Honest Assessment

  • What it is: Limitations, resources, when not to use, and open questions.
  • How it works: We list concrete constraints and where the method shines or struggles.
  • Why it matters: Helps practitioners decide fit and guides future research. 🍞 Anchor: Like reading a game’s difficulty and system requirements before you install it.

Limitations:

  • Planning speed: Best-of-N with ensembles takes ~5 seconds per action chunk on 8 GPUs—too slow for fast, dynamic tasks.
  • Data needs for planning: To accurately predict beyond demos, the model benefits from rollout data (including failures), which may be costly.
  • Short horizon: The approach plans one action chunk at a time (one ‘layer’ of search); deeper planning could help but costs more compute.
  • Partial observability: Though multi-view helps, self-occlusions and tiny tolerances (e.g., zipper) remain hard.

Required resources:

  • Large GPU budget (e.g., H100s), full fine-tuning of a ~2B parameter model, and 1–2 days of training per benchmark.
  • Multi-camera setup and synchronized proprioception; careful normalization/injection pipelines.

When not to use:

  • Highly dynamic tasks (e.g., juggling) where 5 seconds of planning latency is unacceptable.
  • Settings with no bandwidth for rollout data collection and no tolerance for slower, compute-heavy planning.
  • Extremely long-horizon tasks where deeper tree search is mandatory (unless you redesign planning depth).

Open questions:

  • How to speed up planning (faster samplers, fewer denoising steps, distillation, or smarter candidate pruning)?
  • Can we reduce rollout needs via synthetic hard-negative generation or uncertainty-aware sampling?
  • What’s the best mask scheme and training mix for V vs Q under varying data budgets?
  • How far can deeper horizons go before compute becomes a wall, and can partial rollouts (learned subgoals) help?
  • Can this unify with language-guided long-horizon planners while keeping the low-level precision?

06Conclusion & Future Work

Three-sentence summary:

  • Cosmos Policy fine-tunes a pretrained video diffusion model to directly generate robot action chunks, future states, and value scores by treating them as extra latent frames—no architecture changes needed.
  • This unifies policy, world model, and value function in one timeline, enabling best-of-N planning that imagines futures and picks the highest-value path.
  • The method sets new state-of-the-art results on LIBERO and RoboCasa with strong real-world performance, and improves further with rollout-based planning.

Main achievement:

  • Showing that simply injecting actions/states/values as latent frames into a pretrained video model yields a powerful, unified visuomotor controller and planner.

Future directions:

  • Speed up planning via efficient samplers, distillation to fewer steps, or learned proposal pruning; extend horizon with lightweight tree search; reduce rollout dependence with synthetic or uncertainty-targeted data.

Why remember this:

  • It’s a clean, one-stage recipe that turns a video model into a capable robot brain—seeing, imagining, scoring, and acting in one place—proving that video priors are a strong foundation for precise robot control and practical planning.

Practical Applications

  • •Home assistance: loading dishwashers, folding laundry, and organizing items using multi-view planning.
  • •Factory kitting and assembly: precise multi-step pick-and-place with fewer custom modules.
  • •Warehouse operations: handling varied packaging and zippers, improving grasp reliability with planning.
  • •Hospital logistics: safely placing items into bags or drawers under tight space constraints.
  • •Kitchen robotics: opening containers, manipulating utensils, and following language-specified goals.
  • •Education and research: a clean reference approach to unify policy, world model, and value without architecture changes.
  • •Robust teleoperation assist: suggest and score multiple action chunks for the human to approve.
  • •Quality control automation: imagine outcomes of alternative motions and pick the safest, highest-value one.
  • •Small-data deployment: leverage video priors to reach strong performance with fewer demonstrations.
  • •Simulation-to-real transfer: refine via rollout data to close gaps where demos don’t cover tricky failures.
#video diffusion#robot policy learning#visuomotor control#latent frame injection#world model#value function#action chunking#best-of-N planning#multimodal inputs#imitation learning#model-based planning#cosmos-predict2#video priors#latent diffusion#multi-view robotics
Version: 1