Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

I. Apanasevich; M. Artemyev; R. Babakyan; P. Fedotova; D. Grankin; E. Kupryashin; A. Misailidi; D. Nerus; A. Nutalapati; G. Sidorov; I. Efremov; M. Gerasyov; D. Pikurov; Y. Senchenko; S. Davidenko; D. Kulikov; M. Sultankin; K. Askarbek; O. Shamanin; D. Statovoy; E. Zalyaev; I. Zorin; A. Letkin; E. Rusakov; A. Silchenko; V. Vorobyov; S. Sobolnikov; A. Postnikov

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Intermediate

I. Apanasevich, M. Artemyev, R. Babakyan et al.1/31/2026

arXiv PDF

Key Summary

•Green-VLA is a step-by-step training recipe that teaches one model to see, understand language, and move many kinds of robots safely and efficiently.
•It cleans up and lines up messy robot data so the model learns from smooth, sharp, and diverse demonstrations instead of noisy ones.
•A single 'unified action space' acts like a universal remote so the same policy can control humanoids, mobile cobots, and single arms without confusion.
•A guidance helper (JPM) points the robot to the exact 3D spot to touch, which is especially helpful for look-alike products on store shelves.
•The system predicts how far along a task is and detects out-of-distribution states, making long missions safer and more reliable.
•Speed conditioning lets the same policy move fast for easy parts and slow down for delicate parts—like zooming in and out in time.
•After regular learning, reinforcement learning (RL) aligns the policy with real goals and rewards, boosting long-horizon success and recovery from mistakes.
•On real tests, Green-VLA beats or matches strong baselines using less data, and runs zero-shot across different robot bodies.
•For humanoids, it controls head, torso, two arms, and hands together, succeeding on pick–place, sorting, handovers, and table cleaning, even in new scenes.
•The big idea: quality-aligned data + unified actions + staged training + RL alignment = a practical path to generalist, real-world robots.

Why This Research Matters

Generalist robots that understand language and act safely can help in homes, hospitals, warehouses, and factories without needing a whole new brain for each body. Green-VLA shows how to make one policy that travels well across different robots by cleaning data, unifying actions, and finishing with reward-based alignment. This reduces engineering overhead and speeds up deployment since you don’t have to start from scratch for each embodiment. Guidance and safety checks improve reliability for delicate work, like grabbing small items or operating in clutter. Faster time-to-completion means real workflows get done sooner, saving labor and energy. Because it works in zero-shot settings, integrating new robots or tasks becomes easier, making automation more flexible and scalable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re teaching a team of helpers—some have two arms, some roll on wheels, and one is a full humanoid. You want to give them the same simple voice instructions like 'Put the red cup in the box,' and have any of them do it well. That sounds easy for people, but it’s really hard for robots.

🥬 The Concept (Vision–Language–Action robots): A Vision–Language–Action (VLA) model is a brain that looks (vision), listens/reads (language), and then moves (action) to get things done.

How it worked before: Most robots were trained by copying human demos (behavior cloning), which is like tracing over a drawing—great for short, clean lines, but messy when the page is wrinkly or the picture is long.
What was wrong: Real robot data is noisy (blurry cameras, shaky hands), comes in different speeds and action types, and copying alone doesn’t teach robots how to fix mistakes or finish long chores.
Why it matters: Without a smarter plan, robots freeze up in new homes, pick the wrong item on crowded shelves, or get lost mid-task.

🍞 Anchor: Think of a kid learning to bake: if all the recipe cards are smudged, the oven dials are different each time, and the kid only ever copies, you won’t get a reliable baker. Robots had the same problem.

🍞 Hook: You know how streaming music from many sources needs the same format, or your speaker won’t play it smoothly?

🥬 The Concept (Data Quality and Alignment): DataQA is like a robot DJ that filters out bad tracks (blurry frames, shaky motions) and puts all music to the same beat (temporal alignment), so learning sounds smooth.

How it works:
1. Check video sharpness, motion smoothness, and scene variety.
2. Toss out broken or odd-length episodes.
3. Smooth jittery trajectories and line up speeds using optical flow.
4. Balance sampling so no single dataset drowns out the others.
Why it matters: Clean, well-timed data makes learning faster and more robust, so the same move means the same thing across robots.

🍞 Anchor: It’s like cleaning your glasses and setting a metronome before practicing piano—now the notes are clear and on time.

🍞 Hook: Ever tried using three different TV remotes for one movie night? Confusing, right?

🥬 The Concept (Unified Action Space): A unified action space is a universal remote for robots where each button always means the same body part or motion, no matter the robot.

How it works:
1. Define shared action slots (left arm, right arm, gripper/hand, base...).
2. Map each robot’s native controls into those slots.
3. Use a mask so unused slots don’t add noise.
4. Add a 'control prompt' that tells the policy which slots are active and how.
Why it matters: Without it, training mixes up meanings (like volume-up on one remote being channel-down on another), breaking transfer across robots.

🍞 Anchor: Now, whether it’s a humanoid or a small arm, 'left-hand close' always means 'close left fingers'—no surprises.

🍞 Hook: You know how varsity athletes cross-train in multiple sports to get stronger overall?

🥬 The Concept (Staged, Multi-Embodiment Training): The model learns in five steps—L0 base vision–language, L1 web grounding, R0 robot pretraining, R1 robot-specific tuning, R2 RL alignment—so it grows from general knowledge to real-world skill.

How it works:
1. L0/L1: Learn common sense about objects, physics, and language.
2. R0: Learn broad robot skills across many bodies and cameras.
3. R1: Fine-tune on the target robot’s best data.
4. R2: Use rewards to align with long tasks and safe recovery.
Why it matters: Skipping stages is like skipping grades: you miss key basics, and the robot struggles later.

🍞 Anchor: It’s like reading about swimming (L0/L1), practicing in pools of different sizes (R0), training with your own coach (R1), and finally timing laps to beat your record (R2).

🍞 Hook: When you play a long board game, you sometimes check how close you are to the end.

🥬 The Concept (Progress and Safety Signals): The model predicts 'how far along' a subtask is and detects out-of-distribution states; it also gets a guided target point for tricky objects.

How it works:
1. Episode-progress head estimates completion.
2. A GMM-based OOD detector nudges actions back to safe zones.
3. A Joint Prediction Module (JPM) picks a precise 3D target point from language + vision.
Why it matters: Without these, the robot keeps going after finishing, drifts into unsafe poses, or misses look-alike items.

🍞 Anchor: Like a GPS with 'distance to destination,' lane-keeping, and a pin drop at your exact parking spot.

02Core Idea

🍞 Hook: Picture teaching one orchestra to play many genres—classical, jazz, and pop—using the same sheet music, with a smart conductor that speeds up the easy parts and slows for solos.

🥬 The Concept (Aha!): The key insight is to stage learning and unify actions so one policy can control many robot bodies, then use rewards to align it with long, real tasks.

Multiple Analogies:

Universal remote + training wheels: First, clean and label all the buttons (unified action space), then practice on lots of TVs (multi-embodiment pretraining), then tune for your own set (R1), and finally play real shows with audience feedback (R2 RL).
Cooking school: Learn food words and kitchen tools (L0/L1), cook in many kitchens (R0), specialize in your restaurant (R1), and perfect timing and plating with customer reviews (R2).
Map + compass + guardrails: The web teaches the map, robot data tunes the compass, the guardrails (OOD + guidance) keep you on road, and RL is the odometer reward that pushes farther, cleaner routes.

Before vs After:

Before: Mixed-up controls, messy data, and pure copying led to brittle skills, slow or unsafe behavior, and poor transfer to new robots.
After: Cleaned, aligned data and a unified action language make cross-robot learning click; progress/OOD/guidance improve safety; RL alignment boosts long-horizon success and recovery.

Why It Works (intuition not math):

Consistent meanings (unified slots) let the model build shared skills instead of memorizing per-robot quirks.
Cleaning and time-aligning data shrink the noise, so the model learns cause→effect reliably.
Guidance and OOD checks act like bumpers, keeping plans on track.
RL adds the missing 'what actually counts' so the model stops just imitating and starts achieving.

Building Blocks (each in Sandwich style):

🍞 Hook: Like labeling every drawer in a workshop so tools are always in the same place. 🥬 Unified Action Space: A single, labeled set of action slots with masks and a control prompt.

How: Map each robot’s native actions into shared slots, ignore unused slots, and announce setup via tokens.
Why: Prevents cross-robot meaning clashes that wreck transfer. 🍞 Anchor: 'Left-hand close' is always the same drawer—open, grab, done.

🍞 Hook: Practicing songs to a metronome. 🥬 Temporal Alignment + Speed Conditioning: Make motion speeds comparable across datasets, then teach the model to run at multiple temporal zooms.

How: Optical-flow-based resampling; then condition actions on a speed scalar v.
Why: Without it, the same move could mean fast in one dataset and slow in another. 🍞 Anchor: For careful threading a needle (high v) or quick walking (low v), one model adapts.

🍞 Hook: Dropping a pin on a map for an exact meetup location. 🥬 JPM Guidance: A pointing VLM picks a 2D pixel, lift to 3D with depth and camera pose, solve IK to get a joint goal, then gently steer actions there.

How: Predict affordance point → backproject to 3D → compute feasible joint target → guide flow.
Why: Without it, look-alike items on shelves cause mispicks. 🍞 Anchor: 'Pick the 500ml blue bottle' now lands you at the right spot.

🍞 Hook: A teacher who says 'You’re 80% done' and 'Stay on the path!' 🥬 Progress + OOD Safety: Estimate task completion; use a learned state-density to nudge away from risky zones.

How: Progress head; GMM density gradient correction.
Why: Prevents over-shooting and reduces unsafe drifts. 🍞 Anchor: Finish placing then stop, instead of wiggling until it falls.

🍞 Hook: Practicing for a marathon, not just a sprint. 🥬 RL Alignment (R2): Add rewards so the model learns to finish long tasks, recover from slips, and prioritize success over imitation.

How: Trajectory optimization with a Q-critic and optimizing the source noise of flow-matching.
Why: Copying alone stalls; rewards push real-world reliability. 🍞 Anchor: Time-to-clear drops, success rises, retries shrink.

03Methodology

High-level flow: Input (images + language + robot state) → Multimodal encoder → Action expert (flow matching) with unified action space and control prompt → Safety/Progress/Guidance heads → Robot actions.

Step-by-step (with Sandwich explanations where new concepts appear):

Inputs and Encoding 🍞 Hook: Like reading a recipe while looking at your kitchen counter and feeling where your hands are. 🥬 What: The model reads instructions, sees camera images, and feels joint states (proprioception), combining them into tokens.

How: A vision–language encoder (PaliGemma-based) fuses RGB, text, and state.
Why: Without shared context, actions won’t match goals or the scene. 🍞 Anchor: 'Pick the red mug' + seeing the table + arm angles = a clean plan.

Unified Action Space + Control Prompt 🍞 Hook: Using a universal remote that announces which devices are on. 🥬 What: Predicts action chunks in a shared, masked action layout, guided by a control-type prompt.

How: Map native actions into slots; mask unused parts; prompt includes #arms, hand type, joint/cartesian, mobile/static.
Why: Prevents cross-embodiment conflicts and lets one policy drive many robots. 🍞 Anchor: Same policy controls a humanoid’s hands today and a mobile cobot’s gripper tomorrow.

Data Curation and Temporal Alignment 🍞 Hook: Cleaning up your notes and setting a tempo before band practice. 🥬 What: DataQA filters noisy episodes; optical flow aligns speeds; curriculum sampling balances datasets over time.

How: Sharpness checks, tremble score, diversity metrics; resample with splines based on optical flow; gradually bias sampling from uniform to target mix.
Why: Messy, uneven data causes brittle behavior and poor transfer. 🍞 Anchor: Smooth, sharp, and diverse demos make learning stick.

Speed-Conditioned Modulation 🍞 Hook: Switching between 'precision mode' and 'express mode.' 🥬 What: A scalar v tells the policy to move slowly for delicate contact or faster for gross motion.

How: Warp trajectories per-sample; modulate hidden states via learned gamma/beta of v.
Why: One-size speed fails; tasks need both zoomed-in and zoomed-out timing. 🍞 Anchor: Slow to grasp a grape, fast to move across the table.

Guidance with JPM 🍞 Hook: Dropping a pin where the robot must touch. 🥬 What: A small head predicts a 2D affordance point; lifted to 3D using depth and camera pose; IK yields a feasible joint target; flow guidance steers actions toward it.

Why: Critical for disambiguating near-identical products or small targets. 🍞 Anchor: 'Pick J7 orange 0.5L' lands on the exact bottle, not the similar one.

Progress and OOD Safety 🍞 Hook: A dashboard telling you '90% done' and 'Careful, off-route.' 🥬 What: The action expert predicts episode progress; a GMM over states detects OOD and nudges back toward safe regions.

How: Train progress as t/T; compute density p_train(s), and if low, add a small gradient step toward higher density.
Why: Reduces overrun and unsafe drifts during long horizons. 🍞 Anchor: Stop when finished; recover if the wrist veers into awkward angles.

Task Planner on Top 🍞 Hook: A conductor cueing the orchestra section by section. 🥬 What: A high-level VLM parses the user goal, breaks it into subtasks, and queries the VLA loop; uses progress/OOD/guidance to advance or replan.

Why: Keeps execution faithful to instructions across multi-step sequences. 🍞 Anchor: 'Set the table' becomes 'pick plate,' 'place plate,' 'pick fork'… until done.

Staged Training (L0→L1→R0→R1→R2) 🍞 Hook: School, then practice, then coaching, then competitions with scores. 🥬 What:

L0: Base VLM; L1: Web multimodal grounding; R0: Multi-embodiment robot pretrain; R1: Target robot fine-tune; R2: RL alignment.
Why: Each stage fixes a bottleneck—semantics, affordances, embodiment fit, and reward alignment. 🍞 Anchor: The same recipe turns a general learner into a reliable, real-world performer.

RL Alignment (Two ways) 🍞 Hook: Tightening up a routine after watching replay footage and getting a coach’s score. 🥬 What:

Trajectory optimization with a Q-critic (IQL-style): use ∇_a Q to refine actions; validate; add improved rollouts; re-fine-tune.
Optimizing source noise distribution: learn an actor to sample better noise seeds for flow-matching, improving returns without changing the base weights directly.
Why: Turns 'good imitator' into 'goal finisher' with better recovery and efficiency. 🍞 Anchor: Fewer drops, faster clears, more consistent success.

Secret Sauce:

Unified, masked action slots + explicit control prompting prevent cross-robot confusion.
Temporal alignment + speed conditioning let one policy scale from tiny finger motions to big sweeps.
JPM guidance + OOD guardrails give precision and safety in OOD scenes.
RL alignment upgrades imitation into robust, long-horizon execution.

04Experiments & Results

The Tests and Why They Matter:

Table cleaning and specific object picking on a dual-arm mobile cobot (ALOHA-based): measures task-following accuracy and speed under clutter.
Simpler (Google Robot, WidowX): standardized long-horizon benchmarks; success requires stopping at the right time and avoiding undoing success.
CALVIN ABC→D: compositional, multi-step manipulation; rewards long chains and recovery.
Humanoid tasks (Green robot): full upper-body control with two arms and dexterous hands; pick/place, basket sorting, handovers, and table cleaning in OOD layouts.
E-commerce shelf picking: category vs exact-SKU vs OOD variants; tests JPM guidance on look-alike products.

Competitors:

π, GR00T N1, WALL-OSS, AgiBot GO-1, OpenVLA, RT-1X, Flower, and a MemoryVLA variant. These are strong, modern VLA baselines.

Scoreboard Highlights (with context):

ALOHA table-cleaning (single-target): Green-VLA (R0) reached 83.15% success on 'pick tape/pliers/screwdrivers' type tasks, versus π around 46.3%. That’s like jumping from a barely passing grade to an A.
Time-to-clear: Green-VLA finished in 1m35s versus some baselines taking over 5 minutes—like finishing your chores before your favorite show starts instead of missing it.
Simpler (Google Robot): R0 already matched or beat many pretrain-only baselines; with R1 and especially R2, average climbed to about 71.8%, pushing into top-tier range under the same steps—an A when others hover around C+/B-.
Simpler (WidowX): Moving from R0→R1→R2 lifted average success from mid-40s to ~79–92% per-task peaks and ~79–91.7% averages across categories—solid A-range consistency.
CALVIN: R2 raised average chain length and multi-step success beyond R1 and beyond a fine-tuned π baseline—fewer stumbles, longer clean runs.
E-commerce shelves: JPM guidance significantly raised exact-SKU and OOD success. It’s like recognizing the right book edition on a crowded shelf.
Humanoid: Zero-shot transfer plus R1 tuning produced robust pick/place, sorting, and handovers across OOD layouts, controlling 32 DoF with both arms and hands.

Surprising Findings:

Multi-embodiment pretraining (R0) alone was strong across many robots without per-robot fine-tuning—suggesting unified actions and aligned data unlock wide transfer.
Episode-end prediction mattered a lot: stopping at the right moment prevents 'fidget fails' that flip success into failure.
Speed conditioning enabled a practical trade-off at inference: faster clears on easy phases, delicate control where it counts, all in one policy.

Takeaway: Consistent actions + clean, time-aligned data + safety/guidance + RL delivery equals state-of-the-art long-horizon performance with less data and real-time viability.

05Discussion & Limitations

Limitations:

Retargeting fidelity: Mapping from many source robots to a humanoid is approximate; edge cases can feel 'off' for dexterous hands or unusual kinematics.
Dataset bias: Even after DataQA, some patterns dominate; rare skills or camera views may under-train.
Dexterous coverage: Complex in-hand manipulation still needs more breadth and depth.
Latency budget: Adding planners and guidance while keeping low-latency control is a careful engineering balance.

Required Resources:

Mid-scale compute (dozens of modern GPUs) for R0 pretraining and R2 alignment.
Multi-camera, time-synced logs with proprioception; depth for best JPM lifting.
Safety infrastructure for real-robot RL data collection.

When NOT to Use:

Ultra-tight safety-critical settings with no room for exploration or occasional correction (e.g., hazardous, human-close operations without safety cages).
Tasks demanding fine in-hand re-grasping beyond the training distribution without additional specialized data.
Embodiments with radically different affordances that cannot be meaningfully slotted into the unified action layout.

Open Questions:

Automated, on-the-fly selection of speed parameter v by a high-level policy.
Tighter fusion of fast reasoning (chain-of-thought) with low-latency control.
Better multilingual instruction grounding for global deployments.
Continual learning with safe, online preference/RL shaping without catastrophic forgetting.
More principled OOD handling that blends model uncertainty, state density, and human-in-the-loop prompts.

06Conclusion & Future Work

3-Sentence Summary: Green-VLA is a staged Vision–Language–Action framework that cleans and aligns data, unifies actions across robots, and then uses rewards to align behavior with real tasks. It adds guidance, progress, and OOD safety so the same policy can control many embodiments reliably, from cobots to humanoids. The result is state-of-the-art long-horizon performance with less data and real-time practicality.

Main Achievement: A practical, unified recipe—quality-aligned data + unified action space + staged training + RL alignment—that turns heterogeneous demos into one generalist, deployment-ready robot policy.

Future Directions:

Multilingual instruction following to improve inclusivity and data efficiency.
Lightweight reasoning for task decomposition without latency spikes.
Safety-aware, continual RL with embodied memory and replay for longer, harder chores.

Why Remember This: It shows that structure beats scale alone: when you clean the data, agree on a shared action language, add safety/guidance, and finish with rewards, generalist robots become both capable and dependable in the real world.

Practical Applications

•Warehouse picking and packing with exact-SKU selection in cluttered shelves.
•Hospital supply delivery and safe handover to staff using clear language commands.
•Home assistance for sorting laundry, setting tables, and tidying surfaces.
•Retail restocking and returns processing where packaging changes over time.
•Light assembly tasks that mix fast repositioning with precise insertions.
•E-commerce micro-fulfillment with quick tote filling and careful item handling.
•Kitchen prep: ingredient sorting, utensil fetching, and safe tool placement.
•Education and research platforms that share one policy across many robot kits.
•Event support: moving props, handing items to presenters, and clearing stages.
•Hotel service robots for amenity delivery and guest-facing handovers.

Version: 1