Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge

Junjie Bai; Yu-Wei Chao; Qizhi Chen; Jinwei Gu; Moo Jin Kim; Zhaoshuo Li; Xuan Li; Tsung-Yi Lin; Ming-Yu Liu; Nic Ma; Kaichun Mo; Delin Qu; Shangkun Sun; Hongchi Xia; Fangyin Wei; Xiaohui Zeng

Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge

Intermediate

Junjie Bai, Yu-Wei Chao, Qizhi Chen et al.12/10/2025

arXiv PDF

Key Summary

•This paper shows how to make home-helper robots better at long, multi-step chores by smart training on diverse tasks and by polishing the model after training using its own best attempts.
•The team builds on a strong Vision-Language-Action backbone (pi 0.5) and feeds it human demos, motion-planner paths, and offline RL rollouts to cover many ways to act.
•They use Rejection Sampling Fine-Tuning (RFT): perturb the start, roll out the policy, keep only the successful tries, and retrain, creating a data flywheel for robustness.
•Scaling pre-training from a few tasks to all 50 challenge tasks boosts generalization, raising the validation Q-score from 0.192 to 0.224 after initial post-training and then to 0.345 with improved balancing.
•In the official 2025 BEHAVIOR Challenge, their system placed a very close second with a test Q-score of 0.2514, outperforming most teams by a wide margin.
•Inference choices matter a lot: using a receding horizon of 32 steps, high-resolution images (head 720p, wrist 480p), and absolute-joint actions at 30 Hz with proprioception significantly help.
•Some expected wins were smaller: point clouds helped only a little over RGB, and overly long action horizons actually hurt performance.
•A theoretical best score of 0.611 from aggregating historical checkpoints shows big headroom, suggesting that further scaling and smarter post-training could yield large gains.
•The approach avoids slow, complex online RL in this simulator and offers practical lessons for adapting foundation models to long-horizon, human-centered tasks.
•Real-world impact: more reliable robots for tidying, cooking assistance, and household organization by chaining many small skills without falling apart.

Why This Research Matters

Robots that can finish whole chores, not just single moves, are far more useful in homes, hospitals, and warehouses. This work offers a practical way to get there by avoiding slow online RL and instead using a data flywheel of the robot’s own best attempts. The lessons—broaden pre-training, keep only good rollouts, and use tight feedback with clear vision—translate to many embodied AI problems. Better long-horizon reliability means fewer mid-task failures, safer behavior around people, and more trust. As foundation models spread into physical spaces, these strategies help bridge the gap from neat demos to real, everyday assistance.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have to clean your room, make a snack, and pack your school bag—all before the bus arrives. That’s not one action; it’s a bunch of small steps that need to go in the right order without messing up.

🥬 The Concept (Vision-Language-Action, VLA): What it is: A VLA model is a robot brain that looks (vision), listens/reads (language), and moves (action) all in one system so it can follow instructions and act in the world. How it works: 1) See camera images, 2) Read the task instruction, 3) Mix what it sees and reads into a shared understanding, 4) Output continuous control actions for the robot’s body. Why it matters: Without VLA, robots need many brittle, hand-made parts that often break when the world changes a little.

🍞 Anchor: Tell a robot, “Put the red cup in the sink.” VLA helps it find the red cup in the picture, understand the instruction, and move its arm to carry the cup to the sink smoothly.

🍞 Hook: You know how building a LEGO castle takes lots of steps? If you mess up step 3, step 8 won’t fit anymore.

🥬 The Concept (Long-horizon Manipulation): What it is: Long-horizon manipulation is when a robot must complete many interlocking steps over a long time—like a whole chore, not just a single grab. How it works: 1) Plan action sequences that mix navigation and hand use, 2) Keep re-checking what happened after each step, 3) Adjust the next steps so small mistakes don’t snowball. Why it matters: Without strong long-horizon skills, robots start okay but drift off-track as errors pile up across many steps.

🍞 Anchor: “Make microwave popcorn” means: go to kitchen, find bag, open microwave, put bag in, press buttons, wait, and take it out. If you miss the step “open microwave,” everything after fails.

🍞 Hook: Think of a teacher who gives you a score not only for finishing, but also for getting key parts right along the way.

🥬 The Concept (Q-score): What it is: Q-score is the challenge’s report card that rewards completing tasks and important subgoals, not just one final success. How it works: 1) Check if the robot meets goal rules (like “object is in container,” “door is open”), 2) Tally partial and full completions, 3) Average across tasks and instances. Why it matters: Pure success/fail throws away information; Q-score shows steady progress and robustness across long tasks.

🍞 Anchor: If the robot turns on the radio but doesn’t put it back on the shelf, Q-score still recognizes partial success instead of a harsh 0.

🍞 Hook: When you practice piano, you don’t just learn one song; learning many songs helps you play new ones.

🥬 The Concept (Pre-training vs Post-training): What it is: Pre-training teaches the robot broad skills from many tasks; post-training polishes those skills for the target setting. How it works: Pre-training: 1) Mix demos from many tasks, 2) Train a single end-to-end model; Post-training: 1) Roll out the model, 2) Use successes to refine, 3) Repeat. Why it matters: Without broad pre-training, the robot overfits; without polishing, it stays brittle.

🍞 Anchor: Learning to cook many recipes (pre-training) makes you adaptable; practicing tonight’s dinner twice (post-training) makes it reliable.

The world before: Robots could follow short instructions well (like “pick this up”), thanks to early VLAs such as RT-1/RT-2 and open models like Octo/OpenVLA. But they stumbled on chores needing 5–10 different skills chained across hundreds of decision steps in cluttered, realistic homes. The problem: Errors multiply over time. If the gripper is 1 cm off now, the next move might be 2 cm off, and after 50 steps the task fails. Previous fixes split tasks into subtasks with separate little policies, but those struggled at the hand-offs—called the skill-chaining problem—where transitions break and mistakes grow. Other attempts used online reinforcement learning, but the BEHAVIOR simulator is slow: one task can take hours to a day, and training needs different GPU types for sim and learning, making it impractical. The gap: A simple, scalable, end-to-end recipe to push a strong general VLA to reliably handle long-horizon home tasks—without complicated online RL or fragile modular pipelines. Real stakes: Home robots that can truly help—tidy rooms, manage kitchens, organize storage—require many steps that stay stable over time. That’s what this work targets by pairing diverse pre-training with a practical, robust post-training flywheel.

02Core Idea

🍞 Hook: Imagine learning to skateboard: you watch lots of people (diverse examples), then you practice your own runs and keep only the clean landings to study and improve.

🥬 The Concept (The Aha!): Key insight in one sentence: Scale pre-training coverage across many tasks and then use rejection-sampling fine-tuning to build a self-improving pool of only successful rollouts, plus the right inference tricks, to make long-horizon control reliable—without online RL.

Multiple analogies:

Orchestra: Pre-training is rehearsing many songs so the orchestra knows many patterns; RFT is recording the best takes and studying only those to polish timing and transitions.
Cooking: Learn many recipes (pre-train), then for a special dinner, you repeat the best trial runs and tweak details like oven temp (RFT and inference settings) until everything comes out perfect.
Hiking: Pre-train by walking many trails to build stamina and map sense; after that, you replay only the successful routes (RFT) and choose the right stride length and pace (action horizon, control mode) to finish the long trek.

Why it works (intuition):

Broader pre-training reduces surprise. Seeing all 50 tasks—easy and hard—teaches the model the full spectrum of scenes, objects, and interactions.
RFT forms a data flywheel: roll out → keep only good episodes → retrain → roll out better. This filters noise and compounds robustness without the heavy machinery of online RL.
Good inference choices (like receding horizon re-planning and higher-resolution vision) keep feedback tight and perception crisp, making each step trustworthy, so small errors don’t snowball.

Before vs After:

Before: Single-task fine-tuning did okay on a couple of tasks but failed to generalize. Online RL was too slow and cumbersome in this simulator.
After: Multi-task pre-training (pt7 → pt10 → pt50) unlocked more tasks; RFT lifted Q-scores significantly; careful inference doubled success on some tasks.

Building blocks (with mini Sandwich blocks):

🍞 Hook: You know how practicing only the parts you played well can make those parts amazing really fast? 🥬 The Concept (RFT): What it is: Rejection Sampling Fine-Tuning keeps only the model’s successful rollouts and fine-tunes on them. How it works: 1) Slightly perturb starting poses, 2) Roll out the current policy, 3) Keep successful episodes, 4) Retrain, 5) Repeat for N rounds. Why it matters: Without filtering, the model learns from messy, failing attempts and can get confused. 🍞 Anchor: Like keeping only the correctly solved math problems in your study guide so your final review is super clean.

🍞 Hook: Imagine choosing how far ahead to plan your bike ride before checking your map again. 🥬 The Concept (Receding Horizon Control): What it is: Plan a short action segment, execute it fully, then re-plan based on the new state. How it works: 1) Predict actions for a near future (e.g., 32 steps), 2) Execute them, 3) Observe new images and states, 4) Predict the next segment. Why it matters: Without frequent feedback, errors compound and the robot drifts off-target. 🍞 Anchor: You steer a bit, look again, steer a bit more—rather than guessing the entire ride at once.

🍞 Hook: Think of camera megapixels helping you spot your friend in a crowd. 🥬 The Concept (High-resolution Perception): What it is: Using sharper head and wrist images to see small, critical details. How it works: 1) Capture higher-res frames (e.g., 720p head, 480p wrist), 2) Feed them to the visual encoder, 3) Let the policy target precise grasps and placements. Why it matters: Low-res vision misses fine details that break manipulation. 🍞 Anchor: It’s easier to pick up the tiny LEGO piece when your camera view is crisp.

With these pieces together, the team lifts validation Q-score to 0.345 and places a close second on the challenge test set at 0.2514, showing a practical path to reliable long-horizon behavior.

03Methodology

High-level recipe: Instruction + Multi-view images + Robot state → Pre-train VLA (pi 0.5) on diverse tasks → Post-train with RFT flywheel → Inference with receding horizon and tuned settings → Actions for navigation and manipulation.

Inputs:

Vision: Head and wrist RGB (optionally depth/point clouds); best performance used higher resolutions (head ~720p, wrist ~480p).
Language: Task instruction describing the household activity.
State: Proprioception (e.g., joint positions), which proved important.

Backbone: pi 0.5 (a VLA transformer) that fuses vision and language features and outputs continuous robot actions. Data included roughly 1.1K hours of human demos plus additional planner and offline RL trajectories; later, ~3.6K extra planner/RL trajectories expanded coverage.

Step A — Multi-task pre-training:

What happens: Train a single end-to-end policy on subsets of the 50 BEHAVIOR Challenge tasks: pt1 (single), pt7, pt10, and pt50 (all). Keep the training recipe fixed to isolate the effect of task coverage.
Why it exists: Broaden the model’s experience so it generalizes to many object layouts, rooms, and skill combinations.
Example: With only pt1 (one task), the model overfits and only reliably solves 2 tasks. With pt10 and pt50, many more tasks start working because the model has seen a fuller variety of kitchens, living rooms, and tools.

Step B — RFT post-training (the flywheel):

What happens: Start from the pre-trained policy; for each round (N=3), randomly perturb initial robot poses, roll out across scenes using both train and validation human demos as seeds, keep only successful episodes, then fine-tune. Each round collects ~8,500 rollouts on average; after de-dup and balancing, 1,469 top-quality episodes were used for training.
Why it exists: Efficiently improve robustness without slow, complex online RL. Keeping only successful rollouts raises signal-to-noise.
Example: For “turning_on_radio,” the robot practices many starts (slightly moved base or wrist). Only runs that actually turn it on are kept; retraining on those helps it handle varied start poses next time.

Step C — Inference strategy (control and sensing choices):

Control mode: Receding Horizon outperformed Temporal Ensemble/Receding Temporal (which had near-zero success). Execute an action segment fully, then re-plan with fresh observations.
Action horizon: Non-monotonic effect—too short misses multi-step intent; too long couples errors. 32-step horizon worked best.
Perception: Higher image resolution more than doubled success on a test task. Point clouds helped only modestly over RGB, and at notable compute/latency cost.
Action representation and rate: Absolute joint commands at 30 Hz with proprioception beat delta joints or 15 Hz subsampling.

Step D — Evaluation:

Metric: Q-score (plus full task success). Evaluate across 50 tasks with realistic, interactive 3D scenes and goal predicates (BDDL) emphasizing complete activities.
Infrastructure: Training in JAX on NVIDIA H200 GPUs; multi-GPU parallel rollouts across 10 instances to overcome slow simulation.

What would break without each step?

No multi-task pre-training: brittle behavior, poor generalization (pt1 result: very few successes).
No RFT: model keeps repeating mistakes; lacks a high-quality, self-curated dataset to lock in robust behaviors.
No receding horizon: open-loop drift; small errors snowball.
Low-res vision/no proprioception: misses fine details and body context; grasp and placement fail more often.
Wrong action horizon or sampling: either too myopic or too wobbly, hurting stability and dexterity.

Concrete mini-examples:

“Turning_on_radio” ablation: Going from 224×224 images to 720p head + 480p wrist raised success from ~0.30 to ~0.60 under the best settings; receding horizon was essential.
“Bringing_water”: With pt50 pre-training plus RFT, the robot better navigates to the sink, opens/uses containers, and adapts if starting a bit misaligned.

Secret sauce: The combination of (1) broad pre-training coverage, (2) an RFT flywheel that self-selects only good episodes, and (3) inference-time choices that keep tight feedback and clear vision. Each adds robustness; together, they convert a strong but generic VLA into a long-horizon finisher.

04Experiments & Results

The test: Measure Q-score and full task success over 50 long-horizon household tasks from BEHAVIOR-1K under a standardized protocol that rewards complete activities and partial goal progress. This matters because long tasks have many chances to go wrong; a good metric should recognize steady, reliable execution.

The competition: Compared with other top teams in the 2025 BEHAVIOR Challenge. The Comet entry placed a very close second on the official test set with Q=0.2514, while the winner achieved Q=0.2599; other teams trailed notably. Post-challenge refinements (validated offline) raised Comet’s validation Q to 0.345, well above previous public results.

Scoreboard with context:

Pre-training only: validation Q=0.192. That’s like passing, but shaky on hard chores.
During-challenge post-training: validation Q≈0.224. Like improving from a C to a solid C+.
Post-challenge refined post-training: validation Q=0.345. That’s like boosting to a strong B+, clearly above most peers.
Official test ranking: Q=0.2514 (2nd place), very close to the top.
Theoretical best (aggregate across historical checkpoints): validation Q=0.611 and success rate 0.35—showing large headroom if selection and scaling improve.

Surprising findings:

Control mode matters a lot: Temporal averaging strategies failed; receding horizon planning was key for closed-loop stability.
Action horizon is a Goldilocks choice: medium (32) was best; too long tangled dependencies and hurt control.
High-res images gave big gains; point clouds helped little versus their compute/latency cost.
Absolute joint actions at 30 Hz with proprioception beat delta actions or lower sampling; removing state input tanked performance.

Skill and data insights:

Dataset imbalance: moves like “move to” (~33%) and “pick up from” (~24%) dominate frames, with a long tail of rarer skills. This imbalance means the model must learn common moves well while not forgetting rare sequences that unlock whole tasks.
Task complexity: Some tasks require 5–10 distinct skills over hundreds of frames, with tough cases mixing navigation, opening/closing containers, and multi-object rearrangement—prime stress tests for long-horizon robustness.

Where it worked best: Easier tasks like “turning_on_radio” became robust; many medium tasks improved, and overall the system could complete 22/50 tasks with around 15% validation success rate across all tasks in the post-challenge analysis.

Interpretation: The gains aren’t from one trick but from stacking the right ingredients: broader pre-training, the RFT flywheel, and inference tuning. The large theoretical headroom suggests model scaling and smarter data selection could yield the next jump.

05Discussion & Limitations

Limitations:

Sampling efficiency: RFT keeps only successes; collecting enough of them in a slow simulator is time-consuming.
Simulator slowness and compute: BEHAVIOR rollouts can take hours; even with parallelization, iteration speed is limited.
Coverage gaps: Due to time limits, not every task received equal evaluation; rare, complex skills still pose challenges.
Modest 3D gains: Point clouds helped little relative to cost, leaving room for better geometry handling.

Required resources:

Strong GPUs (e.g., NVIDIA H200), multi-GPU parallel rollout setup, and storage for large multi-modal datasets.
Expertise to manage data balance, checkpoint selection, and inference-time tuning (horizon length, resolutions, action rates).

When not to use:

Ultra-low-latency, low-power robots that cannot afford high-res perception or 30 Hz control.
Settings needing rich tactile feedback or precise contact dynamics not represented in the dataset.
Rapidly changing real homes without domain adaptation; sim-to-real gaps may require extra transfer steps.

Open questions:

Can we improve sampling efficiency with DAgger-like on-policy distillation or leveraging privileged experts to label corrections?
How to pick or weight episodes smarter than simple success/fail—e.g., curriculum by difficulty, diversity-aware selection, or subgoal coverage metrics?
Will scaling model size and richer post-training objectives (including negatives, or reward-shaping via RL) unlock the theoretical best (~0.611) in practice?
How to represent 3D structure efficiently so gains over RGB are consistent without heavy overhead?
Can we formalize horizon tuning (32 steps here) with adaptive controllers that adjust horizon by uncertainty?

06Conclusion & Future Work

Three-sentence summary: This work adapts a strong VLA backbone (pi 0.5) to long-horizon household tasks by scaling pre-training across many tasks and using a rejection-sampling fine-tuning flywheel to lock in robust behaviors. Together with smart inference choices—receding horizon planning, higher-resolution vision, and well-chosen action formats—the system achieves a close 2nd-place test result (Q=0.2514) and a much higher post-challenge validation Q=0.345. Theoretical headroom of 0.611 suggests substantial gains are still available via better selection, scaling, and objectives.

Main achievement: Showing a practical, scalable path to reliable long-horizon control without online RL in a slow simulator—by combining broad pre-training, self-curated successful rollouts, and carefully tuned inference.

Future directions: Improve data efficiency with on-policy distillation/DAgger, integrate richer post-training signals (including negatives/rewards), explore adaptive horizons and uncertainty-aware control, and scale model capacity while making 3D perception more efficient. Also, develop curricula that systematically cover rare skills and difficult transitions.

Why remember this: It’s a clear blueprint for turning a general-purpose foundation policy into a dependable long-horizon robot—learn broadly, keep only your best tries, and plan in short, sharp steps with clear vision. That recipe pushes us closer to robots that can complete real multi-step chores in human homes.

Practical Applications

•Household assistants that can tidy rooms end-to-end (navigate, sort items, place them correctly).
•Kitchen helpers that can prepare simple foods or beverages with reliable multi-step sequences.
•Hospital logistics robots that fetch supplies across departments and deliver them accurately.
•Retail stock bots that restock shelves by navigating aisles, opening containers, and placing goods.
•Facility maintenance tasks like turning devices on/off, operating doors, and organizing tools.
•Warehouse kitting workflows that chain pick, place, and packing reliably over long runs.
•Elder-care support for fetching objects or setting up daily-use stations (e.g., coffee/tea).
•Educational robotics kits that demonstrate long-horizon planning and feedback control.
•Simulation-to-real transfer pipelines that use RFT to harden policies before deployment.

Version: 1