Act2Goal: From World Model To General Goal-conditioned Policy

Pengfei Zhou; Liliang Chen; Shengcong Chen; Di Chen; Wenzhi Zhao; Rongjun Jin; Guanghui Ren; Jianlan Luo

Act2Goal: From World Model To General Goal-conditioned Policy

Intermediate

Pengfei Zhou, Liliang Chen, Shengcong Chen et al.12/29/2025

arXiv PDF

Key Summary

•Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
•Act2Goal teaches a robot to first imagine a picture-by-picture path from now to the goal (a visual plan) and then act using two time scales at once: quick nearby steps and slower faraway anchors.
•This two-speed idea is called Multi-Scale Temporal Hashing (MSTH), which keeps the robot steady on long missions while letting it react to surprises.
•The robot’s action brain listens to the imagined visual plan through cross-attention, so pictures guide motions end to end.
•After offline training from demonstrations, the robot can self-improve online without rewards by relabeling what it actually reached as the goal (HER) and quickly fine-tuning small adapters (LoRA).
•In simulation, Act2Goal beat strong baselines across easy and hard tasks; in the real world it wrote words, plated desserts, and did plug-in operations with big gains.
•On tough, out-of-distribution tasks, short online self-practice raised success from 30% to 90% within minutes.
•An ablation showed MSTH is the key for long words and long chores: without it, performance drops sharply as tasks get longer.
•The method works zero-shot on new objects and layouts, showing strong generalization beyond the training data.
•Act2Goal suggests that “imagine first, then act at two speeds” is a simple, powerful recipe for long-horizon robot control.

Why This Research Matters

Robots that can imagine the path to a goal and act at two speeds are far better at long, real-world chores like arranging objects, writing, or plugging parts together. Because Act2Goal adapts in minutes without rewards or labels, it reduces costly engineering and supervision in homes, hospitals, and factories. The approach is robust to new objects and layouts, so one model can serve many tasks rather than needing a new policy for each. Its visual planning makes goals easy to specify—just show a picture of success—and avoids fragile language ambiguities. Overall, this brings practical, generalist robot helpers closer to everyday use, where reliability over long tasks truly matters.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how building a LEGO castle takes many steps—you don’t just stare at the finished picture on the box and place one brick hoping it works. You plan small sub-steps (find the door, build one wall) and check progress as you go.

🥬 The Concept (The World Before): Robots that follow goals from pictures, called goal-conditioned policies, got good at short chores—like nudging a block—but stumbled on long ones—like setting a table. They usually looked at the current camera view and the final goal image and then tried to predict the next action directly. That’s like guessing your next LEGO piece by only looking at the box art, without imagining the steps in between. Without a sense of progress, these robots often drifted off course, especially when the world changed a bit (a cup moved, lighting shifted) or when tasks took many steps.

Why It Matters: In homes, hospitals, and factories, real chores are long and messy: arranging food, plugging cables, assembling parts. A robot that can’t keep a steady direction over many steps needs lots of human fixes, making it slow and costly.

🍞 Anchor: Imagine asking a robot, “Make this dessert plate look like the photo.” If it can’t picture the middle steps (pick up the strawberry, place it left of the cake, rotate the fork), it’ll likely misplace items and never quite match the example.

🍞 Hook: Picture your school project timeline. If you only plan today and ignore next week’s due date, you might do neat work now but miss the big picture later.

🥬 The Concept (The Problem): Standard goal-conditioned policies predict single-step actions without an explicit model of how scenes evolve toward the goal. With no map of intermediate states, they can’t tell if an action truly helps progress. They also face a tug-of-war: long-term planning gives direction but is brittle to mistakes; short-term control is reactive but can lose the goal over time. Prior attempts tried keyframes, stronger supervision, or language instructions, but either needed lots of handholding, couldn’t align visuals precisely, or broke on long horizons.

Why It Matters: Long tasks pile up small errors. Without a structure that balances big-picture guidance and quick corrections, performance craters on unfamiliar setups.

🍞 Anchor: A robot writing a word on a whiteboard may start neat but drift letter by letter without a long-horizon anchor, ending with a squiggly mess.

🍞 Hook: Think of a GPS that shows both your next turn and the whole route. You need both the zoomed-in street and the zoomed-out map.

🥬 The Concept (The Gap): What was missing was an explicit, visual plan that bridges now to the goal, plus a way to use it at two time scales: dense, nearby steps for fine control; sparse, far anchors for staying on course. Also missing: a lightweight way to self-improve online without reward engineering or human labels.

Why It Matters: If a robot can imagine plausible middle pictures and act at two speeds, it can stay aligned with the goal yet recover from bumps, making it useful in the real world.

🍞 Anchor: With a route preview and turn-by-turn nav, you’re less likely to take a wrong exit and can still handle a sudden roadblock.

🍞 Hook: After a sports match, teams watch replays to learn—even from losses.

🥬 The Concept (The Stakes): Robots that can learn from their own attempts (without scores or coaches) adapt quickly on-site. If they can relabel what they actually reached as the goal and fine-tune small parts of their brain, they’ll get better in minutes, not weeks.

Why It Matters: Fast, reward-free adaptation means less downtime and fewer costly annotations.

🍞 Anchor: In tests, this paper’s robot boosted success on hard, unfamiliar tasks from 30% to 90% within minutes of practice, showing real-world impact beyond lab demos.

02Core Idea

🍞 Hook: Imagine you’re drawing a tricky picture. First, you lightly sketch the main shapes (big plan), then you add careful details (fine moves), correcting as you go.

🥬 The Concept (Aha in One Sentence): Act2Goal makes the robot first imagine a visual path from now to the goal and then act at two speeds—dense nearby steps and sparse far anchors—while quickly learning from its own attempts.

How It Works (High Level):

A goal-conditioned visual world model imagines intermediate pictures linking the current view to the goal image.
Multi-Scale Temporal Hashing (MSTH) splits this imagined movie into proximal frames (close in time, many) and distal frames (far in time, few).
An action expert, guided by cross-attention to the imagined frames, outputs dense, executable actions for the near future while keeping the long goal in mind.
Online, the robot uses Hindsight Experience Replay to relabel what it actually achieved as the goal and rapidly fine-tunes small LoRA adapters—no rewards needed.

Why It Matters: Without the imagined plan and two-speed control, the robot either overreacts locally and forgets the goal, or follows a brittle plan and breaks on surprises.

🍞 Anchor: It’s like tracing a maze: glance at the end to aim your path (distal anchors), but move your pencil carefully corner by corner (proximal control), adjusting if your hand shakes.

Multiple Analogies:

Map Analogy: The world model is your route preview; proximal frames are street-level turns; distal frames are the highway exits that keep you headed to the city.
Comic Analogy: The world model draws a comic strip from start to finish. Proximal panels guide your next stroke; distal milestone panels ensure the story ends right.
Music Analogy: The distal frames are the song’s chorus you must hit; proximal frames are the beats you tap right now to stay in rhythm.

Before vs After:

Before: Directly mapping (now, goal) → next action often wobbled on long, new tasks and needed heavy supervision.
After: Imagining the path and acting at two time scales yields coherent long-horizon behavior that still adapts mid-flight, generalizing across new objects and scenes.

Why It Works (Intuition, No Math):

Visual Imagination grounds actions in concrete, goal-aligned future snapshots, so the robot “knows” what progress looks like.
Two-Scale Time keeps balance: proximal steps enable tight closed-loop corrections; distal anchors prevent drift and keep the big picture intact.
Cross-Attention lets the action expert “look at” relevant imagined frames, fusing seeing and doing.
Hindsight + LoRA lets the system learn quickly from any rollout (even failures) with tiny updates on-device.

Building Blocks (Recipe Pieces):

🍞 Hook: Remember sorting tasks by “do now” and “do later”?
🥬 MSTH splits time into dense-near and sparse-far bins to structure control.
🍞 Anchor: Like a planner with a today list and a semester plan.
🍞 Hook: Think of daydreaming the steps before you try a skateboard trick.
🥬 A goal-conditioned world model predicts middle pictures between now and the goal.
🍞 Anchor: Like ghost images of where your feet and board should be next.
🍞 Hook: A chef watches the recipe photos while cooking.
🥬 Cross-attention in the action expert focuses on the right imagined frames at the right time.
🍞 Anchor: You glance at the photo to ensure the garnish matches the final plate.
🍞 Hook: Practice makes perfect—even failed tries teach you where to adjust.
🥬 HER relabels what you actually reached as the goal; LoRA fine-tunes small adapters fast.
🍞 Anchor: Like adjusting your basketball shot after watching your own replay, without a coach scoring you.

03Methodology

At a high level: Input (current images + goal image + robot state) → Goal-Conditioned World Model imagines MSTH visual frames → Action Expert cross-attends to these frames → Output dense proximal actions (execute) + distal actions (guidance only).

Step 1. Sense and Prepare the Inputs

What happens: The robot captures multi-view camera frames of the current scene and also receives a goal image that shows what success looks like. The robot state (like arm pose) is included. A video encoder compresses images into compact latent features using a VAE.
Why it exists: Raw images are big and slow to process; latents are like meaningful short notes that are fast to plan with. Without this, generation and control would be too heavy.
Example: Current view shows a cup on the right; goal image shows the cup centered on a plate. The encoder turns both images into small latent tensors that keep shapes and positions.

Step 2. Imagine the Visual Future with a Goal-Conditioned World Model

What happens: Using flow matching (a method to turn noise into structured samples), the model generates a sequence of latent frames that plausibly connect the current view to the goal. It is purely vision-conditioned: no language, just images. During inference, it refines noisy latents over several steps guided by a learned vector field.
Why it exists: This creates an explicit picture-by-picture plan. Without imagination, the robot can’t “see” progress and is prone to drift.
Example: The imagined frames show the arm reaching, grasping the cup, lifting it, moving left, then placing it on the plate.

Step 3. Multi-Scale Temporal Hashing (MSTH) to Structure Time

What happens: The imagined trajectory is split into two parts:
- Proximal frames: dense, near-future frames for fine-grained control.
- Distal frames: a few, far-future frames sampled with growing spacing (logarithmic) to anchor the long-term plan.
  Actions follow the same idea: proximal actions are predicted at every tiny step (and are executed), while distal actions are predicted only at anchor times (not executed, just guidance).
Why it exists: Long tasks need both precision now and direction later. Without MSTH, either you overfit to tiny steps and forget the goal, or you follow a long plan that breaks on small bumps.
Example: Proximal: 50 small moves to draw the next letter stroke. Distal: 9 anchor points that keep the word aligned on the line and spaced evenly.

Step 4. Turn Imagined Pictures into Motions via an Action Expert

What happens: An action DiT (Transformer) predicts the next sequence of actions using flow matching, conditioned on robot state and multi-layer features from the world model. Cross-attention lets it focus on the most relevant proximal/distal frames.
Why it exists: Pictures guide actions, but you still need a motor brain to output joint targets, velocities, or end-effector commands. Without it, the plan stays a plan.
Example: Given a proximal frame showing the cup slightly off-center, the action expert outputs a tiny leftward move; distal anchors keep it from drifting off the plate over time.

Step 5. Two-Stage Offline Training

What happens:
- Stage 1: Jointly fine-tune the world model for transition prediction and the action expert for action flow using demonstrations; balance visual and action losses so planned visuals are also actionable.
- Stage 2: End-to-end behavioral cloning with the action loss, letting gradients flow through both modules so visuals are optimized for control.
Why it exists: If visuals and actions aren’t co-trained, the imagined frames might look pretty but be hard to act on. The second stage aligns the entire pipeline with expert moves.
Example: For writing words, Stage 1 teaches plausible letter-by-letter frames; Stage 2 ensures the strokes match the expert’s smooth pen paths.

Step 6. Reward-Free Online Autonomous Improvement (HER + LoRA)

What happens: During deployment, the robot stores (observation, state, action, next observation). It relabels the goal as what it actually achieved (hindsight) and fine-tunes only small LoRA adapters on-device for a few minutes. Then it repeats: rollout → relabel → quick fine-tune.
Why it exists: Real scenes differ from training. Quick, cheap updates fix those gaps without human labels or reward design.
Example: The robot fails to plug a bottle into a holder but gets close; it relabels the close pose as the goal and learns the micro-adjustments needed next time.

Step 7. Deployment Details

What happens: Only actions are generated at inference time. Proximal actions (e.g., 50 steps) are executed at ~200 ms latency; distal actions remain internal guidance. Safeguards reset the scene if attempts exceed a limit.
Why it exists: Keeping inference lean ensures smooth, closed-loop control. Distal anchors stabilize behavior without slowing real-time execution.
Example: While drawing, the robot streams a burst of small strokes every cycle, staying aligned thanks to unseen distal anchors.

The Secret Sauce

Two-Speed Time (MSTH): Dense nearby control prevents wobble; sparse far anchors prevent drift.
Visual Imagination: The plan is grounded in pictures of what progress looks like, not abstract wishes.
End-to-End Coupling: Cross-attention makes the action expert read the plan intelligently.
Self-Improvement Loop: HER + LoRA delivers minutes-to-mastery updates without rewards or labels.

Mini Sandwiches for Key Concepts

🍞 Hook: You know how you check both your next step and the finish line during a race?
🥬 Goal-Conditioned Policy: A rule that picks actions using the current view and a goal image. How: read now + goal → choose next move. Why: without conditioning on the goal, actions may be fast but aimless.
🍞 Anchor: Moving a toy car to match a photo of the parking spot.
🍞 Hook: Imagine flipping a comic strip to see how the story goes from start to end.
🥬 Goal-Conditioned World Model: A model that imagines plausible in-between pictures from now to the goal. How: encode images → flow-match noise into ordered future frames. Why: without pictures of progress, the robot can’t tell if it’s getting closer.
🍞 Anchor: Ghost frames showing a cup being lifted, centered, and placed.
🍞 Hook: Your calendar shows today’s tasks and big monthly deadlines.
🥬 MSTH: A method to split time into many near frames (proximal) and a few far anchors (distal). How: dense sampling nearby, logarithmic sparse sampling far away; predict dense executable actions and sparse guiding actions. Why: without it, either you drift or you break on bumps.
🍞 Anchor: Writing letters neatly now while keeping the word level on the line.
🍞 Hook: Watching replays after a game helps you improve.
🥬 HER: Learn from whatever happened by treating the reached state as the goal. How: store transitions → relabel goal as what you actually got → train. Why: without rewards or labels, you still extract signal.
🍞 Anchor: Missed the cup holder? Learn adjustments from the near-miss.
🍞 Hook: Instead of rebuilding a bike, you swap a small part to tune it.
🥬 LoRA Fine-Tuning: Update tiny low-rank adapters to change the model quickly. How: freeze the big net, train small inserts. Why: faster, cheaper, safer on-device learning.
🍞 Anchor: Five-minute tune-ups that lift success from 30% to 90%.

04Experiments & Results

The Test: What and Why

We measure success rate: does the robot reach the visual goal? This captures long-horizon correctness and robustness to small errors. We test in simulation (Robotwin 2.0) and on a real robot (AgiBot Genie-01), with in-domain (ID) and out-of-domain (OOD) setups to probe generalization.

The Competition: Baselines

DP-GC: Sends features of current and goal images into a DiT action head.
π0.5-GC: A strong vision-language-action baseline given both raw observation and goal image, but with a fixed language condition.
HyperGoalNet: A recent high-performing goal policy with hypernetworks.
These represent different ways to condition on visual goals without explicit imagined trajectories or MSTH.

Scoreboard with Context

Simulation (Robotwin 2.0): Across four tasks (Move Can, Pick Dual Bottles, Place Cup, Place Shoe), Act2Goal wins all easy-mode tasks and three of four hard-mode tasks. For example, on easy mode it posts 0.62–0.80 success where DP-GC sits near 0.03–0.18 and π0.5-GC around 0.13–0.54. Think of it as scoring an A while others hover around C to B-. On hard OOD modes, Act2Goal still lands 0.13–0.43 where others often get zeros, which is like making solid free throws while others miss the rim entirely.
Real World: Three demanding tasks—Whiteboard Word Writing, Dessert Plating, and Plug-In Operation. Without any online improvement, Act2Goal reaches 0.93/0.75/0.45 in ID and 0.90/0.48/0.30 in OOD, while baselines often fall to near zero. This is like consistently winning matches away from home while competitors struggle to score.

Surprising Findings

Online Self-Improvement Works Fast: On tough OOD tasks, a short HER+LoRA loop boosts success from about 0.30 to 0.90 within minutes. That’s like going from missing most shots to nailing almost all after a brief warm-up.
Learn Even from Failures: Using only failed rollouts still improves performance—HER squeezes signal from near-misses. Best is to use all rollouts (successes + failures).
MSTH is the Key for Long Horizons: In a whiteboard writing ablation, without MSTH, success collapses as words get longer (e.g., OOD long words near 0.00), but with MSTH it stays high (≈0.88–0.93). This shows two-speed time control prevents drift over long sequences.
Imagined Videos are Coherent: The world model produces both crisp proximal frames and meaningful distal anchors, giving the action expert reliable guidance.

Takeaways in Plain Language

Imagining the path plus acting at two speeds beats guessing the next move from the end picture.
Small, quick on-device tune-ups let the robot adapt to new scenes without rewards or labels.
MSTH is not just a detail; it’s the steering wheel that keeps long tasks straight.

Concrete Examples

Writing: ID words hit 0.93 success; OOD words still 0.90. Without MSTH, performance on long OOD words nearly vanishes; with MSTH, it stays strong.
Plating Desserts: Despite background and prop changes, Act2Goal maintains nearly half successes OOD, where others often fail, showing robust visual goal following.
Plug-In: Even when the target shape changes (bottle into holder vs metal piece), zero-shot performance is nonzero, and online learning rapidly closes the gap to 0.90.

05Discussion & Limitations

Limitations

Visual Goal Quality: If the goal image is ambiguous, occluded, or mismatched to the scene (wrong angle, missing objects), the imagined plan can be misleading.
World Model Errors: Generative predictions can drift or blur fine details, especially with shiny surfaces, fast dynamics, or heavy occlusion; this can misguide the action expert.
Compute and Data: Training the world model and action expert at scale needs substantial GPUs and demonstrations; small labs may find the offline stage heavy.
Safety and Contacts: Tasks needing precise force control (tight insertions, deformables) can still challenge a vision-centric plan without explicit force feedback models.
Partial Observability: Single-view or narrow FOV can hide crucial details; multi-view helps, but blind spots remain.

Required Resources

Hardware: A robot arm with cameras (ideally multi-view), on-device GPU (e.g., RTX 4090) for inference and quick LoRA updates.
Training: Access to a sizeable demonstration dataset and GPUs (e.g., A800 cluster) for offline stages.
Software: VAE/DiT/flow-matching stacks, cross-attention integration, online buffer and HER relabeling pipeline.
Operations: Basic reset tools for stuck states; occasional human assistance to restore the scene.

When Not to Use

If goals aren’t expressible visually (e.g., hidden internal states) or require precise force/torque specs you can’t see.
Highly dynamic, adversarial settings (flying objects, crowds) where visual predictions become unreliable.
Strict safety-critical operations where any generative misprediction is unacceptable without redundancy (e.g., surgery without extensive safeguards).

Open Questions

Uncertainty-Aware Planning: How to estimate and act on the world model’s confidence, switching behaviors when predictions are shaky?
Multimodal Goals: Best ways to blend language, geometry (3D), and vision goals without losing precision?
Scaling Laws: How do performance and stability scale with model size, data diversity, and MSTH settings?
Guarantees: Can we derive bounds on drift or convergence for two-scale control?
Rich Feedback: How to fuse tactile/force cues into the same two-scale structure for contact-rich manipulation?
Continual Learning: How to maintain gains over months without forgetting earlier skills?

06Conclusion & Future Work

Three-Sentence Summary
Act2Goal teaches robots to imagine a visual path from now to a goal image and to act at two speeds—dense nearby moves and sparse far anchors—so they stay steady on long tasks while reacting to surprises. By coupling the imagined plan to the action expert via cross-attention and adding a quick, reward-free self-improvement loop (HER + LoRA), the system generalizes well and adapts in minutes. Experiments in simulation and on real robots show large gains over strong baselines, including dramatic boosts on tough, unseen tasks.

Main Achievement
The paper’s #1 contribution is Multi-Scale Temporal Hashing (MSTH) fused with a goal-conditioned visual world model, delivering coherent long-horizon control that remains reactive and enabling fast, label-free online adaptation.

Future Directions
Next steps include uncertainty-aware planning for safer execution, integrating force/tactile sensing into the two-scale structure, extending to 3D/point-cloud goal representations, and unifying visual and language goals cleanly. Exploring scaling laws and theoretical guarantees for two-scale control could further stabilize long missions.

Why Remember This
The simple recipe—imagine first, then act at two speeds, learn from what happened—turns out to be a powerful, practical way to make robots handle real, messy, long chores. It’s a blueprint for robust, generalist visuomotor control that can keep improving itself on the job.

Practical Applications

•Home assistance: Set a table to match a reference photo, tidy toys into bins that match a goal snapshot.
•Industrial assembly: Insert pegs, bearings, or bottles into holders, guided by goal images of correct insertion.
•Warehousing: Re-arrange items on shelves to match a planogram image with different object types and layouts.
•Food service: Plate dishes to match a visual template despite dish and background changes.
•Whiteboard and marking tasks: Write labels, draw guidelines, or mark cut lines following a sample image.
•Laboratory support: Place tubes and plates in racks to match a goal layout image without reward engineering.
•Retail display setup: Arrange products to match marketing visuals under varying store conditions.
•Device operation: Plug cables or components into ports that match a goal photo of a correct connection.
•Cleaning and maintenance: Wipe or polish along visual guides, keeping coverage aligned over long strokes.
•Teleoperation assist: Provide robust auto-complete motions that follow a visual goal, while a human supervises.

Version: 1