Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Chi-Pin Huang; Yunze Man; Zhiding Yu; Min-Hung Chen; Jan Kautz; Yu-Chiang Frank Wang; Fu-En Yang

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Intermediate

Chi-Pin Huang, Yunze Man, Zhiding Yu et al.1/14/2026

arXiv PDF

Key Summary

•Fast-ThinkAct teaches a robot to plan with a few tiny hidden "thought tokens" instead of long paragraphs, making it much faster while staying smart.
•It learns which thoughts are good by copying a slower teacher model but only keeping the best reasoning using preference signals from rewards.
•The hidden thoughts are verbalizable, meaning a small language model can read them back as text during training to keep them understandable.
•The model also learns where to go by predicting a handful of visual waypoints in parallel, which connect plans directly to robot motions.
•On tough robot tasks like LIBERO and SimplerEnv, it beats prior systems and is up to 9.3× faster than strong reasoning baselines.
•It stays reliable on long, multi-step jobs, adapts with only a few new demos, and recovers from mistakes by planning corrections.
•A compact student model (3B) matches or exceeds larger 7B models while cutting latency by about 89%.
•The key trick is preference-guided distillation plus visual trajectory alignment so the compact thoughts still point to the right actions.
•At inference, only the fast planner and the action model run; the verbalizer is optional and used mainly for training and debugging.

Why This Research Matters

Robots working in homes, hospitals, and factories must think and act fast, not wait around writing long explanations. Fast-ThinkAct shows that a handful of compact, checkable thought tokens can capture smart planning without slowing the robot down. That means safer, smoother actions in real time, even for long, multi-step jobs. It also lowers the cost of adapting robots to new tasks because the compact planner learns quickly from only a few examples. Finally, by aligning thoughts with visual waypoints, the robot’s plan stays grounded in what it actually sees, making behavior more reliable in messy, changing environments.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re giving directions to a friend. You could write a whole page about every turn, or you could say, “Go straight, turn left at the bakery, then right at the park.” Short directions are faster to say and still get you there.

🥬 The World Before: Vision-Language-Action (VLA) models are robots’ brains that look (vision), understand instructions (language), and move (action). For years, most VLAs learned by copying lots of example moves from humans (imitation). That worked well for simple, short tasks like “pick up the cup” but broke down when tasks got long, messy, or surprising—like cooking steps, cleaning in clutter, or fixing a mistake after dropping something. Robots needed to plan across time and space, keep track of goals and subgoals, and adapt on the fly—but the training data didn’t cover every situation.

🍞 Anchor: Think of a kid who only practiced one piano song; they can’t suddenly play a new song at a recital without understanding music.

🍞 Hook: You know how math students show their work step-by-step? That’s helpful for teachers to see the thinking, not just the answer.

🥬 The Problem: Researchers added chain-of-thought (CoT) reasoning to VLAs—textual step-by-step explanations like “first find the mug, then grasp, then move to the shelf.” These steps improved generalization: robots solved new tasks better. But writing long thoughts (often ~250 tokens) took too long at test time. Robots need to act at 1–15 times per second; long thoughts could take seconds, which is far too slow, and can even be unsafe in time-critical settings.

🍞 Anchor: It’s like needing to think out loud for a whole minute before catching a falling glass—too late!

🍞 Hook: Imagine trying to speed-run a video game: you still plan, but you can’t pause every five seconds to narrate your plan.

🥬 Failed Attempts: People tried to shorten the text by cutting steps or adding penalties for long answers. But slicing out text often cut out the important bits. Others tried to skip reasoning at test time (“reasoning dropout”), which sometimes made plans inconsistent because the model hadn’t truly learned a compact way to think.

🍞 Anchor: If you remove half the directions, your friend might miss the bakery and never find the park.

🍞 Hook: Think of a secret notebook where a chef keeps tiny symbols for recipes—short, fast, and still meaningful.

🥬 The Gap: Robots needed a way to keep the benefits of reasoning without the long text—a compact, fast, still-thoughtful plan that captures both language ideas and visual, spatial details. And it should still be checkable (verbalizable) so we can ensure it learned the right kind of thinking.

🍞 Anchor: We want the short directions (“left at the bakery”) that we can also explain if asked (“Because the bakery marks the right street.”).

🍞 Hook: Imagine the best student learning from a star teacher, but only absorbing the teacher’s best solutions and compressing them into handy flashcards.

🥬 Why This Paper: Fast-ThinkAct proposes verbalizable latent planning—hidden, continuous “thought tokens” that are compact, can be turned back into text when needed, and are aligned with visual waypoints so the plan directly guides robot actions. It learns which thoughts are good using preferences from a teacher trained with rewards, and it aligns the student’s internal visual plan with the teacher’s. This keeps the plans both smart and fast.

🍞 Anchor: Like memorizing a few map pins instead of reading a novel about the city, but still being able to tell someone why those pins matter.

Concept Sandwiches introduced here:

Vision-Language-Action (VLA)
- Hook: You know how you look, listen, and then act—like hearing “Put the book on the shelf,” seeing the shelf, and doing it.
- The Concept: VLA is a model that sees the scene, understands the instruction, and outputs actions.
- How it works: (1) Read instruction, (2) Look at images/video, (3) Plan steps, (4) Predict robot motions.
- Why it matters: Without it, robots can’t connect what they see and what they’re told to what they should do.
- Anchor: “Put the red mug in the drawer” → find red mug in image → reach → grasp → move → place.
Chain-of-Thought (CoT)
- Hook: Like showing your math steps.
- The Concept: CoT is a step-by-step explanation the model writes before acting.
- How it works: (1) Break task into subgoals, (2) Describe each, (3) Use them to choose actions.
- Why it matters: It helps generalize to new tasks.
- Anchor: “Find mug → grasp → lift → open drawer → place → close.”
Latent Reasoning
- Hook: Imagine thinking quietly in your head with quick shorthand.
- The Concept: Latent reasoning is planning in compact hidden vectors instead of long text.
- How it works: (1) Produce a few continuous tokens, (2) Each encodes part of the plan, (3) Use them to guide actions.
- Why it matters: It’s much faster but still keeps structure.
- Anchor: Six hidden “thought tokens” instead of 250 words.
Verbalizable Reasoning
- Hook: Sometimes you need to explain your shorthand.
- The Concept: Verbalizable means those hidden tokens can be decoded back into understandable text when needed.
- How it works: A small language model (verbalizer) reads the tokens and writes a short explanation.
- Why it matters: Keeps learning grounded and debuggable.
- Anchor: The model can say, “First align with mug, then grasp,” when asked.

02Core Idea

🍞 Hook: Picture a coach who watches many plays, marks which replays show great strategies, and then teaches the team to remember just a few key cues that trigger those winning moves in real time.

🥬 The Aha! Moment (one sentence): Teach a fast student model to think in a few compact, verbalizable latent tokens that capture both language reasoning and visual plans, distilled by preferences from a slower teacher and aligned to waypoints that directly drive robot actions.

Multiple Analogies:

GPS Pins, Not Paragraphs: Instead of narrating directions street-by-street, drop five pins on the map; the route is obvious and quick to follow.
Recipe Flashcards: Replace a full cookbook chapter with a small card: “Preheat → whisk → pour → bake,” and you can still explain why each step matters.
Choreography Beats: A dancer remembers a handful of beats (“step, turn, lift, land”), not a long essay, yet the performance stays precise.

Before vs After:

Before: Long textual CoT (slow), brittle shortcuts that trimmed text (losing key info), or skipping reasoning at test time (inconsistent plans).
After: A few continuous “thought tokens” that are (a) fast to produce, (b) verbalizable for supervision and debugging, and (c) aligned with visual waypoints so the plan is grounded in the scene and directly useful for control.

Why It Works (intuition, no math):

Good thoughts beat long thoughts. The teacher’s reinforcement training marks which reasoning traces actually lead to success via rewards. By comparing the best and worst teacher thoughts, the student learns to store only the essence of the winning patterns in a tiny latent code.
Show don’t just tell. Aligning the student’s internal plan representation with the teacher’s visual trajectory state, plus predicting waypoints in parallel, keeps planning spatially grounded. That means the “thought tokens” point to where the gripper should go, not just what the text says.
Speak when needed. Because the tokens are verbalizable, a small language model can read them back into short reasoning during training, ensuring the compact code stays meaningful and not just random.

Building Blocks (each with a Sandwich):

Preference-Guided Distillation
- Hook: Imagine picking the best sports plays to study.
- Concept: The student learns to prefer high-quality teacher thoughts and avoid low-quality ones using reward-based preferences.
- How it works: (1) Teacher generates multiple thoughts, (2) Score them with rewards, (3) Pick best vs worst, (4) Train student so its tokens are decoded as the better thought more often.
- Why it matters: Keeps the compact plan smart, not just short.
- Anchor: Keep the replay where the team scores; skip the fumble.
Verbalizer LLM
- Hook: A translator that turns symbols into sentences.
- Concept: A small language model decodes latent tokens into text during training.
- How it works: (1) Read tokens, (2) Output a short reasoning, (3) Compare to best/worst teacher thoughts, (4) Adjust student tokens so they decode into the better one.
- Why it matters: Ensures tokens stay interpretable and faithful to good reasoning.
- Anchor: The coach can ask, “Why that move?” and the player explains clearly.
Visual Trajectory Alignment
- Hook: A map that must match the road.
- Concept: The student’s internal plan is aligned to the teacher’s action-grounded visual state at the answer step.
- How it works: (1) Grab teacher’s plan state, (2) Pull student’s state toward it, (3) Predict K waypoints in parallel with special spatial tokens.
- Why it matters: Plans stay tied to where the robot actually needs to move.
- Anchor: The pins on your map must land on real streets, not on rivers.
Spatial Tokens for Waypoints
- Hook: Marking the next few stepping stones across a stream.
- Concept: The model appends K spatial tokens; each outputs a waypoint simultaneously via a small head.
- How it works: (1) Attach K tokens, (2) Each becomes a predicted (x,y) or gripper pose list, (3) Done in parallel, not long text.
- Why it matters: It’s fast and creates an actionable plan.
- Anchor: “Step here, here, and here,” all at once.
Reasoning-Enhanced Policy Learning
- Hook: Learning to drive by seeing both the road and a ghost path ahead.
- Concept: A diffusion policy consumes the visual plan latent (from the spatial tokens’ cache) and the observations to output smooth low-level actions.
- How it works: (1) Extract early-layer planning context from the VLM, (2) Feed it into the action model, (3) Train with imitation targets.
- Why it matters: Bridges high-level plan to motor commands.
- Anchor: Follow the ghost line on the road while steering smoothly.

Bottom Bread (Anchor Example in Action):

Instruction: “Put the red mug in the drawer.”
Student makes 6 compact tokens + K=5 spatial tokens. The spatial tokens produce 5 visual waypoints: approach mug, align, grasp, move to drawer, place. The diffusion policy then turns that plan into precise arm motions. If we ask the verbalizer, it will say a short explanation like, “Align with mug, grasp firmly, move to drawer, place inside.”

03Methodology

High-Level Recipe: Input (image/video + instruction) → Teacher rollouts and scoring → Student produces M latent thought tokens → Verbalizer prefers tokens that decode to better thoughts → Align visual plan states and predict K waypoints in parallel → Train an action policy that uses this plan to output robot motions.

Step-by-step with Sandwiches and Examples:

Train the Textual Teacher with Rewards (GRPO)

Hook: Picture a coach who tries many plays and keeps the ones that score higher.
Concept: The teacher model generates multiple chain-of-thoughts and gets rewards based on how well the resulting actions align with success (e.g., goal completion, correct trajectories). A group advantage score ranks each thought.
How it works:
1. For a task like “Put the mug in the drawer,” the teacher writes several reasoning traces.
2. Each trace is scored by rewards tied to action success.
3. Compute a relative advantage per trace inside the group.
4. Update the teacher to prefer traces with higher advantage.
Why it matters: Establishes reliable examples of good vs bad reasoning.
Anchor: The teacher knows which playbooks actually win games.

Student Generates M Compact Latent Thought Tokens

Hook: Swap a long speech for a tiny set of cue cards.
Concept: Instead of long text, the student autoregressively produces M continuous latent tokens (e.g., M=6). These are the compressed “thoughts.”
How it works:
1. Read instruction + observation.
2. Emit a small sequence of vectors (tokens) that summarize the plan.
3. Keep them short and information-dense.
Why it matters: Big speed-up at inference.
Anchor: Six cue cards replace 250 words.

Verbalizer Loss (Preference-Guided)

Hook: A translator who prefers to translate into the best explanation.
Concept: A small LLM (the verbalizer) decodes the student’s tokens into text and is trained so that decoding better teacher thoughts is more likely than worse ones.
How it works:
1. From each teacher rollout group, pick best (τ⁺) and worst (τ⁻) thoughts.
2. Condition the verbalizer on student tokens and compare the likelihood of decoding τ⁺ vs τ⁻.
3. Push the student tokens so the verbalizer favors τ⁺.
Why it matters: Keeps compressed thoughts faithful to high-quality reasoning, not random codes.
Anchor: The translator consistently chooses the clearer instruction manual.

Action-Aligned Visual Plan Distillation

Hook: The plan must match the road you’ll actually drive.
Concept: Align the student’s internal plan state with the teacher’s action-grounded state at the answer step, then directly predict K waypoints using spatial tokens.
How it works:
1. Grab the teacher’s hidden state tied to the visual plan.
2. Nudge the student’s corresponding state to match.
3. Append K learnable spatial tokens; each outputs one waypoint via a small MLP in parallel.
Why it matters: Ties reasoning to concrete, spatial goals so the robot knows where to move next.
Anchor: Place five stepping stones you can actually step on.

Example with Data:

Task: “Move the 7Up can near the apple.”
The student predicts 5 waypoints: approach can, align gripper, grasp, move near apple, release. These are 2D (or 3D) targets aligned to the scene.

Reasoning-Enhanced Policy Learning (Action Model)

Hook: Follow the ghost path while driving.
Concept: A diffusion Transformer policy (e.g., RDT/DiT-Policy) consumes both the observed state and the visual planning latent from the student to output low-level actions.
How it works:
1. Extract early-layer key-value (KV) cache from the VLM’s spatial tokens (contains rich planning cues).
2. Concatenate with the action model’s state encoder KV.
3. Train the action model with imitation learning to match ground-truth robot controls.
Why it matters: Smoothly turns high-level plan into motor commands.
Anchor: The car steers by attending to the ghost path plus camera view.

Training Strategy

Hook: Warm up, then sprint.
Concept: Start from a VLM pre-trained and SFT’d (and CoT-SFT’d), then split paths: teacher gets RL (GRPO), student gets latent distillation with verbalizer and visual alignment; finally, train the policy with the frozen student planner.
How it works:
1. SFT: learn general visual-language + embodied knowledge.
2. CoT-SFT: learn to produce structured reasoning.
3. Teacher GRPO: learn high-reward CoTs.
4. Student Latent: learn tokens favored by verbalizer preferences + trajectory alignment; predict K waypoints.
5. Policy IL: train diffusion policy to execute using the student’s plan.
Why it matters: Each phase builds a piece: knowledge → reasoning → compact planning → actionable control.
Anchor: School → coaching → shorthand notes → game-time plays.

Inference

Hook: No need to read the whole book during a quiz.
Concept: At test time, only the student planner (latent tokens + spatial tokens) and the action model run; the verbalizer is optional.
How it works:
1. Produce M latent tokens quickly.
2. Generate K waypoints in parallel.
3. Feed plan latent to the action policy.
4. Output actions at real-time speeds.
Why it matters: Achieves up to ~89% latency reduction vs prior reasoning VLAs.
Anchor: Snap to the cue cards and go.

The Secret Sauce (what’s clever):

Make thoughts compact but still readable (verbalizable), so learning stays on track.
Use preferences from a rewarded teacher to keep only the winning reasoning patterns.
Ground thoughts in space with parallel waypoint tokens to directly bridge to actions.
Feed early planning KV into the action model for strong plan-to-control coupling.

04Experiments & Results

🍞 Hook: Imagine a track meet where runners must be fast, smart about pacing, and able to adjust if they trip mid-race.

🥬 The Test: The authors checked three big things.

Can the robot complete many kinds of manipulation tasks (LIBERO, SimplerEnv, RoboTwin2.0)?
Can it reason about plans and videos effectively (EgoPlan-Bench2, RoboVQA, OpenEQA)?
Is it fast enough for real-time use (latency in milliseconds)?

🍞 Anchor: It’s like asking, “Do you finish the race? Do you pick good strategies? And are you fast?”

The Competition (Baselines):

Foundation VLAs: OpenVLA
Supervised reasoning VLAs: CoT-VLA, MolmoAct
Reinforced reasoning VLA: ThinkAct
Larger proprietary or general VLMs for reasoning benchmarks: GPT-4V, Gemini 2.5 Flash (for context)

Scoreboard with Context:

LIBERO (diverse tasks: Spatial, Object, Goal, Long): Fast-ThinkAct tops the charts across all suites. Think of getting an A when others get B+ to A-.
SimplerEnv-Google: Fast-ThinkAct achieves 68.7% success versus 64.7% (ThinkAct-3B) and 64.9% (MolmoAct-7B). That’s like beating taller opponents despite being smaller.
RoboTwin2.0 (bimanual, long-horizon): Fast-ThinkAct averages higher than RDT, π0, ACT, and ThinkAct, with +3.3% vs ThinkAct under easy and +1.7% under hard. On the hardest, longest tasks (270–470 steps), it’s notably better—like staying steady in a marathon.
Embodied Reasoning:
- EgoPlan-Bench2: +2.4% over runner-up; chooses better next steps in egocentric tasks.
- RoboVQA: +5.5 BLEU over runner-up; clearer, more accurate video-based reasoning in robotics.
- OpenEQA: +1.1 points; better spatial/functional understanding of real environments.
Latency: Up to 89.3% reduction vs ThinkAct-7B and 88.0% vs MolmoAct-7B; 3B student gives ~7× faster inference than ThinkAct-3B (805 ms vs 5674 ms per decision). That’s like going from waiting for an elevator to taking the stairs and arriving first.

Surprising/Notable Findings:

Small but Mighty: The 3B Fast-ThinkAct matches/exceeds 7B baselines while being much faster. This shows that good planning compression beats sheer size when latency matters.
Concise but Correct: When verbalized, student thoughts are shorter and more focused than teacher text, filtering out distracting fluff while keeping the core plan (seen in RoboVQA and OpenEQA examples).
Few-Shot Adaptation: With just 10 demos per task on RoboTwin2.0, Fast-ThinkAct outperforms strong baselines, showing the compact planner transfers well and learns quickly.
Failure Recovery: On RoboFAC, Fast-ThinkAct gives accurate correction plans after errors (simulation and real), beating the next best by 10.9 and 16.4 points respectively—like a runner who stumbles but quickly regains form.

Why These Results Matter:

Real robots need decisions at 1–15 Hz. Cutting latency ~89% while keeping or improving accuracy makes deployment safer and more practical.
Long-horizon success shows the compact thoughts still encode multi-step structure, not just shortcuts.
Few-shot strength means lower data collection costs for new tasks/environments.

🍞 Anchor: In practice, this means a kitchen robot can quickly plan “align → grasp → place,” smoothly execute, and, if it drops the spoon, quickly recover by re-aligning and trying again—without pausing to write an essay each time.

05Discussion & Limitations

Limitations (be specific):

Verbalizer Hallucination: The small verbalizer LLM can occasionally produce plausible but inaccurate text when asked to explain. This doesn’t affect execution at test time (verbalizer is optional), but it can mislead users if they rely on the explanation alone.
Teacher Quality Ceiling: The student learns preferences from the teacher’s rewarded thoughts. If teacher rewards or strategies are biased or suboptimal, the distilled compact thoughts may inherit those limitations.
Visual Waypoint Assumptions: K fixed waypoints work well for many tasks, but extremely dexterous or highly dynamic scenes may require adaptive waypoint counts or richer 3D trajectories.
Compute and Data: Training uses substantial datasets (OXE, RoboVQA, RoboFAC, etc.) and multi-GPU resources; not every lab has this setup.

Required Resources:

A pre-trained VLM backbone, datasets with manipulation videos and QA, and access to GPUs for SFT, CoT-SFT, RL for teacher (GRPO), and policy training. Robotics action datasets (e.g., OXE, ALOHA) are needed for the final policy stage.

When NOT to Use:

Ultra-high-frequency control loops (e.g., 100–1000 Hz torque control) where even waypoint-level planning may be too slow without dedicated low-level controllers.
Tasks needing fine-grained tactile feedback or micro-manipulation where visual waypoints alone are insufficient.
Domains with no reliable reward signals for teacher training (preference learning may be noisy).

Open Questions:

Adaptive Reasoning Length: Can the model choose the number of latent tokens M on the fly per task difficulty?
Richer Grounding: How to incorporate 3D scene graphs or force/tactile signals into the compact plan to handle contact-rich tasks?
Safer Explanations: Can we further reduce hallucinations with grounding-aware verbalization or evidence-linked explanations?
Continual Learning: How to update compact thoughts online without catastrophic forgetting in real deployments?
Beyond Waypoints: Can we unify compact thoughts with closed-loop subtask policies for even longer horizons and more robust recovery?

06Conclusion & Future Work

Three-Sentence Summary:

Fast-ThinkAct compresses long, slow chain-of-thought planning into a handful of verbalizable latent tokens that still capture both language reasoning and visual trajectory plans.
It learns which thoughts to keep via preference-guided distillation from a rewarded teacher and aligns the student’s internal plan with concrete waypoints, then uses a diffusion policy to convert plans into actions.
The result is strong performance on manipulation and reasoning benchmarks with up to ~89% lower latency, plus reliable long-horizon planning, few-shot adaptation, and failure recovery.

Main Achievement:

Showing that compact, verbalizable latent planning—backed by reward-based preferences and visual plan alignment—can outperform or match larger, slower reasoning VLAs while being dramatically faster.

Future Directions:

Make reasoning length adaptive; ground verbalization more tightly to evidence; add tactile/force cues; and extend from fixed K waypoints to richer, hierarchical subplans.

Why Remember This:

It flips the script: robots don’t need long essays to think well. With just a few smart, checkable thought tokens tied to where to move next, they can plan, act, adapt, and recover—fast enough for the real world.

Practical Applications

•Home assistance: Quickly fetch, tidy, or load dishwashers while adapting to clutter or dropped items.
•Warehouse picking: Plan concise grasp-and-place sequences that adapt to new box layouts with low latency.
•Assembly lines: Execute multi-step, bimanual assembly with reliable waypoint grounding and fast correction after slips.
•Hospital logistics: Deliver supplies and handle carts safely around people thanks to rapid decision-making.
•Kitchen robots: Perform long-horizon tasks like cooking steps with fast, grounded plans and recovery if something spills.
•Education and labs: Teach new tasks with only a few demos, speeding up research and prototyping.
•Retail restocking: Handle varied shelves and packaging under changing lighting and crowds.
•Agricultural handling: Pick-and-place delicate items (e.g., fruit sorting) with concise visual plans.
•Assistive devices: Provide quick, explainable actions (via optional verbalization) for users who want transparency.
•Inspection and maintenance: Navigate waypoints for checking equipment and adapt plans if obstacles appear.

Version: 1