Beyond Imitation: Reinforcement Learning for Active Latent Planning

Zhi Zheng; Wee Sun Lee

Beyond Imitation: Reinforcement Learning for Active Latent Planning

Intermediate

Zhi Zheng, Wee Sun Lee1/29/2026

arXiv PDF

Key Summary

•The paper shows how to make AI think faster and smarter by planning in a hidden space instead of writing long step-by-step sentences.
•It replaces wordy chain-of-thought (CoT) steps with dense, continuous “latent tokens,” which carry more information per token.
•Past methods simply copied one example solution, but there are many correct paths, so copying just one can make models rigid and overfit.
•The authors make the hidden space smooth with a conditional VAE and add a stop head so each latent token carries a similar amount of information.
•They then use reinforcement learning (RL) to actively search for better thinking paths, rewarding both correct answers and step-to-step coherence.
•The coherence reward checks whether decoded steps logically connect, like making sure each math result is used by the next step.
•On four math benchmarks using LLaMA-1B, their method (ATP-Latent) is on average 4.1% more accurate while using 3.3% fewer tokens than strong baselines.
•Even with only the coherence reward (no answer labels), RL still helps, showing coherence is a useful unsupervised signal.
•The method avoids “overthinking,” gives more stable exploration for RL, and scales to more steps without breaking.
•This approach could speed up on-device reasoning, tutoring, and real-time assistants by thinking more with fewer tokens.

Why This Research Matters

Faster, cheaper reasoning helps real-world apps like tutoring apps on phones, where typing fewer tokens saves time and battery. A smooth, explainable latent space means we can guide and trust models more, since we can decode hidden steps and check their logic. The coherence reward shows we can train useful behavior even with limited labels, which lowers costs and broadens access. Active planning can reduce latency in voice assistants, car dashboards, or customer-service bots that need quick and correct answers. This approach also avoids “overthinking,” giving compact solutions that are easier to deploy at scale. Finally, the method offers a blueprint for safer exploration in hidden spaces, which can transfer to many domains beyond math.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine solving a math problem out loud versus using a neat scratchpad. Saying every step with words takes time, but a good scratchpad lets you think quickly and clearly.

🥬 The Concept (Chain-of-Thought in words): AI models can solve problems by writing step-by-step explanations in natural language (called chain-of-thought, or CoT). This works well but often creates long, slow answers. How it works:

The model reads the question.
It writes out many small reasoning steps as words.
It produces the final answer. Why it matters: If every idea must be spelled out with words, answers can be slow and expensive to generate, especially for real-time tasks.

🍞 Anchor: If a student explains every tiny step out loud in math class, they’ll be correct but slow. That’s how wordy CoT can feel for AI.

🍞 Hook: You know how you can think faster in your head than speaking out loud? Your brain uses compact mental notes.

🥬 The Concept (Latent tokens): Latent tokens are hidden, continuous vectors the model uses instead of many words to represent thinking steps. How it works:

Replace multiple word tokens with a single dense latent token.
Pass these latent tokens step-by-step, like a compact scratchpad.
At the end, produce the final answer in normal words. Why it matters: This packs more meaning into fewer tokens, speeding up reasoning while keeping quality.

🍞 Anchor: It’s like using a quick math shorthand on a scratchpad instead of writing a long paragraph.

🍞 Hook: Picture a map where roads are smooth and well-spaced, so cars can explore new routes safely.

🥬 The Concept (Smooth latent space via a conditional VAE): A conditional Variational Autoencoder (VAE) shapes the hidden space so nearby points make similar, sensible thoughts, and it can also decode them back into readable steps. How it works:

Encoder: turns questions and hidden steps into a distribution (mean and spread) for each latent token.
Sampling: picks a specific latent token from that distribution so exploration is possible.
Decoder: turns latent tokens back into language steps to check what they mean. Why it matters: Without a smooth space, tiny changes in the hidden token can break reasoning and make RL exploration unstable.

🍞 Anchor: A VAE is like a recipe book (decoder) plus a pantry organizer (encoder); similar ingredients lead to similar dishes, and you can read back the dish description.

🍞 Hook: Think of a coach who doesn’t just copy last year’s plays but tries new tactics and keeps the ones that score.

🥬 The Concept (Reinforcement Learning, RL): RL lets the model try different hidden plans, then rewards good ones so it keeps improving. How it works:

Sample multiple latent thinking paths.
Score them with rewards (correctness and coherence).
Update the policy to favor better-scoring paths. Why it matters: Imitation alone can overfit to one path; RL searches for generally good strategies that work on new problems.

🍞 Anchor: Like training a puppy with treats, the model repeats the behaviors that earn higher rewards.

🍞 Hook: You know how there are many ways to solve 16 eggs − 3 eaten − 4 baked = 9 left, then 9 × $2 =$ 18? Different correct step orders still get $18.

🥬 The Concept (The problem with imitation-only): If the model only copies one correct explanation, it may believe that’s the only right way and fail when a different but valid path is needed. How it works:

Dataset gives just one of many correct CoTs.
Model learns that specific path.
At test time, varied problems or step orders confuse it. Why it matters: This creates a training–testing gap and weaker generalization.

🍞 Anchor: Memorizing a single route to school doesn’t help when the street is closed; you need flexible planning.

🍞 Hook: Imagine a traffic light that knows when it’s time to stop adding more steps.

🥬 The Concept (Stop head): A small head predicts when to stop generating more latent tokens so each token carries a similar amount of information. How it works:

At each latent step, a classifier decides “continue” or “stop.”
Training encourages stopping after enough info is encoded.
This balances information across tokens and unifies token lengths. Why it matters: Without it, some tokens carry too much or too little information, making planning uneven and exploration tricky.

🍞 Anchor: It’s like knowing when your sandwich has enough layers—adding more won’t make it better, just messier.

🍞 Hook: When you explain your math, each line should connect to the next, not jump around.

🥬 The Concept (Coherence reward): A bonus in RL that checks whether decoded latent steps connect logically—e.g., whether one step’s result is used in the next. How it works:

Decode each latent token into a language step.
For math, see if the right-hand side (result) of a step appears in a later step’s left-hand side or in the final answer.
Give higher reward when steps flow coherently. Why it matters: Without coherence, the model might use shortcuts or produce unrelated steps that don’t actually help reasoning.

🍞 Anchor: It’s like grading a math solution not just by the final answer but also by whether each line follows from the last.

The world before: Strong reasoning models often wrote long CoTs—accurate but slow. First attempts at latent reasoning compressed words into hidden vectors but trained by imitating one labeled path, which can make latent spaces sharp and brittle. Adding random noise for RL exploration helped sometimes but often failed when the space wasn’t smooth. The gap: We needed a smooth, explainable latent space plus guided exploration to find generally good plans. Real stakes: Faster, cheaper thinking helps tutors on phones, assistants in cars, and tools for students, especially where time, cost, or energy use matters.

02Core Idea

🍞 Hook: You know how the best athletes don’t just copy one drill—they practice in a smart gym that gives feedback and lets them try variations safely.

🥬 The Concept (ATP-Latent, the big idea): Don’t just imitate one explanation—shape a smooth, explainable hidden space with a VAE, then actively plan in it with RL using accuracy and coherence as rewards. How it works (recipe):

Train a conditional VAE so latent tokens form a smooth, sampleable space and can be decoded back into readable steps.
Add a stop head so each token carries similar info and we know when to end.
Use RL to explore many latent plans, rewarding both the correct final answer and the coherence of decoded steps. Why it matters: Without smoothing and guided rewards, exploration is noisy or risky; with them, the model learns flexible, general reasoning strategies using fewer tokens.

🍞 Anchor: It’s like paving even roads (VAE), adding smart stoplights (stop head), and using a GPS that scores good routes (RL with coherence), so you arrive faster and more reliably.

Multiple analogies:

Maps: The VAE makes the terrain smooth; the stop head sets safe block sizes; RL tries routes and keeps the ones that both arrive (accuracy) and follow roads logically (coherence).
Cooking: VAE organizes your pantry; the stop head says “enough spices”; RL tastes the dish and rewards both deliciousness (correctness) and recipe flow (coherence).
Lego: VAE ensures compatible pieces; the stop head decides when the model is complete; RL prefers builds that both match the blueprint (answer) and make structural sense (coherence).

Before vs After:

Before: Latent methods mostly imitated one labeled CoT, which could overfit and restrict planning.
After: ATP-Latent actively explores a smooth space and keeps plans that are both correct and internally consistent.

Why it works (intuition):

Smoothing with a VAE prevents tiny changes from causing big, random jumps in meaning, so exploration is safer.
Decoding gives visibility into what latent tokens “say,” enabling a meaningful coherence score.
RL then steers the policy toward good basins in the latent space that produce correct, connected steps across many problems.

Building blocks:

Conditional VAE (encoder = reasoning model; decoder = explainer model) to mold and interpret the space.
Stop head to normalize information per token and unify planning length.
Coherence reward to encourage real participation of steps, not shortcuts.
Accuracy and format rewards to keep final outputs correct and well-formed.
GRPO-style RL updates to learn from multiple sampled plans per question.

🍞 Anchor: Imagine finding $18 from a duck-egg problem. Many correct step orders exist. ATP-Latent keeps the ones that both get$ 18 and link steps cleanly, instead of locking onto only one memorized route.

03Methodology

High-level overview: Input (a math question) → SFT stage (learn a smooth, decodable latent space with a conditional VAE + stop head) → RL stage (actively explore latent plans and reward accuracy + coherence) → Output (final concise answer with optional decoded steps).

Stage 1: SFT with a conditional VAE and stop head

What happens: The reasoning LLM acts as the VAE encoder, producing a mean and spread for each latent token so we can sample it. A separate decoder LLM turns latent tokens back into language steps. A stop head learns when to finish generating latent tokens. Training combines: encoder loss (predicting remaining gold text), decoder loss (reconstructing steps), stop loss (learn to stop), and a small KL loss (to keep the space smooth without overpowering reasoning).
Why it exists: If we only imitate a single fixed path with deterministic latent tokens, the space becomes sharp and brittle; sampling breaks things; RL exploration struggles. The VAE makes the space smooth and sampleable; the decoder provides interpretability; the stop head evens out token information.
Example with data: For the duck-egg problem, the encoder learns to map early steps (like 16 − 3 and 16 − 3 − 4) into latent tokens; the decoder learns to read them back as short equations; the stop head learns to stop after enough steps to reach a clean final answer.
What breaks without it: Noisy exploration would produce nonsense; tokens would carry uneven info; we couldn’t check what a latent token “means.”

Details (SFT):

Encoder (reasoning LLM): For each latent step t, predict μ_t and σ_t from the prior context; sample l_t ~ N(μ_t, σ_t^2). This injects controlled randomness for later RL.
Decoder (explainer LLM): Given c latent tokens per stage, reconstruct the corresponding language steps; if there’s nothing to reconstruct (past the labeled part), it outputs empty.
Stop head: At each latent token, predict continue/stop; training encourages stopping after K steps so tokens have comparable information density.
Total loss: L_Enc + L_Dec + L_Stop + β·L_KL, with β set small so decoding and reasoning don’t get overwhelmed by KL.

Secret sauce #1: Sampling from a trained distribution (not a single point) makes the space friendly to exploration and still keeps meanings stable—like walking on soft carpet instead of ice.

Stage 2: RL with accuracy + coherence rewards

What happens: For each question, sample several latent plans by drawing latent tokens from the learned distributions (and using the stop head to decide length). Decode latent tokens to language steps. Compute rewards: (a) correctness of the final answer, (b) coherence of decoded steps (does each step’s result feed the next or the final answer?), plus a small format reward.
Why it exists: Imitation can favor one arbitrary solution path. RL searches for better, more general latent policies across multiple correct routes.
Example with data: If the decoded steps go 16 − 3 = 13, 13 − 4 = 9, and later 9 × 2 = 18, the coherence is high because each result appears as needed in the following step. If instead steps wander (e.g., 16 − 9 = 7, then jump to 9 × 2 without using 7), coherence drops.
What breaks without it: The model might rely on shortcuts or produce steps that don’t truly contribute, making reasoning fragile.

Coherence reward (how it’s computed):

Decode each chunk of c latent tokens into a language step.
For math, check if the right-hand side (result) of a step is used as the left-hand side input in later steps or appears in the final answer.
Score = fraction of steps that chain correctly. This nudges the model to build logically connected solutions.

Secret sauce #2: Coherence works even without labels. The paper shows that training with “coherence-only” (unsupervised) still improves over SFT alone—meaning the model can self-improve by preferring internally consistent thinking.

Putting it together (Flow):

Input → Encoder samples latent tokens → Stop head decides when to stop → Decoder can optionally show what the tokens mean → Final answer generated.
RL repeats: sample several latent plans → score (accuracy + coherence + format) → GRPO-style update to favor better plans.

What makes it clever:

It fixes the root cause of failed exploration—unsmooth hidden spaces—by training a VAE.
It gives the model eyes on its own hidden thoughts via decoding, enabling a measurable coherence reward.
It prevents endless token growth with a learned stop signal, keeping reasoning compact and efficient.

04Experiments & Results

The test: The authors measured how often the final answers were correct (accuracy) and how many tokens were generated on average (#Token), since efficiency matters. They used four math datasets: GSM8K (in-domain), and GSM-Hard, SVAMP, MultiArith (out-of-domain). Baselines included CoT-SFT, Answer-SFT, Coconut, SIM-CoT, and CoLaR.

The competition: Coconut and SIM-CoT are strong latent-reasoning baselines; SIM-CoT adds decoding supervision but struggled to benefit from RL unless warmed up carefully. CoLaR compresses chains in one shot; CoT-SFT writes full text CoTs; Answer-SFT skips steps and learns only answers.

The scoreboard (with context):

ATP-Latent achieved 47.7% average accuracy with 8.4 average tokens. That’s like getting an A- while also finishing the test faster.
Compared to their SIM-CoT reimplementation, ATP-Latent was +4.1% in accuracy and used 3.3% fewer tokens. That’s like winning by several points while also running fewer laps.
On MultiArith, ATP-Latent reached 94.4% accuracy—very strong—while staying token-efficient.

Surprising findings:

Coherence-only RL (no correctness labels) still improved over SFT. This means coherence is a powerful unsupervised signal, letting the model self-improve by preferring logically connected steps.
SIM-CoT benefited from RL only when fine-tuned after Coconut, suggesting that making the space smoother first (or less sharp) is key for exploration—supporting the paper’s emphasis on VAE smoothing.
Pass@K greatly improved after RL in ATP-Latent, meaning when you try multiple sampled plans, the chance that at least one is correct rises sharply—evidence for better planning diversity.
Self-extending ability: With the learned stop head, ATP-Latent handled more latent steps without collapsing, unlike earlier methods that became unstable when scaling up token stages.

What moves the needle:

VAE smoothing + stop head made RL exploration stable and useful.
Coherence reward helped the model avoid shortcuts and keep steps genuinely connected to the final answer.
The combination led to better accuracy at lower token counts, which is the main goal of latent reasoning.

05Discussion & Limitations

Limitations:

VAE quality matters. If the decoder cannot reliably read back latent tokens or the encoder distributions are poorly learned, coherence signals weaken, and exploration guidance degrades.
Task dependence. The current coherence metric focuses on math-like steps where checking whether one result feeds the next is straightforward; for open-ended language tasks, coherence may be harder to define.
Reward tuning. The weights for correctness, coherence, and format require care; poor choices can under- or over-emphasize certain behaviors.
Compute and data. Training encoder/decoder VAEs and running RL sampling require nontrivial resources.

Required resources:

Two LLMs (or heads): one for reasoning (encoder/policy), one for decoding explanations.
GPU budget for SFT (VAE training) and RL sampling (GRPO-like loops).
Datasets with structured or at least decodable intermediate signals (equations help compute coherence).

When not to use:

Purely creative writing or tasks without clear step linkage may not benefit from the current coherence metric.
Extremely low-resource environments where RL sampling is impractical.
Domains where interpretability via decoding is unreliable; if decoded steps are noisy, the coherence reward can mislead.

Open questions:

How to generalize coherence beyond math—e.g., logical entailment in text, evidence chains in QA, or multi-hop reasoning in knowledge graphs?
Can we learn the coherence signal itself (a learned verifier) while keeping efficiency gains?
How does this scale to larger base models and multimodal settings (images, code, tables) without losing token savings?
What are the best curricula and stop-head strategies for diverse tasks?
Can we design exploration strategies that are even safer and more sample-efficient using the VAE geometry directly?

06Conclusion & Future Work

3-sentence summary: This paper introduces ATP-Latent, which first shapes a smooth, decodable latent space with a conditional VAE and stop head, then uses RL to actively plan in that space. By rewarding both answer correctness and coherence of decoded steps, the method finds flexible, general reasoning policies that use fewer tokens. Experiments on four math benchmarks show higher accuracy with lower token counts compared to strong baselines.

Main achievement: Demonstrating that active planning in a well-shaped latent space—guided by a coherence reward—beats imitation-only training in both accuracy and efficiency.

Future directions: Extend coherence beyond math to general text and multimodal reasoning, explore learned verifiers for broader coherence checks, and scale to larger models and domains while preserving token efficiency. Investigate richer geometry-aware exploration that uses VAE structure even more directly.

Why remember this: It reframes latent reasoning from “copy a path” to “plan in a smooth space with feedback,” offering a practical path to faster, smarter AI that thinks more with fewer tokens.

Practical Applications

•On-device math tutors that solve problems quickly with fewer tokens and show optional decoded steps.
•Voice assistants that answer multi-step questions in real time without long explanations.
•Customer-support bots that reason through troubleshooting flows compactly and coherently.
•Educational tools that provide short, accurate hints instead of lengthy walkthroughs.
•Embedded systems (cars, appliances) that need fast, low-latency reasoning for user queries.
•Financial calculators that check multi-step logic (tax, interest) with coherent intermediate results.
•Medical triage chatbots that keep internal reasoning stable and decodable for auditing.
•Coding helpers that plan hidden steps but output concise, correct final code snippets.
•Robotics planners that use a smooth latent space to explore action sequences safely.
•Research assistants that try multiple hidden plans (high Pass@K) and present the best concise answer.

Version: 1