No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs

Liyan Xu; Mo Yu; Fandong Meng; Jie Zhou

No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs

Intermediate

Liyan Xu, Mo Yu, Fandong Meng et al.2/2/2026

arXiv PDF

Key Summary

•Large language models don’t map out a full step-by-step plan before they start thinking; they mostly plan just a little bit ahead.
•The authors build a probe called Tele-Lens that looks at a model’s hidden states during Chain-of-Thought to see what the model seems to be planning.
•Across 12 very different tasks, models were good at predicting only the next few steps, not the whole solution path (a short-sighted or “myopic” horizon).
•For math-like, multi-step problems, signs of the correct final answer usually appear only right before the reasoning finishes.
•For knowledge questions, models sometimes show a vague early “gist” of the right answer, but it’s weaker than actually reasoning or even answering directly.
•Using a Wooden Barrel idea—focus on a few critical spots instead of averaging everything—the authors greatly improved uncertainty estimates (up to 6–9% AUROC better).
•They also show automatic CoT bypass: skip long thinking when early signs say it’s easy, saving compute with almost no accuracy loss (e.g., only 0.03 drop with 16.2% fewer thoughts on CSQA).
•This unifies past, seemingly conflicting claims: models carry some early hints, but precise global plans don’t appear until late in reasoning.
•The work suggests practical tools: better confidence scores, faster inference, and smarter use (or skipping) of CoT.
•Code, data, and models are released for the community to build on.

Why This Research Matters

If models don’t truly plan far ahead, we shouldn’t spend time and money forcing them to think out loud for every question. By focusing on just a few critical steps, we can measure confidence more accurately and avoid being fooled by long, fluent but fragile explanations. Systems can automatically skip CoT on easy items and reserve detailed reasoning for genuinely hard problems, making AI faster and greener. Product teams can use better uncertainty scores to decide when to verify, ask a human, or deploy a tool. Researchers can design training that strengthens the few key steps that matter most. For users, this means answers that are not just polished, but also more trustworthy and efficient to obtain.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a maze. Sometimes you can see the whole map from above (easy!), but other times you only see the next turn and must feel your way step by step.

🥬 The Concept: Chain-of-Thought (CoT)

What it is: CoT is when an AI writes out its reasoning steps in plain language, like showing its work on math homework.
How it works: 1) Read the problem, 2) Write small reasoning steps, 3) Use those steps to reach an answer, 4) Output the final answer.
Why it matters: Without CoT, the model often guesses or misses multi-step logic, like trying to solve a long division problem in one jump. 🍞 Anchor: When asked “How many apples are left if I eat 3 from 10?”, a CoT model might write, “Start with 10; eat 3; 10−3=7; answer: 7.”

The World Before: For years, people saw that CoT boosted reasoning, especially for math and logic. But another camp noticed something curious: even before writing a single reasoning token, the model’s internal “hidden states” sometimes already held hints about the final answer or how long the reasoning would be. That sounds like having a plan before speaking.

🍞 Hook: You know how you might know the punchline of a joke before you finish telling the story? Some researchers thought LLMs work like that, too.

🥬 The Concept: Hidden States

What it is: Hidden states are the model’s private scribbles—tiny vectors that store what the model is thinking at each step.
How it works: 1) Read words, 2) Update hidden states layer by layer, 3) Use them to pick the next word, 4) Repeat.
Why it matters: If hidden states already “know” the solution, CoT might just be fancy window dressing. 🍞 Anchor: Like notes in your head while explaining a puzzle—you may have the answer but still explain it slowly.

The Problem: Two stories didn’t match. One story said “LLMs plan early, so CoT is just telling a story already decided.” The other said “Transformers can’t compose many steps in one shot; they need intermediate steps, so CoT is essential.” Which is it?

Failed Attempts: Prior studies often looked at one task at a time. In some tasks, signals looked early and strong; in others, they didn’t. Without a common tool and a broad testbed, it was easy to draw opposite conclusions.

🍞 Hook: Think of testing a running shoe only on a treadmill and deciding it’s great, then being surprised it slips on a muddy trail.

🥬 The Concept: Probing

What it is: Probing is testing what information is inside hidden states by training a small, simple reader to predict something (like the next token or the final answer) from them.
How it works: 1) Freeze the model, 2) Train a tiny adapter to map hidden states to predictions, 3) Check accuracy.
Why it matters: Without probing across many tasks, we can mistake a local pattern for a general rule. 🍞 Anchor: Like asking short quiz questions about what your brain remembers at each point in solving a riddle.

The Gap: We needed a shared lens to peek into hidden states, across many domains (from puzzles like Parity to knowledge tests like MMLU), and measure how far ahead models really plan.

Real Stakes: This isn’t just academic. If models don’t plan far ahead, then:

We should focus CoT on tricky parts and stop over-explaining easy ones (saves time and money).
We must rethink confidence: averaging certainty over thousands of filler words can hide the few wobbly steps that actually decide correctness.
We can design smarter systems that know when to think hard—and when not to.

🍞 Hook: Imagine packing for a trip. If you overpack for every outing, you waste effort; if you underpack for a hike, you get into trouble. You need the right amount of planning.

🥬 The Concept: Uncertainty Estimation

What it is: It’s a score of how sure the model is about its answer.
How it works: 1) Look at token probabilities, 2) Turn them into scores like entropy or perplexity, 3) Higher uncertainty → be cautious.
Why it matters: Without calibrated uncertainty, models can sound confident but be wrong. 🍞 Anchor: Like a weather app saying “70% chance of sun” so you know whether to bring a jacket.

This paper builds a single, careful story: create one probe, test many tasks, and explain when early planning signals appear, when they don’t, and how to use this fact to make models safer, faster, and more reliable.

02Core Idea

🍞 Hook: You know how some people drive by only watching the next turn, not the whole route? That works okay—until you hit a complicated interchange.

🥬 The Concept: Latent Planning Horizon

What it is: The latent planning horizon is how far ahead a model can “think” using its hidden states before it starts writing its Chain-of-Thought.
How it works: 1) Pause during CoT, 2) Read hidden states, 3) Predict future steps and answers, 4) See how far those predictions stay accurate.
Why it matters: If the horizon is short, models aren’t pre-deciding everything; they’re inching forward. 🍞 Anchor: A GPS that shows the next turn clearly but gets fuzzy if you ask about turns three or four steps ahead.

Aha! Moment (one sentence): LLMs are mostly short-sighted planners—hidden states support immediate, local moves, not precise global plans, though easy tasks can show a vague early “answer gist.”

Multiple Analogies:

Chess: A beginner can see a move or two ahead; a grandmaster sees ten. LLMs are more beginner-like in most tasks—good locally, shaky globally.
Cooking: You read just the next step (“boil water”), then the next (“add pasta”). If the recipe is tricky (soufflé), last-minute success only appears near the end.
Hiking: Your flashlight lights the next few meters. You move safely step by step, but you can’t confirm the mountain peak route until you’re closer.

Before vs After:

Before: Some thought hidden states pre-planned the whole reasoning, making CoT redundant. Others argued Transformers need step-by-step CoT for real composition.
After: Both sides were partly right. For simple, semantic questions, early states hint at the answer. For real multi-step composition (Parity, Cycle), precise answer signals pop up only at the very end. Overall: myopic horizon.

Why It Works (intuitive): Hidden states carry strong local patterns—what’s the next token? what’s the next micro-step?—because autoregressive training rewards predicting one token at a time. Long-range composition (like counting across a long list or following a graph) isn’t naturally compressed early; the model has to accumulate evidence, so precise final answers emerge late.

Building Blocks:

Tele-Lens probe: a small adapter that maps hidden states to predictions (final answer, next tokens, reasoning length).
Three probes, one story: 1) Final-answer probing shows late spikes on hard tasks; 2) Subsequent-token probing shows steep accuracy drop after 1–2 steps; 3) Reasoning-length probing is unreliable early, except when shortcuts (like input length) leak information.
Wooden Barrel principle: The reliability of a CoT is set by a few critical steps (the shortest staves), not the average confidence across thousands of filler tokens. Focus on pivot tokens → better uncertainty.
CoT bypass: If early signals show high confidence on an easy item, skip the long CoT and answer directly—save time with tiny or no accuracy loss.

🍞 Hook: Think of a leaky barrel: adding more tall planks doesn’t help if a few short planks still leak.

🥬 The Concept: Wooden Barrel Principle (for CoT)

What it is: A CoT’s reliability is controlled by a few critical, uncertain steps—improve those and you lift the whole chain.
How it works: 1) Identify pivot tokens (most uncertain or most informative), 2) Score only them, 3) Use that score as your confidence.
Why it matters: Averaging across the whole chain hides the weak links. 🍞 Anchor: In a long essay, a few key sentences decide the grade; polishing filler won’t fix broken logic in those key spots.

03Methodology

High-Level Recipe: Input question → Generate a Chain-of-Thought (CoT) → At each step, capture hidden states → Tele-Lens adapter reads these states → Predict three things (next tokens, final answer, total CoT length) → Analyze how far ahead predictions stay good → Use insights to design better uncertainty metrics and a CoT bypass switch.

Step A: Collect Diverse CoTs

What happens: The authors run two kinds of models: an off-the-shelf Qwen3-32B (with thinking mode) and an In-Domain 7B model trained via reinforcement learning to produce shorter, steadier CoTs. They gather hidden states for 12 tasks spanning explicit composition (Parity, Cycle, Subsum), implicit composition (GSM8K, MATH, AIME, MuSR, Zebra), and knowledge/semantics (CSQA, MMLU, QuALITY, GPQA).
Why this step exists: If you only test one kind of task, you might think the model always plans globally, or never does—context matters.
Example data: Parity: “Count how many 8’s appear; even or odd?” Cycle: “Is there a path from v_344 to v_460 in the directed edges?” CSQA: commonsense multiple-choice.

🍞 Hook: Like checking a backpack on city streets, forest trails, and mountains—not just one place.

🥬 The Concept: Task Diversity

What it is: Using many task types to see if a pattern holds widely.
How it works: 1) Choose varied datasets, 2) Standardize answers (often multi-choice), 3) Compare behaviors.
Why it matters: Without diversity, you risk overfitting your conclusion to one niche. 🍞 Anchor: A shoe that’s great on a treadmill but slips on mud isn’t truly “good.”

Step B: Build Tele-Lens (the probe)

What happens: Tele-Lens is a small, low-rank adapter (inspired by Logit Lens ideas) trained per layer to map hidden states to predictions: (1) final-answer class (like A/B/C/D), (2) next tokens up to m steps ahead, (3) total CoT length.
Why this step exists: The main model stays frozen; Tele-Lens is the magnifying glass. This avoids changing what we are measuring.
Example: At the 10th reasoning token, read its hidden state and predict the final answer or the next few tokens.

🍞 Hook: Think of sticking a label reader onto a sealed machine: you don’t open it, you just read its signals.

🥬 The Concept: Low-Rank Adapter

What it is: A tiny plug-in network that gently transforms hidden states to a prediction space.
How it works: 1) Compress (low rank), 2) Add nonlinearity, 3) Map to vocabulary or answer labels, 4) Train quickly.
Why it matters: It’s efficient and less likely to overfit; we probe many layers cheaply. 🍞 Anchor: Like putting a small camera lens on a phone to see details without rebuilding the phone.

Step C: Probe Three “Teleological” Dimensions

Final-Answer Probing

What: From a hidden state, guess the ultimate answer.
Why: If precise plans exist early, accuracy should be high at the start.
Example: In Parity, can layer-21’s hidden state at token #2 already say “even”? Results: Usually no—spikes happen near the very end for hard tasks.

Subsequent-Token Probing

What: From a hidden state, guess the next few CoT tokens (Top-5).
Why: Measures local planning depth: how many steps ahead are predictable?
Example: Accuracy is strong for 1–2 steps, then falls off—especially in natural language tasks.

Reasoning-Length Probing

What: From early states, guess how long the CoT will be.
Why: Tests for a “global clock.”
Example: Predictions are unreliable early, except when a task’s input length gives away CoT length (a shortcut) in Parity/Subsum.

🍞 Hook: Like asking a runner: “What’s your finishing time?” after 10 meters. Tough to know, unless the course is a fixed length.

🥬 The Concept: Confounders

What it is: Shortcut clues that look like planning but aren’t.
How it works: If input length correlates with CoT length, the probe predicts well for the wrong reason.
Why it matters: We must avoid mistaking shortcuts for real foresight. 🍞 Anchor: Guessing a book’s difficulty by thickness alone—you might be right sometimes for the wrong reason.

Step D: Turn Insights into Better Uncertainty

What happens: Most tokens are high-confidence “filler.” Averaging uncertainty over thousands of tokens dilutes the few tricky leaps. The authors propose the Wooden Barrel principle: find the few pivot positions (e.g., top-k highest entropy or lowest Tele-Lens final-answer entropy) and average only them to score uncertainty.
Why this step exists: We want confidence scores that separate correct from incorrect chains.
Example: With Tele-Lens pivots (k=5), AUROC improves up to 9% over best baselines. With general LM signals on Qwen3-32B, picking k=100 pivots improves AUROC by up to 6% across metrics.

🍞 Hook: If one weak link can break a chain, check the weak links—not every link equally.

🥬 The Concept: Pivot Tokens

What it is: The few spots that steer the solution or carry the real uncertainty.
How it works: 1) Rank tokens by uncertainty/informativeness, 2) Keep top-k, 3) Average their scores.
Why it matters: Captures the true risk points of reasoning. 🍞 Anchor: In a debate, a couple of questions decide the winner—judge those moments to gauge overall strength.

Step E: CoT Bypass (Skip When Safe)

What happens: Look at the first few CoT tokens. If Tele-Lens says the final-answer prediction is very confident (low normalized entropy), stop unfolding CoT and answer directly. On CSQA/MMLU with Qwen3-32B, this trims thinking by 16.2%/12.4% with just ~0.03 accuracy loss.
Why this step exists: Save time and compute on easy items; think hard only when needed.
Example: Threshold 0.1 on normalized entropy triggers bypass on easy cases; for hard composition tasks (like Parity), the system naturally keeps thinking.

The Secret Sauce:

Use one unified, light-weight probe (Tele-Lens) to read hidden states across tasks.
Don’t assume a global plan—measure how far ahead signals really reach.
Focus uncertainty on pivots (shortest staves) rather than average over the whole chain.
Turn insights into knobs: a bypass switch that safely skips long thoughts when early confidence is high.

04Experiments & Results

The Test: What was measured and why?

Final-answer probing: Can hidden states at various CoT positions name the correct final answer? This reveals if precise plans appear early or only near the end.
Subsequent-token probing: How well can we predict the next m tokens (Top-5)? This quantifies local planning depth.
Reasoning-length probing: Do early states foresee how long the CoT will be? This checks for global plans.
Uncertainty calibration: Can we better tell confident-correct from uncertain-wrong? We compare standard metrics (perplexity, entropy, self-certainty) against our pivot-focused variants.
CoT bypass: Can we detect when it’s safe to skip CoT to save compute without hurting accuracy?

The Competition: Baselines and settings

Models: Off-the-shelf Qwen3-32B (with/without CoT), and an In-Domain 7B trained via RL for concise, stable CoTs.
Tasks: 12 datasets—explicit composition (Parity, Cycle, Subsum), implicit composition (GSM8K, MATH, AIME, MuSR, Zebra), knowledge/semantics (CSQA, MMLU, QuALITY, GPQA).
Metrics: Accuracy for final-answer probing; Top-5 accuracy for subsequent tokens; AUROC for uncertainty calibration; overall task accuracy for bypass experiments.

Scoreboard with Context:

Final-answer probing (myopic horizon)

Explicit composition: Precise answer signals are near-random early, then spike only one step before finishing (e.g., in Parity, probabilities stay flat until after all digits are counted, then jump to >90%).
Knowledge/semantic tasks: An early “gist” appears (above random), but it’s weaker than real reasoning or even direct answering. Across tasks, using only the early gist generally underperforms both full CoT and no-CoT direct answers.
Layer effects: Mid-to-late layers often probe best; the very last layer isn’t always most informative.

Subsequent-token probing (limited foresight)

Next 1–2 steps: Often >50% Top-5 accuracy—solid local planning.
Beyond that: Accuracy declines quickly, especially in natural language tasks (e.g., MMLU/GPQA). Structural tasks (Parity/Cycle) show better patterns but still decay.

Reasoning-length probing (no early global clock)

Generally poor early predictions across tasks—no reliable “internal stopwatch.”
Confounders: Parity/Subsum look better because CoT length correlates with input length (a shortcut). In Cycle, where length depends on actual path/cycle structure, predictions falter—supporting the lack of true global planning.

Uncertainty calibration (Wooden Barrel wins)

With Tele-Lens pivots (k=5), AUROC improves up to 9% over best global baselines across GSM8K, Zebra, MMLU, GPQA.
With general LM metrics (no special probe), selecting top-k uncertain positions (e.g., k=100 on Qwen3-32B) boosts AUROC consistently by 3–6% across perplexity, entropy, and self-certainty.
Insight: Most tokens are high-confidence “filler.” Scoring a few pivots captures the true risk of the chain.

CoT bypass (save time safely)

Off-the-shelf Qwen3-32B: Using an early-entropy threshold of 0.1, the system bypasses 16.2% (CSQA) and 12.4% (MMLU) of thoughts with only ~0.03 accuracy drop overall. Hard, compositional tasks naturally do not trigger bypass.
In-Domain 7B: Higher bypass ratios are possible with small accuracy trade-offs; thresholds control the balance.

Surprising Findings:

Early “answer gist” doesn’t beat just answering directly (no CoT) on most tasks, and it’s far worse than full CoT on hard problems. So gist ≠ plan.
The best probing layer is often not the final layer—intermediate layers can hold richer semantic cues.
Tele-Lens and LM-entropy pivots choose very different spots in the chain; combining them might do even better.

Big Picture: The results strongly support a myopic planning horizon. LLMs make confident local steps, accumulate evidence, and only lock in the precise answer very near the end—especially for problems that truly require multiple steps.

05Discussion & Limitations

Limitations:

Fixed answer spaces: Final-answer probing used multi-choice or fixed token sets. Open-ended formats are trickier and would need alternative probing or scoring designs.
Prober dependency: Tele-Lens requires training small adapters per layer. While light-weight, it’s still extra work and may overfit if not controlled.
Confounders: Some tasks leak shortcuts (e.g., input length correlates with CoT length). Future studies must guard against such artifacts.
Model scope: Results center on Qwen-family models and one RL-trained 7B; other architectures, sizes, or training regimens might differ somewhat in horizon length.

Required Resources:

Access to hidden states during CoT generation.
Compute to train Tele-Lens adapters (rank-256 per layer) and run large-batch probing.
Datasets covering multiple domains so conclusions generalize.

When NOT to Use:

If you cannot access hidden states (black-box APIs only), Tele-Lens-based signals won’t be directly available. In that case, use the general pivot strategy with bare LM entropy/self-certainty.
For very short, trivial prompts, CoT bypass is overkill; answering directly is fine without extra checks.
For creative or open-ended generation, the Wooden Barrel principle may need adaptation (e.g., segment-level pivots instead of token-level).

Open Questions:

Can we train models to expand their planning horizon, or is myopia inherent to autoregressive training? Would new objectives or architectures help?
What’s the best way to find pivots—entropy, Tele-Lens, attention maps, gradient-based saliency, verifier disagreement, or a hybrid?
Can we turn pivot detection into better training (e.g., focus feedback/RL on those steps) for stronger, shorter, more reliable CoTs?
How do these findings interact with verifiers/judges and with external tools (like calculators or search) that might extend effective horizon?

06Conclusion & Future Work

Three-Sentence Summary: This paper shows that large language models rarely plan the whole solution in advance; instead, their hidden states support short, local moves, and precise final answers usually appear only near the end of reasoning—especially for multi-step tasks. The authors introduce Tele-Lens to systematically measure this “latent planning horizon,” then use a Wooden Barrel principle to focus on a few critical steps for better uncertainty scores and a safe CoT bypass switch. The result is a unified explanation of past mixed observations and practical gains in confidence and efficiency.

Main Achievement: Uncovering and validating the myopic planning horizon of LLMs—plus turning that insight into working tools that improve uncertainty calibration (up to 6–9% AUROC) and reduce unnecessary thinking with negligible accuracy loss.

Future Directions:

Expand pivot-finding methods (combine Tele-Lens with LM entropy, verifiers, or disagreement signals) and test on open-ended generation.
Explore training that stretches the planning horizon or explicitly scaffolds key pivot steps.
Use pivot-aware signals to compress CoTs, route hard/easy cases, and guide reinforcement learning toward decisive steps.

Why Remember This: It reframes how we think about LLM reasoning: not a pre-written script, but a careful step-by-step walk with a short flashlight beam. Once you know that, you can place the light where it matters most—on the few steps that make or break the answer.

Practical Applications

•Deploy a pivot-based confidence score to flag risky answers for human review while letting confident ones pass automatically.
•Add a CoT bypass switch: after a few early tokens, skip full reasoning on easy items to reduce latency and cost.
•Compress CoTs by keeping only pivot steps, shrinking logs and speeding up audits without losing essential logic.
•Route questions: send likely-hard, multi-step tasks to full CoT + tools; keep easy, semantic ones in fast mode.
•Train with pivot-aware feedback: focus reinforcement learning or verifiers on the few decisive steps to improve reliability.
•Early-stop decoding when pivot uncertainty stays low for several steps, saving tokens without hurting accuracy.
•Blend signals (Tele-Lens + LM entropy + verifier disagreement) to identify pivots more robustly.
•Curriculum design: build datasets that stress the few critical steps rather than flooding with filler reasoning.
•Monitoring dashboards: visualize pivot locations and confidence spikes to audit model behavior over time.
•Tool selection: trigger calculators/search only when pivot uncertainty crosses a threshold, avoiding unnecessary tool calls.

Version: 1