Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang; Xin Xia; Yuxi Ren; Jianbin Zheng; Xuanda Wang; Zhixia Zhang; Hongyan Xie; Songshi Liang; Zehao Chen; Xuefeng Xiao; Fuzhen Zhuang; Jianxin Li; Yikun Ban; Deqing Wang

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Intermediate

Zixuan Huang, Xin Xia, Yuxi Ren et al.2/9/2026

arXiv

Key Summary

•Big AI reasoning models often keep thinking long after they already found the right answer, wasting time and tokens.
•The paper shows that these models actually know when to stop, but common sampling (pass@1 and greedy next-token) hides this skill.
•SAGE is a new way to sample reasoning that follows the model’s own confidence and stops right when the model feels done.
•SAGE finds short, precise chains of thought that are more accurate and much shorter than usual outputs.
•SAGE-RL mixes a little SAGE sampling into group-based reinforcement learning so the model learns efficient reasoning patterns.
•Across tough math benchmarks, SAGE-RL improves accuracy while cutting a lot of tokens, boosting token efficiency dramatically.
•A key insight is to score whole paths by cumulative confidence, not just the next token, so the model can confidently choose to end.
•As we explore more candidate paths, models converge to short high-confidence solutions, proving they implicitly know when to stop.
•This approach reduces latency and cost without changing the model architecture or needing special reward hacks.
•The result is faster, smarter, and more concise solutions that still get the answer right.

Why This Research Matters

Fast, accurate AI helps in classrooms, coding, and daily problem solving without making people wait or pay extra for wasted text. By teaching models to stop at the right time, apps feel snappier and cheaper to run, especially on phones and edge devices. Shorter, steadier reasoning is also easier to read and audit, which supports trust and safety. Cloud providers can serve more users with the same hardware by cutting redundant tokens. And as models take on planning and tool-use, learning to “think just enough” will be key for practical assistants that feel both smart and efficient.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you do homework, sometimes you keep re-checking the answer even after you’re already sure it’s right? That extra checking doesn’t always help, and it wastes time you could spend on other problems.

The World Before:

Large Reasoning Models (LRMs) got really good at math and logic by writing long step-by-step Chains of Thought (CoTs). Longer CoTs gave them room to explore, fix mistakes, and show their work. This helped them shine on hard tests like AIME, OlympiadBench, and IMO-style problems.

🍞 Hook: You know how a student sometimes writes a whole page to show they understand, even when a few clear lines would be enough? 🥬 The Concept (Chains of Thought): A chain of thought is the model's step-by-step explanation as it solves a problem.

How it works: (1) The model writes a thought step, (2) checks what to write next, (3) keeps going until it believes it’s done, (4) gives the final answer.
Why it matters: Without CoTs, the model might jump to conclusions and make logic leaps; with CoTs, it can reason carefully. 🍞 Anchor: When solving 12 × 13, the model might write: 12 × 10 = 120, 12 × 3 = 36, 120 + 36 = 156.

The Problem:

People noticed something odd: making CoTs longer didn’t always make answers better. In fact, longer responses were often less accurate. And many outputs were bloated—full of extra tokens that didn’t help reach the final answer.

🍞 Hook: Imagine carrying a huge backpack full of snacks for a 10-minute walk—you’ll be slower for no good reason. 🥬 The Concept (Overthinking/Redundancy): Overthinking is when the model adds steps that don’t help solve the problem.

How it works: (1) The model keeps writing thoughts, (2) but those thoughts repeat or wander, (3) the correct answer may already be present, (4) the model continues anyway.
Why it matters: Extra steps burn time and money; worse, they can distract and cause new mistakes. 🍞 Anchor: In a tie-prices problem, the model found $800 quickly but kept writing 400+ more tokens repeating checks.

Failed Attempts:

Greedy or random sampling (pass@1) often locks in one path and rides it too long.
Even when sampling many answers (pass@k), the very best short solution is in there somewhere, but we don’t reliably pick it.

🍞 Hook: Picture picking just the first cookie you see from a mixed box, even if the perfect cookie is two spots away. 🥬 The Concept (pass@1 vs pass@k): pass@1 means take the single top sample; pass@k means try k different samples and see if any are correct.

How it works: pass@1: pick one completion. pass@k: generate multiple; a correct one anywhere counts.
Why it matters: pass@1 can miss a great short solution; pass@k finds it but doesn’t teach the model to choose it by default. 🍞 Anchor: If you sample 8 solutions, a sharp, short solution often appears, but the one shown to the user (pass@1) might be a longer, worse one.

The Gap:

The model seems to “know” when to stop—its internal confidence spikes when it’s on a good path and ready to end. But next-token-focused sampling hides that sense and doesn’t reliably pick the short, right paths.

🍞 Hook: Imagine you can feel you’re done with a puzzle, but your pencil has a rule: never stop until the page is full. 🥬 The Concept (RLVR): RL with verifiable rewards gives a point for a correct final answer (and zero for wrong), guiding learning without a separate reward model.

How it works: (1) Generate several solutions, (2) auto-check correctness, (3) reward correct ones, (4) update the model.
Why it matters: It scales reasoning skill, but can unintentionally reward length because longer tries sometimes raise the chance a correct step appears. 🍞 Anchor: If a system grades only the final answer, the model might keep writing until it hits the correct one, even if it was already there earlier.

Real Stakes:

Latency: Users wait longer for answers.
Cost and Energy: More tokens cost more compute and power.
Reliability: Longer chains can drift off-track.
Product UX: Tutors, coding helpers, and math solvers feel sluggish.
Accessibility: On-device models need to be fast and frugal.

In short, the field needed a way to help models stop at “just enough” thinking—keeping accuracy while cutting waste.

02Core Idea

The “Aha!” in one sentence: The model already knows when to stop; SAGE simply listens to the model’s own confidence to end at the right moment and pick the short, correct chain.

Three Analogies:

Cooking Soup: Taste as you add salt. Stop when it tastes right. Don’t keep salting for another 10 minutes—you’ll ruin it. SAGE is the taste test that says, “Done now.”
Maze Flags: As you walk a maze, you plant flags where the route feels solid. When your confidence peaks and you see the exit, you stop. SAGE follows the most confident flagged path and exits right away.
Schoolwork: You check your answer once. If it’s clearly correct, you stop and turn it in. No need to rewrite the whole page. SAGE turns in the great short answer instead of the long-winded one.

Before vs After:

Before: Greedy next-token sampling (pass@1) stretched thoughts, often missing the short-best solution hiding in pass@k.
After: SAGE uses whole-path confidence to favor compact, high-quality reasoning that naturally ends right when the model feels done.
Training impact: Mixing SAGE into RL (SAGE-RL) teaches the model to prefer these concise, effective patterns even in standard pass@1.

🍞 Hook: You know how a coach trusts not just a player’s last move, but their whole play so far to judge if the game plan is working? 🥬 The Concept (Cumulative Confidence Score): Instead of judging only the next token, SAGE scores an entire partial path by averaging its log-probabilities (how confident the model was across the whole path).

How it works: (1) Build several candidate paths, (2) compute the average confidence over the tokens so far, (3) keep the highest-confidence paths, (4) if a path ends with the special stop-thinking token, accept it.
Why it matters: Judging whole paths avoids being tricked by one flashy next move; it tracks steady, trustworthy reasoning. 🍞 Anchor: A solution that’s consistently confident and then says “I’m done” beats a wobbly path that flips between guesses.

Why It Works (intuition, not equations):

Local vs Global: Next-token probability (local) can wobble; cumulative confidence (global) stabilizes judgment across steps.
Confident Ends: When the model is on a strong path, the “stop-thinking” decision aligns with high whole-path confidence. That’s why the end token ranks high under SAGE’s scoring.
Exploration Helps: If you widen the set of candidates just a bit, the best short path reliably shows up, and performance converges as you explore more.

🍞 Hook: Think of a hallway with many doors; opening a couple more doors can reveal the quick shortcut. 🥬 The Concept (Exploration Width): Keep the top m candidate reasoning paths and expand them step by step.

How it works: (1) Start from the prompt, (2) sample a few next steps, (3) score whole paths, (4) keep the best m, repeat, (5) stop when a path confidently ends.
Why it matters: Tiny exploration is enough to catch short, sharp solutions that greedy decoding often misses. 🍞 Anchor: Trying two or four alternate steps is like checking a second method in math class; often the cleaner one pops out.

Building Blocks:

Chains of Thought: the step-by-step reasoning.
A stop token (</think>): the model’s way to say “I’m done thinking.”
Whole-path confidence: a stable, average log-prob score over tokens.
Step-wise sampling (SAGE): generate whole steps at once, then choose.
SAGE-RL: in each training group, a couple of samples come from SAGE, so the policy learns to prefer concise, correct chains.

Put together, the big idea is simple: let the model show you the strong, short path it already believes in, then teach it to choose that path by default.

03Methodology

At a high level: Input question → Explore a few reasoning paths step by step → Score each path by whole-path confidence → Stop as soon as a path confidently ends → Greedily produce the final answer from that path → Output.

Step 1: Prepare the prompt and “thinking mode”

What happens: We present the problem and let the model enter its chain-of-thought mode, where it can write reasoning steps.
Why this step exists: Without a space to reason, the model might jump to an answer without exploring options.
Example: “John buys ties…” The model starts listing steps to compute red and blue tie costs.

🍞 Hook: Imagine laying out clean paper before doing math; it’s the space to think. 🥬 The Concept (Reasoning Steps): A reasoning step is a chunk of thoughts before a line break.

How it works: (1) Write a step, (2) pause, (3) decide to continue or stop.
Why it matters: Grouping by steps lets us sample and compare full thoughts instead of single tokens. 🍞 Anchor: Like writing one paragraph of your explanation, then checking if you need another.

Step 2: Step-wise exploration (SAGE)

What happens: For each partial path, sample a few complete next steps in parallel (exploration width m), creating multiple candidate paths.
Why this step exists: One path can be misleading; a few alternatives reveal cleaner solutions.
Example (ties): One candidate computes counts first; another computes prices first. Both reach $800 early, but one is shorter.

🍞 Hook: When solving a puzzle, trying two ideas beats betting everything on one. 🥬 The Concept (Exploration Width m): Keep the top m paths and try about 2m new steps for each.

How it works: (1) Sample alternatives, (2) attach each to the current path, (3) now you have many candidates, (4) pick the best m by score.
Why it matters: A small m gives big gains; too small and you miss the gem, too big and it’s slow. 🍞 Anchor: Testing 2–4 ways to start a proof often surfaces a much tidier solution.

Step 3: Score whole paths by cumulative confidence

What happens: For each candidate path, compute the average log-probability across its tokens (a steady confidence measure). Keep the best m.
Why this step exists: Looking only at the next token can wobble; a whole-path score captures consistent reasoning quality.
Example: Two paths reach “$800.” The one with smoother, higher confidence all along wins and is usually shorter.

🍞 Hook: A steady hand paints better than one that jerks back and forth. 🥬 The Concept (Whole-path Confidence): Average of the model’s token log-probabilities so far.

How it works: (1) Sum log-probs across the path, (2) divide by length, (3) compare across candidates.
Why it matters: Rewards paths that were confident the whole way, not just at the end. 🍞 Anchor: A student who explains each step clearly is more trustworthy than one who guesses and gets lucky at the end.

Step 4: Stop when a high-confidence path ends

What happens: If any top path ends with the model’s stop-thinking token, accept it immediately.
Why this step exists: The model is signaling, “I’m done, and I’m sure.” Waiting longer risks redundancy or drift.
Example: In the tie problem, the model hits $800 and the stop token; SAGE stops right there instead of allowing extra checks.

🍞 Hook: When the traffic light turns green and you’re sure it’s safe, you go—you don’t keep waiting. 🥬 The Concept (Stop Token): A special token like </think> means the model is finished reasoning.

How it works: (1) Track if a candidate’s latest step ends with stop, (2) if yes and path is high-confidence, accept it, (3) move to answer.
Why it matters: Prevents “just one more paragraph” spirals. 🍞 Anchor: Like placing your pencil down when you’re confident you solved the problem.

Step 5: Greedy answer decoding

What happens: Once we’ve accepted a reasoning chain, we ask the model to write the final answer succinctly, greedily.
Why this step exists: Avoids reopening exploration; we just want the clean finish.
Example: The model writes “800.”

Step 6: Output

Return the concise reasoning and the answer.

The Secret Sauce:

Use step-wise exploration (not token-by-token) to compare whole thoughts.
Rank by whole-path confidence so steady, strong reasoning bubbles up.
Trust the model’s own “I’m done” signal to stop early.
In training (SAGE-RL), mix a couple of SAGE samples into each group so the model learns these efficient patterns automatically.

🍞 Hook: Think of a class where two students demonstrate quick, clean solutions; everyone learns to be concise from those examples. 🥬 The Concept (SAGE-RL: Mixing into RLVR): During RL, for each group of sampled answers, a few come from SAGE; the rest are normal. The verifiable reward then favors the concise, correct ones.

How it works: (1) Sample G answers, (2) r come from SAGE, G−r from standard sampling, (3) grade them by correctness, (4) update the model.
Why it matters: No fancy reward tricks; the rollout alone nudges the model to internalize concise reasoning. 🍞 Anchor: In an 8-sample group, if 2 SAGE samples often win, the model shifts toward their style over time.

What breaks without each step:

No step-wise exploration: you miss better short paths.
No whole-path scoring: you may latch onto wobbly next tokens.
No early stop on confidence: you bloat outputs.
No SAGE-RL: the model won’t learn to prefer concise reasoning by default.

04Experiments & Results

The Test: What did they measure and why?

Accuracy (pass@1): Does the single sampled solution get it right?
Response length: How many tokens does the model spend thinking?
Token efficiency: Accuracy per token (higher is better). This ties correctness to cost/latency.

🍞 Hook: You know how a race judges both speed and finishing place? Here we judge getting the right answer and how quickly you got there. 🥬 The Concept (Token Efficiency): A combined measure showing how much accuracy you get for each token spent.

How it works: token efficiency = pass@1 / length.
Why it matters: Encourages being right and brief, not just verbose. 🍞 Anchor: Two students both score 90%; the one who finishes with half the writing is more efficient.

Benchmarks and Competitors:

Datasets: MATH-500, AIME 2024, AIME 2025, AMC23, OlympiadBench, Minerva—covering a spread from easier to very hard math problems.
Baselines: RLVR variants (GRPO, GSPO) and length-control/trimming methods (AdaptThink, Efficient-Reasoning, ThinkPrune), plus standard sampling.

Scoreboard With Context:

SAGE alone (inference-time) finds shorter paths that end confidently. As you widen exploration a bit, accuracy improves while length shrinks—rare and valuable.
SAGE-RL (training-time) injects just 2 SAGE rollouts per 8-sample group, yet consistently boosts pass@1 while cutting tokens across models and datasets. Think of it as getting an A on tests with fewer pages of work.
On multiple benchmarks, SAGE-RL often gains 1–6 percentage points in pass@1 while reducing tokens by roughly 30–60%, sometimes nearly doubling token efficiency. Compared to methods that compress length but sacrifice accuracy, SAGE-RL improves both at once.
For stronger models and harder datasets (e.g., DeepScaleR on AIME/Minerva), SAGE-RL leans into performance gains. For smaller models on easier sets (e.g., DS-1.5B on MATH-500), it slashes redundancy and latency.

Surprising Findings:

Shorter Can Be Better: In many problems, the shortest correct chain beats longer ones sampled randomly—a clean mind wins.
Confident Ends: Under whole-path confidence, the model’s stop token ranks very high when it appears, proving the model knows when it’s done.
Convergence With Exploration: As you explore a little more, accuracy rises and length falls toward a stable sweet spot, showing the model’s built-in sense of efficient reasoning.
Training Dynamics: With SAGE-RL, entropy drops (the model becomes more decisive), KL rises (it moves away from its old habits), and lengths shrink—all signs it’s learning the concise, correct style.

Takeaway: SAGE helps at inference time; SAGE-RL goes further by teaching the model to be concise-and-correct by default.

05Discussion & Limitations

Limitations:

Exploration Cost: SAGE tries a few step alternatives; with large exploration width m, inference can slow down on limited hardware.
Hyperparameter Sensitivity: Picking m (how many paths) and r (how many SAGE rollouts per group) matters; too small and you miss gains, too large and you pay in time.
Assumes Think-and-Stop Signals: Best results come when the model uses clear reasoning steps and a stop-thinking token. Models without such formatting may need light prompting.
Domain Transfer: Evidence is strongest on math/code-like tasks; other domains (long essays, dialog nuance) need more testing.
Overconfidence Risk: If a model is confidently wrong, stopping early won’t fix correctness; verifiable rewards help, but robust checking still matters.

Required Resources:

For training with SAGE-RL, a standard RLVR setup with group sampling (e.g., G=8) and modest GPUs is sufficient.
For inference with SAGE (no training), use a serving stack that can parallelize a few candidate steps.

When NOT to Use:

Very short, trivial queries where CoT isn’t needed; overhead may outweigh benefits.
Ultra low-latency edge cases where even small exploration is too costly.
Models that cannot or should not reveal internal thinking steps (policy or privacy constraints).

Open Questions:

Theory: What precise link ties whole-path confidence to correctness and optimal stopping?
Budget Adaptation: Can SAGE auto-tune exploration to a fixed token/time budget per query?
Safety: How to detect and avoid confidently wrong early stops without adding heavy verification cost?
Generalization: How well does this extend to multi-hop QA, planning, tool-use, and non-math reasoning?
Teaching Without RL: Can we distill SAGE-found short chains via SFT/contrastive methods while preserving performance?

06Conclusion & Future Work

Three-sentence summary:

The paper shows that large reasoning models already know when to stop thinking, but common sampling hides this ability and causes overthinking.
SAGE listens to the model’s own whole-path confidence and stops right when the model is sure, uncovering short, precise chains that improve accuracy and cut tokens.
SAGE-RL mixes a little SAGE into RLVR training so models learn these efficient patterns and deliver concise, correct answers by default.

Main Achievement:

A simple, training-free decoding (SAGE) plus a minimal training tweak (SAGE-RL) that together increase pass@1 while substantially reducing response length across tough math benchmarks.

Future Directions:

Auto-adjust exploration to fit strict latency or token budgets per task.
Extend to planning/tool-use tasks and code synthesis, with verifiable intermediate checks.
Combine with lightweight verification to safely detect rare confidently-wrong early stops.

Why Remember This:

It reframes “think longer” to “think just enough.” By trusting the model’s own confidence over entire paths—and teaching it to prefer concise thinking—we get answers that are both right and fast, saving time, money, and energy while improving user experience.

Practical Applications

•Enable SAGE decoding in your math or coding assistant to cut latency while maintaining accuracy.
•Add SAGE-RL to your RLVR training loop with a small group mix (e.g., 2 SAGE rollouts in G=8) to learn concise reasoning.
•Start with exploration width m=2 for strong efficiency gains without large overhead.
•Track token efficiency (pass@1 per token) during evaluation to balance accuracy and cost.
•Measure RFCS (first-correct-step ratio) to quantify and reduce overthinking in your model outputs.
•Set a reasonable max step budget and trust early stop signals to prevent runaway chains.
•Use SAGE preferentially on harder problems where it boosts accuracy; use smaller m on easy tasks for speed.
•Combine SAGE with lightweight verification for safety-critical deployments to catch rare confident mistakes.
•Adopt concise reasoning prompts (clear steps, explicit stop cues) so SAGE can detect ends reliably.
•Deploy on-device or mobile assistants with SAGE to deliver faster, more battery-friendly results.

Version: 1