Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Ming Li; Chenrui Fan; Yize Cheng; Soheil Feizi; Tianyi Zhou

Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Intermediate

Ming Li, Chenrui Fan, Yize Cheng et al.12/23/2025

arXiv PDF

Key Summary

•This paper turns messy chains of thought from language models into clear, named steps so we can see how they really think through math problems.
•It uses a classic idea from human math education called Schoenfeld’s Episode Theory and adds an extra 'Answer' step to fit AI.
•The authors build a scalable tool, ThinkARM, that labels each sentence of a model’s reasoning as Read, Analyze, Plan, Implement, Explore, Verify, Monitor, or Answer.
•Across many models, a repeatable 'heartbeat' appears: plan first, do in the middle, check at the end.
•Reasoning models aren’t just longer; they spend time exploring and checking, unlike non-reasoning models that mostly just execute steps.
•Exploration is a crucial fork: correct solutions tend to follow Explore with Monitor or Analyze, while wrong ones rush into more doing or stop too early.
•Speed-up methods don’t just shorten text; they often cut out checking and feedback loops, which explains accuracy trade-offs.
•A human-verified gold set and an automatic annotator (GPT-5 in this study) make the labeling reliable at large scale.
•These episode labels let us diagnose patterns linked to correctness and efficiency, going beyond simple accuracy or token counts.

Why This Research Matters

This work gives us a practical way to look inside an AI’s thinking, not just at its final answers. By naming and measuring thinking steps, we can spot when a model explores wisely or when it skips crucial checks. That means better tutoring systems that teach good habits, safer assistants that verify claims, and faster models that don’t throw away the parts that keep them accurate. Teams can debug and compare models fairly, focusing on reasoning quality rather than length. And as we deploy AI into important fields like coding, science, and healthcare, structured thinking patterns become a safety feature, not just an academic curiosity.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a friend solve a tricky puzzle. They first read the rules, think about strategies, try something, and then check if it worked. You can follow their moves because they happen in stages.

🥬 The Concept (Mathematical Problem Solving): It’s the journey from seeing a question to getting an answer, with steps like understanding, planning, doing, and checking.

How it works: 1) Understand the problem, 2) Think about ideas, 3) Pick a plan, 4) Do the steps, 5) Check and fix, 6) Share the answer.
Why it matters: If we can’t see which step broke, we can’t fix thinking mistakes. 🍞 Anchor: In a word problem, you first spot what’s asked, choose a formula, compute, and then see if the number makes sense.

🍞 Hook: You know how coaches review game film to see not just the final score, but how plays unfolded?

🥬 The Concept (Schoenfeld’s Episode Theory): It’s a way researchers described the stages people go through while solving math problems—like scenes in a movie of thinking.

How it works: They listened to many “think-alouds” and labeled moments as Read, Analyze, Plan, Implement, Explore, Verify, and later Monitor.
Why it matters: Without shared names for these steps, everyone talks past each other about what “good reasoning” looks like. 🍞 Anchor: A student reads the question (Read), recalls relevant rules (Analyze), decides to try substitution (Plan), does the algebra (Implement), toys with a different idea (Explore), checks work (Verify), says “wait, let me think” (Monitor), then states the answer.

🍞 Hook: Think of a cooking show where every motion—chop, stir, taste—is labeled so you can learn the craft, not just copy the recipe.

🥬 The Concept (Cognitive Episodes): These are the named thinking steps—Read, Analyze, Plan, Implement, Explore, Verify, Monitor, Answer—that break down a solution into understandable parts.

How it works: Label each sentence of a solution by what it’s doing functionally.
Why it matters: If a model always skips Verify, it may be fast but fragile. 🍞 Anchor: “Let’s compute the derivative” is Implement; “Is this correct?” is Verify.

The world before: Language models got much better at math when they wrote out chains of thought. But judging them mostly by the final answer, length, or token counts is like scoring a basketball team by points only—no assists, defense, or turnovers. People observed “overthinking” too: models sometimes wrote a lot without getting more correct, but we couldn’t say which parts were helpful or harmful.

The problem: We didn’t have a clear, shared way to mark which parts of a model’s text were understanding, exploring, executing, or checking. So we couldn’t compare models’ inner styles or explain where they differ beyond “one is longer.”

Failed attempts: Token-level stats (like how many words) and end-to-end accuracy miss structure. Paragraph labels are too coarse. Hand-labeling everything doesn’t scale.

🍞 Hook: Imagine sorting a big toy box. If you only count how many toys you have, you still won’t know if you’re missing wheels or a battery.

🥬 The Concept (Episode-level Annotation): It’s tagging each sentence with the thinking step it belongs to, so you see the structure, not just the size.

How it works: Use clear definitions and examples; label sentence-by-sentence.
Why it matters: Without these tags, we can’t tell if a model is exploring wisely or just wandering. 🍞 Anchor: “Maybe try factoring” is Explore; “Next, I’ll factor” is Plan; “(x−3)(x+2)=0” is Implement.

The gap: We needed a middle view—bigger than tokens, smaller than whole solutions—to see the “shape” of reasoning. Also, we needed it to work at scale, across many models, not just one.

Real stakes: In daily life, we want AIs that aren’t just good at answers but good at thinking: catching mistakes, trying alternatives, and knowing when to stop. That’s important for tutoring, coding, science, and safety. If we can see the steps, we can train better habits, not just longer texts.

🍞 Hook: Think of a map app that not only shows your destination but also your path and reroutes.

🥬 The Concept (ThinkARM): It’s a tool that labels every sentence in a model’s reasoning with its cognitive episode so we can analyze patterns across many models.

How it works: Build a gold set labeled by humans, pick the best automatic labeler, then annotate hundreds of thousands of sentences.
Why it matters: Without a scalable tool, we stay stuck with anecdotes instead of evidence. 🍞 Anchor: Across 15 models and 410,991 sentences, ThinkARM shows who plans carefully, who explores too long, and who actually checks.

Finally, to fit how AIs present final results, the authors add one more episode—Answer—so we can see exactly when a model commits to the solution, separate from when it checks.

02Core Idea

🍞 Hook: You know how a heart monitor shows a repeating rhythm—start, beat, recover—so doctors can tell if everything is healthy?

🥬 The Concept (Aha!): If we label each sentence of an AI’s reasoning with human-style thinking steps, a clear, repeatable “heartbeat” pattern appears—and that lets us compare, diagnose, and improve how models think, not just how long they talk.

How it works: 1) Define eight episodes (Read, Analyze, Plan, Implement, Explore, Verify, Monitor, Answer). 2) Label each sentence with one episode. 3) Track how episodes ebb and flow over time and how they transition. 4) Compare patterns across models and methods.
Why it matters: Without structure, we can’t tell good exploration from waste, or careful checking from careless rushing. 🍞 Anchor: In many models, early Analyze/Plan, middle Implement, late Verify/Monitor, then Answer—a reliable thinking rhythm.

Three analogies:

Orchestra: Strings (Analyze/Plan) set the theme, brass (Implement) carries the melody, percussion (Verify/Monitor) tightens the ending, and the final chord is Answer.
Cooking: Read the recipe (Read), understand techniques (Analyze), pick a method (Plan), cook (Implement), taste and adjust (Verify/Monitor), plate and serve (Answer).
Sports: Study the play (Read/Analyze), call the play (Plan), run the play (Implement), review and adjust (Verify/Monitor), score (Answer).

Before vs After:

Before: We judged models mainly by accuracy and length. Exploration looked like “extra words.”
After: We see that where a model spends time (e.g., Explore vs Verify) and how it loops (Explore→Monitor) predicts success better than length alone. Non-reasoning models mostly “do;” reasoning models also explore and check.

🍞 Hook: Imagine a subway map of thinking. Some lines branch and rejoin; some go straight through.

🥬 The Concept (Reasoning Dynamics / Cognitive Heartbeat): It’s the time pattern of episodes across a solution: scaffold early, execute mid, converge late.

How it works: Normalize each solution from 0–100% progress; measure how often each episode appears at each slice.
Why it matters: If a model’s “check” episodes don’t rise near the end, it’s more likely to present unchecked answers. 🍞 Anchor: Verify and Monitor ramp up near the finish line in strong reasoning models.

🍞 Hook: Think of traffic cameras that also track direction changes, not just car counts.

🥬 The Concept (Transition Patterns): These are the step-to-step jumps (like Explore→Monitor) that reveal behavior, beyond how much of each step occurs.

How it works: Turn each labeled solution into a sequence, count n-grams (short patterns), and see which patterns distinguish groups (e.g., reasoning vs non-reasoning).
Why it matters: Two models can have the same amount of Explore, but one might follow it with Monitor (safer) while the other rushes to Implement (riskier). 🍞 Anchor: Explore→Monitor and Monitor→Analyze show up more in correct traces; Explore→Verify too early is a risk sign.

Building blocks (the eight episodes), each with a mini-sandwich:

🍞 Hook: Like reading game rules. 🥬 Read: State the given facts and goal; first, repeat what’s asked; without it, you may solve the wrong problem. 🍞 Anchor: “We are asked to find x.”
🍞 Hook: Like figuring out which rules matter. 🥬 Analyze: Connect concepts and infer relationships; recall definitions, note properties; without it, steps won’t fit together. 🍞 Anchor: “Because the triangle is right, a^2+b^2=c^2 applies.”
🍞 Hook: Like choosing a route. 🥬 Plan: Announce the next strategy; decide “what to try first”; without it, you may wander. 🍞 Anchor: “Next, I’ll factor the equation.”
🍞 Hook: Like doing the moves. 🥬 Implement: Carry out calculations; substitute, expand, simplify; without it, you never get numbers. 🍞 Anchor: “(x−3)(x+2)=0 ⇒ x=3 or x=−2.”
🍞 Hook: Like brainstorming plays. 🥬 Explore: Try possibilities without commitment; ask “maybe,” “what if”; without it, you can miss a better path. 🍞 Anchor: “Maybe try symmetry instead.”
🍞 Hook: Like checking the scoreboard. 🥬 Verify: Test correctness; plug back in, cross-check; without it, errors slip through. 🍞 Anchor: “Plugging x=3 satisfies the equation.”
🍞 Hook: Like pausing to think. 🥬 Monitor: Short self-checks (“Wait…”); track progress and confusion; without it, you plow ahead blindly. 🍞 Anchor: “Hmm, that seems off.”
🍞 Hook: Like announcing the final score. 🥬 Answer: Commit to the result; state it clearly; without it, no one knows you’re done. 🍞 Anchor: “Therefore, the answer is 3.”

Why it works (intuition, no math): Human problem solving has structure, and LLM chains of thought echo that structure. When we label sentences with these human-friendly steps, stable patterns pop out across models. Those patterns explain differences in accuracy and efficiency more meaningfully than raw length. The core insight is that thinking quality lives in which steps appear, when they appear, and how they connect—not just in how much text is produced.

03Methodology

High-level recipe: Input (math problems and model responses) → Episode labeling (sentence by sentence) → Pattern analysis (time trends, allocations, transitions) → Diagnostics (correctness and efficiency case studies) → Output (insights on reasoning structure).

Step A: Curate problems and collect reasoning traces

What happens: The authors pick 100 diverse problems from Omni-MATH and gather responses from 15 models, totaling 410,991 sentences (some models provide hidden “thinking,” others only final answers).
Why this step exists: To compare behaviors across families, you need the same tasks and lots of examples.
Example: DeepSeek-R1 produces full chains of thought; standard instruction followers output just the final solution text.

🍞 Hook: Like having a teacher’s answer key to check graders. 🥬 The Concept (Gold Set): A small, carefully human-labeled set used to pick and check the automatic annotator.

How it works: Humans label 7,067 sentences across 9 problems with the eight episodes.
Why it matters: Without a trustworthy yardstick, automation can drift. 🍞 Anchor: If the auto-labeler agrees strongly with the gold set, we trust it on the big dataset.

Step B: Choose and configure the automatic annotator

What happens: Several strong LLMs are tested as annotators on the gold set; GPT-5 shows the best agreement (high accuracy and kappa) and is selected.
Why this step exists: Hand-labeling hundreds of thousands of sentences isn’t feasible.
Example: The annotator also writes a short justification for each label to improve reliability and traceability.

🍞 Hook: Like using sticky notes to tag parts of a book. 🥬 The Concept (Episode-level Annotation): Label each sentence as Read, Analyze, Plan, Implement, Explore, Verify, Monitor, or Answer using a detailed guidebook and context.

How it works: Segment responses into sentences; process in batches; provide the problem, prior labeled context, definitions, and format; collect labels and rationales.
Why it matters: Sentence-level granularity reveals fine structure and supports robust statistics (allocations and transitions). 🍞 Anchor: “Let’s try substitution” gets Explore; the next line “Substitute x=2” gets Implement.

Step C: Quality and scalability mechanics

What happens: The guidebook defines each label with examples and pitfalls; the annotator sees previous context to avoid inconsistent tags.
Why this step exists: Consistency over long, varied texts requires clear rules and memory.
Example: Distinguishing Analyze (“because it’s a right triangle…”) from Implement (“compute a^2+b^2”).

Step D: Temporal dynamics (“cognitive heartbeat”)

What happens: Normalize each response to a 0–100% progress scale; measure how often each episode appears at each slice.
Why this step exists: To compare short and long solutions fairly and reveal early/middle/late phases.
Example: Implement peaks mid-solution; Verify grows toward the end; Monitor forms a U-shape (early and late).

🍞 Hook: Like pie charts for how time is spent. 🥬 The Concept (Episode Allocation): The fraction of tokens devoted to each episode across a response or model.

How it works: Count tokens per episode and compare distributions across models.
Why it matters: Shows where a model invests effort—analysis vs execution vs checking. 🍞 Anchor: Non-reasoning models are heavily Implement; reasoning models allocate more to Analyze, Explore, and Verify.

🍞 Hook: Like watching dance steps, not just how long the song is. 🥬 The Concept (Transition Patterns): Short episode sequences (like Explore→Monitor) that reveal how thinking moves.

How it works: Convert labels into strings and compute which n-grams best distinguish groups using an information measure; inspect which transitions are common in correct vs incorrect solutions.
Why it matters: Flow matters: Explore followed by Monitor/Analyze is safer than leaping to Implement. 🍞 Anchor: Explore→Monitor and Monitor→Analyze are positive signals for correctness; Explore→Verify too soon is risky.

Step E: Correctness diagnostics

What happens: Build a simple “scorecard” model using global stats (lengths), allocations (ratios of each episode), and transitions (8×8 counts) to predict correctness.
Why this step exists: To see which structural features link to getting the right answer.
Example: Positive weights for Explore→Monitor and Monitor→Analyze; negative for a high Explore ratio overall or Implement→Read (lost and restarting).

Step F: Efficiency case studies

What happens: Compare a baseline reasoning model to three efficiency methods. Examine what gets cut: Are checks and feedback loops reduced, or is structure preserved?
Why this step exists: Length reduction methods differ—some prune evaluation, others keep it.
Example: Some methods shrink Verify and Analyze and suppress loops like Analyze→Verify→Analyze; others better preserve the topology while still saving tokens.

The secret sauce:

A human-grounded vocabulary (Schoenfeld’s episodes) that maps naturally onto sentence-level LLM text.
A scalable pipeline (gold set → best annotator with justifications → large-batch labeling) that makes episode analysis feasible at scale.
Multi-view analysis (time trends, allocations, transitions) that turns long text into a compact, interpretable “map of thinking.”
Diagnostics that connect structure to outcomes (correctness, efficiency), giving actionable levers for training and evaluation.

What breaks without each piece:

No gold set: labeling may drift; comparisons become unreliable.
No sentence-level tags: you lose fine-grained transitions that predict success.
No temporal normalization: early/mid/late patterns blur across different lengths.
No transition analysis: two models with the same allocations but different flows look falsely similar.
No diagnostics: you see patterns but can’t say which help or hurt correctness.

04Experiments & Results

The test: The authors wanted to know if episode labels reveal stable structures, separate reasoning from non-reasoning behavior, and explain correctness and efficiency. They used 100 Omni-MATH problems and 15 models, amassing 410,991 sentences. A 7,067-sentence gold set picked the best annotator to trust at scale.

The competition: They compared (a) open reasoning models with full chains of thought, (b) instruction-following models with only final responses, and (c) proprietary reasoning models where only final answers were visible. They also compared efficient-reasoning variants to their baseline parent.

The scoreboard with context:

Annotator choice: GPT-5 best matched humans (high accuracy and substantial kappa across both reasoning and non-reasoning traces). That’s like earning the most trusted “referee” badge before judging the big game.
Cognitive heartbeat (temporal dynamics): Across reasoning models, a three-phase rhythm emerged. Early: Read/Analyze/Plan (scaffolding) fades gradually (not just a tiny prefix). Middle: Implement peaks (most of the concrete work). Late: Verify rises steadily; Monitor is U-shaped (early “hmm,” late “am I done?”); Answer spikes at the end.
Allocation differences: Reasoning models devote significant budget to Analyze, Explore, and Verify (balanced profile). Non-reasoning models are strongly Implement-heavy, with little exploration or checking. Proprietary models that expose only final text look closer to non-reasoning profiles in the visible outputs.
Distillation preserves structure: Smaller distilled models keep episode allocations similar to their teacher, suggesting that training transfers not only answers but also the “shape” of thinking.
Transition patterns: Reasoning traces contain frequent Explore–Monitor/Verify loops (e.g., Explore→Monitor, Monitor→Explore), while non-reasoning traces are more feed-forward (fewer loops). Even within a reasoning model, the final Answer segment is less loop-heavy than the thinking portion.

Surprising findings:

Exploration is a double-edged sword. A high overall Explore ratio predicts risk, but Explore followed by Monitor or Analyze predicts success. In other words, exploring is good when it’s stabilized by meta-checks or deeper analysis.
Verification starts early, not only at the end. Correct traces include signs like Read→Verify (checking consistency from the start) and Answer→Verify (last-minute confirmation) more often than incorrect ones.
Efficiency methods differ in what they cut. Some mostly prune evaluation loops (like Analyze→Verify→Analyze) and reduce Verify/Analyze budgets—shorter but sometimes shakier. Others preserve more of the loop structure and maintain better balance, achieving efficiency with less behavioral drift.

Meaningful numbers, as analogies:

Annotator performance: Choosing GPT-5 over others is like picking the A student with the most consistent grading to mark everyone’s essays.
Allocation contrast: A heavy Implement share in non-reasoning models is like a team that only practices shooting, skipping scouting (Analyze) and film review (Verify). Reasoning models practice scouting, drills, and review—more balanced practice leads to more consistent play.
Transition signals: Seeing Explore→Monitor and Monitor→Analyze more often in correct solutions is like a pilot who checks instruments after trying a new course and then rethinks the plan before proceeding.

Overall, episode-level analysis doesn’t just re-describe long answers—it reveals stable, interpretable structures (the heartbeat), shows real group differences (balanced vs execute-only), and uncovers transition signatures that track correctness and the side-effects of efficiency tricks.

05Discussion & Limitations

Limitations:

Automatic annotation, even with strong agreement, can include noise. Small labeling errors may ripple into transition counts.
Domain focus: Results center on mathematical problem solving; other domains (law, medicine, open-ended writing) may show different episode mixes.
Sentence granularity: Some sentences mix functions (e.g., a quick plan plus a tiny calculation). Sentence-level tags may blur such hybrids.
Visibility bias: For models that don’t reveal their full thinking, observed profiles look more like non-reasoning traces—this reflects limited visibility, not necessarily limited thinking.
Potential gaming: If future models are trained to “look good” in episodes, they might mimic patterns without truly improving correctness.

Required resources:

A reliable annotator model (here, GPT-5) or similar, plus a human-labeled gold set for calibration.
Compute and storage for large-scale annotation and analysis (hundreds of thousands of sentences).
A clear guidebook and quality controls to keep labels consistent over time.

When NOT to use:

Tasks with ultra-short answers where episodes collapse into a couple of lines (little signal for transitions).
Creative writing or brainstorming without a notion of “correctness,” where Verify/Answer are undefined.
Low-resource settings lacking any validated annotator or gold set.
Microscopic token-level mechanistic studies; this is a mid-scale lens, not a neuron-level microscope.

Open questions:

Generalization: How do episodes manifest in code generation, science reasoning, or multi-turn dialogue?
Causality: If we nudge a model to do more Verify-after-Explore, will correctness reliably rise, or will models just write more checking words?
Training-time use: Can we build rewards or curricula from episode signals to teach better habits (e.g., stabilize exploration)?
Personalization: Do different problems or user types benefit from different episode rhythms (e.g., more early Analyze for geometry)?
Faithfulness: How do episode labels relate to the model’s internal computations versus just the words it prints?

06Conclusion & Future Work

Three-sentence summary: This paper introduces ThinkARM, a scalable way to label each sentence of an AI’s math solution with human-style thinking steps (Read, Analyze, Plan, Implement, Explore, Verify, Monitor, Answer). With these labels, a consistent “cognitive heartbeat” appears—plan first, do in the middle, check near the end—and meaningful differences surface between reasoning and non-reasoning models. The labels also explain correctness (exploration stabilized by monitoring/analysis is good) and reveal what efficiency methods really cut (often verification loops), moving evaluation beyond answer accuracy and length.

Main achievement: Turning long chains of thought into an explicit, comparable structure that links episode flows to outcomes—making the invisible shape of reasoning visible and measurable at scale.

Future directions:

Extend to other domains (coding, science, multi-step planning) and multi-turn dialogues.
Use episode signals to guide training (rewards for healthy transitions) and to build real-time monitors that flag risky flows.
Combine with mechanistic probes to connect episode text patterns with internal model activations.
Design benchmarks that score structure (healthy heartbeats and stabilizing loops), not just final answers.

Why remember this: It reframes “reasoning quality” as a structured process, not a word count. By naming and measuring thinking steps, we gain levers to compare models fairly, teach better habits, and choose efficiency methods that save time without cutting the safety nets.

Practical Applications

•Build dashboards that show a model’s episode allocations and transitions to audit reasoning quality over time.
•Train with rewards that encourage healthy flows (e.g., Explore→Monitor→Analyze) instead of raw length bonuses.
•Design prompts that ask for Verify near the end to strengthen the convergence phase.
•Use episode diagnostics to flag risky answers (e.g., no late Verify, high unexplained Explore).
•Choose efficiency methods that preserve verification loops when accuracy is critical.
•Distill smaller models while checking that episode structure (not just answers) is preserved.
•Develop curriculum datasets that explicitly include balanced episodes (Analyze, Explore, Verify).
•Create classroom tools that compare a student’s reasoning steps to model episodes for targeted feedback.
•Prioritize regression tests that catch drops in Verify/Monitor patterns after model updates.
•Filter or re-rank solutions by healthy heartbeat patterns when using multi-sample decoding.

Version: 1