Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang; Shujian Zhang; John Lambert; Wenxuan Zhou; Zhangyang Wang; Mingqing Chen; Andrew Hard; Rajiv Mathews; Lun Wang

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Beginner

Zhenyu Zhang, Shujian Zhang, John Lambert et al.12/30/2025

arXiv PDF

Key Summary

•This paper shows a new way (called RISE) to find and control how AI models think without needing any human-made labels.
•They train a Sparse Auto-Encoder (SAE) on step-by-step thinking traces to discover "reasoning vectors"—directions in the model’s brain that match behaviors like reflection and backtracking.
•These discovered vectors naturally group into clear clusters, especially in the model’s later layers, which means the model organizes thinking styles in understandable ways.
•By gently nudging the model along (or away from) these vectors during inference, the authors can increase or decrease specific behaviors without retraining the model.
•They also find structure for response length and identify special vectors related to confidence by optimizing for lower entropy (more certainty).
•Interventions reduce reflection steps from about 90.5 to 33.8 and backtracking steps from about 35.5 to 5.9 on AIME25 while keeping answers mostly correct.
•The discovered vectors trained on math data still work on other tasks like GPQA-Diamond and KnowLogic, showing strong generalization.
•A simple test-time steering using top confidence vectors improves accuracy by up to 4.66 points and cuts token cost by 13.69% compared to baselines like TIP and SEAL.
•The work supports the idea that abstract thinking patterns live as linear directions inside models, which we can map and steer.
•This opens a path to safer, cheaper, and more reliable AI reasoning by discovering new thinking habits we didn’t even know to look for.

Why This Research Matters

This work gives us a practical control panel for AI thinking, found automatically without hand-labeling behaviors. By discovering and steering reasoning vectors, we can make models more careful or more concise on demand, saving time and compute. It helps keep reasoning transparent so humans can diagnose and coach models like students. It generalizes beyond math to logic and knowledge tasks, suggesting the behaviors are truly internal habits. Confidence vectors can reduce needless dithering, speeding up workflows without retraining. Finally, the approach reveals deeper structure (like length) in the model’s mind, guiding future designs for safer and more reliable reasoning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine watching a friend solve a tough math puzzle out loud. They try one idea, pause, check their work, maybe switch strategies, and finally get the answer. You can hear their thinking.

🥬 Filling (The Actual Concept):

What it is: Chain-of-thought (CoT) reasoning is when an AI writes out its thinking step by step instead of jumping straight to the answer.
How it works (recipe):
1. The model gets a question.
2. It generates one sentence (a “step”) of reasoning.
3. It uses that step to decide the next step, and so on.
4. It ends with a final answer.
Why it matters: Without CoT, models can skip steps, make hidden mistakes, and be hard to debug.

🍞 Bottom Bread (Anchor): If you ask, “What is 27 × 14?”, a CoT model might write: “27 × 10 = 270; 27 × 4 = 108; 270 + 108 = 378; answer: 378.” You see the path, not just the destination.

🍞 Top Bread (Hook): You know how a class photo shows everyone at once, but each face is a little different? If you could measure every smile, eye direction, and hat tilt, you’d have a giant picture of the group’s style.

🥬 Filling (The Actual Concept):

What it is: Activation space is the big map of numbers inside a model that changes as it thinks; each point describes what the model is focusing on right now.
How it works:
1. The model reads text and creates hidden numbers (activations) at each layer.
2. Each layer transforms these numbers into new ones.
3. The whole path of numbers is the model’s internal thought trail.
Why it matters: Without looking at this space, we can’t understand or guide the model’s thinking.

🍞 Bottom Bread (Anchor): When the AI says, “Wait, let me check,” its activation space shifts to a “double-check” region; when it says, “Alternatively,” it moves to a “try a new plan” region.

🍞 Top Bread (Hook): Think of a toy box stuffed with many toys. If you want to explain what’s inside quickly, you might say, “There are blocks, cars, and a teddy,” instead of listing every piece.

🥬 Filling (The Actual Concept):

What it is: A Sparse Auto-Encoder (SAE) is a small helper model that learns to describe complicated activations using a short list of simple, non-overlapping features.
How it works:
1. Encoder compresses the big activation into a few active features (most are off = sparse).
2. Decoder rebuilds the original activation from those few features.
3. Training teaches the encoder-decoder to reconstruct well while keeping features sparse.
Why it matters: Without sparsity, many features turn on at once and get tangled, making them hard to interpret.

🍞 Bottom Bread (Anchor): Like summarizing your backpack as “1 notebook + 2 pencils + 1 snack,” the SAE says, “This thought = a bit of ‘check work’ + a bit of ‘try new plan’.”

The World Before: Big models could solve tougher problems by writing longer thoughts, but we didn’t know how their internal thinking worked. Researchers looked at parts of the trail like “reflection” (self-checking) or “backtracking” (switching strategies), but mostly by using human-chosen words. That’s like searching for only the toys you already know by name—you miss surprises.

The Problem: Reasoning behaviors are fuzzy, overlap, and are hard to label at scale. Word-level labels (like spotting the word “Wait”) don’t always match true behavior. We needed a way to find the thinking patterns directly inside the model’s activations, without guessing labels first.

Failed Attempts: Difference-of-Means steering and similar methods work when you have neat opposites (happy vs. sad), but reasoning behaviors aren’t clean opposites. Other work tried focusing only on a couple of behaviors (like reflection) or on response length, which leaves most thinking habits unexplored.

The Gap: A label-free (unsupervised) method that discovers many behaviors at once, directly from activations, and that also lets us gently steer those behaviors at test time.

Real Stakes: If we can see and guide an AI’s thinking:

We can shorten wasteful reasoning to save time and money.
We can add extra self-checks for safety-critical tasks.
We can teach or debug models like we coach students.
We can discover helpful new habits (like healthy confidence) we didn’t think to label.

🍞 Top Bread (Hook): Imagine sorting a huge jar of mixed candies into neat piles by taste, without reading the labels—just by trying them.

🥬 Filling (The Actual Concept):

What it is: Disentangled features are separate, clean knobs inside the model, where each knob controls one clear behavior.
How it works:
1. Train an SAE so only a few features turn on per step.
2. Each decoder column becomes a candidate “behavior direction.”
3. Because they’re sparse, these directions separate instead of blending.
Why it matters: Without disentanglement, one knob moves many behaviors at once.

🍞 Bottom Bread (Anchor): One knob boosts reflection (more “Let me check”), another knob boosts backtracking (more “Alternatively”).

🍞 Top Bread (Hook): Think of a compass needle that points north. If the model’s activation is a traveler, a “direction” tells you which way it’s headed in thinking-land.

🥬 Filling (The Actual Concept):

What it is: Reasoning vectors are directions in activation space that correspond to specific thinking behaviors (like reflection or confidence).
How it works:
1. SAE learns a set of decoder columns (candidate directions).
2. We find which columns light up during certain behaviors.
3. We can then push the activation a little toward or away from those directions.
Why it matters: Without these directions, we can’t target a behavior precisely.

🍞 Bottom Bread (Anchor): Add a bit of the “reflection” vector and the model does more self-checking; remove it and the model is more direct.

02Core Idea

🍞 Top Bread (Hook): You know how a DJ can slide one knob to add bass and another to add treble, changing the song’s feel without re-recording it?

🥬 Filling (The Actual Concept):

What it is (one sentence): The key insight is that reasoning behaviors live as linear directions in the model’s activation space, which we can discover with a sparse auto-encoder and then steer during inference.

Multiple Analogies:

Recipe knobs: Each direction is a flavor knob—“reflection,” “backtracking,” or “confidence”—that you can turn up or down without cooking a new dish (no retraining).
Map routes: The activation space is a city map; reasoning vectors are streets that lead to neighborhoods like “self-check” or “try-new-idea.” Steering chooses which street to take next.
Color sliders: Like mixing red, green, and blue to get any color, combining a few reasoning vectors creates many reasoning styles.

Before vs After:

Before: We relied on human labels, which covered only a few behaviors and didn’t scale. Steering worked best for simple opposites (like sentiment), not for messy reasoning styles.
After: We discover many behaviors unsupervised from step-level activations. We see clear clusters in later layers and can modulate behaviors on the fly without further training. We even find new ones (like confidence vectors) that aren’t easy to label with words.

Why It Works (intuition):

The linear representation hypothesis suggests that complex concepts often align with straight directions in activation space. SAEs encourage sparsity so only a few features explain each step, separating behaviors into cleaner “knobs.” Decoder columns act like a dictionary of behavior directions. Mid-to-late layers carry more behavior-specific structure, so clusters show up clearly there.

Building Blocks (each with a sandwich intro where first used):

Step-level thought representations: Use a special delimiter token between sentences to capture the activation for each reasoning step.
SAE training on one layer: Learn sparse codes that reconstruct those step activations; decoder columns become candidate behavior vectors.
Behavior discovery: See which columns fire for reflection/backtracking and visualize them; they cluster into distinct regions.
🍞 Top Bread (Hook): Imagine a remote control that makes a robot thinker more cautious or more decisive with a tap. 🥬 Filling (The Actual Concept):
- What it is: Controllability in LLMs means we can gently adjust specific thinking behaviors at test time.
- How it works:
  1. Pick a decoder column (or average of columns) for a behavior.
  2. Project the current activation away from or toward that direction.
  3. Generate the next token with the adjusted activation.
- Why it matters: Without controllability, we can’t adapt the model to different tasks’ needs on the fly. 🍞 Bottom Bread (Anchor): For quick tasks, nudge away from reflection to be faster; for safety-critical ones, nudge toward reflection to double-check.
New behavior search via entropy:
- 🍞 Hook: A thermometer shows how hot it is; a similar gauge can show how uncertain a model is.
- 🥬 What it is: Entropy mechanisms measure uncertainty; lower entropy means higher confidence. How it works:
  1. Score decoder columns by how much they reduce entropy when added to activations.
  2. Pick top-scoring columns as confidence-related vectors.
  3. Steer with these to shift the model toward more confident, calculation-focused steps. Why it matters: Without an uncertainty gauge, we can’t reliably find or control confidence.
- 🍞 Anchor: With confidence vectors on, tokens like “Wait” and “Alternatively” appear less, and numbers or direct calculations appear more.

03Methodology

High-Level Overview: Input (unlabeled CoT traces) → Build step-level activations → Train SAE on one layer → Discover clusters/vectors → Intervene during inference → Output (steered reasoning).

Step-by-Step (like a recipe):

Collect model responses

What happens: Use a reasoning model (e.g., DeepSeek-R1-Distill-Qwen-1.5B) to answer problems (e.g., MATH500) with chain-of-thought.
Why this step: We need real reasoning trajectories to study thinking habits.
Example: The model solves a geometry problem, writing 8 sentences of reasoning.

Split into sentence-level steps

What happens: Insert a delimiter (\n\n) and treat each sentence as a step; re-run to capture the hidden activation at the delimiter token after a chosen layer.
Why this step: Token-level features are too fine; sentence steps better capture behaviors like reflecting or backtracking.
Example: Step 5 begins with “Wait, let me verify...”—we capture the activation right at that step’s boundary.

Train a Sparse Auto-Encoder (SAE)

What happens: Feed these step activations into an SAE with sparse codes (e.g., hidden D=2048; ReLU; L2 reconstruction + L1 sparsity).
Why this step: Sparsity helps separate features, so each decoder column can align with a behavior direction.
Example: The SAE learns columns that often activate on steps starting with “Let me check,” hinting at reflection.

Visualize the decoder columns

What happens: Project decoder columns to 2D with UMAP to see clusters; compute Silhouette scores across layers for separation quality.
Why this step: Clusters show that columns correspond to distinct behaviors; layer-wise scores reveal where structure is sharpest (mid-to-late layers).
Example: Reflection-related columns gather in one region; backtracking columns in another.

Build behavior vectors

What happens: Identify columns most active for reflection or backtracking (using LLM-as-judge or keyword support) and average them to get a clean behavior vector.
Why this step: Averaging stabilizes the direction and removes mixed columns.
Example: The “reflection vector” becomes a unit vector used for steering.

Intervene during inference

What happens: At each reasoning step, project the current activation away from (or toward) the behavior vector: h' = h − α·w(wᵀh) with α controlling strength.
Why this step: This gentle nudge directly reduces or enhances the target behavior without retraining.
Example: α < 0 reduces reflection; α > 0 increases it. Changing α from −1.5 to 1.5 gradually changes reflection steps from about 58.6 to 166.9 on AIME25.

Discover new behaviors via entropy (confidence vectors)

What happens: Score each decoder column by how much it lowers entropy when added to activations; select top columns and combine for steering.
Why this step: Confidence isn’t easy to define with words, but entropy gives a measurable target.
Example: After applying the confidence vector, tokens like “Wait” and “Alternatively” drop; more numbers and direct calculations appear.

The Secret Sauce:

Train on step-level delimiter activations so each feature describes a whole thought move, not just a word.
Use sparsity to keep features disentangled, so columns become clean behavior directions.
Intervene with simple projections—no retraining—so control is lightweight and fast.
Generalize across domains: vectors learned on math transfer to GPQA-Diamond and KnowLogic, showing they capture true behaviors, not just dataset quirks.

Additional Sandwich Intros for tools first used here:

🍞 Hook: Imagine shrinking a world map to fit on a postcard while keeping neighborhoods close. 🥬 What it is: UMAP is a tool that squeezes high-dimensional data into 2D so clusters stay meaningful. How it works: (1) Measure who is near whom; (2) Build a neighborhood graph; (3) Place points in 2D to keep neighbors close. Why it matters: Without a good map, we can’t see behavior clusters. 🍞 Anchor: Reflection-related columns appear as a tight island on the 2D map.
🍞 Hook: If a team forms natural groups, you can grade how clearly they cluster. 🥬 What it is: Silhouette score measures how well points fit into their clusters versus others. How it works: (1) Compute average distance to own cluster; (2) Compute distance to nearest other cluster; (3) Compare them per point and average. Why it matters: Without a score, we only “eyeball” clusters. 🍞 Anchor: Later layers have higher silhouette scores, showing behaviors are most separated there.

04Experiments & Results

The Test: The authors ask three main questions:

Do SAE decoder columns form meaningful behavior clusters (reflection, backtracking, others)?
Do interventions along these columns causally change reasoning behaviors without hurting correctness?
Can we discover new behavior directions like confidence (via entropy) and do they generalize across tasks?

The Competition (comparisons and context): Prior activation-steering methods (like DiffMean) need human-contrasted labels; the paper instead uses unsupervised SAE discovery. For test-time enhancement, they also compare to TIP and SEAL style steering methods.

The Scoreboard (with context):

Clear clustering: UMAP visualizations show reflection and backtracking columns forming separable islands, especially in mid-to-late layers. Normalized Silhouette scores rise with depth, peak before the very last layer, then slightly dip (consistent with oversmoothing).
Causal steering works: On AIME25 with DeepSeek-R1-1.5B, negative reflection steering reduces reflection steps, positive steering increases them, while final answers often remain the same. With strength α varied across {−1.5, −1, 0, 1, 1.5}, reflection steps smoothly change to {58.6, 73.6, 90.5, 131.0, 166.9}, like a volume knob that actually works.
Cross-domain generalization: Vectors learned on MATH500 still modulate reflection/backtracking on GPQA-Diamond and KnowLogic, which is like a basketball move that still works when you change courts.
Confidence vectors emerge: Scoring columns by entropy identifies a concentrated cluster. Intervening reduces reflection from about 90.53 to 33.77 and backtracking from about 35.50 to 5.93 on AIME25. Performance drops slightly from 23.33% to 20.00% (1 problem out of 30), showing a strong style shift with minimal accuracy impact.
Reasoning enhancement: Combining top-3 confidence vectors and learning small coefficients at test time improves accuracy by up to 4.66 points and reduces token cost by 13.69% versus TIP/SEAL baselines, which is like winning more often while spending fewer minutes thinking aloud.
Response length structure: Decoder columns also separate long vs. short reasoning traces, with the strongest separation in mid-to-late layers, revealing a structural axis inside the model’s thought space.

Surprising Findings:

Reflection and backtracking partially overlap in space, hinting these metacognitive moves are related.
Confidence-related columns cluster in a specific region and correlate with fewer hesitation tokens (like “Wait”), replacing them with calculation tokens (like numbers). This suggests confidence is not just a final-answer feeling but a shift in the entire reasoning style.
Clarity peaks before the final layer, aligning with known oversmoothing: the very end may compress distinctions.

🍞 Top Bread (Hook): Think of a dimmer switch that changes not only brightness but also room mood.

🥬 Filling (The Actual Concept):

What it is: Intervention/steering means adding or removing a component of the activation along a chosen vector to change behavior.
How it works:
1. Pick a behavior vector from the decoder columns.
2. Project current activation toward/away from that vector.
3. Generate the next token with the adjusted activation.
Why it matters: Without direct steering, behavior control would require retraining or clumsy prompts.

🍞 Bottom Bread (Anchor): Removing the reflection component made the model skip “Let me check…” sentences and answer more directly, while keeping the same final solution.

05Discussion & Limitations

Limitations:

Behavior coverage: SAEs capture many behaviors but may still miss rare or subtle ones, especially if training data under-represents them.
Sensitivity: Which layer you train on, how you choose sparsity, and which columns you average can change results. Mixed columns must be filtered.
Tradeoffs: Confidence steering can reduce helpful reflection in hard problems; small accuracy dips can appear if overused.
Measurement noise: LLM-as-judge and keyword labels are proxies; they’re consistent but not perfect.
Access needs: You must hook into model activations; closed models may not allow this.

Required Resources:

Compute to run the base model and extract step-level activations on a few hundred to a few thousand CoT traces.
GPU memory for training an SAE (e.g., D=2048) and for UMAP/Silhouette analysis.
Light optimization to find entropy/confidence vectors.

When NOT to Use:

Ultra-short tasks where CoT isn’t needed; steering overhead won’t help.
High-stakes settings that forbid any test-time internal edits (policy/regulatory constraints).
Models without activation access or that are extremely compressed where layer signals are weak.

Open Questions:

How stable are reasoning vectors across model updates and scales?
Can we automatically de-duplicate mixed columns and map a richer behavior atlas (planning, hypothesis testing, error correction)?
How do behavior vectors interact with truthfulness and safety—can confidence vectors sometimes over-assert wrong answers?
Can similar unsupervised maps help align RL-finetuned “thinking” models and base models more reliably than supervised transfer?
Is there an optimal multi-behavior controller (a small policy) that picks steering strengths per step to maximize accuracy and minimize tokens?

06Conclusion & Future Work

Three-Sentence Summary: RISE trains a sparse auto-encoder on sentence-level thought activations to discover reasoning vectors—directions in activation space that align with clear thinking behaviors. These vectors cluster in mid-to-late layers and can be used at test time to dial behaviors like reflection, backtracking, and confidence up or down without retraining. The approach generalizes across domains, reveals structure like response length, and enables on-the-fly improvements in accuracy and efficiency.

Main Achievement: Showing that diverse reasoning behaviors live as disentangled, steerable directions learned unsupervised in the SAE decoder space—and that small, targeted activations edits can causally reshape the model’s reasoning trajectory.

Future Directions:

Build a richer behavior atlas (planning, decomposition, verification, hypothesis testing) and automate column selection.
Design adaptive controllers that choose steering strengths per step and per task.
Study how behavior steering interacts with truthfulness, safety, and calibration.
Extend beyond language to multimodal reasoning (code, math, diagrams).

Why Remember This: It turns “AI thinking” from a black box into a control panel. By discovering and steering the hidden knobs of reasoning, we gain a practical, label-free way to make AI think better, faster, and safer—today, not just in theory.

Practical Applications

•Speed up customer support bots by reducing over-reflection on easy questions to answer faster.
•Increase self-checking in medical or legal drafting to reduce risky mistakes before finalization.
•Cut cloud costs by steering away from wasteful long chains when confidence is already high.
•Improve tutoring systems by turning up reflection so the AI shows careful checks students can learn from.
•Debug model failures by identifying which behavior vector (e.g., backtracking) was underused or overused.
•Build adaptive agents that adjust behavior per step: explore more at first, verify more near the end.
•Transfer learned behavior vectors from one domain (math) to another (logic) to bootstrap better reasoning.
•Create safer defaults by gently boosting verification behaviors for sensitive tasks.
•Personalize reasoning style (concise vs. thorough) for different users or time budgets.
•Develop curriculum learning tools that introduce one behavior vector at a time to train specific habits.

Version: 1