Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Key Summary
- ā¢This paper shows a new way (called RISE) to find and control how AI models think without needing any human-made labels.
- ā¢They train a Sparse Auto-Encoder (SAE) on step-by-step thinking traces to discover "reasoning vectors"ādirections in the modelās brain that match behaviors like reflection and backtracking.
- ā¢These discovered vectors naturally group into clear clusters, especially in the modelās later layers, which means the model organizes thinking styles in understandable ways.
- ā¢By gently nudging the model along (or away from) these vectors during inference, the authors can increase or decrease specific behaviors without retraining the model.
- ā¢They also find structure for response length and identify special vectors related to confidence by optimizing for lower entropy (more certainty).
- ā¢Interventions reduce reflection steps from about 90.5 to 33.8 and backtracking steps from about 35.5 to 5.9 on AIME25 while keeping answers mostly correct.
- ā¢The discovered vectors trained on math data still work on other tasks like GPQA-Diamond and KnowLogic, showing strong generalization.
- ā¢A simple test-time steering using top confidence vectors improves accuracy by up to 4.66 points and cuts token cost by 13.69% compared to baselines like TIP and SEAL.
- ā¢The work supports the idea that abstract thinking patterns live as linear directions inside models, which we can map and steer.
- ā¢This opens a path to safer, cheaper, and more reliable AI reasoning by discovering new thinking habits we didnāt even know to look for.
Why This Research Matters
This work gives us a practical control panel for AI thinking, found automatically without hand-labeling behaviors. By discovering and steering reasoning vectors, we can make models more careful or more concise on demand, saving time and compute. It helps keep reasoning transparent so humans can diagnose and coach models like students. It generalizes beyond math to logic and knowledge tasks, suggesting the behaviors are truly internal habits. Confidence vectors can reduce needless dithering, speeding up workflows without retraining. Finally, the approach reveals deeper structure (like length) in the modelās mind, guiding future designs for safer and more reliable reasoning.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): Imagine watching a friend solve a tough math puzzle out loud. They try one idea, pause, check their work, maybe switch strategies, and finally get the answer. You can hear their thinking.
š„¬ Filling (The Actual Concept):
- What it is: Chain-of-thought (CoT) reasoning is when an AI writes out its thinking step by step instead of jumping straight to the answer.
- How it works (recipe):
- The model gets a question.
- It generates one sentence (a āstepā) of reasoning.
- It uses that step to decide the next step, and so on.
- It ends with a final answer.
- Why it matters: Without CoT, models can skip steps, make hidden mistakes, and be hard to debug.
š Bottom Bread (Anchor): If you ask, āWhat is 27 Ć 14?ā, a CoT model might write: ā27 Ć 10 = 270; 27 Ć 4 = 108; 270 + 108 = 378; answer: 378.ā You see the path, not just the destination.
š Top Bread (Hook): You know how a class photo shows everyone at once, but each face is a little different? If you could measure every smile, eye direction, and hat tilt, youād have a giant picture of the groupās style.
š„¬ Filling (The Actual Concept):
- What it is: Activation space is the big map of numbers inside a model that changes as it thinks; each point describes what the model is focusing on right now.
- How it works:
- The model reads text and creates hidden numbers (activations) at each layer.
- Each layer transforms these numbers into new ones.
- The whole path of numbers is the modelās internal thought trail.
- Why it matters: Without looking at this space, we canāt understand or guide the modelās thinking.
š Bottom Bread (Anchor): When the AI says, āWait, let me check,ā its activation space shifts to a ādouble-checkā region; when it says, āAlternatively,ā it moves to a ātry a new planā region.
š Top Bread (Hook): Think of a toy box stuffed with many toys. If you want to explain whatās inside quickly, you might say, āThere are blocks, cars, and a teddy,ā instead of listing every piece.
š„¬ Filling (The Actual Concept):
- What it is: A Sparse Auto-Encoder (SAE) is a small helper model that learns to describe complicated activations using a short list of simple, non-overlapping features.
- How it works:
- Encoder compresses the big activation into a few active features (most are off = sparse).
- Decoder rebuilds the original activation from those few features.
- Training teaches the encoder-decoder to reconstruct well while keeping features sparse.
- Why it matters: Without sparsity, many features turn on at once and get tangled, making them hard to interpret.
š Bottom Bread (Anchor): Like summarizing your backpack as ā1 notebook + 2 pencils + 1 snack,ā the SAE says, āThis thought = a bit of ācheck workā + a bit of ātry new planā.ā
The World Before: Big models could solve tougher problems by writing longer thoughts, but we didnāt know how their internal thinking worked. Researchers looked at parts of the trail like āreflectionā (self-checking) or ābacktrackingā (switching strategies), but mostly by using human-chosen words. Thatās like searching for only the toys you already know by nameāyou miss surprises.
The Problem: Reasoning behaviors are fuzzy, overlap, and are hard to label at scale. Word-level labels (like spotting the word āWaitā) donāt always match true behavior. We needed a way to find the thinking patterns directly inside the modelās activations, without guessing labels first.
Failed Attempts: Difference-of-Means steering and similar methods work when you have neat opposites (happy vs. sad), but reasoning behaviors arenāt clean opposites. Other work tried focusing only on a couple of behaviors (like reflection) or on response length, which leaves most thinking habits unexplored.
The Gap: A label-free (unsupervised) method that discovers many behaviors at once, directly from activations, and that also lets us gently steer those behaviors at test time.
Real Stakes: If we can see and guide an AIās thinking:
- We can shorten wasteful reasoning to save time and money.
- We can add extra self-checks for safety-critical tasks.
- We can teach or debug models like we coach students.
- We can discover helpful new habits (like healthy confidence) we didnāt think to label.
š Top Bread (Hook): Imagine sorting a huge jar of mixed candies into neat piles by taste, without reading the labelsājust by trying them.
š„¬ Filling (The Actual Concept):
- What it is: Disentangled features are separate, clean knobs inside the model, where each knob controls one clear behavior.
- How it works:
- Train an SAE so only a few features turn on per step.
- Each decoder column becomes a candidate ābehavior direction.ā
- Because theyāre sparse, these directions separate instead of blending.
- Why it matters: Without disentanglement, one knob moves many behaviors at once.
š Bottom Bread (Anchor): One knob boosts reflection (more āLet me checkā), another knob boosts backtracking (more āAlternativelyā).
š Top Bread (Hook): Think of a compass needle that points north. If the modelās activation is a traveler, a ādirectionā tells you which way itās headed in thinking-land.
š„¬ Filling (The Actual Concept):
- What it is: Reasoning vectors are directions in activation space that correspond to specific thinking behaviors (like reflection or confidence).
- How it works:
- SAE learns a set of decoder columns (candidate directions).
- We find which columns light up during certain behaviors.
- We can then push the activation a little toward or away from those directions.
- Why it matters: Without these directions, we canāt target a behavior precisely.
š Bottom Bread (Anchor): Add a bit of the āreflectionā vector and the model does more self-checking; remove it and the model is more direct.
02Core Idea
š Top Bread (Hook): You know how a DJ can slide one knob to add bass and another to add treble, changing the songās feel without re-recording it?
š„¬ Filling (The Actual Concept):
- What it is (one sentence): The key insight is that reasoning behaviors live as linear directions in the modelās activation space, which we can discover with a sparse auto-encoder and then steer during inference.
Multiple Analogies:
- Recipe knobs: Each direction is a flavor knobāāreflection,ā ābacktracking,ā or āconfidenceāāthat you can turn up or down without cooking a new dish (no retraining).
- Map routes: The activation space is a city map; reasoning vectors are streets that lead to neighborhoods like āself-checkā or ātry-new-idea.ā Steering chooses which street to take next.
- Color sliders: Like mixing red, green, and blue to get any color, combining a few reasoning vectors creates many reasoning styles.
Before vs After:
- Before: We relied on human labels, which covered only a few behaviors and didnāt scale. Steering worked best for simple opposites (like sentiment), not for messy reasoning styles.
- After: We discover many behaviors unsupervised from step-level activations. We see clear clusters in later layers and can modulate behaviors on the fly without further training. We even find new ones (like confidence vectors) that arenāt easy to label with words.
Why It Works (intuition):
- The linear representation hypothesis suggests that complex concepts often align with straight directions in activation space. SAEs encourage sparsity so only a few features explain each step, separating behaviors into cleaner āknobs.ā Decoder columns act like a dictionary of behavior directions. Mid-to-late layers carry more behavior-specific structure, so clusters show up clearly there.
Building Blocks (each with a sandwich intro where first used):
- Step-level thought representations: Use a special delimiter token between sentences to capture the activation for each reasoning step.
- SAE training on one layer: Learn sparse codes that reconstruct those step activations; decoder columns become candidate behavior vectors.
- Behavior discovery: See which columns fire for reflection/backtracking and visualize them; they cluster into distinct regions.
- š Top Bread (Hook): Imagine a remote control that makes a robot thinker more cautious or more decisive with a tap.
š„¬ Filling (The Actual Concept):
- What it is: Controllability in LLMs means we can gently adjust specific thinking behaviors at test time.
- How it works:
- Pick a decoder column (or average of columns) for a behavior.
- Project the current activation away from or toward that direction.
- Generate the next token with the adjusted activation.
- Why it matters: Without controllability, we canāt adapt the model to different tasksā needs on the fly. š Bottom Bread (Anchor): For quick tasks, nudge away from reflection to be faster; for safety-critical ones, nudge toward reflection to double-check.
- New behavior search via entropy:
- š Hook: A thermometer shows how hot it is; a similar gauge can show how uncertain a model is.
- š„¬ What it is: Entropy mechanisms measure uncertainty; lower entropy means higher confidence.
How it works:
- Score decoder columns by how much they reduce entropy when added to activations.
- Pick top-scoring columns as confidence-related vectors.
- Steer with these to shift the model toward more confident, calculation-focused steps. Why it matters: Without an uncertainty gauge, we canāt reliably find or control confidence.
- š Anchor: With confidence vectors on, tokens like āWaitā and āAlternativelyā appear less, and numbers or direct calculations appear more.
03Methodology
High-Level Overview: Input (unlabeled CoT traces) ā Build step-level activations ā Train SAE on one layer ā Discover clusters/vectors ā Intervene during inference ā Output (steered reasoning).
Step-by-Step (like a recipe):
- Collect model responses
- What happens: Use a reasoning model (e.g., DeepSeek-R1-Distill-Qwen-1.5B) to answer problems (e.g., MATH500) with chain-of-thought.
- Why this step: We need real reasoning trajectories to study thinking habits.
- Example: The model solves a geometry problem, writing 8 sentences of reasoning.
- Split into sentence-level steps
- What happens: Insert a delimiter (\n\n) and treat each sentence as a step; re-run to capture the hidden activation at the delimiter token after a chosen layer.
- Why this step: Token-level features are too fine; sentence steps better capture behaviors like reflecting or backtracking.
- Example: Step 5 begins with āWait, let me verify...āāwe capture the activation right at that stepās boundary.
- Train a Sparse Auto-Encoder (SAE)
- What happens: Feed these step activations into an SAE with sparse codes (e.g., hidden D=2048; ReLU; L2 reconstruction + L1 sparsity).
- Why this step: Sparsity helps separate features, so each decoder column can align with a behavior direction.
- Example: The SAE learns columns that often activate on steps starting with āLet me check,ā hinting at reflection.
- Visualize the decoder columns
- What happens: Project decoder columns to 2D with UMAP to see clusters; compute Silhouette scores across layers for separation quality.
- Why this step: Clusters show that columns correspond to distinct behaviors; layer-wise scores reveal where structure is sharpest (mid-to-late layers).
- Example: Reflection-related columns gather in one region; backtracking columns in another.
- Build behavior vectors
- What happens: Identify columns most active for reflection or backtracking (using LLM-as-judge or keyword support) and average them to get a clean behavior vector.
- Why this step: Averaging stabilizes the direction and removes mixed columns.
- Example: The āreflection vectorā becomes a unit vector used for steering.
- Intervene during inference
- What happens: At each reasoning step, project the current activation away from (or toward) the behavior vector: h' = h ā α·w(wįµh) with α controlling strength.
- Why this step: This gentle nudge directly reduces or enhances the target behavior without retraining.
- Example: α < 0 reduces reflection; α > 0 increases it. Changing α from ā1.5 to 1.5 gradually changes reflection steps from about 58.6 to 166.9 on AIME25.
- Discover new behaviors via entropy (confidence vectors)
- What happens: Score each decoder column by how much it lowers entropy when added to activations; select top columns and combine for steering.
- Why this step: Confidence isnāt easy to define with words, but entropy gives a measurable target.
- Example: After applying the confidence vector, tokens like āWaitā and āAlternativelyā drop; more numbers and direct calculations appear.
The Secret Sauce:
- Train on step-level delimiter activations so each feature describes a whole thought move, not just a word.
- Use sparsity to keep features disentangled, so columns become clean behavior directions.
- Intervene with simple projectionsāno retrainingāso control is lightweight and fast.
- Generalize across domains: vectors learned on math transfer to GPQA-Diamond and KnowLogic, showing they capture true behaviors, not just dataset quirks.
Additional Sandwich Intros for tools first used here:
- š Hook: Imagine shrinking a world map to fit on a postcard while keeping neighborhoods close. š„¬ What it is: UMAP is a tool that squeezes high-dimensional data into 2D so clusters stay meaningful. How it works: (1) Measure who is near whom; (2) Build a neighborhood graph; (3) Place points in 2D to keep neighbors close. Why it matters: Without a good map, we canāt see behavior clusters. š Anchor: Reflection-related columns appear as a tight island on the 2D map.
- š Hook: If a team forms natural groups, you can grade how clearly they cluster. š„¬ What it is: Silhouette score measures how well points fit into their clusters versus others. How it works: (1) Compute average distance to own cluster; (2) Compute distance to nearest other cluster; (3) Compare them per point and average. Why it matters: Without a score, we only āeyeballā clusters. š Anchor: Later layers have higher silhouette scores, showing behaviors are most separated there.
04Experiments & Results
The Test: The authors ask three main questions:
- Do SAE decoder columns form meaningful behavior clusters (reflection, backtracking, others)?
- Do interventions along these columns causally change reasoning behaviors without hurting correctness?
- Can we discover new behavior directions like confidence (via entropy) and do they generalize across tasks?
The Competition (comparisons and context): Prior activation-steering methods (like DiffMean) need human-contrasted labels; the paper instead uses unsupervised SAE discovery. For test-time enhancement, they also compare to TIP and SEAL style steering methods.
The Scoreboard (with context):
- Clear clustering: UMAP visualizations show reflection and backtracking columns forming separable islands, especially in mid-to-late layers. Normalized Silhouette scores rise with depth, peak before the very last layer, then slightly dip (consistent with oversmoothing).
- Causal steering works: On AIME25 with DeepSeek-R1-1.5B, negative reflection steering reduces reflection steps, positive steering increases them, while final answers often remain the same. With strength α varied across {ā1.5, ā1, 0, 1, 1.5}, reflection steps smoothly change to {58.6, 73.6, 90.5, 131.0, 166.9}, like a volume knob that actually works.
- Cross-domain generalization: Vectors learned on MATH500 still modulate reflection/backtracking on GPQA-Diamond and KnowLogic, which is like a basketball move that still works when you change courts.
- Confidence vectors emerge: Scoring columns by entropy identifies a concentrated cluster. Intervening reduces reflection from about 90.53 to 33.77 and backtracking from about 35.50 to 5.93 on AIME25. Performance drops slightly from 23.33% to 20.00% (1 problem out of 30), showing a strong style shift with minimal accuracy impact.
- Reasoning enhancement: Combining top-3 confidence vectors and learning small coefficients at test time improves accuracy by up to 4.66 points and reduces token cost by 13.69% versus TIP/SEAL baselines, which is like winning more often while spending fewer minutes thinking aloud.
- Response length structure: Decoder columns also separate long vs. short reasoning traces, with the strongest separation in mid-to-late layers, revealing a structural axis inside the modelās thought space.
Surprising Findings:
- Reflection and backtracking partially overlap in space, hinting these metacognitive moves are related.
- Confidence-related columns cluster in a specific region and correlate with fewer hesitation tokens (like āWaitā), replacing them with calculation tokens (like numbers). This suggests confidence is not just a final-answer feeling but a shift in the entire reasoning style.
- Clarity peaks before the final layer, aligning with known oversmoothing: the very end may compress distinctions.
š Top Bread (Hook): Think of a dimmer switch that changes not only brightness but also room mood.
š„¬ Filling (The Actual Concept):
- What it is: Intervention/steering means adding or removing a component of the activation along a chosen vector to change behavior.
- How it works:
- Pick a behavior vector from the decoder columns.
- Project current activation toward/away from that vector.
- Generate the next token with the adjusted activation.
- Why it matters: Without direct steering, behavior control would require retraining or clumsy prompts.
š Bottom Bread (Anchor): Removing the reflection component made the model skip āLet me checkā¦ā sentences and answer more directly, while keeping the same final solution.
05Discussion & Limitations
Limitations:
- Behavior coverage: SAEs capture many behaviors but may still miss rare or subtle ones, especially if training data under-represents them.
- Sensitivity: Which layer you train on, how you choose sparsity, and which columns you average can change results. Mixed columns must be filtered.
- Tradeoffs: Confidence steering can reduce helpful reflection in hard problems; small accuracy dips can appear if overused.
- Measurement noise: LLM-as-judge and keyword labels are proxies; theyāre consistent but not perfect.
- Access needs: You must hook into model activations; closed models may not allow this.
Required Resources:
- Compute to run the base model and extract step-level activations on a few hundred to a few thousand CoT traces.
- GPU memory for training an SAE (e.g., D=2048) and for UMAP/Silhouette analysis.
- Light optimization to find entropy/confidence vectors.
When NOT to Use:
- Ultra-short tasks where CoT isnāt needed; steering overhead wonāt help.
- High-stakes settings that forbid any test-time internal edits (policy/regulatory constraints).
- Models without activation access or that are extremely compressed where layer signals are weak.
Open Questions:
- How stable are reasoning vectors across model updates and scales?
- Can we automatically de-duplicate mixed columns and map a richer behavior atlas (planning, hypothesis testing, error correction)?
- How do behavior vectors interact with truthfulness and safetyācan confidence vectors sometimes over-assert wrong answers?
- Can similar unsupervised maps help align RL-finetuned āthinkingā models and base models more reliably than supervised transfer?
- Is there an optimal multi-behavior controller (a small policy) that picks steering strengths per step to maximize accuracy and minimize tokens?
06Conclusion & Future Work
Three-Sentence Summary: RISE trains a sparse auto-encoder on sentence-level thought activations to discover reasoning vectorsādirections in activation space that align with clear thinking behaviors. These vectors cluster in mid-to-late layers and can be used at test time to dial behaviors like reflection, backtracking, and confidence up or down without retraining. The approach generalizes across domains, reveals structure like response length, and enables on-the-fly improvements in accuracy and efficiency.
Main Achievement: Showing that diverse reasoning behaviors live as disentangled, steerable directions learned unsupervised in the SAE decoder spaceāand that small, targeted activations edits can causally reshape the modelās reasoning trajectory.
Future Directions:
- Build a richer behavior atlas (planning, decomposition, verification, hypothesis testing) and automate column selection.
- Design adaptive controllers that choose steering strengths per step and per task.
- Study how behavior steering interacts with truthfulness, safety, and calibration.
- Extend beyond language to multimodal reasoning (code, math, diagrams).
Why Remember This: It turns āAI thinkingā from a black box into a control panel. By discovering and steering the hidden knobs of reasoning, we gain a practical, label-free way to make AI think better, faster, and saferātoday, not just in theory.
Practical Applications
- ā¢Speed up customer support bots by reducing over-reflection on easy questions to answer faster.
- ā¢Increase self-checking in medical or legal drafting to reduce risky mistakes before finalization.
- ā¢Cut cloud costs by steering away from wasteful long chains when confidence is already high.
- ā¢Improve tutoring systems by turning up reflection so the AI shows careful checks students can learn from.
- ā¢Debug model failures by identifying which behavior vector (e.g., backtracking) was underused or overused.
- ā¢Build adaptive agents that adjust behavior per step: explore more at first, verify more near the end.
- ā¢Transfer learned behavior vectors from one domain (math) to another (logic) to bootstrap better reasoning.
- ā¢Create safer defaults by gently boosting verification behaviors for sensitive tasks.
- ā¢Personalize reasoning style (concise vs. thorough) for different users or time budgets.
- ā¢Develop curriculum learning tools that introduce one behavior vector at a time to train specific habits.