Reasoning Models Generate Societies of Thought

Junsol Kim; Shiyang Lai; Nino Scherrer; Blaise Agüera y Arcas; James Evans

Reasoning Models Generate Societies of Thought

Intermediate

Junsol Kim, Shiyang Lai, Nino Scherrer et al.1/15/2026

arXiv PDF

Key Summary

•The paper shows that top reasoning AIs don’t just think longer—they act like a tiny team inside their heads, with different voices that ask, disagree, and then agree.
•This inner 'society of thought' appears more in reasoning-trained models (like DeepSeek-R1 and QwQ-32B) than in regular instruction-tuned models.
•The models’ conversations include question-and-answer turns, perspective shifts, conflicts, and reconciliations, and these behaviors rise on harder problems.
•Using a microscope-like tool (sparse autoencoders), the authors find a 'surprise' feature (like saying 'Oh!') that, when gently boosted, nearly doubles puzzle accuracy (27.1% to 54.8%).
•Boosting conversational features also turns on more useful thinking habits: verification, backtracking, subgoals, and working backwards.
•The inner voices aren’t all the same: they differ in personality (like more or less agreeable) and expertise (like physics or finance), and this diversity links to better results.
•In reinforcement learning, even when only accuracy is rewarded, models gradually start using conversational behaviors on their own.
•If you first fine-tune a model to solve problems as a dialogue among personas, it learns to reason faster than if you fine-tune it to think as a single monologue.
•These findings suggest engineering AIs as coordinated teams of diverse inner agents can improve reasoning beyond just making chains of thought longer.

Why This Research Matters

When AIs reason about math, science, health, finance, or policy, we want them to catch mistakes and consider alternatives—not just sound confident. This work shows that simulating a small, well-organized team inside one model can boost accuracy, especially on hard problems. It gives builders practical levers—like steerable conversational features and dialogue-style fine-tuning—to improve results without simply making outputs longer. It suggests new, safer designs for agent systems that coordinate diverse skills and viewpoints. And it points to faster training: teach models to converse internally, and they learn good reasoning habits sooner. Ultimately, this can make AI assistants more trustworthy for classrooms, coding, research, and decision support.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a group of classmates can solve a tough puzzle better than one student working alone—someone asks a question, another spots a mistake, a third brings a new idea? That’s been true for people, and now we’re discovering something similar in AI.

🍞 Top Bread (Hook): Imagine you’re doing a tricky math puzzle. If you only follow your first idea, you may get stuck. But if you play both 'the checker' and 'the challenger' in your head, you often get it right.

🥬 Filling (The Actual Concept — Reinforcement Learning): Reinforcement learning (RL) is a way for AIs to learn by getting rewarded when they do well and not rewarded when they don’t. How it works:

The AI tries an answer.
If it’s correct (and neatly formatted), it gets points.
It tries different ways next time to get more points. Why it matters: Without RL, the AI doesn’t know which behaviors actually lead to better answers.

🍞 Bottom Bread (Anchor): Like a video game: you try moves, see what scores, and keep the moves that win.

The world before: Large language models (LLMs) were already great at writing and recalling facts. But reliable step-by-step reasoning—like multi-step math, science logic, or verifying tricky claims—was hit-or-miss. A common trick was to ask models to write longer 'chains of thought.' That helped some, but longer wasn’t always smarter. Sometimes models rambled without checking themselves.

The problem: What exactly makes the newer 'reasoning' models (like DeepSeek-R1 and QwQ-32B) more accurate? People guessed it was just more thinking time. But was there a deeper behavior hiding inside those long thoughts?

🍞 Top Bread (Hook): You know how detectives in stories talk through theories—'What if it’s this?' 'Wait, that can’t be right.'—until they agree? That back-and-forth is powerful.

🥬 Filling (The Actual Concept — Reasoning Reinforcement Learning): Reasoning RL is training that specifically rewards correct reasoning steps and answers, encouraging the model to 'think before speaking.' How it works:

The AI writes out its reasoning.
It’s rewarded for correct results (and good structure), not for sounding confident.
Over time, it prefers habits that lead to right answers. Why it matters: Without it, models may sound smooth but miss key checks.

🍞 Bottom Bread (Anchor): It’s like practicing math with feedback on both your steps and final answer, not just getting a gold star for neat handwriting.

Failed attempts: Just stretching the chain of thought sometimes made models verbose, not accurate. Other tries forced 'debate' by running multiple separate AIs, but that’s slow and complicated.

The gap: We lacked a clean explanation for why reinforcement-trained models succeeded. Was it quantity (longer) or quality (a smarter structure of thought)?

🍞 Top Bread (Hook): Think of a school project team—each kid brings a different strength: the planner, the idea-generator, the skeptic.

🥬 Filling (The Actual Concept — Multi-Agent Interaction): Multi-agent interaction is when multiple 'voices' or agents contribute ideas to solve a task. How it works:

Different roles propose and test ideas.
They question each other.
They settle on a stronger plan together. Why it matters: Without multiple views, the team can miss mistakes or better paths.

🍞 Bottom Bread (Anchor): Like a robotics team where one designs, one builds, one tests—together they win.

This paper’s big suggestion: inside strong reasoning models, there’s an implicit 'society of thought'—not literally many models, but one model simulating a conversation among different perspectives.

🍞 Top Bread (Hook): Picture your inner voices: 'The Checker' says, 'Wait—did we add wrong?' 'The Explorer' says, 'Try a new route!'

🥬 Filling (The Actual Concept — Society of Thought): A society of thought is an internal, simulated conversation among diverse perspectives to explore, challenge, and refine ideas. How it works:

The model poses questions to itself.
It shifts perspectives to try alternatives.
It surfaces conflicts and then reconciles them. Why it matters: Without this social-like structure, the model tends to monologue and miss errors.

🍞 Bottom Bread (Anchor): When asked a hard science question, the model might first propose a path, then interrupt itself with 'But wait…' and finally combine the best parts.

Real stakes: In everyday life, we need AIs to be careful—double-check math, spot misinformation, debug code, and reason about medical or financial options. A model that only 'sounds good' can lead to wrong answers. A model that simulates a healthy inner debate is more likely to catch mistakes and choose wisely.

02Core Idea

The 'Aha!' in one sentence: Better reasoning doesn’t just come from thinking longer—it comes from thinking socially inside the model, where diverse inner voices question, disagree, and then agree.

Three analogies:

Classroom panel: A math whiz, a skeptic, and a creative thinker talk through a problem. The AI simulates that panel inside one brain.
Sports team: Offense suggests plays, defense points out risks, the coach reconciles to pick the best move.
Cooking show: One chef experiments, one taste-tests and says 'Hmm, too salty,' and they adjust the recipe together.

🍞 Top Bread (Hook): You know how a friend’s 'Wait!' can save you from a mistake?

🥬 Filling (The Actual Concept — Diversity of Perspectives): Diversity of perspectives means the model uses different 'voices' with varied personalities and expertise. How it works:

The model surfaces multiple angles (e.g., analyst, verifier, creative ideator).
It compares their proposals.
It keeps the best parts and discards the rest. Why it matters: Without diversity, the AI can echo its first idea and miss better solutions.

🍞 Bottom Bread (Anchor): On a geometry problem, one voice recalls a theorem, another checks units, a third tries a new construction—together they reach the right proof.

Before vs after:

Before: Focus on 'longer chains of thought'—more words, not always more wisdom.
After: Focus on 'richer inner conversation'—questions, perspective shifts, conflicts, and reconciliations that correlate with higher accuracy, especially on difficult problems.

Why it works (intuition, not equations):

Exploration: Multiple voices cover more of the solution space (fewer blind spots).
Error-catching: Disagreement and verification catch slips.
Coordination: Reconciliation blends strong ideas into one better plan.
Pressure test: A conversational 'surprise' moment (like 'Oh!') often marks a perspective shift that opens new routes.

🍞 Top Bread (Hook): Imagine wearing goggles that reveal which gears spin inside a robot’s head.

🥬 Filling (The Actual Concept — Mechanistic Interpretability): Mechanistic interpretability tries to see how the model’s internal parts represent behaviors and ideas. How it works:

Use a tool (sparse autoencoder) to find 'features'—like little switches for styles or behaviors.
Identify a 'conversational surprise' feature (like 'Oh!').
Gently nudge that feature during thinking. Why it matters: Without this view, we can’t test if conversation-like features truly improve reasoning.

🍞 Bottom Bread (Anchor): When the 'Oh!' feature is boosted, accuracy on a step-by-step math game nearly doubles.

Building blocks of the idea:

Detect conversation-like behaviors: question→answer, perspective shifts, conflicts, reconciliation.
Measure socio-emotional roles: asking vs giving info, and positive vs negative tones (like agreement or disagreement).
Map inner diversity: identify different 'voices' with personalities (Big Five) and expertise areas.
Causal test with steering: boost the 'surprise' feature; watch accuracy and good habits rise.
Learning dynamics: even when only accuracy is rewarded, conversation-like behaviors emerge over time; pre-training with dialogues speeds everything up.

Net effect: The model’s inner conversation acts like a small, well-run team. That social structure—more than just length—drives better reasoning.

03Methodology

At a high level: Problem → Generate model’s reasoning trace → Detect conversational and cognitive behaviors → Probe internal features with sparse autoencoders → Causally steer a conversational feature → Measure accuracy and behaviors → Analyze diversity of inner voices → Reinforcement-learn with only accuracy rewards → Compare with conversation- vs monologue-fine-tuning.

Step 1: Generate reasoning traces on tough benchmarks

What happens: The team collects 8,262 problems from BigBench Hard, GPQA, MATH (Hard), MMLU-Pro, MUSR, and IFEval. They ask reasoning-trained models (DeepSeek-R1, QwQ-32B) and regular instruction-tuned models to solve each and show their working.
Why this step: You need many examples to fairly compare behaviors and accuracy across models and tasks.
Example: On a graduate-level physics question, a reasoning model writes a back-and-forth trace with 'Wait… but if…', while a non-reasoning model writes a straight monologue.

🍞 Top Bread (Hook): Like a referee who watches a game and marks every pass and tackle.

🥬 Filling (The Actual Concept — LLM-as-Judge): An LLM-as-judge is a model used to label behaviors in another model’s text. How it works:

Feed it the reasoning trace.
Ask it to count behaviors (Q&A, perspective shifts, conflicts, reconciliation) and roles (ask/give, positive/negative).
Use fixed rules and prompts to keep it consistent. Why it matters: Without a careful rater, we can’t measure conversation-like behavior reliably.

🍞 Bottom Bread (Anchor): The judge reads 'Why couldn’t we…?' followed by 'Because…' and logs that as a Q&A turn.

Step 2: Measure conversational and socio-emotional behaviors

What happens: They track how often the trace includes Q&A, perspective shifts, conflicts, and reconciliations, plus Bales’ social roles (asking vs giving; positive vs negative tone).
Why this step: To test if reasoning models really simulate dialogue patterns, not just longer text.
Example: On hard math items, reasoning models show more asking, disagreeing, and later reconciling.

Step 3: Connect behaviors to accuracy

What happens: They use statistical models controlling for problem difficulty and trace length to see if these behaviors predict higher accuracy.
Why this step: To show it’s not just 'longer chains' or 'easier questions' causing the effect.
Example: Perspective shifts and verification link to more correct answers.

🍞 Top Bread (Hook): Imagine opening a radio to find the specific dial that turns up 'surprise'.

🥬 Filling (The Actual Concept — Sparse Autoencoder Feature & Steering): A sparse autoencoder (SAE) finds interpretable 'features' in a model’s hidden activations; steering tweaks a chosen feature during generation. How it works:

Train an SAE on a middle layer to discover many features.
Identify a 'conversational surprise' feature (tokens like 'Oh!') that often occurs in dialogues.
Add a small amount of that feature’s vector while the model thinks. Why it matters: Without steering, we can’t test if conversation markers cause better reasoning.

🍞 Bottom Bread (Anchor): Nudging the 'Oh!' feature boosted puzzle accuracy from 27.1% to 54.8%.

Step 4: Causal test via steering on a controlled puzzle

What happens: They use the Countdown arithmetic game. During the model’s reasoning, they add the 'surprise' feature at a set strength.
Why this step: A clean, repeatable testbed to measure cause and effect.
Example: With positive steering, the AI asks more internal questions, shifts perspectives more, and reaches the target number more often.

🍞 Top Bread (Hook): Like noticing when your brain says 'Hold on!' and then rechecks the math.

🥬 Filling (The Actual Concept — Cognitive Strategies): Cognitive strategies are smart habits like verification, backtracking, subgoals, and working backwards. How it works:

The model tests if a step really hits the target (verification).
If not, it backs up and tries another path (backtracking).
It sets smaller goals (subgoals) and can reason from the goal to the start (backward chaining). Why it matters: Without these habits, the model can wander or lock into wrong paths.

🍞 Bottom Bread (Anchor): 'This gives 31, not 29—back up. New plan: aim for 20 first, then add 9.'

Step 5: Map inner voices and their diversity

What happens: The judge model estimates how many distinct voices appear, assigns each a brief personality profile (Big Five) and expertise tag (like biology, logic), and segments the trace by who 'spoke.'
Why this step: To test if reasoning models actually recruit diverse perspectives.
Example: One voice is a meticulous verifier (high conscientiousness), another a creative ideator (high openness), and a third a cautious risk-checker (neuroticism cues).

Step 6: Reinforcement learning (RL) experiments

What happens: They train small base models with PPO, rewarding only accuracy and format—no reward for 'being conversational.'
Why this step: To see if conversation-like behaviors naturally emerge when accuracy is the only goal.
Example: Over training steps, Q&A and conflicts rise; later, the model sometimes uses 'we,' showing persona-like coordination.

Step 7: Conversation vs monologue fine-tuning before RL

What happens: They first fine-tune a model on multi-persona dialogues (conversation) or on single-voice traces (monologue), then run the same RL.
Why this step: To see if conversational scaffolding helps reasoning emerge faster.
Example: Conversation-primed models learn quicker and reach higher accuracy earlier than monologue-primed ones on the same problems.

The secret sauce: Treating reasoning as organized social exploration—then proving it causally by steering a conversational feature—shows that 'teamwork inside the model' is a mechanism, not just a style.

04Experiments & Results

The test: Measure whether reasoning models show more conversational behaviors, more diverse inner voices, and better accuracy—especially on hard problems—and whether boosting a conversational feature causally helps.

The competition: Reasoning models (DeepSeek-R1-0528, QwQ-32B) versus instruction-tuned peers (DeepSeek-V3-0324, Qwen-2.5-32B-Instruct, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct) across 8,262 problems from BBH, GPQA, MATH (Hard), MMLU-Pro, MUSR, and IFEval.

Scoreboard with context:

Conversational behaviors: DeepSeek-R1 and QwQ-32B show much higher rates of question→answer, perspective shifts, conflicts, and reconciliation than instruction-tuned models, even when controlling for trace length and task difficulty. Instruction-tuned models behave like one-sided monologues across sizes (8B to 671B).
Socio-emotional roles: Reasoning models both ask and give information; they show both positive and negative roles (agreement and disagreement), with stronger balance (higher Jaccard co-occurrence) than instruction-tuned models. This indicates true dialogue-like give-and-take.
Difficulty sensitivity: Conversational behaviors appear more on harder questions (both by LLM-rated difficulty and by higher error rates from non-reasoning models), suggesting the model recruits inner dialogue when it needs it most.
Causal steering on Countdown: Adding a 'conversational surprise' feature (akin to an 'Oh!' pivot) raised accuracy from 27.1% (no steering) to 54.8% (+10 strength); negative steering reduced it to 23.8%. It also increased all four conversational behaviors and the four cognitive strategies (verification, backtracking, subgoals, backward chaining). Steering random conversational features helped somewhat; the surprise feature helped the most.
Diversity of voices: Reasoning models show greater personality diversity (notably in agreeableness, neuroticism, openness, and extraversion) and greater expertise diversity than instruction-tuned models, even when controlling for how many voices are present. Diversity in conscientiousness was lower (voices stayed diligent), which can actually help team coordination.
RL emergence: In accuracy-only RL, models spontaneously increased conversation-like patterns over time—Q&A and conflicts rose; perspective shifts increased then later declined as solutions became more direct. Two distinct personas often emerged mid-training (e.g., methodical solver plus exploratory tester). Cognitive strategies rose alongside.
Conversation priming helps: Models first fine-tuned on multi-agent dialogues learned faster than monologue-fine-tuned or baseline models, especially early in training and across architectures, and even transferred some benefits to a different task (misinformation detection).

Surprising findings:

A tiny, targeted nudge to a conversational 'surprise' feature dramatically improved arithmetic puzzle accuracy, showing conversation markers can be causal drivers, not just decorations.
Diversity rose mostly in socially oriented traits (extraversion, neuroticism) rather than task diligence (conscientiousness), echoing human team science that balanced social diversity with consistent effort can improve group performance.
Even when no reward directly mentions 'conversation,' conversation-like behaviors still emerge if they help reach correct answers—suggesting social structure is a natural solution strategy.

05Discussion & Limitations

Limitations:

Measurement depends on LLM-as-judge labels (though validated against humans and debate corpora). Some subtle behaviors or sarcasm might be misread.
Steering used one distilled model layer and one standout feature; other layers/features may behave differently, and too-strong steering can hurt.
Tasks include math/logic/science plus misinformation, but not all real-world domains (e.g., law or medicine) where stakes and formats differ.
RL experiments use smaller 3B models for practicality; results may vary in larger or differently trained systems.
Simulated voices are behaviors inside one model; they’re not separate minds. We infer 'personas' from text and activations, which is a useful but indirect lens.

Required resources:

Access to strong reasoning models (or their distilled versions), compute for generating long traces, sparse autoencoder tooling, and PPO/Verl for RL.
An evaluation pipeline: LLM-as-judge prompts, baselines for difficulty, and datasets covering various reasoning types.

When not to use:

Highly time-critical settings where long inner debates are too slow.
Tasks where a single known algorithm is best (no need for exploration or disagreement).
Low-stakes tasks where concision matters more than marginal accuracy gains.

Open questions:

What is the best 'team structure' inside a model—how many voices, which roles, and how to coordinate them dynamically by difficulty?
Can we design safer 'polite conflicts' that maximize error-catching without getting stuck in endless debate?
How do different layers and features encode roles, emotions, and expertise—and can we build libraries of controllable 'social features'?
What are the trade-offs between diversity (exploration) and efficiency (fast convergence), and how should rewards balance them?
How well does conversation-primed reasoning transfer to high-stakes, high-knowledge domains?

06Conclusion & Future Work

Three-sentence summary: This paper shows that strong reasoning models don’t just think longer; they simulate inner conversations—a 'society of thought'—with different voices that ask, disagree, and reconcile. Using interpretability tools, the authors identify a conversational 'surprise' feature whose gentle boost nearly doubles puzzle accuracy, and they show that conversation-like behaviors and diversity of perspectives rise with problem difficulty and link to better results. Reinforcement learning experiments reveal that even when only accuracy is rewarded, conversational structure naturally emerges, and pre-training with dialogues speeds up learning.

Main achievement: Proving that social organization inside a single model—diverse inner voices coordinated through conversation-like behaviors—is a mechanism for better reasoning, supported by both behavioral analyses and causal feature steering.

Future directions:

Engineer inner teams: design role toolkits (verifier, explorer, planner) and dynamic schedulers that activate them by difficulty.
Build safer, faster debates: adaptive turn-taking and early reconciliation policies.
Expand interpretable controls: catalogs of steerable features for perspectives, emotions, and expertise.
Broaden tasks and scales: apply to larger models and high-stakes domains, measuring cost–benefit of inner social reasoning.

Why remember this: It reframes progress in AI reasoning from 'more tokens' to 'better teamwork inside'—showing that coordinated diversity, not just length, is key. Just like human teams, the smartest AIs may think best when many well-organized voices are heard.

Practical Applications

•Pre-train or fine-tune assistants to reason as a dialogue among defined personas (verifier, explorer, planner) before applying RL for accuracy.
•Use steerable conversational features (like 'surprise' markers) during test-time to trigger perspective shifts on difficult questions.
•Detect problem difficulty and adaptively activate more inner debate (e.g., more Q&A and conflict) when tasks are hard.
•Instrument models with LLM-as-judge monitors to ensure healthy balances of asking/giving and positive/negative roles during reasoning.
•Encourage cognitive strategies by prompts that nudge verification, backtracking, subgoals, and backward chaining.
•Design multi-agent workflows where a single model simulates multiple roles rather than spinning up many separate models, saving compute.
•Build dashboards that visualize inner voice diversity (personality/expertise) to diagnose echo chambers or missing skills.
•Create domain-specific personas (e.g., 'unit-checker' for physics, 'edge-case tester' for code) and schedule them when certain cues appear.
•Apply conversational scaffolding in low-resource settings (small models) to accelerate reasoning improvement during RL.
•Develop safety guardrails where a 'risk reviewer' persona must sign off before finalizing high-stakes answers.

Version: 1