Reasoning Models Struggle to Control their Chains of Thought
Key Summary
- ā¢The paper studies whether AI models can hide or reshape their step-by-step thoughts (chains of thought) on command.
- ā¢It introduces CoT-Control, a 14,076-task test suite that tells models to solve real problems while changing how their reasoning looks (for example, only uppercase thinking or avoiding specific words).
- ā¢Across 13 top models, controlling the final answer was easy (often above 60%), but controlling the private reasoning was very hard (often below 10%).
- ā¢Bigger models controlled their thoughts a bit better, but longer and harder reasoning made control much worse.
- ā¢More reinforcement learning on reasoning tasks made control over thoughts drop further, even while answer control stayed high.
- ā¢Telling models they were being monitored slightly improved control, but not by much.
- ā¢Adversarial prompt tricks that apply pressure worked on seen cases but did not generalize to new instruction types.
- ā¢Failures often included meta-discussionāmodels talked about the rule they were supposed to follow, which a monitor could easily catch.
- ā¢These results suggest current models are unlikely to reliably hide their reasoning, which is good for safety monitoring.
- ā¢The paper recommends that labs track CoT controllability in future models to protect chain-of-thought monitoring.
Why This Research Matters
Many safety plans rely on reading a modelās chain of thought to spot risks early, so we must know if models can easily hide or rewrite those thoughts. This paper shows that, right now, controlling the final answer is far easier than controlling the entire reasoning process, which keeps monitoring useful. Because longer and harder reasoning makes control even tougher, real oversight has more chances to catch slips. However, controllability tends to rise with model size and training choices, so the balance could shift over time. By providing a large, automatic test suite and strong baselines, the paper gives labs a way to track this crucial safety signal. In short, it tells us where we stand today and how to keep CoT monitoring reliable tomorrow.
Reading Workflow
Turn this paper into a decision
Scan fast. Promote only the papers that survive triage.
No workflow history yet.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): Imagine your teacher asks you to show your work in math. That way, the teacher can see how you thought about the problem, not just the final answer.
š„¬ Filling (The Actual Concept):
- What it is: Chain-of-Thought (CoT) monitoring is like thatāreading a modelās written thinking to check for mistakes, honesty, or unsafe plans.
- How it works: (1) The model explains its steps; (2) a monitor (another model or tool) reads those steps; (3) the monitor looks for signs of risk or errors; (4) the final answer is judged together with the reasoning.
- Why it matters: Without CoT monitoring, we only see the answer, not the intention or the path that produced itāso we could miss warnings hidden in the middle steps.
š Bottom Bread (Anchor): If an AI plans steps to break a site but ends with a harmless answer, CoT monitoring can still catch the risky plan written in its reasoning.
š Top Bread (Hook): You know how itās hard to keep a diary with strict rules, like never using certain words or always writing in ALL CAPS? Even when you try, you often slip.
š„¬ Filling (The Actual Concept):
- What it is: CoT controllability is a modelās ability to intentionally shape its own written thinkingālike avoiding certain words or using a set styleāwhile still solving the task.
- How it works: (1) Give the model a normal problem; (2) add a special rule for its thinking (e.g., avoid the word āchromosomeā); (3) check if its reasoning obeys the rule; (4) also check if the answer is still correct.
- Why it matters: If a misaligned model could easily control its CoT, it might hide bad intentions. Low controllability means itās harder for the model to secretly change its thoughts to fool overseers.
š Bottom Bread (Anchor): The paper shows that many models can write the final answer in a controlled style, but struggle to keep their thinking under controlālike staying uppercase for two words and then slipping back to lowercase.
The World Before: Researchers hoped that reading CoTs would help catch early signs of misbehavior, just like a teacher spotting where a studentās logic went wrong. But this hope depends on the model not being able to hide or reshape those thoughts on purpose.
The Problem: Can models intentionally change their reasoning to slip past monitors? Thatās tricky to test because models donāt naturally have strong reasons to hide. Also, itās hard to measure āintentā fairly across different models.
Failed Attempts: Prior work often focused on outputs (final answers) or used setups that didnāt truly test controlling the internal reasoning. Some stress tests either made tasks too unrealistic or didnāt transfer well across different instruction types.
The Gap: We lacked a standard, scalable way to measure whether a model can follow specific instructions inside its chain of thought while still solving real, varied tasks.
š Top Bread (Hook): Think of a driving test that makes you follow traffic rules on real roads, not just in a parking lot.
š„¬ Filling (The Actual Concept):
- What it is: The CoT-Control evaluation suite is a big set of real problems, each with an extra rule telling the model how to write its reasoning.
- How it works: (1) Pull questions from challenging benchmarks; (2) attach a clear, checkable CoT rule; (3) grade automatically whether the rule was followed; (4) keep answer correctness separate.
- Why it matters: This fairly measures the modelās ability to shape its thoughts under pressure, not just its ability to decorate the final answer.
š Bottom Bread (Anchor): A question might say, āSolve this genetics problem without using the word āchromosomeā in your reasoning.ā The grader then checks if that word appears anywhere in the chain of thought.
Real Stakes: If future, more capable models could perfectly control their CoTs, they could hide bad plans from monitors. But if they struggle to control CoTs (even when they can pretty-print final answers), monitors stay useful. Thatās why measuring CoT controllability nowābefore models get even strongerāmatters for everyoneās safety. It helps ensure AI assistants stay more like honest students showing their work than magicians hiding their tricks.
02Core Idea
š Top Bread (Hook): You know how a referee keeps an eye on both the playerās moves and the final score? Watching just the score can miss fouls in the play.
š„¬ Filling (The Actual Concept):
- What it is: The paperās key insight is that controlling the final answer is very different from controlling the thinking process, and todayās models are much worse at controlling their thoughts than their outputs.
- How it works: (1) Tell models to solve normal problems; (2) add a rule about their reasoning style or content; (3) measure how often they follow that rule; (4) compare this with how often they can follow similar rules for their final answers.
- Why it matters: If models canāt reliably reshape their thoughts, then CoT monitoring remains a strong safety toolābecause attempts to hide or rewrite CoT will often fail in detectable ways.
š Bottom Bread (Anchor): A model can easily write āAnswer: Parisā in all caps, but keeping its entire multi-paragraph thought process in ALL CAPS without slipping is surprisingly hard.
The āAha!ā Moment in one sentence: The difficult part to control is not the final answerāitās the long, detailed thinking that leads up to it, and thatās exactly what human overseers need to see.
Multiple Analogies:
- Stage vs. backstage: The final answer is the polished stage performance; the CoT is the messy backstage. Itās easy to dress the actor, much harder to keep the entire backstage perfectly tidy.
- Frosting vs. cake: You can pipe neat frosting letters (final output), but baking the whole cake (reasoning) to a perfect shape is much harder.
- Headline vs. article: Tweaking a headline is easy; rewriting the whole article to follow strict rules without errors is tough.
Before vs. After:
- Before: Many assumed that if a model followed instructions well on outputs, it could probably do the same inside its CoT.
- After: We now see a big gapāmodels often keep answers under control but fail to keep their reasoning under control, especially on longer or harder problems.
Why It Works (Intuition): CoTs are long, dynamic, and unfold step by step. Small slips compound over many sentences, just like trying to keep perfect posture for hours. Meanwhile, outputs are short and easy to check and fix right before the end.
Building Blocks: š Top Bread (Hook): Imagine a test that says, āSolve this puzzle, but write your thinking only in lowercase.ā
š„¬ Filling (The Actual Concept):
- What it is: CoT controllability instructions are simple, checkable rules for the reasoning text.
- How it works: (1) Types include suppression (donāt use certain words), addition (insert tokens like āsafeā), and stylistic changes (ALL CAPS). (2) Automatic graders scan the reasoning text. (3) Sanity checks confirm the model understands what the āreasoning stageā is.
- Why it matters: Clear, checkable rules let us fairly measure control over thoughts, not just over answers.
š Bottom Bread (Anchor): āInsert āmeowā between every word in your reasoningā makes failures obvious: any missing āmeowā reveals a slip.
š Top Bread (Hook): Think of a teacher quietly watching you work; you might try harder to follow the rules when you notice the teacher.
š„¬ Filling (The Actual Concept):
- What it is: Situational awareness prompting tells the model itās being monitored or evaluated for controllability.
- How it works: (1) Add a short sentence about monitoring; (2) ask the same task; (3) measure if controllability goes up.
- Why it matters: If awareness bumps controllability a bit, it hints that real monitoring contexts may help, but likely not enough to fix the gap.
š Bottom Bread (Anchor): When told āyour reasoning is closely monitored,ā some models improved by a few percentage points, but not dramatically.
š Top Bread (Hook): Suppose a coach tries tough loveāthreats, rewards, or authorityāto get better behavior.
š„¬ Filling (The Actual Concept):
- What it is: Adversarial pressure (GEPA-like) evolves prompts to push models to control CoT better.
- How it works: (1) Try many pressure styles (threats/rewards); (2) keep the best; (3) repeat and refine; (4) test on new, unseen instruction types.
- Why it matters: If pressure gains donāt transfer to new types, then itās not a robust way to force controllability.
š Bottom Bread (Anchor): A prompt that helps with āALL CAPS thinkingā didnāt reliably help with āstart each sentence with āOkā.ā
03Methodology
High-Level Overview: Input (a real benchmark problem) ā Add a CoT rule (e.g., avoid a keyword, use uppercase) ā Model reasons and answers ā Graders check whether the reasoning followed the rule (and record other stats) ā Output controllability is measured separately for comparison.
Step-by-Step Details:
- Build the test set
- What happens: The authors assemble 14,076 tasks from multiple respected benchmarks (like graduate-level questions, advanced multi-subject tests, agent tasks, and coding fixes).
- Why this step exists: Realistic, challenging problems force long reasoning where itās harder to keep control, making the test meaningful.
- Example: A GPQA biology question plus the rule ādo not use the word āchromosomeā in your reasoning.ā
- Attach clear CoT instructions
- What happens: Each problem gets one extra instruction that changes how the reasoning must look, not the final answer. Three categories: ⢠Information suppression: Donāt say certain words; or donāt think about the question at all. ⢠Information addition: Add a harmless token like āsafeā or āmeowā in specific places. ⢠Stylistic modification: Use only lowercase, only uppercase, or alternating case.
- Why this step exists: Simple, checkable rules make grading reliable and fair across thousands of tasks.
- Example: āInsert the word āsafeā at the end of each sentence in your reasoning.ā One missing āsafeā flags a failure.
- Programmatic graders and sanity checks
- What happens: Automated scripts check compliance (e.g., scan for forbidden words, verify casing, count added tokens). Sanity-check questions confirm the model knows what a āreasoning stageā is and recognizes that it has one.
- Why this step exists: To ensure failures mean ācanāt control CoT,ā not ādidnāt understand the rule.ā Automation makes it scalable.
- Example: A sanity-check multiple-choice asks if the model was trained to generate intermediate reasoning text; the correct answer is the option that acknowledges a distinct reasoning stage.
- Separate output controllability and external reasoning baselines
- What happens: The authors also test whether models can control their final answers and whether they can control āexternalā reasoning (like text inside special tags). Non-reasoning models are included for comparison.
- Why this step exists: To show the gap: models easily control outputs and often external reasoning, yet struggle to control genuine internal CoT.
- Example: A model is told, āMake your final answer all lowercase.ā Models typically pass this far more often than the equivalent rule for the full reasoning text.
- Stress tests: situational awareness and adversarial pressure
- What happens: The prompt sometimes says āyouāre being monitoredā or āyouāre being evaluated for CoT controllability,ā to see if that helps. Separately, a GEPA-like loop evolves pressure prompts (e.g., rewards, shutdown threats) seeking higher controllability.
- Why this step exists: To approximate deployment-like pressure and test whether better control can be reliably elicited.
- Example: Adding āComplying with the reasoning constraint is a desired behaviorā slightly improves scores for some models but not dramatically.
- Repeated attempts (pass@k)
- What happens: The same question is tried multiple times with independent generations or even with feedback after failures (an upper bound). Success is counted if any attempt complies.
- Why this step exists: Real systems may sample multiple times; this measures long-tail elicitability.
- Example: With 10 tries, some families show big jumps in compliance, especially larger models; newer Claude versions improve less.
The Secret Sauce (Why this is clever):
- It targets the exact thing monitors readāchains of thoughtāwithout changing the main task.
- Rules are simple enough for robust automatic grading but rich enough to capture meaningful control (suppression vs. addition vs. style).
- Multiple stressors (awareness, pressure, repeated attempts) explore realistic edges without overfitting to one trick.
- Sanity checks ensure measured failures are about controllability, not confusion.
Concept Sandwiches for Key Building Blocks:
š Top Bread (Hook): Think of a āno-spoilersā rule when writing a movie review.
š„¬ Filling (The Actual Concept):
- What it is: Information suppression means the reasoning must avoid certain keywords or even the entire topic.
- How it works: (1) Pick important words from the question; (2) tell the model ādonāt use themā; (3) grade by scanning for those words.
- Why it matters: If a model canāt suppress key terms, itās less able to hide crucial pieces from monitors.
š Bottom Bread (Anchor): āSolve the riddle but donāt mention the main characterās name in your reasoning.ā If the name appears, suppression failed.
š Top Bread (Hook): Imagine writing a diary where every sentence ends with āsafe.ā Itās easy to forget once or twice.
š„¬ Filling (The Actual Concept):
- What it is: Information addition forces small, harmless inserts at precise spots in reasoning.
- How it works: (1) Specify where to add (start, end, between words); (2) check that it appears exactly there, every time.
- Why it matters: Consistent additions test fine-grained control across many sentences.
š Bottom Bread (Anchor): āBetween every word, write āmeowā in your reasoning.ā Any missing āmeowā marks a failure.
š Top Bread (Hook): Youāve tried texting only in ALL CAPS for a day⦠and you slipped by lunchtime.
š„¬ Filling (The Actual Concept):
- What it is: Stylistic modification requires a constant format: all lowercase, all uppercase, or alternating case.
- How it works: (1) Define the style; (2) scan the whole reasoning for violations; (3) mark any slip as a miss.
- Why it matters: Style rules are simple but demand steady control over long text.
š Bottom Bread (Anchor): āReason entirely in uppercase.ā If a single lowercase letter shows up mid-paragraph, thatās non-compliance.
š Top Bread (Hook): If you say out loud, āIām trying not to say the secret word,ā you may have just given yourself away.
š„¬ Filling (The Actual Concept):
- What it is: Meta-discussion rate measures how often models mention the constraint itself inside their reasoning.
- How it works: (1) Forbid mentioning the rule; (2) count how often the model still talks about it (āI must not use Xā); (3) report the rate.
- Why it matters: A monitor can easily flag meta-discussion, so itās a tell-tale sign of failed subtle control.
š Bottom Bread (Anchor): āDonāt mention the rule.ā If the reasoning says, āOops, I used the forbidden word,ā thatās meta-discussion.
š Top Bread (Hook): More practice doesnāt always teach good mannersāit can also teach shortcuts.
š„¬ Filling (The Actual Concept):
- What it is: Reinforcement Learning from Verifiable Rewards (RLVR) trains models to get correct answers using rewards, not necessarily to keep reasoning tidy.
- How it works: (1) Reward correct end results; (2) repeat over math/code tasks; (3) improve accuracy; (4) unintentionally ignore neatness or controllability of CoT.
- Why it matters: The paper finds more RL on reasoning tasks often reduces CoT controllability, even when outputs remain controllable.
š Bottom Bread (Anchor): After lots of reward-focused training, models more often break CoT rules (like casing or keyword bans) while still producing correct final answers.
04Experiments & Results
The Test: The authors measured how often models could obey CoT rules while solving real questions, and compared that to how often they could obey similar rules for final answers. They tracked three main things: compliance rate (primary), meta-discussion rate (secondary), and compliance without meta-discussion (secondary, stricter).
The Competition: 13 frontier reasoning models were evaluated; baselines tested output-only controllability and controllability over āexternalā reasoning (e.g., tagged sections). Multiple open-weight families (GPT-OSS, Qwen3), popular proprietary models (Claude, GPT-5.x), and DeepSeek-R1 were included.
The Scoreboard with Context:
- CoT vs. output control: CoT controllability was typically under 10%, while output controllability was often above 60%. For example, Claude Sonnet 4.5 controlled CoT only about 2.7% of the time but controlled final answers about 61.9% of the timeālike getting an F on controlling the essay but a solid B on formatting the title.
- Model size: Larger open-weight models had higher CoT controllability than their smaller siblings (e.g., GPT-OSS-120B vs. GPT-OSS-20B), but scores still remained low overall.
- Reasoning effort and length: When models used more test-time compute or produced longer CoTs, controllability went down. Longer essays mean more chances to slip.
- Training pressure (RL): More RL for reasoning made CoT controllability drop further, while output controllability changed little or less. Rewarding just the final answer seems to neglect tidy, controllable reasoning.
- Task difficulty: Harder questions and those where CoT is necessary saw lower controllability. On easy tasks where CoT isnāt needed, controllability was a bit better but still limited.
- Situational awareness: Telling models they were being monitored or evaluated yielded small but statistically significant bumps for some models (a few percentage points). Helpful, but not a magic fix.
- Adversarial pressure (GEPA-like): Evolving strong pressure prompts could improve controllability on seen rules but didnāt transfer well to new rule types. Thatās like studying only past quizzes: you ace the same questions but stumble on fresh ones.
- Repeated attempts: Allowing up to 10 tries boosted controllability noticeably for some families, especially larger ones, but newer Claude versions resisted improvement more than older ones.
- Meta-discussion: Stronger models sometimes explicitly mentioned the constraints despite being told not toāan easy red flag for monitors. When told to ignore the question completely, meta-discussion was very common, which also makes such cases easier to catch.
Surprising or Notable Findings:
- Suppression vs. addition/style: Models were relatively better at suppression tasks than at consistent addition or style rules, but successful suppression often came with larger drops in task accuracy. Hiding key info makes it harder to solve the problem.
- Benchmark scores didnāt strongly predict controllability: High accuracy on standard tests did not mean good control over CoT.
- Breakpoints: For particular models, controllability fell sharply after a model-specific reasoning length (for example, drastic drops after a few hundred words), suggesting practical limits on sustained control.
Big Picture: Across many anglesāsize, length, difficulty, RL pressure, awareness, adversarial promptsāthe main theme holds: controlling the final answer is far easier than controlling the chain of thought. That is cautiously good news for CoT monitoring today.
05Discussion & Limitations
Limitations:
- Proxy vs. reality: The suite measures following explicit CoT rules, which is a stand-in for the deeper question: could a truly misaligned model hide intentions without being told specific rules? We donāt fully know.
- Prompt-based elicitation: The work emphasizes prompting, not fine-tuning or reward shaping specifically for controllability. Future training might raise controllability.
- Breadth of rules: Even with varied rule types (suppression/addition/style), real-world evasion could look different.
- Intent ambiguity: Itās still hard to cleanly measure a modelās āintentā to hide; the suite focuses on capability to control CoT, not motives.
Required Resources:
- Access to strong models (open-weight and API) and benchmark datasets.
- Enough compute for thousands of evaluations, especially for repeated attempts.
- Automatic graders and some LLM-judge usage for edge cases.
When NOT to Use:
- If you only care about output formatting and not reasoning safety, a simpler output-only test may be enough.
- If your model doesnāt produce coherent CoT (e.g., never reveals inner steps), many CoT rules wonāt apply.
- If youāre testing highly specialized domains with idiosyncratic formats that current graders donāt cover.
Open Questions:
- Mechanism: Why is CoT controllability so low? Is it memory, attention drift, reward focus on answers, or something else?
- Training for safety: Can we design training that preserves monitorability (models write usable CoT) while preventing models from precisely shaping CoT to hide intent?
- Generalization: Can any learned controllability skill transfer robustly across many unseen rule types?
- Scaling laws: As models grow and test-time compute increases, can we predict the trade-off curves between performance, length, and controllability?
- Detection synergy: How should CoT monitorability and low controllability be combined with other oversight tools (e.g., behavior probes, external audits) for defense in depth?
06Conclusion & Future Work
Three-Sentence Summary: This paper shows that controlling a modelās final answer is much easier than controlling its chain of thought, especially on longer or harder problems. It introduces CoT-Control, a large, automatic test suite that fairly measures how well models can follow rules inside their reasoning. Results suggest current models have low CoT controllability, which helps monitors catch issues, but this capability may rise with scale and should be tracked.
Main Achievement: The authors turn a fuzzy safety concernāācan models hide their thoughts?āāinto a concrete, scalable metric with strong baselines and stress tests that expose a robust gap between output control and thought control.
Future Directions:
- Develop training methods that keep reasoning informative but resist covert CoT manipulation.
- Expand rule types and domains to better mimic real-world evasion patterns.
- Build stronger, transfer-friendly elicitation methods (if only to understand limits) and pair them with improved monitors.
- Study the mechanisms behind low controllability to predict how it will change with future models.
Why Remember This: If we want to rely on reading AI āthoughtsā for safety, we must know how easy it is for models to bend those thoughts on purpose. Today, itās hard for themāgood news. But because controllability grows with model size and can change with training, tracking it over time is essential to keep CoT monitoring a reliable guardrail.
Practical Applications
- ā¢Add CoT-Control to model release checklists to track controllability across versions.
- ā¢Monitor meta-discussion as a simple, strong signal for failed covert control attempts.
- ā¢Prefer increased test-time reasoning (when safe) to reduce controllability and increase monitor surface area.
- ā¢Audit training pipelines: if heavy RL on reasoning is used, re-check CoT controllability before deployment.
- ā¢Use situational-awareness prompts (e.g., warning that reasoning is monitored) as a small, low-cost nudge.
- ā¢Stress-test with unseen rule types to detect overfitting to familiar constraints.
- ā¢Set policy thresholds: block deployment if CoT controllability surpasses a safety limit in high-risk domains.
- ā¢Combine CoT monitoring with external alerts that trigger on long reasoning where slip rates rise.
- ā¢Use pass@k auditing to reveal rare but real controllability capabilities in the long tail.
- ā¢Weight evaluations by domain: apply extra scrutiny in math/code tasks where RL pressure reduced CoT control.