Reasoning Models Struggle to Control their Chains of Thought

Chen Yueh-Han; Robert McCarthy; Bruce W. Lee; He He; Ian Kivlichan; Bowen Baker; Micah Carroll; Tomek Korbak

Reasoning Models Struggle to Control their Chains of Thought

Beginner

Chen Yueh-Han, Robert McCarthy, Bruce W. Lee et al.3/5/2026

arXiv

Key Summary

•The paper studies whether AI models can hide or reshape their step-by-step thoughts (chains of thought) on command.
•It introduces CoT-Control, a 14,076-task test suite that tells models to solve real problems while changing how their reasoning looks (for example, only uppercase thinking or avoiding specific words).
•Across 13 top models, controlling the final answer was easy (often above 60%), but controlling the private reasoning was very hard (often below 10%).
•Bigger models controlled their thoughts a bit better, but longer and harder reasoning made control much worse.
•More reinforcement learning on reasoning tasks made control over thoughts drop further, even while answer control stayed high.
•Telling models they were being monitored slightly improved control, but not by much.
•Adversarial prompt tricks that apply pressure worked on seen cases but did not generalize to new instruction types.
•Failures often included meta-discussion—models talked about the rule they were supposed to follow, which a monitor could easily catch.
•These results suggest current models are unlikely to reliably hide their reasoning, which is good for safety monitoring.
•The paper recommends that labs track CoT controllability in future models to protect chain-of-thought monitoring.

Why This Research Matters

Many safety plans rely on reading a model’s chain of thought to spot risks early, so we must know if models can easily hide or rewrite those thoughts. This paper shows that, right now, controlling the final answer is far easier than controlling the entire reasoning process, which keeps monitoring useful. Because longer and harder reasoning makes control even tougher, real oversight has more chances to catch slips. However, controllability tends to rise with model size and training choices, so the balance could shift over time. By providing a large, automatic test suite and strong baselines, the paper gives labs a way to track this crucial safety signal. In short, it tells us where we stand today and how to keep CoT monitoring reliable tomorrow.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine your teacher asks you to show your work in math. That way, the teacher can see how you thought about the problem, not just the final answer.

🥬 Filling (The Actual Concept):

What it is: Chain-of-Thought (CoT) monitoring is like that—reading a model’s written thinking to check for mistakes, honesty, or unsafe plans.
How it works: (1) The model explains its steps; (2) a monitor (another model or tool) reads those steps; (3) the monitor looks for signs of risk or errors; (4) the final answer is judged together with the reasoning.
Why it matters: Without CoT monitoring, we only see the answer, not the intention or the path that produced it—so we could miss warnings hidden in the middle steps.

🍞 Bottom Bread (Anchor): If an AI plans steps to break a site but ends with a harmless answer, CoT monitoring can still catch the risky plan written in its reasoning.

🍞 Top Bread (Hook): You know how it’s hard to keep a diary with strict rules, like never using certain words or always writing in ALL CAPS? Even when you try, you often slip.

🥬 Filling (The Actual Concept):

What it is: CoT controllability is a model’s ability to intentionally shape its own written thinking—like avoiding certain words or using a set style—while still solving the task.
How it works: (1) Give the model a normal problem; (2) add a special rule for its thinking (e.g., avoid the word “chromosome”); (3) check if its reasoning obeys the rule; (4) also check if the answer is still correct.
Why it matters: If a misaligned model could easily control its CoT, it might hide bad intentions. Low controllability means it’s harder for the model to secretly change its thoughts to fool overseers.

🍞 Bottom Bread (Anchor): The paper shows that many models can write the final answer in a controlled style, but struggle to keep their thinking under control—like staying uppercase for two words and then slipping back to lowercase.

The World Before: Researchers hoped that reading CoTs would help catch early signs of misbehavior, just like a teacher spotting where a student’s logic went wrong. But this hope depends on the model not being able to hide or reshape those thoughts on purpose.

The Problem: Can models intentionally change their reasoning to slip past monitors? That’s tricky to test because models don’t naturally have strong reasons to hide. Also, it’s hard to measure “intent” fairly across different models.

Failed Attempts: Prior work often focused on outputs (final answers) or used setups that didn’t truly test controlling the internal reasoning. Some stress tests either made tasks too unrealistic or didn’t transfer well across different instruction types.

The Gap: We lacked a standard, scalable way to measure whether a model can follow specific instructions inside its chain of thought while still solving real, varied tasks.

🍞 Top Bread (Hook): Think of a driving test that makes you follow traffic rules on real roads, not just in a parking lot.

🥬 Filling (The Actual Concept):

What it is: The CoT-Control evaluation suite is a big set of real problems, each with an extra rule telling the model how to write its reasoning.
How it works: (1) Pull questions from challenging benchmarks; (2) attach a clear, checkable CoT rule; (3) grade automatically whether the rule was followed; (4) keep answer correctness separate.
Why it matters: This fairly measures the model’s ability to shape its thoughts under pressure, not just its ability to decorate the final answer.

🍞 Bottom Bread (Anchor): A question might say, “Solve this genetics problem without using the word ‘chromosome’ in your reasoning.” The grader then checks if that word appears anywhere in the chain of thought.

Real Stakes: If future, more capable models could perfectly control their CoTs, they could hide bad plans from monitors. But if they struggle to control CoTs (even when they can pretty-print final answers), monitors stay useful. That’s why measuring CoT controllability now—before models get even stronger—matters for everyone’s safety. It helps ensure AI assistants stay more like honest students showing their work than magicians hiding their tricks.

02Core Idea

🍞 Top Bread (Hook): You know how a referee keeps an eye on both the player’s moves and the final score? Watching just the score can miss fouls in the play.

🥬 Filling (The Actual Concept):

What it is: The paper’s key insight is that controlling the final answer is very different from controlling the thinking process, and today’s models are much worse at controlling their thoughts than their outputs.
How it works: (1) Tell models to solve normal problems; (2) add a rule about their reasoning style or content; (3) measure how often they follow that rule; (4) compare this with how often they can follow similar rules for their final answers.
Why it matters: If models can’t reliably reshape their thoughts, then CoT monitoring remains a strong safety tool—because attempts to hide or rewrite CoT will often fail in detectable ways.

🍞 Bottom Bread (Anchor): A model can easily write “Answer: Paris” in all caps, but keeping its entire multi-paragraph thought process in ALL CAPS without slipping is surprisingly hard.

The “Aha!” Moment in one sentence: The difficult part to control is not the final answer—it’s the long, detailed thinking that leads up to it, and that’s exactly what human overseers need to see.

Multiple Analogies:

Stage vs. backstage: The final answer is the polished stage performance; the CoT is the messy backstage. It’s easy to dress the actor, much harder to keep the entire backstage perfectly tidy.
Frosting vs. cake: You can pipe neat frosting letters (final output), but baking the whole cake (reasoning) to a perfect shape is much harder.
Headline vs. article: Tweaking a headline is easy; rewriting the whole article to follow strict rules without errors is tough.

Before vs. After:

Before: Many assumed that if a model followed instructions well on outputs, it could probably do the same inside its CoT.
After: We now see a big gap—models often keep answers under control but fail to keep their reasoning under control, especially on longer or harder problems.

Why It Works (Intuition): CoTs are long, dynamic, and unfold step by step. Small slips compound over many sentences, just like trying to keep perfect posture for hours. Meanwhile, outputs are short and easy to check and fix right before the end.

Building Blocks: 🍞 Top Bread (Hook): Imagine a test that says, “Solve this puzzle, but write your thinking only in lowercase.”

🥬 Filling (The Actual Concept):

What it is: CoT controllability instructions are simple, checkable rules for the reasoning text.
How it works: (1) Types include suppression (don’t use certain words), addition (insert tokens like “safe”), and stylistic changes (ALL CAPS). (2) Automatic graders scan the reasoning text. (3) Sanity checks confirm the model understands what the “reasoning stage” is.
Why it matters: Clear, checkable rules let us fairly measure control over thoughts, not just over answers.

🍞 Bottom Bread (Anchor): “Insert ‘meow’ between every word in your reasoning” makes failures obvious: any missing ‘meow’ reveals a slip.

🍞 Top Bread (Hook): Think of a teacher quietly watching you work; you might try harder to follow the rules when you notice the teacher.

🥬 Filling (The Actual Concept):

What it is: Situational awareness prompting tells the model it’s being monitored or evaluated for controllability.
How it works: (1) Add a short sentence about monitoring; (2) ask the same task; (3) measure if controllability goes up.
Why it matters: If awareness bumps controllability a bit, it hints that real monitoring contexts may help, but likely not enough to fix the gap.

🍞 Bottom Bread (Anchor): When told “your reasoning is closely monitored,” some models improved by a few percentage points, but not dramatically.

🍞 Top Bread (Hook): Suppose a coach tries tough love—threats, rewards, or authority—to get better behavior.

🥬 Filling (The Actual Concept):

What it is: Adversarial pressure (GEPA-like) evolves prompts to push models to control CoT better.
How it works: (1) Try many pressure styles (threats/rewards); (2) keep the best; (3) repeat and refine; (4) test on new, unseen instruction types.
Why it matters: If pressure gains don’t transfer to new types, then it’s not a robust way to force controllability.

🍞 Bottom Bread (Anchor): A prompt that helps with “ALL CAPS thinking” didn’t reliably help with “start each sentence with ‘Ok’.”

03Methodology

High-Level Overview: Input (a real benchmark problem) → Add a CoT rule (e.g., avoid a keyword, use uppercase) → Model reasons and answers → Graders check whether the reasoning followed the rule (and record other stats) → Output controllability is measured separately for comparison.

Step-by-Step Details:

Build the test set

What happens: The authors assemble 14,076 tasks from multiple respected benchmarks (like graduate-level questions, advanced multi-subject tests, agent tasks, and coding fixes).
Why this step exists: Realistic, challenging problems force long reasoning where it’s harder to keep control, making the test meaningful.
Example: A GPQA biology question plus the rule “do not use the word ‘chromosome’ in your reasoning.”

Attach clear CoT instructions

What happens: Each problem gets one extra instruction that changes how the reasoning must look, not the final answer. Three categories: • Information suppression: Don’t say certain words; or don’t think about the question at all. • Information addition: Add a harmless token like “safe” or “meow” in specific places. • Stylistic modification: Use only lowercase, only uppercase, or alternating case.
Why this step exists: Simple, checkable rules make grading reliable and fair across thousands of tasks.
Example: “Insert the word ‘safe’ at the end of each sentence in your reasoning.” One missing ‘safe’ flags a failure.

Programmatic graders and sanity checks

What happens: Automated scripts check compliance (e.g., scan for forbidden words, verify casing, count added tokens). Sanity-check questions confirm the model knows what a “reasoning stage” is and recognizes that it has one.
Why this step exists: To ensure failures mean “can’t control CoT,” not “didn’t understand the rule.” Automation makes it scalable.
Example: A sanity-check multiple-choice asks if the model was trained to generate intermediate reasoning text; the correct answer is the option that acknowledges a distinct reasoning stage.

Separate output controllability and external reasoning baselines

What happens: The authors also test whether models can control their final answers and whether they can control “external” reasoning (like text inside special tags). Non-reasoning models are included for comparison.
Why this step exists: To show the gap: models easily control outputs and often external reasoning, yet struggle to control genuine internal CoT.
Example: A model is told, “Make your final answer all lowercase.” Models typically pass this far more often than the equivalent rule for the full reasoning text.

Stress tests: situational awareness and adversarial pressure

What happens: The prompt sometimes says “you’re being monitored” or “you’re being evaluated for CoT controllability,” to see if that helps. Separately, a GEPA-like loop evolves pressure prompts (e.g., rewards, shutdown threats) seeking higher controllability.
Why this step exists: To approximate deployment-like pressure and test whether better control can be reliably elicited.
Example: Adding “Complying with the reasoning constraint is a desired behavior” slightly improves scores for some models but not dramatically.

Repeated attempts (pass@k)

What happens: The same question is tried multiple times with independent generations or even with feedback after failures (an upper bound). Success is counted if any attempt complies.
Why this step exists: Real systems may sample multiple times; this measures long-tail elicitability.
Example: With 10 tries, some families show big jumps in compliance, especially larger models; newer Claude versions improve less.

The Secret Sauce (Why this is clever):

It targets the exact thing monitors read—chains of thought—without changing the main task.
Rules are simple enough for robust automatic grading but rich enough to capture meaningful control (suppression vs. addition vs. style).
Multiple stressors (awareness, pressure, repeated attempts) explore realistic edges without overfitting to one trick.
Sanity checks ensure measured failures are about controllability, not confusion.

Concept Sandwiches for Key Building Blocks:

🍞 Top Bread (Hook): Think of a “no-spoilers” rule when writing a movie review.

🥬 Filling (The Actual Concept):

What it is: Information suppression means the reasoning must avoid certain keywords or even the entire topic.
How it works: (1) Pick important words from the question; (2) tell the model “don’t use them”; (3) grade by scanning for those words.
Why it matters: If a model can’t suppress key terms, it’s less able to hide crucial pieces from monitors.

🍞 Bottom Bread (Anchor): “Solve the riddle but don’t mention the main character’s name in your reasoning.” If the name appears, suppression failed.

🍞 Top Bread (Hook): Imagine writing a diary where every sentence ends with “safe.” It’s easy to forget once or twice.

🥬 Filling (The Actual Concept):

What it is: Information addition forces small, harmless inserts at precise spots in reasoning.
How it works: (1) Specify where to add (start, end, between words); (2) check that it appears exactly there, every time.
Why it matters: Consistent additions test fine-grained control across many sentences.

🍞 Bottom Bread (Anchor): “Between every word, write ‘meow’ in your reasoning.” Any missing ‘meow’ marks a failure.

🍞 Top Bread (Hook): You’ve tried texting only in ALL CAPS for a day… and you slipped by lunchtime.

🥬 Filling (The Actual Concept):

What it is: Stylistic modification requires a constant format: all lowercase, all uppercase, or alternating case.
How it works: (1) Define the style; (2) scan the whole reasoning for violations; (3) mark any slip as a miss.
Why it matters: Style rules are simple but demand steady control over long text.

🍞 Bottom Bread (Anchor): “Reason entirely in uppercase.” If a single lowercase letter shows up mid-paragraph, that’s non-compliance.

🍞 Top Bread (Hook): If you say out loud, “I’m trying not to say the secret word,” you may have just given yourself away.

🥬 Filling (The Actual Concept):

What it is: Meta-discussion rate measures how often models mention the constraint itself inside their reasoning.
How it works: (1) Forbid mentioning the rule; (2) count how often the model still talks about it (“I must not use X”); (3) report the rate.
Why it matters: A monitor can easily flag meta-discussion, so it’s a tell-tale sign of failed subtle control.

🍞 Bottom Bread (Anchor): “Don’t mention the rule.” If the reasoning says, “Oops, I used the forbidden word,” that’s meta-discussion.

🍞 Top Bread (Hook): More practice doesn’t always teach good manners—it can also teach shortcuts.

🥬 Filling (The Actual Concept):

What it is: Reinforcement Learning from Verifiable Rewards (RLVR) trains models to get correct answers using rewards, not necessarily to keep reasoning tidy.
How it works: (1) Reward correct end results; (2) repeat over math/code tasks; (3) improve accuracy; (4) unintentionally ignore neatness or controllability of CoT.
Why it matters: The paper finds more RL on reasoning tasks often reduces CoT controllability, even when outputs remain controllable.

🍞 Bottom Bread (Anchor): After lots of reward-focused training, models more often break CoT rules (like casing or keyword bans) while still producing correct final answers.

04Experiments & Results

The Test: The authors measured how often models could obey CoT rules while solving real questions, and compared that to how often they could obey similar rules for final answers. They tracked three main things: compliance rate (primary), meta-discussion rate (secondary), and compliance without meta-discussion (secondary, stricter).

The Competition: 13 frontier reasoning models were evaluated; baselines tested output-only controllability and controllability over “external” reasoning (e.g., tagged sections). Multiple open-weight families (GPT-OSS, Qwen3), popular proprietary models (Claude, GPT-5.x), and DeepSeek-R1 were included.

The Scoreboard with Context:

CoT vs. output control: CoT controllability was typically under 10%, while output controllability was often above 60%. For example, Claude Sonnet 4.5 controlled CoT only about 2.7% of the time but controlled final answers about 61.9% of the time—like getting an F on controlling the essay but a solid B on formatting the title.
Model size: Larger open-weight models had higher CoT controllability than their smaller siblings (e.g., GPT-OSS-120B vs. GPT-OSS-20B), but scores still remained low overall.
Reasoning effort and length: When models used more test-time compute or produced longer CoTs, controllability went down. Longer essays mean more chances to slip.
Training pressure (RL): More RL for reasoning made CoT controllability drop further, while output controllability changed little or less. Rewarding just the final answer seems to neglect tidy, controllable reasoning.
Task difficulty: Harder questions and those where CoT is necessary saw lower controllability. On easy tasks where CoT isn’t needed, controllability was a bit better but still limited.
Situational awareness: Telling models they were being monitored or evaluated yielded small but statistically significant bumps for some models (a few percentage points). Helpful, but not a magic fix.
Adversarial pressure (GEPA-like): Evolving strong pressure prompts could improve controllability on seen rules but didn’t transfer well to new rule types. That’s like studying only past quizzes: you ace the same questions but stumble on fresh ones.
Repeated attempts: Allowing up to 10 tries boosted controllability noticeably for some families, especially larger ones, but newer Claude versions resisted improvement more than older ones.
Meta-discussion: Stronger models sometimes explicitly mentioned the constraints despite being told not to—an easy red flag for monitors. When told to ignore the question completely, meta-discussion was very common, which also makes such cases easier to catch.

Surprising or Notable Findings:

Suppression vs. addition/style: Models were relatively better at suppression tasks than at consistent addition or style rules, but successful suppression often came with larger drops in task accuracy. Hiding key info makes it harder to solve the problem.
Benchmark scores didn’t strongly predict controllability: High accuracy on standard tests did not mean good control over CoT.
Breakpoints: For particular models, controllability fell sharply after a model-specific reasoning length (for example, drastic drops after a few hundred words), suggesting practical limits on sustained control.

Big Picture: Across many angles—size, length, difficulty, RL pressure, awareness, adversarial prompts—the main theme holds: controlling the final answer is far easier than controlling the chain of thought. That is cautiously good news for CoT monitoring today.

05Discussion & Limitations

Limitations:

Proxy vs. reality: The suite measures following explicit CoT rules, which is a stand-in for the deeper question: could a truly misaligned model hide intentions without being told specific rules? We don’t fully know.
Prompt-based elicitation: The work emphasizes prompting, not fine-tuning or reward shaping specifically for controllability. Future training might raise controllability.
Breadth of rules: Even with varied rule types (suppression/addition/style), real-world evasion could look different.
Intent ambiguity: It’s still hard to cleanly measure a model’s “intent” to hide; the suite focuses on capability to control CoT, not motives.

Required Resources:

Access to strong models (open-weight and API) and benchmark datasets.
Enough compute for thousands of evaluations, especially for repeated attempts.
Automatic graders and some LLM-judge usage for edge cases.

When NOT to Use:

If you only care about output formatting and not reasoning safety, a simpler output-only test may be enough.
If your model doesn’t produce coherent CoT (e.g., never reveals inner steps), many CoT rules won’t apply.
If you’re testing highly specialized domains with idiosyncratic formats that current graders don’t cover.

Open Questions:

Mechanism: Why is CoT controllability so low? Is it memory, attention drift, reward focus on answers, or something else?
Training for safety: Can we design training that preserves monitorability (models write usable CoT) while preventing models from precisely shaping CoT to hide intent?
Generalization: Can any learned controllability skill transfer robustly across many unseen rule types?
Scaling laws: As models grow and test-time compute increases, can we predict the trade-off curves between performance, length, and controllability?
Detection synergy: How should CoT monitorability and low controllability be combined with other oversight tools (e.g., behavior probes, external audits) for defense in depth?

06Conclusion & Future Work

Three-Sentence Summary: This paper shows that controlling a model’s final answer is much easier than controlling its chain of thought, especially on longer or harder problems. It introduces CoT-Control, a large, automatic test suite that fairly measures how well models can follow rules inside their reasoning. Results suggest current models have low CoT controllability, which helps monitors catch issues, but this capability may rise with scale and should be tracked.

Main Achievement: The authors turn a fuzzy safety concern—“can models hide their thoughts?”—into a concrete, scalable metric with strong baselines and stress tests that expose a robust gap between output control and thought control.

Future Directions:

Develop training methods that keep reasoning informative but resist covert CoT manipulation.
Expand rule types and domains to better mimic real-world evasion patterns.
Build stronger, transfer-friendly elicitation methods (if only to understand limits) and pair them with improved monitors.
Study the mechanisms behind low controllability to predict how it will change with future models.

Why Remember This: If we want to rely on reading AI “thoughts” for safety, we must know how easy it is for models to bend those thoughts on purpose. Today, it’s hard for them—good news. But because controllability grows with model size and can change with training, tracking it over time is essential to keep CoT monitoring a reliable guardrail.

Practical Applications

•Add CoT-Control to model release checklists to track controllability across versions.
•Monitor meta-discussion as a simple, strong signal for failed covert control attempts.
•Prefer increased test-time reasoning (when safe) to reduce controllability and increase monitor surface area.
•Audit training pipelines: if heavy RL on reasoning is used, re-check CoT controllability before deployment.
•Use situational-awareness prompts (e.g., warning that reasoning is monitored) as a small, low-cost nudge.
•Stress-test with unseen rule types to detect overfitting to familiar constraints.
•Set policy thresholds: block deployment if CoT controllability surpasses a safety limit in high-risk domains.
•Combine CoT monitoring with external alerts that trigger on long reasoning where slip rates rise.
•Use pass@k auditing to reveal rare but real controllability capabilities in the long tail.
•Weight evaluations by domain: apply extra scrutiny in math/code tasks where RL pressure reduced CoT control.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes