When Reasoning Meets Its Laws

Junyu Zhang; Yifan Sun; Tianang Leng; Jingyan Shen; Liu Ziyin; Paul Pu Liang; Huan Zhang

When Reasoning Meets Its Laws

Intermediate

Junyu Zhang, Yifan Sun, Tianang Leng et al.12/19/2025

arXiv PDF

Key Summary

•The paper proposes the Laws of Reasoning (LORE), simple rules that say how much a model should think and how accurate it can be as problems get harder.
•Compute Law: the thinking budget (reasoning tokens) should grow roughly in a straight line with problem difficulty.
•Accuracy Law: accuracy should drop in a smooth, predictable way (like an exponential curve) as problems get harder.
•Because true difficulty is hard to measure, the authors test two easy-to-check properties: monotonicity (harder → more thinking, lower accuracy) and compositionality (two independent parts → add thinking, multiply accuracies).
•They build LORE-BENCH with two parts: LORE-MONO (tests monotonicity) and LORE-COMPO (tests compositionality).
•Most models pass monotonicity but fail compositionality, meaning they don’t spend enough (or the right) thinking when two independent tasks are combined.
•They introduce SFT-Compo, a simple fine-tuning method that teaches models to make the thinking for combined questions equal the sum of the parts.
•Enforcing compositionality reduces deviation by up to 40.5% (nMAD on compute) and boosts accuracy across six tough math/science benchmarks (e.g., +5.0 Pass@1 for an 8B model).
•A nice bonus: training for compute compositionality also improves monotonicity and even accuracy compositionality, showing synergistic effects.
•This work turns fuzzy ideas about 'thinking well' into testable laws and shows that following these laws makes models better reasoners.

Why This Research Matters

When models follow clear ‘laws’ for how much to think and how accuracy should change with difficulty, they become more reliable helpers in schoolwork, coding, and planning. Proper thinking budgets cut wasted computation on easy tasks and prevent underthinking on hard ones, saving time and money. Stronger compositionality makes models better at multi-step tasks, like ‘solve A, then use that to solve B,’ which appears everywhere from math problems to data pipelines. Predictable accuracy scaling helps set the right expectations, especially for long chains where errors can compound. The approach is practical: simple tests (LORE-BENCH) and a lightweight training fix (SFT-Compo) produce real gains on respected benchmarks. In short, this makes AI reason more like careful humans—sensible, consistent, and efficient.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing homework. When the problems get harder, you naturally spend more time thinking, and your chance of a small mistake grows with each extra step. That’s how humans reason.

🥬 The situation before: Large Reasoning Models (LRMs) like advanced chatbots got surprisingly good at step-by-step thinking, but they often behaved in odd ways. Sometimes they wrote way too much for an easy question (overthinking) or too little for a hard one (underthinking). Other times, they solved two separate sub-questions fine, but when those same two were glued together, they suddenly used fewer thinking steps and did worse. That’s not how people usually solve problems. What was missing was a clear set of rules—simple, testable laws—about how much thinking to do for a problem of a given difficulty and how accuracy should change as problems get harder. Without those rules, training data (like chain-of-thought examples) was messy and inconsistent: no one told the model how to budget its ‘thinking tokens’ across easy and hard tasks.

🍞 Anchor: Picture a model that spends more tokens on squaring 55 (harder) than on adding 1 through 10 (easier), but then, when asked to “add then square,” it strangely thinks less than for squaring alone—accuracy drops. That mismatch is exactly what the authors observed and set out to fix.

🍞 Hook: You know how sports teams have playbooks—structured rules for deciding which play to run? Models also need a playbook for thinking.

🥬 The problem: There was no theoretical playbook that connected three things: (1) how hard a question is, (2) how much the model should think, and (3) how accuracy should behave as difficulty rises. Previous attempts tried ad-hoc tricks, like shortening or lengthening reasoning at random times or adding small penalties during training. These helped a bit but didn’t fix the deeper issue: there was no ‘law’ guiding the right amount of thinking for a given complexity. Failed attempts: Heuristics at training time (variable-length chain-of-thought) and at test time (turning thinking on/off or scheduling tokens) sometimes worked, sometimes didn’t, and often broke in weird cases like combined tasks.

🍞 Anchor: It’s like telling a runner to ‘sometimes speed up, sometimes slow down’ without tying that advice to the length or steepness of the course. You might get lucky—but you won’t be reliably fast.

🍞 Hook: Imagine we had rulers for both time (how much to think) and outcome (how accurate to expect) as tasks get harder.

🥬 The gap: We needed simple, physics-style laws for reasoning—rules that say, as tasks get more complex, thinking should scale sensibly, and accuracy should change predictably. But we can’t easily measure true problem complexity exactly in the real world. So we also need practical “check-ups” that stand in for the full laws and can be measured on real data.

🍞 Anchor: Think of a speedometer (how fast) and a fuel gauge (how far you can go). Even if you don’t know the exact road shape, these two simple dials help you drive wisely. The paper builds those dials for reasoning.

🍞 Hook: Why should anyone care?

🥬 Real stakes: In everyday life, reasoning models help with math tutoring, code fixes, planning trips, and even lab science workflows. If a model spends too little ‘thinking’ on a hard sub-part of a plan, the whole plan can fail. If it overthinks an easy part, it wastes time and money. Clear laws that guide models to think the right amount make them more reliable, faster, and cheaper—and help them handle multi-step tasks where mistakes easily snowball.

🍞 Anchor: Think of a homework helper that knows to spend just a few steps adding 1 through 10, more steps squaring the result, and the sum of both when you ask it to ‘add then square.’ That’s the kind of sensible behavior these laws aim to enforce.

02Core Idea

🍞 Hook: You know how a recipe tells you how many steps to take and what results to expect? Good reasoning needs a recipe too.

🥬 The aha! in one sentence: The paper proposes two simple laws—how much to think should grow linearly with problem complexity, and how accuracy should decay exponentially as complexity grows—and shows that teaching models to follow these laws improves reasoning.

Multiple analogies:

Hiking trail: The longer the trail (complexity), the longer you’ll hike (thinking). And the chance of stumbling (error) grows with each extra step.
Lego build: Double the number of independent Lego steps? You’ll spend about double the time. Finishing two independent builds correctly requires both to be right (accuracies multiply).
Chores: Doing laundry and washing dishes separately takes some minutes each; doing both takes the sum of those minutes. But making a slip during either chore means the whole ‘both-done-perfectly’ claim fails (probabilities multiply).

🍞 Anchor: For two independent math puzzles—say, basic algebra and simple geometry—thinking time should add when combined; the chance of getting both right should be the product of each individual success chance.

🍞 Hook: Imagine two dials: ‘Thinking Tokens’ and ‘Accuracy.’ As the task gets harder, one dial should turn up steadily; the other should go down in a smooth curve.

🥬 Why it works (intuition, no equations):

Compute Law: If a problem truly has more necessary steps, an efficient solver should spend proportionally more tokens to reason through those steps—plus a tiny, nearly constant ‘overhead’ for setup/transitions.
Accuracy Law: If each step has some chance of being correct, and the step successes are roughly independent, then the chance the whole chain is correct shrinks in a predictable, exponential-like way as steps increase.
Practical proxies: We can’t measure true complexity perfectly, so we check two stand-ins: (1) Monotonicity—harder variants demand more thinking and yield lower accuracy; (2) Compositionality—two independent parts together should take the sum of compute and have the product of accuracies.

🍞 Anchor: If variant #30 of a synthetic task requires 30 updates, we expect more thinking and, typically, lower accuracy than variant #5. If we combine two independent tasks, we expect combined thinking ≈ thinking(task A) + thinking(task B), and combined accuracy ≈ accuracy(A) × accuracy(B).

🍞 Hook: What changes before vs after these laws?

🥬 Before vs After:

Before: Models often over- or underthink, especially on combined tasks; they don’t follow a consistent cost plan.
After: Models learn to budget thinking proportional to difficulty and to respect additivity on independent parts; their accuracy behaviors become more predictable; overall performance improves.

🍞 Anchor: In tests, enforcing compute compositionality cut deviation by up to 40.5% and increased average one-shot accuracy across six tough benchmarks.

🍞 Hook: What are the building blocks?

🥬 Building blocks (with simple ‘sandwich’ intros for each concept):

🍞 You know how you need a rulebook for a game? 🥬 LORE (Laws of Reasoning) is a rulebook connecting task difficulty, thinking budget, and accuracy. It proposes a Compute Law (thinking grows linearly with complexity) and an Accuracy Law (accuracy decays exponentially), and tests them via monotonicity and compositionality. 🍞 Example: A 10-step puzzle should get about twice the thinking tokens of a 5-step puzzle.
🍞 Imagine counting steps on a staircase. 🥬 Complexity is the minimal number of tiny ‘correct’ steps needed to solve a question. We define it using an ideal checker to verify a solution’s steps, even though computing it exactly is hard. 🍞 Example: A puzzle that needs 12 necessary moves has complexity 12.
🍞 Think of words you write while solving. 🥬 Reasoning Compute is the expected number of reasoning tokens a model produces. We average across outputs to get a stable measure. Without it, we can’t tell how much the model is ‘thinking.’ 🍞 Example: If the model writes ~500 tokens to solve a problem, that’s its compute for the question.
🍞 Like a quiz score. 🥬 Reasoning Accuracy is the probability the final answer is correct. We check many outputs and count how often the answer matches the truth. 🍞 Example: If 6 out of 10 attempts are correct, accuracy is 60%.
🍞 Harder homework takes more time. 🥬 Compute Law: compute ≈ a constant times complexity, plus a tiny overhead. If this fails, models waste compute on easy tasks or starve hard ones. 🍞 Example: Doubling necessary steps should roughly double the thinking tokens.
🍞 Longer chains have more chances to slip. 🥬 Accuracy Law: as steps grow, accuracy decays predictably (exponentially). This sets realistic expectations. 🍞 Example: If each step has a 95% success chance, long chains become unlikely to be perfectly correct.
🍞 Climbing higher hills needs more energy. 🥬 Monotonicity: as complexity rises, compute shouldn’t go down; accuracy shouldn’t go up. 🍞 Example: Variant #25 should demand at least as many tokens as variant #10.
🍞 Two chores’ times add up. 🥬 Compositionality (independent sub-questions): total compute ≈ sum of parts; total accuracy ≈ product of parts. 🍞 Example: If task A needs 300 tokens and task B needs 200, ‘A then B’ should be ~500 tokens.

🍞 Anchor: With these pieces, the model behaves sensibly: bigger tasks → more thinking; combined independent tasks → add thinking; and accuracy shifts predictably.

03Methodology

At a high level: Input question → Measure/define quantities (complexity, compute, accuracy) → State laws (compute and accuracy) → Replace unmeasurable parts with testable properties (monotonicity, compositionality) → Build benchmarks (LORE-MONO and LORE-COMPO) → Train a simple method (SFT-Compo) to enforce compute-law compositionality → Evaluate.

Step-by-step (with friendly ‘sandwich’ intros for the new pieces):

Defining the key quantities

🍞 Imagine counting steps to finish a maze. 🥬 Complexity is the minimal number of tiny, valid steps needed to solve a question. Formally it’s defined using a Turing-machine-like ‘step’ and a verifier that says whether a full step sequence solves the problem. Why: we need a true yardstick for difficulty, even if it’s hard to compute exactly. 🍞 Example: A matrix update applied N times has complexity roughly N.
🍞 Think of jotting notes while solving. 🥬 Reasoning Compute is the expected number of reasoning tokens the model writes before the final answer. Why: this is the model’s ‘thinking budget’—if we don’t measure it, we can’t test the Compute Law. 🍞 Example: Averaging eight sampled solutions might yield ~1,200 tokens for a geometry question.
🍞 Like a batting average. 🥬 Reasoning Accuracy is the probability the model’s final answer matches the truth. Why: accuracy tells us whether the thinking paid off. 🍞 Example: If 7 of 10 outputs are correct, A = 0.7.

The laws themselves

🍞 Harder puzzles take more time. 🥬 Compute Law says compute should grow roughly in a straight line with complexity (compute ≈ constant × complexity + small overhead). Why: If models don’t do this, they’ll overthink easy tasks or underthink hard ones. 🍞 Example: Doubling necessary steps should roughly double the tokens used.
🍞 More links, more chances to break. 🥬 Accuracy Law says accuracy should decay predictably (exponentially) as necessary steps increase. Why: If each step has some chance of error, long chains are harder to get perfect. 🍞 Example: With independent 95%-correct steps, longer chains are less likely to be 100% correct.

Making the laws testable with proxies

🍞 Ranking by height without a ruler. 🥬 Monotonicity checks whether compute rises (and log accuracy falls) as the task variant index (a proxy for complexity) increases. Why: We can’t measure perfect complexity, but we can order tasks. 🍞 Example: Variant #1 < #2 < … < #30 in required steps.
🍞 Two independent chores add up. 🥬 Compositionality checks whether, for independent sub-questions A and B, compute(A+B) ≈ compute(A) + compute(B), and log accuracy(A+B) ≈ log accuracy(A) + log accuracy(B) (equivalently, accuracy multiplies). Why: Independence means no shared work, so thinking adds; success on both means both must be correct. 🍞 Example: Algebra-only plus Geometry-only.
🍞 Not sharing ingredients. 🥬 Independence is approximated by choosing sub-questions from disjoint concept sets (e.g., different math subjects). Why: Perfect independence is hard to prove; this is a practical, operational proxy. 🍞 Example: Counting & Probability vs. Geometry.

LORE-BENCH: building practical tests

🍞 A science fair with two booths. 🥬 LORE-BENCH has two parts:
- LORE-MONO (monotonicity): synthetic question families with 30 variants each, where variant N requires N steps, across math, science, language, and code. We compute Spearman correlation between variant index and (i) compute, (ii) log accuracy. Why: It directly tests the monotonicity property. 🍞 Example: If compute tracks variant index closely (correlation near 1), monotonicity holds.
- LORE-COMPO (compositionality): pairs of independent problems sampled from different math subjects (e.g., Algebra + Geometry). We form a combined question and check how close compute(combined) is to compute(A)+compute(B), and similarly for log accuracy. We use normalized Mean Absolute Deviation (nMAD) to quantify deviation. Why: It directly tests additivity/multiplicativity. 🍞 Example: Small nMAD means the model adds thinking properly.
🍞 Ranking without exact scores. 🥬 Spearman correlation measures whether two things rise/fall together in rank (it doesn’t require exact values). Why: Perfect for monotonicity. 🍞 Example: As variants go from 1→30, compute should also trend upward.
🍞 How far off the sum? 🥬 nMAD (normalized mean absolute deviation) measures the average gap between combined and sum-of-parts, scaled so it’s fair across sizes. Why: Perfect for compositionality. 🍞 Example: nMAD near 0 means near-perfect additivity.

The secret sauce: SFT-Compo training

🍞 Practicing combo drills. 🥬 SFT-Compo is a simple supervised fine-tuning method to teach compute additivity. How it works:
1. Build triplets (A, B, A⊕B) from distinct categories (to encourage independence).
2. Sample K outputs per question (A, B, A⊕B) from a strong teacher model.
3. Keep only triples where all three answers are correct.
4. Among these, choose the triple that best matches compute additivity by minimizing |len(rA) + len(rB) − len(rA⊕B)|.
5. Fine-tune the student model on these curated triples, supervising both reasoning chains and final answers. Why this step exists: It directly teaches the model that the combined task should consume the sum of per-part compute. Without it, models often underthink combined tasks and lose accuracy. 🍞 Example: If A takes 1200 tokens and B takes 800, we reward combined solutions near 2000 tokens that are correct, and train the model to reproduce this pattern.

Putting it all together Input → Define/measure compute & accuracy → Test laws via monotonicity and compositionality (LORE-BENCH) → If broken, apply SFT-Compo → Re-test and evaluate on general benchmarks (GSM8K, MATH500, AIME, AMC, OlympiadBench). The clever bit is choosing supervision signals that embody the law (additive compute) rather than hand-wavy heuristics.

04Experiments & Results

The tests: The authors evaluate two big questions. (1) Do today’s reasoning models already follow the laws (monotonicity and compositionality)? (2) If we train them to respect compute compositionality (SFT-Compo), do they improve—both on law-compliance and on real benchmarks?

Setups and models: They test 10 LRMs, including DeepSeek-R1-Distill (1.5B, 7B, 8B, 14B), Phi-4-mini-reasoning, Nemotron-14B, Sky-T1-32B, Qwen3-Next-80B, and length-controlled models like Thinkless-1.5B and AdaptThink-7B. For monotonicity (LORE-MONO), they use curated synthetic tasks across math, science, language, and code, with 30 variants per seed. For compositionality (LORE-COMPO), they form 250 triplets (A, B, A⊕B) from disjoint math subjects.

What they measure and why:

Spearman correlation between variant index and (i) reasoning compute, (ii) log accuracy. Why: If compute rises with complexity (near +1) and log accuracy drops (near −1), monotonicity holds.
nMAD for compute and log accuracy on composed tasks. Why: If combined compute matches the sum (small nMAD), compositionality holds; similarly for log accuracy (multiplicative accuracy → additive in log-space).

Competition/baselines: They compare base models to SFT-Compo versions, and also include a control SFT (uniformly sampling one correct trace per question without enforcing additivity) to show that benefits come from compositionality, not just distilling a stronger teacher’s answers.

Scoreboard with context:

Monotonicity: Most models show very high positive Spearman correlations for compute (often ~0.99 overall) and strong negative correlations for log accuracy (near −0.95), meaning as variants get harder, models generally think more and, as expected, get less accurate. One weak model (DeepSeek-R1-1.5B) struggles in some domains (e.g., language compute correlation −0.346), showing that not all models reliably scale their thinking with difficulty.
Compositionality: nMAD values are large across models for both compute and log accuracy, meaning most LRMs fail additivity/multiplicativity on independent sub-questions. Even models with length-control mechanisms still deviate a lot, so simple length-control isn’t enough to make reasoning compositional.

Now the key intervention—SFT-Compo:

Compute compositionality improves strongly: On DeepSeek-R1-1.5B, nMAD for compute drops from 0.528 to 0.314 (a 40.5% reduction). On 8B, from 0.423 to 0.328 (22.5% reduction). Scatter plots move closer to the ideal y = x line.
General performance improves: Across GSM8K, MATH500, AIME 2024/2025, AMC 2023, and OlympiadBench, SFT-Compo beats the base model across all sizes (1.5B, 7B, 8B, Phi-4-mini). For instance, the 8B model gains +5.0 average Pass@1 across the six benchmarks. Importantly, SFT-Compo consistently beats the control SFT, showing that teaching additivity (not just copying a teacher) delivers extra gains.

Surprising findings (synergy):

Training for compute compositionality also improves compute monotonicity. The weak 1.5B model’s overall Spearman compute correlation jumps from 0.875 to 0.977; in code, from 0.151 to 0.914.
It also improves accuracy compositionality: nMAD for log accuracy drops by 71.1% on 1.5B (2.368 → 0.685) and 35.4% on 7B (1.170 → 0.756). This suggests the properties and laws are interlinked: teach one well and others can fall into place.

Takeaway: Most current models behave sensibly as problems get harder (monotonicity), but they stumble when combining independent tasks (compositionality). A simple, targeted fine-tuning (SFT-Compo) teaches them to budget thinking additively—and that cascades into better law-following and higher real-world benchmark scores.

05Discussion & Limitations

Limitations:

Benchmark coverage: LORE-MONO currently includes 40 seed questions. While carefully built and checked to avoid shortcuts like periodic answers, its topic breadth can grow further.
Independence proxy: The paper operationalizes independence by choosing sub-questions from disjoint subject labels (e.g., Algebra vs. Geometry). True cognitive independence is subtler; future work could refine or learn better independence tests.
Model scope: The study focuses on strong open-source LRMs due to resource limits; closed-source models weren’t broadly evaluated.

Required resources:

Data construction for triplets (A, B, A⊕B) and multi-sampling K outputs per question benefits from a stronger teacher (e.g., a 14B model) and compute to sample/score many traces.
Fine-tuning requires standard SFT infrastructure and the ability to curate correct, near-additive reasoning traces.

When not to use:

Highly entangled tasks: If A and B share internal steps or heavily interact, compute additivity may not be the right target; enforcing additivity could mislead the model.
Non-reasoning tasks: For tasks where long chain-of-thought isn’t needed or desirable (e.g., direct retrieval), pushing additivity doesn’t add value.

Open questions:

Better independence: Can we learn or verify independence automatically from model internals (e.g., attention/activation patterns) rather than subject labels?
Beyond linearity and exponentials: Do some domains demand different scaling shapes (e.g., sublinear or superlinear compute, different accuracy curves) and how would proxies adapt?
Multi-part structures: For chains with more than two parts, does enforcing pairwise additivity suffice, or do we need structured curricula (trees, DAGs) of additivity constraints?
Other laws: Are there ‘memory laws’ or ‘search laws’ that, together with compute/accuracy, better explain and improve reasoning?
Robustness and transfer: How well does compute compositionality training transfer across domains (e.g., from math to code) and to new reasoning styles (e.g., tree-of-thought)?

06Conclusion & Future Work

Three-sentence summary: This paper introduces two simple Laws of Reasoning—compute should scale linearly with problem complexity, and accuracy should decay exponentially—and provides practical tests (monotonicity and compositionality) to check them without needing exact complexity. A new benchmark (LORE-BENCH) shows most models pass monotonicity but fail compositionality, especially on combined independent tasks. A simple fine-tuning method (SFT-Compo) that enforces compute additivity fixes this gap and boosts performance across diverse reasoning benchmarks.

Main achievement: Turning fuzzy intuitions about ‘how much to think’ into crisp, testable laws—and showing that making models follow these laws leads to real, measurable gains.

Future directions: Expand LORE-BENCH across more domains and richer independence notions; explore new laws beyond compute/accuracy (e.g., memory or search); and design curricula that teach compositionality over larger chains and graphs of sub-problems. Investigate how these laws interact with inference-time strategies and whether similar laws hold for multimodal reasoning.

Why remember this: It’s a clean, physics-like approach to reasoning—simple laws, practical proxies, targeted training—that not only explains why models misbehave on composites but also provides a lightweight fix that makes them think more like careful humans.

Practical Applications

•Adaptive tutoring: Ensure the model spends appropriate reasoning on harder student problems and reliably handles multi-part questions.
•Code assistants: Improve step-by-step debugging and refactoring by allocating the right compute to each independent module or function.
•Math solvers: Boost performance on competitions (AIME/AMC) by enforcing additive thinking across independent sub-problems.
•Workflow planning: Make multi-stage plans (research, data prep, analysis) more robust by teaching models to ‘add’ compute across stages.
•Scientific pipelines: Improve reliability when combining independent lab calculations or simulations by aligning compute with complexity.
•Customer support: Handle multi-issue tickets by allocating the sum of compute needed for each issue rather than underthinking the bundle.
•Data analysis: For dashboards with separate metrics, ensure the model spends additive reasoning across independent metrics before summarizing.
•Content generation: For multi-section documents, allocate reasoning per section and maintain overall consistency via additive planning.
•Educational content creation: Build progressive problem sets (monotonic variants) and verify models scale their reasoning appropriately.
•Model evaluation: Use LORE-BENCH to diagnose over/underthinking and guide targeted fine-tuning like SFT-Compo.

Version: 1