Effective Reasoning Chains Reduce Intrinsic Dimensionality

Archiki Prasad; Mandar Joshi; Kenton Lee; Mohit Bansal; Peter Shaw

Effective Reasoning Chains Reduce Intrinsic Dimensionality

Beginner

Archiki Prasad, Mandar Joshi, Kenton Lee et al.2/9/2026

arXiv

Key Summary

•The paper asks a simple question: which kind of step-by-step reasoning helps small language models learn best, and why?
•The authors propose a clear, countable score called intrinsic dimensionality, which is the minimum number of trainable parameters a model needs to reach a fixed accuracy on a task.
•Surprisingly, better reasoning chains make the task simpler for the model, so the model needs fewer adjustable parts to learn it well.
•Using Gemma-3 1B and 4B models on GSM8K math problems, intrinsic dimensionality strongly predicts generalization to new, tougher test sets.
•For the 4B model, intrinsic dimensionality correlates with overall accuracy at 0.93 (very strong), beating other metrics like length (0.31), KL divergence (-0.17), and token perplexity (0.82).
•Executed Program-of-Thought (code that actually runs) has the lowest intrinsic dimensionality and the best out-of-distribution performance (43.40% OOD, 46.15% overall).
•The finding holds for the smaller 1B model too (0.75 correlation), showing the idea works across sizes.
•Longer explanations are not automatically better; quality and structure that compress the problem matter more than raw length.
•The metric is robust to how you pick the accuracy threshold and can be estimated early in training, saving time.
•Bottom line: effective reasoning chains help models ‘explain less with more sense,’ letting them learn using fewer knobs to turn.

Why This Research Matters

This work gives builders a reliable way to pick the best kind of explanations to train models, leading to better performance on new, tricky problems. It can cut training costs by identifying reasoning styles that let models learn with fewer adjustable parameters. It helps avoid wasting resources on overly long or noisy chains that look thoughtful but don’t teach the model the rule. It guides data collection and alignment: choose prompts and rationales that compress the task. It also supports safer, more robust applications (like tutoring or planning) because simpler, clearer internal rules tend to be more reliable. Finally, it connects modern practice to classic ideas about simplicity and compression, giving a solid theory for everyday choices in reasoning data.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you study for a test, some explanations make things click right away, while others are long but still confusing? Models feel the same way about different kinds of step-by-step reasoning. Researchers have tried many styles—short explanations, long ones, plans before solving, even writing little programs—but it has been hard to say, in a single number, which style truly helps a model learn and generalize to new problems.

🍞 Hook: Imagine two teachers. One uses a tidy checklist; the other rambles for ten minutes. Which one helps you remember and use the idea later? Probably the tidy one.

🥬 The Concept (Chain-of-Thought Reasoning):

What it is: Chain-of-Thought (CoT) is when a model shows its steps before giving the final answer.
How it works:
1. The model reads the question.
2. It writes a short reasoning chain (like notes on a scratchpad).
3. It gives the final answer.
Why it matters: Without CoT, the model might jump to answers without building the bridge from question to solution, which can hurt learning and generalization.

🍞 Anchor: When solving “48 in April and half as many in May,” CoT writes 48/2 = 24, then 48 + 24 = 72, and finally answers 72.

Before this paper, people suggested reasons why some CoT styles seem to work: maybe they are longer (more thinking), more structured (clear order), or closer to the model’s pretraining style (easier to read). But these ideas either weren’t countable in a reliable way or didn’t predict results consistently. For example, some studies found longer chains help; others found making them longer can actually hurt.

🍞 Hook: You know how packing a suitcase is easier if your clothes fold well? Then the same suitcase (space) can hold more outfits (ideas).

🥬 The Concept (Minimum Description Length Principle):

What it is: MDL says the best explanation is the shortest one that still fits the facts.
How it works:
1. Consider all explanations that fit the data.
2. Prefer the one that needs the fewest bits to describe.
3. Shorter, cleaner rules tend to generalize better.
Why it matters: If a reasoning style lets a model describe a task with fewer adjustable parts, that style should generalize better.

🍞 Anchor: A math rule like “total = April + May, and May = April/2” is a short story that neatly explains many similar problems.

This paper brings in a powerful, countable idea from earlier deep learning research: intrinsic dimensionality. Instead of guessing that “long equals good,” it measures how many parameters the model actually needs to tune to reach a given accuracy with each reasoning style. If a style needs fewer adjustable parts to learn the same skill, that style is likely better at teaching the model the underlying rule.

🍞 Hook: Think of a piano with thousands of keys (silly, I know). If you can play a song by moving only a few keys, the song is simpler to learn.

🥬 The Concept (Intrinsic Dimensionality):

What it is: Intrinsic dimensionality is the minimum number of adjustable parameters a model needs to reach a set accuracy on a task.
How it works:
1. Freeze the big model.
2. Allow it to learn in a small, low-rank space (a tiny slice of knobs).
3. Gradually make that space bigger until training hits the accuracy threshold.
Why it matters: If a reasoning style reaches the target accuracy with fewer knobs, it’s teaching a simpler, more compressible mapping from question to answer—and should generalize better.

🍞 Anchor: If “Executed Program-of-Thought” needs only a small number of tunable parameters to reach good accuracy, it’s likely a better teacher than a long, messy explanation that needs many.

To do this safely and efficiently on language models, the paper uses LoRA, a standard method to fine-tune only a small, low-rank slice of the model.

🍞 Hook: Instead of rebuilding a bicycle, you just adjust the seat and handlebars to fit you.

🥬 The Concept (Low-Rank Adaptation, LoRA):

What it is: LoRA fine-tunes a small, low-rank set of updates inside a big model, so you change less while learning more.
How it works:
1. Pick certain weight matrices (like attention or MLP parts).
2. Add tiny low-rank adapters (two small matrices whose product is the update).
3. Train only those adapters; keep the original model frozen.
Why it matters: LoRA lets us measure how much small-capacity learning is enough, which is exactly what intrinsic dimensionality needs.

🍞 Anchor: It’s like putting custom insoles in your shoes instead of buying new shoes—small adjustments can do the trick.

Finally, to check if simpler (lower intrinsic dimensionality) really means “learns better,” the paper measures generalization: how well the model solves new, different problems after training.

🍞 Hook: A good math class doesn’t just prepare you for the quiz; it prepares you for new kinds of word problems later.

🥬 The Concept (Generalization Performance):

What it is: Generalization is how well a model performs on new, unseen questions after training.
How it works:
1. Train on one set of problems.
2. Test on the usual test set (in-distribution) and on trickier variations (out-of-distribution).
3. Compare accuracy.
Why it matters: If a reasoning style truly teaches the rule, the model should handle fresh questions, not just repeat training patterns.

🍞 Anchor: A model trained with clean step-by-step code may nail tough new GSM variants (like GSM-Symbolic or GSM-Hard) better than one trained on rambling text.

The real stakes are practical: picking the right reasoning style can save money (fewer parameters to train), improve reliability on tricky inputs (like with distractors), and guide data collection (choose explanations that actually help models learn the rule, not just look smart).

02Core Idea

Aha! The key insight is: Effective reasoning chains reduce intrinsic dimensionality, which makes the task easier for the model to learn and leads to better generalization.

Three analogies:

Hiking trails: A well-marked trail (clear reasoning) guides you straight to the view with minimal wandering (fewer parameters). A confusing trail (messy reasoning) forces lots of detours (more parameters).
Recipes: A tidy recipe with exact steps lets even a new cook succeed (low intrinsic dimensionality). A vague, long story about cooking takes more practice and guessing (high intrinsic dimensionality).
Jigsaw puzzles: If someone sorts the pieces by color and edge first (structured reasoning), you need fewer guesses to complete the picture. That’s learning with fewer adjustable knobs.

Before vs. After:

Before: We often judged reasoning by length or by gut-feel ideas like “more structure is better,” but these weren’t consistent or quantifiable across styles.
After: We can now ask, “How many trainable parameters does each reasoning style need to reach the same accuracy?” Lower is better, and it strongly predicts generalization.

Why it works (intuition, no equations):

A reasoning chain that logically bridges input to answer compresses the mapping the model must learn. Compression means the rule can be captured in a smaller subspace (fewer directions to adjust), so the model needs fewer trainable parameters to reach the same skill level.
Minimum Description Length backs this up: shorter, cleaner internal rules tend to generalize. Intrinsic dimensionality is a way to count how short that internal rule effectively is for the model.
LoRA lets us dial how many parameters can move. If a style reaches the target accuracy with a smaller dial, it’s more compressible.

Building blocks of the idea:

Fix the model (same architecture and pretraining).
Change only the outputs the model is trained to produce (different reasoning styles over the same questions).
For each style, train with increasing LoRA capacity and record when it first passes a common accuracy threshold.
That capacity is the intrinsic dimensionality for that style.
Compare intrinsic dimensionality to test performance, both in-distribution and out-of-distribution.

What changes because of this idea:

We can rank reasoning styles using a principled, early-in-training measurement.
We can choose or design data collection prompts that produce more compressible chains.
We can save compute by not over-investing in long or fancy reasoning that doesn’t actually compress the task.

03Methodology

At a high level: Problems → Generate versions with different reasoning styles → Fine-tune the same model using small adjustable adapters (LoRA) of various sizes → Find the smallest size that reaches the accuracy threshold → That size is intrinsic dimensionality → Compare this number to generalization.

Step-by-step, like a recipe:

Start with the same set of math word problems (GSM8K training split).

What happens: We keep inputs fixed so any performance difference comes from how outputs (the reasoning chains) are written.
Why it exists: Controls the experiment—only the style of reasoning changes.
Example: The question “48 in April, half in May, total?” stays the same across all styles.

Create multiple training sets, one per reasoning style.

What happens: For each question, generate an answer formatted as one of many strategies: No CoT (just the answer), Short CoT, Very Short CoT, Critical CoT, Plan-and-Solve, Executed Program-of-Thought (PoT with real code execution), Simulated PoT, and more. Filter to keep only correct final answers.
Why it exists: Lets us compare styles fairly, using the same questions but different solution formats.
Example: Executed PoT returns a tiny Python function that computes 48, then 24, then 72.

Fix the base model and fine-tune using LoRA adapters only.

What happens: The core Gemma-3 (1B or 4B) weights are frozen. We add small low-rank adapters (LoRA) to attention and/or MLP layers and only train those.
Why it exists: This controls the number of trainable parameters, which we vary to measure intrinsic dimensionality.
Example: Start with rank-1 adapters on a few attention matrices (very few knobs), then increase rank and/or target more matrices.

Sweep over adapter sizes to cover tiny to large capacity.

What happens: Try around 20–30 configurations (depending on model size), spreading parameter counts evenly on a log scale—from very small (rank 1 on a small subset) to very large (high rank on all attention and MLP layers).
Why it exists: We need a smooth curve showing training accuracy vs. trainable parameter count to find the smallest size that works.
Example: For the 4B model, Very Short CoT needed hundreds of millions of parameters to pass the threshold, while Executed PoT needed around 1.49M.

Choose a common accuracy threshold τ and find when each style first crosses it.

What happens: We set τ to a common standard (e.g., 90% of the best training accuracy achieved by any style after epoch 1) and read off the minimum parameter count where each style reaches τ.
Why it exists: Using a common τ makes intrinsic dimensionality comparable across styles, even if some styles have different maximum accuracies.
Example: On Gemma-3 4B, the threshold used in plots was 63.0%. Executed PoT crossed it at about 1.49M parameters; Short CoT crossed later at about 3.92M; Very Short CoT much later at about 532.81M.

Evaluate generalization separately.

What happens: For each style, we also fine-tune with full capacity and then test on:
- In-distribution: GSM8K test set.
- Out-of-distribution: Five stress test splits (e.g., GSM-Symbolic, GSM-IC, GSM-Hard) that add wording changes, irrelevant sentences, or harder arithmetic.
Why it exists: We want to see if lower intrinsic dimensionality actually predicts better performance on new, different problems.
Example: Executed PoT had the best OOD performance (43.40%) among styles tested on the 4B model.

Compare intrinsic dimensionality to other easy-to-compute metrics.

What happens: We also measure average reasoning length, token perplexity (how surprising the text is to the base model), and sequence-level KL divergence.
Why it exists: To check if intrinsic dimensionality really adds value over common, convenient proxies.
Example: Length had weak correlation (0.31 on 4B); perplexity was better (0.82) but still below intrinsic dimensionality (0.93); KL divergence was not helpful (-0.17).

The secret sauce:

Keep the model the same, and change only the way solutions are written. This isolates the effect of reasoning style.
Use LoRA to precisely control how many parameters are allowed to move.
Use a common accuracy threshold chosen early in training (epoch 1) to avoid overfitting and to enable quick, inexpensive estimates.
Read off a single number—the smallest capacity needed—that acts like a ‘simplicity score’ for each reasoning style.

Putting it together with a tiny concrete walk-through:

Input: “Natalia sold 48 clips in April and half as many in May. What is the total?”
Styles:
- No CoT: “Answer: 72.”
- Short CoT: “May is 48/2 = 24; total 48+24 = 72. Answer: 72.”
- Executed PoT: A small Python function that computes 72.
Training with small adapters:
- Try rank 1 on a few attention matrices: see training accuracy.
- Increase rank or cover more matrices: see training accuracy rise.
- Record the smallest parameter count where accuracy passes τ.
Result: Executed PoT crosses τ earliest (lowest intrinsic dimensionality); later, when training fully, it also performs best on new, harder tests. That’s the compressibility–generalization connection in action.

04Experiments & Results

The test: Can intrinsic dimensionality predict how well a model generalizes after being trained on one particular reasoning style?

Models: Gemma-3 1B and 4B (same architecture per size; only LoRA adapters change during training for the intrinsic dimensionality measurement).
Data: GSM8K for training; test on GSM8K (ID) and five OOD splits (GSM-Symbolic main/P1/P2, GSM-IC, GSM-Hard). We report overall performance as the geometric mean across all six splits.
Reasoning styles compared: No CoT, No CoT+extra tokens, Very Short CoT, Short CoT, Short CoT with distractors (2, 4, 8), Gemma 27B CoT, Gemini CoT, Executed PoT, Simulated PoT, Plan-and-Solve, Critical CoT, High Review Ratio CoT.
Baseline metrics: Length, token perplexity, sequence KL divergence.

The competition: How do different metrics rank the styles?

On Gemma-3 4B:
- Intrinsic dimensionality vs. accuracy correlation: 0.93 (very strong; statistically significant).
- Token perplexity vs. accuracy: 0.82 (strong, but worse than intrinsic dimensionality).
- Length vs. accuracy: 0.31 (weak; longer isn’t reliably better).
- KL divergence vs. accuracy: -0.17 (not predictive here).
On Gemma-3 1B:
- Intrinsic dimensionality vs. accuracy: 0.75 (strong; statistically significant).
- Token perplexity vs. accuracy: 0.63 (moderate-strong).
- Length vs. accuracy: 0.24 (weak).
- KL divergence vs. accuracy: -0.18 (not predictive).

The scoreboard with context (4B highlights):

Executed PoT: ID = 62.77% (GSM8K), OOD = 43.40%, Overall = 46.15%, and the lowest intrinsic dimensionality among tested styles (~1.49M params to cross the 63.0% threshold). That’s like getting an A on the regular test and also an A- on the tough surprise quiz—using the fewest study notes.
Gemma 27B CoT and High Review Ratio CoT also score highly overall, but need more capacity than Executed PoT to hit the same training threshold—suggesting they’re less compressible.
Short CoT is strong but not the best; Very Short CoT with distractors and No CoT trail behind, with much higher intrinsic dimensionality.

The scoreboard with context (1B highlights):

The absolute accuracies are lower (smaller model ceiling), but the pattern remains: intrinsic dimensionality still predicts which styles generalize better. Executed PoT again shines for OOD and overall performance (ID 20.24%, OOD 11.00%, Overall 11.76%) and crosses a lower, 24.3% threshold using only ~1.03M parameters.

Surprising findings:

Longer chains are not automatically better. Length had weak correlation. In fact, some very long styles didn’t compress the task well for the model.
Token perplexity helps, but not as much as intrinsic dimensionality. This suggests that being ‘familiar’ to the base model is useful, but the deeper story is about how well the reasoning chain turns the task into a simpler, learnable rule.
Threshold robustness: Whether the threshold was set using 70%, 80%, 90% of best epoch-1 training accuracy, or 90% of validation accuracy, correlations stayed high (around 0.72–0.94). This means the measurement is stable, not a cherry-picked setting.
Bigger isn’t lazier. Larger models (4B) often compress effective reasoning strategies more efficiently than smaller ones (1B), achieving higher accuracy with comparable or even fewer effective ‘degrees of freedom’ relative to task complexity. But for messy, noisy strategies (like distractor-heavy), the larger model’s intrinsic dimensionality grew a lot—suggesting that big models won’t just memorize noise cheaply.

Bottom line: The single number—minimum parameters needed to hit a target accuracy—predicts which reasoning styles truly teach the model the rule, and which just look busy.

05Discussion & Limitations

Limitations:

Compute cost: Measuring intrinsic dimensionality requires training many LoRA variants per reasoning style. That’s more expensive than computing quick text metrics like length or perplexity.
Task/domain scope: The study focuses on grade-school math problems. While results are strong, we still need to test this on other domains (science QA, planning, coding with complex environments) to be sure it generalizes widely.
Data generation quality: Most reasoning chains are made by teacher models and filtered for correct final answers. If teachers or filters are biased, that could shape which styles appear effective.
Threshold choice: Although the results are robust to several thresholds, any metric with a threshold can be sensitive in edge cases. Very noisy or very easy tasks might compress oddly.

Required resources:

Access to base models (Gemma-3 1B/4B or similar), GPUs/TPUs, and enough budget to run 20–30 adapter configurations per style during the sweep.
Teacher models (for some styles) to generate reasoning chains and a validation setup for picking checkpoints in full-capacity runs.

When not to use:

If you cannot afford multiple fine-tuning runs, intrinsic dimensionality measurement may be too costly.
If your task is not about reasoning (e.g., pure style imitation), simpler text metrics or small pilot studies may be enough.
If your pipeline can’t freeze base weights (e.g., you must full-fine-tune), you’ll lose the clean control needed for precise intrinsic dimensionality measurement.

Open questions:

Can we predict intrinsic dimensionality without many training runs—perhaps from early training curves, small pilot subsets, or features of the reasoning text?
How does intrinsic dimensionality behave for multimodal reasoning (text + images) or interactive agents with tools?
Can we design reward models or data selection rules that directly encourage low intrinsic dimensionality (more compression), not just correctness?
What properties of reasoning chains (structure, variables, verification steps) most strongly drive the compression effect? Can we auto-generate such structures reliably?
Can we develop adapter-sweep shortcuts (smart search, Bayesian optimization) to reduce compute while preserving accuracy of the estimate?

06Conclusion & Future Work

Three-sentence summary:

The paper shows that the best step-by-step reasoning chains are the ones that let a model learn the task using the fewest adjustable parameters—a property called intrinsic dimensionality.
Measuring this number by sweeping LoRA adapter sizes strongly predicts which reasoning styles will generalize well to new, tricky problems.
This finding holds across model sizes and is more reliable than common proxies like chain length.

Main achievement:

Introducing intrinsic dimensionality as a practical, quantitative, and highly predictive metric for evaluating the effectiveness of reasoning chains, with strong correlations to generalization performance.

Future directions:

Create faster, cheaper ways to approximate intrinsic dimensionality from early training signals or text features.
Extend to new domains (science reasoning, real-world planning) and modalities (vision + language).
Use the metric to auto-select or auto-generate better reasoning chains during data collection and training.

Why remember this:

It reframes ‘good reasoning’ as ‘reasoning that compresses the task,’ connecting directly to classic ideas about simplicity and generalization. With one clear number, we can choose reasoning styles that truly teach models the underlying rule, not just talk longer.

Practical Applications

•Filter and select reasoning chains that yield lower intrinsic dimensionality to build higher-quality training datasets.
•Design prompts that produce concise, structured chains (e.g., Executed PoT) rather than just longer text.
•Estimate intrinsic dimensionality early (after epoch 1) to pick the best reasoning style without full training.
•Create curriculum learning pipelines that start with more compressible chains and gradually introduce complexity.
•Use intrinsic dimensionality as a target when training reward models or verifiers, encouraging simpler internal rules.
•Allocate compute more wisely: favor styles that reach thresholds with fewer parameters for cost-effective fine-tuning.
•Adapt reasoning style to model size (1B vs. 4B), choosing chains that compress well for that model.
•Diagnose noisy or distractor-heavy data by observing unusually high intrinsic dimensionality.
•Benchmark new reasoning templates with a quick LoRA sweep before large-scale deployment.
•Guide teacher model prompting to produce code-executable or clearly structured solutions that compress the task.

Version: 1