Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy
Key Summary
- •Large language models (LLMs) are good at many math problems but often mess up simple counting when the list gets long.
- •This happens because transformers add up counts slowly across layers, and their “internal counter” runs out of room for big numbers.
- •The paper’s fix is a System-2 strategy: split a long list into smaller groups using a separator like |, count each group, then add the group totals.
- •This only works reliably when you use both structure (the | separators) and show your work (intermediate steps like “part 1: 7”).
- •Mechanistic tests show where the counts live: mainly in the last item and the final comma of each group; special attention heads move these counts to the interim steps and then to the final sum.
- •Attention concentrates in middle-to-late layers (for Qwen2.5 7B, around layers 19–23), with specific heads handling sub-count transfer and final aggregation.
- •Across open and closed models, “structured + with steps” keeps accuracy high even for 100 items, while other settings collapse.
- •This is a test-time trick—no training changes—so you can boost counting right now by prompting differently.
- •The method gives a clear, stage-by-stage picture of how numbers flow inside LLMs, improving both performance and interpretability.
- •The same blueprint likely helps other reasoning tasks that overload a model’s “fast” System-1 but succeed when decomposed with System-2.
Why This Research Matters
Many real tasks involve counting or controlled lengths, like making summaries a fixed size, listing steps in order, or tallying items in receipts and inventories. This strategy offers an immediate, low-cost way to improve reliability: just change the prompt, no retraining needed. It also gives a window into the model’s internals—showing where numbers live and how they move—which helps build trust. The same pattern can help with other overloaded tasks that fail when done in one shot but succeed when broken into steps. Better reliability in simple counting can ripple out to better planning, checking, and auditing in everyday AI uses. Ultimately, it nudges models toward clearer, more faithful reasoning we can inspect and verify.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you try to count a giant pile of LEGO bricks, it’s easy to lose track unless you group them into small stacks? Before this research, large language models (LLMs) were a bit like kids trying to count an entire heap at once—they’d start strong, but once the pile got big, they’d stumble.
🍞 Top Bread (Hook): Imagine your brain has two modes: quick guesses and careful steps. LLMs often use the quick mode for counting. 🥬 The Concept (Transformers): What it is: A transformer is the brain design most modern LLMs use to read and write language. How it works: 1) Split text into tokens; 2) Layers of “attention” let each token look at important neighbors; 3) Layers stack to build understanding; 4) The model predicts the next token. Why it matters: Without transformers, the model couldn’t track patterns across long text. 🍞 Anchor: Reading “apple, apple, apple” works because transformers pass along patterns like “we’ve seen another apple.”
🍞 Hook: You know how you look at the teacher when they say the important part? That’s attention. 🥬 The Concept (Attention Mechanism): What it is: Attention tells the model which tokens to focus on right now. How it works: 1) Each token asks, “Who matters to me?”; 2) It gives higher scores to helpful tokens; 3) It blends information mostly from high-scoring tokens. Why it matters: Without attention, the model treats all words equally and gets confused. 🍞 Anchor: When answering “What’s the capital of France?”, attention locks onto “capital” and “France,” not “What’s the.”
🍞 Hook: Think of counting beads by moving one bead at a time on a string. 🥬 The Concept (Counting Mechanism in LLMs): What it is: LLMs build a hidden count little by little across layers. How it works: 1) See an item token; 2) Update a tiny “counter” signal; 3) Repeat for each item; 4) After enough layers, the model guesses the number. Why it matters: Because the counter grows across layers, if the list is longer than what the depth can handle, precision collapses. 🍞 Anchor: Models count small lists (like 5–10) well, but for bigger ones (30+), answers start drifting.
🍞 Hook: You know how you can quickly say “there are some” (fast), or carefully tally “1, 2, 3…” (slow)? 🥬 The Concept (System-1 vs System-2): What it is: System-1 is fast and automatic; System-2 is careful and step-by-step. How it works: 1) System-1 gives speedy guesses; 2) System-2 breaks the problem into clear steps; 3) It checks and sums; 4) It writes down intermediate results. Why it matters: Counting long lists needs System-2, because fast guesses run out of accuracy. 🍞 Anchor: If you split 90 marbles into piles of 10, count each pile, then add them, you make fewer mistakes than trying to count 90 in one breath.
The problem: Researchers kept seeing LLMs fail on long counts—even when the items were simple and repeated. The more items there were, the worse the answers got. Some models also “squish” big numbers in a compressed way, like a fuzzy number line, so close-together big numbers are hard to tell apart.
Failed attempts: 1) Just ask for the final answer (fast System-1) — works for small lists but fails for big ones. 2) Add Chain-of-Thought (show your steps) without structuring the input — still not enough. 3) Structure the input into parts but don’t write steps — also not enough.
The gap: LLMs needed both a neat structure and explicit step-by-step counting. Only then could they escape the “depth limit” that caps their counting.
Real stakes: This matters for everyday tasks like length-controlled summaries, step-by-step instructions, counting items in receipts or inventories, and any job where you must total up many pieces without slipping. It also shows us how to talk to models in ways that make their thinking clearer and more trustworthy.
02Core Idea
The “Aha!” in one sentence: If a long count overwhelms the model’s fast System-1, make it think in System-2 by slicing the list into small chunks, counting each chunk, and then summing.
🍞 Hook: Think of slicing a huge pizza into slices so everyone gets a fair count. 🥬 The Concept (Partitioning): What it is: Partitioning means breaking a big list into smaller groups using a visible marker like |. How it works: 1) Insert | to divide the list into bite-size parts; 2) Each part stays within the model’s comfy counting range; 3) The model counts each part; 4) It adds the part counts. Why it matters: Without partitioning, the model’s internal counter saturates and loses precision on big lists. 🍞 Anchor: “apple, apple, apple | apple, apple | apple, apple, apple” becomes part 1: 3, part 2: 2, part 3: 3 → final 8.
🍞 Hook: You know how showing your work in math helps you not skip steps? 🥬 The Concept (Intermediate Reasoning / Chain-of-Thought): What it is: The model writes down mini-results like “part 2: 5” before the final sum. How it works: 1) For each partition, output “part k: x”; 2) Keep those mini-numbers visible; 3) At the end, sum them; 4) Output the final total. Why it matters: Without showing steps, the model tries to both gather and add at once, which causes errors on long lists. 🍞 Anchor: You write “10 + 10 + 9 = 29” before “Final answer: 29,” reducing slip-ups.
Three analogies for the same idea:
- Grocery bags: Instead of carrying 50 oranges at once, pack them into small bags of 8–10, count each bag, then add bags. Fewer drops, same total.
- Classroom groups: Split 60 students into groups of 10, count each group, then tally groups. No one gets missed.
- Marathon splits: Runners track 5K splits; summing splits gives the marathon time more reliably than eyeballing the whole race.
Before vs After:
- Before: Long, flat lists cause accuracy to drop sharply beyond ~30 items. The model’s counter maxes out.
- After: Structured input (with |) plus written steps keeps accuracy high up to 100 items and likely beyond, because each sub-count stays easy.
Why it works (intuition, not equations):
- The model’s hidden “counter” is precise for small numbers but gets fuzzy for big ones. Partitioning keeps each count small and crisp.
- The model stores each part’s number at boundary tokens (like the last item and the trailing comma). Attention heads then copy these numbers into the intermediate text (“part 3: 7”). Finally, other heads read those mini-numbers and add them.
Building blocks (small pieces that make it go):
- Visible separators (|) to create clean part boundaries.
- A prompt that forces “part k: x” lines before the final sum.
- Attention heads in middle-to-late layers that move numbers from part boundaries → steps → final answer.
- A safe per-part size so the model never leaves its reliable counting zone.
🍞 Hook: Think of sticky notes you place on each pile saying how many are inside. 🥬 The Concept (Aggregation): What it is: Aggregation is the final step of adding all part counts. How it works: 1) Read the mini-numbers (“part 1: 7,” “part 2: 6”); 2) Focus attention on them; 3) Add them internally; 4) Output the total. Why it matters: Without careful aggregation, even perfect sub-counts can still lead to a wrong final answer. 🍞 Anchor: “7 + 6 + 8 = 21,” then “Final answer: 21.”
Put simply: The paper’s idea is a prompt-time switch that gently forces the model out of fast guessing into careful, step-by-step thinking. No retraining needed—just a smarter plan for how we feed and structure the problem.
03Methodology
At a high level: Input list → Partition with | → Count each part (write ‘part k: x’) → Aggregate those x’s → Final answer.
Step 1: Structure the input
- What happens: You take a long, comma-separated list and insert | to make smaller groups. Example: “apple, apple, apple, apple, apple | apple, apple, apple | apple, apple, apple, apple, apple, apple.”
- Why this exists: Big, flat lists overload the model’s depth-limited counter. Partitioning keeps each sub-count small and accurate.
- Example with data: If the model is reliable up to 9 items, make each part 6–9 items. Three parts like 5 | 3 | 6 stay easy to count.
Step 2: Force intermediate steps
- What happens: The prompt asks the model to write “part1: x1, part2: x2, …” before giving the total.
- Why this exists: Without steps, the model tries to gather and add at once (error-prone). Steps separate “find the numbers” from “add the numbers.”
- Example: The model outputs: “part1: 5, part2: 3, part3: 6” then “Final answer: 14.”
Step 3: Local counting inside each partition
- What happens: As the model reads items in part k, it accumulates a tiny hidden counter. The final item and the trailing comma of that part end up storing the part’s count.
- Why this exists: Boundary tokens are natural “bookends,” making it easy for attention heads to fetch the number.
- Example: In “apple, apple, apple | …,” the count “3” is most clearly encoded on the last “apple” and the comma right after it.
🍞 Hook: Picture a conveyor belt passing boxes; special scanners read the sticker on each last box. 🥬 The Concept (Attention Heads and Layers): What it is: Within attention, each “head” is a tiny specialist that focuses on a specific pattern; layers stack these specialists. How it works: 1) Early layers collect local info; 2) Middle-to-late layers route important signals; 3) Specific heads copy numbers from boundaries into the text being generated; 4) Other heads combine mini-numbers for the final total. Why it matters: If the right heads are blocked, the number can’t move to where it’s needed. 🍞 Anchor: In Qwen2.5 7B, layers ~19–23 light up; head 13 in layer 22 helps write part counts; head 1 in layer 22 helps with the final sum.
🍞 Hook: Imagine every token carries a backpack to pass notes forward. 🥬 The Concept (Residual Stream): What it is: The residual stream is a shared highway where tokens carry and mix information across layers. How it works: 1) Each layer adds its computed message; 2) The stream keeps past info available; 3) Later heads read and route it; 4) Critical signals (like counts) ride along to the right places. Why it matters: Without this highway, partial counts wouldn’t reliably reach the reasoning tokens. 🍞 Anchor: The “3” from the first partition rides the residual stream so “part 1: 3” can be produced.
Step 4: Write the intermediate numbers into text
- What happens: The model generates “part k: x” lines. When it prints x, attention spikes on the last item and the final comma of that partition—exactly where the number is stored.
- Why this exists: Writing intermediate text makes the model’s knowledge explicit and easy to add later.
- Example: Generating “part 2: 4” shows strong attention to the end of partition 2; earlier items in that partition get much less attention.
Step 5: Aggregate the mini-numbers
- What happens: The model focuses on the digits in the “part k: x” lines and adds them to produce the final total.
- Why this exists: Separating counting (local) from summing (global) keeps each action simple and accurate.
- Example: After “part1: 6, part2: 7, part3: 5,” the model’s next tokens attend mainly to 6, 7, 5, not back to the raw items.
The secret sauce
- Keep each partition inside the model’s accurate range (e.g., 6–9 for many open models; 15–25 for strong closed models at longer contexts).
- Use clear separators and a strict “with steps” format so the model doesn’t blur stages.
- Leverage the model’s own internal bookkeeping (boundary tokens encoding counts) and the specialized attention heads that shuttle numbers.
Mechanistic tools (how the authors verified the pipeline)
🍞 Hook: Think of swapping batteries between two toys to see which one now lights up. 🥬 The Concept (Activation Patching): What it is: A way to copy hidden activations from one run into another to test causality. How it works: 1) Run two examples; 2) Swap specific layer activations for key tokens; 3) Observe if the output follows the swapped info; 4) Conclude which states carry the cause. Why it matters: Without patching, we can’t prove which hidden pieces actually make the answer change. 🍞 Anchor: Swapping the hidden state of “part 2: 6” makes the final sum jump by 6 in the other context.
🍞 Hook: Like putting earmuffs on one student to see if they stop passing notes. 🥬 The Concept (Attention Knockout / Masking): What it is: Temporarily blocking certain heads or tokens to see if the behavior breaks. How it works: 1) Zero out a token’s activation or an attention head; 2) Measure drop in correct counting; 3) Identify which parts are critical; 4) Map the circuit. Why it matters: Without knockout, we might confuse “looks important” with “is necessary.” 🍞 Anchor: Knocking out layer-22 head-13 drops accuracy for writing part counts; knocking out layer-22 head-1 hurts the final sum.
🍞 Hook: Imagine a special decoder ring that turns a secret signal into a number. 🥬 The Concept (CountScope): What it is: A causal probe that decodes the count a token is secretly carrying. How it works: 1) Patch a token’s hidden state into a blank counting prompt; 2) Read the number the model now outputs; 3) Repeat across tokens; 4) See where the true count lives. Why it matters: Without CountScope, ordinary lenses struggle to read numbers from hidden states. 🍞 Anchor: It shows the last item and trailing comma of each partition most confidently store that partition’s count.
Putting it all together: The recipe is simple (partition + steps), but the internals are elegant: counts form locally, get written into text by dedicated heads, and then get added by another set of heads. Because every sub-task stays small, the model never trips over its depth limits.
04Experiments & Results
The test: Measure how well models count long lists under four settings: (1) unstructured input without steps, (2) unstructured with steps, (3) structured without steps, (4) structured with steps. Track exact-match accuracy and mean absolute error (MAE).
The competition: Open models (Qwen2.5 7B, Llama 3 8B, Gemma 3 27B) on 11–50 items; strong closed models (GPT-4o, Gemini-2.5-Pro) on 51–100 items.
Scoreboard with context (think of letter grades):
- Open models, 11–50 items:
- Qwen2.5 7B: Structured + with steps jumps to about A-level for smaller ranges (accuracy ≈ 0.95 in 11–20) and stays notably better than all other settings as length grows; unstructured without steps collapses fast (e.g., accuracy ≈ 0.06 in 31–40; MAE ≈ 5.29) while structured with steps keeps MAE low (≈ 1–2 for mid ranges).
- Llama 3 8B: Unstructured without steps dives from decent to near-zero for longer ranges, but structured with steps rises dramatically (e.g., accuracy ≈ 0.84 for 11–20; ≈ 0.26 even at 41–50, far ahead of others; MAE around 1–2 for mid ranges).
- Gemma 3 27B: Structured with steps shines (accuracy up to 1.00 for 11–20 and stays strongest as lists grow), while structured without steps can be worse than unstructured early on—proof that steps are essential.
- Closed models, 51–100 items:
- GPT-4o: Structured with steps scores high from 51–100 (≈ 0.96 down to ≈ 0.83) with very low MAE (≈ 0.04 to ≈ 0.22). Other settings trail notably, especially unstructured ones where MAE balloons.
- Gemini-2.5-Pro: Structured with steps is best-in-class (≈ 0.97 down to ≈ 0.91 across 51–100), with tiny MAE (≈ 0.05–0.10). Unstructured with steps helps some but never matches structured with steps.
Make the numbers meaningful:
- Think of 0.95 accuracy as “an A” where you almost always nail the exact count, versus 0.20 as “a failing grade” where you’re usually off.
- MAE near 0.1 means you’re off by about a tenth on average—super tight—while MAE near 5–12 means you’re often off by a whole handful or more.
Surprising findings:
- Chain-of-Thought alone (unstructured with steps) doesn’t consistently save you. The model needs the partitions in the input to anchor where sub-counts live.
- Structured input without steps can be harmful or underwhelming. The model sees the parts but still tries to gather and add in one breath.
- Only the combo—structured input plus written steps—delivers stable wins across models and long ranges.
Mechanistic confirmations:
- Attention during “part k: x” peaks on the last item and final comma of that partition; final-answer tokens attend strongly to the mini-numbers just written.
- In Qwen2.5 7B, middle-to-late layers (≈ 19–23) dominate both transfer-to-steps and final aggregation; specific heads shoulder distinct stages (e.g., head 13 layer 22 for intermediate steps, head 1 layer 22 for the final sum).
- CountScope decodes the correct number from boundary tokens; zeroing those tokens’ activations drops the chance of producing the right sub-count.
- Swapping intermediate-step activations between contexts causally swaps how much each context contributes to the final total—proof that those tokens carry the numbers.
Bottom line: Across models and lengths, the structured-with-steps setting is the only one that reliably beats the architectural counting ceiling, and the internal traces line up with a clean, stage-by-stage mechanism.
05Discussion & Limitations
Limitations:
- Narrow domain: The tests use repeated simple nouns (like fruits/animals). Real text is messier—mixed words, punctuation quirks, and tokenization oddities could impact results.
- Prompt dependence: The trick leans on a specific, structured prompt and knowing a safe per-part size. If you pick partitions too big, accuracy erodes again.
- Tokenization effects: Different models split words differently; boundaries and commas might behave non-uniformly, shifting where counts are stored.
- Faithfulness beyond counting: While the steps look meaningful here, in other tasks, written steps aren’t always faithful to internal reasoning.
Required resources:
- No training changes—just smart prompting. You need:
- A template that inserts | separators.
- A “with steps” instruction that forces “part k: x” lines.
- A guess (or calibration) of the model’s reliable per-part range.
- Optional analysis tools (advanced): activation patching, attention knockout, and a CountScope-like probe to audit internals.
When not to use:
- Very short lists: Plain counting is already fine; overhead isn’t needed.
- Highly varied tokens with unclear item boundaries: If it’s hard to mark where one item ends, the method might misfire.
- Tasks where decomposition is unnatural or where summation isn’t the right aggregation (e.g., you need median or majority instead).
Open questions:
- Auto-calibration: Can a model learn to choose its own safe partition sizes on the fly?
- Beyond counting: Which other reasoning tasks (e.g., multi-constraint checking, long arithmetic, multi-hop logic) benefit most from this structured-with-steps pattern?
- Robustness: How do different tokenizations, languages, and punctuation schemes move the count-storing spots?
- Training-time alignment: If we fine-tune with structured-with-steps examples, can we make the mechanism more reliable and reduce the need for careful prompts?
- General circuits: Are there universal “aggregation heads” shared across tasks that add numbers, tally votes, or combine evidence?
06Conclusion & Future Work
Three-sentence summary: LLMs fail on long counting because their internal, layer-by-layer counter is capacity-limited—great for small lists, wobbly for big ones. A simple System-2 test-time strategy—partition the list with |, write part-by-part counts, then add—escapes that ceiling and restores accuracy without retraining. Careful analyses show where numbers live (at part boundaries), how they move (via specific attention heads), and how they’re added (in middle-to-late layers).
Main achievement: Turning a fragile, fast System-1 behavior into a sturdy, step-by-step System-2 procedure using only prompt structure and intermediate steps—and mapping the internal “count highways” that make it work.
Future directions: Automate partition sizing, extend to other reasoning tasks that saturate (like multi-hop proofs or long arithmetic), and explore light training to harden the discovered mechanisms. Improve cross-tokenization robustness and test in messy, real-world text.
Why remember this: It’s a rare result that is both useful now (better counting by prompting) and illuminating (a clear, causal picture of how numbers flow inside LLMs). It shows that smart problem framing can unlock hidden capacity, and that interpretability tools can guide practical prompting strategies—not just explain them after the fact.
Practical Applications
- •Prompt templates for long-list counting: insert | separators and require “part k: x” lines before the final total.
- •Inventory and receipt parsing: segment items into batches and sum batch totals to reduce miscounts.
- •Length-controlled summarization: split content into sections, set per-section token budgets, then aggregate for a precise length.
- •Checklist and step enumerations: partition long instructions into groups, number each group, then count groups accurately.
- •QA over tables or logs: chunk rows into partitions, count within chunks, and aggregate to avoid long-context drift.
- •Data dedup or frequency counts: group by key in prompt, count each group, then sum groups.
- •Curriculum-style tutoring: break multi-part math problems into structured sub-steps and aggregate results.
- •Agent planning: divide a long plan into stages, estimate costs/steps per stage, then sum to check feasibility.
- •Code analysis: count occurrences (e.g., TODOs, function calls) by file blocks, then aggregate per-project.
- •Quality control: require models to expose sub-counts for auditing; if a final total is wrong, you can trace which part failed.