Sparse Reward Subsystem in Large Language Models

Guowei Xu; Mert Yuksekgonul; James Zou

Sparse Reward Subsystem in Large Language Models

Intermediate

Guowei Xu, Mert Yuksekgonul, James Zou2/1/2026

arXiv PDF

Key Summary

•The paper discovers a tiny, special group of neurons inside large language models (LLMs) that act like a reward system in the human brain.
•These sparse “value neurons” estimate how likely the model is to succeed even before it writes an answer.
•Turning off just 1% of these value neurons causes huge drops in reasoning accuracy, while turning off random neurons barely matters.
•The same value neurons keep working across many tasks (math, science, multiple-choice), model sizes (1.5B–14B), and different architectures (Qwen, Llama, Phi, Gemma).
•The neuron positions overlap strongly across datasets and fine-tuned models from the same base, showing stability and transferability.
•When the model’s expectation and the actual outcome don’t match, the paper finds “dopamine neurons” whose activity spikes for pleasant surprises and dips for disappointments.
•Zeroing value neurons also scrambles dopamine neurons’ surprise signals, showing the two sets are closely connected.
•Training a lightweight probe with Temporal Difference (TD) learning exposes these neurons better than training only on final rewards.
•Using value neurons can predict model confidence efficiently, rivaling or beating other popular confidence baselines.
•This gives a clearer, brain-inspired map of how LLMs evaluate and correct their own reasoning, opening doors to safer and smarter AI.

Why This Research Matters

If we can read a model’s internal sense of “how well am I doing?” we can predict errors early, save compute by stopping low-confidence attempts, and route tough questions to stronger solvers. Knowing exactly which neurons carry this signal lets us gently guide models to think longer or double-check when needed. It also helps prevent hallucinations by flagging when the model’s value estimate is shaky. The brain-inspired connection—value and dopamine-like neurons—gives us a natural language to explain and teach models better reasoning habits. Finally, the subsystem’s stability across datasets and architectures means practical tools built on it can generalize widely.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how your brain gets a little burst of excitement when you solve a puzzle and a letdown when you realize a mistake? That feeling isn’t random—your brain has a tiny system that tracks how good things are going and how surprising they are.

🥬 The Concept (Large Language Models, LLMs): LLMs are computer programs that predict the next word to answer questions and solve problems.

How it works: (1) Read the question, (2) turn words into numbers (hidden states), (3) think through many layers, (4) pick the next word, repeat.
Why it matters: Without LLMs, we wouldn’t have helpful chatbots or tools that explain math and science. 🍞 Anchor: When you ask, “What’s 12×13?”, an LLM thinks step by step before answering “156.”

🍞 Hook: Imagine a giant orchestra where most instruments play softly, but a few lead instruments carry the melody.

🥬 The Concept (Neurons and Hidden States): In LLMs, “neurons” are numbers inside the model that light up to represent ideas, and “hidden states” are the model’s internal thoughts at each step.

How it works: Each layer transforms the hidden state, highlighting parts that seem useful.
Why it matters: If we understand which neurons do what, we can guide or fix the model. 🍞 Anchor: A few neurons might light up for “numbers,” others for “cause and effect,” and some for “confidence.”

🍞 Hook: You know how a game score tells you if you’re winning and helps you choose your next move?

🥬 The Concept (Rewards in AI): A reward is a score that tells the model if it did well (like got the right answer).

How it works: After finishing an answer, the model gets reward 1 for correct or 0 for wrong.
Why it matters: Rewards help models learn what strategies work. 🍞 Anchor: Solve a math problem right? Score 1. Wrong? Score 0.

🍞 Hook: Imagine a tiny coach inside the model whispering, “This looks promising!” or “Uh-oh.”

🥬 The Concept (Sparse Reward Subsystem): It’s a small, special set of neurons that tracks how good the current situation looks and how surprising new results are.

How it works: A tiny subset estimates expected success (value) and surprise (prediction error).
Why it matters: Without it, the model can’t judge progress or learn from surprises. 🍞 Anchor: Before writing an answer, the subsystem already leans “likely right” or “risky.”

🍞 Hook: Think of a friend who can tell if a plan will work just by hearing the setup.

🥬 The Concept (Value Neurons): These neurons estimate how likely the model is to succeed from its current state.

How it works: They read hidden states and output a score (“value”) for expected success.
Why it matters: Without value neurons, the model can’t prioritize good paths. 🍞 Anchor: Reading a math word problem, value neurons say, “I’ve got this!” before solving.

🍞 Hook: Remember feeling extra happy when you did better than expected—and bummed when you didn’t?

🥬 The Concept (Dopamine Neurons): These neurons signal surprise—high when things go better than expected, low when worse.

How it works: They encode Reward Prediction Error (RPE): actual outcome minus expected outcome.
Why it matters: Without them, the model can’t notice or use surprises to improve. 🍞 Anchor: If the model thinks it’ll fail but suddenly sees the trick, dopamine neurons spike.

🍞 Hook: Imagine checking your homework answer at each step instead of waiting until the end.

🥬 The Concept (Temporal Difference Learning, TD): A way to train value estimates step by step by comparing today’s guess with tomorrow’s improved guess.

How it works: (1) Predict value now, (2) look at next state’s value or final result, (3) adjust by the difference.
Why it matters: Without TD, learning waits until the end and misses important mid-course clues. 🍞 Anchor: While solving, the model updates its confidence each token instead of only at the final answer.

🍞 Hook: Think of a test grader that rates how well you can separate right from wrong answers.

🥬 The Concept (AUC): AUC measures how well predictions (like values) separate correct from incorrect outcomes.

How it works: Higher AUC means better ranking of correct over incorrect.
Why it matters: Without a clear metric, we can’t tell if the subsystem works. 🍞 Anchor: If value neurons give higher scores to correct answers most of the time, AUC is high.

The world before: LLMs could solve lots of math and science problems, but we didn’t know which parts of their “brains” judged progress or noticed surprises. People used the hidden states to predict confidence or detect hallucinations, but we didn’t know why it worked or whether a tiny, reusable core did the job.

The problem: Can we find a compact, stable set of neurons that estimate expected success and surprises—like a brain-style reward system?

Failed attempts: Using full hidden states or complex decoders works but is bulky and hard to interpret; training on final rewards only ignores valuable mid-trajectory signals.

The gap: We lacked proof of a small, robust, transferable subsystem with measurable importance for reasoning.

Real stakes: If we can read a model’s internal “is this going well?” meter, we can trust it more, save compute by stopping early on low-confidence problems, detect mistakes faster, and safely tune behavior.

02Core Idea

🍞 Hook: Imagine a treasure map hidden inside a giant book—only a few red X’s truly matter, and if you cover them, the map becomes useless.

🥬 The Concept (Key Insight): A tiny, sparse set of neurons inside LLMs forms a reward subsystem: value neurons estimate expected success, and dopamine neurons capture surprise (reward prediction error). Disrupting just these few neurons severely harms reasoning.

How it works: (1) Train a small probe to read value signals from hidden states using TD learning; (2) prune inputs to see if only a few neurons suffice; (3) ablate those neurons and watch reasoning collapse; (4) find dopamine neurons by looking at cases where expectation and reality diverge.
Why it matters: It reveals a compact, brain-like control panel for reasoning that is robust across datasets, layers, model sizes, and architectures. 🍞 Anchor: Knocking out 1% of the right neurons drops math accuracy by over 50%, while random 1% barely changes anything.

The “Aha!” Moment (one sentence): Most of the information about a model’s own chances of success lives in a tiny, stable set of neurons that also ties directly to surprise signals—just like value and dopamine systems in brains.
Multiple Analogies:

Orchestra: Many instruments play, but a few lead instruments (value neurons) keep everyone on tempo; the cymbals (dopamine neurons) crash when something surprising happens.
GPS: Value neurons estimate how close you are to your destination; dopamine neurons ping when you find a shortcut (positive surprise) or hit a detour (negative surprise).
Teacher’s red pen: Value neurons predict if the solution path is on track; dopamine neurons mark, “Great leap!” or “Oops!” at key steps.

Before vs After:

Before: Confidence and correctness could be predicted from hidden states, but it seemed spread everywhere and hard to localize.
After: The signals are sparse, sit in specific neurons, are transferable, and are necessary for reasoning—giving us handles to guide, debug, and improve models.

Why It Works (intuition, not equations):

As the model processes a question, certain neurons naturally specialize in summarizing “how promising this looks” (value). TD learning encourages these signals to be consistent over time.
When outcomes surprise the model, the mismatch pops out in other neurons (dopamine), which track big good/bad jumps to update expectations.
Because reasoning patterns repeat across tasks and models, the same small neuron sets keep reappearing and stay useful.

Building Blocks:

Value probe: a tiny MLP that reads hidden states and outputs expected success.
Pruning: keep only the most important input dimensions; high AUC after strong pruning proves sparsity.
Ablation: set those key neurons to zero; big accuracy drops prove necessity.
Transfer tests: overlap (IoU) across datasets and sibling models shows stability.
Dopamine discovery: track activation during positive/negative surprises; spikes/dips reveal reward prediction error.
Coupling test: ablating value neurons disrupts dopamine signals, showing the two are functionally linked.

03Methodology

At a high level: Question → Hidden states (per layer, per token) → Value probe (per layer) → Identify sparse value neurons → Pruning tests → Ablation tests → Dopamine neuron discovery → Coupling analysis.

Step A: Train a Value Probe (per layer)

What happens: For each layer’s hidden state at each step, a small 2-layer MLP (the value probe) predicts the value (expected success). It’s trained with Temporal Difference (TD) learning, so each step’s prediction is nudged toward the next step’s prediction or final reward.
Why this step exists: Final-only training ignores useful mid-trajectory clues; TD lets the probe learn smooth, time-aware estimates.
Example: On a GSM8K math problem, before writing any answer tokens, the probe already predicts a moderate chance of success; as the model progresses, the estimate grows or shrinks.

Step B: Early Prediction Test (AUC)

What happens: On a validation set, compute the value from the initial input positions (before generating the answer) and measure how well it separates correct vs. incorrect outcomes using AUC.
Why this step exists: If value is real, the model should “sense” difficulty before solving; high AUC proves predictive power.
Example: For MATH500, many layers show strong AUC even without seeing any generated tokens.

Step C: Pruning to Find Sparse Value Neurons

What happens: Rank input dimensions to the probe by weight norms; prune the smallest and keep only a fraction (e.g., 1%). Re-evaluate AUC as pruning grows.
Why this step exists: If a tiny subset holds most value information, AUC should stay high even after heavy pruning.
Example: In Qwen-2.5-14B, AUC curves remain steady—and sometimes rise slightly—even after keeping <1% of inputs.

Step D: Intervention (Ablation) to Test Causality

What happens: During inference, set the top 1% value neuron activations to zero in a given layer, and measure accuracy drops; compare with randomly ablating 1% of neurons.
Why this step exists: It shows these neurons aren’t just correlated—they’re necessary for reasoning.
Example: On MATH500 with Qwen-2.5-7B, average accuracy plunges from 75.2% to 20.3% (−54.9 points) when ablating value neurons; random 1% causes ~−0.6.

Step E: Robustness and Transferability Checks

What happens: Repeat AUC–prune tests across datasets (GSM8K, MATH500, Minerva Math, ARC, MMLU-STEM), sizes (1.5B/7B/14B), layers (early to deep), and architectures (Qwen, Llama, Phi, Gemma). Compute IoU overlap of selected value neurons across datasets and across sibling fine-tuned models.
Why this step exists: If this is a true subsystem, it should be stable across settings.
Example: IoU of value neurons between GSM8K and ARC exceeds 0.6 at 99% pruning—far above random baselines.

Step F: Identify Dopamine Neurons (Surprise Trackers)

What happens: Find cases where initial predicted value is high but outcome is wrong (negative surprise) or low but outcome is right (positive surprise). Track neuron activations over tokens; dopamine neurons show spikes for positive surprise and dips for negative.
Why this step exists: To locate RPE-like signals that mirror biological dopamine behavior.
Example: In layer 5, neuron #1517 spikes when the model discovers a key step unexpectedly and shows a trough where a logical error occurs.

Step G: Coupling Test Between Value and Dopamine Neurons

What happens: Zero out the top 20% value neurons, then re-measure a dopamine neuron’s activation curve; compare to randomly ablating 20% neurons.
Why this step exists: If value neurons guide dopamine signals, disrupting them should scramble surprise patterns.
Example: Random ablation barely changes the dopamine curve; value-neuron ablation shifts peaks/troughs dramatically, erasing classic RPE behavior.

Secret Sauce: Three clever ingredients

TD-trained tiny probes: Using time-aware training with a very small network ensures we’re reading signals that already exist, not inventing new ones.
Extreme pruning: Showing AUC barely drops after keeping <1% inputs proves true sparsity.
Causal ablations: Massive accuracy loss from targeted ablation (but not random) proves these neurons are functionally essential.

Concrete Mini-Walkthrough:

Input: “John bought 3 apples at $2 and 2 bananas at$ 1.50. Total cost?”
Layer k hidden state (before answer): Probe predicts medium-high value.
Pruning: Keep 0.5% inputs—value still separates correct/incorrect well.
Ablation: Zero those 0.5% neurons—model now often forgets to multiply or add correctly.
Surprise case: Model starts unconfident but suddenly notices 3×2 + 2×1.5—dopamine neuron spikes right then.

04Experiments & Results

The Test: What did they measure and why?

AUC of early value predictions: to see if the model can sense success before answering.
Pruning curves: to check if only a few neurons carry value information.
Ablation impact: to test if those neurons are necessary for reasoning.
IoU overlap across datasets/models: to test stability and transferability.
Dopamine activation traces: to verify reward prediction error signals.
Confidence correlation: to see if value neurons can cheaply predict accuracy.

The Competition: Baselines and Comparisons

Random ablation baseline: Is targeted ablation really special?
Confidence baselines: Verbalized confidence and next-token probability, plus a strong linear-probe method (LCD).

The Scoreboard (with context):

Pruning barely hurts AUC: Even after pruning to <1% inputs, AUC stays high and sometimes increases—like keeping just a handful of instruments and still playing the melody.
1% targeted ablation devastates accuracy: On MATH500 (Qwen-2.5-7B), accuracy drops from 75.2% to an average 20.3% when ablating value neurons (that’s like an A turning into a failing grade). Random 1% ablation leaves accuracy ~74.6% (almost unchanged).
Robustness across datasets and sizes: The same pattern holds for GSM8K, Minerva Math, ARC, MMLU-STEM and for 1.5B/7B/14B models—sparsity and predictive power persist.
Cross-architecture validity: Llama-3.1-8B, Phi-3.5-mini, Gemma-3-4B also show stable AUC under pruning—so it’s not just a Qwen thing.
High IoU across datasets: At extreme pruning (99%), overlaps like GSM8K vs. ARC exceed 0.6—far above random—showing that the most critical value neurons are shared.
High IoU across sibling models: Fine-tunes from the same base model share many of the same value neuron positions.
Dopamine neurons found: Activation spikes at positive surprises and dips at negative surprises; ablating value neurons shifts or erases these patterns.
Confidence prediction: Value-neuron-based scores achieve Spearman ~0.47, beating Verbalized (0.08) and Next-token (0.09), and matching strong linear-probe baselines that use far more neurons.

Surprising Findings:

Pruning sometimes helps AUC: Removing noisy inputs can sharpen the signal, making predictions even cleaner.
Stability is strongest at extreme sparsity: As pruning approaches 99%, overlaps across datasets climb, pointing to a very small, very stable core.
Value–dopamine coupling is tight: Disrupting value neurons reliably breaks dopamine surprise signatures—suggesting a functional pipeline, not just coincidence.

05Discussion & Limitations

Limitations:

Scope: Results focus on autoregressive, decoder-only LLMs; other architectures or modalities need testing.
Dopamine quantification: Evidence is primarily case-based visualizations with a selection procedure; richer, global metrics would strengthen claims.
Probe dependence: Although tiny, the probe still defines which inputs look important; alternative probes could yield slightly different subsets.
Temporal settings: TD hyperparameters (like discount factor) and smoothing choices can influence which neurons surface as “value” or “dopamine.”
Causality depth: We show strong necessity via ablation, but more granular, mechanistic circuits (who talks to whom) remain to be mapped.

Required Resources:

Access to hidden states during inference for several layers.
Modest GPU for training tiny probes and running pruning/ablation sweeps.
Datasets spanning reasoning types (math, multiple-choice) to test transfer.

When NOT to Use:

If you can’t access or modify intermediate activations (e.g., closed-box APIs), locating specific neurons is impractical.
For tasks where rewards aren’t meaningful or are delayed/ambiguous, the discovered signals may be weaker.
If the model is extremely small or undertrained, value signals may not be well-formed.

Open Questions:

Can we design training that explicitly strengthens this subsystem to improve reasoning robustness and transparency?
How universal are dopamine-like neurons across domains beyond math and STEM (e.g., coding, dialogue safety)?
Can these neurons guide dynamic compute allocation (think more when value is uncertain) reliably in the wild?
What do connectivity maps look like—can we trace circuits from attention heads to value/dopamine neurons?
Could a standardized, quantitative RPE metric be built to benchmark dopamine neurons across models?

06Conclusion & Future Work

Three-Sentence Summary:

This paper uncovers a tiny, stable reward subsystem inside LLMs: value neurons that predict expected success and dopamine neurons that encode surprise when expectations meet reality.
Extreme pruning shows value information is highly sparse; targeted ablation proves these neurons are necessary for reasoning; overlaps across datasets, sizes, and architectures reveal broad stability.
Disrupting value neurons scrambles dopamine surprise signals, and value-based scores predict confidence well—showing practical uses and a brain-inspired structure.

Main Achievement:

Pinpointing and validating a sparse, transferable set of neurons that govern internal value and surprise signals—and showing they’re crucial for reasoning.

Future Directions:

Build training and inference methods that directly leverage these neurons for better problem-solving, uncertainty estimation, and compute allocation.
Develop quantitative RPE benchmarks and richer connectivity maps to turn observations into robust, model-wide metrics and interpretable circuits.
Explore applications to safety (early-error detection), alignment (shaping internal rewards), and efficiency (think-time scheduling).

Why Remember This:

It turns a blurry picture of “confidence in hidden states” into a crisp, brain-like subsystem with testable, transferable parts.
It offers practical knobs—value and dopamine neurons—to guide, debug, and harden reasoning.
It bridges neuroscience intuitions with modern AI, making models not just smarter, but more understandable and controllable.

Practical Applications

•Early confidence check: Use value neurons before generation to decide whether to think longer or hand off to a bigger model.
•Compute scheduling: Allocate more tokens or steps when initial value is low or uncertain; save compute when high.
•Safety and reliability: Trigger self-checks or verifiers when value is middling and dopamine spikes suggest risky surprises.
•Curriculum design: Train with targeted examples that maximize useful RPE signals to strengthen reasoning.
•Hallucination reduction: Downweight generations when value neurons predict low success on knowledge-heavy questions.
•Best-of-N sampling: Use intrinsic value signals to pick the most promising candidate solutions without external labels.
•Interactive tutoring: Show learners where the model had big positive/negative surprises to highlight key steps or mistakes.
•Model diagnostics: Track subsystem neurons during fine-tuning to ensure reasoning quality doesn’t regress.
•Cross-task portability: Reuse discovered value neurons across tasks to bootstrap confidence tools quickly.
•Adaptive prompting: If value is low, prompt for step-by-step reasoning; if high, allow concise answers.

Version: 1