Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Lecheng Yan; Ruizhe Li; Guanhua Chen; Qing Li; Jiahui Geng; Wenxi Li; Vincent Wang; Chris Lee

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Intermediate

Lecheng Yan, Ruizhe Li, Guanhua Chen et al.1/16/2026

arXiv PDF

Key Summary

•The paper shows that when an LLM is trained with spurious (misleading) rewards in RLVR, it can score higher by memorizing answers instead of reasoning.
•They discover a Perplexity Paradox: the model becomes very confident on the final answer tokens while getting worse at understanding the prompt.
•Using tools like Path Patching, Logit Lens, JSD, and Neural Differential Equations, they find a specific circuit that flips the model into memorization mode.
•This circuit has two parts: Functional Anchors in middle layers (L18–L20) that trigger the memory, and Structural Adapters in later layers (L21+) that reshape representations to carry the shortcut.
•They causally prove this by resetting and keeping certain layers: changing the Anchors hurts leaked (contaminated) questions a lot but barely affects clean reasoning questions.
•They can steer the model by scaling specific MLP neuron keys, either boosting or blocking the memorization shortcut on demand.
•The effect is strong in Qwen2.5-Math and weaker or absent in control models like LLaMA-3.1-8B and OLMo-2-1124-7B, showing it’s tied to data contamination.
•On clean benchmarks like LiveMathBench, gains do not come from memorizing answers, supporting the distinction between reasoning and leakage.
•This gives a roadmap to detect, understand, and mitigate contamination-driven gains in RLVR-tuned models.

Why This Research Matters

If we mistake memorization for reasoning, we may deploy models that look great on paper but break in real-world use. This paper gives clear warning signs (the Perplexity Paradox) and concrete tools to tell shortcuts from true understanding. With this roadmap, teams can audit RLVR training, avoid celebrating contaminated gains, and design cleaner reward functions. It also enables practical fixes: layer-specific resets and neuron-level steering that reduce reliance on leaked data. Ultimately, this helps ensure AI systems actually learn how to think through problems and remain trustworthy when the questions are new.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how some students can ace a test by memorizing last year’s answer key, even if they don’t understand the material? They look smart on paper, but they didn’t really learn.

🥬 Filling (The Actual Concept):

What it is: This paper studies how a popular way to train AI to solve math problems, called RLVR, can accidentally teach a model to memorize leaked answers instead of truly reasoning.
How it works (step by step):
1. Train an AI with a reward for producing the correct final answer (that’s RLVR).
2. If the training or pretraining data secretly contains the test answers (data contamination), the model may already have those answers stored somewhere.
3. Even if the rewards are wrong, random, or only care about formatting, the model still learns to trigger its stored answers.
4. The model’s internal layers reorganize so that certain middle layers send a “retrieve memory now” signal, and later layers reshape the information to deliver the memorized token.
5. The final result is higher scores on contaminated tests but worse language understanding of the prompts.
Why it matters: Without realizing it, we might celebrate higher accuracy that comes from shortcuts, not real thinking. This can mislead researchers, users, and product teams about what models can truly do.

🍞 Bottom Bread (Anchor): Imagine a calculator that shows the right answer because it recognizes the question from before—not because it recalculated. It would pass that specific test but fail on a new one with the same idea but different numbers.

Now, we’ll introduce each building block concept in the order that makes the story easiest to follow.

🍞 Top Bread (Hook): Imagine a big factory with assembly lines. Different stations add parts until a final product rolls out the door.

🥬 Multilayer Perceptron (MLP):

What it is: An MLP is a stack of simple math layers in a Transformer that stores and transforms facts and features.
How it works:
1. Take in a hidden vector (the current ‘state’ of understanding).
2. Use neurons that light up when certain patterns appear (keys activate).
3. Push out a new vector (values) that adds information back into the model’s stream.
4. Repeat across layers.
Why it matters: If MLPs act like memory shelves, the model can pull out stored answers without re-thinking them.

🍞 Bottom Bread (Anchor): Like a library shelf: see the book’s label (key), pull out the book (value), and put the book’s info into your notes (residual stream).

🍞 Top Bread (Hook): You know how a video game gives you points when you beat a level? That score teaches you what works.

🥬 Reinforcement Learning with Verifiable Rewards (RLVR):

What it is: A training method that gives the model a reward based on whether the final answer is correct (and checkable).
How it works:
1. The model proposes an answer to a math question.
2. A checker verifies if it’s correct.
3. The model gets a reward (or not) and updates itself.
4. Repeat many times so good strategies are reinforced.
Why it matters: RLVR can build strong problem-solvers—if the reward points to real reasoning.

🍞 Bottom Bread (Anchor): Like a math bee where every right answer gets a gold star; after many rounds, you learn habits that earn more gold stars.

🍞 Top Bread (Hook): Think of peeking at the answer sheet. You don’t solve the problem—you just copy the result.

🥬 Memorization Shortcuts:

What it is: Quick paths that output known answers from memory instead of reasoning.
How it works:
1. Notice a familiar-looking question.
2. Trigger stored information that matches it.
3. Skip the slow step-by-step solution.
4. Output the remembered answer.
Why it matters: Shortcuts can boost scores on leaked tests but fail on new, slightly different problems.

🍞 Bottom Bread (Anchor): Like typing a saved Wi‑Fi password: it connects fast, but you never learned what the password really is.

🍞 Top Bread (Hook): Picture tracking a toy car’s path as it rolls along a track. You can draw its curve and see where it splits into two lanes.

🥬 Neural Differential Equations (NDEs):

What it is: A way to view layer-by-layer changes in a network as a smooth, continuous path through a space.
How it works:
1. Treat each layer’s update like a small step in time.
2. Fit a simple function that describes how the hidden state changes continuously.
3. Measure where two paths (reasoning vs memorization) split and how strongly they’re pushed apart.
4. Identify where the split begins (the bifurcation point).
Why it matters: It pinpoints exactly where the model decides to switch from reasoning to memorization.

🍞 Bottom Bread (Anchor): Like watching two skiers on the same slope suddenly take different trails—NDEs tell you where and why they split.

🍞 Top Bread (Hook): Imagine wearing X-ray glasses that let you see what word your friend is thinking of, step by step, before they say it.

🥬 Logit Lens:

What it is: A tool that projects an internal hidden state directly to “which tokens are likely next?”
How it works:
1. Take the hidden vector from a certain layer.
2. Map it straight into the vocabulary’s score space.
3. See which tokens already look favored.
4. Track how token preferences change across layers.
Why it matters: It reveals exactly when the model commits to a specific answer token—early or late.

🍞 Bottom Bread (Anchor): If the answer “4” is already top-1 by Layer 23, you know the decision was made before the final layer.

🍞 Top Bread (Hook): Think of replacing a piece of a Lego build to see if the final model still stands. If it does, that piece mattered.

🥬 Path Patching:

What it is: A method for swapping internal activations to test what caused the output.
How it works:
1. Run the model on two versions of a question: one that triggers memory (leaked) and one that doesn’t (stable).
2. Copy activations from certain layers of the leaked run into the stable run.
3. See if the stable run now outputs the memorized answer.
4. Layers whose swaps change the result are causally important.
Why it matters: It tells us which internal parts actually trigger memorized answers.

🍞 Bottom Bread (Anchor): Like swapping the engine from Car A into Car B; if B now zooms, you found the power source.

🍞 Top Bread (Hook): Picture a train junction where a lever sends the train down a shortcut track, and a later station reshapes the cars to fit the narrow tunnel.

🥬 Anchor-Adapter Circuit:

What it is: A two-part pathway inside the model; Functional Anchors (layers 18–20) trigger memory retrieval, while Structural Adapters (layers 21+) reshape the representation to carry that shortcut cleanly to the output.
How it works:
1. Middle layers decide: “Use the stored answer.”
2. A strong token signal is injected (the memory is activated).
3. Later layers reorganize (rotate/transform) the internal space.
4. The final layers deliver the memorized token with high confidence.
Why it matters: This explains how spurious RLVR unlocks memorization that was already inside the model.

🍞 Bottom Bread (Anchor): In Qwen2.5-Math, Layers 18–20 light up first, and Layers 21–23 reshape things so the answer token (like “4”) pops to the top.

🍞 Top Bread (Hook): Imagine your friend gets more and more certain about an answer while becoming less sure what the question was asking.

🥬 Perplexity Paradox:

What it is: The model’s uncertainty (perplexity) drops on the answer tokens but rises on the prompt words.
How it works:
1. Spurious RLVR trains the model to nail the final answer.
2. The model gets very confident at the end token(s).
3. Meanwhile, its prompt understanding degrades.
4. You see low answer perplexity but high prompt perplexity.
Why it matters: It’s a red flag that the model isn’t reasoning—just recalling.

🍞 Bottom Bread (Anchor): In the paper, Qwen2.5-Math shows decreasing answer perplexity and increasing prompt perplexity after spurious RLVR.

🍞 Top Bread (Hook): Think of comparing two recipes for cookies to see how different they are.

🥬 Jensen-Shannon Divergence (JSD):

What it is: A measure of how far apart two probability distributions are.
How it works:
1. Look at token probabilities from an intermediate layer (via Logit Lens).
2. Compare them to the final output probabilities.
3. Compute JSD to see how much transformation still has to happen.
4. Track JSD across layers to spot where structure changes.
Why it matters: Peaks in certain sub-parts of MLPs show where the model reconfigures itself to carry the shortcut.

🍞 Bottom Bread (Anchor): In Qwen2.5-Math, JSD for W_up and W_gate peaks around Layers 21–22, signaling strong structural adaptation there.

02Core Idea

🍞 Top Bread (Hook): Imagine a maze where most kids solve it by thinking, but one kid finds a hidden door that jumps straight to the exit. He wins fast—but didn’t learn the maze.

🥬 The Concept (Aha! in one sentence): Spurious RLVR doesn’t teach better reasoning; it flips a hidden switch that makes the model retrieve memorized answers using an Anchor-Adapter circuit.

How it works (step by step):

If pretraining leaked test answers into the model, those answers live somewhere in its MLP memories.
Spurious or incorrect rewards still pressure the model to get final answers right.
Middle layers (Functional Anchors) learn to detect “this looks like a known question” and inject a memory-trigger signal.
Later layers (Structural Adapters) reshape the internal space so the injected answer token cleanly rises to the top.
The model’s answer confidence goes up, but its prompt understanding goes down (the Perplexity Paradox).

Why it matters: Without noticing, we might call this ‘reasoning progress’ even though it is just faster copying; this misleads benchmarks, product claims, and research directions.

🍞 Bottom Bread (Anchor): On MATH-500, Qwen2.5-Math improves a lot after spurious RLVR, but tools show the answer token appears early and strongly due to Anchors/Adapters—classic memorization, not new reasoning.

Multiple Analogies (3 ways):

Secret elevator: The Anchor is the button; the Adapter is the elevator shaft reshaped to fit; you end up at the top floor without climbing the stairs.
Choir and conductor: The Anchor (conductor) cues the soloist (answer token) early; the Adapter (choir) harmonizes so the solo dominates at the finale.
GPS reroute: The Anchor presses “avoid traffic” and picks a shortcut; the Adapter redraws the map so every turn leads straight to the exit.

Before vs After:

Before: We assumed RLVR gains meant better step-by-step reasoning.
After: We now see those gains can come from unlocking stored answers via a specific mid-to-late layer circuit.
Change: Evaluation must separate ‘knowing what to answer’ from ‘knowing how to reason.’

Why it works (intuition):

MLPs store compressed patterns; spurious rewards apply pressure at the outcome, not the process.
The model finds the simplest way to maximize reward: match prompts to stored answers.
Mid-layers are well-placed to recognize patterns; late layers can rotate features to present the chosen token cleanly.
This division of labor (Anchor vs Adapter) naturally emerges under reward pressure when contamination is present.

Building Blocks:

MLP memory shelves: hold candidate features and answers.
Functional Anchors (L18–20): choose the ‘memory route’ and inject a high-probability answer direction.
Structural Adapters (L21+): rotate/transform representation so the output layer cleanly selects the injected token.
Detectives (tools): Path Patching (causality), Logit Lens (early token commitment), JSD (where structure shifts), NDEs (where dynamics split).

03Methodology

At a high level: Input (math question) → Detect whether it matches a stored pattern (Anchors L18–20) → Reshape the signal to favor the memorized token (Adapters L21+) → Output the final answer.

Step-by-step recipe:

Prepare datasets and groups.

What happens: Split questions into ‘leakage’ (Wrong→Right after spurious RLVR) and ‘stable’ (unchanged), across MATH-500, MinervaMath, and the clean control LiveMathBench.
Why it exists: To compare where memorization is activated versus where it isn’t.
Example: A MATH-500 problem that’s wrong before training but right after suggests the model learned to retrieve, not newly reason.

Measure the Perplexity Paradox.

What happens: Track perplexity (uncertainty) over training steps for both prompt tokens and answer tokens.
Why it exists: To see if answer confidence rises while prompt understanding falls—a memorization signature.
Example: In Qwen2.5-Math, answer perplexity drops while prompt perplexity rises; in LLaMA/OLMo both perplexities get worse.

Use Path Patching to find causal layers.

What happens: Swap layer activations from the RLVR-tuned run into the base run (and vice versa) and see if outputs flip.
Why it exists: To causally locate which layers make the ‘memorize now’ decision.
Example: Swapping MLP activations in L18–L20 boosts accuracy recovery; after L21, recovery drops—pointing to the Anchors just before a structural shift.

Use Logit Lens to see when tokens emerge.

What happens: Project each layer’s hidden state into token space to watch which tokens are already winning.
Why it exists: To catch early commitment to the final answer before the last layer.
Example: For a leaked MATH-500 question with answer “4,” the token “4” surges at L23 after precursor signals near L19.

Use JSD to find structural adaptation.

What happens: Compare token distributions from intermediate layers to the final distribution to measure how much change remains.
Why it exists: High or peaking JSD reveals where the model reorganizes its internal space.
Example: W_up and W_gate JSD peak at L21–L22, then decline; W_down stays high—showing late layers adapt structure for the shortcut signal.

Use NDEs to map dynamics and find bifurcation.

What happens: Fit a continuous model of layer-wise updates; measure where leakage and stable trajectories split and how strongly.
Why it exists: To mathematically confirm the exact depth where the path diverges.
Example: Separation force peaks at L18–L20, validating Anchors as the decision point; later layers boost velocity (amplify the signal) without changing direction.

Do ablations to test necessity and sufficiency.

What happens: Reset Anchors (L18–20) or Adapters (L21–22) to base weights; or keep only one group’s RLVR weights.
Why it exists: To show which pieces are required and whether any piece alone is enough.
Example: On MATH-500, resetting Anchors drops leakage accuracy from 98% to 86%; keeping only Adapters fails badly—both groups are needed together.

Mechanistic intervention: steer with neuron key scaling.

What happens: Identify task-relevant MLP neurons, then multiply their key activations by a factor α during inference.
Why it exists: To causally turn the shortcut up or down and prove control.
Example: Amplify at Layer 18 (+4.4% leakage accuracy); suppress at Layer 18 (−3.8%); on clean LiveMathBench, steering shows no systematic effect, indicating specificity to contamination.

What breaks without each step:

Without leakage vs stable grouping: You can’t isolate memorization from genuine reasoning.
Without perplexity tracking: You miss the paradox signal.
Without Path Patching: You don’t know which layers cause the behavior.
Without Logit Lens: You can’t see early token commitment.
Without JSD: You miss where structure is reconfigured.
Without NDEs: You can’t pinpoint the bifurcation depth.
Without ablations/interventions: You lack causal proof and control.

The secret sauce:

The Anchor-Adapter split: mid-layers decide; late layers adapt structure.
Cross-tool triangulation (Path Patching + Logit Lens + JSD + NDE): each method independently points to the same layers and roles.
Bidirectional steering: scaling MLP keys not only observes but controls the shortcut, closing the causal loop.

04Experiments & Results

The test: Measure whether gains come from reasoning or memorization, and where the switch happens.

Datasets: MATH-500 and MinervaMath (contaminated), LiveMathBench (clean control), plus AIME/AMC for breadth.
Models: Qwen2.5-Math-7B (affected), Qwen3-8B (weaker effect), LLaMA-3.1-8B and OLMo-2-1124-7B (controls).
Metrics: Accuracy, perplexity (prompt vs answer), ROUGE-L and completion for partial prompts, JSD, and dynamic NDE measures.

The competition: Compare how spurious RLVR affects different model families.

Qwen2.5-Math-7B: large gains on contaminated sets, clear Perplexity Paradox, and a strong Anchor-Adapter signature.
Qwen3-8B: similar but weaker patterns (later peaks, less dramatic gains).
LLaMA/OLMo: no Anchor-Adapter signature; spurious RLVR hurts perplexity overall; little to no gains.

The scoreboard (with context):

Perplexity Paradox in Qwen2.5-Math: answer perplexity goes down (more certain), prompt perplexity goes up (less coherent). That’s like getting faster at shouting the final number but worse at reading the problem.
Path Patching: High accuracy recovery when patching MLPs around L18–L20, sharp drop after L21—pinpointing the Anchors just before a structural shift.
JSD: W_up and W_gate peak at L21–L22 then decline; W_down stays high—late layers are adapting structure rather than adding new content.
NDE dynamics: Separation force peaks at L18–L20; velocity difference grows later—Anchors decide, Adapters amplify.
Ablations (MATH-500): Reset Anchors (L18–20) → leakage accuracy 98%→86%; Reset Adapters (L21–22) → smaller drop; Keep Only Adapters → catastrophic on contaminated sets; Stable (clean) samples barely change.
Steering: At Layer 18, suppression reduces leakage accuracy (−3.8%), amplification boosts it (+4.4%). On LiveMathBench, steering shows no consistent effect—supporting contamination specificity.

Surprising findings:

Even incorrect or random rewards can activate the memory pathway if contamination exists; reward correctness wasn’t necessary.
In failed retrievals, amplification sometimes ‘wakes up’ a dormant shortcut that base inference couldn’t access.
LiveMathBench improvements, when present, look like formatting/structure refinements, not memory retrieval—helping separate ‘how to talk’ from ‘what to answer.’

05Discussion & Limitations

Limitations:

Primarily demonstrated on Qwen2.5-Math-7B (with supporting signs in Qwen3-8B). Other architectures may organize memories differently.
Focused on math benchmarks; coding or multi-step scientific reasoning may show other patterns.
Tools like Logit Lens and Path Patching approximate complex internals; results depend on careful protocol choices.

Required resources:

Access to base and RLVR-tuned checkpoints.
Compute for running multiple probes (Path Patching sweeps, JSD component tests, NDE fitting) and ablation evaluations.
Clean control datasets to distinguish reasoning from contamination.

When not to use (or interpret cautiously):

If you lack a clean benchmark, you may mistake formatting gains for new reasoning.
If the model family shows no leakage signs (like LLaMA/OLMo here), don’t over-interpret noise in JSD or Logit Lens.
If rewards target multi-step reasoning quality (not just final answers), this specific shortcut may be less relevant.

Open questions:

How general is the Anchor-Adapter pattern across sizes, domains, and training recipes?
Can we design RLVR rewards that explicitly penalize early token commitment without chain-of-thought evidence?
Can automatic detectors flag Perplexity Paradox on-the-fly during training to stop contamination amplification?
Are there safer structural adapters that improve reasoning instead of reorganizing for shortcuts?
Can post-hoc steering be turned into a robust, deployable decontamination filter?

06Conclusion & Future Work

Three-sentence summary:

Spurious RLVR can make LLMs score higher by activating memorized answers rather than improving reasoning.
A specific Anchor-Adapter circuit enables this: middle layers (L18–L20) trigger retrieval, while later layers (L21+) reshape representations so the memorized token wins.
The Perplexity Paradox—lower answer perplexity but higher prompt perplexity—signals this switch, and targeted neuron scaling can amplify or suppress it.

Main achievement:

A causal, mechanistic roadmap—using Path Patching, Logit Lens, JSD, NDEs, ablations, and steering—that localizes and controls contamination-driven shortcuts.

Future directions:

Build contamination-resistant reward functions and benchmarks that require grounded reasoning.
Automate paradox detection during training to prevent shortcut formation.
Generalize the Anchor-Adapter analysis to more model families and tasks (math, code, science).
Turn neuron-level steering into a practical ‘decontamination’ inference-time tool.

Why remember this:

It reframes puzzling RLVR gains: not all accuracy increases mean deeper thinking.
It gives clear signs (Perplexity Paradox) and tools (Anchor-Adapter mapping, steering) to tell memorization from reasoning.
It helps the community build safer, more honest evaluations and models that truly learn how to solve problems, not just what to answer.

Practical Applications

•Add a Perplexity Paradox check (prompt vs answer perplexity) to RLVR training dashboards to catch shortcut formation early.
•Run Path Patching sweeps after RLVR to localize Anchor and Adapter layers before shipping a model.
•Use clean control sets like LiveMathBench to separate reasoning gains from contamination-driven gains.
•Apply layer resets (Anchor/Adapter) to diagnose and mitigate leaked performance without retraining from scratch.
•Deploy neuron key scaling at inference time to suppress suspected memorization on sensitive evaluations.
•Track JSD of MLP subcomponents across layers to monitor structural adaptations that signal shortcut building.
•Fit NDEs on residual updates to automatically detect bifurcation layers where reasoning switches to memory.
•Gate benchmark claims: require both accuracy gains and no Perplexity Paradox before declaring reasoning improvements.
•Design rewards that score intermediate reasoning quality (not just final answers) to reduce shortcut incentives.
•Automate a ‘decontamination pass’ that disables identified Anchor-Adapter circuits during official evaluations.

Version: 1