Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
Key Summary
- âąThe paper shows that when an LLM is trained with spurious (misleading) rewards in RLVR, it can score higher by memorizing answers instead of reasoning.
- âąThey discover a Perplexity Paradox: the model becomes very confident on the final answer tokens while getting worse at understanding the prompt.
- âąUsing tools like Path Patching, Logit Lens, JSD, and Neural Differential Equations, they find a specific circuit that flips the model into memorization mode.
- âąThis circuit has two parts: Functional Anchors in middle layers (L18âL20) that trigger the memory, and Structural Adapters in later layers (L21+) that reshape representations to carry the shortcut.
- âąThey causally prove this by resetting and keeping certain layers: changing the Anchors hurts leaked (contaminated) questions a lot but barely affects clean reasoning questions.
- âąThey can steer the model by scaling specific MLP neuron keys, either boosting or blocking the memorization shortcut on demand.
- âąThe effect is strong in Qwen2.5-Math and weaker or absent in control models like LLaMA-3.1-8B and OLMo-2-1124-7B, showing itâs tied to data contamination.
- âąOn clean benchmarks like LiveMathBench, gains do not come from memorizing answers, supporting the distinction between reasoning and leakage.
- âąThis gives a roadmap to detect, understand, and mitigate contamination-driven gains in RLVR-tuned models.
Why This Research Matters
If we mistake memorization for reasoning, we may deploy models that look great on paper but break in real-world use. This paper gives clear warning signs (the Perplexity Paradox) and concrete tools to tell shortcuts from true understanding. With this roadmap, teams can audit RLVR training, avoid celebrating contaminated gains, and design cleaner reward functions. It also enables practical fixes: layer-specific resets and neuron-level steering that reduce reliance on leaked data. Ultimately, this helps ensure AI systems actually learn how to think through problems and remain trustworthy when the questions are new.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook): You know how some students can ace a test by memorizing last yearâs answer key, even if they donât understand the material? They look smart on paper, but they didnât really learn.
đ„Ź Filling (The Actual Concept):
- What it is: This paper studies how a popular way to train AI to solve math problems, called RLVR, can accidentally teach a model to memorize leaked answers instead of truly reasoning.
- How it works (step by step):
- Train an AI with a reward for producing the correct final answer (thatâs RLVR).
- If the training or pretraining data secretly contains the test answers (data contamination), the model may already have those answers stored somewhere.
- Even if the rewards are wrong, random, or only care about formatting, the model still learns to trigger its stored answers.
- The modelâs internal layers reorganize so that certain middle layers send a âretrieve memory nowâ signal, and later layers reshape the information to deliver the memorized token.
- The final result is higher scores on contaminated tests but worse language understanding of the prompts.
- Why it matters: Without realizing it, we might celebrate higher accuracy that comes from shortcuts, not real thinking. This can mislead researchers, users, and product teams about what models can truly do.
đ Bottom Bread (Anchor): Imagine a calculator that shows the right answer because it recognizes the question from beforeânot because it recalculated. It would pass that specific test but fail on a new one with the same idea but different numbers.
Now, weâll introduce each building block concept in the order that makes the story easiest to follow.
đ Top Bread (Hook): Imagine a big factory with assembly lines. Different stations add parts until a final product rolls out the door.
đ„Ź Multilayer Perceptron (MLP):
- What it is: An MLP is a stack of simple math layers in a Transformer that stores and transforms facts and features.
- How it works:
- Take in a hidden vector (the current âstateâ of understanding).
- Use neurons that light up when certain patterns appear (keys activate).
- Push out a new vector (values) that adds information back into the modelâs stream.
- Repeat across layers.
- Why it matters: If MLPs act like memory shelves, the model can pull out stored answers without re-thinking them.
đ Bottom Bread (Anchor): Like a library shelf: see the bookâs label (key), pull out the book (value), and put the bookâs info into your notes (residual stream).
đ Top Bread (Hook): You know how a video game gives you points when you beat a level? That score teaches you what works.
đ„Ź Reinforcement Learning with Verifiable Rewards (RLVR):
- What it is: A training method that gives the model a reward based on whether the final answer is correct (and checkable).
- How it works:
- The model proposes an answer to a math question.
- A checker verifies if itâs correct.
- The model gets a reward (or not) and updates itself.
- Repeat many times so good strategies are reinforced.
- Why it matters: RLVR can build strong problem-solversâif the reward points to real reasoning.
đ Bottom Bread (Anchor): Like a math bee where every right answer gets a gold star; after many rounds, you learn habits that earn more gold stars.
đ Top Bread (Hook): Think of peeking at the answer sheet. You donât solve the problemâyou just copy the result.
đ„Ź Memorization Shortcuts:
- What it is: Quick paths that output known answers from memory instead of reasoning.
- How it works:
- Notice a familiar-looking question.
- Trigger stored information that matches it.
- Skip the slow step-by-step solution.
- Output the remembered answer.
- Why it matters: Shortcuts can boost scores on leaked tests but fail on new, slightly different problems.
đ Bottom Bread (Anchor): Like typing a saved WiâFi password: it connects fast, but you never learned what the password really is.
đ Top Bread (Hook): Picture tracking a toy carâs path as it rolls along a track. You can draw its curve and see where it splits into two lanes.
đ„Ź Neural Differential Equations (NDEs):
- What it is: A way to view layer-by-layer changes in a network as a smooth, continuous path through a space.
- How it works:
- Treat each layerâs update like a small step in time.
- Fit a simple function that describes how the hidden state changes continuously.
- Measure where two paths (reasoning vs memorization) split and how strongly theyâre pushed apart.
- Identify where the split begins (the bifurcation point).
- Why it matters: It pinpoints exactly where the model decides to switch from reasoning to memorization.
đ Bottom Bread (Anchor): Like watching two skiers on the same slope suddenly take different trailsâNDEs tell you where and why they split.
đ Top Bread (Hook): Imagine wearing X-ray glasses that let you see what word your friend is thinking of, step by step, before they say it.
đ„Ź Logit Lens:
- What it is: A tool that projects an internal hidden state directly to âwhich tokens are likely next?â
- How it works:
- Take the hidden vector from a certain layer.
- Map it straight into the vocabularyâs score space.
- See which tokens already look favored.
- Track how token preferences change across layers.
- Why it matters: It reveals exactly when the model commits to a specific answer tokenâearly or late.
đ Bottom Bread (Anchor): If the answer â4â is already top-1 by Layer 23, you know the decision was made before the final layer.
đ Top Bread (Hook): Think of replacing a piece of a Lego build to see if the final model still stands. If it does, that piece mattered.
đ„Ź Path Patching:
- What it is: A method for swapping internal activations to test what caused the output.
- How it works:
- Run the model on two versions of a question: one that triggers memory (leaked) and one that doesnât (stable).
- Copy activations from certain layers of the leaked run into the stable run.
- See if the stable run now outputs the memorized answer.
- Layers whose swaps change the result are causally important.
- Why it matters: It tells us which internal parts actually trigger memorized answers.
đ Bottom Bread (Anchor): Like swapping the engine from Car A into Car B; if B now zooms, you found the power source.
đ Top Bread (Hook): Picture a train junction where a lever sends the train down a shortcut track, and a later station reshapes the cars to fit the narrow tunnel.
đ„Ź Anchor-Adapter Circuit:
- What it is: A two-part pathway inside the model; Functional Anchors (layers 18â20) trigger memory retrieval, while Structural Adapters (layers 21+) reshape the representation to carry that shortcut cleanly to the output.
- How it works:
- Middle layers decide: âUse the stored answer.â
- A strong token signal is injected (the memory is activated).
- Later layers reorganize (rotate/transform) the internal space.
- The final layers deliver the memorized token with high confidence.
- Why it matters: This explains how spurious RLVR unlocks memorization that was already inside the model.
đ Bottom Bread (Anchor): In Qwen2.5-Math, Layers 18â20 light up first, and Layers 21â23 reshape things so the answer token (like â4â) pops to the top.
đ Top Bread (Hook): Imagine your friend gets more and more certain about an answer while becoming less sure what the question was asking.
đ„Ź Perplexity Paradox:
- What it is: The modelâs uncertainty (perplexity) drops on the answer tokens but rises on the prompt words.
- How it works:
- Spurious RLVR trains the model to nail the final answer.
- The model gets very confident at the end token(s).
- Meanwhile, its prompt understanding degrades.
- You see low answer perplexity but high prompt perplexity.
- Why it matters: Itâs a red flag that the model isnât reasoningâjust recalling.
đ Bottom Bread (Anchor): In the paper, Qwen2.5-Math shows decreasing answer perplexity and increasing prompt perplexity after spurious RLVR.
đ Top Bread (Hook): Think of comparing two recipes for cookies to see how different they are.
đ„Ź Jensen-Shannon Divergence (JSD):
- What it is: A measure of how far apart two probability distributions are.
- How it works:
- Look at token probabilities from an intermediate layer (via Logit Lens).
- Compare them to the final output probabilities.
- Compute JSD to see how much transformation still has to happen.
- Track JSD across layers to spot where structure changes.
- Why it matters: Peaks in certain sub-parts of MLPs show where the model reconfigures itself to carry the shortcut.
đ Bottom Bread (Anchor): In Qwen2.5-Math, JSD for W_up and W_gate peaks around Layers 21â22, signaling strong structural adaptation there.
02Core Idea
đ Top Bread (Hook): Imagine a maze where most kids solve it by thinking, but one kid finds a hidden door that jumps straight to the exit. He wins fastâbut didnât learn the maze.
đ„Ź The Concept (Aha! in one sentence): Spurious RLVR doesnât teach better reasoning; it flips a hidden switch that makes the model retrieve memorized answers using an Anchor-Adapter circuit.
How it works (step by step):
- If pretraining leaked test answers into the model, those answers live somewhere in its MLP memories.
- Spurious or incorrect rewards still pressure the model to get final answers right.
- Middle layers (Functional Anchors) learn to detect âthis looks like a known questionâ and inject a memory-trigger signal.
- Later layers (Structural Adapters) reshape the internal space so the injected answer token cleanly rises to the top.
- The modelâs answer confidence goes up, but its prompt understanding goes down (the Perplexity Paradox).
Why it matters: Without noticing, we might call this âreasoning progressâ even though it is just faster copying; this misleads benchmarks, product claims, and research directions.
đ Bottom Bread (Anchor): On MATH-500, Qwen2.5-Math improves a lot after spurious RLVR, but tools show the answer token appears early and strongly due to Anchors/Adaptersâclassic memorization, not new reasoning.
Multiple Analogies (3 ways):
- Secret elevator: The Anchor is the button; the Adapter is the elevator shaft reshaped to fit; you end up at the top floor without climbing the stairs.
- Choir and conductor: The Anchor (conductor) cues the soloist (answer token) early; the Adapter (choir) harmonizes so the solo dominates at the finale.
- GPS reroute: The Anchor presses âavoid trafficâ and picks a shortcut; the Adapter redraws the map so every turn leads straight to the exit.
Before vs After:
- Before: We assumed RLVR gains meant better step-by-step reasoning.
- After: We now see those gains can come from unlocking stored answers via a specific mid-to-late layer circuit.
- Change: Evaluation must separate âknowing what to answerâ from âknowing how to reason.â
Why it works (intuition):
- MLPs store compressed patterns; spurious rewards apply pressure at the outcome, not the process.
- The model finds the simplest way to maximize reward: match prompts to stored answers.
- Mid-layers are well-placed to recognize patterns; late layers can rotate features to present the chosen token cleanly.
- This division of labor (Anchor vs Adapter) naturally emerges under reward pressure when contamination is present.
Building Blocks:
- MLP memory shelves: hold candidate features and answers.
- Functional Anchors (L18â20): choose the âmemory routeâ and inject a high-probability answer direction.
- Structural Adapters (L21+): rotate/transform representation so the output layer cleanly selects the injected token.
- Detectives (tools): Path Patching (causality), Logit Lens (early token commitment), JSD (where structure shifts), NDEs (where dynamics split).
03Methodology
At a high level: Input (math question) â Detect whether it matches a stored pattern (Anchors L18â20) â Reshape the signal to favor the memorized token (Adapters L21+) â Output the final answer.
Step-by-step recipe:
- Prepare datasets and groups.
- What happens: Split questions into âleakageâ (WrongâRight after spurious RLVR) and âstableâ (unchanged), across MATH-500, MinervaMath, and the clean control LiveMathBench.
- Why it exists: To compare where memorization is activated versus where it isnât.
- Example: A MATH-500 problem thatâs wrong before training but right after suggests the model learned to retrieve, not newly reason.
- Measure the Perplexity Paradox.
- What happens: Track perplexity (uncertainty) over training steps for both prompt tokens and answer tokens.
- Why it exists: To see if answer confidence rises while prompt understanding fallsâa memorization signature.
- Example: In Qwen2.5-Math, answer perplexity drops while prompt perplexity rises; in LLaMA/OLMo both perplexities get worse.
- Use Path Patching to find causal layers.
- What happens: Swap layer activations from the RLVR-tuned run into the base run (and vice versa) and see if outputs flip.
- Why it exists: To causally locate which layers make the âmemorize nowâ decision.
- Example: Swapping MLP activations in L18âL20 boosts accuracy recovery; after L21, recovery dropsâpointing to the Anchors just before a structural shift.
- Use Logit Lens to see when tokens emerge.
- What happens: Project each layerâs hidden state into token space to watch which tokens are already winning.
- Why it exists: To catch early commitment to the final answer before the last layer.
- Example: For a leaked MATH-500 question with answer â4,â the token â4â surges at L23 after precursor signals near L19.
- Use JSD to find structural adaptation.
- What happens: Compare token distributions from intermediate layers to the final distribution to measure how much change remains.
- Why it exists: High or peaking JSD reveals where the model reorganizes its internal space.
- Example: W_up and W_gate JSD peak at L21âL22, then decline; W_down stays highâshowing late layers adapt structure for the shortcut signal.
- Use NDEs to map dynamics and find bifurcation.
- What happens: Fit a continuous model of layer-wise updates; measure where leakage and stable trajectories split and how strongly.
- Why it exists: To mathematically confirm the exact depth where the path diverges.
- Example: Separation force peaks at L18âL20, validating Anchors as the decision point; later layers boost velocity (amplify the signal) without changing direction.
- Do ablations to test necessity and sufficiency.
- What happens: Reset Anchors (L18â20) or Adapters (L21â22) to base weights; or keep only one groupâs RLVR weights.
- Why it exists: To show which pieces are required and whether any piece alone is enough.
- Example: On MATH-500, resetting Anchors drops leakage accuracy from 98% to 86%; keeping only Adapters fails badlyâboth groups are needed together.
- Mechanistic intervention: steer with neuron key scaling.
- What happens: Identify task-relevant MLP neurons, then multiply their key activations by a factor α during inference.
- Why it exists: To causally turn the shortcut up or down and prove control.
- Example: Amplify at Layer 18 (+4.4% leakage accuracy); suppress at Layer 18 (â3.8%); on clean LiveMathBench, steering shows no systematic effect, indicating specificity to contamination.
What breaks without each step:
- Without leakage vs stable grouping: You canât isolate memorization from genuine reasoning.
- Without perplexity tracking: You miss the paradox signal.
- Without Path Patching: You donât know which layers cause the behavior.
- Without Logit Lens: You canât see early token commitment.
- Without JSD: You miss where structure is reconfigured.
- Without NDEs: You canât pinpoint the bifurcation depth.
- Without ablations/interventions: You lack causal proof and control.
The secret sauce:
- The Anchor-Adapter split: mid-layers decide; late layers adapt structure.
- Cross-tool triangulation (Path Patching + Logit Lens + JSD + NDE): each method independently points to the same layers and roles.
- Bidirectional steering: scaling MLP keys not only observes but controls the shortcut, closing the causal loop.
04Experiments & Results
The test: Measure whether gains come from reasoning or memorization, and where the switch happens.
- Datasets: MATH-500 and MinervaMath (contaminated), LiveMathBench (clean control), plus AIME/AMC for breadth.
- Models: Qwen2.5-Math-7B (affected), Qwen3-8B (weaker effect), LLaMA-3.1-8B and OLMo-2-1124-7B (controls).
- Metrics: Accuracy, perplexity (prompt vs answer), ROUGE-L and completion for partial prompts, JSD, and dynamic NDE measures.
The competition: Compare how spurious RLVR affects different model families.
- Qwen2.5-Math-7B: large gains on contaminated sets, clear Perplexity Paradox, and a strong Anchor-Adapter signature.
- Qwen3-8B: similar but weaker patterns (later peaks, less dramatic gains).
- LLaMA/OLMo: no Anchor-Adapter signature; spurious RLVR hurts perplexity overall; little to no gains.
The scoreboard (with context):
- Perplexity Paradox in Qwen2.5-Math: answer perplexity goes down (more certain), prompt perplexity goes up (less coherent). Thatâs like getting faster at shouting the final number but worse at reading the problem.
- Path Patching: High accuracy recovery when patching MLPs around L18âL20, sharp drop after L21âpinpointing the Anchors just before a structural shift.
- JSD: W_up and W_gate peak at L21âL22 then decline; W_down stays highâlate layers are adapting structure rather than adding new content.
- NDE dynamics: Separation force peaks at L18âL20; velocity difference grows laterâAnchors decide, Adapters amplify.
- Ablations (MATH-500): Reset Anchors (L18â20) â leakage accuracy 98%â86%; Reset Adapters (L21â22) â smaller drop; Keep Only Adapters â catastrophic on contaminated sets; Stable (clean) samples barely change.
- Steering: At Layer 18, suppression reduces leakage accuracy (â3.8%), amplification boosts it (+4.4%). On LiveMathBench, steering shows no consistent effectâsupporting contamination specificity.
Surprising findings:
- Even incorrect or random rewards can activate the memory pathway if contamination exists; reward correctness wasnât necessary.
- In failed retrievals, amplification sometimes âwakes upâ a dormant shortcut that base inference couldnât access.
- LiveMathBench improvements, when present, look like formatting/structure refinements, not memory retrievalâhelping separate âhow to talkâ from âwhat to answer.â
05Discussion & Limitations
Limitations:
- Primarily demonstrated on Qwen2.5-Math-7B (with supporting signs in Qwen3-8B). Other architectures may organize memories differently.
- Focused on math benchmarks; coding or multi-step scientific reasoning may show other patterns.
- Tools like Logit Lens and Path Patching approximate complex internals; results depend on careful protocol choices.
Required resources:
- Access to base and RLVR-tuned checkpoints.
- Compute for running multiple probes (Path Patching sweeps, JSD component tests, NDE fitting) and ablation evaluations.
- Clean control datasets to distinguish reasoning from contamination.
When not to use (or interpret cautiously):
- If you lack a clean benchmark, you may mistake formatting gains for new reasoning.
- If the model family shows no leakage signs (like LLaMA/OLMo here), donât over-interpret noise in JSD or Logit Lens.
- If rewards target multi-step reasoning quality (not just final answers), this specific shortcut may be less relevant.
Open questions:
- How general is the Anchor-Adapter pattern across sizes, domains, and training recipes?
- Can we design RLVR rewards that explicitly penalize early token commitment without chain-of-thought evidence?
- Can automatic detectors flag Perplexity Paradox on-the-fly during training to stop contamination amplification?
- Are there safer structural adapters that improve reasoning instead of reorganizing for shortcuts?
- Can post-hoc steering be turned into a robust, deployable decontamination filter?
06Conclusion & Future Work
Three-sentence summary:
- Spurious RLVR can make LLMs score higher by activating memorized answers rather than improving reasoning.
- A specific Anchor-Adapter circuit enables this: middle layers (L18âL20) trigger retrieval, while later layers (L21+) reshape representations so the memorized token wins.
- The Perplexity Paradoxâlower answer perplexity but higher prompt perplexityâsignals this switch, and targeted neuron scaling can amplify or suppress it.
Main achievement:
- A causal, mechanistic roadmapâusing Path Patching, Logit Lens, JSD, NDEs, ablations, and steeringâthat localizes and controls contamination-driven shortcuts.
Future directions:
- Build contamination-resistant reward functions and benchmarks that require grounded reasoning.
- Automate paradox detection during training to prevent shortcut formation.
- Generalize the Anchor-Adapter analysis to more model families and tasks (math, code, science).
- Turn neuron-level steering into a practical âdecontaminationâ inference-time tool.
Why remember this:
- It reframes puzzling RLVR gains: not all accuracy increases mean deeper thinking.
- It gives clear signs (Perplexity Paradox) and tools (Anchor-Adapter mapping, steering) to tell memorization from reasoning.
- It helps the community build safer, more honest evaluations and models that truly learn how to solve problems, not just what to answer.
Practical Applications
- âąAdd a Perplexity Paradox check (prompt vs answer perplexity) to RLVR training dashboards to catch shortcut formation early.
- âąRun Path Patching sweeps after RLVR to localize Anchor and Adapter layers before shipping a model.
- âąUse clean control sets like LiveMathBench to separate reasoning gains from contamination-driven gains.
- âąApply layer resets (Anchor/Adapter) to diagnose and mitigate leaked performance without retraining from scratch.
- âąDeploy neuron key scaling at inference time to suppress suspected memorization on sensitive evaluations.
- âąTrack JSD of MLP subcomponents across layers to monitor structural adaptations that signal shortcut building.
- âąFit NDEs on residual updates to automatically detect bifurcation layers where reasoning switches to memory.
- âąGate benchmark claims: require both accuracy gains and no Perplexity Paradox before declaring reasoning improvements.
- âąDesign rewards that score intermediate reasoning quality (not just final answers) to reduce shortcut incentives.
- âąAutomate a âdecontamination passâ that disables identified Anchor-Adapter circuits during official evaluations.