Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification
Key Summary
- •This paper shows that many reasoning failures in AI are caused by just a few distracting words in the prompt, not because the problems are too hard.
- •They introduce LENS, a two-stage method that first finds and removes those distracting words (interference tokens) and then teaches the model to ignore them even when they are present.
- •The key trick is to compare the current model with a steady reference model to score which prompt tokens are most likely to cause trouble.
- •After making a cleaner prompt, the model gathers more successful examples and uses them to calibrate learning on the original noisy prompt.
- •Across seven math benchmarks, LENS boosts accuracy by an average of 3.88% and reaches strong performance about 1.6× faster than GRPO.
- •LENS reduces the number of zero-reward prompts (where no sampled answers are correct), preventing training from stalling.
- •It beats both ‘just do more rollouts’ and ‘throw away hard prompts’ strategies, using fewer resources.
- •Only a tiny slice of tokens (often under 5%) cause most of the interference, so careful pruning helps a lot.
- •Weaker models need a slightly higher pruning ratio than stronger models, suggesting they’re more sensitive to interference.
- •Limitations include tests up to 8B models, mostly binary rewards, and not yet combining LENS with other RLVR variants.
Why This Research Matters
Real-world prompts are messy: extra words, mixed styles, and occasional bad hints. LENS shows that teaching models to ignore a tiny fraction of distracting tokens can dramatically improve reasoning without huge compute. This makes AI helpers more reliable in classrooms, coding assistants, and study tools where instructions aren’t perfectly clean. It also reduces training waste by turning previously unhelpful, zero-reward prompts into useful teachers. Faster convergence means lower energy costs and quicker iteration for research and products. By focusing on signal quality over brute force, LENS points to a smarter path for advancing AI reasoning.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how doing homework in a noisy room is tough? Even if you’re good at math, a few loud noises can make you mess up.
🥬 The Concept: Reinforcement Learning with Verifiable Rewards (RLVR) is how we train AI to reason by trying answers and getting a clear right-or-wrong reward at the end.
- What it is: A learning setup where the AI tries different solution steps (called rollouts) and only gets a reward if the final answer is correct.
- How it works: 1) Read the problem. 2) Generate several solution attempts. 3) Check which ones are right (verifiable). 4) Use the wins to learn better next time.
- Why it matters: Without clear rewards, the AI doesn’t know what to copy or avoid. With RLVR, correct answers guide learning—even if feedback comes late.
🍞 Anchor: Imagine a quiz where you only get a point for the exact correct answer. The AI keeps trying solutions, and points (rewards) help it learn which attempts were good.
🍞 Hook: Imagine you try 8 times on a tough problem, but none are correct. That’s like practicing but never knowing what worked.
🥬 The Concept: Reward sparsity means the AI rarely sees successful attempts, so it can’t learn well.
- What it is: A training situation where correct rollouts are very rare, so useful learning signals are scarce.
- How it works: 1) The AI makes many long answers. 2) Most end up wrong (binary: 0). 3) With few correct examples, training slows or collapses. 4) The model struggles to explore effectively.
- Why it matters: If success is too rare, the AI can’t tell what to improve. It’s like studying from blank papers instead of marked tests.
🍞 Anchor: If you do 100 math problems and get feedback on only 1, you won’t improve fast. That’s the AI’s problem too.
🍞 Hook: You know how one confusing sentence can mess up your whole understanding of a story?
🥬 The Concept: Interference tokens are a few words in the prompt that distract the AI and increase mistakes.
- What it is: Specific prompt tokens that push the model off-track, often fewer than 5% of the words.
- How it works: 1) The AI reads the prompt. 2) Certain words make it guess strangely (over-optimizing or chasing noise). 3) Those tokens cause wrong solution paths. 4) Remove them, and success often jumps.
- Why it matters: If we blame the whole problem, we miss the real culprit—a tiny slice of noisy tokens.
🍞 Anchor: In the paper’s tests, simply removing a few bad tokens improved accuracy by over 20% on some hard samples.
🍞 Hook: Think of two classmates: one steady and careful, one a bit excitable. If they disagree a lot on some word, that word might be confusing.
🥬 The Concept: A reference model is the steady classmate we compare against to spot weird behavior.
- What it is: A stable model used as a baseline for normal token behavior.
- How it works: 1) Compute how likely each token seems to both models. 2) Big differences flag suspicious tokens. 3) Rank tokens by this difference. 4) Mark the highest as interference.
- Why it matters: Without a calm baseline, it’s hard to know which words are actually disruptive.
🍞 Anchor: If your friend and you both read the same sentence but only you get confused at the word “only,” that word needs attention.
🍞 Hook: If noise is the problem, what if we quiet the room first and then learn to focus even with noise later?
🥬 The Concept: LENS (Less Noise Sampling) is a two-stage method that first cleans noisy prompt tokens, then teaches the model to ignore them in the original prompt.
- What it is: A rollout framework that boosts exploration by pruning interference tokens and transferring the wins back to the noisy setting.
- How it works: 1) Score and remove a tiny fraction of high-interference tokens. 2) Generate answers on this cleaner prompt to get more correct samples. 3) Use those successful answers to guide learning on the original noisy prompt. 4) The model learns to ignore the noise.
- Why it matters: It turns low-success prompts into useful teachers without throwing them away or spending tons more compute.
🍞 Anchor: It’s like practicing a piano piece slowly without background chatter, then using what you learned to play well even when the room is noisy.
The world before: RLVR methods like GRPO made real progress in math reasoning, but exploration on complex tasks was inefficient. Many training steps produced no correct rollouts (zero-reward prompts). Teams tried two main fixes: do more rollouts (expensive and not necessarily smarter) or filter out hard/zero-variance prompts (safer but loses chances to improve on tough problems).
The problem: Training became unstable when the model couldn’t find enough correct examples, especially on long problems with delayed, binary rewards. This caused low sampling success and sometimes collapsed learning.
Failed attempts: Scaling exploration multiplied cost; filtering made training stable but cut off learning on the most useful, challenging samples.
The gap: Everyone treated hard prompts as the culprit. The paper discovered a sharper truth: a tiny set of prompt tokens often causes the failures. Fixing those tokens can unlock success without massive compute or discarding data.
Real stakes: In real life, prompts are messy—extra instructions, typos, style quirks. Teaching models to ignore minor distractions makes them more reliable in classrooms, coding, tutoring, and everyday assistants. Less noise means more voice: the model’s true reasoning shines through.
02Core Idea
🍞 Hook: Imagine reading a math problem with a few sneaky words that make you misread it. If you cross out just those words, suddenly everything clicks.
🥬 The Concept: The key insight is that a small number of “interference tokens” cause many exploration failures; prune them to find more correct answers, then teach the model to ignore them even when they’re present.
- What it is: A two-step strategy—clean first, then transfer learning to the original noisy prompt.
- How it works: 1) Detect suspicious tokens by comparing the current model with a stable reference. 2) Remove a tiny top-k set of high-interference tokens. 3) Generate answers on the cleaned prompt to get more correct rollouts. 4) Replace some failures on the original prompt with these successes and reweight them to guide learning. 5) Update the policy so it learns to resist interference.
- Why it matters: Without this, training wastes time on zero-reward prompts or throws them away, missing rich learning signals.
🍞 Anchor: Like putting on noise-canceling headphones to learn a song, then practicing without them so you can perform anywhere.
Multiple analogies:
- Classroom analogy: Cross out confusing side-notes in a test question to get the right answer, then learn to ignore such side-notes in future tests.
- Sports analogy: A coach first runs a drill in a quiet gym (clean prompt) to perfect form, then uses that skill to win in a loud stadium (noisy prompt).
- Cooking analogy: You remove an odd spice that ruins the dish (prune tokens), learn the intended flavor, then recognize and ignore that spice next time you taste it.
Before vs After:
- Before: We thought failures mostly came from problem difficulty, so we either sampled more (costly) or filtered hard prompts (lost learning opportunities).
- After: We know a few tokens cause outsized harm. By pruning them and transferring successes back, we keep hard prompts and make them teach us more efficiently.
Why it works (intuition):
- Token surgery: Deleting just 1–5% of tokens reduces misleading gradients that pull the model off-track.
- Better exploration: Clean prompts yield more correct attempts, increasing reward variance and stabilizing learning.
- Transfer learning: Using those correct attempts to supervise learning on the original prompt teaches the model to ignore interference rather than depend on cleaner conditions.
- Safety rail: A reference model flags unusual token influences, preventing over-optimization toward noisy directions.
Building blocks (each with a mini sandwich):
- 🍞 Hook: You know how comparing answers with a careful friend helps spot your slips? 🥬 The Concept: Reference model. It’s a steady baseline to compare token choices. Why it matters: Without it, you can’t easily tell which tokens are the troublemakers. 🍞 Anchor: If the calm friend acts normal at a word but you act weird, the word is suspicious.
- 🍞 Hook: Imagine a red highlighter marking the most distracting words. 🥬 The Concept: Interference score. It measures how much a token’s behavior deviates from the reference. Why it matters: High scores point to tokens likely causing failures. 🍞 Anchor: Top-scored words get pruned first.
- 🍞 Hook: Think of erasing just a few scribbles on a worksheet. 🥬 The Concept: Token-wise pruning. Remove the top-k most interfering tokens for a cleaner prompt. Why it matters: Tiny deletions, big clarity. 🍞 Anchor: Delete 1–5% of tokens and accuracy can jump.
- 🍞 Hook: Practice the song cleanly, then play it on a noisy stage. 🥬 The Concept: CRPO (Calibrated Rollout Policy Optimization). Use successes from clean prompts to guide learning on the original prompts. Why it matters: The model learns to be robust to noise. 🍞 Anchor: Replace some failed attempts with clean successes and reweight them during training.
- 🍞 Hook: When grading, harder questions can count more. 🥬 The Concept: Sample reweighting. Adjust how much each example influences learning based on success rates. Why it matters: Stabilizes updates and emphasizes the most informative signals. 🍞 Anchor: Successful clean rollouts get enough weight to steer the model away from interference.
In short, the aha is: prune a pinch of noise, harvest more wins, and then teach the model to win even with the noise back in place.
03Methodology
At a high level: Noisy prompt → Stage I: Identify and prune interference tokens → Generate rollouts on the clean prompt → Stage II: Calibrate policy on the original prompt using clean successes → More robust model outputs.
Stage I: Interference Token Identification and Purification
🍞 Hook: You know how you and a careful friend might react differently to a tricky word? That difference is a clue.
🥬 The Concept: Interference scoring with a reference model.
- What happens: 1) For each token in the prompt, compare how likely the current model and a stable reference think that token’s continuation is. 2) Big differences mean the token may be causing trouble. 3) Rank tokens by this difference (the interference score). 4) Delete the top-k tokens, where k is a small fraction (like 1–5%).
- Why this step exists: It pinpoints which words are distracting the model’s reasoning. Without it, pruning would be random and could harm meaning.
- Example: Prompt token list: [“Solve”, “carefully”, “don’t”, “guess”, “2x+3=7”, “answer”, “now”]. If “don’t” and “guess” score highest, we remove them to reduce confusion.
🍞 Anchor: Like erasing two noisy margin notes so the main equation stands out and you solve it correctly.
Generating Rollouts on the Denoised Prompt
🍞 Hook: Practice the piece slowly in a quiet room to get more correct notes.
🥬 The Concept: Cleaner prompts give more successful rollouts.
- What happens: 1) Sample several answers from the cleaned prompt. 2) Check which are correct (verifiable reward). 3) Compute a success rate. 4) If the cleaned prompt did better than the original, we’ll transfer those wins.
- Why this step exists: It creates a richer pool of correct examples, so training doesn’t stall. Without it, we’re stuck with too many failures.
- Example: Original prompt success: 0/8. Clean prompt success: 3/8. Now we have 3 good answers to learn from.
🍞 Anchor: If you finally play three bars correctly in practice, you can study what you did right.
Stage II: Calibrated Rollout Policy Optimization (CRPO)
🍞 Hook: After nailing the melody in the quiet room, you bring that skill to the main stage.
🥬 The Concept: Transfer clean successes back to the original prompt and calibrate training.
- What happens: 1) Build a training group by taking all original samples, then replacing a matching number of failures with successes from the clean prompt. 2) Reweight samples so the signal is balanced (successes vs. failures). 3) Apply importance correction (so learning remains fair when mixing clean and noisy data). 4) Update the model using a stable, clipped objective with gentle pull toward the reference (to avoid drifting into weird behaviors).
- Why this step exists: It teaches the model to ignore interference instead of relying on perfectly clean prompts. Without it, the model would do well only in quiet conditions.
- Example with data: Original failures: 8/8. Clean successes: 3. Replace 3 original failures with these 3 successes. Reweight them so they guide the update strongly but safely. Train so next time, even with the noisy words back, the model solves it.
🍞 Anchor: It’s like learning to focus on the melody even when the crowd is loud.
Sample Reweighting (Secret Stabilizer)
🍞 Hook: In class, harder or more informative questions sometimes count more.
🥬 The Concept: Sample reweighting.
- What happens: 1) Compute how successful the original prompt was. 2) Give weights so successful samples (from clean or original) influence learning more. 3) Keep training stable by normalizing within the group.
- Why this step exists: Prevents any single source (all failures or all successes) from dominating updates. Without it, learning can become wobbly or collapse.
- Example: If original success was 0/8, the clean successes get weight to pull the model toward the better behavior.
🍞 Anchor: Grading with fair weights helps the class average reflect true learning.
Importance Correction and Gentle Regularization
🍞 Hook: When combining practice from different rooms (quiet vs. noisy), you adjust for the differences so your score stays fair.
🥬 The Concept: Importance correction and KL regularization to a reference.
- What happens: 1) Correct for distribution mismatch when mixing clean and original rollouts. 2) Use PPO-style clipping so updates aren’t too big. 3) Nudge the policy toward the reference to avoid drifting into odd, over-optimized behavior.
- Why this step exists: Keeps optimization safe and steady. Without it, mixing data sources could push the model in unstable directions.
- Example: Even if clean rollouts are different from original ones, importance correction makes sure the learning step is ‘apples to apples.’
🍞 Anchor: Like adjusting scores from different judges so everyone’s on the same scale.
The Secret Sauce
- Tiny, targeted pruning: Removing just 1–5% of tokens can flip many failures into successes.
- Transfer, not filter: Instead of throwing away tough prompts, LENS pulls wins from cleaned prompts back into the original, noisy world.
- Stability through reference: The reference model and gentle regularization keep learning on track while exploring better solutions.
End-to-end example
- Input: A long math prompt with discouraging phrases and extra fluff.
- Step A (Identify): Score each token vs. reference; flag the top few.
- Step B (Purify): Remove the flagged tokens; sample answers; collect successes.
- Step C (Calibrate): Replace some original failures with these successes; reweight; correct for distribution; update safely.
- Output: A policy that now solves more problems, even when small noisy phrases are present.
04Experiments & Results
🍞 Hook: You know how a good study method should help you get better grades faster, not just study longer?
🥬 The Concept: The authors tested whether LENS finds more correct answers and learns faster than popular methods.
- What they measured: 1) Pass@1 (did the first try get it right?), 2) Speed to reach strong accuracy, 3) How many prompts gave zero correct answers (a training danger zone).
- Why it matters: Higher accuracy with fewer steps means smarter learning, not just more effort.
🍞 Anchor: Think of it like getting an A sooner, with less cramming.
The test setup
- Models: Multiple families and sizes, including Llama-3.2-3B, Qwen2.5-3B/7B, Qwen3-4B/8B.
- Training data: Openr1-Math-46k for RL training.
- Benchmarks: Seven math challenges—MATH-500, AMC23, AIME24, AIME25, GaokaoEN-2023, Minerva, OlympiadBench—ranging from medium to very hard.
- Competitors: GRPO (standard), GRPO extended (more rollouts), DAPO and GRESO (filtering strategies), plus extended versions.
Scoreboard with context
- Overall: LENS beats GRPO with an average accuracy gain of 3.88% and gets there about 1.6× faster. That’s like jumping from a solid B to an A- while also finishing your homework in half the time.
- Stability: LENS consistently reduces zero-reward prompts. Fewer ‘all wrong’ batches mean steady learning instead of stalling.
- Hard tests: On tougher sets like AMC23 and AIME24, LENS shines even more. It’s like getting A’s in the hardest classes, not just the easy ones.
- Efficiency: Even when baselines used 2× rollouts or more training epochs, LENS often still won—with less compute.
Learning curves
- On both medium (MATH-500) and high difficulty (OlympiadBench), LENS climbs faster and ends higher or equal to GRPO.
- Translation: It’s not only better at the end; it’s better along the way, saving training time and resources.
Surprising findings
- Fewer than 5% of tokens often cause the crash. Pruning those can flip many failed samples into successes.
- Weak vs. strong models: Smaller models benefit from slightly higher pruning ratios. Bigger models need less pruning to reach peak results.
- Not just scale: Simply doing more rollouts (GRPO extended) doesn’t solve the ‘noisy token’ problem. Smart pruning and transfer learning do.
Concrete examples (simplified)
- Zero-reward prompts: With GRPO, many prompts got 0/8 correct samples in a rollout group. LENS cut this ratio, moving more prompts into ‘some wins’ territory.
- Speedup: LENS reached GRPO’s best accuracy using only around 60% of the training steps on some benchmarks (about 1.6× faster).
Takeaway: LENS delivers a Pareto improvement—better accuracy and better efficiency. It doesn’t just work harder; it works smarter by removing tiny-but-mighty distractions and then teaching the model to ignore those distractions when they’re present.
05Discussion & Limitations
🍞 Hook: You know how a great study trick might still not work for every subject or classroom?
🥬 The Concept: Honest limits and when to use (or not use) LENS.
- Limitations: 1) Scale: Experiments go up to 8B parameters; behavior at 32B–70B is untested. 2) Reward style: Mostly binary rewards; multi-dimensional or partial-credit settings need more study. 3) Algorithm combo: LENS was built on GRPO; combining with other RLVR variants (like adaptive rollout schedulers or new advantage shaping) is future work. 4) Overhead: LENS adds some per-step cost (about 1.27×–1.62×) to get higher-quality signals, though still cheaper than simply doubling rollouts.
- Required resources: A reference model, token-level scoring, multi-rollout sampling, and an RLVR training stack (PPO-style updates, verifiable checkers). You need enough memory and compute to handle token scoring and re-rolling cleaned prompts.
- When not to use: 1) Very short or already clean prompts (little room to prune). 2) Tasks with dense, step-by-step rewards (noise less harmful). 3) Domains where removing any token breaks meaning (e.g., precise legal text). 4) Languages or tokenizations where tiny deletions distort crucial semantics.
- Open questions: 1) Can we learn interference detection end-to-end without a fixed reference model? 2) Can pruning be adaptive per-sample and per-model-size automatically? 3) How does LENS behave with graded rewards (partial credit) or multiple objectives (correctness + style)? 4) Can we extend beyond math to coding, planning, and multimodal tasks reliably? 5) Can we fuse LENS with advanced exploration schedulers for even faster gains?
🍞 Anchor: Like any powerful tool, LENS works best when the homework is noisy and long, and you have a clear way to check answers. For tiny, tidy questions, a simple pencil might be enough.
06Conclusion & Future Work
Three-sentence summary: This paper discovers that many reasoning failures come from a tiny number of distracting prompt tokens and proposes LENS to fix it. LENS first prunes those tokens to get more correct rollouts, then transfers those wins back to the original prompt so the model learns to ignore the noise. The result is higher accuracy and about 1.6× faster learning across multiple math benchmarks compared to strong baselines like GRPO.
Main achievement: A simple, plug-and-play framework that boosts exploration efficiency by turning low-success, noisy prompts into valuable teachers—without adding big compute budgets or discarding hard problems.
Future directions: Scale to larger models, handle richer rewards (partial credit, multiple objectives), and combine with advanced RLVR variants for even better stability and speed. Explore automated, adaptive pruning and reference-free detectors, and test across domains like coding and multimodal reasoning.
Why remember this: LENS flips the script—hard prompts aren’t always hard; a few noisy words can hide the answer. By removing less than 5% of tokens and smartly transferring successes, we can get “less noise, more voice,” making AI reasoning sturdier in the messy real world.
Practical Applications
- •Build more robust math tutors that ignore distracting phrasing and focus on the core problem.
- •Improve coding assistants by filtering out misleading comments or boilerplate that confuses reasoning.
- •Enhance educational chatbots to handle noisy student questions without losing accuracy.
- •Stabilize RL training for planning and tool-use agents when instructions contain fluff or inconsistent styles.
- •Speed up model fine-tuning by reducing zero-reward batches, saving time and compute.
- •Boost reliability in customer support bots where prompts vary in tone and clarity.
- •Assist data curation by identifying high-interference phrases that should be rewritten or removed.
- •Upgrade evaluation systems by detecting prompts likely to cause unstable model behavior.
- •Enable better on-device reasoning models where compute is limited and efficient learning matters.
- •Support multilingual reasoning by spotting language-specific interference tokens for targeted cleanup.