Lost in the Noise: How Reasoning Models Fail with Contextual Distractors
Key Summary
- ā¢The paper shows that when we give AI lots of extra text, even harmless extra text, it can get badly confusedāsometimes losing up to 80% of its accuracy.
- ā¢The authors build NoisyBench, a new test that adds different kinds of ādistractorsā like random web pages, old chat messages, and tricky look-alike passages.
- ā¢They find that agent-style AI that uses tools (like search or calculators) becomes even more fragile because it over-trusts noisy tool outputs.
- ā¢Common fixesābetter prompts, clever context engineering, and simple supervised fine-tuningādo not reliably help and can even make things worse.
- ā¢Reinforcement learning helps a bit, but the big win comes from a new reward called Rationale-Aware Reward (RARE) that rewards models for pointing to the right evidence.
- ā¢RARE encourages the model to find and quote the helpful parts inside the noise, which significantly boosts robustness across tasks and models.
- ā¢As noise increases, models tend to think longer, grow less confident (higher entropy), and pay too much attention to distractor tokens.
- ā¢More test-time computation (longer chains of thought) in noisy contexts can actually hurt performanceāan inverse scaling effect.
- ā¢The authors also release NoisyInstruct, a training set with distractors and hints to teach models what to keep and what to ignore.
- ā¢Overall, the work offers a realistic benchmark, a practical training signal (RARE), and clear analyses that guide building sturdier reasoning agents.
Why This Research Matters
In real life, information is messy: web pages can be irrelevant, chat histories get long, and tools return wrong or partial outputs. This paper shows that such noise can quietly wreck the performance of even top reasoning models, which matters in places like healthcare, finance, and education. With NoisyBench, we finally have a clear way to see how models behave under realistic clutter, not just in quiet, clean tests. RARE then gives us a practical training knob: reward models for using the right evidence, not just for lucky final answers. That makes AI helpers more trustworthy when they face distractions, leading to better advice, safer decisions, and fewer confident mistakes. Over time, this approach can guide better system design, tool routing, and context curation for robust, real-world AI agents.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine trying to solve a math problem in a cafeteria while friends are talking, music is playing, and your phone keeps buzzing. Even if you know the math, the noise makes it hard to focus.
š„¬ The Concept (Agentic AI):
- What it is: Agentic AI are models that donāt just answer; they plan, use tools, and take steps to reach goals.
- How it works:
- Read the task
- Decide which tool to use (search, code, calculator)
- Execute tools and gather info
- Combine results to act or answer
- Why it matters: Without this, models can only guess in one shot and fail at multi-step, real-world tasks. š Anchor: A travel agent AI books your flight: it checks dates, searches prices, applies rules, and confirms the ticketāstep by step.
š Hook: You know how you sometimes search your notes during homework because you canāt remember every detail?
š„¬ The Concept (RAG ā Retrieval-Augmented Generation):
- What it is: RAG is when the model looks things up (retrieves documents) and then writes an answer using that info.
- How it works:
- Read the question
- Retrieve likely helpful documents
- Focus on the right parts
- Write a grounded answer
- Why it matters: Without RAG, the model must rely on memory and can easily hallucinate. š Anchor: Asking, āWho discovered penicillin?ā RAG fetches a short bio of Alexander Fleming and then answers correctly.
š Hook: You know how a shiny, wrong clue in a mystery story can trick you?
š„¬ The Concept (Contextual Distractors):
- What it is: Extra text in the input that looks useful but isnātālike random web pages, unrelated chat history, or tricky, similar-sounding passages.
- How it works:
- Appear near the real question
- Share keywords or style
- Steal the modelās attention
- Lead reasoning down the wrong path
- Why it matters: Without filtering, the model chases the wrong trail and answers incorrectly. š Anchor: A math word problem plus a long paragraph about a similar but different situation tricks the model into using the wrong numbers.
š Hook: Think about how your eyes jump to bold or important words when reading.
š„¬ The Concept (Attention Mechanism):
- What it is: A modelās way of deciding which tokens matter most right now.
- How it works:
- Compare each word to your current goal
- Give āimportance scoresā
- Read high-score words more closely
- Use them to decide the next token
- Why it matters: Without attention, the model treats ātheā and āanswerā as equally important. š Anchor: Asked āWhatās the capital of France?ā, attention zeroes in on ācapitalā and āFrance,ā enabling āParis.ā
š Hook: Before a big play, teams practice what to include on the field and what to leave on the bench.
š„¬ The Concept (Context Engineering):
- What it is: The craft of choosing and organizing what context to put in the modelās prompt.
- How it works:
- Select possibly relevant info
- Trim or rephrase it
- Order it logically
- Present it in a consistent template
- Why it matters: Without it, the model sees too much clutter and misses key facts. š Anchor: Building a ācheat sheetā that only contains the formulas you really need for the test.
š Hook: If a goalie only gets praise for the final score, not for good saves during the game, they might stop making smart moves in the middle.
š„¬ The Concept (Reinforcement Learning, RL):
- What it is: A way to train models by giving rewards for better behavior.
- How it works:
- Model tries a behavior (answer or plan)
- Gets a reward score
- Learns to increase good moves
- Repeats to improve policy
- Why it matters: Without RL, models canāt learn from feedback about multi-step processes. š Anchor: A robot learns to navigate a maze faster by getting points for moving closer to the exit.
The world before: Many benchmarks were ācleanāāthey only tested models with the exact needed info and no extra noise. In these settings, top models looked great. But real life is messy: tools return wrong pages, chat histories pile up, and unrelated snippets sneak into prompts. The problem: When researchers added distractors similar to real-life clutter, model accuracy crashed, sometimes by as much as 80%. Worse, the errors looked like āemergent misalignmentāāthe model drifted from instructions even without any attack.
š Hook: A teacher needs a fair test, not just the easiest one.
š„¬ The Concept (Alignment/Misalignment):
- What it is: Alignment means the model follows intended goals and values; misalignment means it drifts or behaves undesirably.
- How it works:
- Get an instruction and context
- Keep the goal in mind
- Avoid unsafe or off-goal paths
- Produce helpful, faithful outputs
- Why it matters: Without alignment, the model can be confidently wrong or unsafe. š Anchor: You ask for safe kitchen tips, but the model recommends dangerous shortcutsāthatās misalignment.
Failed attempts: People tried prompting tricks and context engineering to steer attention. They also tried supervised fine-tuning (SFT) on noisy examples, but it often caused catastrophic forgettingāmodels forgot useful habits. Plain outcome-only RL helped a bit but didnāt teach the model how to spot the right evidence. The gap: We needed (1) a realistic test that injects noise and (2) a training signal that rewards not just correct answers but correct use of evidence.
Real stakes: In healthcare, mixing a correct guideline with a misleading blog post can change the treatment suggestion. In finance, an irrelevant memo can steer a risk analysis wrong. For students, long chats and random links can cause āsmartā helpers to answer with confidence but be wrong. Getting robust to noise isnāt a bonusāitās table stakes for safety and trust.
02Core Idea
Aha! Key insight in one sentence: To make reasoning agents robust in the real world, test them with realistic noise (NoisyBench) and train them to point to the right evidence inside that noise (RARE).
Multiple analogies:
- Lifeguard whistle: In a crowded, noisy pool, a lifeguard learns to scan for real distress signals; RARE teaches the model to lock onto the truly helpful lines in the sea of text.
- Treasure map with decoys: The beach is full of fake Xās; RARE rewards the model for finding and citing the one real X.
- Science fair judging: Donāt just grade the final answer; give points for showing the experiment and labeling the right dataāRARE rewards grounded reasoning steps.
Before vs. After:
- Before: Clean benchmarks made models look strong; adding distractors revealed big weaknesses (up to ā80%). Agentic workflows that use tools helped in clean settings but got more fragile with noise.
- After: With NoisyBench, we can see the true robustness picture. With RARE, models learn to identify and quote the useful parts, resisting distractors and lifting accuracy across tasks and model sizes.
Why it works (intuition): Outcome-only rewards tell you if the cake tastes good, but not whether you followed the recipe. In noisy kitchens (prompts), you must reward āusing the correct ingredientsā (citing helpful spans) during cooking (reasoning). RARE gives that mid-process feedback, nudging attention and thought toward relevant evidence and away from distractors.
Building blocks (with concept intros):
š Hook: Testing a runner only on flat tracks hides how they handle hills.
š„¬ The Concept (NoisyBench):
- What it is: A benchmark that adds realistic distractorsārandom documents, random chat history, and hard negativesāacross 11 datasets spanning RAG, reasoning, alignment, and tool use.
- How it works:
- Start with a standard task
- Mix in one of several distractor types
- Ensure distractors donāt secretly help
- Measure performance drops meaningfully
- Why it matters: Without it, we overestimate model strength and miss real-world failure modes. š Anchor: Itās like giving a math test in a noisy cafeteria to see who can still focus.
š Hook: Giving a gold star only for a right final answer can reward lucky guesses.
š„¬ The Concept (Outcome-based Rewards, OR):
- What it is: RL signals that pay only for the final answer being correct.
- How it works:
- Model answers
- Judge correctness
- Reward if correct, else not
- Update policy toward lucky final outcomes
- Why it matters: Without process checks, models may learn shortcuts and ignore evidence. š Anchor: A student gets an A for the right number, even if they copied it from a friend.
š Hook: You know how teachers give points for showing work, not just answers?
š„¬ The Concept (RARE ā Rationale-Aware Reward):
- What it is: A reward that pays the model for identifying and using the right evidence inside noisy context.
- How it works:
- Include a hidden reference span that contains helpful info
- Model highlights/paraphrases that info during reasoning
- A judge checks match between modelās cited content and the reference
- Reward is given when the right evidence is used
- Why it matters: Without RARE, models donāt learn to filter noise; with it, they ground their thinking. š Anchor: Like awarding points for correctly citing the textbook section that supports your answer.
š Hook: Trick questions in quizzes often look almost right to make you pick them.
š„¬ The Concept (Hard Negative Distractors):
- What it is: Misleading, look-alike passages that feel relevant but donāt help answer the question.
- How it works:
- Share surface features with the question
- Encourage plausible but wrong steps
- Increase cognitive load
- Cause confident mistakes
- Why it matters: These expose whether models truly understand or just match patterns. š Anchor: A math problem with numbers that tempt you to use the wrong formula.
š Hook: Solving a mystery often takes several clues, not just one.
š„¬ The Concept (Multi-hop Reasoning):
- What it is: Connecting several pieces of info across steps to reach the answer.
- How it works:
- Find clue A
- Use A to find B
- Combine A and B
- Conclude logically
- Why it matters: Without it, models fail tasks needing combinations of facts. š Anchor: To answer, āWhich scientistās discovery led to X?ā, you link who did what, when, and why.
Put together, the core idea is simple but powerful: Stress-test with realistic noise (NoisyBench), and teach the model to tag and use the right bits inside that noise (RARE).
03Methodology
High-level pipeline: Input (Question + Noisy Context) ā Model proposes reasoning and cites helpful spans ā Judge evaluates correctness and grounding ā RL updates model ā Output is a more robust, noise-aware reasoner.
Step-by-step details (with the crucial concepts introduced once using sandwiches):
- Build NoisyBench
- What happens: Each task instance gets one of four settings: ND (no distractor), RD (random document), RC (random chat history), HN (hard negative). Distractors are filtered so they donāt leak the answer or change the correct label.
- Why it exists: Real life has noise; we need to know how models behave under it.
- Example: A GPQA science question is paired with a long, on-topic but irrelevant article (HN) about similar molecules that could mislead a model into the wrong pathway explanation.
- Construct NoisyInstruct (training data)
š Hook: Practicing with realistic scrimmages beats only practicing drills.
š„¬ The Concept (NoisyInstruct):
- What it is: A training set that mixes questions with hints (helpful references) and distractors (random and hard negatives) in various combinations: (A|Q), (A|Q,H), (A|Q,D), (A|Q,D,H).
- How it works:
- Gather diverse tasks from a broad corpus
- Add distractors sampled or generated elsewhere to avoid contamination
- Create hints that are helpful but not answers
- Filter low-quality samples with checks
- Why it matters: Without such practice, models wonāt learn to ignore noise. š Anchor: Itās like scrimmaging with crowd noise so the team learns to communicate effectively.
- Train with RL using two reward parts
š Hook: Learning to bake better cakes requires more than judging tasteāyou should check the ingredients, too.
š„¬ The Concept (Reinforcement Learning recap with OR+RARE):
- What it is: We optimize the model with RL where rewards come from two sources: final-answer correctness (Outcome-based Reward) and evidence-grounding (RARE).
- How it works:
- Generate multiple answers with chains of thought
- Score final correctness (OR)
- Check whether the chain cites/paraphrases the hidden helpful span (RARE)
- Update the policy to increase both correctness and proper grounding
- Why it matters: Without the RARE part, the model may get the right answer but for the wrong reasons and stay fragile to noise. š Anchor: A math student earns points for both the right solution and for showing the correct steps that use the right theorem.
- Secret sauce: the RARE check
- What happens: Prompts include a hidden helpful <reference>...</reference> span. During training, the judge compares the modelās cited or paraphrased content in its reasoning to this span. Matching earns a binary reward.
- Why it exists: It gently pulls attention and reasoning toward relevant tokens and away from distractors.
- Example with data: Suppose the question is about the cause of a historical treaty. The distractor is a look-alike article about a different treaty from the same era. The hidden reference contains two sentences about the actual cause. When the model quotes or closely paraphrases those sentences during its chain of thought, it earns the RARE reward.
- Agentic workflow evaluation
š Hook: If your GPS trusts every road signāeven fake onesāyouāll get lost faster.
š„¬ The Concept (Agentic Workflow with Tools):
- What it is: A policy that decides when to call tools (search, calculator, retriever) and how to use their outputs.
- How it works:
- Plan steps
- Call a tool
- Read its output
- Decide next actions
- Why it matters: Without calibrated trust, tools can amplify noise. š Anchor: A shopping assistant keeps re-querying a bad price API and gets more confused with every call.
- Measuring robustness
- What happens: Evaluate on 11 datasets covering RAG (SealQA, MultihopRAG, Musique), reasoning (AIME25, BBEH-Mini, GPQA-Diamond), alignment (Self-Awareness, Survival-Instinct, BBQ), and tool use (TauBench Retail/Airline). Use pass@k or pass^k as in the original tasks.
- Why it exists: We need a broad and fair picture across skills.
- Example: On BBQ, we track how distractors change bias-related multiple-choice performance; on TauBench, we measure success rates of tool-using agents.
- Behavioral analyses under noise
š Hook: The longer you stare at the wrong clue, the more confident you feelāuntil youāre not.
š„¬ The Concept (Inverse Scaling at Test-Time Compute):
- What it is: In noisy settings, making chains of thought longer can reduce accuracy.
- How it works:
- Distractors pull attention
- Extra reasoning reinforces wrong paths
- Errors compound step by step
- Final accuracy drops
- Why it matters: Without caution, āthink moreā backfires when inputs are noisy. š Anchor: Writing extra pages of wrong math doesnāt fix the mistake; it buries it.
š Hook: When youāre unsure, your voice wobblesāmodels do that too.
š„¬ The Concept (Entropy as Uncertainty):
- What it is: A measure of how unsure the model is while generating tokens.
- How it works:
- Compute token probabilities
- Higher spread ā higher entropy
- Aggregate over key tokens
- Track how entropy changes with more distractors
- Why it matters: Rising entropy signals growing confusion. š Anchor: With two distractors, the modelās āvoiceā shakes moreāentropy goes up.
- Why naive fixes fall short
- Prompting and context engineering: Often remove some noise but also delete needed info, and can themselves be misled by noise.
- SFT: Risks catastrophic forgetting, weakening built-in resilience.
- Plain OR: Rewards final answers but fails to teach evidence filtering.
Putting it all together, the recipe is: realistic noise (NoisyBench) + practice with noise (NoisyInstruct) + a two-part reward that grades both the answer and the use of correct evidence (OR + RARE).
04Experiments & Results
The test: The authors measured model accuracy across 11 datasets and four settings: ND (no distractor), RD (random document), RC (random chat), and HN (hard negative). They also checked how agentic workflows performed compared to their base models, and ran analyses on attention, uncertainty (entropy), and the effect of longer chains of thought.
The competition: They evaluated strong proprietary and open models: Gemini-2.5-Pro/Flash, DeepSeek-R1-0528, gpt-oss-120b, Qwen3-4B-Thinking, Qwen3-30B-A3B-Thinking, and DeepSeek-R1-Distill-Llama-8B. They also trained three open models with various methods: prompting, SFT, RL with outcome rewards (OR), and RL with OR+RARE.
The scoreboard (with context):
- Big drop under noise: All models lost ground when distractors appeared, from about ā9% to nearly ā80% on average across tasks. Thatās like going from an A to an F when the classroom gets loud.
- Hard negatives hurt most: DeepSeek-R1-Distill-Llama-8B lost about 80.6% with HN, showing that tricky, similar-looking passages are especially dangerous.
- Even random noise triggers misalignment-like behavior: On BBQ (a bias/alignment test), Gemini-2.5-Pro fell from 94.0% to 60.5% with distractors; DeepSeek-R1-0528 dropped from 93.0% to 33.7%āa massive shift caused by non-adversarial noise.
- Agentic workflows turn fragile in noise: Agents that used tools did better in clean settings but worse in noisy ones. Over-trusting noisy tool outputs and re-calling tools on contaminated context amplified errors.
Surprising findings:
- Inverse scaling under noise: More test-time thinking (longer chains of thought) tended to make answers worse when distractors were present. The model spent more tokens following the wrong trail.
- Higher uncertainty (entropy): As the number of distractors increased, entropy rose steadily, signaling confusion and less confidence.
- Attention fixates on distractors when wrong: Incorrect predictions allocated disproportionately more attention mass to distractor tokens, confirming that models were literally looking at the wrong places.
Do common fixes help?
- Prompting and context engineering: Small or negative gains; can remove useful context or be misled by noise.
- SFT: Often harms performance via catastrophic forgetting, except a narrow case.
- RL with OR: Better than prompting/SFT, but still limited because it doesnāt teach evidence filtering.
The big wināRARE:
- Across models and distractor types, RL with OR+RARE consistently outperformed OR alone (and the other baselines), often by large margins. For example, on Qwen3-4B under RD, the average harmonic mean jumped by over 55% relative to baseline; under HN, improvements exceeded 36%.
- RARE reduced ādistracted chains of thoughtā during training while increasing final-answer rewards, leading to higher accuracy by the end.
- Bonus: Training with RARE also transferred to clean settings (ND) with modest gains, suggesting that learning to filter noise does not harm clean performance and can even help.
Bottom line: When the world is noisy, testing must be noisy, and training must explicitly reward using the right evidence. That combination made models far more resilient than prior methods.
05Discussion & Limitations
Limitations:
- Scope of models: The study targets explicit reasoning models and agentic systems; it does not systematically evaluate base or purely instruction-tuned models, nor multimodal settings.
- Reward model reliance: The RARE pipeline depends on an LLM judge to verify cited evidence; judge bias or errors could influence training.
- Data/programmatic overhead: Building NoisyBench and NoisyInstruct (with generation and filtering) requires careful engineering to avoid contamination and ensure quality.
- Not an architectural fix: RARE improves behavior but doesnāt redesign attention to be noise-proof; long-term robustness may need architectural changes.
Required resources:
- Compute for RL (GRPO) training runs and inference with long contexts.
- Access to a capable judge model for verifying evidence use.
- Storage and orchestration for constructing and filtering distractors and hints.
When not to use:
- Tiny, closed-book tasks with minimal context, where noise is absent and RL overhead brings little benefit.
- Fully verifiable, symbolic pipelines where correctness can be guaranteed without language-model reasoning.
- Extremely short-horizon tasks where the modelās built-in behavior is already robust and fast.
Open questions:
- Architecture-level noise filters: Can we design attention mechanisms that downweight distractor tokens automatically?
- Uncertainty calibration: How can we turn rising entropy into self-stopping, ask-for-clarification, or re-retrieval behaviors?
- Tool trust calibration: How can agents learn to question tool outputs and avoid contamination loops?
- Multi-distractor mixtures: Whatās the best policy when random chat, random docs, and hard negatives appear together?
- Generalization: How well do RARE-trained models transfer to entirely new domains and unseen distractor styles?
06Conclusion & Future Work
Three-sentence summary: The paper shows that reasoning models look strong on clean tests but can fall apartāby up to 80%āwhen realistic noise is added. It introduces NoisyBench to measure this robustness and RARE, a reward that pays models for using the right evidence during reasoning. Together with NoisyInstruct, these tools significantly improve resilience across tasks and models.
Main achievement: Turning āshow your work using the right evidenceā into a concrete training signal (RARE) that reliably boosts robustness in noisy contexts.
Future directions: Build architectures that inherently suppress distractor attention; calibrate uncertainty so models pause or ask for help under confusion; extend to multimodal and real-time agent settings; refine judge models and automatic reference generation. Explore workflows that actively clean or restructure context before reasoning.
Why remember this: Real-world AI must resist noise. By testing with realistic distractors and rewarding grounded reasoning steps, we move from models that only sound smart in silence to agents that stay accurate in the everyday bustle.
Practical Applications
- ā¢Build safer chat assistants that ignore irrelevant chat history and focus on the latest user goal.
- ā¢Improve RAG systems by training them to cite and rely on the most helpful retrieved snippets.
- ā¢Calibrate agent workflows to question tool outputs when the context looks noisy or conflicting.
- ā¢Deploy RARE-trained models in customer support so they avoid outdated or off-topic knowledge base articles.
- ā¢Enhance educational tutors to highlight the exact lines in passages that justify their answers.
- ā¢Use NoisyBench to stress-test enterprise AI before launch, revealing fragility to real-world clutter.
- ā¢Add uncertainty-aware behaviors (e.g., ask for clarification) when entropy rises with more distractors.
- ā¢Integrate context filters that move distractors above the question and summarize them before reasoning.
- ā¢Train search agents to downweight look-alike hard negatives that historically led to errors.
- ā¢Monitor attention patterns during evaluation to spot when models over-focus on distractor tokens.