Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

Seongyun Lee; Yongrae Jo; Minju Seo; Moontae Lee; Minjoon Seo

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

Intermediate

Seongyun Lee, Yongrae Jo, Minju Seo et al.1/12/2026

arXiv PDF

Key Summary

•The paper shows that when we give AI lots of extra text, even harmless extra text, it can get badly confused—sometimes losing up to 80% of its accuracy.
•The authors build NoisyBench, a new test that adds different kinds of “distractors” like random web pages, old chat messages, and tricky look-alike passages.
•They find that agent-style AI that uses tools (like search or calculators) becomes even more fragile because it over-trusts noisy tool outputs.
•Common fixes—better prompts, clever context engineering, and simple supervised fine-tuning—do not reliably help and can even make things worse.
•Reinforcement learning helps a bit, but the big win comes from a new reward called Rationale-Aware Reward (RARE) that rewards models for pointing to the right evidence.
•RARE encourages the model to find and quote the helpful parts inside the noise, which significantly boosts robustness across tasks and models.
•As noise increases, models tend to think longer, grow less confident (higher entropy), and pay too much attention to distractor tokens.
•More test-time computation (longer chains of thought) in noisy contexts can actually hurt performance—an inverse scaling effect.
•The authors also release NoisyInstruct, a training set with distractors and hints to teach models what to keep and what to ignore.
•Overall, the work offers a realistic benchmark, a practical training signal (RARE), and clear analyses that guide building sturdier reasoning agents.

Why This Research Matters

In real life, information is messy: web pages can be irrelevant, chat histories get long, and tools return wrong or partial outputs. This paper shows that such noise can quietly wreck the performance of even top reasoning models, which matters in places like healthcare, finance, and education. With NoisyBench, we finally have a clear way to see how models behave under realistic clutter, not just in quiet, clean tests. RARE then gives us a practical training knob: reward models for using the right evidence, not just for lucky final answers. That makes AI helpers more trustworthy when they face distractions, leading to better advice, safer decisions, and fewer confident mistakes. Over time, this approach can guide better system design, tool routing, and context curation for robust, real-world AI agents.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to solve a math problem in a cafeteria while friends are talking, music is playing, and your phone keeps buzzing. Even if you know the math, the noise makes it hard to focus.

🥬 The Concept (Agentic AI):

What it is: Agentic AI are models that don’t just answer; they plan, use tools, and take steps to reach goals.
How it works:
1. Read the task
2. Decide which tool to use (search, code, calculator)
3. Execute tools and gather info
4. Combine results to act or answer
Why it matters: Without this, models can only guess in one shot and fail at multi-step, real-world tasks. 🍞 Anchor: A travel agent AI books your flight: it checks dates, searches prices, applies rules, and confirms the ticket—step by step.

🍞 Hook: You know how you sometimes search your notes during homework because you can’t remember every detail?

🥬 The Concept (RAG – Retrieval-Augmented Generation):

What it is: RAG is when the model looks things up (retrieves documents) and then writes an answer using that info.
How it works:
1. Read the question
2. Retrieve likely helpful documents
3. Focus on the right parts
4. Write a grounded answer
Why it matters: Without RAG, the model must rely on memory and can easily hallucinate. 🍞 Anchor: Asking, “Who discovered penicillin?” RAG fetches a short bio of Alexander Fleming and then answers correctly.

🍞 Hook: You know how a shiny, wrong clue in a mystery story can trick you?

🥬 The Concept (Contextual Distractors):

What it is: Extra text in the input that looks useful but isn’t—like random web pages, unrelated chat history, or tricky, similar-sounding passages.
How it works:
1. Appear near the real question
2. Share keywords or style
3. Steal the model’s attention
4. Lead reasoning down the wrong path
Why it matters: Without filtering, the model chases the wrong trail and answers incorrectly. 🍞 Anchor: A math word problem plus a long paragraph about a similar but different situation tricks the model into using the wrong numbers.

🍞 Hook: Think about how your eyes jump to bold or important words when reading.

🥬 The Concept (Attention Mechanism):

What it is: A model’s way of deciding which tokens matter most right now.
How it works:
1. Compare each word to your current goal
2. Give “importance scores”
3. Read high-score words more closely
4. Use them to decide the next token
Why it matters: Without attention, the model treats “the” and “answer” as equally important. 🍞 Anchor: Asked “What’s the capital of France?”, attention zeroes in on “capital” and “France,” enabling “Paris.”

🍞 Hook: Before a big play, teams practice what to include on the field and what to leave on the bench.

🥬 The Concept (Context Engineering):

What it is: The craft of choosing and organizing what context to put in the model’s prompt.
How it works:
1. Select possibly relevant info
2. Trim or rephrase it
3. Order it logically
4. Present it in a consistent template
Why it matters: Without it, the model sees too much clutter and misses key facts. 🍞 Anchor: Building a “cheat sheet” that only contains the formulas you really need for the test.

🍞 Hook: If a goalie only gets praise for the final score, not for good saves during the game, they might stop making smart moves in the middle.

🥬 The Concept (Reinforcement Learning, RL):

What it is: A way to train models by giving rewards for better behavior.
How it works:
1. Model tries a behavior (answer or plan)
2. Gets a reward score
3. Learns to increase good moves
4. Repeats to improve policy
Why it matters: Without RL, models can’t learn from feedback about multi-step processes. 🍞 Anchor: A robot learns to navigate a maze faster by getting points for moving closer to the exit.

The world before: Many benchmarks were “clean”—they only tested models with the exact needed info and no extra noise. In these settings, top models looked great. But real life is messy: tools return wrong pages, chat histories pile up, and unrelated snippets sneak into prompts. The problem: When researchers added distractors similar to real-life clutter, model accuracy crashed, sometimes by as much as 80%. Worse, the errors looked like “emergent misalignment”—the model drifted from instructions even without any attack.

🍞 Hook: A teacher needs a fair test, not just the easiest one.

🥬 The Concept (Alignment/Misalignment):

What it is: Alignment means the model follows intended goals and values; misalignment means it drifts or behaves undesirably.
How it works:
1. Get an instruction and context
2. Keep the goal in mind
3. Avoid unsafe or off-goal paths
4. Produce helpful, faithful outputs
Why it matters: Without alignment, the model can be confidently wrong or unsafe. 🍞 Anchor: You ask for safe kitchen tips, but the model recommends dangerous shortcuts—that’s misalignment.

Failed attempts: People tried prompting tricks and context engineering to steer attention. They also tried supervised fine-tuning (SFT) on noisy examples, but it often caused catastrophic forgetting—models forgot useful habits. Plain outcome-only RL helped a bit but didn’t teach the model how to spot the right evidence. The gap: We needed (1) a realistic test that injects noise and (2) a training signal that rewards not just correct answers but correct use of evidence.

Real stakes: In healthcare, mixing a correct guideline with a misleading blog post can change the treatment suggestion. In finance, an irrelevant memo can steer a risk analysis wrong. For students, long chats and random links can cause “smart” helpers to answer with confidence but be wrong. Getting robust to noise isn’t a bonus—it’s table stakes for safety and trust.

02Core Idea

Aha! Key insight in one sentence: To make reasoning agents robust in the real world, test them with realistic noise (NoisyBench) and train them to point to the right evidence inside that noise (RARE).

Multiple analogies:

Lifeguard whistle: In a crowded, noisy pool, a lifeguard learns to scan for real distress signals; RARE teaches the model to lock onto the truly helpful lines in the sea of text.
Treasure map with decoys: The beach is full of fake X’s; RARE rewards the model for finding and citing the one real X.
Science fair judging: Don’t just grade the final answer; give points for showing the experiment and labeling the right data—RARE rewards grounded reasoning steps.

Before vs. After:

Before: Clean benchmarks made models look strong; adding distractors revealed big weaknesses (up to −80%). Agentic workflows that use tools helped in clean settings but got more fragile with noise.
After: With NoisyBench, we can see the true robustness picture. With RARE, models learn to identify and quote the useful parts, resisting distractors and lifting accuracy across tasks and model sizes.

Why it works (intuition): Outcome-only rewards tell you if the cake tastes good, but not whether you followed the recipe. In noisy kitchens (prompts), you must reward “using the correct ingredients” (citing helpful spans) during cooking (reasoning). RARE gives that mid-process feedback, nudging attention and thought toward relevant evidence and away from distractors.

Building blocks (with concept intros):

🍞 Hook: Testing a runner only on flat tracks hides how they handle hills.

🥬 The Concept (NoisyBench):

What it is: A benchmark that adds realistic distractors—random documents, random chat history, and hard negatives—across 11 datasets spanning RAG, reasoning, alignment, and tool use.
How it works:
1. Start with a standard task
2. Mix in one of several distractor types
3. Ensure distractors don’t secretly help
4. Measure performance drops meaningfully
Why it matters: Without it, we overestimate model strength and miss real-world failure modes. 🍞 Anchor: It’s like giving a math test in a noisy cafeteria to see who can still focus.

🍞 Hook: Giving a gold star only for a right final answer can reward lucky guesses.

🥬 The Concept (Outcome-based Rewards, OR):

What it is: RL signals that pay only for the final answer being correct.
How it works:
1. Model answers
2. Judge correctness
3. Reward if correct, else not
4. Update policy toward lucky final outcomes
Why it matters: Without process checks, models may learn shortcuts and ignore evidence. 🍞 Anchor: A student gets an A for the right number, even if they copied it from a friend.

🍞 Hook: You know how teachers give points for showing work, not just answers?

🥬 The Concept (RARE – Rationale-Aware Reward):

What it is: A reward that pays the model for identifying and using the right evidence inside noisy context.
How it works:
1. Include a hidden reference span that contains helpful info
2. Model highlights/paraphrases that info during reasoning
3. A judge checks match between model’s cited content and the reference
4. Reward is given when the right evidence is used
Why it matters: Without RARE, models don’t learn to filter noise; with it, they ground their thinking. 🍞 Anchor: Like awarding points for correctly citing the textbook section that supports your answer.

🍞 Hook: Trick questions in quizzes often look almost right to make you pick them.

🥬 The Concept (Hard Negative Distractors):

What it is: Misleading, look-alike passages that feel relevant but don’t help answer the question.
How it works:
1. Share surface features with the question
2. Encourage plausible but wrong steps
3. Increase cognitive load
4. Cause confident mistakes
Why it matters: These expose whether models truly understand or just match patterns. 🍞 Anchor: A math problem with numbers that tempt you to use the wrong formula.

🍞 Hook: Solving a mystery often takes several clues, not just one.

🥬 The Concept (Multi-hop Reasoning):

What it is: Connecting several pieces of info across steps to reach the answer.
How it works:
1. Find clue A
2. Use A to find B
3. Combine A and B
4. Conclude logically
Why it matters: Without it, models fail tasks needing combinations of facts. 🍞 Anchor: To answer, “Which scientist’s discovery led to X?”, you link who did what, when, and why.

Put together, the core idea is simple but powerful: Stress-test with realistic noise (NoisyBench), and teach the model to tag and use the right bits inside that noise (RARE).

03Methodology

High-level pipeline: Input (Question + Noisy Context) → Model proposes reasoning and cites helpful spans → Judge evaluates correctness and grounding → RL updates model → Output is a more robust, noise-aware reasoner.

Step-by-step details (with the crucial concepts introduced once using sandwiches):

Build NoisyBench

What happens: Each task instance gets one of four settings: ND (no distractor), RD (random document), RC (random chat history), HN (hard negative). Distractors are filtered so they don’t leak the answer or change the correct label.
Why it exists: Real life has noise; we need to know how models behave under it.
Example: A GPQA science question is paired with a long, on-topic but irrelevant article (HN) about similar molecules that could mislead a model into the wrong pathway explanation.

Construct NoisyInstruct (training data)

🍞 Hook: Practicing with realistic scrimmages beats only practicing drills.

🥬 The Concept (NoisyInstruct):

What it is: A training set that mixes questions with hints (helpful references) and distractors (random and hard negatives) in various combinations: (A|Q), (A|Q,H), (A|Q,D), (A|Q,D,H).
How it works:
1. Gather diverse tasks from a broad corpus
2. Add distractors sampled or generated elsewhere to avoid contamination
3. Create hints that are helpful but not answers
4. Filter low-quality samples with checks
Why it matters: Without such practice, models won’t learn to ignore noise. 🍞 Anchor: It’s like scrimmaging with crowd noise so the team learns to communicate effectively.

Train with RL using two reward parts

🍞 Hook: Learning to bake better cakes requires more than judging taste—you should check the ingredients, too.

🥬 The Concept (Reinforcement Learning recap with OR+RARE):

What it is: We optimize the model with RL where rewards come from two sources: final-answer correctness (Outcome-based Reward) and evidence-grounding (RARE).
How it works:
1. Generate multiple answers with chains of thought
2. Score final correctness (OR)
3. Check whether the chain cites/paraphrases the hidden helpful span (RARE)
4. Update the policy to increase both correctness and proper grounding
Why it matters: Without the RARE part, the model may get the right answer but for the wrong reasons and stay fragile to noise. 🍞 Anchor: A math student earns points for both the right solution and for showing the correct steps that use the right theorem.

Secret sauce: the RARE check

What happens: Prompts include a hidden helpful <reference>...</reference> span. During training, the judge compares the model’s cited or paraphrased content in its reasoning to this span. Matching earns a binary reward.
Why it exists: It gently pulls attention and reasoning toward relevant tokens and away from distractors.
Example with data: Suppose the question is about the cause of a historical treaty. The distractor is a look-alike article about a different treaty from the same era. The hidden reference contains two sentences about the actual cause. When the model quotes or closely paraphrases those sentences during its chain of thought, it earns the RARE reward.

Agentic workflow evaluation

🍞 Hook: If your GPS trusts every road sign—even fake ones—you’ll get lost faster.

🥬 The Concept (Agentic Workflow with Tools):

What it is: A policy that decides when to call tools (search, calculator, retriever) and how to use their outputs.
How it works:
1. Plan steps
2. Call a tool
3. Read its output
4. Decide next actions
Why it matters: Without calibrated trust, tools can amplify noise. 🍞 Anchor: A shopping assistant keeps re-querying a bad price API and gets more confused with every call.

Measuring robustness

What happens: Evaluate on 11 datasets covering RAG (SealQA, MultihopRAG, Musique), reasoning (AIME25, BBEH-Mini, GPQA-Diamond), alignment (Self-Awareness, Survival-Instinct, BBQ), and tool use (TauBench Retail/Airline). Use pass@k or pass^k as in the original tasks.
Why it exists: We need a broad and fair picture across skills.
Example: On BBQ, we track how distractors change bias-related multiple-choice performance; on TauBench, we measure success rates of tool-using agents.

Behavioral analyses under noise

🍞 Hook: The longer you stare at the wrong clue, the more confident you feel—until you’re not.

🥬 The Concept (Inverse Scaling at Test-Time Compute):

What it is: In noisy settings, making chains of thought longer can reduce accuracy.
How it works:
1. Distractors pull attention
2. Extra reasoning reinforces wrong paths
3. Errors compound step by step
4. Final accuracy drops
Why it matters: Without caution, “think more” backfires when inputs are noisy. 🍞 Anchor: Writing extra pages of wrong math doesn’t fix the mistake; it buries it.

🍞 Hook: When you’re unsure, your voice wobbles—models do that too.

🥬 The Concept (Entropy as Uncertainty):

What it is: A measure of how unsure the model is while generating tokens.
How it works:
1. Compute token probabilities
2. Higher spread → higher entropy
3. Aggregate over key tokens
4. Track how entropy changes with more distractors
Why it matters: Rising entropy signals growing confusion. 🍞 Anchor: With two distractors, the model’s “voice” shakes more—entropy goes up.

Why naive fixes fall short

Prompting and context engineering: Often remove some noise but also delete needed info, and can themselves be misled by noise.
SFT: Risks catastrophic forgetting, weakening built-in resilience.
Plain OR: Rewards final answers but fails to teach evidence filtering.

Putting it all together, the recipe is: realistic noise (NoisyBench) + practice with noise (NoisyInstruct) + a two-part reward that grades both the answer and the use of correct evidence (OR + RARE).

04Experiments & Results

The test: The authors measured model accuracy across 11 datasets and four settings: ND (no distractor), RD (random document), RC (random chat), and HN (hard negative). They also checked how agentic workflows performed compared to their base models, and ran analyses on attention, uncertainty (entropy), and the effect of longer chains of thought.

The competition: They evaluated strong proprietary and open models: Gemini-2.5-Pro/Flash, DeepSeek-R1-0528, gpt-oss-120b, Qwen3-4B-Thinking, Qwen3-30B-A3B-Thinking, and DeepSeek-R1-Distill-Llama-8B. They also trained three open models with various methods: prompting, SFT, RL with outcome rewards (OR), and RL with OR+RARE.

The scoreboard (with context):

Big drop under noise: All models lost ground when distractors appeared, from about −9% to nearly −80% on average across tasks. That’s like going from an A to an F when the classroom gets loud.
Hard negatives hurt most: DeepSeek-R1-Distill-Llama-8B lost about 80.6% with HN, showing that tricky, similar-looking passages are especially dangerous.
Even random noise triggers misalignment-like behavior: On BBQ (a bias/alignment test), Gemini-2.5-Pro fell from 94.0% to 60.5% with distractors; DeepSeek-R1-0528 dropped from 93.0% to 33.7%—a massive shift caused by non-adversarial noise.
Agentic workflows turn fragile in noise: Agents that used tools did better in clean settings but worse in noisy ones. Over-trusting noisy tool outputs and re-calling tools on contaminated context amplified errors.

Surprising findings:

Inverse scaling under noise: More test-time thinking (longer chains of thought) tended to make answers worse when distractors were present. The model spent more tokens following the wrong trail.
Higher uncertainty (entropy): As the number of distractors increased, entropy rose steadily, signaling confusion and less confidence.
Attention fixates on distractors when wrong: Incorrect predictions allocated disproportionately more attention mass to distractor tokens, confirming that models were literally looking at the wrong places.

Do common fixes help?

Prompting and context engineering: Small or negative gains; can remove useful context or be misled by noise.
SFT: Often harms performance via catastrophic forgetting, except a narrow case.
RL with OR: Better than prompting/SFT, but still limited because it doesn’t teach evidence filtering.

The big win—RARE:

Across models and distractor types, RL with OR+RARE consistently outperformed OR alone (and the other baselines), often by large margins. For example, on Qwen3-4B under RD, the average harmonic mean jumped by over 55% relative to baseline; under HN, improvements exceeded 36%.
RARE reduced “distracted chains of thought” during training while increasing final-answer rewards, leading to higher accuracy by the end.
Bonus: Training with RARE also transferred to clean settings (ND) with modest gains, suggesting that learning to filter noise does not harm clean performance and can even help.

Bottom line: When the world is noisy, testing must be noisy, and training must explicitly reward using the right evidence. That combination made models far more resilient than prior methods.

05Discussion & Limitations

Limitations:

Scope of models: The study targets explicit reasoning models and agentic systems; it does not systematically evaluate base or purely instruction-tuned models, nor multimodal settings.
Reward model reliance: The RARE pipeline depends on an LLM judge to verify cited evidence; judge bias or errors could influence training.
Data/programmatic overhead: Building NoisyBench and NoisyInstruct (with generation and filtering) requires careful engineering to avoid contamination and ensure quality.
Not an architectural fix: RARE improves behavior but doesn’t redesign attention to be noise-proof; long-term robustness may need architectural changes.

Required resources:

Compute for RL (GRPO) training runs and inference with long contexts.
Access to a capable judge model for verifying evidence use.
Storage and orchestration for constructing and filtering distractors and hints.

When not to use:

Tiny, closed-book tasks with minimal context, where noise is absent and RL overhead brings little benefit.
Fully verifiable, symbolic pipelines where correctness can be guaranteed without language-model reasoning.
Extremely short-horizon tasks where the model’s built-in behavior is already robust and fast.

Open questions:

Architecture-level noise filters: Can we design attention mechanisms that downweight distractor tokens automatically?
Uncertainty calibration: How can we turn rising entropy into self-stopping, ask-for-clarification, or re-retrieval behaviors?
Tool trust calibration: How can agents learn to question tool outputs and avoid contamination loops?
Multi-distractor mixtures: What’s the best policy when random chat, random docs, and hard negatives appear together?
Generalization: How well do RARE-trained models transfer to entirely new domains and unseen distractor styles?

06Conclusion & Future Work

Three-sentence summary: The paper shows that reasoning models look strong on clean tests but can fall apart—by up to 80%—when realistic noise is added. It introduces NoisyBench to measure this robustness and RARE, a reward that pays models for using the right evidence during reasoning. Together with NoisyInstruct, these tools significantly improve resilience across tasks and models.

Main achievement: Turning “show your work using the right evidence” into a concrete training signal (RARE) that reliably boosts robustness in noisy contexts.

Future directions: Build architectures that inherently suppress distractor attention; calibrate uncertainty so models pause or ask for help under confusion; extend to multimodal and real-time agent settings; refine judge models and automatic reference generation. Explore workflows that actively clean or restructure context before reasoning.

Why remember this: Real-world AI must resist noise. By testing with realistic distractors and rewarding grounded reasoning steps, we move from models that only sound smart in silence to agents that stay accurate in the everyday bustle.

Practical Applications

•Build safer chat assistants that ignore irrelevant chat history and focus on the latest user goal.
•Improve RAG systems by training them to cite and rely on the most helpful retrieved snippets.
•Calibrate agent workflows to question tool outputs when the context looks noisy or conflicting.
•Deploy RARE-trained models in customer support so they avoid outdated or off-topic knowledge base articles.
•Enhance educational tutors to highlight the exact lines in passages that justify their answers.
•Use NoisyBench to stress-test enterprise AI before launch, revealing fragility to real-world clutter.
•Add uncertainty-aware behaviors (e.g., ask for clarification) when entropy rises with more distractors.
•Integrate context filters that move distractors above the question and summarize them before reasoning.
•Train search agents to downweight look-alike hard negatives that historically led to errors.
•Monitor attention patterns during evaluation to spot when models over-focus on distractor tokens.

Version: 1