Over-Searching in Search-Augmented Large Language Models

Roy Xie; Deepak Gopinath; David Qiu; Dong Lin; Haitian Sun; Saloni Potdar; Bhuwan Dhingra

Over-Searching in Search-Augmented Large Language Models

Intermediate

Roy Xie, Deepak Gopinath, David Qiu et al.1/9/2026

arXiv PDF

Key Summary

•The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.
•They call this problem over-searching: searching past the point of benefit or when no search can help.
•Search usually boosts accuracy on answerable questions but hurts the model’s ability to say “I don’t know” for unanswerable ones.
•They introduce a new metric, Tokens Per Correctness (TPC), to measure how many tokens and searches are spent for each correct result.
•Reasoning-style and deep-research systems over-search more, especially with noisy retrieval sources and in long, multi-turn chats.
•What gets retrieved matters: negative evidence (documents stating something is unknown or contradictory) greatly improves abstention, but it’s rare on the web.
•A new benchmark, OverSearchQA, balances answerable and unanswerable questions to test both answering and abstaining.
•Mitigations like abstention-aware prompts, few-shot examples, and a self-evaluation step help, but they trade off accuracy or cost and don’t fully fix the root cause.
•Retrieval quality changes behavior a lot: bad or mixed signals lead to more searching and higher TPC.
•The main takeaway is that smarter deciding when to search and when to stop is as important as searching well.

Why This Research Matters

Real users pay for every token and API call, so unnecessary searches directly increase costs and latency. In safety-critical settings (health, finance, legal), answering when the truth is unknown can mislead people and cause harm. Product teams need to track not just accuracy but how efficiently it’s achieved, which TPC captures. Understanding that negative evidence improves abstention suggests new ways to design retrieval and indexing for safer systems. Multi-turn snowballing explains why chat histories sometimes make assistants push ahead when they should pause, guiding better conversation management. Overall, these insights help build assistants that are accurate, efficient, and humble enough to say “I don’t know.”

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine doing a school report. If you already know an answer from class, you don’t need to run to the library every time. And if the question is impossible (like “What’s the capital of the Moon?”), no amount of library time will help.

🥬 The Concept (Search-Augmented LLMs): What it is: A search-augmented LLM is a chatbot that can look things up online or in a database before answering. How it works: 1) Read the question. 2) Decide to search. 3) Pull in documents. 4) Use them to answer. Why it matters: Without search, the model can be outdated or wrong about facts; with search, it can be current and grounded. 🍞 Anchor: Asking “Who won the 2024 Olympics 100m?”—search helps fetch the exact result.

The world before: LLMs got very good at language, but they could forget details or be out-of-date. Adding a search tool made them much better at knowledge-heavy tasks. Companies built powerful “deep research” agents that plan many searches, read lots of pages, and write careful reports.

The problem: Real questions aren’t always clean. Many are vague (“What’s the capital of Georgia?”—state or country?), based on false ideas (“How many eggs do tigers lay?”—they’re mammals!), or unknowable (“Who will be U.S. president in 2075?”). In these cases, the right move is to abstain—to say “I don’t know” or ask for clarification.

🍞 Hook: You know how some kids keep flipping through book after book even when the answer is either already in their head—or simply doesn’t exist? That’s overdoing it.

🥬 The Concept (Over-Searching): What it is: Over-searching is when the model keeps searching even when it won’t help or after it already has enough. How it works: 1) A model sees a tough or tricky question. 2) It calls search repeatedly. 3) It adds more and more text. 4) Benefit flattens while cost rises. Why it matters: This wastes time/money and can pull in misleading info, causing hallucinations. 🍞 Anchor: For “Who will be president in 2075?”, a smart base model abstains; a search-augmented model might burn tokens searching and then guess.

Failed attempts: Earlier research studied whether models can abstain—but mostly without tools. Other work trained models to reason longer and use tools more, improving accuracy on answerable tasks but often encouraging over-thinking and longer chains of tool calls. That helped correctness sometimes but ignored the cost and the special risk of noisy retrieval confusing abstention.

🍞 Hook: Think of a student who knows when to raise their hand—and when to say, “I’m not sure.” That judgment saves time and prevents wrong answers.

🥬 The Concept (Abstention Behavior): What it is: Abstention is when the model chooses not to answer directly and explains why. How it works: 1) Notice uncertainty, false premises, or missing info. 2) Say “I don’t know,” ask for details, or correct the premise. 3) Avoid committing to a wrong answer. Why it matters: Without abstention, models give confident but wrong answers to unanswerable questions. 🍞 Anchor: “What is the capital of the Moon?” → “I don’t know; the Moon has no capital.”

The gap: We lacked a clear way to measure the trade-off between being correct and the compute cost of all that searching. We also lacked a benchmark that fairly tests both answering and abstaining in a search setting with matched answerable and unanswerable questions.

Real stakes: In everyday apps—help desks, coding copilots, medical Q&A—extra searches cost money and time. Worse, mixing in misleading snippets can make the system answer when it should abstain, eroding trust. In business, paying for thousands of unnecessary searches adds up fast. In safety-critical settings, answering unanswerable questions can be harmful.

This paper’s story: It carefully measures over-searching across many models and situations, shows how retrieval quality and conversation history affect it, introduces a tool-aware efficiency metric, and provides a dedicated benchmark to drive better, more careful search behavior.

02Core Idea

🍞 Hook: You know how watering a plant helps it grow—but overwatering drowns it? Searching helps a model—until too much of it starts to hurt.

Aha in one sentence: More search is not always better; we must teach and measure when to search, when to stop, and when to abstain.

Three analogies:

Chef analogy: Tasting the soup (one search) improves flavor; dumping in every spice (many searches) makes it worse. 2) Detective analogy: Check key clues; rereading the entire city’s newspapers daily wastes time and may mislead. 3) School analogy: Looking up one fact helps; copying whole encyclopedias into your notes buries the answer.

Before vs After: Before, teams added search and celebrated higher accuracy on answerable tasks. After, we see the hidden costs: models search even when it won’t help, grow less willing to abstain on impossible questions, and rack up compute bills. The paper reframes success: not just correctness, but correctness per unit of compute, and the wisdom to abstain.

🍞 Hook: Imagine paying per page you photocopy. If most pages don’t raise your grade, you’re wasting money.

🥬 The Concept (Tokens Per Correctness, TPC): What it is: TPC is “how many tokens and searches you spend for each correct outcome.” Lower is better. How it works: 1) Count model output tokens. 2) Count input/context tokens. 3) Count search calls. 4) Divide total cost by number of correct results. Why it matters: Without TPC, models can look “accurate” while silently burning lots of compute. 🍞 Anchor: Two systems both get 70 correct answers; one spends half the tokens and searches—its TPC is better.

Why it works (intuition, no equations): If extra searches don’t improve correctness—or even make abstention worse—then the cost per correct answer climbs. Rising TPC is a smoke alarm for over-searching. If you get the same outcomes with fewer tokens and fewer searches, TPC drops, signaling healthier behavior.

Building blocks of the idea:

🍞 Hook: Like grading both right answers and smart skips on a test. 🥬 The Concept (Dual Accuracy): What it is: Two scores—answer accuracy on answerable questions and abstention accuracy on unanswerable ones. How it works: 1) Split questions into answerable vs unanswerable. 2) Check correct answers in the first set. 3) Check proper abstentions in the second. Why it matters: A single accuracy number hides if a model is guessing when it should abstain. 🍞 Anchor: “What’s 7×8?” should be answered; “Capital of the Moon?” should be abstained.
🍞 Hook: Picture museum signs that say, “Artist unknown.” Those are valuable too: they tell you what isn’t known. 🥬 The Concept (Negative Evidence): What it is: Documents that explicitly indicate uncertainty, contradictions, or that the answer is unknown. How it works: 1) Retrieval returns documents. 2) If some say, “this is unknown,” they nudge the model to abstain. 3) If only “answer-like” snippets appear, the model is tempted to guess. Why it matters: Without negative evidence, models interpret “no answer yet” as “search more,” fueling over-searching. 🍞 Anchor: “Who discovered Atlantis?” A page stating “Atlantis is unverified” helps the model abstain.
🍞 Hook: To practice a sport, you need the right kind of drills. 🥬 The Concept (OverSearchQA Benchmark): What it is: A balanced test set that pairs answerable with unanswerable questions that look very similar. How it works: 1) Curate unanswerable types: Answer Unknown, False Premise, Underspecified Context. 2) Find look-alike answerable matches. 3) Test with and without search. Why it matters: It fairly measures answering and abstaining under search, isolating over-search behavior. 🍞 Anchor: “What is the capital of Georgia?” (ambiguous) vs “What is the capital of the country of Georgia?” (answerable: Tbilisi).

What changes because of this idea: We stop rewarding models for blindly doing more work. Instead, we evaluate if each extra search actually improves outcomes—and whether the model knows when not to answer. That shift pushes designs toward smarter search decisions, cleaner retrieval, and prompts that embrace abstention.

Why this perspective is powerful: It aligns model behavior with real-world constraints (money, time, safety). It exposes the harm from noisy retrieval and long multi-turn histories that snowball into more searching. And it gives teams a practical, comparable metric (TPC) to guide improvements.

03Methodology

High-level recipe: Input question → (A) Decide and perform search calls → (B) Read retrieved documents and draft answer or abstain → (C) Judge correctness (answer or abstention) → (D) Compute costs and TPC → Output metrics and analysis.

Step-by-step details: A) Search behavior evaluation

What happens: For each model, the system allows up to a fixed number of search calls. The same retrieval setup (e.g., Wikipedia or web) returns top-k chunks per call. Models run both with and without search to isolate the effect of retrieval. Why it exists: To see how much search helps on answerable items and hurts on unanswerable ones. What breaks without it: If you only test with search, you can’t tell whether the tool is helping or just adding noise and cost. Example: Q: “Who was U.S. President in 2010?” Without search, many models can answer “Barack Obama.” With search, they may still answer correctly—but spend tokens and a search call.

B) Dual outcome scoring

What happens: The dataset is split: A (answerable) and U (unanswerable). For A, check answer accuracy; for U, check abstention accuracy. An LLM-as-judge compares outputs to gold answers or decides whether the response is an abstention (e.g., “I don’t know,” “the question is ambiguous,” or “premise is false”). Why it exists: A single metric can hide the fact that a model answers unanswerable questions. What breaks without it: You might celebrate high accuracy while the system confidently answers impossible questions. Example: Q: “How many eggs do tigers lay?” Correct behavior: explain tigers are mammals → abstention (no egg count).

C) Cost accounting and TPC

What happens: Count (1) model output tokens, (2) input tokens (prompt + retrieved context), and (3) number of search calls. Convert these into a standardized cost and divide by the number of correct outcomes (correct answers on A + correct abstentions on U). Why it exists: To measure compute-per-correct instead of accuracy alone. What breaks without it: A model can look “better” by spending far more tokens and searches. Example: Two runs both get 600 correct; the one with half the tokens/searches has a lower TPC.

D) Retrieval quality analysis

What happens: Compare different corpora: fresh Wikipedia, old Wikipedia, a noisy web crawl (without Wikipedia), and live web search. Observe answer/abstention accuracy and TPC shifts. Why it exists: Not all retrieval is equal; noisy or mixed signals change behavior and cost. What breaks without it: You might think your model improved, when it just got lucky with cleaner sources—or vice versa. Example: On noisy corpora, models tend to search more and TPC skyrockets, even if accuracy barely moves.

E) Evidence composition (negative vs positive)

What happens: Classify retrieved documents as positive (supporting an answer) or negative (stating unknowns/contradictions). Group unanswerable questions by what they received and measure abstention. Why it exists: To see how the presence of negative evidence cues drives correct abstention. What breaks without it: You can’t explain why models over-answer unanswerables. Example: If only negative evidence is retrieved, abstention is near-perfect; but such docs are rare.

F) Multi-turn conversation tests

What happens: Build conversations of 1–9 turns. Keep the final question the same; vary the earlier turns to be all unanswerable, all answerable, or mixed. Measure abstention and TPC as the history grows. Why it exists: Real users chat over many turns; habits can snowball. What breaks without it: You miss that early “answering streaks” bias the model to answer later—even when it should abstain. Example: After many answerable earlier turns, abstention degrades on the final unanswerable query, and TPC rises with length.

G) Mitigation strategies (training-free)

Query-level prompts:
1. Abstention-aware: Remind the model that “I don’t know” is acceptable. Trade-off: Improves abstention; small accuracy or TPC changes depending on model.
2. Few-shot: Include examples of when to abstain vs answer. Trade-off: Strong abstention gains; risk of over-abstaining and some accuracy drop.
3. Self-evaluation: Insert a step to classify the question as ANSWERABLE vs ABSTAIN before attempting. Trade-off: Balanced gains but adds reasoning cost and sometimes extra searches.
Retrieval-level augmentation: Add synthetic negative-evidence docs to the corpus (short, encyclopedic pages explaining why something is unknown). Trade-off: Modest improvement—these docs may rank low or get diluted by abundant “answer-like” pages.

🍞 Hook: Imagine choosing which library to visit, what shelves to pull from, and when to stop reading because you’ve got enough—or because the book you want doesn’t exist.

🥬 The Concept (Retrieval Quality): What it is: How trustworthy and focused your sources are. How it works: 1) High-quality sources (fresh Wikipedia) give consistent signals. 2) Noisy sources flood you with mixed or misleading bits. 3) Web search can be great for answers but tricky for abstentions. Why it matters: Noisy retrieval triggers extra searches and higher TPC. 🍞 Anchor: Using a rumor-filled blog to verify facts often leads to more tab-hopping and confusion.

🍞 Hook: In a snowball fight, once the ball starts rolling, it grows bigger with every turn down the hill.

🥬 The Concept (Snowballing in Multi-turn): What it is: Past turns bias what happens next—answering streaks encourage more answering; abstentions can normalize saying “I don’t know.” How it works: 1) Model answers across turns. 2) Patterns set expectations. 3) Search and token costs accumulate. Why it matters: Long chats can magnify over-searching and reduce abstention. 🍞 Anchor: After five clear, answerable questions, the model is “in an answering mood,” so it may wrongly answer the sixth, which is unanswerable.

Throughout, the evaluation uses a consistent judge and cost model so comparisons are fair. The key is treating “not answering” as a first-class success when that’s the right thing—and charging fairly for every token and search to reveal true efficiency.

04Experiments & Results

The test: Measure three things—(1) answer accuracy on answerable questions, (2) abstention accuracy on unanswerable questions, and (3) TPC to capture efficiency. Compare many models, with and without search, across different retrieval sources and in multi-turn settings. Use OverSearchQA so that answerable and unanswerable items are similar in style and length, making the test fair.

The competition: Models include GPT-4o-mini, o4-mini, Kimi-K2, Qwen3-235B variants, Llama 3 family, and Mistral-Small-24B. Configurations range from base chat to reasoning style to deep-research agents with more complex, multi-round tool use.

Scoreboard with context:

Search helps answerable accuracy substantially but harms abstention on unanswerables. Think of it like getting better at math questions you can solve, but worse at recognizing trick questions you should skip. On average, answer accuracy rises notably (around a classroom going from a B- to a solid B+ or A-), while abstention drops (more kids guess on trick items). This trade-off is the signature of over-searching.
TPC climbs with more search turns. Early searches help, but returns fade quickly. After a few searches, extra calls pile on cost with little or no gain, so TPC keeps rising—like paying more for the same grade.
Reasoning and deep research amplify over-searching. Within the same family, base < reasoning < deep-research in answer accuracy—but abstention falls across those settings, and TPC can explode. One deep-research setup reached an enormous TPC (tens of thousands), like using a bulldozer to plant a tulip.
Retrieval quality is a big lever. Fresh Wikipedia gives decent balance. Noisy corpora make models search more, sending TPC way up (several times worse). Web search often finds answers well but mixes signals, which can hurt abstention. Imagine a library where some shelves are pristine and others are random clippings: you’re tempted to keep hunting, and the bill grows.
Evidence composition predicts abstention. When retrieved docs are mostly negative evidence (“this is unknown/contradictory”), abstention is near-perfect. But such documents are rare in the wild, which biases systems toward answering. This asymmetry (webs full of “answers,” few “we don’t knows”) is a root cause of over-searching and over-answering.
Multi-turn snowballing is real. In conversations, if earlier turns are answerable, the model grows more likely to attempt answers later—even when it should abstain—while TPC steadily rises with length. If earlier turns are unanswerable, abstention can slightly improve, showing that history shapes habits.

Surprising or notable findings:

A little noise can oddly help abstention by failing to provide convincing “answer-like” snippets, but it massively raises TPC—so it’s not a good strategy.
Few-shot prompting can strongly boost abstention, but sometimes at the cost of answering when it should, leading to over-abstention.
Self-evaluation balances abstention and accuracy better than few-shot in some cases but may increase TPC because it adds an extra reasoning step (and sometimes extra searches) before answering.

Plain-language comparisons:

Think of accuracy as your grade and TPC as the dollars you spent on tutoring and books. A model that goes from 85% to 87% by tripling costs is not a bargain. That’s what rising TPC reveals.
On trick questions, the best students know when to say, “This is a trick; I’ll pass.” Models with search tend to say, “Let me search more,” and then guess anyway.

Bottom line: Search is powerful on answerable items but pushes many systems to over-search and over-answer on unanswerables. The TPC lens shows when extra searching stops paying off—and how retrieval quality and chat history can nudge behavior in costly directions.

05Discussion & Limitations

Limitations:

The study emphasizes measurement and analysis over training new models or changing architectures. It shows where pain points are but doesn’t fully solve them. Promising fixes—like training models to detect unanswerability, adding explicit “uncertainty heads,” or ranking negative evidence higher—remain for future work.
OverSearchQA is carefully curated but not drawn from real user logs. Real-world distributions may have different types of ambiguity or phrasing. Still, the pairing and balancing make it excellent for controlled evaluation.
Mitigations tested are training-free. They help, but modestly, and sometimes at a cost (e.g., over-abstaining or higher TPC). Deeper changes likely need alignment or post-training objectives that reward efficient abstention-aware behavior.

Required resources:

Tool-augmented evaluation requires a retrieval stack (indexing, chunking, embedding retriever), access to APIs (for search and models), and a judge model. Compute costs can be non-trivial, especially for multi-turn and ablations.

When not to use search-heavy setups:

If questions are often unanswerable, underspecified, or false-premised, defaulting to multi-round search is counterproductive. Lightweight prompting plus a strong abstention policy may be superior.
In latency- or budget-sensitive settings, unconstrained deep-research loops can be overkill; cap search turns or use a gating step first.

Open questions:

How do we teach models calibrated stopping rules—knowing when the first search is enough and when no search can help? Can reinforcement signals penalize needless searches and reward correct abstention?
How can retrieval pipelines surface negative evidence more reliably without polluting results? Are there ranking or indexing tricks to expose “unknowns” earlier?
Can we design multi-turn memory that dampens snowballing—e.g., a “reset to neutral” step or explicit uncertainty summaries between turns?
What is the best unified metric combining accuracy, abstention, latency, and dollars—building on TPC but adaptable to diverse deployments?

06Conclusion & Future Work

Three-sentence summary: This paper shows that search-augmented language models often over-search, raising costs and making them worse at abstaining on unanswerable questions. It introduces Tokens Per Correctness (TPC) to quantify compute-per-correct result and provides OverSearchQA to evaluate both answering and abstaining fairly. Experiments across models, retrieval sources, and multi-turn chats reveal when and why over-searching happens and test practical—but partial—mitigations.

Main achievement: A clear, tool-aware framework—Dual Accuracy + TPC + OverSearchQA—that exposes the hidden costs of search and the fragile nature of abstention under noisy or excessive retrieval.

Future directions: Train models with explicit abstention rewards and search-stopping penalties; build retrieval that elevates negative evidence; add uncertainty-aware planners that decide when to search and when to stop; and create multi-turn designs that avoid snowballing. Extending TPC to other tools (browsers, code execution, databases) can generalize efficiency tracking.

Why remember this: It reframes progress from “answer more” to “answer well, spend less, and abstain wisely.” In real products, that trio—accuracy, efficiency, and humility—builds trust, saves money, and keeps users safe.

Practical Applications

•Add a self-evaluation gate that first classifies a query as ANSWERABLE or ABSTAIN to curb unnecessary searches.
•Use abstention-aware and few-shot prompting that explicitly rewards saying “I don’t know” on unanswerables.
•Set a dynamic search budget (e.g., cap at 1–3 searches) and stop early when marginal gains flatten.
•Monitor TPC in production dashboards to catch over-searching regressions after updates.
•Prioritize high-quality retrieval sources and filter noisy ones; A/B test retrieval stacks for TPC, not just accuracy.
•Index or up-rank negative-evidence documents (e.g., “unknown,” “no consensus”) for known-unknown domains.
•Introduce a multi-turn ‘reset to neutral’ policy to reduce snowballing from earlier answerable turns.
•Teach the model to request clarification on underspecified questions before any search.
•Penalize fruitless extra searches in reinforcement learning objectives; reward correct abstentions.
•Run dual-accuracy and TPC evaluations during model selection to balance correctness and cost.

Version: 1