DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Nikita Gupta; Riju Chatterjee; Lukas Haas; Connie Tao; Andrew Wang; Chang Liu; Hidekazu Oiwa; Elena Gribovskaya; Jan Ackermann; John Blitzer; Sasha Goldshtein; Dipanjan Das

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Beginner

Nikita Gupta, Riju Chatterjee, Lukas Haas et al.1/28/2026

arXiv PDF

Key Summary

•DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
•It focuses on three hard skills: gathering pieces from many sources (systematic collation), cleaning up duplicates (entity resolution), and knowing when to stop searching (stopping criteria).
•Each task works like a chain of steps where the next clue depends on the previous one, so the agent must plan and remember well.
•Answers are judged only by the final set the agent submits, using precision, recall, and F1-score to balance being complete and being correct.
•Top agents like Gemini Deep Research Agent and GPT-5 Pro High Reasoning do best, but they still miss items or add extras, showing a ‘Last Mile’ gap.
•Smaller or cheaper models fail much more often, proving that deep research needs strong reasoning and multi-step planning, not just quick search.
•Sampling multiple runs boosts success a lot (from about 67% to nearly 86% fully correct when sampling eight times), showing test-time compute helps.
•A careful LLM judge checks whether items match semantically, so different wordings of the same thing still count as correct.
•The benchmark uses verified, time-anchored questions to keep grading stable even though the web changes.
•DeepSearchQA aims to push agents from ‘answering a question’ toward ‘mastering a topic’ by rewarding exhaustive, clean lists.

Why This Research Matters

In real life, people need complete, trustworthy lists—like all qualifying medications, every eligible city for a move, or all safety records that meet a standard. DeepSearchQA makes agents practice and prove those skills, not just show off a single fact. This helps move AI from being a quick answer machine to a careful research partner you can rely on for big decisions. It also exposes where agents stumble—missing rare items, adding extras, or stopping too soon—so designers know what to fix. Over time, this will make AI better at serving journalists, analysts, scientists, and everyday users who need full coverage, not half the story.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re doing a school project about planets. If you only answer, “Earth is a planet,” your teacher will say, “Great, but where’s the full list?” Real projects often need complete sets, not just one fun fact.

🥬 Filling (The Actual Concept)

What it is: Before this paper, most AI tests checked if an AI could find one correct answer to a question, like a capital city or a date, not a full set.
How it works:
1. Benchmarks asked questions with a single, definite answer.
2. Grading was easy: either the answer matched or it didn’t.
3. This made tests cheap and quick but didn’t reflect real research.
Why it matters: Many real tasks need exhaustive answers (all items that match rules). Without testing for completeness, AIs can look smart while missing most of what people actually need.

🍞 Bottom Bread (Anchor): It’s like grading a grocery helper who brings only “milk” when the list says milk, eggs, bread, apples. They passed the ‘find milk’ test but failed your real shopping.

🍞 Top Bread (Hook): You know how some homework problems need several steps—like measure, then multiply, then compare—before you can answer? Many online questions are like that too.

🥬 Filling (Multi-step Information-seeking Tasks)

What it is: Questions that require several linked searches and filters to reach the final result.
How it works:
1. Start with a broad group (e.g., all cities).
2. Apply constraint A (e.g., house price under £200k).
3. From what remains, apply constraint B (e.g., top 5 by green space).
4. Keep chaining filters until you get the final list.
Why it matters: If an agent skips or confuses a step, the final list is wrong.

🍞 Bottom Bread (Anchor): Like finding a book in a big library: first the floor, then the shelf, then the author, then the title.

🍞 Top Bread (Hook): When you and your friends search different websites for a group project, someone must gather and organize everyone’s notes into one master list.

🥬 Filling (Systematic Collation)

What it is: Carefully gathering pieces from many sources and organizing them into one complete, non-overlapping list.
How it works:
1. Visit multiple trustworthy sources.
2. Extract candidates that might match.
3. Combine them into a single organized set.
Why it matters: No single page usually has everything; without collation, the list stays incomplete.

🍞 Bottom Bread (Anchor): Like collecting all puzzle pieces from different boxes and laying them out to see the whole picture.

🍞 Top Bread (Hook): Have you ever met someone called “Liz” and later realized her full name is “Elizabeth”? Same person, different names.

🥬 Filling (Entity Resolution)

What it is: Figuring out when two names or forms actually refer to the same thing.
How it works:
1. Compare names, spellings, and context.
2. Check details (like IDs, locations, or official info).
3. Merge duplicates into a single clean entry.
Why it matters: Without this, the list gets bloated and precision drops because the same thing appears multiple times.

🍞 Bottom Bread (Anchor): Recognizing that “PS2” and “PlayStation 2” are the same console avoids double-counting.

🍞 Top Bread (Hook): When you’re on a scavenger hunt, how do you know when you’ve found everything on the checklist?

🥬 Filling (Stopping Criteria)

What it is: Rules an agent uses to decide it has searched enough and can stop.
How it works:
1. Estimate how many items should exist based on sources.
2. Check if any new searches are still finding fresh, valid items.
3. Stop when likely complete, not just tired.
Why it matters: Stopping too soon misses items (low recall); searching forever adds noise (low precision).

🍞 Bottom Bread (Anchor): You stop looking for Easter eggs when baskets from all rooms match the expected total.

🍞 Top Bread (Hook): Teachers grade your test by what you wrote at the end, not the doodles you made while thinking.

🥬 Filling (Outcome-Based Evaluation)

What it is: Scoring the final answer set only—did you include all the right items and avoid wrong ones?
How it works:
1. Compare the agent’s submitted set to the ground truth set.
2. Count what matches (correct) and what doesn’t (errors).
3. Calculate precision, recall, and F1.
Why it matters: It rewards complete and clean answers, no matter how the agent searched.

🍞 Bottom Bread (Anchor): Like checking your finished grocery bag against the shopping list: everything needed is there, and nothing extra.

Finally, the problem this paper targets is the Comprehensiveness Gap: AI agents can often find a few items but struggle to produce the full, correct list while knowing when to stop. DeepSearchQA was built to measure—and close—this gap on realistic, high-value web tasks across many domains.

02Core Idea

🍞 Top Bread (Hook): Imagine a treasure hunt where winning requires finding every single coin, not just one shiny one—and you lose points if you also bring back bottle caps.

🥬 Filling (DeepSearchQA, the Core Idea)

What it is: A benchmark that tests whether AI agents can produce complete and correct lists by doing multi-step web research.
How it works:
1. Give the agent a realistic, chained prompt (each step depends on the last).
2. The agent browses, gathers candidates, removes duplicates, checks constraints, and decides when to stop.
3. We grade only the final set using precision, recall, and F1.
Why it matters: It forces agents to balance being thorough (recall) with being accurate (precision), like real research.

🍞 Bottom Bread (Anchor): It’s like finishing a scavenger hunt with exactly the full checklist—no missing items and no extras.

Three analogies for the same idea:

Puzzle analogy: You don’t pass by finding one corner piece; you must complete the whole picture without forcing in wrong pieces.
Grocery trip analogy: Visit multiple stores, gather all items on the list, don’t buy duplicates, and don’t add candy that wasn’t requested.
Library quest analogy: Follow a chain of clues across catalogs and archives, collect every required document, and stop when the bibliography is complete.

🍞 Top Bread (Hook): You know how directions sometimes say, “First turn left, then cross the bridge, then look for the red house”? Skip a step and you get lost.

🥬 Filling (Causal Chain Tasks)

What it is: Tasks where each step’s result is needed for the next.
How it works:
1. Solve Step A (e.g., find qualifying cities by price).
2. Use A’s output to do Step B (e.g., green space ranks for those cities).
3. Continue until the final filtered set is reached.
Why it matters: Tests long-horizon planning and memory; if you lose earlier facts, the later answer breaks.

🍞 Bottom Bread (Anchor): Like dominoes—if the first one doesn’t fall correctly, the chain won’t complete.

🍞 Top Bread (Hook): Casting a net catches more fish, but you might also haul in seaweed. Being picky gets a clean catch but may miss some fish.

🥬 Filling (Recall and Precision Trade-off + F1)

What it is: Recall is how much of the true set you found; precision is how clean (error-free) your set is; F1 balances both.
How it works:
1. Gather candidates (boost recall).
2. Verify and filter (boost precision).
3. Use F1 to balance both so you don’t over- or under-shoot.
Why it matters: Real research needs both completeness and correctness.

🍞 Bottom Bread (Anchor): A perfect sticker collection means you have every sticker (high recall) and no duplicates or wrong ones (high precision); F1 is the overall score.

🍞 Top Bread (Hook): Referees judge the finished play, not how the team drew the plan on the whiteboard.

🥬 Filling (Outcome-Based Evaluation)

What it is: Grading agents only on the final set they submit.
How it works:
1. Compare the agent’s set to the ground truth set.
2. Use an LLM judge to match items that are worded differently but mean the same thing.
3. Score precision, recall, F1, and categories like Fully Correct or with Extraneous Answers.
Why it matters: Encourages any effective strategy that ends with complete, clean answers.

🍞 Bottom Bread (Anchor): If your final report has the right list, you get the points—even if your scratch paper looks messy.

Before vs. After:

Before: Benchmarks mostly rewarded finding a single correct answer quickly.
After: DeepSearchQA rewards producing the entire correct list and penalizes missing items or padding with guesses.

Why it works (intuition):

Agents can’t ‘game’ the test by blurting lots of guesses (precision drops) or by being overly cautious (recall drops). The only winning path is careful planning, multi-source synthesis, deduplication, and confident stopping.

Building blocks:

Diverse, time-anchored prompts spanning 17 fields.
Two answer styles (single-answer and set-answer with enumerations/ composites).
A strict, three-phase human verification protocol to fix ambiguous or wrong ground truths.
Metrics: precision, recall, F1, and categorical outcomes to diagnose failure modes.
An LLM judge to recognize semantic matches (e.g., synonyms, alternate names).

03Methodology

High-level pipeline: Prompt → Plan multi-step search → Browse and extract candidates → Systematic collation → Entity resolution (de-dup) → Constraint verification → Decide stopping criteria → Submit final set → Automated judging → Scores.

Step-by-step:

Understand the prompt and plan

What happens: The agent reads a chained prompt (e.g., filter by price, then rank by green space, then filter by employment, then remove clean-air-zone cities) and drafts a search plan.
Why it exists: Without a plan, the agent may wander, miss steps, or mix constraints.
Example: “Relocation Planner” splits into four filters in the right order.

Browse and extract candidates

What happens: The agent visits multiple sources, issues queries, and scrapes potential items with supporting facts (names, numbers, dates, qualifiers).
Why it exists: No single source has all answers; multi-source checking avoids blind spots.
Example: For vaccination study, the agent pulls World Bank tables for population, life expectancy, and immunization rates.

Systematic collation

What happens: The agent merges candidates into a master set, tracking where each came from and what evidence supports it.
Why it exists: Prevents missing rare items hidden on niche pages and organizes disparate facts.
Example: Combining a list of consoles from company pages with sales figures from press releases.

Entity resolution (de-duplication)

What happens: The agent detects when different names refer to the same entity and merges them.
Why it exists: Avoids inflated lists and wrong counts that lower precision.
Example: “PS2” and “PlayStation 2” collapse to one entry.

Constraint verification

What happens: The agent applies all numeric and logical rules to each candidate (e.g., ‘life expectancy ≥ 75’ and ‘vaccination ≥ 85%’ in 2023 only).
Why it exists: Prevents including items that look close but don’t meet the rules.
Example: A country with missing 2023 immunization data is excluded.

Stopping criteria

What happens: The agent estimates completeness (e.g., by coverage of trusted sources or diminishing returns) and decides to stop.
Why it exists: Stopping too early misses the long tail; searching forever invites errors.
Example: After cross-checking multiple official tables and seeing no new valid items appear, the agent stops.

Submit final set and justification

What happens: The agent outputs just the items (and, if allowed, brief citations)—order usually doesn’t matter.
Why it exists: The benchmark scores the final set only, not the journey.
Example: “Comma-separated countries meeting all three criteria.”

Automated evaluation (LLM-as-judge)

What happens: For each submitted item, the judge (Gemini 2.5 Flash, zero-shot) checks semantic equivalence against ground-truth items.
Why it exists: Handles synonyms and formatting variations fairly.
Example: ‘United States’ ≈ ‘U.S.’ if the ground truth says so.

Scoring and categorization

What happens:
- Precision = correct-included / all-included.
- Recall = correct-included / all-true.
- F1 balances both.
- Categorical outcomes: • Fully Correct (exact match set), • Fully Incorrect (no overlap), • Partially Correct (some overlap, set tasks), • Correct with Extraneous Answers (all correct found but with extras).
Why it exists: Gives both a fine-grained score (F1) and clear buckets to diagnose failure modes (e.g., hedging with extras or missing the tail).
Example: Listing the eight planets plus Pluto scores high recall but reduced precision, landing in ‘Correct with Extraneous Answers’ if all correct items are present.

Secret sauce (what’s clever):

Set-based, outcome-only grading forces agents to master exploration, verification, deduplication, and stopping—no shortcuts.
Time-anchored prompts reduce web-drift so results stay stable and verifiable.
Three-phase human verification (independent research, comparison, conflict resolution) yields trustworthy ground truth.
LLM judge enables fair matching across messy real-world wording.
Diverse domains and dependency chains stress long-horizon planning and memory.

04Experiments & Results

The test

What they measured: How completely and cleanly agents return the true set of answers across 900 prompts in 17 fields.
Why: Real research needs full coverage (recall) without junk (precision). F1 tells whether agents balance both.

The competition

Compared deep research agents and strong reasoning models, including Gemini Deep Research Agent and GPT-5 Pro High Reasoning, as well as smaller models.

The scoreboard (with context)

Gemini Deep Research Agent: 66.09% Fully Correct, only 9.95% Fully Incorrect, F1 = 81.90. That’s like getting an A- overall and rarely handing in a blank or totally wrong list.
GPT-5 Pro High Reasoning: 65.18% Fully Correct, 14.13% Fully Incorrect, F1 = 78.98. Also strong, but more often fails completely on some tasks.
Mid-tier agents (o3 Deep Research, o4 Mini) drop to around 40–44% Fully Correct and 20–24% Fully Incorrect—more frequent derailments on multi-step chains.
Smaller models (e.g., Gemini 2.5 Flash) have a steep fall: F1 around 43% and nearly half Fully Incorrect, showing that deep, chained research is not solvable by quick search alone.

Surprising findings

The ‘Last Mile’ gap: Even top agents show a 15-ish point gap between high F1 and strict ‘Fully Correct.’ Translation: they often get very close but either miss the long tail (under-retrieval) or add extras (over-retrieval) when they don’t know when to stop.
Test-time compute helps a lot: Sampling more runs lifts Fully Correct from ~67% (n=1) to ~86% (n=8), meaning a few extra tries can rescue near-misses.
Different failure flavors: • Quantitative estimation errors: Aggregated the right items but misranked or estimated numbers loosely. • Tool limitations: Stopped when encountering an unreadable file, instead of recovering via alternate sources. • Stopping errors: Found the right source list but didn’t correctly filter per the prompt’s constraints.

Takeaway

Deep research agents outperform standalone reasoning models, confirming that iterative browsing, memory, and tool use are critical for completeness.
But even leaders struggle to perfectly balance recall and precision, especially in long chains with nuanced constraints.

05Discussion & Limitations

Limitations

Black-box scoring: Outcome-only grading can’t tell if the agent’s path was smart or lucky. Two agents with the same final set might have very different processes.
Static web assumption: Time-anchored prompts reduce drift but don’t test breaking-news or rapidly changing pages; ground truths can still age.
Judge dependence: LLM judges help with semantic matching but can introduce evaluation noise if they misunderstand an item.

Required resources

To use this benchmark well, you need: a capable agent with browsing and tool use, enough test-time compute to sample multiple runs, and robust retrieval plus dedup systems.

When not to use

Urgent, live-updating tasks (e.g., “current stock tickers changing by the minute”), ambiguous or subjective queries, and tasks where process transparency (not just outcome) is essential.

Open questions

Can we log and categorize trajectories (queries, pages, tool calls) to better diagnose where chains break?
How to test ‘live’ dynamic lists while keeping grading fair and reproducible?
Could we add weighted relevance (core vs. peripheral items) to reward hitting the most important parts first?
What architectures best learn dynamic stopping under uncertainty without over-hedging or quitting early?

06Conclusion & Future Work

Three-sentence summary

DeepSearchQA is a benchmark that tests whether AI agents can produce complete and correct answer sets for complex, multi-step web tasks.
It uses outcome-only, set-based scoring with precision, recall, F1, and strict categories to expose under-retrieval, over-retrieval, and stopping errors.
Results show strong agents do best but still face a ‘Last Mile’ gap, pushing future research toward better collation, deduplication, and stopping strategies.

Main achievement

Shifting evaluation from “find one fact” to “master the full list” with a rigorous, verifiable, and domain-diverse benchmark that directly measures comprehensiveness.

Future directions

Add trajectory diagnostics, introduce dynamic/time-sensitive lists, and explore weighted relevance to reflect real-world priorities.

Why remember this

Because real research isn’t about one shiny answer; it’s about getting the whole, clean set. DeepSearchQA makes that the goalpost and gives the community a fair way to measure progress toward trustworthy deep-research agents.

Practical Applications

•Build research agents that return complete regulatory checklists without duplicates or near-miss items.
•Create due-diligence tools that aggregate company facts across filings and press releases and stop when coverage is complete.
•Design medical literature scouts that list all clinical trials matching strict criteria with clean deduplication.
•Develop market scanners that enumerate all securities meeting numeric thresholds across regions and dates.
•Support public policy analysts by assembling exhaustive city or state lists that satisfy census and safety constraints.
•Equip journalists with deep fact-gathering assistants that verify multi-source claims and clearly avoid over- or under-retrieval.
•Enhance academic review bots that collect all papers meeting topic and year filters and correctly merge author/name variants.
•Power compliance systems that enumerate all entities impacted by new rules with strong stopping logic.
•Improve enterprise knowledge mining by producing complete vendor or product catalogs with robust entity resolution.
•Enable smarter shopping/recommendation engines that compile full, up-to-date option sets under complex filters.

Version: 1