SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Tiansheng Hu; Yilun Zhao; Canyu Zhang; Arman Cohan; Chen Zhao

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Intermediate

Tiansheng Hu, Yilun Zhao, Canyu Zhang et al.2/5/2026

arXiv PDF

Key Summary

•SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.
•Across 1,200 questions and a 200,000-paper corpus, traditional BM25 search beat modern LLM-based retrievers by about 30% on tough, reasoning-heavy tasks.
•Agents often break big questions into short, keyword-like sub-queries, which suits BM25’s exact word matching but clashes with LLM retrievers trained for natural sentences.
•Adding helpful keywords and metadata to each paper at test time (corpus-level scaling) made retrieval easier and improved scores, especially for BM25 (+8% EM on short-form).
•Open-ended questions showed smaller gains (+2% weighted recall) because agent sub-queries lacked diversity, limiting what any retriever could surface.
•Proprietary deep research agents (like GPT-5) led on web-search tasks, but even they struggled when retrieval required weaving together metadata and inter-paper relationships.
•The benchmark covers four domains (Computer Science, Natural Science, Healthcare, Humanities) and includes both verifiable short-form and practical open-ended questions.
•Ablation studies showed that which clues matter (metadata, details, relationships) depends strongly on the search backend and agent behavior.
•The main insight: instead of only making the query smarter, make the corpus easier to find—small, smart edits to documents can help off-the-shelf retrievers.
•SAGE highlights that better collaboration between agents and retrievers will require retriever-aware query strategies and better handling of long, complex documents.

Why This Research Matters

Scientific progress depends on finding the right prior work quickly and confidently. SAGE shows where today’s deep research agents succeed and where they stumble, especially when questions require weaving together metadata, figures, and citation networks. The study reveals a practical path that organizations can use today: lightly enrich document collections so off-the-shelf retrievers and existing agents perform better without retraining. This helps students, researchers, clinicians, and analysts stay current and make decisions grounded in reliable evidence. Over time, aligning how agents ask (query style) with how retrievers search (lexical vs semantic) can unlock even bigger gains. The work also motivates hybrid systems that blend BM25’s precision with dense models’ semantic reach for long, technical documents.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a big school project. You don’t just ask one question and get an answer—you look up many books, take notes, connect ideas, and finally write a report. That’s what deep research agents try to do online.

🥬 The Concept (BM25):

What it is: BM25 is a classic search method that scores documents by how well their words match your query words.
How it works:
1. Count how often each query word appears in a document.
2. Reward words that are rarer across the whole library (they’re more special).
3. Combine these counts into a score and rank documents by that score.
Why it matters: If you don’t match words precisely, you might miss the one paper that exactly mentions what you typed. 🍞 Anchor: If your query is “NAACL 2024 tool creation paper,” BM25 looks for those exact words and quickly finds papers with “NAACL,” “2024,” and “tool.”

🥬 The Concept (LLM-based Retriever):

What it is: An LLM-based retriever uses a large language model to understand meaning, not just exact words, and then finds relevant documents.
How it works:
1. Turn the query and each document into numerical “meaning” vectors (embeddings).
2. Compare these vectors to see which documents are closest in meaning.
3. Return the top matches.
Why it matters: Without meaning awareness, you might miss documents that use different words but mean the same thing. 🍞 Anchor: Searching “paper that splits tool design into abstract and concrete steps” can find a paper that says “high-level and low-level tool phases,” even if it never uses your exact words.

🥬 The Concept (Deep Research Agents):

What it is: Deep research agents are AI helpers that plan, search, read, and stitch together evidence across many sources to answer complex questions.
How it works:
1. Read the question and break it into smaller sub-questions.
2. Search for papers, collect snippets, and take notes.
3. Repeat until there’s enough evidence.
4. Write a well-cited answer.
Why it matters: Without planning and iteration, the agent might stop too early or miss critical steps. 🍞 Anchor: To find a NAACL 2024 paper tied to specific citations, the agent searches for venue, year, author count, figures, and shared references, then combines them to pinpoint the paper.

🥬 The Concept (Short-form Questions):

What it is: These are precise questions with one correct, checkable answer.
How it works:
1. Provide exact constraints (e.g., venue, year, number of authors).
2. Require reading figures/tables and linking citations.
3. Output the single correct paper.
Why it matters: If the agent misses any constraint, the answer is simply wrong. 🍞 Anchor: “Find the NAACL 2024 paper with four authors whose figure shows tasks hitting ceiling without code training—and that shares 14 citations with LLM as Tool Makers.”

🥬 The Concept (Open-ended Questions):

What it is: These are broader, realistic research prompts where multiple good answers exist.
How it works:
1. Provide topic background (e.g., tool creation in LLMs).
2. Ask for key references and methods.
3. Score how well the returned papers cover the core area.
Why it matters: Real research rarely has just one “right paper.” Breadth and relevance matter. 🍞 Anchor: “I’m exploring how LLMs learn to create and use tools. Give me methodological background references.”

🥬 The Concept (SAGE Benchmark):

What it is: SAGE is a large, up-to-date test for scientific literature retrieval by deep research agents.
How it works:
1. Build a clean, fixed corpus of 200,000 papers across four domains.
2. Ask 1,200 questions: 600 short-form (verifiable) and 600 open-ended (realistic).
3. Measure exactness (Exact Match) and coverage (Weighted Recall).
Why it matters: Without a controlled, modern testbed, we can’t fairly compare retrieval methods or improve them. 🍞 Anchor: If the agent says it found the one correct paper, SAGE checks if that exact paper is in the citations and text.

The World Before: AI agents grew strong at browsing the web and answering questions. But web data is messy, search backends are proprietary, and it’s hard to tell if the agent truly found the right paper or guessed. LLM-based retrievers promised “smarter” search by understanding meaning. Researchers wondered: will they boost deep research agents?

The Problem: In scientific search, clues come from metadata (venue, year, authors), figures/tables, and how papers cite each other. Agents that split a big question into keyword-like sub-queries may accidentally mismatch LLM-based retrievers, which expect natural sentences and semantic phrasing. Long PDFs make dense embeddings harder too.

Failed Attempts: Prior work mainly scaled compute on the query side—expanding or rewriting queries. But when the agent’s sub-queries stay keyword-y, semantic retrievers can underperform. Dense models may also struggle when only the first chunk of a long paper fits their window.

The Gap: We needed a clean, modern benchmark whose papers and questions force real reasoning—and a method that helps off-the-shelf retrievers without retraining the agent.

Real Stakes: Faster, more accurate paper finding helps students write reports, researchers do literature reviews, doctors and scientists keep up with advances, and everyone get better, well-evidenced answers. SAGE puts these needs to the test and shows a practical path forward: sometimes the best fix is to make the library easier to search, not just the question smarter.

02Core Idea

🍞 Hook: You know how you can either shout your question louder or organize the bookshelf better? If people can’t find books, tidying the shelves might help more than yelling.

🥬 The Concept (Aha! Moment):

What it is: The key insight is that deep research agents often issue keyword-like sub-queries, so classic BM25 wins big; to improve, don’t only tweak queries—enrich the corpus so existing retrievers can find things more easily.
How it works:
1. Benchmark agents on realistic, reasoning-heavy scientific questions.
2. Observe that keyword-y sub-queries favor BM25 over LLM retrievers by ~30%.
3. Add helpful metadata and succinct, LLM-extracted keywords to each paper (no model retraining) so retrieval becomes easier.
Why it matters: If the agent and retriever don’t “speak the same language,” improving the library’s labels can bridge the gap quickly. 🍞 Anchor: Prepending “NAACL 2024; 4 authors; tool creation; abstract vs concrete tool phases” to a paper helps BM25 catch it when the agent types short, keyword-y queries.

Three Analogies:

Librarian vs shelves: A librarian (retriever) can be super smart, but if books lack labels, they’ll still struggle. Add good labels, and even a regular librarian shines.
Keys and locks: If your keys (queries) are simple, a fancy lock (semantic retriever) won’t help. Better to add a clear keyhole sign (metadata/keywords on documents).
Treasure map: You can buy a better compass (query tricks) or draw brighter landmarks on the map (corpus augmentation). The latter helps everyone navigate.

Before vs After:

Before: People expected LLM-based retrievers to boost deep research agents automatically. Agents broke problems into keyword-like sub-queries; dense retrievers stumbled; BM25 quietly dominated.
After: We know matching the agent’s behavior matters. By enriching the corpus with metadata and LLM-chosen keywords, we help off-the-shelf retrievers (especially BM25) retrieve the right papers more often (+8% EM on short-form; +2% on open-ended).

🥬 The Concept (Query Decomposition):

What it is: Agents split a big question into smaller searches.
How it works:
1. Extract constraints (venue, year, author count).
2. Search for figures/tables clues.
3. Check shared citations with a target paper.
4. Iterate until one paper fits all clues.
Why it matters: If sub-queries are keyword-y, semantic retrievers can mismatch, but BM25 thrives. 🍞 Anchor: “NAACL 2024,” “4 authors,” “figure ceiling without code training,” “shares 14 citations with Tool Makers” becomes 3–5 short searches.

Why It Works (intuition, no equations):

The agent’s short, constraint-heavy sub-queries create a lexical game: exact words matter. BM25 is built for that.
Dense retrievers compress long documents; signals far inside may be downweighted.
Adding crisp keywords and metadata up front raises a paper’s chances of matching those lexical sub-queries.
You don’t modify the agent or retriever—just provide better “signposts” on each document.

Building Blocks:

SAGE Benchmark: hard, modern, controlled questions across four domains.
Two task types: short-form (one right answer) and open-ended (many good answers).
Measurements: Exact Match (EM) for correctness, Weighted Recall for coverage.
Corpus-level Test-time Scaling: prepend compact metadata and eight LLM-picked keywords to every paper to boost findability.

🥬 The Concept (Corpus-level Test-time Scaling):

What it is: At inference time, enrich documents (not the model) with metadata and keywords so they’re easier to retrieve.
How it works:
1. Parse each paper into text.
2. Extract venue, year, authors, and citation signals.
3. Ask an LLM for eight on-point keywords.
4. Prepend these to the paper’s text before indexing/search.
Why it matters: If you can’t change the agent or retriever, change the playground—make papers more discoverable. 🍞 Anchor: A paper starting with “NAACL 2024; 4 authors; toollink; abstract/concrete tool phases; shared citations with CREATOR” pops to the top when queried with those clues.

Bottom line: SAGE shows that “how the agent asks” and “how the library is labeled” must match. Until agents produce more natural, semantic sub-queries, smart corpus labeling is a powerful, low-friction fix.

03Methodology

High-level recipe: Question → Agent thinks and splits into sub-queries → Retriever searches the paper corpus → Agent reads, iterates, and answers.

Step A: Build the SAGE Corpus

What happens: Collect ~200k recent, open-access scientific papers across four domains (≈50k each; ≈40k for humanities), convert PDFs to markdown, keep up to 32k tokens per paper.
Why this step exists: A fixed, modern, controlled library lets us compare retrievers fairly and avoid hidden web knowledge.
Example: A 2024 NAACL paper is included with its references and related works so it can be reliably found by exact clues or by shared citations.

🥬 The Concept (Reference Overlap):

What it is: Two papers are “related” if they cite many of the same works.
How it works:
1. List references for each paper.
2. Count shared citations between pairs.
3. If overlap ≥ 4, mark them as related.
Why it matters: Connections through shared references help design questions that require reasoning across papers. 🍞 Anchor: If Paper A and Paper B both cite “CREATOR,” they’re likely about similar tool-creation ideas.

Step B: Create Short-form Questions (verifiable)

What happens: Use metadata (venue, year, authors), figures/tables (parsed from PDFs), and inter-paper relationships to generate precise, one-answer questions.
Why this step exists: Forces the agent to combine multiple clues, not just match a title.
Example data: “Find the NAACL 2024 paper with 4 authors whose main figure shows ceiling effects without code training and that shares 14 citations with ‘LLM as Tool Makers.’”
What breaks without it: If we only use titles or abstracts, the task becomes trivial and doesn’t test real reasoning.

🥬 The Concept (Exact Match, EM):

What it is: A score that checks if the exact gold paper appears in the agent’s answer or citations.
How it works:
1. Compare predicted papers to the single ground-truth paper.
2. EM = 1 if found; else 0. Average over questions.
Why it matters: Binary correctness keeps evaluation strict for precision-heavy tasks. 🍞 Anchor: If the only right paper is “Toolink: Linking Toolkit Creation…,” EM gives credit only if that exact paper is cited.

Step C: Create Open-ended Questions (realistic)

What happens: Start from pairs of related papers with shared references. Use an LLM to write practical prompts asking for background and method references.
Why this step exists: Real researchers need breadth; multiple relevant answers should count.
Example data: “I’m exploring self-supervised tool creation in LLMs; give background and methodological basis references.”

🥬 The Concept (Weighted Recall):

What it is: A coverage score giving more credit for retrieving “most relevant” seed papers and some credit for other shared references.
How it works:
1. Label top-2 seed papers as 2 points each; shared-citation papers as 1 point.
2. Weighted Recall = (sum of retrieved gains) / (sum of all possible gains).
Why it matters: It fairly rewards returning the core set and related works, not just one perfect hit. 🍞 Anchor: Returning both seed papers scores higher than returning only a few loosely related ones.

Step D: Evaluate Agents with Web Search

What happens: Test GPT-5, Gemini-2.5, and DR Tulu using their normal web tools and settings.
Why this step exists: Shows how strong end-to-end agents perform in the wild.
Example: GPT-5 sets “medium” reasoning effort; Gemini enables dynamic thinking; DR Tulu runs on an H100 GPU via vLLM.
What breaks without it: We would not know the baseline of current best systems before switching to a controlled corpus.

Step E: Controlled Corpus Search via Swappable Retrievers

What happens: Keep the same agent (DR Tulu) but replace web search with specific retrievers over the fixed SAGE corpus: BM25, gte-Qwen2-7B-instruct, ReasonIR.
Why this step exists: Isolate the retriever’s effect while holding the agent constant.
Example: For each sub-query, return top-5 or top-10 titles and abstracts from the corpus.

🥬 The Concept (Top-k):

What it is: The number of results returned per search.
How it works:
1. Ask the retriever for the best k matches.
2. The agent reads them and decides next steps.
Why it matters: Too few results can hide the right paper; too many can waste reading time. 🍞 Anchor: Setting k=10 instead of k=5 helped weaker retrievers by surfacing more candidates.

Step F: Indexing Long Documents

What happens: Convert PDFs to markdown, extract tables, and embed up to 32,000 tokens per paper for dense retrievers; BM25 indexes all tokens lexically.
Why this step exists: Long scientific PDFs need careful handling; otherwise key evidence is missed.
Example: A figure caption deep in the paper might be critical for a short-form question.
What breaks without it: Dense retrievers may ignore late sections; BM25 may be noisy but complete.

🥬 The Concept (Semantic Drift):

What it is: When the agent’s search focus slowly slides toward the wrong meaning due to misleading early hits.
How it works:
1. A keyword (e.g., “physics-informed”) returns generic but popular papers.
2. The agent reads and reinforces that path.
3. Subsequent searches drift further from the target.
Why it matters: Drift can trap the agent in a feedback loop of off-target results. 🍞 Anchor: Instead of a symbolic regression paper, the agent keeps retrieving general physics-informed neural network papers.

Step G: Corpus-level Test-time Scaling

What happens: Prepend each paper with bibliographic metadata (venue, year, authors) and eight LLM-generated, topic-relevant keywords.
Why this step exists: Agents often issue keyword-like sub-queries; adding crisp labels makes those queries hit the right papers more often.
Example: “NAACL 2024; 4 authors; tool creation; abstract vs. concrete tool phases; shared citations with CREATOR.”
What breaks without it: Dense retrievers may still miss long-document signals; BM25 gains less without strong front-loaded clues.

🥬 The Concept (Metadata Augmentation):

What it is: Adding compact, informative labels to each document.
How it works:
1. Extract venue, year, authors, citation stats.
2. LLM picks eight keywords summarizing contributions.
3. Prepend these to the paper text for indexing.
Why it matters: It turns buried evidence into front-door signs that match keyword-style queries. 🍞 Anchor: A paper’s new header includes “ICML 2023; symbolic regression; AI Feynman; ODEFormer,” helping the agent’s short queries hit it.

Secret Sauce:

Instead of retraining the agent or building a brand-new retriever, lightly enrich the corpus so existing tools work better—simple, scalable, and effective, especially for BM25.

04Experiments & Results

The Test: Researchers measured whether agents could find the single correct paper for short-form questions (using Exact Match) and how well they covered key references for open-ended questions (using Weighted Recall). They compared strong proprietary agents (GPT-5 family, Gemini-2.5) and an open-source agent (DR Tulu) using both web search and a controlled corpus with swappable retrievers (BM25, gte-Qwen2-7B-instruct, ReasonIR).

The Competition:

Web-search agents: GPT-5, GPT-5-mini, GPT-5-nano, Gemini-2.5-Pro, Gemini-2.5-Flash, DR Tulu.
Corpus-search (with DR Tulu backbone): BM25 (lexical), gte-Qwen2-7B-instruct (dense), ReasonIR (dense, trained for reasoning).

Scoreboard (with context):

On short-form (EM), GPT-5 led web search (≈72% EM), showing strong end-to-end capability. But when DR Tulu used a fixed corpus and swappable retrievers, BM25 jumped ahead by about 30% over LLM-based retrievers. That’s like BM25 getting an A while dense retrievers got C+ on precision-heavy tasks.
On open-ended (Weighted Recall), gaps narrowed. Gte-Qwen2-7B-instruct slightly beat BM25 in some settings, showing that when broader coverage is acceptable, dense retrievers can compete. Think of this as both getting solid Bs with small differences.
Top-k matters: Raising k from 5 to 10 consistently helped, especially for weaker first-page rankings (ReasonIR improved most), similar to giving more contestants a chance to audition.

Surprising Findings:

Search quantity isn’t everything: Gemini-2.5-Flash issued tons of searches but still trailed GPT-5 on short-form; DR Tulu returned many references yet lagged too. Fewer, better-aligned queries beat brute force.
Query style mismatch: Agents often produce keyword-y sub-queries; dense retrievers expect natural, semantic sentences. This mismatch helped BM25 shine.
Long-doc penalty for dense embeddings: With 32k-token documents, signals deep in the paper can be underrepresented by dense models, lowering recall and diversity per search.
Semantic drift traps: Dense models sometimes overemphasized popular phrases (e.g., “physics-informed”), pulling the agent off-target. BM25’s exact matching kept it anchored.

Case Study (symbolic regression): The target ICML paper used physics-informed symbolic heuristics. ReasonIR latched onto “physics-informed” in titles and kept returning generic PINN papers—drifting away from symbolic regression. BM25’s lexical anchoring surfaced the correct paper in its top hits.

Diversity Metric (URS): Unique References per Search was higher for BM25 (e.g., 2.97) than for ReasonIR (1.98) under top-5 in short-form tasks—BM25 exposed more fresh candidates per search call.

Ablations (what clues matter?):

Removing inter-paper relationships hurt corpus-search DR Tulu with BM25 the most, showing how much shared-citation logic helps in the controlled setting.
With web search backends, details like figures/tables mattered more, hinting that the search engine itself changes which clues are most useful.

Corpus-level Test-time Scaling Results:

Short-form: BM25 gained about +8.18% EM after augmentation; dense retrievers gained modestly.
Open-ended: All retrievers improved slightly (≈+1–2% weighted recall). Limited sub-query diversity likely capped the benefit: if you don’t search wide, better labels can only help so much.

Takeaway: For precision-heavy, constraint-rich questions, BM25 is a powerhouse under current agent behavior. Small, smart document labels further boost it. For broader, exploratory tasks, dense retrievers stay competitive—especially if agents learn to write more semantic sub-queries.

05Discussion & Limitations

Limitations:

Agent training: The study didn’t fine-tune agents to be retriever-aware. If agents learned to craft semantic sub-queries for dense retrievers (or switch styles per backend), gaps might shrink.
Generality: Many behavioral analyses center on DR Tulu; other agents with different training might decompose queries differently and change outcomes.
Long documents: Dense retrievers face window and compression limits on 32k-token papers; mid-to-late evidence can be underweighted.
Query diversity: Low-diversity decomposition in open-ended tasks limits the upside for any retriever or corpus augmentation.

Required Resources:

GPUs (e.g., H100) for agent inference and dense embeddings; CPU/RAM for BM25 indexing; storage for 200k papers; PDF parsing tools (PyMuPDF, PDFPlumber).

When NOT to Use:

If you can only rely on dense retrievers and your agent insists on keyword-like sub-queries, expect underperformance on precise, constraint-rich tasks.
If documents are extremely long and you cannot index beyond short prefixes, dense models may miss key signals; rely more on lexical or hybrid approaches.
If your task demands fresh, web-scale coverage beyond the controlled corpus, SAGE’s fixed library won’t reflect live web dynamics.

Open Questions:

Retriever-aware agents: Can we train agents to adapt query style to the retriever (lexical vs semantic) automatically?
Hybrid retrieval: What is the best way to combine BM25 and dense retrievers for long scientific PDFs—per-query switching, score fusion, or late-stage reranking with test-time compute?
Better corpus scaling: Beyond keywords, can we safely add LLM-written mini-abstracts, relation graphs, or figure captions without hallucinations?
Handling drift: How can agents detect and correct semantic drift mid-search (e.g., diversity-promoting prompts, self-critique, or exploration bonuses)?

06Conclusion & Future Work

Three-sentence summary: SAGE is a new, realistic benchmark for scientific literature retrieval by deep research agents, revealing that classic BM25 outperforms modern LLM-based retrievers by about 30% on reasoning-heavy, constraint-rich questions. The core reason is a query–retriever style mismatch: agents often emit keyword-like sub-queries, which suit BM25 but clash with semantic dense retrievers. A practical fix is corpus-level test-time scaling—prepend papers with metadata and LLM-chosen keywords—yielding notable gains, especially for short-form tasks.

Main achievement: Proving, with a controlled and up-to-date benchmark, that improving the library’s labels (documents) can matter more than only improving the questions (queries), and delivering a simple, effective recipe to do so.

Future directions: Train retriever-aware agents, develop robust hybrid BM25+dense pipelines for long documents, add richer but trustworthy corpus annotations (e.g., figure summaries, relationship notes), and design anti-drift strategies. Explore reranking with test-time compute to capitalize on larger candidate sets.

Why remember this: When agents ask in keywords, it pays to speak back in keywords—right from the documents themselves. SAGE shows a clear path to better retrieval today while pointing to smarter, better-matched agent–retriever teamwork tomorrow.

Practical Applications

•University libraries can prepend metadata and curated keywords to institutional repositories to improve search for theses and articles.
•Research groups can build internal literature scouts that excel at precise, constraint-rich queries (venue, year, shared citations) using BM25 plus corpus augmentation.
•Clinics can maintain curated guideline libraries with front-loaded metadata to speed up evidence lookup for time-critical decisions.
•Enterprises can index long technical reports with augmented headers so staff find exact documents via short, keyword-like queries.
•Newsrooms can rapidly locate source papers for trending claims by leveraging shared-citation questions and BM25-first retrieval.
•Patent analysts can use relationship-driven queries (reference overlap) to surface prior art clusters more reliably.
•Conference organizers can audit submissions by quickly finding closely related works via overlap-based retrieval.
•Educators can assemble reading lists by asking open-ended, background-style queries over an augmented course corpus.
•Policy analysts can discover methodological precedents by querying for specific figure evidence and shared references.
•AI teams can add corpus-level scaling to existing RAG systems as a low-risk upgrade before retraining models.

Version: 1