Agentic-R: Learning to Retrieve for Agentic Search

Wenhan Liu; Xinyu Ma; Yutao Zhu; Yuchen Li; Daiting Shi; Dawei Yin; Zhicheng Dou

Agentic-R: Learning to Retrieve for Agentic Search

Intermediate

Wenhan Liu, Xinyu Ma, Yutao Zhu et al.1/17/2026

arXiv PDF

Key Summary

•Agentic-R is a new way to teach a search retriever to find not just similar text, but the text that truly helps an AI get the final answer right.
•It measures a passage’s usefulness in two ways at once: how well it answers the current sub-question (local relevance) and whether it helps lead to the correct final answer (global correctness).
•It builds training data by letting an agent think, search, and try different passages, then scoring those passages with a strong LLM and by checking final-answer accuracy.
•Agentic-R and the search agent improve each other in turns: first train the agent with the current retriever, then use the agent’s better queries to train a better retriever, and repeat.
•Across seven QA benchmarks and three different agents, Agentic-R beats strong baselines, especially on tough multi-hop questions.
•Ablations show both signals (local relevance and global correctness) are important; removing either hurts performance.
•Agentic-R also helps agents solve problems with fewer searches, making the whole system faster and more efficient.
•Using only the original question plus the current sub-question works best; adding earlier queries actually adds noise.
•Two rounds of the agent–retriever loop are enough; a third round doesn’t help further.
•The method generalizes across different retriever backbones (e.g., E5, BGE) and different agents (in-domain and out-of-domain).

Why This Research Matters

Agentic-R helps AI systems stop chasing lookalike text and instead focus on what truly leads to the right final answer. That means better help on complex homework, research, coding, and decision-making tasks where multiple steps of thinking are needed. It also makes systems faster by reducing the number of searches, which saves time and compute. In fields like medicine, law, and finance, avoiding “relevant but misleading” information can prevent costly mistakes. Because the agent and retriever learn together, the system adapts to how it is actually used, improving robustness over time.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you're doing a big school project. You don’t just grab the first book that looks similar to your topic—you pick the books that actually help you finish the project correctly.

🥬 The World Before: In Retrieval-Augmented Generation (RAG), AIs look up outside information and then write answers. Most systems used a single-turn search: ask once, retrieve once, answer once. Retrievers were usually trained to find passages that looked semantically similar to the question. This worked okay for simple fact lookups, but struggled when problems needed several steps of thinking.

🥬 The Problem: A new style called agentic search lets an AI think in steps, ask sub-questions, and retrieve between thoughts. But the retrievers behind these agents were still “similarity finders,” not “answer helpers.” That’s a mismatch. Worse, in multi-step reasoning, a passage can be locally relevant yet still nudge the agent toward a wrong final answer. Also, we don’t have gold answers for the agent’s intermediate sub-queries, so it’s hard to judge which passages truly help at each step.

🥬 Failed Attempts: Utility-trained retrievers for single-turn RAG tried to measure how much a passage helps the final answer by using the gold answer or the model’s likelihood. But when extended to multi-turn agentic search, they hit roadblocks: no gold labels for every sub-query, and purely local scoring ignores how a passage might subtly derail the later reasoning.

🥬 The Gap: We needed a retriever that understands two things at once: (1) a passage’s local match to the current sub-question and (2) its global contribution to getting the final answer right. We also needed a way for the retriever and the agent to grow together, because in agentic search the queries come from the agent, and better retrievers produce better agent behavior, which can then produce better training queries.

🥬 Real Stakes: In daily life, this matters because:

Homework helpers: You want sources that actually help you solve the full problem, not just sound related.
Medical or legal assistants: A seemingly relevant note that nudges you to a wrong final conclusion can be costly.
Research and coding: Saving search turns and avoiding misleading snippets means faster, more accurate results.
Robustness: Systems must resist “distractor” texts that look right but lead you astray.

🍞 Anchor: Think of a treasure hunt with clues. A clue that mentions the right town (similarity) but points you to the wrong street won’t help you find the treasure. You need clues that both match the current question (“Which street?”) and guide you all the way to the treasure (the correct final answer). That’s the world Agentic-R is built for.

— New Concepts (explained with the Sandwich pattern):

🍞 Hook: You know how a librarian helps you find books and you write your report using those books? 🥬 The Concept (Retrieval-Augmented Generation, RAG): RAG is when an AI first retrieves outside info and then generates an answer using it.

How it works: (1) Take the question. (2) Retrieve passages. (3) Read them. (4) Write the answer using what was found.
Why it matters: Without retrieval, the AI may “make things up” because it lacks facts. 🍞 Anchor: Asking, “When was the Eiffel Tower built?” The AI fetches a passage with the date and then answers correctly.

🍞 Hook: Imagine solving a mystery by asking smaller questions step by step. 🥬 The Concept (Agentic Search): The AI alternates thinking and searching in multiple turns to solve complex problems.

How it works: (1) Think about what’s missing. (2) Ask a sub-question. (3) Retrieve. (4) Update understanding. (5) Repeat until ready, then answer.
Why it matters: Hard questions need multiple hops; a single look-up won’t cut it. 🍞 Anchor: To compare two people’s ages, the agent first finds each birthday, then compares.

🍞 Hook: When shopping, you adjust your list based on what’s in stock. 🥬 The Concept (Dynamic Query Generation): The agent writes new sub-queries on the fly as it learns more.

How it works: (1) Read what you found. (2) Spot the gap. (3) Ask a sharper next question.
Why it matters: Better follow-up questions mean faster, better answers. 🍞 Anchor: After finding John’s birthday, the next query becomes “Jed Hoyer birth year.”

🍞 Hook: Picking books for a report isn’t just about similar titles; it’s about helpful content. 🥬 The Concept (Passage Utility): Utility measures how helpful a passage is to getting the final answer right.

How it works: Score passages not only by matching the sub-question, but also by whether they lead to a correct final answer.
Why it matters: Similarity alone can mislead; utility keeps you on track. 🍞 Anchor: A passage naming Gilley’s Club is useful because it leads to the correct founder (Mickey Gilley), not just because it mentions “honky tonk.”

🍞 Hook: When you grade homework, you check if it answers the exact question asked. 🥬 The Concept (Local Relevance): This checks if a passage directly answers the current sub-question.

How it works: An LLM gives each candidate passage a score from 0–100 for how well it fits the current query.
Why it matters: Without this, you may retrieve off-topic passages. 🍞 Anchor: For “Jed Hoyer birth year,” a passage that states his birth year scores high; a passage about his career scores lower.

🍞 Hook: Finishing the puzzle matters more than any one nice-looking piece. 🥬 The Concept (Global Answer Correctness): This checks whether using a passage helps the agent reach the correct final answer.

How it works: Continue the agent’s reasoning using that passage and see if the final answer matches the gold answer.
Why it matters: A seemingly relevant passage that leads to a wrong final conclusion is not truly useful. 🍞 Anchor: If plugging in a passage makes the agent answer the question correctly, that passage gets top credit.

🍞 Hook: Judging songs is easier when you compare them all at once. 🥬 The Concept (Listwise Scoring): Instead of rating one passage at a time, the LLM compares many together to assign consistent scores.

How it works: Show the query and multiple passages; the LLM scores each with 0–100 based on utility rules.
Why it matters: Side-by-side comparison improves judgment. 🍞 Anchor: With five candidate passages, the LLM rates them all together so the most directly answering one rises to the top.

🍞 Hook: Sometimes you write a mini-answer first to guide what to look for next. 🥬 The Concept (Sub-answer): A short, inferred answer for the current sub-question, used to help the LLM judge relevance.

How it works: Another LLM infers the likely sub-answer from the full trajectory and the gold final answer; if unsure, it says “Not Sure.”
Why it matters: A good hint sharpens the relevance test. 🍞 Anchor: If the sub-question is “When was John Henry II born?”, the sub-answer might be “September 13, 1949.”

🍞 Hook: In practice, teams and players improve together by taking turns training. 🥬 The Concept (Iterative Training Framework): The agent and retriever take turns improving each other.

How it works: (1) Fix retriever, train agent via RL. (2) Use better agent to generate better queries. (3) Train retriever on those. (4) Repeat.
Why it matters: As the agent asks smarter questions, the retriever learns what truly helps. 🍞 Anchor: After round one, the agent writes crisper sub-queries, which produce better training data for round two’s retriever.

🍞 Hook: You remember things better by telling similar and different items apart. 🥬 The Concept (Contrastive Learning): Train the retriever to pull positives close and push negatives away in embedding space.

How it works: Use one positive passage and many negatives (including in-batch and cross-device) and optimize similarity scores.
Why it matters: It builds a sense of “this is truly useful” vs. “this is distractor.” 🍞 Anchor: For a given sub-question, the retriever learns to rank the passage that led to the correct final answer above those that didn’t.

🍞 Hook: A coach rewards a player when the team wins. 🥬 The Concept (Reinforcement Learning with PPO): The agent is trained to get higher rewards for correct final answers.

How it works: Generate multi-turn trajectories, compare final answer to gold, give reward, update the policy.
Why it matters: It teaches the agent when to search, what to ask, and when to stop. 🍞 Anchor: The agent learns to ask fewer, sharper searches to maximize correct answers.

🍞 Hook: Checking your final answer against the answer key. 🥬 The Concept (Exact Match, EM): A metric that says “1” if the final answer exactly equals the gold answer, else “0.”

How it works: Compare strings after normalization.
Why it matters: Gives a clear, task-level success signal. 🍞 Anchor: If the gold is “Mickey Gilley,” the system gets EM=1 only when it outputs that exact answer.

02Core Idea

🍞 Hook: You know how a great teammate isn’t just the one who looks like they’re trying—they’re the one who actually helps you win the game?

🥬 The Aha! Moment (one sentence): Train the retriever to prefer passages that both match the agent’s current sub-question and actually help the agent reach the correct final answer—and let the agent and retriever improve each other in turns.

Multiple Analogies:

Treasure Map: Don’t just pick clues that mention the right city (similarity); pick the clues that genuinely lead you to the treasure (final correctness), and practice the hunt in loops so your clues and strategy co-improve.
Cooking: Don’t only grab ingredients that sound related; choose ingredients that make the dish come out right, and refine both your shopping list (queries) and your taste (retriever) across repeated cook-and-shop cycles.
School Project: Instead of skimming for matching keywords, gather sources that help you finish the whole report accurately; as you revise, your questions and your source-picking both get smarter.

Before vs After:

Before: Retrievers ranked by similarity; single-shot training on fixed user questions; utility mostly local; agents had to make do with whatever the retriever brought.
After: Retrievers are trained on multi-turn agent queries; they optimize for both local relevance and global final correctness; agents and retrievers co-train iteratively, producing better queries and better passage choices over time.

Why It Works (intuition, no equations):

Local relevance alone can be a decoy, like a clue that looks right but misleads later steps. Global correctness checks whether a passage actually helps the full reasoning chain succeed.
Comparing passages listwise taps into an LLM’s comparative judgment, often more reliable than scoring one at a time.
Contrastive learning sharpens the retriever’s boundary between “truly helpful” and “distracting.”
Iterative co-training aligns the retriever to the agent’s evolving query style, reducing the mismatch between training and real use.

Building Blocks (each as a Sandwich explanation):

🍞 Hook: Picking the best helper notes for a multi-step math problem. 🥬 What it is (Local Relevance Scoring): An LLM scores how directly a passage answers the current sub-query.

How it works: Gather 20 candidates; an LLM rates each 0–100, listwise, sometimes with a sub-answer hint.
Why it matters: Filters out off-topic passages. 🍞 Anchor: For “Who is older?” sub-queries, a passage stating birthdays scores high.

🍞 Hook: Checking whether a hint actually leads you to solve the riddle. 🥬 What it is (Global Answer Correctness): A score for whether using a passage leads the agent to the right final answer.

How it works: Plug the passage into the agent mid-trajectory and finish the run; if the final equals gold, mark success.
Why it matters: Prevents being led astray by plausible but harmful passages. 🍞 Anchor: Only passages that make the agent finally answer “Mickey Gilley” get top priority.

🍞 Hook: Choosing your MVP note. 🥬 What it is (Positive/Negative Selection): Pick one top passage (must both be locally good and produce the correct final answer) and treat the rest as negatives.

How it works: First sort by global correctness, then by local score; require correctness=1 and local≥60.
Why it matters: Ensures the positive is truly useful for the whole task. 🍞 Anchor: The note that both states the key fact and yields the right final answer becomes the positive.

🍞 Hook: Carry the main question in your mind while asking follow-ups. 🥬 What it is (Retriever Input Design): Feed the retriever the original question plus the current sub-query, not all previous queries.

How it works: Input = Q [SEP] q_i; exclude earlier turns to avoid noise.
Why it matters: Agent sub-queries are usually self-contained; extra history can distract. 🍞 Anchor: “Compare ages” + “Jed Hoyer birth year” is clearer than listing every earlier query.

🍞 Hook: Practice by telling good apples from bad ones. 🥬 What it is (Contrastive Learning): Teach the retriever to pull the positive close and push away many negatives.

How it works: Lots of in-batch and cross-device negatives give rich practice.
Why it matters: The retriever learns a sharp sense of what truly helps. 🍞 Anchor: Over time, truly helpful passages consistently score higher than lookalikes.

🍞 Hook: A coach and a star player improve each other. 🥬 What it is (Iterative Agent–Retriever Loop): Train the agent with PPO using the current retriever; then retrain the retriever using the agent’s improved queries; repeat.

How it works: Fix retriever, train agent; fix agent, train retriever with new trajectories.
Why it matters: Aligns the retriever with how the agent actually searches. 🍞 Anchor: After two rounds, queries get sharper and retrieval gets more on-target.

🍞 Hook: Shorter paths beat wandering detours. 🥬 What it is (Fewer Search Turns): With better retrieval, the agent needs fewer steps.

How it works: More useful passages per turn reduce follow-up searches.
Why it matters: Faster, cheaper, and often more accurate. 🍞 Anchor: On HotpotQA and TriviaQA, search turns drop by about 10–15% compared to a baseline.

03Methodology

High-level Recipe: Input → Agent thinks and asks (q_i) → Retrieve 20 candidates → Score each passage two ways → Pick positives/negatives → Train retriever (contrastive) → Train agent (PPO) with fixed retriever → Iterate → Output final answers with fewer, better searches.

Step-by-step (Sandwich style where new ideas appear):

🍞 Hook: Start like a detective: think, guess what’s missing, then look it up. 🥬 Step (Generate Trajectories): Let the agent alternate between thinking (<think>), searching (<search>), reading (<information>), and finally answering (<answer>).

Why this step exists: We need realistic sub-queries and contexts to evaluate passage utility.
Example: Question: “Who is older, Jed Hoyer or John William Henry II?” The agent first searches for John Henry II’s birthday, then Jed Hoyer’s. 🍞 Anchor: The trajectory T captures every turn’s thoughts, queries, and retrieved snippets.

🍞 Hook: Compare all possible hints side-by-side, not one at a time. 🥬 Step (Local Relevance via Listwise LLM Scoring): For each sub-query q_i, retrieve 20 candidate passages using the current retriever; ask a strong LLM to score each passage 0–100 for how directly it answers q_i.

Why this step exists: Side-by-side (listwise) grading gives more stable, relative judgments.
Example: For “Jed Hoyer birth year,” a passage that states the year clearly gets 80–100; a career bio without the year might get ~40. 🍞 Anchor: The LLM can optionally use a sub-answer hint to sharpen scoring; if unsure, it proceeds without it.

🍞 Hook: Make sure a clue helps you win the whole game, not just look nice. 🥬 Step (Global Answer Correctness): For each candidate passage p_{i,j}, continue the agent’s reasoning from that point using p_{i,j} and see if the final answer exactly matches the gold answer (EM=1 or 0).

Why this step exists: Some relevant-looking passages derail later steps; this check guards against that.
Example: If a passage about “honky tonk” doesn’t lead to “Mickey Gilley” at the end, it won’t count as globally helpful. 🍞 Anchor: Only passages that help produce the correct final answer score highest on this dimension.

🍞 Hook: Crown one MVP hint and treat the rest as sparring partners. 🥬 Step (Select Positives/Negatives): Rank candidates first by global correctness (desc), then by local score (desc). Choose the top one as the positive if it has EM=1 and local≥60; sample the rest as negatives to make N=16 per query.

Why this step exists: Ensures the positive passage is both on-topic and actually helpful to the end goal.
Example: From 20 candidates, you keep one top positive and 15 negatives. 🍞 Anchor: If none meet the bar, discard that training instance.

🍞 Hook: Keep the main goal in mind when asking a follow-up. 🥬 Step (Retriever Input = Q [SEP] q_i): Concatenate the original question and the current sub-query as the retriever input; exclude earlier queries.

Why this step exists: Agentic sub-queries are usually self-contained; extra history adds noise.
Example: Input: “Compare ages [SEP] Jed Hoyer birth year.” 🍞 Anchor: An ablation shows adding history hurts performance, especially out-of-domain.

🍞 Hook: Practice sorting real gems from shiny fakes. 🥬 Step (Contrastive Learning for Retriever): Train to pull the positive passage embedding close to the query embedding and push away a large set of negatives (sampled + in-batch + cross-device).

Why this step exists: Many tough negatives build a robust sense of utility.
Example: With batch size B, GPUs G, and N candidates per query, you get roughly (B×G×N−1) negatives. 🍞 Anchor: Over time, the retriever ranks truly helpful passages higher.

🍞 Hook: A coach first uses the current playbook, then updates it after the game. 🥬 Step (Train the Agent with PPO): Fix the current retriever and train the agent via RL to maximize exact-match reward, masking out retrieved tokens from gradients.

Why this step exists: Teaches the agent when to search, what to ask, and when to stop.
Example: Learning rates are small; the agent runs for hundreds of PPO steps on training questions. 🍞 Anchor: The agent becomes better at writing crisp sub-queries.

🍞 Hook: Now swap: the better player helps the coach refine drills. 🥬 Step (Iterative Optimization): After agent training, use the improved agent to generate new trajectories and better sub-queries, rebuild the positive/negative sets, and retrain the retriever. Repeat for K rounds (K=2 works best).

Why this step exists: Aligns the retriever to the agent’s evolving, higher-quality queries.
Example: First iteration initializes retriever from E5; second iteration brings a notable accuracy bump; third brings no further gain. 🍞 Anchor: Final system answers more accurately with fewer searches.

Secret Sauce (what’s clever):

Dual utility: combining local (LLM listwise scores) and global (final EM) signals.
Listwise LLM grading with optional sub-answers for sharper relevance judgments.
Positive selection gated by global correctness to avoid “relevant but misleading” passages.
Tight input design (Q + q_i only) to avoid history noise.
Big-batch contrastive training with many hard negatives.
Agent–retriever co-training so queries and retrieval co-evolve.

04Experiments & Results

The Test (what they measured and why):

Task: Open-domain QA across single-hop and multi-hop datasets.
Metric: Exact Match (EM), which cleanly checks if the final answer matches the gold answer.
Goal: Show that a retriever trained for agentic search improves final answer accuracy and search efficiency.

The Competition (baselines):

General-purpose retrievers: E5, BGE.
RAG-specific retrievers (single-turn utility): LLM-Embedder, SCARLet, REPLUG.
Agents used: (1) The paper’s trained agent (in-domain), (2) R1-Searcher (out-of-domain), and (3) SimpleDeepSearcher (out-of-domain).

Scoreboard with Context:

Across seven datasets (HotpotQA, 2Wiki, Musique, Bamboogle, NQ, TriviaQA, PopQA):
- In-domain agent: Agentic-R averages 45.00 EM. That’s like moving from a solid B (around 41–42 EM for top baselines like REPLUG/E5) to an A- (45 EM). The gap to the second-best baseline is about +3.2 points.
- R1-Searcher: Agentic-R averages 43.64 EM, beating the best baseline (E5 at ~41.67) by about +2 points.
- SimpleDeepSearcher: Agentic-R averages 39.43 EM vs. top baselines around ~37.7, again about +1.7 points up.
Bigger gains on multi-hop QA: On multi-hop datasets, improvements over strong baselines are roughly +3 EM points (vs. ~+2 on single-hop). This makes sense: multi-hop problems benefit most from passages that not only match locally but also lead to correct end-to-end reasoning.

Search Efficiency:

Agentic-R reduces search turns: On HotpotQA and TriviaQA, agents using Agentic-R need about 10–15% fewer search turns than with REPLUG or E5. That’s like finishing your homework with fewer trips to the library.

Ablations (what mattered):

Iterative training matters: Two rounds of agent–retriever co-training significantly outperform one round. A third round offers no further gains, indicating convergence.
Dual signals are both needed: Removing Global Answer Correctness or Local Relevance scoring hurts accuracy (≈1–2 EM point drops), with local relevance removal hurting slightly more.
Input design matters: Dropping the original question from the retriever input hurts. Adding full history (earlier queries) also hurts, especially out-of-domain—confirming that sub-queries are typically self-contained and extra text adds noise.
Generalizes across backbones: Training with different retriever backbones (E5-base, E5-large, BGE-base) shows consistent gains and a clear scaling trend (larger backbones help).

Surprising Findings:

RAG-specific single-turn retrievers aren’t always better in agentic search than general-purpose ones. Why? Single-turn utility doesn’t fully capture multi-turn pitfalls, and training on user questions (not agent queries) causes a mismatch. Agentic-R closes this gap by training on agent-generated queries with dual utility signals.

05Discussion & Limitations

Limitations:

Domain coverage: Experiments are on popular single-hop and multi-hop QA. More abstract or expert-level reasoning (e.g., GPQA) remains to be tested.
Scale: The study uses moderately sized LLM agents and retriever backbones due to compute limits. Larger models may further boost results but need validation.
Cost and complexity: The method requires strong LLMs for listwise scoring and iterative PPO training, which can be resource-intensive.
Dependence on gold finals: Global correctness needs gold final answers for training, so fully unsupervised domains would need alternative signals.

Required Resources:

A capable LLM for listwise scoring (e.g., Qwen2.5-72B-Instruct) and one for the agent (e.g., Qwen2.5-7B-Base).
PPO-based RL setup for agent training.
A large text corpus (e.g., Wikipedia dump) and GPUs for contrastive training with many negatives.

When NOT to Use:

Very short, single-fact tasks where single-turn RAG already nails it and extra complexity adds overhead.
Settings without access to gold final answers (for training) and no feasible proxy for global correctness.
Extremely tight latency budgets where iterative scoring and training are impractical.

Open Questions:

Can we replace final-answer supervision with self-consistency or weak signals to train in low-label settings?
How does the method perform with much larger agents and retrievers—and does the dual-utility idea scale linearly?
Can we detect and explicitly handle adversarial or misleading passages without relying on gold answers?
How far can we push turn reduction—can the retriever help the agent decide when to stop searching even more aggressively?

06Conclusion & Future Work

Three-sentence Summary:

Agentic-R trains a retriever specifically for multi-turn agentic search by combining local relevance (does a passage answer the sub-question?) with global answer correctness (does it lead to the right final answer?).
The agent and retriever are improved iteratively: first train the agent using the current retriever, then train the retriever using higher-quality queries from the improved agent.
Across seven QA datasets and three agents, Agentic-R boosts exact-match accuracy and reduces the number of search turns.

Main Achievement:

Showing that a dual-utility signal (local + global) and an agent–retriever co-training loop produce a retriever that consistently outperforms strong single-turn and general-purpose baselines in multi-turn search.

Future Directions:

Extend to expert-level reasoning tasks (e.g., scientific QA), explore larger backbones, and reduce reliance on gold answers via self-supervision.
Integrate active stopping signals and better distractor detection to further cut search turns.

Why Remember This:

It flips the retriever’s job from “find similar text” to “find what actually helps me get the final answer right” and shows that teaching the agent and retriever to grow together makes multi-step reasoning both smarter and faster.

Practical Applications

•Build smarter research assistants that gather sources which actually resolve the final question, not just look similar.
•Create coding helpers that retrieve the snippets most likely to fix the bug or complete the feature correctly.
•Design customer-support bots that ask sharper follow-ups and pull truly helpful articles, reducing back-and-forth.
•Develop educational tutors that break problems into steps and fetch the exact facts needed to finish correctly.
•Speed up enterprise search by reducing the number of retrieval rounds per task while improving answer accuracy.
•Deploy compliance tools that resist distractor passages and surface evidence leading to correct regulatory conclusions.
•Improve clinical decision support by emphasizing passages that contribute to correct end-to-end diagnoses or treatments (with human oversight).
•Enhance legal research by prioritizing cases and statutes that demonstrably support the final legal conclusion.
•Power news and fact-checking tools that avoid misleading yet on-topic snippets, improving reliability.
•Optimize internal knowledge-base search with iterative agent–retriever co-training tailored to your company’s queries.

Version: 1