Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
Key Summary
- âąThe paper fixes a big problem in training web-searching AI: rewarding only the final answer makes agents cut corners and sometimes hallucinate.
- âąIt introduces CaRR (Citation-aware Rubric Rewards), which breaks a hard question into bite-sized checks and requires each check to be backed by real web citations.
- âąIt adds an 'evidence connectivity' rule so the cited facts must link together all the way to the final answer, not just float around separately.
- âąIt combines CaRR with the usual right/wrong reward using a new method called C-GRPO, so the agent must both get the answer and show its work.
- âąAcross four tough benchmarks, C-GRPO beats standard outcome-only RL, especially when the model gets more reading space (longer context).
- âąAgents trained with C-GRPO cite more webpages, satisfy more rubrics, and avoid shortcut tricks.
- âąThe approach generalizes to open-ended research tasks and even competes with some proprietary systems.
- âąAblations show each piece matters: remove hidden-entity identification or connectivity checks and performance drops.
- âąThe method relies on structured, synthetic multi-hop training data, but it still helps with real research tasks.
- âąBottom line: reward the steps with citations, not just the final answer, and agents become more careful, factual, and robust.
Why This Research Matters
When people use AI to research news, health, science, or history, they need more than a guessâthey need reasons with sources. This paper rewards agents for naming the right entities, citing the exact pages they read, and connecting every fact to the final answer, building trust. As tasks get longer and trickier, outcome-only training breaks down; this method keeps improving as the reading gets harder. It also helps open-ended research writing, not just short-answer questions, by promoting thorough, sourced explanations. Organizations can use it to reduce hallucinations and avoid brittle shortcut behaviors. Over time, this makes web agents more like careful researchers and less like lucky guessers.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine grading a maze runner only by whether they reached the exit, not how they got there. They might take risky jumps, skip checkpoints, or even pop out near the end by luckâand still get full marks. That works until the maze gets harder.
đ„Ź The Concept: Reinforcement Learning (RL)
- What it is: RL is a way for an AI to learn by trying actions and getting rewards or penalties.
- How it works:
- The agent tries something (like searching the web or clicking a link).
- It sees what happens next.
- It gets a reward signal (good/bad) and adjusts its behavior to get more reward next time.
- Repeat many times until it learns a good strategy.
- Why it matters: Without RL, the agent doesnât improve from experience and stays stuck repeating mistakes. đ Anchor: A search agent learns that asking a clearer query (like "2015 Thailand bolide mass") gets better pages and more reward, so it keeps doing that.
đ„Ź The Concept: Outcome Rewards
- What it is: An outcome reward is a simple signal, like 1 if the final answer is correct and 0 if itâs wrong.
- How it works:
- Compare the agentâs final answer to the ground truth.
- If they match, give 1; if not, give 0.
- Use this to train the agent.
- Why it matters: Itâs easy and scalable, but it ignores everything about how the answer was found. đ Anchor: If the agent answers â66 tonnesâ correctly but skips checking other clues or cites nothing, it still gets 1.
đ„Ź The Concept: Deep Search Agents
- What it is: These are LLM-based helpers that think, browse, open pages, and find info over many steps to solve hard questions.
- How it works:
- Think about the next step.
- Use tools (search/open/find) to grab info.
- Read the results and plan the next step.
- Repeat until ready to answer with citations.
- Why it matters: Without tool use, the model must rely on memory and can easily be outdated or wrong. đ Anchor: To answer about the â2015 Thailand bolide,â the agent searches, opens Wikipedia, scans the page, and cites the line with the mass.
đ„Ź The Concept: Multi-hop Question Answering
- What it is: Answering questions that require connecting several facts across different pages.
- How it works:
- Break the problem into hops (A leads to B, B leads to C...).
- Find each piece on the web.
- Stitch them into a chain that lands on the answer.
- Why it matters: Without multi-hop thinking, agents miss crucial links and guess. đ Anchor: From a Scottish-engineered settlement â prehistoric impact evidence â bright sky events (bolides) â bolide catalog â the 2015 Thailand bolide â its mass.
đ„Ź The Concept: Shortcut Exploitation
- What it is: When an agent skips work and still lands a correct answer by luck or a partial clue.
- How it works:
- Notice a hint (like âThailand 2015â).
- Jump to a guess (â66 tonnesâ) without verifying earlier clues.
- Get rewarded anyway if the final number matches.
- Why it matters: If shortcuts get full credit, the agent learns to be lazy and brittle. đ Anchor: The agent answers "66 tonnes" correctly but doesnât check whether the questionâs other parts are true or connectedâyet still gets 1.
đ„Ź The Concept: Hallucinations
- What it is: Confident but false statements made by the model.
- How it works:
- The model fills in missing info with guesses.
- It states them confidently, especially if not forced to cite.
- No one corrects it if rewards look only at the final answer.
- Why it matters: Hallucinations reduce trust and can mislead users. đ Anchor: Saying the event was in Malaysia (wrong) but still guessing the final mass right would be a hallucination that outcome-only rewards ignore.
The World Before: Many deep-search systems trained with outcome-only RL showed nice gains on easier settings but had hidden weaknesses on harder ones. Because the reward didnât care about how the answer was reached, agents could pass by using shortcuts or lucky guesses. They didnât have to show their work with reliable citations, and their reasoning chains could be incomplete or disconnected.
The Problem: How do we teach an agent to be thorough and factualânot just right at the end? We need a reward that looks inside the reasoning process and checks whether each step is real, cited, and actually leads to the answer.
Failed Attempts: People tried adding fine-grained signals, like counting how many known intermediate entities the agent mentioned. But that needs gold labels for every hidden step and can still reward non-cited guesses. Also, giving extra points to wrong rollouts sometimes nudged the model in the wrong direction.
The Gap: A training signal was missing that would: 1) reward covering every key hop, 2) require proper citations from the pages the agent actually visited, and 3) ensure the evidence pieces connect all the way to the final answer.
Real Stakes: In schoolwork, journalism, science, and policy, we donât just want answersâwe want trustworthy answers with sources. If an AI canât show a connected, cited evidence chain, users can get misled. As questions get longer and trickier, weak training encourages corner-cutting and increases risk. This paper tackles that by rewarding careful, connected, citation-backed reasoning.
02Core Idea
đ Hook: You know how teachers use a rubric to grade essaysâpoints for ideas, organization, evidence, and clarityâso you canât just write the final sentence and get an A?
đ„Ź The Concept: Rubric (as used here)
- What it is: A rubric is a list of small, checkable statements the answer should satisfy.
- How it works:
- Break a hard question into simple, single-hop checks.
- For each check, name the entities involved.
- Later, see if the agent mentions those entities and backs them with cites.
- Why it matters: Without rubrics, we canât reward thoroughness step by step. đ Anchor: A rubric might say â<E6> is the â2015 Thailand bolideââ and â<E0> is its initial mass,â so the agent must name these and cite the source.
đ„Ź The Concept: Citation-aware Rubric Rewards (CaRR)
- What it is: CaRR gives the agent extra reward for satisfying each rubric with explicit entity names, correct citations, and a full chain to the final answer.
- How it works:
- Use an LLM to decompose the question into single-hop rubrics with placeholder entities.
- Check the agentâs final response for the real names of those entities.
- Verify each rubric is supported by the exact webpages the agent cited.
- Make sure the supported rubrics connect together to the predicted answer.
- Score = fraction of rubrics that pass all checks.
- Why it matters: Without CaRR, agents can ignore hops, invent facts, or cite unrelated pages and still get full credit if the final answer matches. đ Anchor: If the agent says â66 tonnesâ but doesnât prove the event was the 2015 Thailand bolide with a correct citation, it loses rubric points.
đ„Ź The Concept: Hidden Entities
- What it is: These are the unnamed items inside the question that the agent must discover (people, places, events).
- How it works:
- Mark them as <E1>, <E2>, ⊠during decomposition.
- The agent must state their real names in its final explanation.
- Missing or guessed names donât count.
- Why it matters: Without explicit identities, you canât verify facts or build a clean evidence chain. đ Anchor: Turning <E6> into â2015 Thailand bolideâ and <E8> into âmeteoroidâ is required and must be cited.
đ„Ź The Concept: Evidence Connectivity
- What it is: The supported rubrics must link together all the way to the final answer entityâlike a connected path.
- How it works:
- Treat entities and rubrics as nodes in a graph.
- Connect a rubric to the entities it mentions.
- Walk the graph from the final answer; only rubrics connected to it count.
- Why it matters: Without connectivity, the agent could cite true but irrelevant facts and still get points. đ Anchor: Citing that âbolides are brightâ isnât enough unless that fact is tied, through other supported rubrics, to the exact Thailand event and its mass.
đ„Ź The Concept: Group Relative Policy Optimization (GRPO)
- What it is: A way to train agents by comparing a small group of rollouts and pushing the model toward the better ones.
- How it works:
- Generate several answer attempts for the same question.
- Score each one.
- Nudge the model to prefer higher-scoring attempts.
- Why it matters: Group comparisons give a stable training signal in multi-turn settings. đ Anchor: If two answers are correct, but one has better citations and coverage, GRPO can prefer the stronger one.
đ„Ź The Concept: Citation-aware GRPO (C-GRPO)
- What it is: C-GRPO mixes the standard outcome reward with CaRR, but only adds the rubric bonus when the final answer is correct.
- How it works:
- Score each rollout: outcome (0/1) and CaRR (0â1).
- Normalize CaRR within the group so the best chain gets 1.0.
- Final reward = (1âα)outcome + αoutcome*normalized_CaRR.
- Optimize the policy to prefer higher-reward rollouts.
- Why it matters: Without this mix, models either ignore process quality (outcome-only) or get distracted from being correct (process-only). C-GRPO strikes the balance. đ Anchor: Two rollouts both answer â66 tonnes.â The one that fully names entities, cites correctly, and connects evidence gets more reward and becomes the modelâs favorite.
The âAha!â Moment: Donât just reward the destinationâreward the verified, connected path with citations.
Three Analogies:
- School Essay: You donât get an A for just the last sentence. You need sources, organization, and all points covered.
- Math Proof: The final number isnât enough; you must show each step and justify it.
- Treasure Hunt: Clues must form a trail to the treasure. Random facts about gold donât count if they donât lead to the chest you found.
Before vs After:
- Before: Agents chased correct answers and often cut corners; longer or trickier problems exposed their brittleness.
- After: Agents gather, cite, and connect evidence, staying robust as context gets longer and tasks get harder.
Why It Works (intuition): The reward now reflects what we truly valueâcomplete, factual, connected reasoningâso the agent learns to invest effort where it counts. The connectivity rule prevents âfact farmingâ that isnât aimed at the answer. Restricting rubric rewards to correct rollouts keeps the model focused on solving the task while improving how it solves it.
Building Blocks:
- Question decomposition into single-hop rubrics.
- Hidden-entity naming in the final response.
- Citation checking against visited pages only.
- Evidence connectivity graph to the predicted answer.
- Mixed reward (outcome + normalized rubric) in group-based RL.
03Methodology
High-level Recipe: Input question â Rubric initialization â Step 1: Hidden entity identification â Step 2: Citation-based rubric judgment â Step 3: Evidence connectivity check â Rubric reward â Mix with outcome reward (C-GRPO) â Update the agent.
đ„Ź The Concept: LLM Judge
- What it is: A helper model that checks whether entities are named and whether citations actually support the statements.
- How it works:
- Read the agentâs final explanation and citations.
- Decide which hidden entities were explicitly identified.
- Verify each rubric using only the content of cited pages the agent actually opened/found.
- Why it matters: Without a judge, we canât automatically grade thoroughness and factual grounding at scale. đ Anchor: The judge sees the cited Wikipedia line âmass of 66 tonnesâ on the Thailand bolide page and marks that rubric supported.
Overview with an example:
- Input: A multi-hop question about a prehistoric collision near a Scottish-planned settlement, linking to bolides, a bolide catalog, the 2015 Thailand event, and asking the objectâs initial mass.
- Output: A final answer (e.g., â66 tonnesâ) plus an explanation listing each hop with citations. The training system scores how complete and connected the evidence is.
Step-by-Step
-
Rubric Initialization (decomposition)
- What happens: An LLM turns the question into single-hop factual statements (rubrics), each with placeholders <E0> (final answer) and <E1>, <E2>, ... (hidden entities).
- Why it exists: This creates clear checkpoints for grading thoroughness.
- Example: â<E6> is a mid-2010s event over a Southeast Asian nation <E7>.â â<E8> is the object that caused <E6>.â â<E0> is the initial mass of <E8>.â
-
Hidden Entity Identification
- What happens: The judge scans the agentâs final explanation to see if it explicitly names each <E1>, <E2>, ⊠(e.g., â2015 Thailand bolideâ, âThailandâ, âmeteoroidâ).
- Why it exists: We canât verify facts or connect evidence if the entities remain unnamed.
- Example: If the explanation never says âBolide Catalogueâ by name, rubrics involving it arenât considered fully identified and wonât proceed to evidence checking.
-
Citation-based Rubric Judgment
- What happens: a) Extract the URLs cited in the final explanation (cap at 20 to prevent spam-citing). b) Gather only the content the agent actually retrieved (search snippets, opened pages, find matches). c) For each fully identified rubric, the judge checks if the cited texts support it.
- Why it exists: This blocks hallucinated facts and requires that claims be grounded in what the agent truly read.
- Example: The judge confirms that the Thailand bolide page explicitly states the mass and that this URL appears in the agentâs citations.
-
Evidence Connectivity Check
- What happens: a) Build a graph with entity nodes and supported rubric nodes. b) Connect an entity to a rubric if the rubric mentions that entity. c) Starting from <E0> (the predicted final answer entity/number), walk the graph and keep only rubrics connected along the way.
- Why it exists: True-but-irrelevant facts shouldnât earn points; only facts that chain to the answer count.
- Example: A supported rubric âbolides are brightâ helps only if linked via the event/entity chain to the mass answer.
-
Rubric Reward Calculation
- What happens: Rubric reward = number of connected, supported rubrics divided by total rubrics.
- Why it exists: This single score summarizes coverage, grounding, and connectivity.
- Example: If 3 of 8 rubrics are identified, supported, and connected to the final answer, the rubric reward is 3/8 = 0.375.
-
Outcome Reward
- What happens: If the final answer matches the ground truth string/number, outcome = 1; else 0.
- Why it exists: We must still prioritize correctness of the end result.
- Example: Answering â66 tonnesâ correctly gives an outcome reward of 1.
-
Mixing Rewards with C-GRPO
- What happens: a) For each question, sample a group of rollouts. b) Compute outcome and rubric rewards for each. c) Normalize rubric rewards within the group: the best chain gets 1.0. d) Final reward R = (1âα)outcome + αoutcome*normalized_rubric. e) Update the policy to favor higher R, at the token level, only where the model is generating (not the pasted web text).
- Why it exists: The mix balances âbe rightâ with âshow cited, connected work,â and group normalization makes training stable.
- Example: Two correct rollouts: one with stronger citations/coverage gets more reward and is reinforced.
-
Practical Training Choices
- Tools: search (find pages), open (view page head), find (locate keywords in page).
- Judges: A capable LLM evaluates both correctness and rubric satisfaction.
- Safeguards: Cap citations at 20; give 0 reward to malformed or overlong rollouts.
The Secret Sauce:
- Triple filter: identify entities â verify with citations â ensure connectivity. This trio prevents the classic hacks: guessing names, cherry-picking random facts, or citing unrelated pages. Training then lifts policies that are simultaneously correct and well-supported.
What breaks without each step:
- Without entity identification: The agent can speak vaguely, and we canât align facts to the right names.
- Without citation judgment: Hallucinations sneak in, and evidence quality drops.
- Without connectivity: The agent can âfarmâ isolated true facts and still look good.
- Without mixing with outcome: The agent might be very thorough but miss the actual answer.
04Experiments & Results
đ Hook: When you test runners on a short track, even sloppy form can win. On a marathon, only solid technique holds up. The same is true for deep search agents.
đ„Ź The Concept: Baselines
- What it is: Other training methods we compare against, like GRPO (outcome-only) and E-GRPO (entity-match signals for wrong rollouts).
- How it works:
- Train agents with the baseline reward.
- Evaluate on shared benchmarks.
- Compare scores fairly.
- Why it matters: Without baselines, we canât tell if the new method truly helps. đ Anchor: GRPO boosts 64k-context accuracy but often plateaus or slips at 128k, while C-GRPO keeps climbing.
đ„Ź The Concept: Context Budget
- What it is: The maximum amount the model can read/remember in one go (like 64k or 128k tokens).
- How it works:
- Give the agent a limit on how long its conversation and web snippets can be.
- Harder tasks need more context.
- Check if performance scales when the budget grows.
- Why it matters: Real research needs long contexts. Fragile strategies fail when the reading gets heavy. đ Anchor: C-GRPO gains more than GRPO when moving from 64k to 128k on multiple benchmarks.
What they measured and why:
- Accuracy on four deep-search benchmarks (BrowseComp, BrowseComp-ZH, xbench-DeepSearch, GAIA) using an LLM judge for correctness: this shows task success.
- Effects of longer context budgets and more tool-call steps: this shows test-time scaling and whether the agent can handle longer, harder problems.
- Rubric satisfaction and number of cited webpages: this shows thoroughness and factual grounding.
The Competition:
- GRPO (outcome-only RL) and E-GRPO (adds entity-match rate for incorrect rollouts) are the main baselines.
- They also list scores from several proprietary or differently trained systems as references.
The Scoreboard (with context):
- Across 4 benchmarks and two model sizes (4B and 30B), C-GRPO consistently beats GRPO and E-GRPO.
- Example summary: With 64k context, C-GRPO improves over GRPO by about 5.1 points (4B) and 2.6 points (30B). With 128k, the margin grows to about 8.0 (4B) and 6.0 (30B). Thatâs like moving from a low B to a solid A while others hover at B-.
- Test-time scaling: GRPO often helps at 64k but underperforms at 128k, suggesting shortcut habits donât scale. C-GRPO keeps improving as context grows, signaling robust evidence habits.
- Training dynamics: GRPOâs average tool calls drop and stay lowâhinting at shortcuts. C-GRPOâs tool calls first drop (more efficient) then rise (gathering enough evidence), and its reward curves outpace GRPO over time.
Comprehensiveness and factuality:
- On a subset of problems solved within 64k, the 30B C-GRPO agent cites more pages and satisfies more rubrics (identified, supported, and connected) than both SFT and GRPO. GRPO even cites fewer pages than SFTâevidence of shortcutting.
Generalization to open-ended research:
- On DeepResearch Bench (PhD-level report tasks, graded by a rubric), C-GRPO improves comprehensiveness, insight, instruction following, and readability compared to baselines. The 30B C-GRPO model competes with some proprietary agents, showing that skills learned on synthetic QA transfer to real research writing.
Surprising Findings:
- Pure outcome rewards can degrade test-time scaling at longer contexts, likely because the model settles on short, brittle strategies that donât generalize.
- Adding rubric rewards only to correct rollouts matters: giving them to all rollouts can accidentally push the model toward carefully arguedâbut wrongâanswers, hurting performance.
- Each CaRR component (entity identification and connectivity) has a measurable impact; removing either leads to sizable drops.
05Discussion & Limitations
đ Hook: If you only pay a painter when the wall looks white, they might paint fast and miss the corners. Start paying for edges and even coats, and the quality jumpsâat a cost.
Limitations:
- Structured questions required: CaRR depends on questions that can be decomposed into clear single-hop rubrics. Open-ended prompts without explicit requirements are harder to break down reliably.
- Judge quality matters: If the LLM judge misreads a citation or misses an entity, scoring can be noisyâthough spot checks showed high accuracy in this work.
- Extra compute: Decomposition, judging, and graph checks add overhead compared to simple outcome rewards.
- Web variability: Pages change, snippets differ, and some sites block access; robust caching and retrieval hygiene are needed.
Required Resources:
- A capable base model (4Bâ30B in this paper) with tool-use abilities.
- Search/open/find infrastructure with access to the public web.
- An LLM judge for entity identification and citation checks.
- Enough context length (64kâ128k+) and training budget for multi-turn RL.
When NOT to Use:
- Purely creative tasks (poems, brainstorming) where thereâs no single factual chain.
- Tiny compute budgets where the overhead of rubrics and judging outweighs benefits.
- Domains with scarce or paywalled sources where citation verification is impractical.
Open Questions:
- How to extend rubric generation beyond synthetic multi-hop QA to fully open-ended research queries?
- Can we auto-learn evolving rubrics during training that better match each modelâs current mistakes?
- Whatâs the best balance (α) across domains and model sizes, and can it be adapted per-question?
- How to make connectivity checks more semantic (not just entity-string matching) while staying robust and cheap?
- Can we reduce dependence on a single judge by cross-checking multiple verifiers?
06Conclusion & Future Work
Three-Sentence Summary: The paper introduces CaRR, a reward that grades each reasoning hop with explicit entity naming, correct citations, and a connectivity check to the final answer. It combines this with the standard right/wrong signal using C-GRPO, rewarding not only the destination but the cited, connected path. The result is more robust deep search agents that scale better with longer contexts and generalize to open-ended research.
Main Achievement: Turning âshow your work with sourcesâ into a concrete, scalable reward signalâso agents learn to be both correct and convincingly evidence-backed.
Future Directions: Automate rubric creation for open-ended questions, make connectivity more semantic, adapt the reward mix to each task, and explore multi-judge ensembles to further stabilize training. Integrating richer web interactions (e.g., PDFs, tables, code) with citation checks could broaden applicability.
Why Remember This: Rewarding only final answers teaches bad habits; rewarding cited, connected reasoning teaches good ones. CaRR + C-GRPO makes that principle practical, pushing web agents toward trustworthy research behavior rather than lucky guessing.
Practical Applications
- âąTrain enterprise research assistants that must provide linked, citation-backed evidence for every claim.
- âąBuild newsroom fact-checking tools that reward connected chains of sources rather than single quotes.
- âąUpgrade academic literature-review agents to name authors, venues, and findings with precise citations.
- âąCreate legal or policy assistants that assemble verifiable chains from statutes, cases, and official documents.
- âąImprove customer-support bots by requiring cited steps from manuals and knowledge bases before answering.
- âąDevelop internal compliance auditors that verify each rule is supported by cited company policies.
- âąEnable scientific assistants that trace results back to datasets, methods, and papers with proper links.
- âąTrain education tutors to show work with sources, discouraging unsupported claims.
- âąPower procurement or market-analysis agents that justify conclusions with connected, sourced evidence.
- âąRefine medical information tools to cite guidelines and studies explicitly, reducing risky hallucinations.