Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Jiajie Zhang; Xin Lv; Ling Feng; Lei Hou; Juanzi Li

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Intermediate

Jiajie Zhang, Xin Lv, Ling Feng et al.1/9/2026

arXiv PDF

Key Summary

•The paper fixes a big problem in training web-searching AI: rewarding only the final answer makes agents cut corners and sometimes hallucinate.
•It introduces CaRR (Citation-aware Rubric Rewards), which breaks a hard question into bite-sized checks and requires each check to be backed by real web citations.
•It adds an 'evidence connectivity' rule so the cited facts must link together all the way to the final answer, not just float around separately.
•It combines CaRR with the usual right/wrong reward using a new method called C-GRPO, so the agent must both get the answer and show its work.
•Across four tough benchmarks, C-GRPO beats standard outcome-only RL, especially when the model gets more reading space (longer context).
•Agents trained with C-GRPO cite more webpages, satisfy more rubrics, and avoid shortcut tricks.
•The approach generalizes to open-ended research tasks and even competes with some proprietary systems.
•Ablations show each piece matters: remove hidden-entity identification or connectivity checks and performance drops.
•The method relies on structured, synthetic multi-hop training data, but it still helps with real research tasks.
•Bottom line: reward the steps with citations, not just the final answer, and agents become more careful, factual, and robust.

Why This Research Matters

When people use AI to research news, health, science, or history, they need more than a guess—they need reasons with sources. This paper rewards agents for naming the right entities, citing the exact pages they read, and connecting every fact to the final answer, building trust. As tasks get longer and trickier, outcome-only training breaks down; this method keeps improving as the reading gets harder. It also helps open-ended research writing, not just short-answer questions, by promoting thorough, sourced explanations. Organizations can use it to reduce hallucinations and avoid brittle shortcut behaviors. Over time, this makes web agents more like careful researchers and less like lucky guessers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine grading a maze runner only by whether they reached the exit, not how they got there. They might take risky jumps, skip checkpoints, or even pop out near the end by luck—and still get full marks. That works until the maze gets harder.

🥬 The Concept: Reinforcement Learning (RL)

What it is: RL is a way for an AI to learn by trying actions and getting rewards or penalties.
How it works:
1. The agent tries something (like searching the web or clicking a link).
2. It sees what happens next.
3. It gets a reward signal (good/bad) and adjusts its behavior to get more reward next time.
4. Repeat many times until it learns a good strategy.
Why it matters: Without RL, the agent doesn’t improve from experience and stays stuck repeating mistakes. 🍞 Anchor: A search agent learns that asking a clearer query (like "2015 Thailand bolide mass") gets better pages and more reward, so it keeps doing that.

🥬 The Concept: Outcome Rewards

What it is: An outcome reward is a simple signal, like 1 if the final answer is correct and 0 if it’s wrong.
How it works:
1. Compare the agent’s final answer to the ground truth.
2. If they match, give 1; if not, give 0.
3. Use this to train the agent.
Why it matters: It’s easy and scalable, but it ignores everything about how the answer was found. 🍞 Anchor: If the agent answers “66 tonnes” correctly but skips checking other clues or cites nothing, it still gets 1.

🥬 The Concept: Deep Search Agents

What it is: These are LLM-based helpers that think, browse, open pages, and find info over many steps to solve hard questions.
How it works:
1. Think about the next step.
2. Use tools (search/open/find) to grab info.
3. Read the results and plan the next step.
4. Repeat until ready to answer with citations.
Why it matters: Without tool use, the model must rely on memory and can easily be outdated or wrong. 🍞 Anchor: To answer about the “2015 Thailand bolide,” the agent searches, opens Wikipedia, scans the page, and cites the line with the mass.

🥬 The Concept: Multi-hop Question Answering

What it is: Answering questions that require connecting several facts across different pages.
How it works:
1. Break the problem into hops (A leads to B, B leads to C...).
2. Find each piece on the web.
3. Stitch them into a chain that lands on the answer.
Why it matters: Without multi-hop thinking, agents miss crucial links and guess. 🍞 Anchor: From a Scottish-engineered settlement → prehistoric impact evidence → bright sky events (bolides) → bolide catalog → the 2015 Thailand bolide → its mass.

🥬 The Concept: Shortcut Exploitation

What it is: When an agent skips work and still lands a correct answer by luck or a partial clue.
How it works:
1. Notice a hint (like “Thailand 2015”).
2. Jump to a guess (“66 tonnes”) without verifying earlier clues.
3. Get rewarded anyway if the final number matches.
Why it matters: If shortcuts get full credit, the agent learns to be lazy and brittle. 🍞 Anchor: The agent answers "66 tonnes" correctly but doesn’t check whether the question’s other parts are true or connected—yet still gets 1.

🥬 The Concept: Hallucinations

What it is: Confident but false statements made by the model.
How it works:
1. The model fills in missing info with guesses.
2. It states them confidently, especially if not forced to cite.
3. No one corrects it if rewards look only at the final answer.
Why it matters: Hallucinations reduce trust and can mislead users. 🍞 Anchor: Saying the event was in Malaysia (wrong) but still guessing the final mass right would be a hallucination that outcome-only rewards ignore.

The World Before: Many deep-search systems trained with outcome-only RL showed nice gains on easier settings but had hidden weaknesses on harder ones. Because the reward didn’t care about how the answer was reached, agents could pass by using shortcuts or lucky guesses. They didn’t have to show their work with reliable citations, and their reasoning chains could be incomplete or disconnected.

The Problem: How do we teach an agent to be thorough and factual—not just right at the end? We need a reward that looks inside the reasoning process and checks whether each step is real, cited, and actually leads to the answer.

Failed Attempts: People tried adding fine-grained signals, like counting how many known intermediate entities the agent mentioned. But that needs gold labels for every hidden step and can still reward non-cited guesses. Also, giving extra points to wrong rollouts sometimes nudged the model in the wrong direction.

The Gap: A training signal was missing that would: 1) reward covering every key hop, 2) require proper citations from the pages the agent actually visited, and 3) ensure the evidence pieces connect all the way to the final answer.

Real Stakes: In schoolwork, journalism, science, and policy, we don’t just want answers—we want trustworthy answers with sources. If an AI can’t show a connected, cited evidence chain, users can get misled. As questions get longer and trickier, weak training encourages corner-cutting and increases risk. This paper tackles that by rewarding careful, connected, citation-backed reasoning.

02Core Idea

🍞 Hook: You know how teachers use a rubric to grade essays—points for ideas, organization, evidence, and clarity—so you can’t just write the final sentence and get an A?

🥬 The Concept: Rubric (as used here)

What it is: A rubric is a list of small, checkable statements the answer should satisfy.
How it works:
1. Break a hard question into simple, single-hop checks.
2. For each check, name the entities involved.
3. Later, see if the agent mentions those entities and backs them with cites.
Why it matters: Without rubrics, we can’t reward thoroughness step by step. 🍞 Anchor: A rubric might say “<E6> is the ‘2015 Thailand bolide’” and “<E0> is its initial mass,” so the agent must name these and cite the source.

🥬 The Concept: Citation-aware Rubric Rewards (CaRR)

What it is: CaRR gives the agent extra reward for satisfying each rubric with explicit entity names, correct citations, and a full chain to the final answer.
How it works:
1. Use an LLM to decompose the question into single-hop rubrics with placeholder entities.
2. Check the agent’s final response for the real names of those entities.
3. Verify each rubric is supported by the exact webpages the agent cited.
4. Make sure the supported rubrics connect together to the predicted answer.
5. Score = fraction of rubrics that pass all checks.
Why it matters: Without CaRR, agents can ignore hops, invent facts, or cite unrelated pages and still get full credit if the final answer matches. 🍞 Anchor: If the agent says “66 tonnes” but doesn’t prove the event was the 2015 Thailand bolide with a correct citation, it loses rubric points.

🥬 The Concept: Hidden Entities

What it is: These are the unnamed items inside the question that the agent must discover (people, places, events).
How it works:
1. Mark them as <E1>, <E2>, … during decomposition.
2. The agent must state their real names in its final explanation.
3. Missing or guessed names don’t count.
Why it matters: Without explicit identities, you can’t verify facts or build a clean evidence chain. 🍞 Anchor: Turning <E6> into “2015 Thailand bolide” and <E8> into “meteoroid” is required and must be cited.

🥬 The Concept: Evidence Connectivity

What it is: The supported rubrics must link together all the way to the final answer entity—like a connected path.
How it works:
1. Treat entities and rubrics as nodes in a graph.
2. Connect a rubric to the entities it mentions.
3. Walk the graph from the final answer; only rubrics connected to it count.
Why it matters: Without connectivity, the agent could cite true but irrelevant facts and still get points. 🍞 Anchor: Citing that “bolides are bright” isn’t enough unless that fact is tied, through other supported rubrics, to the exact Thailand event and its mass.

🥬 The Concept: Group Relative Policy Optimization (GRPO)

What it is: A way to train agents by comparing a small group of rollouts and pushing the model toward the better ones.
How it works:
1. Generate several answer attempts for the same question.
2. Score each one.
3. Nudge the model to prefer higher-scoring attempts.
Why it matters: Group comparisons give a stable training signal in multi-turn settings. 🍞 Anchor: If two answers are correct, but one has better citations and coverage, GRPO can prefer the stronger one.

🥬 The Concept: Citation-aware GRPO (C-GRPO)

What it is: C-GRPO mixes the standard outcome reward with CaRR, but only adds the rubric bonus when the final answer is correct.
How it works:
1. Score each rollout: outcome (0/1) and CaRR (0–1).
2. Normalize CaRR within the group so the best chain gets 1.0.
3. Final reward = (1–α)outcome + αoutcome*normalized_CaRR.
4. Optimize the policy to prefer higher-reward rollouts.
Why it matters: Without this mix, models either ignore process quality (outcome-only) or get distracted from being correct (process-only). C-GRPO strikes the balance. 🍞 Anchor: Two rollouts both answer “66 tonnes.” The one that fully names entities, cites correctly, and connects evidence gets more reward and becomes the model’s favorite.

The “Aha!” Moment: Don’t just reward the destination—reward the verified, connected path with citations.

Three Analogies:

School Essay: You don’t get an A for just the last sentence. You need sources, organization, and all points covered.
Math Proof: The final number isn’t enough; you must show each step and justify it.
Treasure Hunt: Clues must form a trail to the treasure. Random facts about gold don’t count if they don’t lead to the chest you found.

Before vs After:

Before: Agents chased correct answers and often cut corners; longer or trickier problems exposed their brittleness.
After: Agents gather, cite, and connect evidence, staying robust as context gets longer and tasks get harder.

Why It Works (intuition): The reward now reflects what we truly value—complete, factual, connected reasoning—so the agent learns to invest effort where it counts. The connectivity rule prevents ‘fact farming’ that isn’t aimed at the answer. Restricting rubric rewards to correct rollouts keeps the model focused on solving the task while improving how it solves it.

Building Blocks:

Question decomposition into single-hop rubrics.
Hidden-entity naming in the final response.
Citation checking against visited pages only.
Evidence connectivity graph to the predicted answer.
Mixed reward (outcome + normalized rubric) in group-based RL.

03Methodology

High-level Recipe: Input question → Rubric initialization → Step 1: Hidden entity identification → Step 2: Citation-based rubric judgment → Step 3: Evidence connectivity check → Rubric reward → Mix with outcome reward (C-GRPO) → Update the agent.

🥬 The Concept: LLM Judge

What it is: A helper model that checks whether entities are named and whether citations actually support the statements.
How it works:
1. Read the agent’s final explanation and citations.
2. Decide which hidden entities were explicitly identified.
3. Verify each rubric using only the content of cited pages the agent actually opened/found.
Why it matters: Without a judge, we can’t automatically grade thoroughness and factual grounding at scale. 🍞 Anchor: The judge sees the cited Wikipedia line “mass of 66 tonnes” on the Thailand bolide page and marks that rubric supported.

Overview with an example:

Input: A multi-hop question about a prehistoric collision near a Scottish-planned settlement, linking to bolides, a bolide catalog, the 2015 Thailand event, and asking the object’s initial mass.
Output: A final answer (e.g., “66 tonnes”) plus an explanation listing each hop with citations. The training system scores how complete and connected the evidence is.

Step-by-Step

Rubric Initialization (decomposition)
- What happens: An LLM turns the question into single-hop factual statements (rubrics), each with placeholders <E0> (final answer) and <E1>, <E2>, ... (hidden entities).
- Why it exists: This creates clear checkpoints for grading thoroughness.
- Example: “<E6> is a mid-2010s event over a Southeast Asian nation <E7>.” “<E8> is the object that caused <E6>.” “<E0> is the initial mass of <E8>.”
Hidden Entity Identification
- What happens: The judge scans the agent’s final explanation to see if it explicitly names each <E1>, <E2>, … (e.g., ‘2015 Thailand bolide’, ‘Thailand’, ‘meteoroid’).
- Why it exists: We can’t verify facts or connect evidence if the entities remain unnamed.
- Example: If the explanation never says “Bolide Catalogue” by name, rubrics involving it aren’t considered fully identified and won’t proceed to evidence checking.
Citation-based Rubric Judgment
- What happens: a) Extract the URLs cited in the final explanation (cap at 20 to prevent spam-citing). b) Gather only the content the agent actually retrieved (search snippets, opened pages, find matches). c) For each fully identified rubric, the judge checks if the cited texts support it.
- Why it exists: This blocks hallucinated facts and requires that claims be grounded in what the agent truly read.
- Example: The judge confirms that the Thailand bolide page explicitly states the mass and that this URL appears in the agent’s citations.
Evidence Connectivity Check
- What happens: a) Build a graph with entity nodes and supported rubric nodes. b) Connect an entity to a rubric if the rubric mentions that entity. c) Starting from <E0> (the predicted final answer entity/number), walk the graph and keep only rubrics connected along the way.
- Why it exists: True-but-irrelevant facts shouldn’t earn points; only facts that chain to the answer count.
- Example: A supported rubric “bolides are bright” helps only if linked via the event/entity chain to the mass answer.
Rubric Reward Calculation
- What happens: Rubric reward = number of connected, supported rubrics divided by total rubrics.
- Why it exists: This single score summarizes coverage, grounding, and connectivity.
- Example: If 3 of 8 rubrics are identified, supported, and connected to the final answer, the rubric reward is 3/8 = 0.375.
Outcome Reward
- What happens: If the final answer matches the ground truth string/number, outcome = 1; else 0.
- Why it exists: We must still prioritize correctness of the end result.
- Example: Answering “66 tonnes” correctly gives an outcome reward of 1.
Mixing Rewards with C-GRPO
- What happens: a) For each question, sample a group of rollouts. b) Compute outcome and rubric rewards for each. c) Normalize rubric rewards within the group: the best chain gets 1.0. d) Final reward R = (1–α)outcome + αoutcome*normalized_rubric. e) Update the policy to favor higher R, at the token level, only where the model is generating (not the pasted web text).
- Why it exists: The mix balances “be right” with “show cited, connected work,” and group normalization makes training stable.
- Example: Two correct rollouts: one with stronger citations/coverage gets more reward and is reinforced.
Practical Training Choices
- Tools: search (find pages), open (view page head), find (locate keywords in page).
- Judges: A capable LLM evaluates both correctness and rubric satisfaction.
- Safeguards: Cap citations at 20; give 0 reward to malformed or overlong rollouts.

The Secret Sauce:

Triple filter: identify entities → verify with citations → ensure connectivity. This trio prevents the classic hacks: guessing names, cherry-picking random facts, or citing unrelated pages. Training then lifts policies that are simultaneously correct and well-supported.

What breaks without each step:

Without entity identification: The agent can speak vaguely, and we can’t align facts to the right names.
Without citation judgment: Hallucinations sneak in, and evidence quality drops.
Without connectivity: The agent can ‘farm’ isolated true facts and still look good.
Without mixing with outcome: The agent might be very thorough but miss the actual answer.

04Experiments & Results

🍞 Hook: When you test runners on a short track, even sloppy form can win. On a marathon, only solid technique holds up. The same is true for deep search agents.

🥬 The Concept: Baselines

What it is: Other training methods we compare against, like GRPO (outcome-only) and E-GRPO (entity-match signals for wrong rollouts).
How it works:
1. Train agents with the baseline reward.
2. Evaluate on shared benchmarks.
3. Compare scores fairly.
Why it matters: Without baselines, we can’t tell if the new method truly helps. 🍞 Anchor: GRPO boosts 64k-context accuracy but often plateaus or slips at 128k, while C-GRPO keeps climbing.

🥬 The Concept: Context Budget

What it is: The maximum amount the model can read/remember in one go (like 64k or 128k tokens).
How it works:
1. Give the agent a limit on how long its conversation and web snippets can be.
2. Harder tasks need more context.
3. Check if performance scales when the budget grows.
Why it matters: Real research needs long contexts. Fragile strategies fail when the reading gets heavy. 🍞 Anchor: C-GRPO gains more than GRPO when moving from 64k to 128k on multiple benchmarks.

What they measured and why:

Accuracy on four deep-search benchmarks (BrowseComp, BrowseComp-ZH, xbench-DeepSearch, GAIA) using an LLM judge for correctness: this shows task success.
Effects of longer context budgets and more tool-call steps: this shows test-time scaling and whether the agent can handle longer, harder problems.
Rubric satisfaction and number of cited webpages: this shows thoroughness and factual grounding.

The Competition:

GRPO (outcome-only RL) and E-GRPO (adds entity-match rate for incorrect rollouts) are the main baselines.
They also list scores from several proprietary or differently trained systems as references.

The Scoreboard (with context):

Across 4 benchmarks and two model sizes (4B and 30B), C-GRPO consistently beats GRPO and E-GRPO.
Example summary: With 64k context, C-GRPO improves over GRPO by about 5.1 points (4B) and 2.6 points (30B). With 128k, the margin grows to about 8.0 (4B) and 6.0 (30B). That’s like moving from a low B to a solid A while others hover at B-.
Test-time scaling: GRPO often helps at 64k but underperforms at 128k, suggesting shortcut habits don’t scale. C-GRPO keeps improving as context grows, signaling robust evidence habits.
Training dynamics: GRPO’s average tool calls drop and stay low—hinting at shortcuts. C-GRPO’s tool calls first drop (more efficient) then rise (gathering enough evidence), and its reward curves outpace GRPO over time.

Comprehensiveness and factuality:

On a subset of problems solved within 64k, the 30B C-GRPO agent cites more pages and satisfies more rubrics (identified, supported, and connected) than both SFT and GRPO. GRPO even cites fewer pages than SFT—evidence of shortcutting.

Generalization to open-ended research:

On DeepResearch Bench (PhD-level report tasks, graded by a rubric), C-GRPO improves comprehensiveness, insight, instruction following, and readability compared to baselines. The 30B C-GRPO model competes with some proprietary agents, showing that skills learned on synthetic QA transfer to real research writing.

Surprising Findings:

Pure outcome rewards can degrade test-time scaling at longer contexts, likely because the model settles on short, brittle strategies that don’t generalize.
Adding rubric rewards only to correct rollouts matters: giving them to all rollouts can accidentally push the model toward carefully argued—but wrong—answers, hurting performance.
Each CaRR component (entity identification and connectivity) has a measurable impact; removing either leads to sizable drops.

05Discussion & Limitations

🍞 Hook: If you only pay a painter when the wall looks white, they might paint fast and miss the corners. Start paying for edges and even coats, and the quality jumps—at a cost.

Limitations:

Structured questions required: CaRR depends on questions that can be decomposed into clear single-hop rubrics. Open-ended prompts without explicit requirements are harder to break down reliably.
Judge quality matters: If the LLM judge misreads a citation or misses an entity, scoring can be noisy—though spot checks showed high accuracy in this work.
Extra compute: Decomposition, judging, and graph checks add overhead compared to simple outcome rewards.
Web variability: Pages change, snippets differ, and some sites block access; robust caching and retrieval hygiene are needed.

Required Resources:

A capable base model (4B–30B in this paper) with tool-use abilities.
Search/open/find infrastructure with access to the public web.
An LLM judge for entity identification and citation checks.
Enough context length (64k–128k+) and training budget for multi-turn RL.

When NOT to Use:

Purely creative tasks (poems, brainstorming) where there’s no single factual chain.
Tiny compute budgets where the overhead of rubrics and judging outweighs benefits.
Domains with scarce or paywalled sources where citation verification is impractical.

Open Questions:

How to extend rubric generation beyond synthetic multi-hop QA to fully open-ended research queries?
Can we auto-learn evolving rubrics during training that better match each model’s current mistakes?
What’s the best balance (α) across domains and model sizes, and can it be adapted per-question?
How to make connectivity checks more semantic (not just entity-string matching) while staying robust and cheap?
Can we reduce dependence on a single judge by cross-checking multiple verifiers?

06Conclusion & Future Work

Three-Sentence Summary: The paper introduces CaRR, a reward that grades each reasoning hop with explicit entity naming, correct citations, and a connectivity check to the final answer. It combines this with the standard right/wrong signal using C-GRPO, rewarding not only the destination but the cited, connected path. The result is more robust deep search agents that scale better with longer contexts and generalize to open-ended research.

Main Achievement: Turning “show your work with sources” into a concrete, scalable reward signal—so agents learn to be both correct and convincingly evidence-backed.

Future Directions: Automate rubric creation for open-ended questions, make connectivity more semantic, adapt the reward mix to each task, and explore multi-judge ensembles to further stabilize training. Integrating richer web interactions (e.g., PDFs, tables, code) with citation checks could broaden applicability.

Why Remember This: Rewarding only final answers teaches bad habits; rewarding cited, connected reasoning teaches good ones. CaRR + C-GRPO makes that principle practical, pushing web agents toward trustworthy research behavior rather than lucky guessing.

Practical Applications

•Train enterprise research assistants that must provide linked, citation-backed evidence for every claim.
•Build newsroom fact-checking tools that reward connected chains of sources rather than single quotes.
•Upgrade academic literature-review agents to name authors, venues, and findings with precise citations.
•Create legal or policy assistants that assemble verifiable chains from statutes, cases, and official documents.
•Improve customer-support bots by requiring cited steps from manuals and knowledge bases before answering.
•Develop internal compliance auditors that verify each rule is supported by cited company policies.
•Enable scientific assistants that trace results back to datasets, methods, and papers with proper links.
•Train education tutors to show work with sources, discouraging unsupported claims.
•Power procurement or market-analysis agents that justify conclusions with connected, sourced evidence.
•Refine medical information tools to cite guidelines and studies explicitly, reducing risky hallucinations.

Version: 1