Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Yuxuan Wan; Tianqing Fang; Zaitang Li; Yintong Huo; Wenxuan Wang; Haitao Mi; Dong Yu; Michael R. Lyu

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Intermediate

Yuxuan Wan, Tianqing Fang, Zaitang Li et al.1/22/2026

arXiv PDF

Key Summary

•DeepVerifier is a plug-in checker that helps Deep Research Agents catch and fix their own mistakes while they are working, without retraining.
•It uses a failure taxonomy (a map of common mistakes) and clear rubrics (grading rules) to give precise, step-by-step feedback.
•The key idea is verification asymmetry: checking an answer is often easier than creating it, so the system breaks big checks into tiny yes/no questions.
•A three-part recipe—decompose the check, verify evidence, then judge with feedback—lets the agent retry smarter each round.
•On the GAIA benchmark, this test-time verification loop boosts accuracy by about 8% for strong models and 5–6% for an open 8B model.
•Compared with LLM-as-judge baselines, DeepVerifier improves meta-evaluation F1 by 12%–48%, meaning it’s better at deciding what’s truly right or wrong.
•The method generalizes to other tough datasets (XBench-DeepSearch and BrowseComp), still showing gains.
•They also release DeepVerifier-4K, a 4,646-example training set that teaches open models how to reflect and verify.
•Ablations show that both decomposition and explicit verification are necessary—remove either and recall and accuracy drop.
•This approach makes AI research agents more trustworthy for long, web-heavy tasks where human supervision is impractical.

Why This Research Matters

Many real tasks—like checking medical facts, legal dates, or financial numbers—depend on precise, verifiable details. DeepVerifier shows that smarter checking at test time can make AI agents more trustworthy without expensive retraining. By focusing on the easiest decisive questions, it avoids re-doing long, error-prone journeys and catches subtle mistakes. The method scales across models and datasets, helping both top-tier APIs and smaller open models. Clear rubrics translate into practical, short instructions, which means faster, safer retries. This helps students, journalists, researchers, and businesses rely on AI for complex, web-heavy work with fewer errors.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a big school project online—opening lots of tabs, copying notes, and piecing together an answer. Even if you’re smart, it’s easy to grab a wrong fact, miss a key source, or jump to a conclusion too fast.

🥬 The Concept: Deep Research Agents (DRAs)

What it is: DRAs are AI helpers that can browse the web, open files, use tools, and reason through multi-step tasks to answer complex questions.
How it works:
1. Read the task and plan steps.
2. Search the web or tools for evidence.
3. Collect quotes, numbers, and images.
4. Stitch them into an answer.
Why it matters: Without reliable checks, DRAs can hallucinate facts, follow wrong links, or misunderstand instructions—making long tasks risky. 🍞 Anchor: A DRA asked for a scientist’s earliest paper might trust a blog summary instead of the official database and return the wrong year.

🍞 Hook: You know how teachers use a grading rubric so students know exactly what counts as a good answer?

🥬 The Concept: DRA Failure Taxonomy

What it is: A categorized map of the ways DRAs commonly mess up, built by studying many agent attempts.
How it works:
1. Collect real agent trajectories from a benchmark (WebAggregatorQA).
2. Mark exact error points (e.g., wrong source, misunderstood instruction).
3. Cluster them into five big families and 13 sub-categories (e.g., finding sources, reasoning errors, decomposition mistakes, action/UI errors, hitting max steps).
Why it matters: If you know typical trapdoors, you can check those spots first and faster. 🍞 Anchor: If an agent often relies on generic searches, the taxonomy flags “risky sourcing,” so checks ask: “Does the official database confirm this claim?”

🍞 Hook: Think of test prep—when you take practice quizzes and instantly see what you got wrong, you improve fast.

🥬 The Concept: Self-evolving mechanisms

What it is: A way for agents to improve themselves on the fly by checking their own outputs and fixing errors.
How it works:
1. Generate an answer.
2. Verify it against targeted checks.
3. Get structured feedback.
4. Retry with those tips, repeating a few rounds.
Why it matters: You don’t need extra training time—just smarter use of test time. 🍞 Anchor: The agent answers, gets told “your source is secondary; check the official archive,” then tries again and corrects itself.

The world before: Many teams tried test-time tricks like generating more drafts (Best-of-N) or voting between runs. But if every run repeats the same blind spot (like trusting a blog), the vote still fails. Others used LLMs as judges, but judging complex web reasoning holistically is hard—judges miss subtle factual slips.

The problem: We need a scalable, automated way to catch specific, common DRA errors and guide precise fixes during inference, especially for long, web-heavy tasks where humans can’t supervise every click.

Failed attempts: Parallel samples and majority votes amplify the same mistakes; holistic judges spot obvious fails (like a broken link) but miss nuanced ones (like a premature conclusion based on a weak source).

The gap: A practical, plug-in verifier that (a) focuses on typical failure points, (b) breaks verification into small, checkable questions, and (c) turns checks into actionable feedback loops.

Real stakes: Better DRAs mean more trustworthy research summaries, safer coding assistance, more accurate data audits, and less time wasted chasing wrong leads—things students, journalists, researchers, and businesses all care about.

02Core Idea

🍞 Hook: Picture checking a giant LEGO castle by tugging on a few critical bricks instead of rebuilding the whole thing—you’ll quickly know if it’s sturdy.

🥬 The Concept: Verification asymmetry

What it is: It’s often easier to check if an answer is right than to create that answer from scratch.
How it works:
1. Turn a big claim into small, bite-sized checks (e.g., yes/no about a specific fact in a trusted source).
2. Retrieve targeted evidence for each micro-check.
3. Combine the micro-check results to accept or reject the big answer.
Why it matters: If you try to “re-solve” the whole problem, you repeat the same mistakes; micro-checks are simpler and more reliable. 🍞 Anchor: Instead of redoing a 20-step web journey, ask: “Does the official archive list 2009 as the first paper year?” One click, clear verdict.

🍞 Hook: Like a teacher’s rubric that says, “Use primary sources, quote exact numbers, and cross-check dates,” so students know precisely what to improve.

🥬 The Concept: Rubrics-based feedback

What it is: Structured, labeled guidance that tells the agent exactly which rule it broke and how to fix it.
How it works:
1. Derive rubrics from the failure taxonomy (e.g., verify source authority, confirm numerics).
2. Score the answer along these rubric lines.
3. Produce targeted, short instructions for the next attempt.
Why it matters: Vague advice (“be careful”) doesn’t help; concrete rules do. 🍞 Anchor: “Your date came from a secondary blog. Check the university’s official repository and quote the exact year from its page.”

🍞 Hook: Imagine practicing a song: play, listen, fix the off-notes, and try again—you improve every round.

🥬 The Concept: Inference-time scaling of verification

What it is: Using extra test-time steps (not extra training) to verify, give feedback, and retry, so the agent’s performance scales up during inference.
How it works:
1. Answer → 2) Verify via micro-questions → 3) Judge + feedback → 4) Retry a few rounds.
Why it matters: More compute at test time buys accuracy gains when training isn’t possible. 🍞 Anchor: After two to four feedback rounds, accuracy jumps—like a student acing a retake after seeing exactly what went wrong.

🍞 Hook: Think of DeepVerifier as the pit crew for a race car—diagnose, fix, send it back out, faster each lap.

🥬 The Concept: DeepVerifier

What it is: A plug-and-play verification system that decomposes checks, retrieves evidence, and gives rubric-based feedback to help the agent self-improve.
How it works:
1. Decompose the verification into a few decisive follow-up questions.
2. Retrieve and read the right sources.
3. Judge correctness (1–4) and give concise fix-it tips.
Why it matters: It consistently catches subtle factual and sourcing errors that generic judges miss. 🍞 Anchor: On GAIA, DeepVerifier lifts accuracy ~8–11% for strong models, and beats LLM-as-judge baselines by 12–48% F1.

Aha! moment in one sentence: Don’t re-solve the hard problem—verify it with tiny, targeted checks guided by a failure-aware rubric, and iterate.

Multiple analogies:

Detective: Instead of recreating the crime, check alibis and camera timestamps (micro-checks).
Chef: Don’t cook a new soup—taste salt, acidity, and spice levels (rubric points) and adjust.
Editor: Don’t rewrite the book—fact-check quotes, dates, and references (targeted verification).

Before vs After:

Before: Agents generated more drafts or asked a generic judge, often missing the same subtle errors.
After: Agents run a focused verify→feedback→retry loop, leading to steady accuracy gains in a few rounds.

Why it works (intuition): Micro-checks reduce cognitive load and variance; rubrics inject structure; iteration shrinks the error space each round until the answer stabilizes.

Building blocks:

Taxonomy → rubrics.
Decomposition into ≤3 decisive questions.
Evidence retrieval by a verification agent.
A judge that scores (1–4) and emits short, actionable feedback.
A retry loop capped by a small number of rounds to avoid regressions.

03Methodology

High-level recipe: Task + Unverified answer + (Long trajectory) → [A) Decomposition] → [B) Verification retrieval] → [C) Judging + feedback] → Retry answer → Repeat for a few rounds → Final answer.

🍞 Hook: You know how reviewing a long group project starts with a short summary before deciding what to double-check?

🥬 The Concept: Decomposition module

What it is: A helper that summarizes the trajectory, spots likely failure types, and writes a few high-impact follow-up questions.
How it works:
1. Trajectory summarization: Convert an 8.2M-token browsing trace into a compact, step-indexed list of sources and extracted facts (no opinions).
2. Potential error identification: Using the failure taxonomy, label suspicious behaviors (e.g., “relied on a non-official blog for a key date”).
3. Follow-up question formulation: Draft up to 3 yes/no questions anchored to authoritative sources that can decisively validate or refute the claim.
Why it matters: Without decomposition, the checker tries to re-solve the whole task and inherits the same mistakes. 🍞 Anchor: “Does the university’s official repository list 2009 as the earliest publication year?”

🍞 Hook: Like sending a librarian to fetch exact pages for each question.

🥬 The Concept: Verification agent

What it is: A specialized agent (e.g., CK-Pro) that answers the follow-up questions by searching, clicking, and reading sources.
How it works:
1. For each follow-up, search or open the target site.
2. Extract the relevant snippet (quote/number).
3. Return a brief explanation plus a concise yes/no.
Why it matters: Without this retrieval, judges guess from memory and miss subtle factual issues. 🍞 Anchor: It opens the official archive, finds the author page, and reads the earliest record’s date.

🍞 Hook: Think of a fair referee who explains the call and gives tips to avoid the foul next time.

🥬 The Concept: Judge module

What it is: A scorer that decides if the unverified answer is entirely wrong (1), mostly wrong (2), mostly right (3), or entirely right (4), and provides corrective feedback.
How it works:
1. Reads the summary, flagged errors, and follow-up answers.
2. Writes a one-paragraph explanation.
3. Outputs a 1–4 score and max three, clear instructions for the agent’s retry.
Why it matters: Without precise, short instructions, retries wander or repeat old mistakes. 🍞 Anchor: “Score: 2. Reflection: You used a secondary blog. Instruction: Check the university archive, quote the earliest year on the author page, and update the final answer accordingly.”

Detailed step-by-step with example:

Input: Task—“What is Dr. X’s earliest publication year?” Unverified answer—“2011.” Trajectory summary—Agent used Wikipedia and a blog.
A) Decomposition drafts: • Potential error: Over-reliance on secondary sources. • Follow-up Q1: “Does the university’s official repository list Dr. X’s earliest publication as 2009?”
B) Verification agent: • Opens the official repository, finds Dr. X’s page, sees a 2009 entry. • Returns: “Yes—earliest listed is 2009. Snippet: ‘Earliest publication: 2009’.”
C) Judge: • Explanation cites the official source contradicting 2011. • Score: 2 (mostly incorrect). • Feedback: “Use the official repository; replace 2011 with 2009; cite the exact line.”
Retry: Agent updates the answer to 2009, cites correctly.
Next round: Judge re-checks and returns Score: 4.

The secret sauce:

Verification asymmetry: Small, decisive checks beat re-solving.
Targeted decomposition: ≤3 micro-questions reduce noise and cost.
Rubrics grounded in real failure modes: Feedback maps to fixable actions.
Tight feedback format: Short, actionable instructions prevent drift.
Plug-and-play: Sits on top of any capable backbone model at test time.

🍞 Hook: Like practicing with answer keys to become better at self-checking over time.

🥬 The Concept: Reflection/test-time scaling loop

What it is: Repeating verify→feedback→retry for a few rounds to raise accuracy without retraining.
How it works:
1. Run DeepVerifier after each answer.
2. If score ≤2, apply feedback and retry; stop early if score ≥3.
3. Cap rounds (often peak around 3–4) to avoid regressions.
Why it matters: Gains accuracy when you can’t or won’t do more training. 🍞 Anchor: On GAIA, accuracy climbs across early rounds, peaking near round four.

🍞 Hook: Think of a workbook that teaches you how to spot mistakes by yourself.

🥬 The Concept: DeepVerifier-4K dataset

What it is: 4,646 curated prompt–response pairs that teach models how to verify, reflect, and give useful feedback.
How it works:
1. Collect 400 verification trajectories.
2. Keep only true accept/reject cases (clean labels).
3. Convert to instructional pairs for SFT.
Why it matters: Open models often lack reflection skills; this data trains them to verify effectively. 🍞 Anchor: A Qwen3-8B model fine-tuned on DeepVerifier-4K (DeepVerifier-8B) gains ~5.5 accuracy points after reflection on GAIA-Full.

04Experiments & Results

The test: Can DeepVerifier correctly judge answers (verification quality), and can its feedback loop raise task accuracy across rounds (scaling)? Metrics include precision, recall, accuracy, and meta-evaluation F1 for judging; task accuracy for scaling.

The competition: Baselines include generic LLM-as-judge and an agent-as-judge (CK-Pro). They also test ablations removing verification or decomposition to see which parts matter.

Scoreboard with context:

Verification quality (ablation on GAIA-Web trajectories with Claude-3.7 backbones): • DeepVerifier: Balanced performance with the highest F1 (~73) and accuracy (~76). Translation: It both catches many wrong answers and avoids falsely rejecting correct ones—like a referee who calls fouls accurately without over-calling. • Without verification: Very high precision (100%) but terrible recall (~14%); it catches only the most obvious mistakes—like only penalizing players who shout, missing quiet fouls. • Without decomposition: Precision high (~87%) but weaker recall (~48%) and F1 (~62); it tries to re-solve tasks and repeats original errors.
Scaling on GAIA (accuracy across feedback rounds): • Claude-3.7-Sonnet: GAIA-Full climbs from ~52% to ~59% final (+6.7) with a best of ~60% (+8.0). On GAIA-Web, peaks around ~63% (+~12 from the first row’s baseline cell), showing biggest gains on retrieval-heavy tasks. • GPT-4.1: Improves modestly from ~29.5% to ~31.9% final (+2.4), best ~32.5% (+3.0). This shows generalization but also that backbone quality and prompts matter. • DeepVerifier-8B (Qwen3-8B fine-tuned): From ~26.7% to ~32.2% final (+5.5). Reflection skills learned from the 4K dataset pay off.
Other datasets: • XBench-DeepSearch: Best gain +6.0; final +3.0 after 10 rounds—solid even across languages. • BrowseComp: Best gain +5.0; final +4.0—impressive on extremely hard-to-find info.

Surprising findings:

Early peaks around round 3–4: The system fixes many wrong cases early (incorrect→correct), but a small number of regressions (correct→incorrect) can appear in later rounds as the verifier sometimes overrules correct answers—so stopping early is wise.
Decomposition is not optional: Even with access to the web, trying to re-solve tasks as a judge repeats original reasoning traps. Targeted micro-questions break that loop.
Open models can learn reflection: A relatively small, clean dataset (4,646 pairs) noticeably improves an 8B model’s verification-driven scaling.

Plain-English takeaway: DeepVerifier is both a better referee and a better coach. It judges more fairly (higher F1) and its advice leads to real score improvements (higher accuracy) in just a few rounds.

05Discussion & Limitations

Limitations:

Verification isn’t perfect: Misclassifications happen, especially on nuanced reasoning or when sources conflict. Later rounds can introduce small regressions, so a smart stopping rule is needed.
Taxonomy/rubric coverage: The system is only as good as the failure patterns it knows. New task types may need updated rubrics.
Evidence availability: If the authoritative source is paywalled, down, or ambiguous, verification may stall.
Cost/latency: Extra retrieval and a few feedback rounds add tokens, API calls, and time.

Required resources:

A competent backbone LLM or VLM (closed or open) with browsing/search capability.
Web access, tool-use support (search, click, screenshot, code snippets), and logs for summarization.
Optional SFT compute to fine-tune open models on DeepVerifier-4K.

When NOT to use:

Purely creative tasks (poetry style, brainstorming) with no verifiable ground truth.
Ultra-time-critical settings where extra rounds are unacceptable.
Domains without accessible authoritative sources.

Open questions:

Adaptive stopping: How to predict the best round to stop per-instance?
Confidence calibration: Can the judge report uncertainty and trigger human-in-the-loop only when needed?
Robustness: How to handle adversarial or noisy sources at web scale?
Broader taxonomies: Can we automatically expand failure categories as new domains emerge?
Multi-modal depth: How to verify complex images/tables/videos more reliably across modalities?

06Conclusion & Future Work

Three-sentence summary: This paper turns verification into a first-class citizen for Deep Research Agents by using a failure-informed rubric and tiny, targeted checks. Plugging in DeepVerifier at test time creates a verify→feedback→retry loop that reliably boosts accuracy within a few rounds. A curated dataset (DeepVerifier-4K) also teaches open models to reflect and verify, extending gains beyond closed APIs.

Main achievement: Showing that inference-time scaling of verification—grounded in a real failure taxonomy, targeted decomposition, and rubric-based feedback—consistently improves both judging quality (F1) and end-task accuracy across strong and open models.

Future directions:

Smarter, instance-wise early stopping and uncertainty-aware judging.
Expanding the taxonomy and rubrics to more domains and modalities.
Hybrid loops that combine retrieval with lightweight tool execution (e.g., code) for deterministic checks.
Human-in-the-loop escalation for ambiguous or high-stakes cases.

Why remember this: Instead of making bigger models or more drafts, DeepVerifier shows that carefully checking with the right small questions—and acting on clear, structured feedback—can make agents meaningfully more trustworthy right now.

Practical Applications

•Academic fact-checking: Verify earliest publications, citation counts, and official affiliations from authoritative sources.
•Journalistic research: Confirm dates, quotes, and statistics with primary documents before publishing.
•Enterprise analytics: Validate figures in reports (revenues, growth rates) against filings or official databases.
•Legal and compliance audits: Cross-check deadlines, statutes, and clause references with official repositories.
•Healthcare literature reviews: Ensure study dates, sample sizes, and outcomes match the original papers.
•Data labeling QA: Use micro-checks to validate factual labels and flag ambiguous items for human review.
•E-commerce content validation: Confirm product specs and availability from manufacturer pages.
•Coding assistance: Verify API behaviors and version-specific details against official docs before suggesting fixes.
•Education: Provide students rubric-based feedback and sources to correct research assignments.
•Customer support knowledge bases: Validate answers against official docs to prevent misinformation.

Version: 1