DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Key Summary
- •The paper introduces DeepResearchEval, a fully automated way to build realistic deep research tasks and to grade long research reports from AI systems.
- •It creates tasks by imagining real people (personas) with specific jobs and needs, then filters out easy or low-value tasks so only complex, multi-source work remains.
- •Its evaluator adapts to each task, inventing custom grading categories and weights so the score matches what truly matters for that task.
- •It also runs active fact-checking that searches the web to verify both cited and uncited claims, labeling them as Right, Wrong, or Unknown.
- •On 900 reports from 9 leading systems, Gemini Deep Research scored highest on overall report quality.
- •Manus achieved the best factual accuracy ratio, showing strong resistance to false claims.
- •Across all systems, task-specific scores were lower than generic scores, revealing a gap between general writing quality and truly meeting each task’s unique goals.
- •The framework shows strong agreement with human judgments and is stable across repeated runs, but it currently focuses on English and can be resource-intensive.
- •DeepResearchEval serves as a living benchmark that can keep generating fresh tasks over time.
- •This work helps make AI research agents more reliable for real-world, high-stakes investigations.
Why This Research Matters
When people and organizations rely on AI to do serious research, they need tasks that reflect real-world needs and evaluations that reward what truly matters. DeepResearchEval makes both happen automatically, creating complex tasks from realistic personas and grading with rubrics tailored to each task’s goals. It also checks facts across the entire report, not just the parts with citations, helping catch unsupported or risky claims. This leads to fairer, more trustworthy comparisons between different AI research systems. Over time, this living benchmark can keep up with changing topics and standards, guiding systems to improve where it counts. The result is safer, more reliable AI for policy analysis, business strategy, health guidance, and education.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine your teacher asks you to write a giant report about “Who invested in AI in 2025 and what will change in 2026?” You can’t just guess—you need to search, compare, cross-check, and explain clearly.
🥬 The World Before: Before this paper, AI could already write paragraphs and answer short questions, but deep research reports were a tougher mountain. These reports demand a long journey: plan what to look for, search the web many times, read across news, papers, and reports, compare views, and stitch everything together with evidence and structure. People built “deep research systems” to handle that, but grading them was messy. Benchmarks often depended on humans to carefully write tasks and mark answers, which was slow, costly, and hard to keep updated. Also, many benchmarks graded with one static set of rules for every task, which can be unfair—grading a recipe and grading a history essay the same way misses what each task truly needs. Finally, checking facts often relied only on citations in the report. If a claim had no citation, it was ignored, even if it was important.
🍞 Anchor: Think of a science fair where judges grade a volcano model, a coding project, and a chemistry demo using the same simple rubric and only peek at the parts with sources. You’d miss what really matters for each project, and you might not catch mistakes in uncited parts.
🍞 Hook: You know how… different kids ask different questions? A basketball coach asks for drills; a dietitian asks about food; an engineer asks about chips. Their research needs aren’t the same.
🥬 The Problem: The core challenge was twofold. First, building realistic, complex research tasks without tons of human labor. Second, grading the long reports in a way that adapts to each task’s unique goals and checks facts even when no citation is given. Past attempts leaned on expert-made task lists (time-consuming and narrow), fixed grading dimensions (one-size-fits-all), and citation-only fact checks (misses uncited claims). The result was benchmarks that didn’t scale easily, didn’t reflect real needs, and could overlook incorrect statements.
🍞 Anchor: It’s like trying to test running, swimming, and chess with the same scoring rules and only checking moves that have footnotes—you’ll get odd scores and miss silent blunders.
🍞 Hook: Imagine a librarian who can instantly invent new, realistic assignments for any student profile—and another librarian who can grade each assignment using the exact rules that matter for it.
🥬 The Gap: The field needed an automated way to (1) create lifelike, hard research tasks for many real-world roles (personas), and (2) evaluate long reports using task-custom rubrics plus active, search-based fact-checking across all claims, not just cited ones.
🍞 Anchor: Picture: a dietitian persona requests a 2025 analysis of plant-based meats in the US and EU with tables and health guidance. The grader then invents custom dimensions like “Classification Rigor” and “Cross-Regional Synthesis,” and the fact checker verifies nutritional and labeling claims whether or not they have citations.
🍞 Hook: You know how when you build a LEGO set, instructions must match that exact set? Generic instructions won’t do.
🥬 The Stakes: Without adaptive grading and active fact-checking, we can overrate pretty-but-misfit reports or let sneaky errors slip by. In real life, these systems inform business strategy, public health, education choices, and policy analysis—mistakes could cost money, trust, or safety.
🍞 Anchor: If a city uses a deep research report to plan e-scooter safety rules, the evaluation must reward the right comparisons and accurate metrics, not just nice writing. That’s why a better benchmark matters.
🍞 Hook: You know how a team needs roles—scout, analyst, coach? Deep research systems do, too.
🥬 New Concept — Deep Research Systems: What it is: Deep research systems are AI agents that plan multi-step web investigations, gather from many sources, cross-check, and write citation-grounded reports. How it works: 1) Plan sub-questions, 2) Search, 3) Read/compare across sources, 4) Synthesize with structure and references, 5) Produce a long report. Why it matters: Without this structure, the AI might guess, skip sources, or write shallow overviews that miss crucial evidence.
🍞 Anchor: Like a student who plans, reads library books, takes notes, compares viewpoints, and writes a careful essay—with references.
02Core Idea
🍞 Hook: Imagine a teacher who designs new assignments on the fly for each student, and a super-grader that invents the perfect rubric for each assignment and also double-checks every claim by searching the web.
🥬 The Aha! Moment (one sentence): Automate both sides—create realistic, persona-based deep research tasks and evaluate reports with task-adaptive rubrics plus active, retrieval-driven fact-checking for all claims.
🍞 Anchor: It’s like building a fair, living science fair where tasks match each student’s interests and the judges use the right rules and verify facts beyond what’s footnoted.
Multiple Analogies:
- Toolbox analogy: The framework first builds the right kind of project (task builder), then picks the right tools to grade it (adaptive dimensions), and finally uses a magnifying glass to inspect every claim (active fact-checking).
- Sports analogy: It schedules matches tailored to each player (persona tasks), sets sport-specific scoring rules (adaptive evaluation), and runs video review for all plays, not just flagged ones (active fact-checking).
- Cooking analogy: It writes a recipe based on the diner’s tastes (persona), judges taste using the right criteria for that dish (adaptive rubric), and checks ingredient labels whether or not the chef cited them (active fact-checking).
Before vs After:
- Before: Human-made task lists, fixed grading categories, and citation-only fact checks. Results: expensive, less realistic, and sometimes blind to uncited errors.
- After: Automated persona tasks, task-specific grading dimensions with weights, and web-search-based fact checks of cited and uncited claims. Results: scalable, realistic, fairer, and safer.
Why It Works (intuition):
- Personalization beats one-size-fits-all: If a task demands cross-country policy metrics, the grader adds exactly those dimensions and gives them higher weights.
- Evidence over assumption: Active retrieval doesn’t assume citations are complete; it hunts for proof (or uncertainty) across the web.
- Modularity helps scale: A generator for tasks + an adaptive grader + a fact-checking agent can each improve independently while keeping the whole system robust.
Building Blocks (with sandwich explanations):
-
🍞 Hook: You know how different people need different kinds of research? 🥬 The Concept — Persona-Driven Pipeline: What it is: A generator that creates realistic research tasks anchored in diverse user profiles (personas). How it works: 1) Create personas in many domains, 2) Write tasks that fit each persona’s needs, 3) Enforce requirements like multi-round search, multi-source synthesis, and concrete deliverables, 4) Keep only high-complexity tasks. Why it matters: Without personas, tasks drift into generic or trivial prompts that don’t represent real-world needs. 🍞 Anchor: A city’s transportation planner persona might request e-scooter policy comparisons since 2023 with measurable safety metrics and harmonized recommendations.
-
🍞 Hook: Like a club bouncer checking who really belongs inside. 🥬 The Concept — Task Qualification Filter: What it is: A filter that keeps only tasks that truly require up-to-date info, multi-source integration, deep investigation, and persona fit. How it works: An LLM judge scores each candidate on these criteria and keeps only confident “yes” tasks. Why it matters: Without it, easy or outdated tasks slip in and weaken the benchmark. 🍞 Anchor: A task asking “What is the capital of France?” gets bounced; a task asking for 2024–2025 export-control impacts on IIoT hardware gets in.
-
🍞 Hook: Do you sometimes know the answer without looking it up? Then you don’t need to search. 🥬 The Concept — Search Necessity Filter: What it is: A second filter that removes tasks solvable by internal knowledge alone. How it works: The model tries to answer without web tools; a judge grades the answer’s depth, timeliness, and structure. If it’s already strong, the task is removed. Why it matters: Without this, you waste evaluation on tasks that don’t test real research ability. 🍞 Anchor: “Define GDP” is out; “Compare 2025 GDP growth forecasts across IMF, OECD, and World Bank and explain key disagreements” is in.
-
🍞 Hook: You know how a math test and a history essay shouldn’t be graded the same way? 🥬 The Concept — Adaptive Point-wise Quality Evaluation: What it is: A grader that mixes general dimensions (Coverage, Insight, Instruction-following, Clarity) with task-specific dimensions and weights, then scores detailed criteria within each. How it works: 1) Generate task-specific dimensions, 2) Assign weights so they sum to 1, 3) Generate criteria with their own weights, 4) Score each criterion 1–10, 5) Aggregate by weights for a final score. Why it matters: Without adaptation, a report can look good generally but fail the task’s true goals. 🍞 Anchor: For a cross-country policy task, custom dimensions like Comparative Synthesis and Metric Utility get higher weight than usual.
-
🍞 Hook: Like a detective who doesn’t just trust footnotes—she checks the facts herself. 🥬 The Concept — Active Fact-Checking: What it is: An agent that extracts checkable statements and searches the web to verify them, labeling each statement Right, Wrong, or Unknown. How it works: 1) Split the report into parts, 2) Extract verifiable claims (numbers, events, dates, entities), 3) Retrieve evidence from the web, 4) Decide Right/Wrong/Unknown with reasoning and references, 5) Compute the ratio of Right over all statements. Why it matters: Without this, uncited statements can slip through and mislead readers. 🍞 Anchor: If a report claims “SMIC mass-produced 5 nm chips by 2025,” the agent searches reliable sources and may label it Unknown if evidence is insufficient, even if the claim has no citation.
03Methodology
High-level recipe: Input (persona set and task prompts) → Task Construction (persona-driven generation + filters) → Report Collection (run 9 systems on the final 100 tasks) → Agentic Evaluation (adaptive quality + active fact-check) → Outputs (quality scores, factual ratios, diagnostics).
Step-by-step details (what, why, example):
- Persona Synthesis
- What happens: The system creates diverse personas across 10 domains (like Finance, Health, Industrial, Policy, Software) with roles, affiliations, and realistic backgrounds.
- Why this step exists: Personas anchor tasks in the real world and prevent generic prompts. Without personas, tasks can be too easy or irrelevant.
- Example: “Ethan Kim, Industrial IoT Engineer” becomes the seed for a supply-chain disruption analysis in semiconductors.
- Persona-Conditioned Task Construction
- What happens: For each persona, an LLM writes multiple candidate tasks that explicitly demand: multi-round web searches, many credible sources, deep analysis (recent trends, comparisons), and concrete deliverables with time windows.
- Why it exists: Ensures tasks are hard, current, and structured. Without this, you’d get vague or timeless prompts.
- Example data: “Analyze US/EU export controls (Jan 2024–Aug 2025) and China’s countermeasures; deliver scenario analyses, vendor-risk ranking, and a mitigation roadmap.”
- Two-Stage Filtering
- 3a) Task Qualification Filter
- What happens: An LLM judge checks if a task really needs up-to-date info, multi-source integration, deep investigation, and persona-fit. Only high-confidence tasks pass.
- Why it exists: Stops shallow or off-target tasks. Without it, the benchmark’s difficulty drops.
- Example: A “compare 2025 AI safety rules across US, EU, and China” task passes; a timeless “explain what AI is” fails.
- 3b) Search Necessity Filter
- What happens: The model tries answering each remaining task without any external tools. A second judge scores the no-search answer on accuracy, depth, timeliness, professionalism, and structure. If strong, the task is removed (it didn’t require search).
- Why it exists: Confirms the task truly needs research. Without it, tasks test writing, not research.
- Example: “List common data structures” would be filtered out; “Compare 2024–2025 LLM safety incidents and policy responses with sources” remains.
- Human Verification and Final Task Set
- What happens: Domain experts review the filtered tasks. From 155 high-quality tasks, the authors curate 100 to balance cost and coverage.
- Why it exists: Adds a final human sanity check. Without it, a few odd tasks might slip in.
- Example: Experts confirm that tasks really require multi-round search and cross-source synthesis.
- Collect Reports from 9 Systems
- What happens: Each of the 9 deep research systems runs on the same 100 tasks, producing 900 reports total.
- Why it exists: Creates a fair comparison across systems on identical challenges. Without it, results wouldn’t be comparable.
- Example: Systems include Gemini Deep Research, OpenAI Deep Research, Claude Sonnet, Grok, Qwen, Doubao, DeepSeek, Perplexity, and Manus.
- Adaptive Point-wise Quality Evaluation (the grading brain)
- What happens (like a recipe):
- Fixed general dimensions: Coverage, Insight, Instruction-following, Clarity.
- Generate task-specific dimensions: The evaluator analyzes the task to invent 1–3 unique dimensions (for example, Comparative Synthesis, Metric Utility, Policy Pragmatism), each with a short definition.
- Assign weights: All dimensions (general + task-specific) get weights that sum to 1, reflecting what matters most for this particular task.
- Generate criteria per dimension: Each dimension expands into several concrete criteria, each with its own weight (weights within a dimension sum to 1).
- Score criteria: The LLM judge assigns a 1–10 score for each criterion with a short justification.
- Aggregate: For each dimension, criteria scores are combined by their weights; then all dimensions are combined by their dimension weights into the final quality score.
- Why it exists: Generic rubrics can’t capture task-specific success. Without dynamic dimensions and weights, a report might look fine but miss what the task really demanded.
- Example with data: For a cross-country safety policy task, Comparative Synthesis might get a big weight; a report that lists countries separately but doesn’t synthesize would score poorly on that dimension even if Coverage looks good.
- Active Fact-Checking (the detective)
- What happens:
- Segment the long report into manageable parts.
- Extract checkable statements: numbers, dates, events, named entities, etc.
- Retrieve evidence: Use web search and page scraping to gather evidence.
- Label each statement: Right (supported), Wrong (contradicted), or Unknown (insufficient/ambiguous evidence). Include reasoning and source excerpts.
- Compute factual ratio: Right statements divided by all checked statements.
- Why it exists: Citation-only checks miss uncited claims and can confuse “has a citation” with “is true.” Without active retrieval, you risk trusting unsupported text.
- Example: The system flags “two-thirds of the world’s consumer electronics are made in China” and finds recent estimates closer to about 40–45%, labeling the claim Wrong with sources.
The Secret Sauce (what makes it clever):
- Persona-driven, filter-validated task generation ensures realism, recency, and true research difficulty.
- Task-adaptive grading dimensions and weights make the score match the assignment’s real goals.
- Active, retrieval-based fact-checking guards the entire report, not just cited lines.
- The pipeline is modular and automated, making it updatable (“live”) as the world changes.
Mini sandwich for the key evaluation mechanism:
- 🍞 Hook: You know how teachers design different rubrics for lab reports vs debate essays?
- 🥬 The Concept — Task-Specific Dimensions and Weights: What it is: A system that invents the right grading categories and their importance for each task. How it works: Analyze the task, propose unique dimensions, assign weights, then expand into specific criteria and score them. Why it matters: Without it, scores reward general niceness instead of the exact things the task asked for.
- 🍞 Anchor: In a report requiring “five measurable safety metrics,” the grader heavily weights Metric Utility so hand-wavy metrics score poorly even if writing is smooth.
04Experiments & Results
The Test (what they measured and why):
- They ran 9 well-known deep research systems on the same 100 persona-driven tasks, collecting 900 long reports. Then they used the adaptive quality evaluator to score report quality (with both general and task-specific dimensions) and the active fact-checker to label statements and compute factual ratios. This mirrors real use: produce a research report, judge if it fits the brief well, and verify its truthfulness.
The Competition (who/what was compared):
- Systems included Gemini Deep Research, OpenAI Deep Research, Claude Sonnet, Grok, Qwen, Doubao, DeepSeek, Perplexity, and Manus. These represent a broad slice of current agentic research systems.
The Scoreboard (with context):
- Overall Quality: Gemini Deep Research achieved the highest average quality score (about 8.5 out of 10), which is like getting an A+ when many others got Bs. It led on Coverage, Insight, and Instruction-following, showing strong breadth, reasoning, and adherence to instructions. Claude Sonnet and OpenAI were also strong and balanced.
- Factual Accuracy: Manus achieved the top factual ratio (about 82%), and Gemini and DeepSeek were close behind (roughly 76%+). Think of it as a truth-meter: Manus had the highest share of Right statements among all statements it made. Interestingly, some systems produced many more checkable statements (denser reports) and still kept high factuality (e.g., Gemini), while others were more conservative (fewer statements, e.g., DeepSeek) and stayed accurate.
- Task-Specific Gap: Across all systems, scores on task-specific dimensions were consistently lower than scores on general dimensions. Translation: systems are decent at general writing and structure but struggle to nail the exact things each task uniquely requires. That validates the paper’s key idea: adaptive, task-specific grading catches what fixed rubrics miss.
Surprising/Notable Findings:
- Unknown vs Wrong: Across systems, Unknown labels were more common than flat Wrong ones. This suggests a lot of claims are under-supported rather than blatantly false—a caution for readers to seek sources even when text sounds confident.
- Trade-off Patterns: Some systems wrote denser reports (more statements checked), risking more places to go wrong but still kept high factuality; others wrote fewer, safer claims. Different styles can achieve similar truthfulness in different ways.
- Stability and Alignment: The quality evaluator’s rankings were stable across multiple runs, and a secondary judge (a different strong model) produced very similar rankings. For fact-checking, agreement with human experts was high, and in re-checks the automated agent was often right when disagreements occurred, largely thanks to its exhaustive web searches.
What this means in plain terms:
- If you want a strong, broadly insightful report, the leaders stood out clearly. If you want very few factual slip-ups, Manus and a couple of others did especially well. But no system aced the task-specific part across the board, showing that “shape-shifting” your writing to match each assignment’s exact goals is still hard for today’s agents.
Concrete analogy: In a decathlon, some athletes are great sprinters (Coverage), others excel at strategy (Insight), but most struggle to perfectly follow event-specific rules (task-specific dimensions) every time. The new scoring makes that visible instead of hiding it under one generic grade.
05Discussion & Limitations
Limitations (what this can’t do yet):
- Language scope: The current pipeline and evidence sources focus on English, so performance in multilingual, cross-lingual evidence settings remains untested.
- Cost and compute: Adaptive grading and active fact-checking use powerful models and many retrieval calls, which increases time and money costs—challenging for real-time or massive-scale evaluation.
- Evidence availability: Active fact-checking returns Unknown when public sources are thin or behind paywalls; that is honest but can frustrate users who want a firm yes/no.
- Domain edge cases: Highly specialized, proprietary, or emerging topics can be tough to verify reliably on the open web.
Required Resources (to use this framework):
- Access to strong LLMs for rubric construction and scoring.
- A retrieval stack (search APIs, scraping) and enough budget for tool calls.
- Engineering to orchestrate the multi-round evaluation and store structured outputs.
When NOT to Use (or use with caution):
- If your tasks are simple, timeless, or definitional, this is overkill; a normal QA benchmark is cheaper and adequate.
- If you need instant feedback at huge scale, the active agentic setup may be too slow or costly.
- If your domain relies on non-public data, the web-based checker may label too many claims Unknown.
Open Questions (what we still don’t know):
- How to extend to multilingual and cross-lingual retrieval while maintaining reliability and fairness?
- Can we reduce cost by distilling the evaluator and fact-checker into lighter models without losing accuracy?
- How to better separate “truly false” from “currently unknowable” in domains where ground truth shifts fast?
- Can we train research agents directly on feedback from adaptive dimensions and fact-check outcomes to close the task-specific gap?
- How to incorporate uncertainty quantification so reports signal confidence levels more transparently?
06Conclusion & Future Work
Three-sentence summary: DeepResearchEval automates the creation of realistic, persona-based deep research tasks and evaluates long reports with adaptive, task-specific rubrics plus active web-backed fact-checking. On 900 reports from 9 leading systems, it reveals clear strengths, weaknesses, and a consistent gap on task-specific requirements that fixed rubrics miss. The approach is robust and scalable but currently English-focused and resource-intensive.
Main achievement: Unifying automated task generation with agentic, task-adaptive quality scoring and full-coverage fact verification (including uncited claims), turning long-form research evaluation into a fairer and more realistic “live” benchmark.
Future directions: Expand to multilingual evidence, cut evaluation costs via distillation or caching, integrate uncertainty reporting, and use the adaptive signals to train better research agents. Also, broaden persona libraries to represent more roles and geographies as the world changes.
Why remember this: It shows how to judge research agents the way good teachers judge diverse assignments—by tailoring the rubric to the task and checking the facts everywhere, not just where a citation points. That shift makes evaluations more meaningful and pushes AI systems toward truly reliable, purpose-fit research.
Practical Applications
- •Benchmark new deep research agents before deployment using persona-aligned tasks that reflect your users.
- •Diagnose system weaknesses by examining low-scoring task-specific dimensions and criteria.
- •Reduce hallucinations by training with feedback from active fact-check labels (Right/Wrong/Unknown).
- •Continuously monitor system quality over time with newly generated tasks as domains evolve.
- •Customize evaluation for your organization by adding personas that match your teams and workflows.
- •Prioritize model improvements based on the gap between general and task-specific scores.
- •Use Unknown labels to identify where evidence is thin and add source requirements to prompts.
- •Compare vendor systems fairly on the same task set with adaptive, transparent scoring and evidence-backed fact checks.
- •Build internal dashboards that show coverage, insight, and factual ratios per domain to guide procurement decisions.
- •Stress-test agents on time-sensitive topics (e.g., new regulations) to ensure up-to-date retrieval and synthesis.