Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Shaohan Wang; Benfeng Xu; Licheng Zhang; Mingxuan Du; Chiwei Zhu; Xiaorui Wang; Zhendong Mao; Yongdong Zhang

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Beginner

Shaohan Wang, Benfeng Xu, Licheng Zhang et al.2/2/2026

arXiv PDF

Key Summary

•This paper builds a live challenge that tests how well Deep Research Agents (DRAs) can write expert-level Wikipedia-style articles.
•Instead of using AI-written references, it uses newly approved Wikipedia Good Articles as gold-standard, expert-checked examples.
•The evaluation has two parts: Wiki Writing (39 fine-grained writing rules) and Wiki Fact (fact coverage and citation support).
•Across 100 recent Good Articles in 15 categories, all DRAs still fall far short of human-written Wikipedia quality.
•Best systems did well at writing style, but even top agents only covered about 31% of Wikipedia’s key facts on average.
•Citation checks showed many models either forgot citations or cited pages that didn’t really support what they wrote.
•A special judge model compared AI articles to the Wikipedia versions, and its decisions matched human judges over 80% of the time.
•The benchmark is kept fresh with recent articles to avoid training-set leakage and to mirror the real, changing web.
•Results highlight common weaknesses: missing detailed data, bias toward trending topics, and trouble following strict neutrality and verifiability rules.
•The released benchmark and tools aim to push safer, more reliable research agents for real-world use.

Why This Research Matters

When people rely on AI to learn or decide, they need both clear writing and verified facts. This benchmark ties AI performance to expert-reviewed Wikipedia standards, so we stop rewarding text that only “sounds smart.” It exposes where today’s agents fall short—especially in covering key facts and backing claims with real sources. By keeping the tasks live and recent, it reflects the real web and reduces shortcutting through training data. The framework guides researchers to build agents that plan better, cite honestly, and stay neutral. In short, it helps move AI from confident-sounding essays to trustworthy reference-quality reports.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a big school report and the teacher says, “Don’t just write nicely—prove every fact with good sources, be fair to all sides, and cover the whole topic.” That’s hard! Now imagine asking a robot helper to do that on its own.

🥬 The Concept (Deep Research Agents, or DRAs):

What it is: DRAs are AI helpers that search the web, gather information, think through it, and write long reports by themselves.
How it works: 1) Plan what to look for, 2) Search many sites, 3) Read and take notes, 4) Combine notes into a draft, 5) Check and revise, 6) Add citations.
Why it matters: Without DRAs doing this well, their reports can miss facts, be biased, or sound confident without proof.

🍞 Anchor: Like a student detective who solves a history mystery by visiting libraries, reading many sources, and writing a fair, well-cited paper.

🍞 Hook: You know how some trophies mean the winner reached a very high standard? On Wikipedia, one such “trophy” exists for excellent articles.

🥬 The Concept (Wikipedia Good Articles, or GAs):

What it is: Good Articles are Wikipedia pages that passed a tough human review for clarity, neutrality, completeness, and verifiability.
How it works: 1) Volunteers write, 2) Reviewers check rules (neutral tone, good sourcing, solid coverage), 3) Articles are revised, 4) Only strong ones earn the GA badge.
Why it matters: Without a trusted standard, it’s hard to judge if AI writing is really expert-level or just sounds smart.

🍞 Anchor: Like a science fair project that won a ribbon because judges verified the data and the explanation is complete and fair.

🍞 Hook: Before this paper, people often graded DRAs using other AIs as the answer key. That’s like letting a classmate grade your test without a teacher.

🥬 The Problem:

What it is: Past evaluations used AI-written references or AI-chosen rules, which can be wrong or biased.
How it works: 1) A strong AI writes a reference, 2) Another AI is compared to it, 3) An AI judge scores it; but none of these steps have expert guarantees.
Why it matters: If the “answer key” is shaky, scores don’t tell us if an agent matches real expert standards.

🍞 Anchor: If a map is drawn by a beginner, following it might get you lost—even if you walk perfectly.

🍞 Hook: People tried to fix this with different rubrics or synthetic tasks, but something kept breaking.

🥬 Failed Attempts:

What it is: Prior benchmarks used AI-generated references, broad rubrics, or static datasets.
How it works: 1) Create a fixed set of topics, 2) Ask models to write, 3) Score with coarse rules or LLM judges.
Why it matters: Without expert-verified references and fine-grained checks, scores miss key flaws like biased tone or weak citations.

🍞 Anchor: It’s like grading a book report only for neat handwriting, ignoring whether the facts are correct.

🍞 Hook: So what’s missing? A living, expert-checked yardstick that tests real research and writing under strict rules.

🥬 The Gap WLC Fills:

What it is: A live benchmark using brand-new Wikipedia Good Articles as expert references.
How it works: 1) Continuously collect recent Good Articles, 2) For each, ask DRAs to write a Wikipedia-style article (without reading that page), 3) Score writing with 39 human-grounded criteria, 4) Check facts for coverage and citation support.
Why it matters: Now we can clearly see how far AI is from expert-level, in both style and substance.

🍞 Anchor: Like comparing a student’s essay to a medal-winning example and checking every claim has a reliable footnote.

🍞 Hook: Why should you care? Because when people use AI to learn or make decisions, bad facts or biased writing can cause real harm.

🥬 Real Stakes:

What it is: The quality of AI research affects school learning, news understanding, health decisions, and more.
How it works: 1) People read AI summaries, 2) If facts are missing or slanted, they may be misled, 3) Trust in AI drops.
Why it matters: With a strong, expert-grounded test, we can build safer, fairer research agents.

🍞 Anchor: If an AI says a medicine works without good sources, someone could get hurt; strong verification helps prevent that.

02Core Idea

🍞 Hook: Imagine grading a cooking robot by tasting dishes made by master chefs and checking the recipe card for every ingredient. That’s fair, right?

🥬 The Aha! Moment:

What it is: Use newly approved Wikipedia Good Articles as expert references and grade DRAs on two fronts: writing quality (39 precise rules) and factual verifiability (coverage of Wikipedia facts and support from cited sources).
How it works: 1) Gather fresh Good Articles, 2) Have DRAs write on those topics (no peeking), 3) Compare AI vs. GA on writing with an LLM judge, 4) Extract statements and check facts against Wikipedia and the AI’s own citations.
Why it matters: This ties AI performance to real, human-verified standards—not just to other AI outputs.

🍞 Anchor: Like comparing a student’s report to a teacher-approved exemplar and confirming every footnote actually backs the text.

🍞 Hook (Analogy 1): You know how a spelling bee tests real words from dictionaries, not made-up ones? This benchmark tests against real, reviewed Wikipedia pages.

🥬 The Concept (Analogy Set):

What it is: A benchmark where the answer key is the best, freshly reviewed encyclopedia pages.
How it works: 1) The “dictionary” is Wikipedia Good Articles, 2) The “word list” is their facts, 3) The “judge” checks writing rules and fact support.
Why it matters: False confidence from AI-only keys is replaced by expert-level ground truth.

🍞 Anchor: If a contestant spells a word, we check the real dictionary, not a friend’s notes.

🍞 Hook (Analogy 2): Think of a driving test that checks both your smooth driving and your knowledge of road signs.

🥬 Two-Part Evaluation:

What it is: Wiki Writing (how well it’s written) and Wiki Fact (how well it’s grounded in facts and citations).
How it works: 1) 39 writing criteria check clarity, neutrality, and coverage, 2) Fact coverage checks how many key Wikipedia facts the AI includes, 3) Reference accuracy checks whether the AI’s citations truly support its claims.
Why it matters: Good style without true facts is risky; true facts with messy writing are hard to trust. You need both.

🍞 Anchor: A driver who’s smooth but ignores stop signs is dangerous; a driver who knows rules but can’t control the car is also unsafe.

🍞 Hook (Analogy 3): It’s like a treasure hunt where you must find not just the big gems (main facts) but also the tiny, precise jewels (specific data) and show the map (citations) proving how you got there.

🥬 Before vs. After:

What it is: Before, evaluations trusted AI-made references and broad rubrics; after, they anchor to expert-verified articles and fine-grained checks.
How it works: 1) Replace AI references with Good Articles, 2) Use a judge model aligned with human ratings, 3) Check every claim’s support.
Why it works: Expert-grounded references reduce drift and bias; fresh topics reduce training leakage; fine-grained rules catch subtle writing issues.

🍞 Anchor: Like upgrading from guessing answers with a friend’s notes to checking against the teacher’s official answer sheet and showing your work.

🍞 Hook: Let’s break the idea into simple blocks you can stack like LEGO.

🥬 Building Blocks:

What it is: Five pieces—Live GA collection, No-Wikipedia constraint, 39-rule writing judge, Fact coverage vs. Wikipedia, Reference accuracy vs. sources.
How it works: 1) Continuously gather new Good Articles, 2) Ask DRAs to write without reading that exact page, 3) LLM judge compares writing to GA on 39 criteria, 4) Extract and match facts to Wikipedia for coverage, 5) Fetch cited pages and verify support.
Why it matters: Each block closes a loophole—no peeking, no vague judging, no unsupported claims.

🍞 Anchor: A sturdy bridge needs every piece; remove one beam (like citation checks), and trust collapses.

03Methodology

🍞 Hook: Picture a cooking show where contestants must recreate a master chef’s dish without seeing the original recipe—but judges have the authentic dish and its recipe card.

🥬 High-Level Pipeline:

What it is: Input → Collect Good Articles → Create tasks → Generate AI articles → Evaluate writing → Evaluate facts → Output scores.
How it works: 1) Curate fresh GA references, 2) Prompt DRAs to write on those topics (no reading that GA page), 3) Judge writing quality with 39 criteria, 4) Extract statements, 5) Check coverage against Wikipedia facts, 6) Verify claims against cited sources.
Why it matters: This sequence ensures both style and substance are tested with trusted references.

🍞 Anchor: Like challenging bakers to recreate a dessert, then tasting, inspecting texture, and confirming ingredients match the master recipe.

🍞 Hook: First, we need the expert yardstick.

🥬 Step A: Live Task Collection with Good Articles (GAs)

What it is: Gather 100 recent Wikipedia Good Articles across 15 categories as expert references.
How it works: 1) Collect new pages from March–December 2025, 2) Filter those that passed human Good Article review, 3) Prefer deeper, well-cited articles, 4) Continuously refresh to avoid training leakage.
Why it matters: Fresh, expert-reviewed pages prevent shortcuts and keep the test realistic.

🍞 Anchor: Like using the latest edition of an encyclopedia, not last year’s, to check reports.

🍞 Hook: Next, give the assignment to the AI—but with a key rule.

🥬 Step B: DRA Task Setup (No-Peeking Rule)

What it is: Ask each agent to write a Wikipedia-style article on the GA topic without opening that exact Wikipedia page.
How it works: 1) Provide the topic and Good Article criteria summary, 2) Explicitly forbid reading the target Wikipedia page, 3) Allow open web research elsewhere.
Why it matters: If the agent reads the answer page, we don’t learn its real research ability.

🍞 Anchor: It’s like telling a student to research from books and journals, but not copy from the example essay.

🍞 Hook: Now we need a careful teacher to compare the writing to the exemplar.

🥬 Step C: Wiki Writing (39 Fine-Grained Criteria)

What it is: A writing quality judge that compares the AI article to the GA on clarity, neutrality, and coverage.
How it works: 1) Use an LLM-as-a-Judge with strong human agreement, 2) Apply 39 criteria grounded in Wikipedia rules (lead quality, tone, words to watch, scope, due weight), 3) For each criterion, pick a winner (GA or AI), 4) Aggregate wins for a total writing score.
Why it matters: Coarse grades miss subtle problems; fine-grained checks catch puffery, biased wording, missing sections, and off-scope tangents.

🍞 Anchor: Like grading a report line-by-line: Is the intro clear? Are claims neutral? Are key subtopics covered?

🍞 Hook: Writing nicely isn’t enough—you must include the important facts.

🥬 Step D: Wiki Fact—Coverage vs. Wikipedia

What it is: Measures how many of the reference article’s key facts the AI actually included.
How it works: 1) Extract a list of facts from the Wikipedia GA, 2) Extract statements from the AI article, 3) For each Wikipedia fact, find the top AI statements and check if they’re consistent, 4) Average results to get a coverage score.
Why it matters: If big chunks of truth are missing, readers won’t learn the full story.

🍞 Anchor: Like checking if a volcano report includes location, eruption history, causes, and effects—not just a definition.

🍞 Hook: Even if you mention a fact, can you prove it?

🥬 Step E: Wiki Fact—Reference Accuracy vs. Cited Sources

What it is: Checks whether AI statements are truly supported by the web pages they cite.
How it works: 1) For each (statement, URL) pair, fetch the source content, 2) Use a fact-checking model to verify the statement is consistent, 3) Compute the fraction supported.
Why it matters: Citations that don’t back the text are like saying “my friend said so”—not reliable.

🍞 Anchor: Like flipping to page 42 of a book to see if the quoted sentence is really there.

🍞 Hook: Judges must be trustworthy too.

🥬 Step F: Choosing Judge Models and Ensuring Reliability

What it is: Selecting an LLM judge whose decisions match human experts well.
How it works: 1) Test several judge models, 2) Compare their decisions with human annotations, 3) Pick the one with the highest agreement and good cost.
Why it matters: If the judge is unreliable, the whole scoreboard is shaky.

🍞 Anchor: Like picking a referee known for fair calls.

🍞 Hook: What if someone tries to peek at the answers?

🥬 Step G: Handling Wikipedia Leakage

What it is: A defense against agents that sneak in the target Wikipedia page.
How it works: 1) For coverage scoring, ignore statements that cite the target Wikipedia page, 2) Measure how often each model leaks by citing it anyway, 3) Report leakage rates.
Why it matters: Prevents cheating from inflating scores.

🍞 Anchor: Like ignoring points earned by copying from the answer sheet and noting who tried.

🍞 Secret Sauce:

What it is: Combining fresh, expert references; strict writing rules; dual factual checks; trustworthy judges; and leakage controls.
How it works: Each piece patches a known weakness—together they make a robust test that mirrors real research demands.
Why it matters: This design turns vague “good writing” into measurable, expert-aligned progress benchmarks.

🍞 Anchor: A tight basketball defense wins not by one superstar but by five players covering every gap.

04Experiments & Results

🍞 Hook: Think of a league where teams play the same tough schedule, and referees use a precise rulebook. Now we can trust the scoreboard.

🥬 The Test:

What it is: 100 recent Wikipedia Good Articles across 15 categories (like History, Natural Sciences, Music) as the gold standard.
How it works: 1) Each DRA writes a Wikipedia-style article on those topics (no peeking), 2) Evaluate with 39 writing criteria, 3) Measure fact coverage vs. Wikipedia, 4) Check citation support.
Why it matters: Tests both how well they write and how much true, supported information they deliver.

🍞 Anchor: Like judging gymnastics on both artistry and technical difficulty.

🍞 Hook: Who entered the tournament?

🥬 The Competition:

What it is: A mix of leading proprietary systems and open-source frameworks.
How it works: Proprietary: Gemini-3-pro Deep Research, Gemini-2.5-pro Deep Research, OpenAI o3 Deep Research, Qwen-3-max, Perplexity, Grok, Doubao. Open-source: Deep Researcher, Tongyi DeepResearch, and LangChain Open Deep Research (with GPT-4.1 or GPT-5 backends).
Why it matters: Shows the gap between cutting-edge systems and community tools, and what’s still missing for all of them.

🍞 Anchor: Like a track meet with world-class sprinters and strong local clubs on the same track.

🍞 Hook: What did the scoreboard say?

🥬 The Scoreboard (with context):

Writing Quality (39-rule wins): Top performers were Gemini-3-pro Deep Research and LangChain (GPT-5), clearly ahead of others; many open-source-only agents lagged far behind. This is like scoring an A compared to many Bs and Cs, with some getting Ds.
Fact Coverage vs. Wikipedia: Even the best agent (Gemini-2.5-pro Deep Research) covered only about 31% of Wikipedia’s key facts on average—like answering just 3 out of 10 essential questions. Most others did worse.
Reference Accuracy (Do citations truly support claims?): LangChain (GPT-5) and Gemini-3-pro Deep Research were strongest (about two-thirds of claims supported), while some systems cited less or linked to pages that didn’t back the statements.

🍞 Anchor: A student with great grammar who only studied one-third of the material still misses many points.

🍞 Hook: Any surprises?

🥬 Surprising Findings:

Structured vs. Specialized: Agents do better on general or procedural sections (like “Methods”) but stumble on specialized details (like “Phylogeny,” precise numbers, or niche terms). It’s like knowing the rules of the game but forgetting player stats.
Conflict Patterns: Some models rarely contradicted their own citations but often conflicted with Wikipedia (suggesting they retrieved weak or wrong sources). Others had more conflicts with their own references (suggesting hallucinations or sloppy citing).
Judge Reliability: The chosen judge model’s decisions matched human experts over 80% of the time—strong evidence the writing scores are meaningful.

🍞 Anchor: In the “Parasitic Ant” case, agents easily explained what parasitism is but missed fine-grained, expert facts—like mixing up the big picture with missing puzzle pieces.

🍞 Hook: Do categories change difficulty?

🥬 Category Effects:

What it is: Performance varies by topic area.
How it works: History and Mathematics were hardest (average writing wins under 20%), while Natural Sciences and Philosophy/Religion were easier (often over 40%). Difficulty did not depend much on article length or number of links; it correlated moderately with page views (popular topics are easier to research online).
Why it matters: Research difficulty is about how findable, reliable, and synthesizable the web information is, not just how long the article is.

🍞 Anchor: It’s not the thickness of the book—it’s whether the library has clear, credible sources for that subject.

🍞 Hook: What about cheating and costs?

🥬 Leakage and Practicalities:

Leakage: Some models still cited the target Wikipedia page (e.g., over 30% of statements for one system), but even then, overall quality remained subpar—so simply reading Wikipedia isn’t enough to meet GA standards.
Cost and Operations: Collecting articles across systems involved a mix of API costs and small human collection fees; the benchmark is designed to be reproducible and cost-aware.

🍞 Anchor: Copying answers doesn’t guarantee an A if you still miss key rules like neutrality and verifiability.

05Discussion & Limitations

🍞 Hook: Even a strong magnifying glass has blind spots; let’s honestly look at what this benchmark can and can’t do.

🥬 Limitations:

What it is: Boundaries of the current setup.
How it works: 1) Dataset size is around the hundreds because it must stay fresh and post-date model training, 2) Some proprietary systems’ citation behavior is opaque, and some sources can be inaccessible (e.g., paywalls), 3) Reference Accuracy is thus a best-effort estimate, not a perfect ground-truth of “is everything fully grounded.”
Why it matters: Scores—especially citation support—should be read as indicators, not absolute truths.

🍞 Anchor: Like weighing yourself on a home scale—it’s useful to track trends, even if it’s not a medical-grade device.

🍞 Hook: What does it take to run this?

🥬 Required Resources:

What it is: Tools and costs needed.
How it works: 1) Access to judge and extraction LLMs, 2) Web access to fetch citations, 3) Periodic data refresh to stay live, 4) Some human effort to collect and sanity-check outputs.
Why it matters: Teams can reproduce or extend the benchmark without massive infrastructure.

🍞 Anchor: Like running a science fair—you need judges, updated rules, and a steady supply of projects.

🍞 Hook: When might this not be the right measuring stick?

🥬 When NOT to Use:

What it is: Mismatch cases.
How it works: 1) Creative writing or opinionated essays (neutrality rules don’t fit), 2) Real-time breaking news (citations shift too fast), 3) Non-text outputs like code, data visualizations, or interactive tools where GA criteria don’t apply.
Why it matters: Pick the right ruler for the job; this one is for neutral, well-sourced, comprehensive encyclopedia-style writing.

🍞 Anchor: You don’t use a thermometer to measure how far you ran.

🍞 Hook: What questions still need answers?

🥬 Open Questions:

What it is: Next puzzles to solve.
How it works: 1) Can DRAs learn to plan for deep coverage of hard, niche facts? 2) How to better detect subtle bias while staying concise? 3) Can agents self-audit citations and fix unsupported claims automatically? 4) How to robustly prevent and detect Wikipedia leakage at scale? 5) Can we expand from text-only to multimedia evidence while preserving fairness?
Why it matters: Solving these raises the ceiling for safe, reliable research agents.

🍞 Anchor: Like moving from passing a driving test to learning advanced defensive driving.

06Conclusion & Future Work

🍞 Hook: Imagine a report card that finally tests both how nicely you write and how true your facts are—checked against expert encyclopedias.

🥬 Three-Sentence Summary:

What it is: Wiki Live Challenge (WLC) is a live benchmark using newly reviewed Wikipedia Good Articles to evaluate research agents.
How it works: It scores writing quality with 39 precise rules and factual strength by checking coverage of Wikipedia facts and whether citations truly support claims.
Why it matters: Results show a big gap between current AI and expert-level Wikipedia writing, guiding researchers toward safer, more trustworthy systems.

🍞 Anchor: Like grading a science report with both a style rubric and lab data verification.

🥬 Main Achievement:

What it is: An expert-grounded, fine-grained, and reproducible way to measure if DRAs can meet Wikipedia’s tough standards.
How it works: Live GA references, strict writing criteria, dual factual checks, reliable judging, and leakage controls.
Why it matters: Establishes a clear, credible target for progress.

🍞 Anchor: A lighthouse that shows ships exactly where the rocks are—and the path forward.

🥬 Future Directions:

What it is: Enhancements on deck.
How it works: 1) Scale categories and article counts while staying fresh, 2) Improve citation-grounding checks (e.g., robust against paywalls), 3) Add self-repair loops for unsupported claims, 4) Extend to multimedia evidence and non-English Wikipedias.
Why it matters: Makes agents more globally useful and resilient.

🍞 Anchor: From a good ruler to an entire toolbox for building reliable research assistants.

🥬 Why Remember This:

What it is: The lasting impact.
How it works: Anchoring AI evaluation to human expert standards changes the game from “sounds smart” to “is truly reliable.”
Why it matters: It’s a key step toward AI you can safely trust for learning, decisions, and discovery.

🍞 Anchor: Not just a louder microphone—this gives the speaker facts worth amplifying.

Practical Applications

•Train DRAs to plan research specifically for high fact coverage, not just fluent writing.
•Add a self-check loop where agents flag and fix unsupported claims before finalizing reports.
•Use the 39 writing criteria as a style guide for neutral, encyclopedic corporate or educational content.
•Integrate citation verifiers that fetch and confirm support text prior to publication.
•Adopt leakage monitors in pipelines to prevent models from using restricted sources.
•Benchmark in-house research agents on recent topics to track real-world readiness.
•Prioritize retrieval improvements for specialized sections (e.g., phylogeny, precise stats).
•Tune judge models against human raters for reliable automatic evaluation at lower cost.
•Create curriculum learning for agents: pass writing criteria first, then master fact coverage.
•Extend the framework to multilingual Wikipedias for global research tooling.

Version: 1