Dr. Zero: Self-Evolving Search Agents without Training Data

Zhenrui Yue; Kartikeya Upasani; Xianjun Yang; Suyu Ge; Shaoliang Nie; Yuning Mao; Zhe Liu; Dong Wang

Dr. Zero: Self-Evolving Search Agents without Training Data

Intermediate

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang et al.1/11/2026

arXiv PDF

Key Summary

•Dr. Zero is a pair of AI agents (a Proposer and a Solver) that teach each other to do web-search-based reasoning without any human-written training data.
•The Proposer makes challenging but solvable questions; the Solver tries to answer them using a search engine, and their successes and failures become training signals.
•A new training trick, HRPO (Hop-Grouped Relative Policy Optimization), groups questions by how many reasoning steps (“hops”) they require to create fairer comparisons and cut compute costs.
•A difficulty-guided reward encourages questions that are neither too easy nor impossible, pushing both agents to improve together like a self-made curriculum.
•The Solver is trained with GRPO, which strengthens answer strategies by comparing groups of attempts on the same question, no extra value model needed.
•On seven open-domain QA benchmarks, Dr. Zero matches or beats strong supervised baselines in many cases, despite using zero training data.
•Dr. Zero shows especially strong gains on single-hop tasks and competitive results on multi-hop reasoning, with larger models benefiting more from harder synthetic questions.
•HRPO achieves similar or better accuracy than the heavier GRPO-style proposer training while using about a quarter of the rollouts.
•Early training brings fast improvements, then performance plateaus, highlighting stability and curriculum-tuning as next challenges.
•This work suggests that advanced search-and-reason agents can emerge through self-evolution alone, reducing dependence on scarce, expensive datasets.

Why This Research Matters

This work shows we can build strong, search-savvy AI assistants without costly, human-labeled datasets. That means organizations with limited data or budget can still train capable agents for research, customer support, or education. Because the system grounds itself in real documents through search, its answers are verifiable and less likely to drift into fabrications. The new HRPO method makes training cheaper and more practical, opening the door to frequent updates and specialization. Finally, by carefully rewarding “hard but solvable” questions, Dr. Zero keeps learning at its sweet spot, making continuous self-improvement a realistic path forward.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how two friends can get better at chess by playing each other over and over, even without a coach? They improve just by creating challenges for one another.

🥬 The Concept (Self-Evolution): Self-evolution is when an AI improves by creating its own practice problems and learning from them. How it works: (1) The AI proposes tasks, (2) it tries to solve them, (3) it scores itself, and (4) it updates how it asks and answers. Why it matters: Without self-evolution, AIs rely on lots of human-made examples, which are expensive and limited. 🍞 Anchor: Think of a spelling bee where kids make practice words for each other. They learn new words just from that back-and-forth.

🍞 Hook: Imagine having a super-smart library helper who looks up facts on the web when your memory is fuzzy.

🥬 The Concept (Search-Augmented LLMs): A search-augmented LLM is a language model that can call a search engine to fetch up-to-date facts. How it works: (1) Read the question, (2) decide what to search, (3) retrieve documents, (4) reason over the results, (5) answer. Why it matters: Without search, the model might guess or rely on outdated memory. 🍞 Anchor: When you ask, “Who won the latest World Cup?”, the AI searches and then answers accurately instead of guessing.

🍞 Hook: Picture a treasure hunt where each clue leads to another clue before the treasure.

🥬 The Concept (Multi-Hop Reasoning): Multi-hop reasoning means connecting several facts step by step to reach the final answer. How it works: (1) Understand the first clue, (2) find the next related fact, (3) repeat across multiple “hops,” (4) combine them to answer. Why it matters: Without multi-hop skills, the AI solves only simple, one-step questions. 🍞 Anchor: To answer “Which city is the birthplace of the author of book X?”, you first find the author, then find their birthplace.

🍞 Hook: You know how video games give you more points for beating harder levels?

🥬 The Concept (Reinforcement Learning Rewards): In reinforcement learning, rewards are points that tell an AI how well it did. How it works: (1) The AI tries something, (2) it gets a score, (3) it changes its strategy to get higher scores next time. Why it matters: Without rewards, the AI can’t tell what to improve. 🍞 Anchor: If getting the right answer gives +1 and a wrong answer gives 0, the AI learns to aim for +1 more often.

🍞 Hook: Imagine a classroom with two roles: one student writes practice questions, the other solves them. Both get better together.

🥬 The Concept (Proposer–Solver Framework): The Proposer creates questions; the Solver answers them using search and reasoning. How it works: (1) Proposer writes a question with a clear answer, (2) Solver searches and answers, (3) feedback updates both, (4) repeat with tougher questions. Why it matters: Without a good question-maker, the solver never faces the right mix of challenges. 🍞 Anchor: One kid makes math problems of just the right difficulty; the other gets better at solving them. Next round, problems get a bit harder.

Before this paper, many AI systems that used search and reasoning leaned heavily on human-written training data. Some self-evolving attempts worked in narrow areas, like math or coding, where questions are simpler and easy to check. But open-domain search—answering everything from pop culture to science across the web—was a different beast. The questions needed to vary a lot, be multi-hop, and be verifiable through search. Two big roadblocks kept progress slow: low question diversity (most auto-made questions were too simple) and high compute costs (training needed lots of nested samples: many questions and many answers per question).

🍞 Hook: Think of trying to judge races between runners of very different ages without grouping them—you’ll get unfair results.

🥬 The Concept (Nested Sampling Problem): Nested sampling is when training requires multiple questions and multiple answers per question to compute stable learning signals. How it works: (1) Sample many candidate questions, (2) for each, sample many solver attempts, (3) use the averages as baselines, (4) update. Why it matters: Without this, the learning signal can be too noisy; with it, compute becomes huge and slow. 🍞 Anchor: Running 20 laps just to grade one drill is tiring—that’s what nested sampling feels like in compute.

🍞 Hook: Think of a staircase that automatically adds steps as you grow stronger.

🥬 The Concept (Automated Curriculum): An automated curriculum adjusts difficulty as the learner improves. How it works: (1) Start with easy tasks, (2) monitor success, (3) raise difficulty as performance rises, (4) keep tasks solvable but challenging. Why it matters: Without it, learning stalls—tasks are either too easy (no growth) or too hard (no success). 🍞 Anchor: In a language app, once you master basic verbs, it moves you to verb tenses.

The gap this paper fills: It shows how to build a data-free, open-domain, multi-turn search agent that creates its own questions, checks them using a real search engine, and learns effectively without human-written datasets—while also solving the compute barrier.

Why this matters to daily life: Better self-improving search agents can help students study, professionals research, and everyone answer tricky, multi-step questions without needing massive labeled datasets—making smart tools cheaper, faster to build, and more up-to-date.

02Core Idea

🍞 Hook: Imagine two rock climbers roped together—when one climbs higher, the other is pulled up too.

🥬 The Concept (Key Insight in One Sentence): Dr. Zero links a Proposer and a Solver in a self-evolution loop, using an external search engine and a difficulty-guided reward so that better solving pushes better question-making, which in turn pushes better solving—no human data needed. How it works: (1) Start both from the same base LLM, (2) Proposer makes questions and answers while using search, (3) Solver tries to answer them with search, (4) Rewards encourage solvable-but-hard questions, (5) Optimize Proposer with HRPO and Solver with GRPO, (6) Iterate. Why it matters: Without this loop, either the questions stay too easy or the solver doesn’t encounter the right skills to practice. 🍞 Anchor: As the Solver gets good at two-step riddles, the Proposer begins crafting three-step riddles, so both keep growing.

Three analogies:

Study buddies: One writes practice exams, the other takes them. Scores guide future quizzes to be just tough enough—both improve.
Gym training: The trainer (Proposer) sets weights; the athlete (Solver) lifts. If the lifts are too easy or impossibly heavy, gains stall. The trainer adjusts to the sweet spot.
Video game difficulty: Beat a level and the game bumps you up; fail too often and it nudges you down. You hover at your learning edge.

🍞 Hook: You know how teachers group students by skill so progress stays fair?

🥬 The Concept (HRPO – Hop-Grouped Relative Policy Optimization): HRPO is a training method that computes learning signals by grouping questions with similar hop counts. How it works: (1) Make one question per prompt, (2) run a few Solver attempts to score it, (3) group scores by hop (1-hop, 2-hop, etc.), (4) compute a relative score within each group, (5) update the Proposer. Why it matters: Without grouping, you need lots of nested samples or your learning signal is too noisy; HRPO keeps signals stable with less compute. 🍞 Anchor: It’s like ranking 100-meter times only against other 100-meter runners, not against marathoners.

🍞 Hook: Think of a game that gives you the most points when you barely, but definitely, win.

🥬 The Concept (Difficulty-Guided Reward): This reward gives the highest points when exactly one of several Solver tries gets the right answer, signaling “hard but solvable.” How it works: (1) The Proposer makes a question with an answer, (2) the Solver tries n times, (3) reward is high if 0<correct<n and peaks near just one correct, (4) add a format bonus for well-structured questions. Why it matters: Without it, questions drift to trivial or impossible, and learning stalls. 🍞 Anchor: If five students try a puzzle and only one solves it, the puzzle is great practice—not a giveaway and not unfair.

🍞 Hook: Imagine a science fair where you win by proving your claim with sources.

🥬 The Concept (Verifiable Questions with Search): Proposer is nudged to use the search tool so every question has an answer that can be checked externally. How it works: (1) Start from a seed clue, (2) perform multi-turn searches, (3) synthesize a single, canonical answer, (4) ensure answer can be rediscovered by Solver. Why it matters: Without verifiability, self-training can drift into made-up facts. 🍞 Anchor: A question like “What city is author X from?” can be traced to a biography page the Solver can also find.

Before vs. After:

Before: Data-free methods succeeded mostly in narrow domains, often making simple one-hop questions, and were very computationally heavy when extended to search agents.
After: Dr. Zero generates diverse, multi-hop, verifiable questions and trains efficiently. The Solver becomes competitive with supervised systems, even surpassing them on some benchmarks, without using human-written training data.

Why it works (intuition):

The loop is balanced: If questions get too easy, rewards drop; if too hard, rewards drop. The best reward sits at the learning edge.
Grouping by hop count keeps training fair, stabilizing updates even with just one proposed question per prompt.
Search-verified answers anchor the whole loop in reality, avoiding fantasy facts.

Building blocks:

Base LLM used twice (once as Proposer, once as Solver).
External search engine to ground both in real documents.
Difficulty-guided reward to shape the question landscape.
HRPO to train the Proposer efficiently by hop-based grouping.
GRPO to train the Solver from grouped outcomes without a critic model.
Alternating mini-iterations (about 50 steps each) to let each side pull the other up.

03Methodology

High-level recipe: Input → Proposer (makes one question with tool-use) → Score with Solver tries → Update Proposer via HRPO → Generate a batch of QA pairs → Train Solver via GRPO on that batch → Iterate loop.

Step 1: Initialize two agents and the search tool

What happens: Start from the same base LLM to create a Proposer (question-maker) and a Solver (question-answerer). Attach an external search engine (e.g., Wikipedia index + embeddings) so both can retrieve facts.
Why it exists: Without the search tool, answers might rely on guesswork rather than verifiable sources.
Example: The system can look up “Who founded Pixar?” instead of guessing.

🍞 Hook: Imagine a riddle-maker who always starts from a real encyclopedia page, then builds a multi-step clue trail.

🥬 The Concept (Proposer with Multi-Turn Tool-Use): The Proposer crafts a question by first reading seed text, then performing searches to ensure the answer is grounded and discoverable. How it works: (1) Pick an entity in the seed, (2) search for related facts, (3) chain 1–4 hops, (4) record a single, canonical answer. Why it matters: Without disciplined tool-use, questions may be vague, non-verifiable, or too easy/hard. 🍞 Anchor: Starting from “Marie Curie,” the Proposer searches to build a 3-hop question that ends at “University of Paris.”

Step 2: Score a single proposed question efficiently

What happens: For each proposed question, the Solver attempts n answers (group size like 5). Count how many are correct (k). Compute the difficulty-guided reward: it’s high when 0<k<n and peaks near k=1. Add a format reward to ensure proper, structured questions.
Why it exists: This shapes the Proposer to target the “learning edge”: not trivial, not impossible.
Example: If the Solver gets 0 right: too hard (low reward). If 5/5: too easy (low reward). If 1/5: just right (high reward).

🍞 Hook: Think of judging a math problem by how many classmates solved it: one person solving means good challenge.

🥬 The Concept (Difficulty-Guided Reward Details): It’s a scoring rule that favors questions that are challenging but solvable. How it works: (1) Run multiple Solver attempts, (2) compute k, the number correct, (3) reward is zero if k=0 or k=n, and otherwise proportional to how close k is to 1. Why it matters: Keeps the curriculum in the sweet spot. 🍞 Anchor: Out of five tries, exactly one success means the question earns near-maximum points.

Step 3: Train the Proposer with HRPO (one question per prompt)

What happens: Instead of generating many candidate questions (nested sampling), the Proposer makes one question. Its reward is then normalized within a hop-based group (1-hop with 1-hop, 2-hop with 2-hop, etc.) to compute a stable advantage for policy updates.
Why it exists: This removes the need for many parallel samples, reducing compute while keeping the signal stable.
Example: A 3-hop question’s reward is compared only to other 3-hop questions’ rewards collected in the batch.

🍞 Hook: Grading sprints with sprints and marathons with marathons makes rankings fair.

🥬 The Concept (Group Baselines and Advantages): HRPO computes a relative performance score by comparing each question to others of the same hop count. How it works: (1) Partition by hop, (2) compute the mean and variance of rewards per partition, (3) standardize each question’s reward within its partition, (4) update the policy. Why it matters: Without fair baselines, updates get noisy, breaking training or demanding huge compute. 🍞 Anchor: A 2-hop question’s “B+” is meaningful only when compared to other 2-hop questions, not 4-hop puzzles.

Step 4: Build a training batch for the Solver

What happens: After Proposer updates, sample its questions (with answers) to form a dataset. Include a mixture of hops (e.g., ratio 4:3:2:1 for 1/2/3/4-hop) to shape the Solver’s curriculum.
Why it exists: The Solver needs diversity: easy wins for confidence and harder ones for growth.
Example: The batch might include “Who wrote X?” (1-hop) and “Which city is the birthplace of the person who directed Y?” (3-hop).

🍞 Hook: You don’t learn piano by only practicing scales or only tackling concertos—you need a blend.

🥬 The Concept (Curriculum by Hop Ratio): The training set deliberately mixes different hop counts. How it works: (1) Decide a ratio (e.g., 4:3:2:1), (2) sample questions accordingly, (3) train the Solver on this blend, (4) adjust in later iterations if needed. Why it matters: All-easy or all-hard leads to poor learning. 🍞 Anchor: A soccer practice has warm-ups (easy), drills (medium), and scrimmage (hard) in the same session.

Step 5: Train the Solver with GRPO

What happens: For each question, the Solver generates several candidate answers. GRPO computes advantages by normalizing the correctness across the group of attempts and updates the policy (with clipping and an optional KL penalty).
Why it exists: This strengthens good answer strategies without needing a separate value function.
Example: If two out of five tries are correct, the correct attempts get boosted; the wrong ones get reduced.

🍞 Hook: In a quiz bowl, seeing which buzz-in strategies actually win helps you keep the good ones and drop the rest.

🥬 The Concept (GRPO – Group Relative Policy Optimization): GRPO improves the Solver by comparing multiple responses to the same question. How it works: (1) Generate a small group of answers per question, (2) compute a standardized advantage per response, (3) update policy with clipping and KL regularization, (4) repeat. Why it matters: Without grouped comparisons, the Solver’s learning signal is weaker or needs a critic network. 🍞 Anchor: For a single riddle, you try five phrasings; the one that hits the answer gets reinforced.

Step 6: Alternate and iterate

What happens: Run about 50 Proposer steps, then 50 Solver steps, then repeat for a few iterations. Learning often peaks by 2–3 rounds.
Why it exists: Improvements on one side shift the sweet spot for the other, forming a moving training target (a living curriculum).
Example: As Solver nails 2-hop regularly, Proposer starts favoring 3-hop to keep rewards high.

Secret Sauce (what makes it clever):

HRPO slashes compute needs by using hop-grouped baselines with one question per prompt.
Difficulty-guided reward keeps questions in the solvable-but-challenging zone.
Decoupling Proposer and Solver prevents them from memorizing their own quirks and encourages general strategies.
Search grounding ensures every question and answer chain can be rediscovered by the Solver.

Practical data flow example:

Input: Seed text mentions “Author A.”
Proposer: searches A’s works → finds Book B → searches B’s adaptation → finds Director C; builds a 3-hop question whose answer is “C.”
Solver: gets the question, performs similar searches, answers “C.”
Training: If only 1/5 Solver attempts get “C,” Proposer reward is high; Solver GRPO boosts the successful path.
Iterate: Next round, Proposer tries a slightly harder 3–4 hop chain; Solver adapts.

04Experiments & Results

🍞 Hook: Think of testing a new study method across different school subjects—math, history, science—to see if it really helps.

🥬 The Concept (The Test): The authors evaluate Dr. Zero on seven open-domain QA benchmarks that include simple one-hop and complex multi-hop questions. How it works: (1) Use the same search setup (English Wikipedia + the same retriever), (2) measure Exact Match (EM), which checks if the final answer text exactly matches ground truth, (3) compare to both few-shot and supervised baselines. Why it matters: Without consistent tests, we can’t tell if self-evolution truly works. 🍞 Anchor: It’s like grading spelling by whether the word is letter-perfect.

Datasets:

Single-hop: Natural Questions (NQ), TriviaQA, PopQA.
Multi-hop: HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle.

Baselines:

Few-shot: Prompting, IRCoT, Search-o1, RAG.
Supervised: SFT, R1-Instruct, Search-R1.
Data-free enhanced: SQLM* and R-Zero* (strengthened to handle search).

🍞 Hook: Imagine two runners: one trained with a coach and curated drills (supervised), the other trained alone with smart self-play (data-free). Who wins?

🥬 The Concept (Who/What It’s Compared Against): Dr. Zero, which uses zero training data, is compared against systems trained with human-curated datasets and methods. How it works: (1) Keep the search engine and corpora identical, (2) run each model on the same questions, (3) compute EM, (4) report per-dataset and average scores. Why it matters: Beating or matching supervised baselines without data is a big deal for scalability. 🍞 Anchor: If the self-taught runner ties the coached runner, that’s impressive—and cheaper.

Scoreboard with context (selected highlights):

With a 3B base model on NQ, Dr. Zero reaches 0.397 EM vs. Prompting 0.106, IRCoT 0.111, and Search-o1 0.238—like jumping from a D to a strong B.
On PopQA (3B), Dr. Zero 0.431 vs. supervised Search-R1 0.364—about an 18.4% relative improvement, an A- vs. B- moment.
On TriviaQA (3B), Dr. Zero 0.572 vs. Search-R1 0.537—solidly ahead.
On average, Dr. Zero (3B) 0.326 roughly matches Search-R1 0.327, despite zero training data.
With a 7B base, Dr. Zero averages 0.372 vs. Search-R1 0.384 overall, and it outperforms on specific datasets like 2WikiMultihopQA (0.347 vs. 0.326).

Data-free comparisons (3B):

Dr. Zero beats SQLM* and R-Zero* on every dataset, averaging 0.326 vs. 0.233 (SQLM*) and 0.256 (R-Zero*). That’s like moving from a C to a solid B across subjects.

🍞 Hook: Think of a lighter backpack that still carries everything you need.

🥬 The Concept (Efficiency: HRPO vs. GRPO for the Proposer): HRPO uses about a quarter of the rollouts of a GRPO-style proposer (1 question × 5 answers vs. 4 questions × 4 answers × control). How it works: (1) Generate one candidate per prompt, (2) group by hop, (3) standardize rewards in-group, (4) update. Why it matters: Similar or better accuracy with much less compute means more practical training. 🍞 Anchor: HRPO’s average score is slightly higher (0.326 vs. 0.320) while using far fewer rollouts.

Training dynamics and surprises:

Rapid gains appear early (within ~50 steps), then improvements continue into the second iteration but plateau later, especially for the 7B model.
Entropy and response length in the Solver drop and stabilize—signals of growing confidence and consistency—while the Proposer’s entropy remains more varied to preserve question diversity.
Hop ratios matter: For 3B, a mixed ratio (4:3:2:1 for 1→4 hops) works best; pushing too many multi-hop questions doesn’t help. For 7B, relatively more multi-hop data can help uncover capacity.
Ablations show the importance of a format reward (helps structure) and an initial document context (crucial for diverse, grounded question synthesis). A parabolic reward underperforms the proposed difficulty-guided design.
Statistical tests indicate Dr. Zero’s gains are robust on several datasets, especially knowledge-heavy single-hop tasks (e.g., NQ), and 2WikiMultihopQA for 7B.

Bottom line: Dr. Zero competes with, and sometimes beats, supervised search agents across diverse QA tasks, while spending far less on data curation and even on some training compute, thanks to HRPO.

05Discussion & Limitations

🍞 Hook: Even great study plans hit snags—like getting stuck on a plateau or drifting off-topic.

🥬 The Concept (Limitations): Dr. Zero still faces performance plateaus, occasional instability in multi-turn generations, and relies on the quality of the external search engine. How it works: (1) Gains slow after 2–3 iterations, (2) formatting or tokenization hiccups can derail long trajectories, (3) retrieval errors or gaps in the corpus can harm answers. Why it matters: Knowing these edges helps plan the next improvements and when to switch strategies. 🍞 Anchor: If your internet is spotty, even a great researcher gets shaky results.

Required resources:

A base instruction-tuned LLM (e.g., Qwen2.5 3B/7B).
A prepared corpus and retriever (e.g., English Wikipedia + E5 embeddings + ANN search).
Modest RL-style training runs: alternating 50-step phases for Proposer (HRPO) and Solver (GRPO), batch sizes around a few hundred, and short rollouts (about five tool turns).

When not to use it:

No search access or very poor corpora—verifiability collapses.
Safety-critical domains where self-generated curricula could miss rare but crucial edge cases.
Tasks that demand exact ground-truth labels for every step (e.g., formal proofs) rather than final-answer checks.
Extremely long-context reasoning if the model struggles with formatting and truncation.

Open questions:

How to push past the plateau? Better curricula, adaptive hop ratios, or periodic resets?
Can we design richer, still-cheap rewards that capture reasoning quality beyond just final correctness and structure?
How to strengthen long-context stability and tool-use robustness in larger models?
Can we detect and prevent reward hacking or bias amplification during long self-play runs?
How to extend to multilingual or domain-specific corpora while keeping the data-free promise?

Overall assessment: Dr. Zero is a strong, practical step toward data-free, search-grounded self-evolution. It trims compute with HRPO, preserves diversity with difficulty shaping, and reaches competitive accuracy. The next frontier is sustaining growth over more iterations and further hardening stability and safety.

06Conclusion & Future Work

Three-sentence summary: Dr. Zero is a self-evolving pair of AI agents (Proposer and Solver) that use an external search engine, a difficulty-guided reward, and hop-grouped optimization to learn open-domain question answering with zero human-written training data. By alternating Proposer and Solver training, the system builds its own curriculum of verifiable, multi-hop questions and steadily improves. Experiments show Dr. Zero can match or surpass supervised baselines on several benchmarks while using far less curated data and lower compute for the Proposer.

Main achievement: Introducing HRPO and a difficulty-guided, search-anchored self-evolution loop that unlocks competitive search-and-reasoning performance without any training data.

Future directions: Improve stability for longer self-play (avoid plateaus and entropy collapse), develop richer rewards that capture reasoning quality, add defenses against reward hacking, and expand to larger models, multilingual corpora, and more complex tool ecosystems.

Why remember this: It demonstrates that complex, verifiable reasoning skills can emerge through self-evolution alone, turning the data bottleneck into a solvable engineering problem and opening the door to affordable, adaptable research agents.

Practical Applications

•Enterprise knowledge assistants that learn company-specific search strategies without labeled training data.
•Student study helpers that generate progressively harder, verifiable practice questions across subjects.
•Research scouts that self-improve on literature discovery and multi-hop evidence chaining.
•Customer support bots that refine search-and-answer skills from internal docs without manual annotation.
•News and fact-checking aides that build verifiable Q&A pipelines over fresh web sources.
•On-premise or privacy-sensitive deployments that avoid sharing data with external labeling services.
•Low-resource domain adaptation by indexing niche corpora and letting agents self-train via search.
•Internal toolsmiths that produce test questions for QA teams, tuned to be neither trivial nor impossible.
•Continuous learning systems that update themselves as the indexed corpus changes over time.

Version: 1