SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback

Fangyuan Xu; Rujun Han; Yanfei Chen; Zifeng Wang; I-Hung Hsu; Jun Yan; Vishy Tirumalashetty; Eunsol Choi; Tomas Pfister; Chen-Yu Lee

SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback

Intermediate

Fangyuan Xu, Rujun Han, Yanfei Chen et al.1/26/2026

arXiv PDF

Key Summary

•SAGE is a two-agent system that automatically writes tough, multi-step search questions and checks them by actually trying to solve them.
•It controls difficulty by asking for a target number of search steps and then uses real execution traces to fix questions that are too easy or incorrect.
•The key idea is an iterative feedback loop: a data generator proposes a question-answer pair, a search agent attempts it, and their traces guide improvements.
•Compared to simple resampling, using execution feedback creates more accurate questions that truly need more steps to solve, especially at higher difficulty.
•Training search agents on 20K SAGE-generated examples boosts accuracy by up to 27% in-domain and up to 23% out-of-domain versus common baselines.
•Agents trained on SAGE transfer from fixed Wikipedia retrieval to Google Search at test time, improving real-world deep search tasks like GAIA.
•SAGE questions require a broader mix of reasoning skills (temporal, calculation, conflict resolution) than standard benchmarks like Musique.
•An ablation shows two rounds of feedback strike a good balance between difficulty and learnability; harder alone isn’t always better for training.
•The main limitations are reliance on one fixed search agent for feedback, a pragmatic pass@K correctness check, and focus on RL rather than SFT trajectories.
•Overall, SAGE offers a scalable, steerable way to create high-quality deep-search training data without heavy human annotation.

Why This Research Matters

Deep-search agents can be much more helpful in real life when they can gather and combine clues from many sources. SAGE makes the training data for these agents cheaper and more controllable, so we can reliably teach them to handle longer, trickier problems. This means better homework help that cites sources, smarter research assistants for journalists and analysts, and customer support bots that weave together policy pages and logs. Because SAGE’s data is grounded in real documents and shaped by actual execution, agents learn skills that transfer to new tools like Google Search. Over time, this can raise the standard of factual accuracy and explainability in AI systems. It also reduces our dependence on expensive human-written datasets while still improving quality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how finding a simple fact is easy—like “What’s the capital of France?”—but tracking down a detailed answer that mixes clues from many places can feel like a scavenger hunt?

🥬 The Concept (Deep Search Agents): Deep search agents are AI helpers that hunt across many documents, step by step, to answer complicated questions.

How it works:
1. Read the question and plan sub-questions.
2. Search for one piece of info at a time.
3. Use what they find to plan the next search.
4. Combine all clues into the final answer.
Why it matters: Without this, AI treats long, tricky questions like short ones and misses key steps. 🍞 Anchor: If you ask, “Which composer taught the person who scored Movie X?” the agent may first find the composer, then the student, then connect them.

The World Before: AI could answer many single-hop questions where one search was enough. Datasets like NQ and early RAG systems handled “look once, answer once.” Later, “multi-hop” datasets (like HotpotQA, Musique) asked for 2–4 steps, but rarely more. And building truly deep, long-chain questions by hand is slow, expensive, and exhausting for people.

🥪 Sandwich: Search-Augmented LLMs 🍞 Hook: Imagine your brain plus a super library card: you think, and you can also look things up anytime. 🥬 The Concept: Search-augmented LLMs are language models that can call a search tool while they reason.

How it works: Plan → search → read → think → search again → answer.
Why it matters: The model isn’t stuck guessing; it can fetch facts. 🍞 Anchor: When asked “Who directed the film adapted from the 1990 novel by…?”, the model searches for the novel, then the film, then the director.

The Problem: We need lots of hard, realistic training questions to teach agents this kind of deep search. But asking humans to write them (and the correct answers) is pricey—each example can require many searches and long reasoning traces. Also, even when we ask an AI to write “hard questions,” they often end up too easy or not fully correct.

Failed Attempts:

Human-crafted deep questions: accurate but costly, slow, and limited.
Auto-generated multi-hop questions: helpful but often shallow (≤4 steps) or rely on pre-built links between pages, limiting variety.
“Just sample more” (resampling): you can keep trying until you get a good question, but it wastes time and doesn’t teach the generator why it failed.

🥪 Sandwich: Difficulty Control 🍞 Hook: Think of a video game where you can pick “Level 3” or “Level 7.” 🥬 The Concept: Difficulty control means telling the generator how many search steps the answer should require.

How it works: Include a target number S in the prompt; plan a chain that needs about S searches.
Why it matters: Without it, the generator may produce questions that are too easy or too hard. 🍞 Anchor: “Make a 5-step question about this Wikipedia passage,” so training covers a range of levels.

The Gap: Even with “reverse generation” (start from a passage and build a question grounded in it), the AI’s plan for S steps often doesn’t match what actually happens when another agent tries to solve it—some steps collapse or the question is solvable in fewer steps.

🥪 Sandwich: Execution Feedback 🍞 Hook: When you practice a sport, a coach watches you play and then tells you what to fix. 🥬 The Concept: Execution feedback is using real solve attempts (search traces) to show the generator what worked, what didn’t, and how many steps it truly took.

How it works: Generate QA → run a search agent K times → collect traces → judge correctness and steps → feed those traces back to improve the QA.
Why it matters: Without it, the generator keeps guessing and repeats the same mistakes. 🍞 Anchor: If the solver answers in 2 steps but target is 5, the feedback nudges the generator to add needed hops.

Real Stakes: Better deep-search training data means assistants that can investigate medical topics more carefully, students who get multi-source explanations, journalists who verify facts across articles, and customer support bots that piece together policies and logs. SAGE aims to make this kind of data cheap, scalable, and tunable by difficulty.

02Core Idea

The “Aha!” moment in one sentence: Make a data generator and a search agent talk to each other through real solve attempts, so the generator learns to produce correct questions that truly take S steps.

Three Analogies:

Coach and player: The player (search agent) runs the play; the coach (generator) watches the film and redesigns the play so it’s challenging but fair.
Puzzle maker and tester: The puzzle maker crafts a maze; the tester runs through it; the maker reshapes dead-ends so it takes the right number of turns.
Teacher and exam: The teacher writes a test; students take it; the teacher adjusts questions that were too easy or misleading.

🥪 Sandwich: Agentic Data Generation 🍞 Hook: Imagine a robot chef who tastes dishes and changes the recipe until it’s just right. 🥬 The Concept: Agentic data generation is when an AI actively plans, searches, and revises to create training examples.

How it works: Plan a QA, search to ground it, check it with an agent, and rewrite using feedback.
Why it matters: Static one-shot generation misses real-world quirks; agentic loops adapt. 🍞 Anchor: The AI drafts a 5-step question, a solver finishes in 3, the AI rewrites until it really needs 5.

Before vs After:

Before: Data generators aimed for “hard” but often produced easy or incorrect questions. Filtering by resampling helped a bit but wasted compute and didn’t teach the generator what to fix.
After (SAGE): The generator is steered by execution feedback—evidence of how the question actually plays out—so difficulty aligns with reality, not guesswork.

Why It Works (intuition):

Ground-truthing by action: Plans can be wrong; doing is proof. If searches collapse into fewer steps (e.g., info co-located), the traces reveal it.
Tight loop: The generator sees both its own plan and the solver’s path; mismatches become revision targets.
Measurable knobs: Target step S and pass@K make correctness and difficulty computable, so the loop can steer reliably.

🥪 Sandwich: Reverse QA Generation 🍞 Hook: When building a treasure hunt, you start with the treasure and place the clues backward. 🥬 The Concept: Reverse QA means starting from a real document and then crafting a question whose answer is grounded in it.

How it works: Pick a passage → gather related info → compose a question-answer pair tied to retrieved evidence.
Why it matters: Prevents made-up answers; keeps questions anchored to facts. 🍞 Anchor: Choose a Wikipedia paragraph about a festival; then ask, “On what date was the first event recognized as the start of X?”

🥪 Sandwich: pass@K 🍞 Hook: Think of taking multiple shots at a basketball hoop. 🥬 The Concept: pass@K checks if any of K solver attempts match the reference answer.

How it works: Run K different traces; if at least one is correct, mark as “correct.”
Why it matters: Solvers can be imperfect; multiple tries reduce false negatives. 🍞 Anchor: With K=4, if one of four runs gets “February 21, 1972,” it passes correctness.

Building Blocks:

Input: One document from a corpus (Wikipedia) and a target step S.
Generator: Proposes a grounded QA aimed at S steps.
Verifier: A search agent tries to solve it K times, yielding traces.
Feedback: Traces guide the generator to rewrite the QA.
Filter: Keep only questions the solver can answer correctly at least once.
Output: QA pairs that are both correct and require about S steps.

🥪 Sandwich: Reinforcement Learning (RL) Training (downstream) 🍞 Hook: Like giving a puppy treats when it does the trick right. 🥬 The Concept: RL trains the search agent by rewarding good final answers.

How it works: Agent searches and answers → judge correctness → give reward → adjust policy.
Why it matters: It teaches better search plans without needing human-written step-by-step traces. 🍞 Anchor: If the agent answers correctly after 6 searches, PPO updates it toward similar good behaviors.

03Methodology

High-level recipe: Document → (Step A) Generate QA for target S → (Step B) Verify with K solver runs → (Step C) Feed back traces → (Step D) Regenerate → Output when correct and difficult enough.

Components and Steps

Input and Retriever 🥪 Sandwich: Retriever + Corpus 🍞 Hook: Imagine a super index of a giant library so you can find three helpful pages fast. 🥬 The Concept: A retriever searches a fixed corpus (2018 Wikipedia) to fetch top passages for each query.

How it works: Turn a query into an embedding; find nearest passages; return top-3 per search.
Why it matters: Without consistent retrieval, the agent can’t ground its reasoning. 🍞 Anchor: Query “first Ekushey Book Fair date” returns three snippets used to reason.

Initial Generation with Difficulty Prompt

What happens: The data generator (a search-augmented LLM) is told the target steps S (3–7). It plans inside a “think” area, issues iterative searches, gathers evidence, and then outputs a question, a short answer, and a step plan.
Why needed: Without an S prompt, questions skew easy; with S, the generator aims for the right depth.
Example: From a page about a publisher tied to a national book fair, the generator composes: “What is the specific date of the initial event that evolved into the national book fair…?” expecting ≈4 steps.

Verification by a Search Agent (K traces)

What happens: A separate search agent tries to solve the question K times (e.g., K=4), producing answers and full search traces (reasoning + queries + retrieved passages).
Why needed: The generator’s plan might not match reality; only execution exposes true difficulty or mistakes.
Example data: Trace A solves in 3 steps (answer correct), Trace B fails, Trace C solves in 5 steps (answer correct), Trace D solves in 4 steps (answer correct). The minimal correct step count is 3.

Decide Correctness and Difficulty

What happens: Use pass@K to check if any attempt matched the generator’s answer (correctness). For difficulty, look at the minimal number of steps among correct traces and compare to S.
Why needed: A QA must be both right and sufficiently hard; otherwise, it won’t teach deep search.
Example: If S=4 but the minimal correct is 3, it’s correct but too easy.

Execution Feedback and Regeneration 🥪 Sandwich: Execution Feedback (in practice) 🍞 Hook: Like watching your own replay to see exactly where the defense broke your plan. 🥬 The Concept: Feed both the generator’s plan and the solver’s actual traces (including retrieved docs) back to the generator to fix the QA.

How it works:
1. Provide traces showing where steps collapsed or answer differed.
2. Ask the generator to update the question, the answer, or both, to meet target S and remain grounded.
3. Repeat for R rounds (e.g., up to 3).
Why it matters: Without explicit trace evidence, rewrites are guessy; with it, they are targeted. 🍞 Anchor: If two sub-queries were solved with one search (multi-query collapse), the rewrite separates entities so two queries are truly needed.

Filtering

What happens: If no solver trace ever matches (pass@K=0), drop the QA.
Why needed: Avoids training on hallucinations or unreachable questions.
Example: A confusing or ambiguous question is filtered out.

Concrete Mini-Walkthrough

Input: Wikipedia paragraph on a Kolkata publisher tied to the Ekushey Book Fair; target S=4.
Generate: The question asks for the precise date of the initiating event.
Verify: Solver answers correctly but in 2–3 steps because information is co-located.
Feedback: The generator sees traces and revises the question to require an extra hop (e.g., include a condition that forces finding a separate source).
Re-verify: Now minimal correct traces need ≥4 steps; keep QA.

The Secret Sauce

Difficulty steering (target S) + real execution feedback beats blind resampling.
Picking the minimal-step correct trace avoids overcounting extra, unneeded searches.
Grounding every update in retrieved documents keeps QAs faithful.

🥪 Sandwich: Resampling (baseline) 🍞 Hook: If your drawing isn’t great, you throw it away and draw again without feedback. 🥬 The Concept: Resampling keeps generating from scratch until one sample looks right.

How it works: If incorrect or too easy, toss and sample a new QA; repeat a few times.
Why it matters: It can work but wastes attempts and doesn’t teach the generator how to fix. 🍞 Anchor: After three retries you may get a 5-step question, but you don’t know why the first two failed.

Implementation Notes (kept simple)

Tools: E5 retriever, 2018 Wikipedia, top-3 passages per query.
Limits: Max 20 searches per attempt during generation/verification.
Models: Same LLM family for generator and solver during data creation; later train Qwen-3B/7B with RL on the produced data.

04Experiments & Results

The Tests and Why They Matter

Intrinsic: Are generated QAs correct? Do they actually require the target number of steps? Do they feel “hard” to a strong solver (low average accuracy over multiple tries)?
Extrinsic: If we train search agents on these QAs, do they perform better on in-domain (SAGE-style) and out-of-domain benchmarks (Musique, FRAMES)? Do they transfer to Google Search at test time (GAIA, BrowseComp, HLE-Search)?

Key Competitors

No difficulty prompt: Generator just “make it hard,” no S.
Initial generator only: No feedback rounds.
Resampling: Retry fresh generations for up to 3 rounds.
SAGE: 1–3 rounds of execution feedback.

How We Score Difficulty 🥪 Sandwich: Avg@4 (difficulty signal) 🍞 Hook: If four strong students try a problem and most fail, it’s hard. 🥬 The Concept: Avg@4 is the solver’s average success across 4 runs on the same question.

How it works: Lower Avg@4 = harder question.
Why it matters: Captures real solver struggle beyond step counts. 🍞 Anchor: Avg@4 of 79.5 is harder than 86.3 for the same solver.

Intrinsic Results (quality of generated data)

Including a target S nudges difficulty up, but the initial generator alone only gets about 18% of samples both correct and at least S steps.
Resampling improves success, but execution feedback improves it more, especially for higher S. With 3 feedback rounds, %pass climbs notably and the correct subset gets harder (e.g., Avg@4 ≈79.5 vs ≈86 for baselines; and minimal steps align closer to S).
Takeaway: Feedback is not just a filter—it’s a teacher.

Training and Downstream Results (Qwen-3B and 7B, 20K samples)

Setup: Train with RL (PPO) using LLM-as-a-judge rewards, on equal-sized datasets: NQ+HotpotQA (150K but we subsample to match 20K), Musique (20K), SAGE (20K). Evaluate on SAGE in-domain (3–7 hops), Musique, FRAMES.

Highlights

In-domain: SAGE data lifts average accuracy substantially. For Qwen-3B: from 15.9% (NQ+HotpotQA) / 22.4% (Musique) to 28.5% (+27% relative over stronger baseline). For Qwen-7B: from 29.6% (Musique) to 38.1% (+29% relative).
Out-of-domain: On FRAMES, SAGE-trained models improve by ~11% (3B) and ~23% (7B) relative. On Musique OOD for 7B, SAGE (22.3%) even edges out training on Musique itself (21.6%).

Transfer to Google Search (test-time only) 🥪 Sandwich: Generalization to New Tools 🍞 Hook: If you learn to ride on one bike, you can still ride a different bike. 🥬 The Concept: Train with a fixed Wikipedia retriever, then switch to Google Search at inference.

How it works: Replace retriever API; keep the trained agent.
Why it matters: Shows the agent learned skills, not tool quirks. 🍞 Anchor: On GAIA, SAGE-trained 7B jumps to ~24% vs ~15–16% for baselines—around 50% relative gain over the strongest baseline.

Ablations and Analyses

Feedback rounds: Going from 0→2 rounds improves in-domain, Musique, and FRAMES; 3 rounds produce even harder data but not better training—difficulty must be balanced with learnability.
Reasoning diversity: Compared to Musique, SAGE questions more often need temporal reasoning (32%) and calculation (35%), plus conflict resolution and self-correction—broader skills.
Error sources (pre-feedback generator): “Easy data” often caused by info co-location or multi-query collapse; “incorrect data” often due to solver retrieval or reasoning failures, plus some ambiguous questions—exactly the issues execution feedback can reveal and correct.

Bottom line: Using execution feedback to steer difficulty produces better training data, which in turn trains better deep-search agents that generalize beyond the training retriever.

05Discussion & Limitations

Limitations

Fixed feedback provider: The generator learns from a single search agent’s behavior. If that agent has blind spots, feedback may reinforce them.
pass@K correctness: Practical but imperfect; if the search agent can’t solve a valid question, we may wrongly discard it.
Focus on RL: SAGE builds QA pairs suited for outcome-reward RL; it doesn’t yet provide gold step-by-step trajectories for supervised fine-tuning.
Domain scope: Data built from Wikipedia only; domain-specific corpora (legal, medical) remain unexplored.
Scale and algorithms: Experiments cap at 7B parameters and PPO; other RL approaches or larger models might shift results.

Required Resources

A capable LLM for generation and solving during data creation.
A dense retriever over a large corpus (here, 2018 Wikipedia + E5 embeddings).
Compute for iterative feedback rounds and RL training.

When Not to Use

If you only need single-hop fact lookup; simpler datasets are cheaper.
If your domain forbids iterative retrieval (e.g., strict privacy constraints) and you can’t provide a safe corpus.
If you require guaranteed correctness without solver dependence—then add stronger verification or human checks.

Open Questions

Co-evolution: Can the generator and solver be jointly improved in alternating training loops?
Better verification: How to rescue true-but-unsolved QAs (pass@K=0) with external validators?
Curriculum design: What’s the best schedule of step targets S for training phases?
Domain transfer: How to adapt SAGE for specialized corpora (e.g., scientific literature) while keeping costs low?
Trajectory generation: Can we synthesize high-quality step-by-step SFT traces at scale, not just final QA?

06Conclusion & Future Work

Three-sentence summary: SAGE is a dual-agent loop that generates deep-search question–answer pairs at a chosen difficulty and fixes them using real solver traces. This execution feedback aligns planned steps with actual steps, yielding more accurate and truly multi-step questions. Training on SAGE’s 20K examples significantly improves deep-search agents in- and out-of-domain and even transfers to Google Search at test time.

Main Achievement: Turning execution feedback into a steering wheel for synthetic data generation—so difficulty becomes something we can aim for and hit, not just hope for.

Future Directions: Co-train the generator and solver; add robust correctness checks beyond pass@K; extend to expert domains; and synthesize full supervised trajectories. Also, explore alternative RL algorithms and larger models to amplify gains.

Why Remember This: SAGE shows that doing, measuring, and revising beats guessing—by closing the loop between planned difficulty and real execution, we can mass-produce better training data and, in turn, better research agents for the open web and beyond.

Practical Applications

•Build curriculum-style training sets that progress from 3-step to 7-step questions for teaching deep-search agents.
•Create domain-tuned deep-search data (e.g., legal, finance, medical) by swapping in a different corpus and retriever.
•Pre-train an internal support bot that can trace multi-policy answers and cite evidence across documentation.
•Generate practice questions that force temporal reasoning (date math) or calculation to strengthen specific agent skills.
•Stress-test search agents by deliberately crafting questions with conflict resolution or co-reference hops.
•Use execution feedback to refine weak questions in existing datasets, raising difficulty without manual rewriting.
•Train small (3B–7B) search agents with RL on SAGE data to boost performance before scaling to larger models.
•Prototype evaluation suites that measure both correctness (pass@K) and real difficulty (Avg@K, step counts).
•Prepare agents to switch retrievers at test time by training on SAGE and later swapping to Google Search.
•Bootstrap synthetic supervised traces by asking the generator to output justified sub-steps grounded in retrieved passages.

Version: 1