🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
SmartSearch: Process Reward-Guided Query Refinement for Search Agents | How I Study AI

SmartSearch: Process Reward-Guided Query Refinement for Search Agents

Intermediate
Tongyu Wen, Guanting Dong, Zhicheng Dou1/8/2026
arXivPDF

Key Summary

  • •SmartSearch teaches search agents to fix their own bad search queries while they are thinking, not just their final answers.
  • •It scores each intermediate query for novelty and usefulness, then rewrites weak queries and continues the search from the improved point.
  • •A three-stage curriculum (imitate → align → generalize) helps the agent internalize good querying habits step by step.
  • •Dual-Level Credit Assessment checks whether a query brings in new info (rule-based) and whether it was necessary and on-target (model-based).
  • •If a query is low-quality, SmartSearch generates a refined version and regenerates all the following steps for a better trajectory.
  • •Compared with strong baselines, SmartSearch gains big improvements in Exact Match and F1 across multiple hard QA and web tasks.
  • •It also makes searches more efficient by reducing wasted or redundant queries.
  • •A smaller student model does the scoring and refining to keep costs low while staying accurate enough.
  • •Ablations show both process rewards and query refinement are essential—the gains shrink when either is removed.
  • •The approach generalizes from Wikipedia-style search to open-web tasks, showing it’s robust to different environments.

Why This Research Matters

In real life, tiny wording changes in a search query can pull completely different information, which makes or breaks answers. SmartSearch teaches AI agents to catch and fix those tiny mistakes right away, so their research path stays on track. This reduces wasted searches, saves time, and improves trust because answers are based on the right evidence. It helps in classrooms (better study help), in newsrooms (fewer mix-ups of people with similar names), and in offices (faster, more reliable research). It also adapts from tidy sources like Wikipedia to the messy open web, which is where most people live online. Overall, it’s a blueprint for building AI helpers that reason carefully and check their work step by step.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you look something up online, the way you type your question really matters? If you ask, “birthdate of Kevin McCarthy,” you might get the politician, not the actor you meant.

🥬 Filling (The Actual Concept):

  • What it is: This paper tackles how to make AI search agents ask better in-between questions (queries) while they work toward a final answer.
  • How it works: It teaches agents to spot weak queries, score them, fix them, and keep going from the fix—so small mistakes don’t snowball.
  • Why it matters: Without good intermediate queries, even a smart agent can fetch the wrong info and end up confidently wrong.

🍞 Bottom Bread (Anchor): Imagine asking, “birthdate of Kevin McCarthy” and getting the U.S. politician’s page. One extra word—“actor”—changes everything.

🍞 Top Bread (Hook): Think of a Large Language Model (LLM) like a super word wizard that knows tons of facts but sometimes guesses.

🥬 Filling (LLM):

  • What it is: An LLM is a program that understands and generates human language.
  • How it works: It reads your text, predicts likely next words, and uses its training to respond.
  • Why it matters: LLMs are great at language but can miss fresh facts or mix up details.

🍞 Bottom Bread (Anchor): Asking an LLM about a recent sports game can fail if it hasn’t seen the latest news.

🍞 Top Bread (Hook): Imagine your brain having a helpful librarian who can grab books when you don’t remember something.

🥬 Filling (Retrieval-Augmented Generation, RAG):

  • What it is: RAG adds a “go look it up” step so the model can read relevant documents before answering.
  • How it works: It turns your question into a search, retrieves passages, then writes an answer using them.
  • Why it matters: Without retrieval, the model can hallucinate or be outdated.

🍞 Bottom Bread (Anchor): If you ask, “Who won the 2024 marathon?”, RAG helps the model actually check a source.

🍞 Top Bread (Hook): Now imagine not just one search, but a small conversation with the web: think, search, read, think again, repeat.

🥬 Filling (Search Agents):

  • What it is: A search agent is an LLM that iteratively plans, searches, reads results, and keeps reasoning until it answers.
  • How it works: In rounds, it writes a thought, issues a search query, reads snippets, and updates its plan.
  • Why it matters: Complex, multi-hop questions need several smart queries, not just one.

🍞 Bottom Bread (Anchor): To answer “Which author mentored the scientist who discovered X?”, the agent may need three hops of searching.

🍞 Top Bread (Hook): You know how a tiny wrong turn on a road trip can send you far off course if you don’t fix it quickly?

🥬 Filling (Intermediate Query Quality):

  • What it is: The accuracy and usefulness of each mini-search query during a multi-step reasoning process.
  • How it works: Good queries are specific, non-redundant, and aimed at getting the missing piece.
  • Why it matters: A single vague or wrong query can pull in the wrong facts and mislead all later steps.

🍞 Bottom Bread (Anchor): Typing “birthdate of Kevin McCarthy” (no “actor”) leads to the politician and derails the whole answer.

The world before: LLMs were powerful but could hallucinate or be outdated. RAG helped by adding retrieval. Search agents went further by looping through plan → search → read → plan. But most training focused on the final answer or on general reasoning style, not on the quality of the in-between searches. Problem: Low-quality intermediate queries cause wrong or redundant documents to be fetched, making the whole reasoning chain drift. Failed attempts: Rewarding only final answers gave weak guidance—if a 6-step search fails at step 2, you don’t learn what to fix. Some process rewards trained “better thinking,” but didn’t directly score query quality; early attempts to score queries were coarse and ineffective. The gap: Fine-grained, query-specific supervision and a way to fix low-quality queries during training were missing. Real stakes: Better queries mean fewer wasted searches, less confusion, and more trustworthy answers—useful for students, journalists, analysts, doctors, and anyone who relies on accurate, up-to-date facts.

02Core Idea

🍞 Top Bread (Hook): Imagine a teacher who doesn’t just grade your final essay but also gives quick tips on each paragraph while you’re writing—and if a paragraph is weak, helps you rewrite it before you continue.

🥬 Filling (Aha! Moment):

  • What it is: SmartSearch scores every intermediate search query for quality, rewrites weak ones, and regenerates the rest of the search from that improved point—then trains the agent to do this on its own.
  • How it works: Two mechanisms power it: (1) Process rewards with Dual-Level Credit Assessment (novel vs useful) and (2) Query refinement that edits only the weak query and continues. A three-stage curriculum (imitate → align → generalize) makes these habits stick.
  • Why it matters: Without query-level feedback and fixes, agents keep pulling the wrong evidence, wasting steps and missing the right answer.

🍞 Bottom Bread (Anchor): The system catches “birthdate of Kevin McCarthy” as ambiguous, rewrites it to “birthdate of actor Kevin McCarthy,” and all following steps suddenly line up to the correct answer.

Multiple analogies:

  1. Cooking: If the sauce tastes off at step 2, you adjust seasoning now, not after the whole dish is done.
  2. GPS: If you take a wrong turn, it reroutes immediately instead of just judging at the destination.
  3. Lego build: If a middle layer is misaligned, you fix that layer before stacking more bricks.

🍞 Top Bread (Hook): You know how judges in a talent show score both originality and performance?

🥬 Filling (Process Rewards):

  • What it is: Little rewards for each step that guide the agent toward better queries, not just better final answers.
  • How it works: At each step, a rule checks novelty (no redundant docs) and a model checks usefulness (was the query necessary and on-target?).
  • Why it matters: Final-only rewards are too sparse; step feedback teaches precisely what to fix.

🍞 Bottom Bread (Anchor): If two searches return the same docs, novelty = 0; if your query asks for the actor but you fetched the politician, usefulness = 0.

🍞 Top Bread (Hook): Like getting notes in the margin telling you what exactly went wrong, then rewriting that sentence.

🥬 Filling (Query Refinement):

  • What it is: A targeted edit of a low-quality query using the feedback, then restarting from that point.
  • How it works: Identify the weak query, generate a refined version guided by the feedback, re-run the search from there, and rebuild later steps.
  • Why it matters: Fixing the exact weak link prevents error cascades and saves steps.

🍞 Bottom Bread (Anchor): “birthdate of Kevin McCarthy” → “birthdate of actor Kevin McCarthy” makes the next search pull the right page.

🍞 Top Bread (Hook): Think of learning to swim: first copy the coach, then learn what the coach prefers, then practice to swim anywhere.

🥬 Filling (Curriculum Learning):

  • What it is: A three-stage training path so the agent gradually internalizes good querying.
  • How it works: Stage 1 (Imitation): learn only from trajectories with high-quality queries. Stage 2 (Alignment/DPO): prefer refined, better-query paths. Stage 3 (RL): practice with process rewards to generalize.
  • Why it matters: Stepwise learning makes complex habits stick and transfer to new tasks.

🍞 Bottom Bread (Anchor): First, watch perfect examples; next, choose better options; finally, play a full game with scoring that rewards each good move.

Before vs After:

  • Before: Agents often focused on reasoning format and final answers, ignoring the precision of each search.
  • After: Agents actively evaluate, refine, and improve their own queries, boosting accuracy and efficiency.

Why it works (intuition): Search quality is the bottleneck; bad inputs to retrieval yield bad evidence. By rewarding novelty/usefulness and repairing weak queries immediately, the agent keeps its evidence pipeline clean. With curriculum stages, the behavior moves from teacher-guided to self-reliant.

Building blocks:

  • Dual-Level Credit Assessment (novelty rule + usefulness model)
  • Lightweight student model for fast scoring/refining
  • Targeted query rewrite and partial trajectory regeneration
  • Three-stage training: SFT filtering → DPO preferences → RL with composite rewards

03Methodology

High-level recipe: Input question → Agent plans and issues Query → Process rewards score the query (novel/useful) → If low-quality, refine the query → Regenerate subsequent steps → Repeat until answer.

🍞 Top Bread (Hook): Imagine a checklist for each mini-step of a project: Is this step new? Is it necessary? Did it do what we wanted?

🥬 Filling (Dual-Level Credit Assessment):

  • What it is: A two-part check for every query: novelty (rule) and usefulness (model).
  • How it works:
    1. Novelty (rule-based): Count how many retrieved docs overlap with earlier rounds. If too many overlap, mark as redundant (0); else novel (1).
    2. Usefulness (model-based): A small, fine-tuned evaluator reads the question, gold answer (for training-time supervision), and history up to this round, then outputs 1 if the intent was necessary and the retrieved results actually address that intent, else 0. It also explains why.
    3. Overall score: 1 only if both novelty and usefulness are 1; otherwise 0. Explanations are concatenated for guidance.
  • Why it matters: This pinpoints exactly when a query is stale or off-target, so fixes are surgical.

🍞 Bottom Bread (Anchor): If your new search pulls mostly the same docs as earlier, novelty = 0. If you intended “actor” but fetched “politician,” usefulness = 0.

🍞 Top Bread (Hook): Like editing the one wrong sentence in a paragraph before you keep writing.

🥬 Filling (Query Refinement Engine):

  • What it is: A focused rewriter that fixes only the weak query and restarts from there.
  • How it works:
    1. Generate a full trajectory (thoughts, queries, results, steps).
    2. Score each query with process rewards; collect the low-quality ones (score 0).
    3. For each low-quality query, feed the history and the textual feedback to a small refine model to produce a better query.
    4. Regenerate downstream steps from the refined point to build a new, improved trajectory.
  • Why it matters: Fixing the root cause prevents wrong evidence from poisoning later reasoning.

🍞 Bottom Bread (Anchor): “birthdate of Kevin McCarthy” → “birthdate of actor Kevin McCarthy” flips the retrieved page from politician to actor and fixes the path.

🍞 Top Bread (Hook): Like learning math: first copy clean solutions, then choose better ones, then solve under game rules that reward each good move.

🥬 Filling (Three-Stage Curriculum):

  • What it is: A staged training plan so the agent absorbs query quality as a habit.

  • How it works (step by step): Stage 1 — Query Quality Screened Imitation Learning (SFT)

    • What happens: Collect trajectories that both have correct final answers and pass query-quality checks at every step.
    • Why it exists: If you imitate messy searches, you learn messy habits; filtering teaches clean querying from day one.
    • Example: Keep only solutions where every query was novel and useful.

    Stage 2 — Query Generation Alignment (DPO)

    • What happens: For each question, generate an initial trajectory, refine each weak query to produce alternative trajectories, then pick a preferred one using both final correctness and fewer weak queries.
    • Why it exists: The agent learns to prefer paths where query quality drives success—not just lucky endings.
    • Example: Between two correct answers, prefer the one with fewer low-quality queries.

    Stage 3 — Query Aware Policy Optimization (RL with GRPO)

    • What happens: Use reinforcement learning with a composite reward = outcome reward ± scaled process rewards + format reward. During rollouts, expand trajectories via targeted refinements but cap how many share the same prefix to keep diversity.
    • Why it exists: Final practice under realistic feedback solidifies the behavior and generalizes it.
    • Example: Even if the final answer is wrong, many high-quality queries still earn partial credit, steering learning.

Secret sauce:

  • Precise, per-query feedback (novel/useful) eliminates guesswork about what went wrong.
  • Targeted rewrites avoid retraining from scratch; they fix the exact link that failed.
  • Curriculum ties it together: clean examples → preference shaping → practice with live rewards.

Data and models in practice:

  • Base policy model: Qwen2.5-3B-Instruct.
  • Evaluator/refiner: a lightweight student model fine-tuned with labels from a larger teacher model; fast enough to score and refine every step.
  • Retrieval: Wikipedia dump (local) or web API (open web), top-k snippets to ground answers.

What breaks without each step:

  • Without novelty checks: the agent loops redundantly and wastes calls.
  • Without usefulness checks: the agent fetches the wrong entity (actor vs politician) and drifts.
  • Without refinement: errors at step t persist and cascade into later steps.
  • Without curriculum: the agent struggles to internalize query quality and to generalize beyond training.

Concrete data example:

  • Question: “An Annapolis Story stars which American stage, film, and television actor born on February 15, 1914?”
  • Bad query: “birthdate of Kevin McCarthy” → retrieves politician → usefulness = 0 with an explanation.
  • Refined query: “birthdate of actor Kevin McCarthy” → retrieves the correct actor page → downstream reasoning succeeds.

04Experiments & Results

🍞 Top Bread (Hook): Think of a spelling bee where not only your final word matters, but the judge also nudges you if you start sounding out the wrong letters halfway through.

🥬 Filling (The Test):

  • What it is: The researchers tested SmartSearch on tough, multi-hop question-answering datasets (2WikiMultihopQA, HotpotQA, Bamboogle, Musique) and open-web tasks (GAIA, WebWalker).
  • How it works: They compared SmartSearch with prompt-only methods, RL with just final-answer rewards, and RL with other process rewards. They measured Exact Match (EM), F1, Search Quality (how often all queries are high-quality or at least some are despite a wrong final answer), and Search Efficiency (how much correct info per search call).
  • Why it matters: This shows if better queries actually lead to better and cheaper answers across different environments.

🍞 Bottom Bread (Anchor): It’s like scoring both your accuracy and how efficiently you used your hints.

Competition:

  • Prompt baselines (Direct Inference, CoT, RAG, Search-o1) show the basic abilities of LLMs and simple search prompting.
  • RL with outcome-only rewards (ReSearch, ZeroSearch, R1-Searcher, Search-R1) shows the benefit of training by final results.
  • RL with process rewards (ReasonRAG, PPR, StepSearch) shows the value of step-level supervision—but often not query-focused.

Scoreboard with context:

  • SmartSearch achieved the best average EM and F1 across the four knowledge-intensive datasets, with standout gains over the runner-ups (e.g., double-digit relative improvements in many settings). In simple terms: it’s like getting an A+ while the next best gets a solid B.
  • On open-web tasks (GAIA, WebWalker), SmartSearch also topped baselines, showing it generalizes beyond Wikipedia-style corpora.
  • Search Quality: SmartSearch had the highest Perfect Rate (all queries high-quality in correct answers) and Partial Rate (even when the final answer is wrong, many queries are still high-quality), proving it consistently asks better in-between questions.
  • Search Efficiency: SmartSearch extracted more correctness per search call—fewer wasteful or redundant queries.

Surprising findings:

  • Quality filtering during imitation (Stage 1) let the model learn better with even less data, because it copied only clean, high-quality query trajectories.
  • Even when the final answer was incorrect, earning partial credit for high-quality queries helped the agent improve faster in RL—small wins add up.
  • A lightweight student model for scoring/refining reached over 80% agreement with human labels and delivered nearly the same effectiveness as a large model at about one-fifth the time, a strong efficiency trade-off.

Ablation highlights:

  • Remove process rewards? Performance drops—step-level guidance is crucial.
  • Remove query refinement? Performance drops—fixing the exact weak query matters.
  • Use standard GRPO without these mechanisms? Training is less stable and converges to lower F1.

Takeaway: Better mid-course corrections (per-query scoring + targeted rewrites) consistently beat methods that only judge the destination or reward generic reasoning steps.

05Discussion & Limitations

🍞 Top Bread (Hook): Imagine you’re building a model car: great instructions help, but you still need the right tools and time.

🥬 Filling (Honest assessment):

  • Limitations:
    1. Extra compute: Scoring every query and possibly refining it adds overhead, even with a small student model.
    2. Domain shifts: If new tasks have query patterns very different from training, usefulness judgments may wobble.
    3. Label bias: If the teacher or student evaluator mis-scores queries, the agent may learn the wrong habits.
    4. Edge cases: Very long, noisy web pages or tricky names (homonyms) can still confuse retrieval and usefulness checks.
  • Required resources:
    • A base policy LLM, a retriever (for local or web search), and a small evaluator/refiner model.
    • Infrastructure to run multi-round rollouts, store histories, and compute novelty via doc overlap.
  • When NOT to use:
    • Ultra-low-latency single-shot Q&A where you can’t afford any multi-step searching.
    • Topics where retrieval is mostly useless (e.g., purely subjective opinions) or where ground truth isn’t verifiable.
    • Environments with extremely unreliable search APIs where overlap/usefulness signals become noisy.
  • Open questions:
    1. Can we learn the novelty/usefulness judges end-to-end without teacher labels and still be stable?
    2. How to better resolve entity ambiguity (e.g., same names) with minimal extra calls?
    3. Can we adaptively decide when to refine vs. skip to save time, based on expected gain?
    4. How to extend query-quality scoring to also cover snippet selection and summarization quality?

🍞 Bottom Bread (Anchor): It’s like deciding when to stop and check the map—worth it most of the time, but maybe not if you just need to drive one block.

06Conclusion & Future Work

Three-sentence summary: SmartSearch improves search agents by scoring every intermediate query for novelty and usefulness, then refining weak ones and regenerating the rest of the search from that point. A three-stage curriculum—imitate, align, and generalize—helps the agent internalize these skills so they transfer to new tasks. The result is higher accuracy and efficiency across tough benchmarks and open-web settings.

Main achievement: Turning query quality into a first-class training signal—by combining per-query process rewards with targeted query refinement—so agents fix the exact step that causes failure.

Future directions:

  • Learn more robust, domain-adaptive evaluators for usefulness and entity disambiguation.
  • Decide dynamically when to refine to balance speed and gains.
  • Extend the idea beyond queries to snippet selection and synthesis quality.

Why remember this: In multi-step search, the weakest link is often a vague or redundant query. SmartSearch shows that teaching agents to notice and fix those links, right when they happen, unlocks big, reliable gains.

Practical Applications

  • •Academic research assistants that refine queries to find the right papers and avoid redundant sources.
  • •Customer support bots that ask targeted follow-up searches to resolve issues faster.
  • •News and fact-checking tools that disambiguate names (e.g., actor vs politician) to avoid misinformation.
  • •Medical literature explorers that reduce irrelevant hits and focus on necessary, on-target queries.
  • •Enterprise knowledge search that trims repeated lookups and improves retrieval precision for employees.
  • •Coding assistants that refine API or error-message searches to land on the correct documentation quickly.
  • •Education tutors that show students how to improve their search phrasing step by step.
  • •Legal or policy brief builders that reduce redundant case lookups and ensure each query adds new evidence.
  • •Market intelligence tools that avoid overlapping sources and surface fresh, relevant insights.
  • •Open-web agents that remain robust on noisy pages by refining ambiguous or off-target queries.
#Search agents#Process rewards#Query refinement#Curriculum learning#Dual-Level Credit Assessment#Retrieval-Augmented Generation#Reinforcement Learning#Direct Preference Optimization#GRPO#Information retrieval#Multi-hop question answering#Search efficiency#Search quality#LLM training#Agentic RL
Version: 1