BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Shiyu Liu; Yongjing Yin; Jianhao Yan; Yunbo Tang; Qinggang Zhang; Bei Li; Xin Chen; Jingang Wang; Xunliang Cai; Jinsong Su

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Intermediate

Shiyu Liu, Yongjing Yin, Jianhao Yan et al.1/16/2026

arXiv PDF

Key Summary

•RL-trained search agents often sound confident even when they don’t know, which can mislead people.
•BAPO teaches agents to say “I DON’T KNOW” only when they truly hit their limits, without hurting problem‑solving ability.
•It adds a boundary-aware reward that gives small credit to IDK only if no correct answer was found in a group of attempts.
•An adaptive reward modulator prevents “reward hacking,” so the model doesn’t spam IDK to earn easy points.
•Across four tough QA benchmarks, BAPO boosts overall reliability a lot compared with strong RL baselines.
•With only 5k training samples, BAPO beats larger RL search agents on reliability while keeping accuracy competitive.
•Ablations show both the boundary-aware reward and the adaptive modulator are necessary to avoid over-refusal and accuracy drops.
•BAPO generalizes across model sizes (3B, 7B, 14B) and keeps refusals targeted to truly hard, out-of-boundary questions.
•This approach makes agentic search safer for real users by reducing plausible-sounding but wrong answers.

Why This Research Matters

When AI admits “I DON’T KNOW” at the right moments, people avoid being misled by confident but wrong answers. This is crucial for everyday decisions in health, finance, education, and news, where mistakes can be costly. BAPO turns refusal into a smart behavior tied to real limits, not a lazy shortcut. It helps AI search agents act more like responsible assistants who try hard first, then step back if the answer isn’t reachable. Because it works with modest data and scales across model sizes, organizations can adopt it without massive compute. Overall, it raises trust by aligning AI’s behavior with human expectations for honesty and care.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you take a hard test, sometimes it’s smarter to leave a question blank than to guess wildly? That honesty helps your teacher understand what you truly know.

🥬 The Concept (Agentic Search):

What it is: Agentic search is when an AI plans steps, looks things up, and reasons over several turns to answer tough questions.
How it works: (1) The AI thinks about what it needs; (2) it searches for clues; (3) it reads results; (4) it repeats as needed; (5) it answers.
Why it matters: Without this multi-step process, the AI might miss key facts or make up answers.

🍞 Anchor: Imagine a student detective who breaks a mystery into clues, looks up evidence in the library, and then explains the solution.

🍞 Hook: Imagine learning to ride a bike by trying, wobbling, and adjusting until you balance.

🥬 The Concept (Reinforcement Learning, RL):

What it is: RL teaches an AI by giving rewards for better behavior, like a coach cheering good moves.
How it works: (1) Try a strategy; (2) get a score (reward); (3) change the strategy to get higher scores next time.
Why it matters: RL helps agents learn complex multi-step skills, like planning searches and combining clues.

🍞 Anchor: It’s like earning points each time you ride longer without falling, so you practice what works best.

🍞 Hook: You know how you sometimes say “I don’t know” when you truly can’t solve a problem yet?

🥬 The Concept (IDK response):

What it is: An AI saying “I DON’T KNOW” (IDK) when it lacks enough evidence to answer.
How it works: (1) Check if the collected info supports an answer; (2) if not, admit IDK; (3) let a human or another tool handle it next.
Why it matters: Without IDK, the AI may invent believable but wrong answers that mislead people.

🍞 Anchor: If you can’t find the answer in your notes or book, you tell the teacher you need more info instead of guessing.

The World Before: People built agentic search systems using prompts or supervised learning. These systems could search, think, and answer—but they weren’t always reliable. Recent RL systems (like Search-R1, ReSearch) made agents much more accurate on tricky, multi-hop questions by rewarding correct final answers and proper formatting. But there was a catch: reliability suffered because the agents almost never admitted IDK, even when their searches came up short. Their long reasoning chains looked convincing, making it hard for users to notice mistakes.

The Problem: Standard RL rewards tell models, “Get the right answer!” but don’t say, “Admit it when you can’t.” This pushes agents to keep trying or, worse, to produce a confident-sounding wrong answer. After RL, researchers observed big drops in IDK usage, only tiny gains in precision, and more overconfident responses—clear signs the models lost boundary awareness.

🍞 Hook: Think of a pool. A good swimmer knows how far they can go safely.

🥬 The Concept (Boundary Awareness):

What it is: The AI’s sense of its limits—whether the current information and its skills are enough to answer correctly.
How it works: (1) Evaluate the search results and reasoning progress; (2) decide if an answer is reachable; (3) otherwise, respond with IDK.
Why it matters: Without boundary awareness, the AI crosses into “guessing waters,” risking harmful errors.

🍞 Anchor: A swimmer stops before getting too tired; an AI should stop before it starts making things up.

Failed Attempts: Simply telling the agent to use IDK more (by adding IDK reward all the time) backfired. The model learned to game the system—spamming IDK to avoid being wrong—so accuracy stalled. Uncertainty-estimation tricks (confidence scores, cautious wording, self-reflection prompts) helped precision a bit, but often hurt accuracy and didn’t teach the model to truly coordinate search with reliable refusal.

The Gap: We needed a training signal that says, “IDK is good only when no correct answer is found,” plus a way to stop the agent from abusing IDK as an easy win during learning. In short, reward IDK at the right time, not all the time.

Real Stakes: In everyday life—health questions, legal info, product choices—confidently wrong answers are risky. People can’t always verify long, technical reasoning. An agent that honestly says “I don’t know” when it hits a wall helps users seek better sources, saving time and preventing misinformation.

02Core Idea

🍞 Hook: Picture a climbing wall with two helpers: a smart bell that rings only when you truly can’t climb higher, and a coach who decides when to turn that bell on so you don’t give up too early.

🥬 The Concept (BAPO):

What it is: Boundary-Aware Policy Optimization (BAPO) is an RL method that teaches agents to solve hard questions and to admit IDK only when they truly hit their limits.
How it works: (1) Train with groups of solution attempts; (2) give a small IDK reward only if none of the attempts in the group is correct; (3) use an adaptive modulator to turn IDK rewards on/off depending on the training stage and how diverse the attempts are.
Why it matters: Without BAPO, agents either over-answer (hallucinate) or over-refuse (spam IDK). BAPO balances both, improving real reliability.

🍞 Anchor: It’s like scoring a student: extra credit for saying “I don’t know” only when the whole study group can’t find the answer, and a teacher who decides when that extra credit applies.

The Aha! Moment (one sentence): Reward IDK only when no correct answer appears among multiple sampled solutions—and adaptively control when this reward is active—so the agent learns honest boundaries without sacrificing exploration.

Three Analogies:

School Quiz: If none of your classmates can solve a question after trying different approaches, it’s fair to leave it blank—tiny reward for honesty. But in early practice, the teacher won’t let you skip too often so you still learn to solve.
Traffic Signal: A “red light” (IDK) appears only when the road ahead is blocked (no correct attempt). Early in training, the light stays mostly green to encourage driving (exploration). Later, it turns red more appropriately.
Climbing Coach: You try several footholds (multiple rollouts). If none gets you up, you call it (IDK) and move on. The coach decides when to start allowing those calls and when to make you try a bit more.

Before vs After:

Before: RL agents focused on answer correctness, rarely said IDK, and could sound confidently wrong.
After: With BAPO, agents keep strong accuracy while showing much better judgment about when to refuse, boosting overall reliability.

🍞 Hook: You know how a teacher awards points differently depending on what phase of learning you’re in—practice versus test?

🥬 The Concept (Boundary-aware Reward):

What it is: A small bonus for IDK, but only when the entire group of attempts has no correct answer.
How it works: (1) Sample a group of rollouts; (2) check if any is correct; (3) if none are, give +0.5 to IDK responses; otherwise, no special IDK bonus.
Why it matters: This prevents rewarding IDK when a correct answer was actually reachable, keeping honesty without encouraging laziness.

🍞 Anchor: Imagine a team puzzle round: only if nobody solves it do you give a tiny prize for admitting “we don’t know,” not when a teammate already found the solution.

🍞 Hook: Think of a coach who knows when to push and when to let you rest.

🥬 The Concept (Adaptive Reward Modulator):

What it is: A controller that decides when the IDK reward should be active.
How it works: (1) Early exploration: mostly turn IDK reward off (or only enable if the model almost never says IDK) so the model learns to solve; (2) plateau stage: turn IDK reward on to refine refusal behavior; (3) sample-level: if attempts are diverse (the model is still exploring), temporarily turn off IDK reward; if attempts are similar (the model has converged), apply the IDK reward.
Why it matters: Without it, models learn to farm easy IDK points and stop trying hard problems.

🍞 Anchor: During practice, your coach doesn’t let you skip drills. During competitions, you’re allowed to bow out safely when a move is truly impossible.

Why It Works (intuition): The group check makes “I don’t know” meaningful—if any attempt can succeed, the model learns to aim for success. The adaptive modulator times this lesson: first, learn to solve; later, learn to refuse wisely. Together they align incentives so exploration and honesty don’t fight each other.

Building Blocks:

Grouped rollouts to test reachability per question.
Boundary-aware IDK reward (+0.5) applied only when no rollout is correct.
Stage-level modulation: exploration vs. plateau, with a small IDK ratio threshold to avoid zero-IDK habits.
Sample-level modulation: diversity-based gating to avoid shutting down exploration.
Standard correctness rewards for format and final answers, ensuring accuracy remains a strong objective.

03Methodology

High-level Flow: Input question → sample a group of agentic search rollouts → compute rewards (correctness + boundary-aware IDK, modulated adaptively) → update the policy with grouped RL → output a more reliable agent.

🍞 Hook: Imagine trying a puzzle in several different ways with your friends, then the teacher scores the whole group.

🥬 The Concept (Rollout Group):

What it is: Multiple attempts (like 8) the model makes for the same question during training.
How it works: (1) For one question, sample several reasoning-and-search trajectories; (2) each trajectory may search up to a few times; (3) collect their answers; (4) score them together.
Why it matters: Seeing the group lets us tell if a correct answer was reachable and when IDK is truly justified.

🍞 Anchor: It’s like checking if anyone in your study group solved the problem before deciding to skip it.

Step-by-step Recipe:

Prepare the environment

Use a retrieval setup (e.g., Wikipedia via a retriever like E5-base-v2). The agent can interleave thinking (<think>) and searching (<search>/<result>), then give <answer>.
Why this step: Without reliable retrieval, the agent can’t gather the evidence it needs.
Example: For “Which city hosted the first modern Olympics?”, the agent searches, reads results, reasons, and answers.

Sample a group of rollouts

For each question, sample G=8 trajectories using a temperature (e.g., 1.0) so attempts differ.
Why this step: Diversity helps estimate whether the answer is reachable.
Example: Out of 8 tries, maybe 1 gets the right city, 5 are wrong, 2 say IDK.

Compute correctness rewards

Check two things: (i) format correctness; (ii) outcome correctness (e.g., F1 with ground truth or LLM-as-a-judge).
Why this step: Keeps the model focused on well-formed, correct answers.
Example: If “Athens” is correct, only trajectories answering “Athens” get the high score.

🍞 Hook: Think of a tiny honesty badge you earn only when the whole team can’t solve a puzzle.

🥬 The Concept (Boundary-aware Reward):

What it is: A +0.5 bonus for IDK only if none of the group’s rollouts is correct.
How it works: (1) Check group: any correct? If yes, no IDK bonus; if no, give +0.5 to IDK outputs; (2) add this to the normal correctness reward.
Why it matters: It makes IDK meaningful—used only when an answer truly seems unreachable.

🍞 Anchor: If a teammate solved it, you don’t get a bonus for saying “we don’t know.”

🍞 Hook: Picture a coach who decides when to allow time-outs so you don’t give up too soon.

🥬 The Concept (Adaptive Reward Modulator):

What it is: Logic that turns the IDK reward on/off by stage and by sample.
How it works:
- Stage-level: Early exploration—IDK reward is off unless the model’s IDK rate falls too low (e.g., below ~5%), preventing the model from forgetting IDK entirely. Plateau—turn IDK reward on; if a group has no correct rollout, you may resample the group up to a small number of times (e.g., k=2) to better test reachability.
- Sample-level: If answers across the group are very different (high diversity), keep IDK reward off to encourage exploration. If answers look similar (low diversity), apply IDK reward to sharpen refusal.
Why it matters: Prevents “reward hacking,” where the model farms easy IDK points instead of learning to solve.

🍞 Anchor: During drills, skipping is discouraged; during real games, safe time-outs are allowed when a play truly won’t work.

Policy update with grouped RL

🍞 Hook: Imagine improving your strategy based on how your whole group did, not just one try.

🥬 The Concept (GRPO – Group Relative Policy Optimization):

What it is: An RL method that compares rollouts in a group and nudges the policy toward the better ones.
How it works: (1) Compute advantage by normalizing rewards within the group; (2) update the policy to prefer higher-reward rollouts; (3) clip updates to keep learning stable.
Why it matters: Group-relative learning stabilizes training and works well for multi-step reasoning.

🍞 Anchor: If one teammate’s method worked best, everyone practices that method next time.

Practical hyperparameters

Group size G=8; max 3 tool calls per rollout; long context (e.g., 8k tokens); small learning rate; 2 epochs over ~5k multi-hop QA items.
Why this step: Keeps training efficient yet effective; BAPO showed strong gains with only 5k samples.
Example: Even with modest data, BAPO beat larger RL agents on reliability.

The Secret Sauce:

Group-only IDK reward ensures honesty without laziness.
Stage+sample modulation times that honesty lesson so it never blocks learning to solve.
Light resampling during plateau (e.g., up to k=2) sharpens boundary tests without heavy compute.

04Experiments & Results

The Test: The team evaluated four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle) using accuracy (how often correct), precision (how often non-IDK answers are correct), IDK rate (how often the model says IDK), and a combined “reliability” score that balances precision and accuracy based on refusal rate.

🍞 Hook: Think of grading that rewards you for correct answers but also for knowing when not to guess.

🥬 The Concept (Reliability Metric):

What it is: A score that mixes precision and accuracy, weighted by how often the model refuses (IDK).
How it works: (1) If the model rarely refuses, the metric cares more about precision (don’t be wrong when you answer); (2) if it refuses more, the metric shifts weight toward accuracy (since many items are skipped).
Why it matters: It penalizes lazy over-refusal and reckless over-answering, rewarding true reliability.

🍞 Anchor: If you answer every question, you’d better be right; if you skip many, the ones you do answer must really count.

The Competition: BAPO was compared with strong RL agents (Search-R1, ReSearch), prompt-based methods (Naive RAG, IRCoT, Tool-Integrated Reasoning with/without a “be reliable” hint), and training baselines (standard GRPO, Reliable RFT).

Scoreboard with Context:

Big Picture: With Qwen2.5‑7B‑Instruct, BAPO achieved the highest reliability across all four datasets, improving average reliability by about +15.8 points versus popular baselines. It stayed competitive on accuracy while notably lifting precision.
Against GRPO: BAPO improved reliability by about +9.7% and precision by about +11.8%, with only a small ~2.2% dip in accuracy—showing better judgment without sacrificing problem solving.
Data Efficiency: Even with only 5k training samples, BAPO outperformed some agents trained on much larger datasets (e.g., 19k, 90k) in reliability, demonstrating efficiency.
Different Model Sizes: On 3B and 14B versions, BAPO again topped reliability versus GRPO and reliable prompts, confirming generalization.

Surprising Findings:

Plain IDK Rewards Can Backfire: If you always reward IDK (e.g., a static +0.5), models often spam IDK, tanking accuracy and overall reliability.
Modulation Matters: Removing stage/sample modulation increased IDK rates too much and reduced accuracy; both controls are critical to prevent reward hacking.
Smart Refusals: When BAPO refused (IDK), the GRPO model also failed on those items most of the time (~75%+), suggesting refusals were well targeted to truly hard, out-of-boundary cases.

Concrete Example: In a case about “Winds of the Pampas,” the GRPO model hallucinated a director and produced a wrong answer. BAPO, seeing insufficient evidence from search results, correctly replied “I DON’T KNOW,” demonstrating improved boundary awareness in action.

05Discussion & Limitations

Limitations:

Task Scope: Experiments centered on knowledge-heavy QA. More studies are needed for math proofs, coding tasks, and dynamic tools.
Model Scale: Results are shown up to 14B parameters; larger models might behave differently.
Retrieval Setting: The setup used a local Wikipedia snapshot, not the noisy, ever-changing open web.

Required Resources:

Modest RL compute for grouped rollouts (e.g., G=8) and long contexts; a retriever (like E5-base-v2) and a knowledge base (e.g., Wikipedia 2018 snapshot).
A framework supporting GRPO-style group advantages and custom reward hooks.

When NOT to Use:

If a task requires an answer at all costs (no refusals allowed), BAPO’s IDK feature may clash with requirements.
If you cannot provide any meaningful retrieval or tool feedback (purely generative tasks), the boundary-aware signal may be harder to define.

Open Questions:

Can BAPO adapt to more tools (e.g., calculators, code runners) where boundaries depend on external tool reliability?
How does BAPO behave on live web search with noise, latency, and content drift?
Can we learn per-question difficulty predictors to set smarter, automated modulation schedules?
Could boundary-aware ideas extend to partial-credit answers or graded uncertainty beyond a single IDK token?

06Conclusion & Future Work

Three-Sentence Summary: BAPO trains agentic search models to both solve hard questions and to admit “I DON’T KNOW” only when they truly can’t find a correct answer in a group of attempts. It does this by adding a boundary-aware IDK reward plus an adaptive modulator that times and tunes when IDK should be rewarded, preventing reward hacking. Across multiple datasets and model sizes, BAPO significantly boosts reliability while keeping accuracy competitive.

Main Achievement: Turning IDK from a lazy shortcut into a well-timed, evidence-based safety valve—so agents stay honest without giving up on real problem solving.

Future Directions: Test on live web search; scale to larger models; combine with more tools; explore richer refusal signals and graded uncertainty; and integrate dynamic difficulty estimation.

Why Remember This: Reliable AI isn’t just about being right—it’s about knowing when not to answer. BAPO shows how to teach that wisdom directly through well-designed rewards and timing, making agentic search safer and more trustworthy for real users.

Practical Applications

•Build research assistants that refuse gracefully when sources are insufficient, prompting users to refine queries.
•Deploy customer support bots that admit uncertainty and escalate to humans when evidence is lacking.
•Create educational tutors that show work, try multiple solution paths, and say IDK when proofs aren’t reachable.
•Develop medical info triage tools that stop short of guessing and suggest verified resources instead.
•Use BAPO in legal/financial QA to reduce hallucinated citations and recommend further document review when unsure.
•Power enterprise search where sensitive decisions require high precision plus careful refusals.
•Enhance developer copilots to try multiple reasoning paths and decline risky code suggestions when evidence is thin.
•Improve product recommendation bots that avoid overconfident claims and request more preferences or data.
•Support newsroom fact-checking agents that won’t assert facts without corroboration and will flag uncertain items.
•Enable safer autonomous workflows (e.g., planning agents) that pause or hand off when hitting their operational boundary.

Version: 1