ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration

Yifei Chen; Guanting Dong; Zhicheng Dou

ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration

Intermediate

Yifei Chen, Guanting Dong, Zhicheng Dou1/11/2026

arXiv PDF

Key Summary

•ET-Agent is a training framework that teaches AI agents to use tools (like search and code) more wisely, not just to get the right answer.
•It fixes two big behavior problems: using tools too many times (wasteful) and using tools too few times (misses facts), plus bad reasoning steps and broken tool calls.
•A self-evolving data flywheel creates better practice examples by trimming extra steps from good solutions and repairing mistakes in bad ones.
•Action Space Exploration Fine-tuning helps the model try many different safe ways to use tools without crashing.
•Group-wise Pareto Sampling picks training questions that show both accuracy gaps and behavior differences, keeping useful diversity.
•Curriculum RL Training rewards answers that are correct, clearly written, and efficient in tool use—so the agent learns to be both right and swift.
•Across six tough benchmarks (math and knowledge tasks), ET-Agent gets the best average correctness (60.1) and efficiency (46.0).
•It also reduces redundant calls, shortens reasoning, and raises the rate of successful tool execution.
•Visualizations show the model first explores widely, then neatly converges to better habits.
•The framework offers a practical path to build AI that reasons well and acts efficiently with tools.

Why This Research Matters

AI agents increasingly help with research, coding, planning, and learning. If they overuse tools, they become slow and expensive; if they underuse tools, they miss facts and make mistakes. ET-Agent shows how to train agents to be both accurate and efficient—using the right tool at the right time, with clean, concise reasoning. This reduces latency and compute costs, which matters for everyday apps and large-scale deployments. It also improves reliability by cutting broken tool calls and messy logic. In short, ET-Agent helps build trustworthy AI assistants that feel faster, smarter, and more helpful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Top Bread (Hook): Imagine you’re doing a school project. Sometimes you look things up online, sometimes you do math on a calculator, and sometimes you just think it through. If you keep googling the same thing, you waste time. If you never google when you need to, you get stuck. Good students learn when to search, when to calculate, and when to think.

Filling (The Actual Concept — The World Before → Problem → Failed Attempts → Gap → Stakes):

What it is: Tool-Integrated Reasoning (TIR) is when an AI uses external tools—like search engines and code interpreters—while thinking step by step to solve harder problems.
How it worked before: Earlier systems mostly chased higher answer accuracy. They let the AI call tools freely and only checked if the final answer was correct. This meant the AI could over-search, under-search, write buggy code, or wander off in its thoughts—as long as it sometimes landed on the right answer.
Why that’s a problem: Two things broke. (1) Inefficiency: The agent often made redundant tool calls—like asking the same question over and over—which wastes time and compute, and slows real users. (2) Missed facts and broken steps: Sometimes the agent stopped too early (insufficient tool calls) or wrote code/search queries that failed (aborted execution), leading to wrong answers or confusion.
What people tried:
1. Data-only fixes: Teach by example (SFT) or compare good vs bad paths (DPO). These can narrow the agent’s habits too much, so it copies a small set of moves and stops exploring better ones.
2. RL-only fixes: Reward fewer tool calls to fight overuse. That helps with redundancy, but not with other errors like insufficient calls, faulty logic, or broken queries.
The missing piece: A method that (a) grows the agent’s safe exploration of many different valid tool-using paths and (b) progressively calibrates multiple behavior issues—not just one—while still protecting answer correctness.
Real stakes: In daily life, we want AI that can research for us, solve math, plan trips, or check facts. If it makes too many calls, it’s slow and costly. If it makes too few, it’s wrong. If it writes bad code or messy reasoning, it breaks. We need agents that act like smart students—using the right tool at the right time, in the right way.

Bottom Bread (Anchor): Think of an AI helping you answer, “Who lived longer, Person A or Person B?” If it keeps searching the same thing (redundant), it wastes time. If it stops after the first search (insufficient), it might miss Person B’s age. If it types a broken code snippet (aborted), it crashes. A well-trained agent searches each person once, runs a tiny calculation correctly, and gives the right answer fast.

02Core Idea

Top Bread (Hook): You know how a good coach doesn’t just care if the team wins; they also fix bad habits—like dribbling too much or never passing—so the team wins more often and with less effort.

Filling (The Actual Concept):

What it is (Aha! in one sentence): ET-Agent teaches AI not just to be correct, but to develop healthy tool-use habits by growing diverse practice data and then calibrating behaviors with a smart, staged RL process.
How it works (step by step):
1. Build a self-evolving data flywheel: take good solutions and trim the fat; take bad solutions and fix the first mistake or nudge them to try the needed extra tool calls; repeat to create rich, diverse, high-quality training paths.
2. Fine-tune for exploration: teach the model to safely explore many styles of tool use (and avoid broken formats/executions).
3. Pareto sample groups: choose training questions that show both accuracy variation and behavior variety so gradients don’t vanish.
4. Curriculum RL training: reward correctness, clean formatting, and efficiency (fewer, smarter calls and shorter thoughts), tightening behavior round by round.
Why it matters: Without behavior calibration, agents either overuse tools, underuse tools, or break tools—hurting real-world usefulness.

Multiple Analogies (three ways):

Chef analogy: A chef doesn’t slice the same tomato five times (redundant) or skip the oven when baking (insufficient). ET-Agent trains the chef to pick and use the right kitchen tool once, cleanly.
Basketball analogy: Players must know when to pass (search), shoot (compute), or dribble (think). ET-Agent is the practice plan that fixes over-dribbling, encourages smart passes, and keeps the play clean.
Map/GPS analogy: If the GPS keeps rerouting unnecessarily (redundant) or refuses to recalculate when you miss a turn (insufficient), you waste time. ET-Agent trains the GPS to recalc only when needed and give concise directions.

Before vs After:

Before: Agents often got tied in knots—too many searches, too few searches, broken code, long wandering thoughts—sometimes right, often wasteful.
After: Agents keep answers strong while using fewer, smarter tool calls with cleaner reasoning and higher execution success.

Why It Works (intuition):

Growing better data first broadens safe exploration, so the agent sees many valid ways to win.
Pareto sampling keeps training batches diverse yet high-value, preventing everyone in the group from looking the same.
Curriculum RL shapes habits steadily: correct outputs get amplified, but only if they’re also efficient and well-formed, so reward hacking is harder.

Building Blocks (each with quick Sandwich explanations):

Tool-Integrated Reasoning (TIR) Top Bread: Imagine a student who can think, search the library, and use a calculator. Filling: TIR is an AI that reasons step by step and can call a search engine or a code tool to help. It looks up facts or runs math, then keeps thinking. Bottom Bread: The agent searches a Wikipedia page for a date, runs a tiny Python calculation, and answers a history-math question.
Reinforcement Learning (RL) Top Bread: Like training a puppy with treats. Filling: The AI tries actions, gets rewards for good behavior (correct, efficient), and learns to repeat what works. Bottom Bread: The agent gets more reward when the answer is right and it used fewer tool calls.
Behavior Calibration Training Top Bread: A coach fixes habits, not just final scores. Filling: It’s training that targets tool-use patterns (too many, too few, broken calls) so the AI’s process gets better, not just the answer. Bottom Bread: The agent learns to make one precise search instead of three similar ones.
Self-evolving Data Flywheel Top Bread: A snowball that grows as it rolls. Filling: Keep improving data by trimming good paths and repairing bad ones, then loop—so training material becomes richer each round. Bottom Bread: Turn a long, correct solution into a shorter one, and fix a wrong one by adding a missing search.
Pareto Sampling Top Bread: Picking fruits that are both ripe and diverse so you don’t end up with all apples. Filling: Select examples that are strong on accuracy spread and behavior spread, keeping variety without losing quality. Bottom Bread: Choose questions where some paths are right but long, others are shorter—so the model learns to prefer the shorter correct ones.
Curriculum Learning Top Bread: First learn addition, then multiplication. Filling: Train in stages; gradually tighten rewards for efficiency so the model steps up smoothly. Bottom Bread: Round 1 is forgiving about length; by Round 3, shorter thinking gets extra points.

03Methodology

Top Bread (Hook): Think of a two-part recipe: first, build great practice drills; then, run smarter practices that reward clean, efficient play.

Filling (The Actual Concept — High-level Pipeline):

What it is: ET-Agent = Data Flywheel + Behavior Calibration Training.
How it works (high level): Input → Self-evolving Data Flywheel → Action Space Exploration Fine-tuning → Group-wise Pareto Sampling → Curriculum RL Training → Output (a calibrated agent that is accurate and efficient).
Why it matters: Without the first phase, the model doesn’t explore enough safe, diverse tool-use paths. Without the second, it never tightens habits to optimal behavior.

Step-by-step, like a recipe:

Self-evolving Data Flywheel Top Bread: Imagine editing your best homework to remove extra words, and fixing your worst homework by correcting the first mistake—then doing that for several rounds. Filling:

Initialization: For each question, generate multiple tool-using trajectories; split into Correct Set and Incorrect Set.
Correct Reasoning Enhancement: (a) Redundant Modification—locate the first redundant step (e.g., an unhelpful extra search) and regenerate the rest; (b) Global Refinement—clean up the text inside think steps while keeping the same tool call order.
Incorrect Reasoning Reflection: (a) Self-Correction—find the first flawed step (bad logic or tool misuse), fix it, and continue; (b) Hint Injection—insert gentle nudges like “Try another search” or “Use Python here,” encouraging needed extra calls.
Iterate R rounds, merging improved samples back to the pool to form D_aug. Why it exists: It grows diverse, high-quality, executable paths that cover more of the tool-use action space—critical for exploration. Anchor: A math path that was correct but long gets shortened by removing an unnecessary Python echo; a wrong path gets fixed by adding one missing web search.

Action Space Exploration Fine-tuning (RFT on D_aug) Top Bread: Before a big game, you run drills that are safe and varied, so players see many plays but avoid injuries. Filling:

Quality control: Drop trajectories with wrong final answers, broken formats, or failed executions.
Rejection Sampling Fine-tuning: Train the model to imitate only the good, diverse, well-formed paths. Why it exists: Prevents bad habits (like aborted tool calls) and expands safe exploration. Anchor: The model practices multiple short, correct ways to solve “compare two lifespans,” not just one script.

Group-wise Pareto Sampling Top Bread: When building a practice set, pick both the challenges where accuracy differs and where playing styles differ. Filling:

Sample K trajectories per question; compute two dispersions: Correctness Dispersion (variation in scores) and Behavioral Dispersion (variation in tool-call counts).
Pareto frontiers: Keep samples that aren’t worse on both axes; if too many, use crowding distance to retain variety. Why it exists: Avoids homogenized groups that cause tiny gradients. Keeps training impactful and diverse. Anchor: Choose questions where some solutions are right but long, some short but wrong, so the agent learns to be right and short.

Curriculum RL Training (with ARPO) Top Bread: Start with easy drills and light rules; tighten the rules as the team improves. Filling:

Rewards: • Format reward: penalize malformed outputs. • Correctness reward: F1 for knowledge tasks; binary for math. • Efficiency rewards: push toward fewer tool calls and shorter, clearer thinking using group-relative scores (logistic shaping) so being better than your group is rewarded.
ARPO optimization: Train in rounds; normalize rewards within groups; gradually decrease the efficiency-shaping sigmas each round to prevent reward hacking. Why it exists: It aligns multiple goals—right, clean, and efficient—so no single metric dominates badly. Anchor: A candidate solution that is correct, uses two searches instead of five, and keeps the chain-of-thought brief gets the highest reward.

Secret Sauce:

The flywheel seeds safe diversity; Pareto sampling preserves useful differences; curriculum RL steadily shifts the model toward concise, reliable strategies without hurting accuracy.

Bottom Bread (Anchor): Picture the final agent tackling “Who is Sancho Ramírez’s maternal grandfather?” It makes one focused search for Sancho Ramírez, one for his mother’s lineage, skims the right page, and answers—no looped searches, no broken code, no rambling.

04Experiments & Results

Top Bread (Hook): If a team says they improved, you want scoreboards that show not just wins, but also fewer fouls and faster plays.

Filling (The Actual Concept — Tests, Competition, Scoreboard, Surprises):

What they measured and why:
1. Correctness: Are the answers right? (LLM-as-judge or F1 for knowledge tasks; binary for math)
2. Efficiency: How much correctness per tool call? Higher is better—like scoring more points per pass.
3. Conciseness: Fewer redundant tool calls.
4. Successful Execution: Fewer broken tool runs.
5. Reasoning Length: Shorter, cleaner thinking (tokens) excluding tool outputs.
Who they competed against: Direct inference without tools and many state-of-the-art single-tool and multi-tool TIR systems (Search-o1, Search-R1, Research, WebThinker, WebSailor, ToRL, DotaMath, START, IKEA, SMART, AutoTIR, Tool-Star, Tool-Light).
Datasets: Three math (AIME24, AMC23, MATH500) and three knowledge-intensive (2WikiMultiHopQA, Bamboogle, MuSiQue).

Scoreboard with context:

ET-Agent’s average correctness 60.1 and efficiency 46.0 across six tasks—like earning an A for accuracy and also finishing the test faster than the class.
On AIME24, ET-Agent scores 46.7 vs the next best around 33.3, showing stronger competition-level math reasoning.
Efficiency wins are consistent: AutoTIR, Tool-Star, and Tool-Light are strong, but ET-Agent balances correctness with fewer, smarter calls and shorter thoughts.
Behavior metrics: best or near-best in Conciseness (fewer redundant calls; avg ≈ 57.0), Successful Execution (fewer broken calls; avg ≈ 56.4), and Reasoning Length (shorter thoughts; avg ≈ 3.07 normalized units).

Surprising findings:

Big action space: Even correct solutions differ a lot in tool-call counts, proving there are many ways to be right. ET-Agent leverages this by exploring widely first, then converging.
Distribution shift visualized: t-SNE plots show RFT spreads outputs (exploration), and RL compacts them (calibration). This validates the design: explore, then refine.
Vanilla RL vs ET-Agent: When RL optimizes only correctness, efficiency stalls. Adding behavior-aware rewards boosts both reward and efficiency together.

Bottom Bread (Anchor): It’s like a debate team that not only wins more matches but also speaks more clearly, uses fewer note lookups, and avoids microphone glitches—because they practiced better habits, not just memorized answers.

05Discussion & Limitations

Top Bread (Hook): Even the best playbook has limits—you still need the right stadium, players, and time.

Filling (The Actual Concept — Honest Assessment):

Limitations (be specific):
1. Retrieval scope: Experiments used local Wikipedia retrieval for knowledge tasks; live web can be noisier and harder.
2. Model scale: Training larger base models with this pipeline is resource-intensive.
3. Tool set: Only search and code tools were tested; broader tool ecosystems (APIs, databases, planners) may require extra tuning.
4. Reward tuning: Behavior rewards (tool-count and length shaping) need careful schedules (sigma decay) to avoid reward hacking.
Required resources: A capable 7–8B base model, GPUs for RFT and RL (the paper used 4×A800), a retrieval stack (e.g., local Wikipedia) or web access, and logging to detect format/execution errors.
When NOT to use: • Ultra-simple tasks where tools aren’t needed (overhead not worth it). • Domains with tools that frequently fail or change schemas—behavior rewards may mislead the agent. • Settings where transparency of tool calls isn’t allowed or logging is prohibited.
Open questions: • How to generalize to live web with changing pages and partial failures? • How to extend to multi-agent or multi-tool orchestration under tight latency budgets? • Can we learn per-task optimal efficiency targets automatically (instead of fixed sigmas)? • How to measure “usefulness” of a tool call beyond count—e.g., novelty, marginal information gain?

Bottom Bread (Anchor): Think of adding more instruments to a school band—more power, but also more chances to go off-beat. ET-Agent sets a steady rhythm, but bigger bands and new instruments will need extra conducting.

06Conclusion & Future Work

Top Bread (Hook): The best students don’t just get the right answers—they also learn better study habits so they can keep getting answers right, faster.

Filling (The Actual Concept — Takeaway):

Three-sentence summary: ET-Agent is a two-phase training framework that first grows better, more diverse practice data and then steadily calibrates an AI’s tool-use habits with behavior-aware RL. It fixes common tool-use mistakes—too many, too few, or broken calls—while keeping or improving correctness. The result is an agent that answers well and acts efficiently.
Main achievement: Showing that calibrating behavior patterns—not just final answers—yields state-of-the-art accuracy and efficiency together across math and knowledge tasks.
Future directions: Scale to larger models and live web search, broaden toolsets (APIs, planners), and refine rewards that capture information gain, not just counts and lengths.
Why remember this: ET-Agent turns “being right” into “being right the right way”—a shift that makes AI agents faster, cheaper, and more trustworthy in the real world.

Bottom Bread (Anchor): It’s like teaching a calculator-using, library-searching student to think clearly, look up what matters once, and check with code only when needed—so homework is both correct and quick.

Practical Applications

•Research assistants that search fewer times but find the right sources faster.
•Math tutors that run only the necessary code calculations and explain steps briefly and clearly.
•Customer support bots that query internal tools sparingly and avoid repeated lookups.
•Business dashboards where the AI pulls just the required database facts and summarizes concisely.
•Code helpers that execute small, correct snippets without crashing due to undefined variables.
•Educational tools that guide students through minimal but sufficient hints and checks.
•Web agents that avoid looping searches and focus on evidence that actually changes the answer.
•Compliance assistants that keep audit trails clean by reducing redundant external calls.
•Mobile AI features that feel snappy thanks to fewer network/tool round trips.
•Cost-aware deployments that save API credits and compute by rewarding efficient tool use.

Version: 1