Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Zhiwei Zhang; Fei Zhao; Rui Wang; Zezhong Wang; Bin Liang; Jiakang Wang; Yao Hu; Shaosheng Cao; Kam-Fai Wong

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Beginner

Zhiwei Zhang, Fei Zhao, Rui Wang et al.1/22/2026

arXiv PDF

Key Summary

•Small AI models often stumble when a tool call fails and then get stuck repeating bad calls instead of fixing the mistake.
•FISSION-GRPO turns each real error the model makes during training into a mini lesson with a short, specific hint so the model can try again the right way.
•It uses an Error Simulator to generate realistic, concise error messages, like what an API would return, without revealing the correct answer.
•Then it performs 'fission': from one error it samples multiple new recovery attempts, creating many learning signals from a single failure.
•Compared to standard GRPO training, this approach raises the Qwen3-8B model’s error recovery rate by 5.7% and overall accuracy from 42.75% to 46.75% on BFCL v4 Multi-Turn.
•It especially helps with long context and missing-parameter cases, where timely, targeted correction matters most.
•Unlike static, pre-built error datasets that grow stale, FISSION-GRPO stays aligned with the model’s current mistakes by learning on-policy.
•A simple buffer and trigger system controls when corrections happen, trading off speed with compute cost while keeping training stable.
•The idea generalizes to other step-by-step tasks like code debugging and math by turning failures into guided retries.
•Overall, the method makes smaller, faster models more reliable for real-world, multi-turn tool use.

Why This Research Matters

When AIs help with real tasks—booking flights, managing files, controlling smart homes—tools sometimes fail, and plain punishment isn’t enough to teach recovery. FISSION-GRPO makes smaller, cheaper models reliably bounce back by learning from their actual mistakes in the moment. This reduces frustrating loops, lowers latency from fewer wasted retries, and improves user trust. It also cuts the need for massive, stale error datasets by keeping training aligned with the model’s current error patterns. Better recovery means safer, more predictable behavior under changing states or partial failures. The same recipe can strengthen code assistants and math solvers by turning failed attempts into guided, constructive practice.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re learning to use a new app. Sometimes you tap the wrong button and the app pops up a helpful message like “Can’t do that—please log in first.” You read it, fix your step, and move on. Now imagine the app only buzzed angrily with no hint. You’d keep guessing and probably get stuck.

🥬 Filling (The Actual Concept)

What it is: This paper is about teaching smaller AI models to recover when their tool calls (like API requests) fail, especially in multi-step conversations.
How it works: Instead of just punishing the model when an error happens, the training turns that error into a new practice example that includes a short, realistic hint about what went wrong, and then lets the model try several smart retries.
Why it matters: Without recovery skills, a model can spiral into repeated bad calls (like adding a fake force=True parameter that doesn’t exist), wasting time and failing tasks users care about.

🍞 Bottom Bread (Anchor): Think of asking an AI to cancel a flight that’s already checked in. If the AI just keeps shouting the same wrong command, it fails. If it reads the error message and changes course (e.g., change status first), it succeeds.

🍞 Top Bread (Hook): You know how a teacher doesn’t only mark your answer wrong, but also writes, “You used the wrong formula—try the area formula instead”? That hint is everything.

🥬 Reinforcement Learning (RL)

What it is: RL is a way to train models by trying actions and getting rewards or penalties, like scoring points in a game.
How it works: 1) The model attempts a solution. 2) It gets a score. 3) It nudges its future choices to get better scores more often.
Why it matters: If the only signal is “bad,” the model knows it failed but not how to fix it.

🍞 Anchor: Like learning to shoot basketball free throws—you need to know whether to aim higher, not just that you missed.

🍞 Top Bread (Hook): Imagine a checklist for making a sandwich; you must follow the steps in the right order or it tastes wrong.

🥬 Tool-Using AI Agents

What it is: These are AIs that can call tools (like APIs) to act in the real world.
How it works: 1) Read the user’s request. 2) Choose the right tool (function). 3) Fill in the right parameters. 4) Send the call and read the result. 5) Repeat across turns.
Why it matters: If they misuse tools or parameters, the whole task breaks.

🍞 Anchor: If the AI calls cancel_flight with a made-up parameter, the API rejects it.

🍞 Top Bread (Hook): When a game says “You can’t open this door without the key,” that’s helpful feedback.

🥬 Execution Errors

What it is: These are tool or API failures—wrong function, missing parameter, bad value, or wrong state.
How it works: The environment returns an error message (like 409 StateConflict or unexpected argument 'force').
Why it matters: The AI must read the message, diagnose the cause, and update its next action.

🍞 Anchor: Trying to cancel a checked-in flight triggers a rule-based error; the AI must change the plan.

🍞 Top Bread (Hook): Think of a GPS that reroutes after you miss a turn.

🥬 Error Recovery

What it is: The AI’s ability to fix its plan after an error.
How it works: 1) Notice the error. 2) Understand why. 3) Pick a new action that resolves the problem. 4) Continue until success.
Why it matters: Without recovery, the AI loops or gives up.

🍞 Anchor: If grep fails to find a file, the AI should first find the file path, then try grep again.

🍞 Top Bread (Hook): Studying from last year’s test might not match this year’s tricky questions.

🥬 Distribution Mismatch (Offline vs On-Policy)

What it is: Training only on old, pre-collected error cases (offline) may not match the errors your current model makes now (on-policy).
How it works: As the model improves, the kinds of mistakes change, so static datasets go stale.
Why it matters: The model won’t learn to fix its real, current mistakes.

🍞 Anchor: Practicing only easy addition won’t help if your homework now has fractions.

🍞 Top Bread (Hook): Imagine a coach who watches your actual miss and tells you exactly what to adjust.

🥬 GRPO (Group Relative Policy Optimization)

What it is: A memory-friendly RL method that compares a group of attempts to each other to guide learning.
How it works: 1) Sample several answers for one question. 2) Score them. 3) Push probability toward the better ones and away from worse ones, with a safety clip and KL regularization.
Why it matters: If every attempt in a group fails the same way, there’s little signal to improve—learning can stall.

🍞 Anchor: If every student in a class gets the same question wrong, the average doesn’t show who to copy.

The gap: Existing systems either punish errors without guidance (RL-only) or train on stale, synthetic error fixes (offline). The paper fills this gap by turning each live error into a guided, on-policy mini-lesson with multiple retrials, so the model learns to truly recover, not just avoid.

02Core Idea

🍞 Top Bread (Hook): You know how one spark can set off many tiny fireworks? From one event, you get a burst of new signals.

🥬 The “Aha!” Moment

What it is: Turn each real error the model makes into a corrective training case with a short, realistic hint, then explode it into multiple on-policy recovery attempts—like fission.
How it works: 1) Do normal RL rollouts. 2) Catch failures. 3) Ask an Error Simulator to write a concise, API-style error message. 4) Append that message to the dialogue as a corrective context. 5) Resample several new attempts conditioned on that context and train on them. 6) Repeat.
Why it matters: The model learns from exactly the mistakes it is making now, with guidance, so it gets better at recovery instead of looping.

🍞 Anchor: From one bad cancel_flight call, you add: “ERROR: Status is CHECKED_IN; must be OPEN.” The model then tries lawful steps next (e.g., change status) across several retries and learns what works.

Multiple Analogies

Coach Replay: Like watching your actual missed shot in slow motion with the coach’s short note: “Elbow in,” then you try several fixes right away.
Nuclear Fission: One failure spawns multiple recovery samples, multiplying learning signals from a single event.
GPS Rerouting: After a wrong turn, you see an immediate, context-aware note and several alternate routes to rejoin the path.

Before vs After

Before: Errors were mostly just negative points (don’t do that) or trained from old, generic error data (not my mistake today).
After: Errors become targeted mini-lessons with realistic hints and multiple on-policy retries, so the model learns exactly how to fix its own current mistakes.

Why It Works (Intuition)

It densifies feedback: not only “wrong,” but “wrong because X,” which reduces guesswork.
It restores learning signals: conditioning on feedback creates variety in outcomes within a group, so GRPO has meaningful relative advantages to optimize.
It stays on-policy: you train on the distribution of your actual current errors, not yesterday’s.
It avoids lazy suppression: by guiding recovery, it’s less likely to crush good reasoning steps hidden inside failed trajectories.

Building Blocks

🍞 Top Bread (Hook): Like a referee who explains the foul briefly and clearly.

🥬 Error Simulator

What it is: A fine-tuned model that produces short, API-like error messages from the context and the failed call.
How it works: 1) Read tools and dialogue. 2) Compare failed call to ground truth. 3) Output a concise, realistic error (no spoilers).
Why it matters: Clear, specific hints steer the next attempt without revealing the answer.

🍞 Anchor: “ERROR: parameter status expects value OPEN; got CHECKED_IN.”

🍞 Top Bread (Hook): Think of adding a sticky note onto your homework where you got stuck.

🥬 Corrective Context

What it is: The original conversation plus the failed call plus the simulator’s error message.
How it works: 1) Keep the dialogue. 2) Append the failed tool call. 3) Append the error note. 4) Ask the model to try again.
Why it matters: It grounds the retry in the exact place things went wrong.

🍞 Anchor: Dialogue + cancel_flight(...) + “ERROR: unexpected argument 'force'.”

🍞 Top Bread (Hook): One seed sprouts many shoots.

🥬 Fission Resampling

What it is: From one corrective context, sample several (G′) recovery attempts.
How it works: 1) Condition on the error note. 2) Generate multiple fixes. 3) Score them and update with GRPO.
Why it matters: More tries mean richer training signals around the exact failure point.

🍞 Anchor: From a single mkdir failure, the model tries: check folder, rename, skip, or update path—and learns which resolves the error.

🍞 Top Bread (Hook): A checklist keeps you from mixing up steps.

🥬 Reward Mix (Format, Correctness, Efficiency)

What it is: A combined score that rewards valid format, right function/parameters, and concise outputs.
How it works: 1) Binary format check. 2) Function + parameter overlap scoring. 3) Length regularization. 4) Weights change over time to favor semantics later.
Why it matters: It nudges the model from “well-formed” to “actually correct,” without rambling.

🍞 Anchor: You get points for correct JSON, the right API and args, and not writing a novel.

🍞 Top Bread (Hook): Fresh bread tastes best first.

🥬 LIFO Buffer and Trigger

What it is: A last-in-first-out stack of recent errors; when it fills to a threshold, corrective training fires.
How it works: 1) Push newest errors. 2) When size ≥ trigger, pop the freshest and run fission training. 3) Repeat periodically.
Why it matters: Keeps updates focused on current mistakes (on-policy) and controls compute cost.

🍞 Anchor: Fix the latest potholes on the road before they cause more flat tires.

03Methodology

High-Level Flow: Input → Stage 1 (Standard GRPO exploration) → Stage 2 (Build corrective contexts with Error Simulator) → Stage 3 (Fission resampling + corrective GRPO) → Output (a model that recovers from errors).

Stage 1: Standard Exploration and Update

What happens: For each user query and tool set, the policy samples several rollouts (G). Each rollout is scored with a composite reward: format compliance, functional correctness, and efficiency. GRPO updates the policy based on group-relative advantages (better attempts get more probability).
Why this step exists: The model needs strong basics—choose the right tool and fill parameters properly—before advanced recovery can help.
Example: Given “Check status of flight JL-123,” the model calls get_flight_status with the right flight_id and formats the call correctly.

🍞 Hook: Like practicing piano scales to build finger strength. 🥬 GRPO in this step

What it is: An RL update comparing a group of attempts for one prompt.
How it works: 1) Sample G attempts. 2) Normalize their rewards. 3) Push toward the better ones with clipping and KL control.
Why it matters: It makes training stable and memory-efficient for tool calls. 🍞 Anchor: Picking the best take from multiple tries of a tongue twister.

Reward Design (three parts)

🍞 Hook: A good report card measures more than handwriting. 🥬 Format Compliance

What it is: A binary check that the tool call follows the schema (e.g., JSON/XML correctness).
How it works: Pass/fail with a weight that decreases over time (from 2 to 1), shifting focus later to deeper correctness.
Why it matters: If format is broken, the tool won’t even run. 🍞 Anchor: “Unexpected argument 'force'” is caught here.

🍞 Hook: Getting the right recipe and exact ingredients. 🥬 Functional Correctness

What it is: Matching the right function and parameters, with partial credit for partial matches.
How it works: Score = correct function indicator + averaged token-overlap F1 on parameters; its weight grows over time (from 2 to 3).
Why it matters: Close is not enough—APIs need precise parameters. 🍞 Anchor: cancel_flight vs. change_status; flight_id must match, and status must be OPEN.

🍞 Hook: Say it clearly and briefly. 🥬 Efficiency Regularization

What it is: A length-aware bonus that discourages rambling.
How it works: A piecewise Gaussian penalty encourages concise answers, with tolerance annealed over time.
Why it matters: Long, wandering reasoning can hide mistakes and waste tokens. 🍞 Anchor: Don’t write a paragraph when a one-line function call works.

Stage 2: Error Identification and Corrective Sample Construction

What happens: From Stage 1 rollouts, detect errors by format (binary) and correctness (thresholded). For each error, synthesize a diagnostic message: format errors use deterministic parser-style text; semantic errors query the Error Simulator.
Why this step exists: A penalty alone is too vague; the model needs a short, realistic hint to guide recovery.
Example with data: If the model sends cancel_flight(..., force=True), the simulator returns “ERROR: unexpected argument 'force'.” The corrective context becomes [dialogue; failed call; that error message].

🍞 Hook: A friendly referee explains the first foul. 🥬 Error Simulator

What it is: A fine-tuned model (e.g., Qwen3-32B) trained on ~2K curated examples to output realistic API-style errors without revealing the ground truth call.
How it works: It reads tools, dialogue, the model’s failed call, and ground truth; outputs a one- or two-sentence error starting with "ERROR:".
Why it matters: Precise, safe hints accelerate learning without leakage. 🍞 Anchor: “ERROR: parameter status expects OPEN; got CHECKED_IN.”

🍞 Hook: Fix what just broke, while it’s fresh. 🥬 LIFO Buffer

What it is: A last-in-first-out stack storing the newest corrective contexts.
How it works: Push each new error with its message; when the stack reaches a trigger size, pop the freshest ones for corrective updates.
Why it matters: Targets the model’s current weaknesses (on-policy) and reduces staleness. 🍞 Anchor: You fix today’s bug before tomorrow’s changes make it harder.

Stage 3: Corrective Batch Training (Fission)

What happens: For each corrective context popped from the buffer, sample G′ recovery rollouts conditioned on that context, score them, and run a GRPO-style update. Optionally, if new errors appear, repeat the loop.
Why this step exists: Multiplying attempts per error densifies learning signals and restores meaningful within-group differences, making GRPO effective for recovery.
Example with data: After “ERROR: mkdir: File exists,” the model tries alternate sequences: skip mkdir, check directory, move file directly. The best ones get reinforced.

🍞 Hook: One question; many thoughtful tries. 🥬 Fission Resampling

What it is: Turning one error into multiple guided retries (G′ parallel attempts).
How it works: Sample conditional on the hint; compute normalized advantages within each fission group; update policy.
Why it matters: It creates diversity in outcomes, enabling strong gradient signals exactly where the model failed. 🍞 Anchor: From one wrong turn, the GPS offers several reroutes; you quickly learn the best path.

Secret Sauce

Error-to-Lesson: Every live failure is converted into a clear, actionable training example.
On-Policy Alignment: The corrective data always matches the model’s current error modes.
Signal Amplification: Fission multiplies training signals near the failure point.
Stability Controls: A trigger interval (N) throttles how often corrections run, balancing compute vs. freshness.

🍞 Hook: Don’t over-correct every second; check in regularly. 🥬 Correction Trigger Interval (N)

What it is: A schedule that limits how often corrective batches run.
How it works: At most one corrective update every N steps; smaller N = more frequent corrections; larger N = fewer.
Why it matters: Keeps training stable and compute-aware without letting errors pile up. 🍞 Anchor: Like weekly quizzes—often enough to catch mistakes, not so often it exhausts you.

04Experiments & Results

The Test

What they measured: Overall task accuracy and, more importantly, Error Recovery Rate (how often the model succeeds after an error) on BFCL v4 Multi-Turn—a benchmark that purposely throws errors at agents and lets them retry up to 20 times.

🍞 Hook: It’s like a driving test on a road with surprise detours; can you reroute and still reach the destination? 🥬 BFCL v4 Multi-Turn

What it is: A tough benchmark for tool-using AIs that tests state tracking, long dialogs, and realistic error feedback with retries.
How it works: If a call fails, the environment returns an error trace; the agent can try again within limits.
Why it matters: It directly measures real-world robustness, not just one-shot correctness. 🍞 Anchor: Moving files across folders with partial successes and failures; can the agent diagnose and proceed?

The Competition

Baselines: GRPO, DAPO, Dr.GRPO (strong RL methods); plus specialized 8B tool agents ToolACE-2-8B and BitAgent-8B.
Models: Qwen3 at 1.7B, 4B, and 8B scales.

The Scoreboard (with context)

Qwen3-8B: FISSION-GRPO lifts overall accuracy from 42.75% to 46.75% and raises Error Recovery Rate by 5.7% absolute. That’s like moving from a solid B to a clear B+ while others stay around B.
Qwen3-4B: 40.87% overall, beating GRPO, DAPO, and Dr.GRPO.
Qwen3-1.7B: 20.38% overall, over 160% relative gain from Base (7.80%).
Against specialized agents (8B): FISSION-GRPO (46.75%) outperforms ToolACE-2-8B (37.00%) and BitAgent-8B (37.75%) by large margins (≈+9–10 points).
Subsets: Strong gains in Base and Missing Parameter categories; also notable improvements in Long Context.

Surprising/Notable Findings

🍞 Hook: Even a simple hint can change your next try. 🥬 Static vs Dynamic Feedback (Ablation)

What it is: Compare using a generic error message (Static) vs. the learned, context-aware Error Simulator (Dynamic).
How it works: Static uses the same bland hint for all errors; Dynamic uses specific, realistic messages.
Why it matters: Shows whether the simulator’s precision is necessary beyond the fission structure. 🍞 Anchor: Results show Static helps some, but Dynamic adds several extra points—specific hints matter.
Finding: Fission-Static beats GRPO, proving the structure (resampling around failures) is valuable on its own. Fission-Dynamic adds significant extra gains, especially for Missing Parameter and Long Context—targeted guidance is key.

🍞 Hook: How often should you check your homework? 🥬 Trigger Frequency (N)

What it is: How many steps between corrective updates.
What they found: Frequent to moderate corrections keep performance high; making them too rare harms scores, especially for parameters and long contexts. 🍞 Anchor: Weekly check-ins beat once-a-month cramming.

Case Study: File Operations with Partial Failures

Base model: collapses into loops after losing track of state.
GRPO: recognizes some issues but hallucinates parameters (e.g., adds path= to ls which doesn’t exist).
FISSION-GRPO: actively diagnoses by calling find before retrying grep; updates directory correctly and succeeds.

Why these results matter

The main boost comes from better Error Recovery Rate, not at the expense of one-shot success (which also nudges up). That confirms the method truly teaches recovery rather than just punishing mistakes.

05Discussion & Limitations

Limitations

🍞 Hook: Practicing on one field may not cover every stadium. 🥬 Evaluation Scope

What it is: Experiments focus on BFCL v4 Multi-Turn because it supplies error feedback and retries.
Why it matters: Some domains don’t provide structured error traces or allow retries; results may not fully transfer. 🍞 Anchor: Web navigation or certain APIs may give vague errors, reducing the simulator’s usefulness.

🍞 Hook: More drills mean more time. 🥬 Computational Overhead

What it is: Fission resampling adds multiple recovery attempts per error.
Why it matters: Training cost rises; you must balance quality and speed using the trigger interval N and batch sizes. 🍞 Anchor: Like extra batting practice—great for skill, heavy on time.

Required Resources

A reasonably capable base model (e.g., Qwen3 family) and GPUs for RL training.
An Error Simulator model fine-tuned on curated error logs (~2K high-quality pairs in the paper).
A framework like Verl for stable GRPO training and long context windows.

When Not to Use

Single-turn tasks with near-zero execution errors (overhead may not pay off).
Environments without meaningful, parseable error feedback (the simulator has little to imitate).
Ultra-low-latency on-device training scenarios where extra sampling is infeasible.

Open Questions

Can we auto-generate simulator data at scale without teacher models while keeping realism?
How well does the approach transfer to code debugging, math proofs, or web navigation with partial failures?
Can we adaptively tune G′ and N based on live variance and recovery rates to minimize compute?
How to detect and avoid teaching to spurious errors (noisy traces) while preserving generalization?

06Conclusion & Future Work

Three-Sentence Summary

This paper turns each real tool-call failure into a guided, on-policy mini-lesson using an Error Simulator and a fission mechanism that spawns multiple recovery attempts per error.
By densifying feedback and aligning training to the model’s current mistakes, it teaches genuine error recovery instead of repetitive retries.
On BFCL v4 Multi-Turn, it lifts Qwen3-8B’s error recovery rate by 5.7% and overall accuracy from 42.75% to 46.75%, surpassing strong baselines and specialized agents.

Main Achievement

Recasting errors from “things to avoid” into “sources of targeted supervision” inside the RL loop, making recovery a first-class skill for smaller, faster tool-using models.

Future Directions

Apply fission-style corrective training to code debugging, mathematical reasoning, and web navigation; automate and scale simulator training; adaptively schedule corrections with compute-aware controllers; and explore safety-aware simulators that detect risky retries.

Why Remember This

FISSION-GRPO’s core idea is simple but powerful: learn from the exact mistakes you make right now, with specific, realistic hints, and practice several fixes immediately. That reframing reliably converts fragile looping agents into resilient problem solvers across long, messy, real-world tasks.

Practical Applications

•Customer support bots that gracefully recover from API errors (e.g., billing or ticketing) without looping.
•Travel assistants that handle flight-state conflicts by adjusting steps instead of repeating invalid cancellations.
•DevOps copilots that debug command failures (permissions, paths) by diagnosing and retrying the right fix.
•Smart home controllers that interpret device errors (busy, offline) and choose appropriate fallback actions.
•RPA/enterprise workflow bots that adapt to form or schema changes by reading validation errors and correcting.
•Code agents that convert compiler/runtime errors into targeted retries (fix imports, adjust types) during execution.
•Data pipeline assistants that recover from schema drift and missing fields using simulator-like validation hints.
•Web automation agents that handle 4xx/5xx responses by modifying parameters, timing, or authentication flows.
•Educational AI tutors that learn to correct solution steps after hint-like feedback rather than restarting.
•Healthcare scheduling bots that resolve state conflicts (double bookings) with safe, policy-compliant alternatives.

Version: 1