Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Rakshith Vasudev; Melisa Russak; Dan Bikel; Waseem Alshikh

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Intermediate

Rakshith Vasudev, Melisa Russak, Dan Bikel et al.2/3/2026

arXiv PDF

Key Summary

•The paper shows that even if a model is great at predicting when an AI agent will fail, jumping in to “fix” the agent mid-task can still make things worse.
•Interventions have two effects: they sometimes rescue a failing attempt (recovery), but they also sometimes break an attempt that was going to succeed (disruption).
•Whether to intervene depends on a simple rule: only help if the agent’s failure rate is higher than a threshold set by its recovery and disruption tendencies.
•Across tasks where agents usually succeed (like HotPotQA), interventions often hurt, including one model dropping by 26 percentage points.
•In tasks where agents usually fail (like ALFWorld), interventions can give small but real gains (about +2.8 percentage points).
•A small pilot test on 50 tasks can predict ahead of time whether interventions will help or harm, avoiding risky deployments.
•Bigger critic models or perfect failure prediction don’t fix the core issue because the real bottleneck is how the agent handles being corrected mid-trajectory.
•Simple rules like “don’t intervene in the first couple of steps” can prevent many harms caused by early, destabilizing interruptions.
•When disruption outweighs recovery, it’s usually better to run multiple full attempts and pick the best (post-hoc selection) than to interrupt mid-run.
•The main contribution is a practical framework and decision rule to decide when not to intervene, preventing large regressions before deployment.

Why This Research Matters

This work helps teams avoid deploying “helpers” that accidentally hurt their AI agents. It gives a simple, low-cost way to predict help-or-harm with a small pilot test, so companies don’t need to risk large, expensive rollouts. It shows that better failure prediction alone won’t save the day—what matters is how the agent handles being corrected, especially early on. It suggests safer alternatives like post-hoc selection when disruptions outweigh recoveries. By focusing on practical decisions, it can improve user trust, reduce wasted compute, and prevent sudden performance crashes in production. In short, it turns a fuzzy intuition into a clear rule that engineers can actually use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you’re doing a puzzle and a friend suddenly grabs the piece from your hand to “help,” it can throw you off—even if your friend is good at puzzles?

🥬 The Concept (Execution-time Intervention): Many AI systems solve problems in several steps. A “helper” model watches and interrupts when it thinks the main model is about to mess up. It’s like a coach stepping in mid-play.

How it works (recipe):
1. The agent thinks and acts step by step.
2. A watcher (the critic) scores each step for likely failure.
3. If the score looks scary, it interrupts—either undoing the last step or adding a warning.
Why it matters: If we interrupt at the wrong time, we may break a path that would have worked fine.

🍞 Anchor: Imagine you’re writing a math answer that’s actually correct, but your teacher stops you mid-sentence and tells you to start over. You might lose your train of thought and get it wrong.

The world before: People believed that if you can accurately predict when an AI agent will fail, then stepping in to fix it should make results better. Many projects built “LLM critics” that judge whether the agent’s current step is heading toward a win or a fail. These critics can be measured by scores like AUROC to see how well they separate good steps from bad ones.

🍞 Hook: Imagine a lifeguard who is excellent at spotting swimmers in trouble. You’d think more saves would follow, right?

🥬 The Concept (LLM Critic): An LLM critic is a smaller model that watches the big agent and predicts if the agent’s current course might fail.

How it works:
1. Read the agent’s recent steps and context.
2. Output a probability that this path will end badly.
3. If the risk is high, trigger an intervention.
Why it matters: If the critic is wrong or fires at the wrong time, it can cause unnecessary interruptions.

🍞 Anchor: A hall monitor who’s great at noticing when kids might run in the hallway—but shouting “Stop!” too soon can make someone drop their books.

The problem: Even with a critic that scores very high offline (for example, AUROC ≈ 0.94), the team found that interventions sometimes made the overall system worse. On one model, performance fell by 26 percentage points after adding the critic-driven interruptions. On another model, the same setup caused almost no change. So critic accuracy alone didn’t tell us if stepping in would help or harm.

🍞 Hook: Think of a soccer game. Calling a timeout can fix a strategy—or ruin a great attack.

🥬 The Concept (Failure Rate): Failure rate is how often the agent doesn’t finish a task correctly if we don’t intervene.

How it works:
1. Run the agent on many tasks without help.
2. Count how many times it fails.
3. Divide by total tasks.
Why it matters: If failure is rare, most interruptions risk breaking good runs. If failure is common, interruptions may have more chances to help.

🍞 Anchor: If you usually get 9 out of 10 spelling words right, someone interrupting you each time might lower your score.

Failed attempts: Teams tried to improve the critic (bigger models, better calibration, thresholds). But bigger critics didn’t help in their data setting; even perfect prediction had a low ceiling because mid-trajectory corrections themselves can knock the agent off balance.

🍞 Hook: You know how a thermometer tells temperature well, but it can’t make the room warmer?

🥬 The Concept (AUROC): AUROC measures how well the critic separates likely failures from likely successes.

How it works:
1. Score many examples with the critic.
2. See how well high scores match true failures and low scores match successes.
3. Summarize the separation in a single number (closer to 1 is better).
Why it matters: A great AUROC means the critic can detect risk, but it doesn’t guarantee the agent can handle being interrupted.

🍞 Anchor: Even if your smoke detector is excellent at sensing smoke, spraying a fire extinguisher into the stove every time you smell toast will ruin dinner.

The gap: What was missing was a simple, reliable way to decide when to deploy interventions. The authors found a tradeoff: interruptions can save failing paths (recovery) but can also break successful ones (disruption). The trick is to check which side is bigger for your agent on your tasks—before you deploy.

Real stakes: In practice, bad interventions waste compute, return wrong answers, and frustrate users. The paper’s framework helps teams run a tiny pilot (about 50 tasks) to estimate whether interventions are likely to help or harm, letting them avoid big failures in production.

02Core Idea

🍞 Hook: Imagine you’re helping a friend ride a bike. If you grab the handlebars at the right moment, you prevent a fall. If you grab them during a smooth turn, you might cause the crash.

🥬 The Concept (Disruption–Recovery Tradeoff): Intervening mid-task has two opposite effects: it can recover failing runs, but it can also disrupt successful runs.

How it works:
1. Compare baseline runs (no help) to intervention runs (with help) on the same tasks.
2. Count recoveries: tasks that failed before but succeed with help.
3. Count disruptions: tasks that succeeded before but fail with help.
4. Net benefit = more recoveries than disruptions, adjusted by how often failures happen.
Why it matters: If disruptions happen more than recoveries, intervening makes things worse even with a very accurate critic.

🍞 Anchor: If your friend usually rides fine, grabbing the bike often will cause more wobbles than saves.

The “Aha!” in one sentence: Don’t deploy interventions just because your critic predicts failure well; deploy only when your agent’s failure rate is high enough that recoveries will outnumber disruptions.

Three analogies:

Traffic cop: Stopping cars at a busy, accident-prone intersection helps. Stopping cars on a quiet, safe street just creates jams.
Cooking helper: Tasting and correcting a stew that’s going wrong helps. Tossing extra salt into a stew that already tastes great ruins dinner.
Test-taking coach: Whispering hints when a student is stuck helps. Whispering during easy questions distracts and lowers their score.

Before vs. after:

Before: “Better failure prediction = better outcomes.”
After: “Better outcomes depend on the agent’s disruption–recovery balance and how often it fails, not just on prediction accuracy.”

🍞 Hook: Think of a balance scale. One side is “saved mistakes,” the other is “broken successes.” Which is heavier?

🥬 The Concept (Recovery Rate): Recovery rate is the fraction of baseline failures that turn into successes thanks to intervention.

How it works:
1. Look only at tasks the agent failed without help.
2. After intervening, count how many now succeed.
3. Divide by the number of baseline failures.
Why it matters: High recovery rate means your help actually turns losses into wins.

🍞 Anchor: If you usually miss shots in basketball from far away, a coach’s tip that reliably fixes your aim boosts your recovery rate.

🍞 Hook: Now imagine the other side of the scale.

🥬 The Concept (Disruption Rate): Disruption rate is the fraction of baseline successes that turn into failures because of intervention.

How it works:
1. Look only at tasks the agent already solved without help.
2. After intervening, count how many now fail.
3. Divide by the number of baseline successes.
Why it matters: A high disruption rate means your “help” is actually harming good runs.

🍞 Anchor: If you normally ace spelling tests, but someone keeps interrupting you during easy words, your score may drop—that’s disruption.

Why it works (intuition): The agent’s thinking is like a train of thought. Mid-trajectory corrections can either nudge it back on track or knock it off the rails. Some agents absorb corrections well; others become unstable and spiral into worse choices, especially if interrupted too early.

Building blocks of the idea:

Measure the baseline failure rate (how often the agent fails with no help).
Measure recovery and disruption using paired task runs.
Use a simple decision rule: only intervene if the failure rate is above a threshold set by the disruption and recovery rates.

🍞 Hook: Picture a gate that only opens when enough reasons stack up.

🥬 The Concept (Decision Threshold): Intervene only if the agent’s failure rate is bigger than a certain threshold determined by recovery and disruption.

How it works:
1. Estimate recovery rate (how often help turns fails into wins).
2. Estimate disruption rate (how often help turns wins into fails).
3. Compute a threshold from these two numbers; help only if the observed failure rate is higher than that threshold.
Why it matters: This rule prevents you from “helping” in situations where help is more likely to hurt.

🍞 Anchor: If you only step in to help your friend ride a bike when they’re truly wobbling a lot, you avoid grabbing the handlebars during smooth turns.

Key twist: The paper shows that even a very accurate critic can’t guarantee good outcomes, because the main limiter is how the agent behaves after being corrected. In some agents, corrections trigger confusion cascades; in others, they gently steer the agent back. That agent-dependent response is what decides success.

03Methodology

High-level flow: Input task → Baseline run and Intervention run (paired) → Count recoveries and disruptions → Decide whether to deploy.

Step-by-step:

Collect tasks and agents. The authors test across different benchmarks (HotPotQA, GAIA, ALFWorld) and agent backbones (like Qwen-3-8B, GLM-4.7, MiniMax-M2.1).
Train an LLM critic on trajectories labeled by final success or failure, and calibrate its probabilities.
Choose an intervention mechanism (either undo the last step or append a warning message).
For each task, run both a baseline (no help) and an intervention (with help) episode from identical starts.
Count recoveries (fail→success) and disruptions (success→fail) and estimate rates.
Use a small pilot (about 50 tasks) to check if interventions are likely to help before full deployment.

🍞 Hook: Imagine a referee who can blow the whistle in two ways: rewind the last move or just warn the player.

🥬 The Concept (ROLLBACK): When the critic predicts likely failure, undo the agent’s last action and let it try again.

How it works:
1. Watch each step.
2. If risk seems high, revert the last step and restore the environment.
3. Allow the agent to take a different action.
Why it matters: Rolling back can remove a bad move—but it can also erase good momentum and confuse the agent.

🍞 Anchor: Like hitting “undo” in a document. If you undo the wrong sentence, you might forget your point and write something worse.

🍞 Hook: Or maybe the ref just shouts, “Careful!” and lets the play continue.

🥬 The Concept (APPEND): Instead of undoing, add a warning message telling the agent the critic forecasts trouble.

How it works:
1. Let the action go through.
2. Insert a warning like “This may lead to failure—reconsider.”
3. The agent continues, hopefully adjusting course.
Why it matters: Warnings can guide without erasing progress—but they can also distract or overcorrect.

🍞 Anchor: Think of a sticky note on your homework: “Double-check your units!” It might help—or it might make you second-guess everything.

🍞 Hook: Thermometers can read temperature well, but sometimes they read a bit too confidently.

🥬 The Concept (Calibration): Calibration adjusts the critic’s confidence so that “70% risk” really means “about 70% of these turn out badly.”

How it works:
1. Measure how often predicted risks match actual outcomes.
2. Learn a temperature that softens or sharpens probabilities.
3. Use the adjusted probabilities to trigger or skip interventions.
Why it matters: If the critic is overconfident, it may trigger too often, causing needless disruption—especially early in a task.

🍞 Anchor: Like re-marking a ruler so that each inch is truly an inch, not 1.2 inches.

🍞 Hook: Before building a bridge, engineers test a small model to see if the design holds.

🥬 The Concept (Pilot Study): A small test (around 50 tasks) run before deployment to estimate failure, recovery, and disruption rates.

How it works:
1. Pick your agent, critic, and intervention mechanism.
2. Run a small, paired test: baseline vs. intervention.
3. Compute failure rate, recovery rate, and disruption rate, and compare failure rate to the decision threshold.
Why it matters: It lets you predict help-or-harm without risking full-scale failure in production.

🍞 Anchor: Like practicing a dance routine with a small audience to see if your cues help the performers—or make them stumble.

Detailed examples with data:

In high-success HotPotQA, baseline success was roughly 57–70%. Interventions rarely helped and often hurt: one model dropped by 25–30 percentage points. This means disruptions outweighed recoveries.
In low-success ALFWorld, a small pilot estimated high failure (~89%), modest recovery (~12%), and sizable disruption (~56%). The threshold test predicted net help—and indeed, the full run showed a small but significant gain (~+2.8 pp).

What breaks without each step:

Without paired runs, you can’t tell if an intervention “saved” or “broke” a task.
Without calibration, overconfident critics may trigger early and often, causing big harms from early-step disruptions.
Without a pilot, you might deploy into a regime where disruptions dominate, risking large regressions.

The secret sauce:

A simple decision rule: only intervene when the agent’s failure rate is high enough relative to the observed recovery and disruption tendencies.
A practical deployment recipe: run a small, cheap pilot to forecast the real effect, then choose between mid-task intervention or safer alternatives like post-hoc selection.

Bonus concept (safe alternative): 🍞 Hook: If you can’t coach a player mid-play without tripping them, maybe let them finish two plays and then pick the better one.

🥬 The Concept (Post-hoc Selection): Run multiple full attempts and pick the best result afterward instead of interrupting mid-run.

How it works:
1. Produce two or more complete trajectories.
2. Rank them (using a signal or even an oracle in analysis).
3. Output the best one.
Why it matters: It avoids mid-execution disruption and often has a higher improvement ceiling.

🍞 Anchor: It’s like taking two photos and choosing the sharper one instead of nudging the photographer while they’re pressing the shutter.

04Experiments & Results

The test: The authors measured whether adding interventions improved success rates on three kinds of tasks: a mostly-successful QA set (HotPotQA), a medium-success assistant set (GAIA), and a mostly-failing simulated household tasks set (ALFWorld). They also tracked recovery and disruption to understand why scores went up or down.

The competition: They compared “agent only” versus “agent + critic-driven intervention,” trying two simple mechanisms (ROLLBACK and APPEND), with and without calibration.

Scoreboard with context:

HotPotQA (high-success): Interventions did not help any model. For Qwen-3-8B and GLM-4.7, effects were neutral to mildly negative (a few percentage points). For MiniMax-M2.1, results collapsed by about 25–30 percentage points—like going from an A- to a failing grade just by adding the helper.
GAIA (medium-success): The same pattern held—no wins from intervention. Decreases ranged from a few points up to over 30 points, again with MiniMax most sensitive.
ALFWorld (low-success): With failure already very common, the pilot predicted small gains, and that’s what happened. The best setting improved by about +2.8 percentage points (p=0.014), a real but modest bump.

Surprising findings:

Big critic ≠ better outcomes. Scaling the critic from 0.6B to 14B didn’t improve real-world results in this data regime; sometimes it got worse due to overfitting and limited data diversity.
Perfect prediction still has a low upside. An oracle that only intervenes on runs that would fail gave just 3–8 percentage point boosts—showing an intrinsic “disruption tax” from changing course mid-run.
Early-step harm dominates. Most damage came from interventions at steps 0–1, which derailed runs that were already on track to succeed; simple rules like “don’t intervene before step 2” can reclaim a few points.
Calibration helps or hurts depending on the regime. In high-success settings, calibration can reduce unnecessary triggers. In low-success ALFWorld, over-suppressing early triggers reduced helpful recoveries, so uncalibrated sometimes did better.

Concrete example interpretations:

HotPotQA: Imagine students who usually do fine. Yelling “Careful!” during their first two answers leads to second-guessing, detours, and missed points. That’s why disruption beats recovery.
ALFWorld: Imagine a tough obstacle course most students fail. Timely nudges can push a few more passes, even if the help isn’t perfect, because there are many failing attempts to save.

Key numbers in plain words:

One model lost about 26 points after intervention—like dropping from a 64% D to a 38% F just because of the helper.
In the hardest benchmark, small but statistically reliable gains appeared—like nudging a 6% to around 9%—still small, but real progress.

Take-home pattern: The disruption–recovery balance, not the critic’s offline accuracy, predicts whether intervention will help. Use a small pilot to measure this balance and decide ahead of time.

05Discussion & Limitations

Limitations:

Agent dependence: Some agents absorb corrections calmly; others spin out after a single nudge. Results can vary a lot across models and tasks.
Mechanism simplicity: The paper studies two basic interventions (undo or warn). Smarter, plan-aware methods might reduce disruption—but they still face the same tradeoff and must be tested with a pilot.
Data regime: The critic was trained on a limited, diverse-but-not-huge dataset. Larger or frontier critics might behave differently if trained with much more data.
Positive-effect size: Even in the best (high-failure) setting, gains were modest; this reveals a natural ceiling for mid-execution control.

Required resources:

Paired runs (baseline vs. intervention) for a small pilot (≈50–100 tasks) to estimate failure, recovery, and disruption.
A critic model (can be small), basic calibration, and simple intervention hooks (rollback or warning).
Tooling to match seeds and starting states so comparisons are fair.

When not to use:

High-success regimes: If your agent usually succeeds, interventions likely harm more than help, especially if they fire early.
Early-step triggers: Avoid intervening in the first couple of steps unless you’ve proven it helps.
When d/r > 1: If disruptions clearly outnumber recoveries, mid-execution control is a bad bet; use post-hoc selection instead.

Open questions:

Can planning-aware or step-specific fixes push disruption down without lowering recovery?
Can agents be trained to be “interruption-robust,” learning to gracefully absorb mid-course corrections?
What calibration or timing policies best avoid early-step harms across many domains?
How well does the pilot rule transfer across very different tasks (e.g., coding vs. QA vs. robotics)?

06Conclusion & Future Work

Three-sentence summary: Accurate failure prediction does not guarantee that stepping in to correct an AI mid-task will help. Interventions bring a tradeoff: they can recover some failures but also disrupt some successes, and whether they help depends on the agent’s failure rate versus a simple threshold set by recovery and disruption. A small pilot (≈50 tasks) reliably forecasts help-or-harm so teams can avoid costly regressions.

Main achievement: A practical, easy-to-apply framework and decision rule—estimate failure, recovery, and disruption; intervene only when the failure rate clears the threshold—plus evidence across models and benchmarks that this rule predicts real outcomes.

Future directions: Design gentler, plan-aware interventions; train agents to be interruption-robust; learn timing policies that avoid early-step harm; and explore hybrid systems that combine light-touch mid-run nudges with strong post-hoc selection. Also, expand pilots to new domains and larger datasets to refine generalization.

Why remember this: Because “great at predicting failure” doesn’t automatically mean “great at preventing it.” The real limiter is how the agent handles being corrected. With a simple pilot and a simple rule, you can know when not to intervene—and save your system from avoidable, sometimes catastrophic drops.

Practical Applications

•Run a 50–100 task pilot to estimate failure, recovery, and disruption before enabling any intervention.
•Adopt a minimum-step rule (e.g., do not intervene before step 2) to avoid early-step destabilization.
•If disruption outweighs recovery (d/r > 1), switch to post-hoc selection (e.g., Best-of-N) instead of mid-run control.
•Use temperature scaling (calibration) to reduce over-triggering, especially in high-success regimes.
•Tune the intervention threshold to limit frequency, but only after checking that the disruption–recovery balance is favorable.
•Prefer lightweight critics with good calibration over larger critics when data is limited; invest in data diversity first.
•Instrument your system to log paired baseline vs. intervention outcomes so you can count recoveries and disruptions accurately.
•Set an intervention budget (max number of interruptions per task) to prevent cascades from exhausting steps.
•In low-success domains, prioritize rollback on clearly bad steps; in high-success domains, disable mid-run intervention by default.
•Re-evaluate the pilot whenever you change the agent, task domain, or tools, since the tradeoff is model- and domain-dependent.

Version: 1