DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Ming Ma; Jue Zhang; Fangkai Yang; Yu Kang; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Beginner

Ming Ma, Jue Zhang, Fangkai Yang et al.12/7/2025

arXiv PDF

Key Summary

•LLM multi-agent systems often fail quietly (no crash) and leave long, twisty logs that are hard to debug by hand.
•Past methods asked an LLM to guess which agent and step caused the failure from the log, but those guesses weren’t verified and ground-truth labels were often uncertain.
•DoVer (Do-then-Verify) turns debugging into a science experiment: make a precise change (intervention) at a suspected failure step and re-run to see if the task now works.
•DoVer first segments a long run into smaller trials around plan updates, then proposes a failure hypothesis for each trial, crafts a minimal fix, and replays from that point.
•Success is measured by flipping failures to passes and by milestone progress toward the goal, not just by matching a blamed line in the log.
•On GAIA and AssistantBench with the Magnetic-One framework, DoVer turns 18–28% of failures into successes and confirms or refutes 30–60% of failure hypotheses.
•On GSMPlus math tasks with a different framework (AutoGen2), DoVer recovers 49% of failures, showing it generalizes.
•Smaller open-source models (Qwen3-8B/32B) also work with DoVer, and a few-shot prompt helps the small model close the gap.
•DoVer reveals when failures come from missing sub-agent tools (like ‘scroll-to-bottom’), guiding targeted upgrades beyond text fixes.
•This outcome-focused approach makes agent systems more reliable and reduces dependence on ambiguous human annotations.

Why This Research Matters

Real AI helpers need to be reliable, not just impressive in demos. DoVer gives builders a practical way to fix failures that don’t crash the system but still frustrate users. By testing small, targeted changes and measuring real outcomes, teams can quickly recover many failed attempts and learn which parts need new tools or better instructions. This saves time and cost, reduces user dissatisfaction, and makes complex assistants—like web researchers or math tutors—more dependable. Because DoVer works across different agent frameworks and even with smaller open models, it is accessible to many teams. Over time, this approach can help create self-improving agents that debug themselves and grow more capable with each fix.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a group of robots trying to bake a cake together. One gathers eggs, another mixes batter, another handles the oven. The cake comes out wrong—but none of the robots crashed or screamed. Which part went wrong and when? That’s the puzzle.

🥬 The Concept (Multi-agent systems): Multi-agent systems are teams of AI helpers with different jobs that talk to each other to finish a task. How it works: (1) a leader makes a plan; (2) helpers act (browse, code, read files); (3) the leader updates the plan; (4) repeat. Why it matters: Without clear teamwork, the group can wander or repeat mistakes, and you can’t tell who slipped. 🍞 Anchor: A web ‘Orchestrator’ asks a ‘WebSurfer’ to find a NASA page, then a ‘FileSurfer’ to open a PDF. They keep chatting, but the final answer is wrong.

🍞 Hook: You know how a book report can be wrong even if your computer didn’t crash? Errors don’t have to be loud.

🥬 The Concept (Failure without crashing): In these AI teams, many failures are “soft”—the system runs but gives the wrong answer. How it works: The agents complete steps, but a misunderstanding or wrong click leads to a bad path. Why it matters: Soft failures hide inside long logs, making them tough to spot. 🍞 Anchor: The agent browses many pages yet never reaches the correct APOD entry, so the final response is off.

🍞 Hook: Think of a detective reading only diary entries to guess who messed up the cake. No experiments, just guessing.

🥬 The Concept (Log-based failure attribution): This older method reads the conversation log and picks the single agent and step to blame. How it works: Feed the whole log to an LLM; it names who and when the decisive mistake happened. Why it matters: If the guess is wrong—or if many different fixes would work—you chase the wrong suspect. 🍞 Anchor: The LLM says “Step 32 by WebSurfer,” but when you try to fix that line in real life, nothing changes.

🍞 Hook: Imagine a soccer team trying three different plays in one match; each play could fail for a different reason.

🥬 The Concept (Trials inside one session): A session often contains multiple trials—each starts with a plan (or re-plan) and then executes. How it works: After reflecting, the leader revises the plan and starts a fresh attempt. Why it matters: Blaming a single step across the whole session ignores that each trial may have its own failure point. 🍞 Anchor: Trial 1 scrolls archives; Trial 2 tries a calendar; Trial 3 uses URL patterns—each can break differently.

🍞 Hook: Have you ever heard two friends miscommunicate? One says “Click the big blue button,” the other clicks a different blue thing.

🥬 The Concept (Inter-agent misalignment): Sometimes the leader’s instruction is vague or impossible, and the helper does something unrelated. How it works: The orchestrator sends a plan or instruction; the sub-agent interprets it with limited tools; mismatches arise. Why it matters: It’s unfair (and unhelpful) to blame only one party. 🍞 Anchor: Orchestrator asks to click a year in a calendar widget that doesn’t support year jumps; WebSurfer clicks a random nearby control.

🍞 Hook: Think of three teachers grading the same essay but disagreeing on the first mistake.

🥬 The Concept (Uncertain ground-truth labels): Even humans often disagree on which step is the first decisive error. How it works: Annotators review the same log and debate; many cases stay ambiguous. Why it matters: If we train or evaluate only on these labels, our “accuracy” can be unfair or misleading. 🍞 Anchor: In GAIA cases, many ground-truth labels were themselves uncertain.

The world before: People asked models to name the bad step from the log. It felt fast, but guesses weren’t tested. When teams tried to reproduce the “blamed step,” the system often still failed. Worse, sessions had multiple trials, and agents could misalign, so pinning one step or one agent was murky.

The problem: Debugging by reading logs alone creates untested theories. You can’t be sure a guessed fix would actually fix the run.

Failed attempts: End-of-trace self-refinement (like writing a better final answer) often didn’t help because the mistake happened much earlier. Global advice felt too vague to correct a specific misstep mid-trajectory.

The gap: We need outcome-checked debugging that (a) respects multiple trials within a session, (b) edits the specific place where things went wrong, and (c) re-runs to see whether the fix really works.

Real stakes: This matters for real assistants that browse the web, read files, or do math. If they waste time, mis-click, or misread, users get wrong answers. Fast, scalable debugging that actually improves outcomes means more reliable research tools, coding helpers, and classroom tutors.

02Core Idea

🍞 Hook: You know how in science class you don’t just guess—you run an experiment to check your idea.

🥬 The Concept (DoVer = Do-then-Verify): DoVer turns each “who messed up and where” guess into a small, targeted fix (intervention) and re-runs from that point to see if the task now succeeds. How it works: (1) split the big run into trials around plan updates; (2) for each trial, guess the failure step; (3) generate a minimal fix (edit a plan or instruction); (4) replay from that step and measure success or milestone progress. Why it matters: Without verification, you collect unproven hunches; with DoVer, you only keep fixes that work in action. 🍞 Anchor: Change “scroll randomly” to “open the APOD calendar and select Aug 1–7, 2015,” resume, and watch the agent reach the right entry.

The “Aha!” moment in one sentence: Treat failure attribution as a hypothesis you must test by intervening in the original trajectory and verifying the outcome.

Multiple analogies:

Doctor: You suspect a cause of a cough, prescribe a targeted treatment, and check if the cough goes away.
Coach: You tweak a single play during the game and see if the team scores this time.
Mechanic: You replace one part you think is faulty, then turn the engine on to confirm the fix.

Before vs. After:

Before: Read logs, name one bad step/agent, hope the guess is right. No re-run to confirm; confusion when multiple trials exist.
After: For each trial, try a tiny change right where you think the mistake happened, re-run, and score whether things got better or solved.

Why it works (intuition):

Causality check: A real fix should change the outcome. If the same failure persists after a faithful intervention, your hypothesis was likely wrong.
Modular focus: Editing at the precise step avoids washing the trajectory with vague, end-of-run advice.
Trial granularity: Handling each plan-and-execute attempt separately respects branching strategies.

Building blocks (each with a mini sandwich):

🍞 Hook: Imagine chapters in a choose-your-own-adventure book. 🥬 The Concept (Trial segmentation): A trial is one plan plus its execution steps until the next re-plan. How it works: Detect plan/re-plan messages and split the log accordingly. Why it matters: Debug one causal chain at a time. 🍞 Anchor: Trial 2 begins at “Update plan: use calendar,” then includes all steps trying that approach.

🍞 Hook: Guessing which ingredient spoiled the cake. 🥬 The Concept (Failure attribution hypotheses): A hypothesis picks a likely agent and step and explains why. How it works: Use an LLM to scan the trial and propose a mistake location and reason. Why it matters: You need a concrete suspect to craft a focused fix. 🍞 Anchor: “At step 53, the orchestrator told WebSurfer to click a control that doesn’t exist.”

🍞 Hook: Writing a sticky note with clearer instructions. 🥬 The Concept (Interventions): Minimal edits to the plan or instruction at the suspected step. How it works: Replace vague or impossible directions with precise, tool-aware ones; or update the high-level plan order. Why it matters: Small, targeted changes are safer and easier to test. 🍞 Anchor: Change “scroll to find August 2015” to “open APOD calendar, pick 2015 → August → dates 1–7.”

🍞 Hook: Hitting “resume from checkpoint” in a video game after changing your move. 🥬 The Concept (Replay/counterfactual run): Restore the state just before the suspected step, apply the edit, and continue. How it works: Use checkpoints of messages, tools, and configs to re-run only the tail. Why it matters: You isolate the effect of the intervention. 🍞 Anchor: Resume right before the bad instruction and see if the new one leads to the answer.

🍞 Hook: Grading progress with a rubric instead of just pass/fail. 🥬 The Concept (Milestone progress + success): Score not only final success but also partial wins (milestones achieved). How it works: Extract up to five tool-agnostic milestones from human solutions, then check which ones the new run achieves. Why it matters: Even if not fully solved, real progress proves the fix helps. 🍞 Anchor: The run now reaches “Open the right month” (milestone 2), even if it still misses the final extraction.

🍞 Hook: Sorting experiments into “worked,” “helped,” “didn’t help,” or “couldn’t try.” 🥬 The Concept (Validation outcomes): Label each hypothesis as Validated, Partially Validated, Refuted, or Inconclusive. How it works: Combine success rate, progress, and whether the agent followed the new instruction. Why it matters: You learn if your idea was right, half-right, wrong, or blocked by tools. 🍞 Anchor: The agent followed the new instruction but still failed—your hypothesis was wrong (Refuted).

03Methodology

At a high level: Input (failed session log) → Trial segmentation → Per-trial failure hypothesis → Minimal intervention → Replay from that point → Evaluate success and milestone progress.

Step-by-step (with mini sandwiches, examples, and why each step exists):

Trial Segmentation

What happens: The long session is split into trials: each begins at a plan or re-plan message and includes the following execution steps until the next re-plan or the end.
Why it exists: Without splitting, different strategies get mixed, and a single “earliest error” becomes ambiguous.
Example: In a GAIA APOD task, Trial 1 = “scroll archive,” Trial 2 = “use calendar,” Trial 3 = “try URL patterns.”
🍞 Hook: Like dividing a school day into classes so you know which lesson caused confusion.
🥬 The Concept (again, concisely): Detect planning steps and slice the log. If missing system-specific markers, ask an LLM to identify plan updates. Why it matters: Clear boundaries make causal reasoning simpler. 🍞 Anchor: Steps 39–65 belong to Trial 2 because the orchestrator updated the plan there.

Failure Attribution per Trial

What happens: For each trial, an LLM proposes a hypothesis h = (agent, step, reason).
Why it exists: You need a concrete target for the fix.
Example: “At step 53, the Orchestrator instructed clicking an unclickable year; WebSurfer then clicked a random control.”
🍞 Hook: Picking the first wrong line in your math proof.
🥬 The Concept: A decisive error is a step where, if made correct, the rest would likely succeed. Why it matters: It narrows repair to the earliest helpful change. 🍞 Anchor: If the instruction had been “open the APOD calendar and choose August 2015” the rest likely would have worked.

Intervention Generation

What happens: Translate the hypothesis into a minimal, actionable edit. Two main categories: a) Modified instructions to sub-agents (make an ambiguous or impossible command precise and tool-aware) b) Plan updates (reorder or switch tactics to route around a dead end)
Why it exists: A precise, small change reduces side effects and makes verification clean.
Example text edit: “Open the APOD calendar. Click ‘2015.’ Navigate to August. Select dates Aug 1–7. Look for a photo with city lights on the horizon.”
🍞 Hook: Adding one missing direction to a treasure map.
🥬 The Concept (Intervention types): Keep changes local—don’t reset everything. Why it matters: Big edits hide the real cause; small edits spotlight it. 🍞 Anchor: Replace the single bad instruction rather than rewriting the entire plan ledger.

Intervention Execution (Replay)

What happens: Load the saved state (messages, tools, configs) at the target step, replace the message, and continue execution.
Why it exists: To test if the edit truly changes the outcome, holding the past constant.
Example: Resume just before the bad instruction in Trial 2; after the fix, the WebSurfer uses the correct calendar path.
🍞 Hook: Reloading a game checkpoint to try a different move.
🥬 The Concept (Counterfactual replay): Same past, new edit → check new future. Why it matters: It’s a clean causal test. 🍞 Anchor: Only the instruction changed; if success follows, your fix was the key.

Evaluation (Success + Milestone Progress + Hypothesis Validation)

What happens: Score (a) Trial Success Rate: did the task finish correctly? (b) Progress Made: how many additional milestones are now achieved? (c) Was the intervention followed?
Why it exists: Outcome focus prevents overfitting to labels and captures partial improvements.
Example (Milestones): For GAIA tasks, extract up to five tool-agnostic milestones (e.g., “Locate correct month,” “Open target entry,” “Extract city name”). Count new milestones achieved after the edit.
🍞 Hook: Using both a final grade and a checklist of learning goals.
🥬 The Concept (Milestones): LLMs generate concise milestones from human solutions; a judge LLM labels each as achieved/partial/missed. Why it matters: Fine-grained progress shows if an edit helped, even without full success. 🍞 Anchor: After the fix, the run completes Milestone 2 it previously missed.

Outcome Labels

What happens: After three replays per intervention, label the hypothesis:
- Validated: ≥2/3 runs succeed
- Partially Validated: fewer successes, but ≥2/3 runs both followed the edit and gained ≥1 milestone (~20%)
- Refuted: edit was followed but no meaningful progress
- Inconclusive: the agent didn’t follow the edit or tools blocked it
Why it exists: It sorts ideas into “right,” “helpful but blocked,” “wrong,” or “not testable yet.”
Example: If a missing ‘scroll-to-bottom’ action prevents following the instruction, label Inconclusive.
🍞 Hook: Sorting science-fair trials into “worked,” “kind of worked,” “didn’t work,” or “couldn’t run.”
🥬 The Concept (Validation categories): Combine correctness and compliance to judge the hypothesis. Why it matters: You learn what to keep, refine, or discard. 🍞 Anchor: The agent did what you asked but still failed → Refuted hypothesis.

Secret Sauce (what’s clever):

Trial-aware debugging respects branching strategies instead of forcing a single blamestep across the whole session.
Minimal, in-situ interventions create clean causal tests with fewer side effects.
Outcome-based evaluation (success + milestones) values real progress, not just label agreement.
Orchestrator-level edits are system-agnostic, so DoVer ports to different frameworks (Magentic-One, AutoGen2) with light engineering for checkpoints.

Concrete walk-through (APOD case):

Input: Failed session with four trials; Trial 2 used a calendar but clicked an unclickable year.
Hypothesis: Orchestrator sent an impossible instruction at Step 53.
Intervention: Replace with “Open APOD calendar → 2015 → August → Dates 1–7; scan titles for city lights on the horizon.”
Replay: Resume from Step 53.
Evaluate: The run now finds the correct entry and traces Marquette → Holabird & Roche, flipping failure to success.

04Experiments & Results

The Test: Can targeted interventions flip failed trials to successes, or at least move them closer by achieving more milestones? And do they validate or refute the original failure hypotheses?

Datasets and Systems:

Magnetic-One (M1) with GAIA and AssistantBench (the same ecosystem used by prior Who&When work)
AutoGen2 (AG2) MathChat with GSMPlus math problems

Metrics (with sandwiches):

🍞 Hook: Like checking if a study strategy helps you get an A, or at least raises your quiz scores. 🥬 The Concept (Trial Success Rate): The fraction of intervened trials that now solve the task. How it works: Run each intervention three times and count successes. Why it matters: It’s the clearest proof the edit worked. 🍞 Anchor: 17.6% success on WW cases; 27.5% on GAIA Level-1; 49% on GSMPlus.

🍞 Hook: Even if you didn’t ace the test, did your practice help you nail more sections? 🥬 The Concept (Progress Made via milestones): The share of new milestones achieved after intervention. How it works: Extract up to five milestones from human solutions, then count additional ones reached. Why it matters: Shows partial wins, not just pass/fail. 🍞 Anchor: +15.7% milestone gain on GAIA Level-1 (nearly one extra key step).

🍞 Hook: Sorting your guesses into right, half-right, wrong, or untestable. 🥬 The Concept (Validation outcomes): Validated, Partially Validated, Refuted, Inconclusive—based on success, progress, and whether the agent followed the intervention. Why it matters: Turns vague hunches into tested knowledge. 🍞 Anchor: WW sets had ~15–16% Validated and similar Refuted; GAIA Level-1 had more decisive outcomes and fewer inconclusives.

Competition (baselines):

Log-only attribution (guess a bad step) without running an intervention
End-of-run self-improvement styles (Self-Refine-like or CRITIC-like): critique the final output and try one more round

Scoreboard with context:

WW-AB (AssistantBench) and WW-GAIA: 17.6% of intervened trials flipped to success. That’s like turning roughly 1 out of 6 failing attempts into wins, while others stayed the same or improved partially.
GAIA Level-1: 27.5% success and +15.7% milestones—closer to turning 1 out of 4 failures into passes, and often adding almost one milestone.
GSMPlus (AG2 MathChat): 49% success—about a coin flip in your favor—showing DoVer travels well beyond one framework.
Against self-improvement baselines: 0% flips on the same WW-GAIA cases, versus DoVer’s 17.6%. End-of-run critiquing was too little, too late; the needed fix was mid-trajectory.

Surprising findings:

Ground-truth ambiguity hurts log-only attribution. When cases had clearer labels, step attribution accuracy rose significantly; but ambiguity was common.
Small open-source models can run DoVer too: Qwen3-8B achieved 11.3% (14.3% with 3-shot) and Qwen3-32B 16.9% vs GPT-4o’s 17.6%—a near tie for the larger open model.
Many Inconclusive outcomes flag missing tools, not bad hypotheses. For example, lacking a “scroll-to-bottom” capability blocked faithful instruction following; once that tool was added, cases became solvable with the same orchestrator-level interventions.

Case vignettes:

Validated: Replacing vague scrolling with precise calendar filtering led directly to the correct APOD entry and answer chain (Marquette → Holabird & Roche).
Partially Validated: Directing the agent to the right data source (e.g., Alpha Vantage) showed real progress, but API keys and script errors blocked final success.
Refuted: The agent followed the new instruction precisely, but the failure persisted → the original hypothesis was wrong.
Inconclusive: The agent couldn’t perform the instructed action due to tool limits (e.g., no true scroll-to-bottom), so we couldn’t judge the hypothesis.

Takeaway: DoVer doesn’t just produce nicer logs; it changes outcomes. It also teaches where the system’s tools are weak, guiding concrete upgrades.

05Discussion & Limitations

Limitations (be specific):

Requires checkpoint/replay: You need logs rich enough to restore state (messages, tool outputs, configs) and an interface to splice in edited messages. Black-box or asynchronous systems need extra engineering.
Orchestrator-level only (in this version): DoVer edits text instructions and plans, not sub-agent code or tools. Failures rooted in missing capabilities (PDF parsing, precise scrolling) may remain Inconclusive until tools improve.
LLM-as-a-judge bias: Milestone extraction and “did the intervention get followed?” checks rely on LLM judgments that can be imperfect despite careful prompts.
Cost/latency: Multiple replays per intervention can be expensive on long tasks or large models.

Required resources:

An agent framework supporting checkpoints and replay (added to Magentic-One and AutoGen2 in the study).
At least a moderately capable LLM to propose hypotheses, write minimal edits, and judge progress.
Optional: human oversight to act on surfaced tooling gaps (e.g., add a scroll-to-bottom action).

When NOT to use:

Purely synchronous chatbots with no tool use or long reasoning chains—simpler debuggers may suffice.
Safety-critical domains where automated edits must pass strict audits before any replay.
Systems with no feasible way to checkpoint/restore or where side effects of replay cannot be contained.

Open questions:

Can we expand interventions beyond text—e.g., automatically patch sub-agent code or add tools safely (capability-aware edits)?
How to reduce Inconclusive cases by adapting interventions to known tool limits (capability-aware planning)?
Can we train specialized models on intervention data to improve hypothesis quality and fix generation?
How robust are milestone judgments across domains, and can we mix LLM and human audits for trustworthiness?
How to extend DoVer to asynchronous, graph-style, or long-running production agents with minimal instrumentation?

06Conclusion & Future Work

Three-sentence summary: DoVer reframes debugging in LLM multi-agent systems as a do-then-verify process: make a minimal, targeted edit at a suspected failure step and re-run to check if outcomes improve. By segmenting runs into trials, proposing hypotheses, intervening, and grading success and milestones, DoVer turns untested guesses into verified knowledge and real fixes. Across datasets and frameworks, DoVer flips many failures to successes and exposes capability gaps, boosting reliability.

Main achievement: Establishing intervention-driven, outcome-based debugging that validates or refutes failure hypotheses through controlled replay, with strong, generalizable gains (18–28% flips in WW/GAIA; 49% in GSMPlus).

Future directions: Broaden the intervention space to include safe tool/code augmentation, build capability-aware fixers that respect known limits, reduce reliance on LLM judges with hybrid audits, and adapt DoVer to asynchronous and production-scale systems.

Why remember this: It shifts debugging from log-guessing to causal experiments. Instead of arguing over which line “looks wrong,” DoVer edits, replays, and measures—making multi-agent AI more trustworthy and easier to improve.

Practical Applications

•Add DoVer to a web-browsing agent to fix specific navigation mistakes mid-run and recover answers without restarting from scratch.
•Use milestone scoring to track partial progress in research tasks, catching improvements even when final answers are not yet correct.
•Apply DoVer in math-chat agents to correct the first wrong algebra step and re-derive the solution accurately.
•Instrument your agent framework with checkpoints so engineers can replay from a suspected step and test edits safely.
•Adopt orchestrator-level interventions first to keep fixes simple and system-agnostic across different tools and agents.
•Triage failures into Validated/Partially Validated/Refuted/Inconclusive to decide whether to revise instructions or invest in new sub-agent tools.
•Leverage small open-source models with few-shot examples to generate effective interventions at lower cost.
•Use DoVer’s inconclusive cases to prioritize adding key tools (e.g., proper PDF parsing or scroll-to-bottom) that unlock many blocked tasks.
•Compare DoVer against end-of-run critiques on your workload to see where precise mid-trajectory edits outperform global advice.
•Create a human-in-the-loop dashboard where developers can view trials, propose interventions, and immediately see replay outcomes.

Version: 1