Agentic Uncertainty Quantification

Jiaxin Zhang; Prafulla Kumar Choubey; Kung-Hsiang Huang; Caiming Xiong; Chien-Sheng Wu

Agentic Uncertainty Quantification

Intermediate

Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang et al.1/22/2026

arXiv PDF

Key Summary

•Long AI tasks can go wrong early and keep getting worse, like a snowball of mistakes called the Spiral of Hallucination.
•This paper turns an agent’s spoken confidence (how sure it says it is) into smart control signals that guide its behavior.
•System 1 (Uncertainty-Aware Memory) quietly carries forward doubts and explanations so the agent doesn’t charge ahead blindly.
•System 2 (Uncertainty-Aware Reflection) kicks in only when needed to fix low-confidence steps with targeted re-thinking.
•A confidence threshold decides when to switch from quick actions to careful reflection, saving time and compute.
•A special voting method picks reflected answers that are both confident and consistent across multiple tries.
•On ALFWorld and WebShop, the dual system boosts success rates and makes confidence better match reality.
•On deep research tasks, it writes more insightful, thorough reports by detecting gaps and digging deeper.
•New trajectory-level calibration scores judge not just single answers but entire multi-step journeys.
•The method is training-free, works with black-box models, and balances speed with reliability.

Why This Research Matters

Real-world tasks rarely end in one turn; they are long journeys where early slips can spoil everything later. This approach lets agents sense their own uncertainty and act on it, slowing down only when needed and fixing exactly what’s missing. That means more trustworthy shopping assistants, fewer costly tool calls, and safer robots that check preconditions before acting. In research settings, it pushes agents to go beyond surface-level summaries by noticing gaps and probing deeper. Because it’s training-free and works with black-box models, teams can adopt it quickly. The new trajectory-level metrics also help us judge not just answers, but the whole decision-making path. Overall, it moves AI from guesswork to guided, self-aware reasoning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine following a recipe for a long, fancy cake. If you misread “teaspoon” as “tablespoon” in step 2, the rest of the cake can go wrong no matter how carefully you do steps 3 to 10.

🥬 The Concept: Uncertainty Quantification (UQ) is about measuring how unsure we are so we can avoid mistakes spreading through long tasks. How it works:

Ask the system to say how sure it is.
Track that sure-ness step by step.
Use it to guide when to go fast or slow. Why it matters: Without UQ, the agent can be confidently wrong early, and that error infects everything after it. 🍞 Anchor: When an AI plans a 20-step web shopping trip, UQ helps it say “I’m 0.6 sure this is the right item” before it spends 10 more steps comparing the wrong product.

🍞 Hook: You know how in hide-and-seek, you can’t see the whole room at once, so you guess where to look next based on what you’ve seen?

🥬 The Concept: A Partially Observable Markov Decision Process (POMDP) is a math way to describe making good choices when you can’t see everything. How it works:

Keep a belief (a best guess) about the hidden state.
Use history of observations and actions.
Choose the next action to improve the belief or make progress. Why it matters: Agents act with partial information; errors in belief can snowball if not checked. 🍞 Anchor: A web agent can’t see every page at once, so it builds a guess about which link leads to the right store page next.

🍞 Hook: Think of a rumor that keeps getting retold and twisted until everyone believes the wrong thing.

🥬 The Concept: The Spiral of Hallucination is when an early thinking mistake gets written into the agent’s “history” and misguides all later steps. How it works:

Make a small wrong guess (early step).
Store it in context as if it’s true.
Plan the next steps using that wrong “fact,” amplifying error. Why it matters: One incorrect step can doom the whole trajectory, even if later steps look careful. 🍞 Anchor: An agent misreads a tool result and then plans 15 steps that all assume the wrong ID number.

🍞 Hook: When you do homework, sometimes you do easy mental math quickly, other times you slow down and show your work.

🥬 The Concept: A Dual-Process Architecture uses a fast system for routine actions and a slow system for careful checks when things feel uncertain. How it works:

Default to fast, cheap actions.
Monitor confidence.
If confidence drops, switch to slow, thoughtful reflection. Why it matters: It saves time but still prevents big mistakes. 🍞 Anchor: A shopping bot browses quickly, but when it’s only 60% sure it found the right product, it pauses to verify features and reviews.

Before this paper, most UQ was passive: it measured risk but didn’t fix it. Self-reflection methods tried to fix errors but often reflected too much or at the wrong times—leading to wasted compute or even talking themselves into wrong answers. The missing piece was turning uncertainty into an active control signal that both restrains risky moves and guides targeted problem-solving.

🍞 Hook: Like leaving sticky notes in your notebook saying, “I’m not sure about chapter 3—review before the test.”

🥬 The Concept: Uncertainty-Aware Memory (UAM) stores both confidence scores and short explanations of doubts along with actions and observations. How it works:

After each step, the agent writes down its action, a 0–1 confidence, and why it feels that way.
These notes stay in the context.
The model’s attention naturally leans away from overconfident leaps when past doubts are visible. Why it matters: It prevents blind commitment and carries forward warnings so they aren’t forgotten. 🍞 Anchor: If the agent says, “I might have misread the price filter,” that note nudges future steps to re-check filters before buying.

🍞 Hook: When you feel unsure on a test question, you circle it and come back with extra care.

🥬 The Concept: Uncertainty-Aware Reflection (UAR) is a focused re-think triggered only when confidence is low, using the earlier explanation as the exact clue for what to fix. How it works:

Spot low confidence.
Feed the agent its own explanation (“I’m unsure about the date”).
Generate several fixes and pick the most consistent, confident one. Why it matters: It solves the right problem without spinning in endless self-talk. 🍞 Anchor: If the plan doubts a missing citation, reflection expands the search just for that source, not everything.

The stakes are real: better copilots, safer household robots, more trustworthy research reports, and fewer wasted API calls. This paper fills the gap by uniting passive sensing (UQ) with active control (reflection), so agents know when to speed up, when to slow down, and what exactly to fix.

02Core Idea

🍞 Hook: You know how a car dashboard doesn’t just show a warning light—it also helps you decide whether to keep driving or pull over?

🥬 The Concept: The key idea is to turn an agent’s own spoken uncertainty into action switches that either slow it down (caution) or send it to the repair lane (targeted reflection). How it works:

At each step, the agent outputs an action, a confidence score, and a short explanation.
These notes live in memory, softly steering future steps away from risky leaps.
If confidence is too low, the explanation triggers a focused reflection to fix the exact gap. Why it matters: The agent balances speed and care dynamically, avoiding error snowballs without overthinking everything. 🍞 Anchor: A web agent that’s 0.62 confident about a product adds a quick verify step; if that’s not enough, it runs a short, targeted re-check of specs.

Three analogies:

Traffic lights: Green (go fast), yellow (slow and check), red (stop and reflect).
Chef tasting: Cook normally, but if the sauce tastes off (low confidence), follow the tasting note to add salt or acid—don’t remake the whole dish.
Teacher’s margin notes: Fast reading continues, but a “clarify this term” note triggers a focused fix on that term later.

🍞 Hook: When you play a level you know well, you move quickly; on a tricky puzzle, you slow down and double-check.

🥬 The Concept: Before vs. After Before: Agents either barrel ahead (fast but brittle) or reflect too much (careful but costly). After: The agent uses uncertainty to decide when to be fast or slow, and how to fix exactly what’s broken. Why it matters: You get higher success and better-calibrated confidence without paying the full price of always reflecting. 🍞 Anchor: In ALFWorld, the agent avoids getting stuck moving the wrong object by noticing “I’m missing the lamp” and switching to find it first.

🍞 Hook: If your friend says, “I’m only half-sure,” you don’t ask them to recite the dictionary—you ask about the exact part they’re unsure of.

🥬 The Concept: Why it works (intuition without math)

Forward damping: Keeping doubts in memory nudges the model (via attention) to prefer safe information-gathering over bold, risky actions.
Inverse fixing: When confidence is low, the explanation becomes a compass that points the reflection exactly where to look.
Switching: A single threshold separates routine steps from repair steps, making compute spending smart, not constant. Why it matters: It couples sensing (uncertainty) with doing (control) in both directions. 🍞 Anchor: Seeing “unsure about the date field,” the agent opens one more source just for the publication date, not ten irrelevant tabs.

Building blocks (with mini sandwiches):

🍞 Hook: Like highlighting a tricky sentence in your notes. 🥬 The Concept: Verbalized Confidence and Explanation are the agent’s own 0–1 score and a short reason for its certainty. How it works: The prompt asks for action + confidence + explanation each step. Why it matters: Numbers switch modes; explanations guide repairs. 🍞 Anchor: “Confidence 0.68: I might have misread the color filter.”
🍞 Hook: Like a backpack of reminders you carry from class to class. 🥬 The Concept: Uncertainty-Aware Memory (System 1) stores observations, actions, confidence, and explanations. How it works: Keep these records in context so attention doesn’t forget doubts. Why it matters: Prevents blind commitment and keeps risks visible. 🍞 Anchor: “Earlier I wasn’t sure about the coupon rule—recheck before paying.”
🍞 Hook: When a Lego piece doesn’t fit, you check that exact piece and the step number. 🥬 The Concept: Uncertainty-Aware Reflection (System 2) is a targeted re-think. How it works: Use the explanation as a cue, try several fixes, and pick the confident, consistent one. Why it matters: Fixes the right problem without endless loops. 🍞 Anchor: “Unsure about brand compatibility” triggers a brief spec check before buying.
🍞 Hook: If three friends independently pick the same answer and all feel confident, you trust it. 🥬 The Concept: Consistency-Weighted Reflection picks answers that are both frequent and confident across samples. How it works: Sample N candidates; score by confidence times how many agree. Why it matters: Avoids one-off hallucinations. 🍞 Anchor: Three of three samples say “Use academic search for this claim,” with high confidence.
🍞 Hook: If you’re stuck, you might open your full notebook, not just the last page. 🥬 The Concept: Adaptive Memory Expansion reloads full history only when needed. How it works: If reflection is still weak, bring back long context once, then retry. Why it matters: Saves tokens most of the time; uses long memory only when it helps. 🍞 Anchor: The agent reopens earlier notes to find where it first saw the missing lamp clue.

03Methodology

At a high level: Input → System 1 (Generate action + confidence + explanation, update memory) → Switch (if confidence < threshold) → System 2 (Targeted reflection with Best-of-N and consistency-weighted selection, optional memory expansion) → Output action → Repeat.

Step-by-step recipe:

Input and Memory Setup

What happens: The agent receives the task, current observation, and a compact memory that includes past observations, actions, confidence scores, and explanations.
Why this step exists: Without carrying forward doubts, the agent forgets where risks came from and may repeat mistakes.
Example: “At step 4, we tried a filter but noted ‘confidence 0.62: unsure if price cap applied.’ This sits in memory for step 5.”

System 1: Generate Action + Confidence + Explanation (Elicitation)

What happens: The model is prompted to output three fields each step: <action>, <confidence> (0–1), <explanation> (why that confidence).
Why this step exists: Numbers alone can flip a switch, but explanations tell us where to look if we need to fix something.
Example: “<action>search ‘wireless mouse under $25’</action> <confidence>0.71</confidence> <explanation>The store sometimes includes shipping; I might need to add ‘free shipping’ or sort by price.”

Soft Constraint via Attention (Semantic Uncertainty Propagation)

What happens: Because the memory keeps explanations visible, the model’s attention naturally emphasizes words of caution, which reduces overconfident leaps and favors safe exploration (like verifying or expanding search).
Why this step exists: We want the agent to gently prefer information-gathering over irreversible commitments when doubts accumulate, without hard-coded rules.
Example: Seeing “unsure about coupon stacking,” the next action becomes “check coupon policy,” not “buy now.”

🍞 Hook: Like seeing a yellow light before an intersection. 🥬 The Concept: Confidence Threshold (τ) is the switch that decides whether to keep going fast or to reflect. How it works:

Compare the step’s confidence to τ (e.g., 0.85–0.95 range).
If higher, proceed; if lower, trigger reflection. Why it matters: It saves compute by reflecting only where risk is real. 🍞 Anchor: A 0.9 threshold means 0.88 triggers reflection, 0.93 passes through.

System 2: Targeted Reflection (Inverse Fixing)

What happens: When confidence < τ, we build a reflection prompt that includes the agent’s own explanation as a rational cue (e.g., “Address this concern: unsure about release date”). The model generates several candidate fixes (Best-of-N) with new confidence and explanation.
Why this step exists: General reflection can wander; using the precise doubt focuses the repair.
Example: Candidates: (A) Re-run search with site:gov, (B) Open PDF from journal site, (C) Ask tool for structured metadata.

🍞 Hook: If three students reach the same answer independently, we trust it more. 🥬 The Concept: Consistency-Weighted Selection chooses the candidate that many samples agree on and that has high confidence. How it works:

Group semantically equivalent answers.
Score = (how many agree) × (their average confidence).
Pick the best-scoring candidate. Why it matters: It favors robust, non-hallucinated fixes. 🍞 Anchor: If 2 of 3 candidates say “Use academic search,” each with confidence ~0.9, that plan wins.

Adaptive Memory Expansion (Only If Needed)

What happens: If reflection still looks weak (score below a reliability bar), load the full history and reflect once more.
Why this step exists: Sometimes the missing clue is far back in the trajectory; use long memory only when local fixes fail.
Example: ALFWorld: The lamp clue appeared 15 steps ago, so expanding memory reveals it and unblocks the plan.

Execute and Update Memory

What happens: Execute the chosen action; store the updated confidence and explanation.
Why this step exists: Future steps benefit from resolved doubts and any remaining cautions.
Example: “Confidence 0.92: Verified price cap is applied now.”

Putting it all together with data:

ALFWorld mini-case: Goal: Examine bowl with desklamp. System 1 notices “low confidence: lamp not found.” Threshold triggers System 2. Reflection tries (A) re-examine desk, (B) go to sidetable, (C) go to shelf; consistency-weighted pick = (C) with 0.85 confidence. The agent finds the lamp, then completes the task in fewer steps than a baseline that loops.
WebShop mini-case: The agent doubts whether the price filter applied. Reflection suggests adding ‘under $25’ to the query AND sorting by price. Candidates (A), (B), (C) mostly agree; the chosen fix reduces wasted clicks and increases success.

The secret sauce:

Dual direction control: Forward damping (System 1) prevents bad commitments; inverse fixing (System 2) repairs low-confidence steps.
Actionable uncertainty: Explanations transform “I’m unsure” into “I’m unsure about X,” which is a to-do list, not just a warning.
Smart budgeting: A single threshold and one-time memory expansion spend compute only where it pays off.

04Experiments & Results

🍞 Hook: If you study for a test, it’s not just about your final answer—you want your confidence during the whole test to match how well you’re actually doing.

🥬 The Concept: Trajectory-Level Calibration measures how a whole multi-step journey’s confidence matches reality. How it works:

Aggregate step confidences in different ways (last step, average, weakest link).
Compare predicted confidence to actual success (ECE, Brier) and separation (AUROC). Why it matters: One bad step can sink the task; we need metrics that understand sequences, not just single turns. 🍞 Anchor: If your lowest-confidence step is truly where you fail, that’s good calibration of the process.

The tests and why:

ALFWorld (embodied tasks): Demands careful sequencing; a missing precondition causes failure.
WebShop (noisy navigation): Requires robust search and verification under high observation noise.
DeepResearch (open-ended research): Rewards depth, insight, and coherence rather than binary completion.

Competitors:

ReAct (fast, no uncertainty), Reflexion (self-corrects across episodes), Self-Reflection (checks every step blindly), CoT-SC (ensemble self-consistency). We also test our parts: Forward only (UAM), Inverse only (UAR), and the full Dual-Process (AUQ).

Scoreboard with context:

Calibration: UAM (forward-only) often achieves the lowest trajectory ECE (like moving from guessing to well-matched confidence), showing it’s excellent at damping overconfidence. UAR (inverse-only) gets the lowest Brier (sharper, more decisive beliefs) by fixing gaps. AUQ blends both: low ECE with improved sharpness.
Success: On ALFWorld, AUQ reaches ~74.3% success (about like scoring an A when others get B’s). On WebShop, AUQ hits ~42.9%, a strong jump over baselines.
Discrimination: AUQ gets higher AUROC, meaning its internal confidence separates successes from failures better; it “knows when it knows.”
DeepResearch: AUQ improves Comprehensiveness and Insight, transforming shallow summaries into stronger, evidenced reports. It consistently improves across multiple model backends.

Surprising findings:

Forward vs. Inverse roles: UAM is the best at being honest (calibrated). UAR is the best at being decisive (after fixing). Together, they deliver both.
Memory length: Even with short history, AUQ keeps performance up because the stored confidence and explanations compress the “risk state.” When needed, adaptive memory expansion recovers long-range context.
Threshold sweet spot: Around τ ≈ 0.9 often gives the best trade-off: plenty of smart interventions, without over-verifying trivial steps.
Rare risk: Sometimes reflection boosts confidence in a wrong plan (“delusional confirmation”). But overall, net gains are strongly positive and misfires are limited.

🍞 Hook: Imagine grading a group project by how well the team understood risks at every stage, not just the final slide.

🥬 The Concept: End-State (last), Average (overall), and Weakest Link (min) aggregations provide different lenses on the journey. How it works:

Φ_last checks the end decision.
Φ_avg rewards steady confidence.
Φ_min catches the most fragile moment. Why it matters: Agents fail in different ways; one lens doesn’t fit all. 🍞 Anchor: A single low-confidence step that truly precedes failure is exactly what Φ_min is designed to catch.

05Discussion & Limitations

Limitations:

Small or weak models may not verbalize confidence well, reducing the quality of the control signal.
Reflection adds latency when it fires; tight real-time systems may dislike occasional pauses.
If explanations are poor or vague, System 2 may reflect aimlessly.
Rarely, reflection can talk itself into a wrong plan with high confidence; guardrails help.

Required resources:

An LLM that can follow prompts to output action + confidence + explanation.
Ability to store and pass uncertainty-aware memory through context.
Optional: Tool access and retrieval to resolve doubts.

When not to use:

Ultra low-latency chat where any reflection is unacceptable.
Tasks with perfect, immediate ground truth feedback (simpler bandit-style control may suffice).
Settings where the base model’s verbal confidence is known to be uninformative.

Open questions:

Adaptive thresholds: Can τ vary by action type, stakes, or remaining budget automatically?
Better cues: How to make explanations more diagnostic and less sycophantic?
Trust calibration: How to expose confidence to users without causing automation bias?
Multi-agent: How do uncertainty signals coordinate across teams of agents without echoing errors?

06Conclusion & Future Work

In three sentences: This paper turns an agent’s own spoken uncertainty into a steering wheel. A fast memory track carries doubts forward to prevent blind commitment, while a slow reflection track fixes low-confidence steps using the doubt as a precise guide. Together, they break the Spiral of Hallucination and make long-horizon agents both more accurate and more honest about what they know. Main achievement: A training-free, dual-process framework (UAM + UAR) that converts uncertainty from a passive warning into active, bi-directional control—improving success, calibration, and efficiency across very different agent tasks. Future directions: Adaptive thresholds per step type, stronger explanation elicitation, team-level uncertainty sharing, and integrating tool reliability into the same control loop. Why remember this: Because reliable agents don’t just think harder—they know when to think harder, and about what exactly to think.

Practical Applications

•E-commerce agents that verify filters (price, brand) only when confidence dips, reducing returns and misbuys.
•Customer support bots that escalate or ask clarifying questions when their confidence falls below a threshold.
•Research copilots that trigger targeted literature or metadata checks when they detect missing evidence.
•Robotic task planners that pause to confirm preconditions (e.g., tool present) before performing irreversible actions.
•Business analytics agents that expand retrieval only when summaries show uncertainty about key metrics.
•Coding assistants that run focused tests or lint checks when unsure about an API or edge case.
•Healthcare triage assistants that flag low-confidence assessments for human review instead of giving definitive advice.
•Education tutors that add a hint or a worked example only when a student’s answer shows low certainty.
•RPA workflows that switch tools or add verification steps when extraction confidence declines.
•Security monitoring agents that trigger deeper scans when anomaly explanations point to uncertain indicators.

Version: 1