The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

Weihao Xuan; Qingcheng Zeng; Heli Qi; Yunze Xiao; Junjue Wang; Naoto Yokoya

The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

Beginner

Weihao Xuan, Qingcheng Zeng, Heli Qi et al.1/12/2026

arXiv PDF

Key Summary

•This paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.
•Evidence tools like web search feed the agent noisy information and make it overconfident even when it is wrong.
•Verification tools like code interpreters give clear, deterministic feedback that helps the agent match its confidence to reality.
•The authors introduce CAR (Calibration Agentic RL), a training method that teaches agents to be accurate and well-calibrated at the same time.
•They design a new reward, MSCR (Margin-Separated Calibration Reward), that rewards correct answers more than any honest mistake and punishes false bravado.
•Across benchmarks, CAR with MSCR cuts calibration error a lot (up to 68% ECE reduction) without sacrificing accuracy.
•The learned calibration transfers from clean, local search to noisy, real web APIs and even to math problems solved with code tools.
•Results show that different tools need different calibration strategies; one-size-fits-all prompting is not enough.
•This work moves us toward self-aware agents that can say, 'I'm not sure,' when it truly matters.
•Better-calibrated agents are safer and more trustworthy for real-world tasks like research, support, and planning.

Why This Research Matters

In the real world, we need AI helpers that are not just capable but also honest about what they don’t know. This paper shows that the kind of tool an agent uses can quietly push its confidence up or down, which affects our trust. By training agents with CAR and MSCR, we get systems that keep their accuracy while expressing uncertainty more truthfully. That means safer web browsing assistants, more reliable research helpers, and math solvers that know when to double-check. It also helps decision-makers judge when to act and when to seek human review. Over time, this builds a culture of careful AI that supports people instead of misleading them.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you take a test, it’s not just about the answers—you also have a feeling of how well you did? If you think you got 100% but actually got 60%, that’s a problem.

🥬 The Concept: Calibration is an AI’s ability to say how sure it is in a way that matches how often it’s actually right.

What it is: Calibration means confidence should line up with correctness (e.g., 70% confidence should be right about 70% of the time).
How it works: 1) The AI gives an answer, 2) it also gives a confidence number, 3) we check if that number matches reality over many questions.
Why it matters: Without calibration, an AI can sound very sure while being wrong, which makes people trust it too much.

🍞 Anchor: If an AI says, “The capital of Australia is Sydney, 95% sure,” but the right answer is Canberra, that overconfidence could mislead users.

🍞 Hook: Imagine a smart helper that can use tools, like a calculator or the internet, to finish a task for you. Cool, right?

🥬 The Concept: Tool-use agents are AIs that plan steps and call external tools (like web search or a code runner) to solve multi-step problems.

What it is: A tool-use agent is an AI that reasons in multiple turns and uses outside tools to get information or to check work.
How it works: 1) Think about the goal, 2) pick a tool, 3) use the tool, 4) read the result, 5) repeat until ready, 6) give an answer (and a confidence).
Why it matters: Tools make agents more capable, but they also change how confident the agent feels—sometimes in risky ways.

🍞 Anchor: To answer “What year did a scientist win a prize?”, the agent might search the web and then use a calculator to check a timeline.

🍞 Hook: Think of two kinds of helpers: one brings you lots of clippings from newspapers (messy!), the other runs a precise calculation and shows the result.

🥬 The Concept: Evidence tools (like web search) fetch information from the world, while verification tools (like code interpreters) check or compute things with clear rules.

What it is: Evidence tools retrieve possibly noisy info; verification tools execute deterministic steps and give crisp feedback (errors or numbers).
How it works: Evidence tools: query → retrieve passages → sift noise; Verification tools: write code/equation → run → get exact output or error.
Why it matters: Noisy evidence can make agents feel falsely confident; deterministic checks can anchor confidence to facts.

🍞 Anchor: Looking up “largest lake” online might bring mixed answers, but running code to compute an area from a formula gives a definite number.

Before this paper, AI research already knew how to measure calibration for single-turn models, and people had started to notice that agents using search could be overly sure of themselves. But we didn’t know if ALL tool use caused miscalibration, or if the kind of tool mattered. Some tried prompt tricks (“Be careful!”) or standard reinforcement learning to improve tool use. These helped the task success a bit but didn’t fix the confidence problem consistently.

The missing piece this paper fills is a clear diagnosis and a tailored fix. First, it shows a confidence dichotomy: evidence tools tend to inflate confidence; verification tools tend to calm it down. Second, it introduces a training method (CAR) that directly teaches agents to be both correct and honest about uncertainty, with a special reward (MSCR) that keeps the incentives clean and stable.

Why care? In daily life, we rely on agents to browse the web, plan trips, summarize research, help with homework, or check math. If they sound 100% certain while wrong, people can make bad choices. Calibrated agents say, “I’m 60% sure,” when that’s the truth—helping us decide when to trust, when to double-check, and when to ask for help.

02Core Idea

🍞 Hook: Imagine a friend who sometimes guesses loudly after glancing at headlines but speaks more carefully after doing the math. Same person, different tools, different confidence.

🥬 The Concept: The key insight is that tool type shapes confidence: evidence tools push agents toward overconfidence, while verification tools anchor confidence—and we can train agents (CAR) with a special reward (MSCR) to keep confidence honest.

What it is: A discovery (confidence dichotomy) plus a fix (CAR training with MSCR) that aligns spoken confidence with actual correctness.
How it works: 1) Show that evidence vs. verification tools shift confidence in opposite ways, 2) add a training reward that separates correct from incorrect outcomes by a safe margin, 3) optimize the agent with RL so it learns to pair answers with truthful confidence.
Why it matters: Without this, agents using search sound too certain; with CAR+MSCR, agents learn to express uncertainty reliably while staying accurate.

🍞 Anchor: After CAR training, when the web is noisy, the agent says, “I think this is right, 55%,” but when code confirms a calculation, it says, “I’m 90% sure,” matching reality better.

Three analogies to clarify the same idea:

Weather forecaster: If your thermometer is precise (verification), you trust it; if you’re reading clouds from a blurry window (evidence), you should be cautious. CAR teaches the forecaster to report confidence that fits the instrument.
Classroom quiz: If you checked your math with a calculator (verification), you can be confident; if you skimmed a textbook paragraph (evidence), be careful. MSCR rewards being both correct and honest about how you know.
Detective work: A suspect list from gossip (evidence) can mislead; fingerprints from a lab (verification) are stronger. CAR trains the detective to weigh the kind of evidence when stating certainty.

Before vs. After:

Before: Agents using search often became overconfident; simple prompts didn’t fix it. Rewards that only score right/wrong could accidentally teach “confident guessing.”
After: With CAR+MSCR, agents keep accuracy while lowering calibration error a lot, and they transfer this honest behavior from clean testbeds to noisy real web and even to math tasks with code tools.

Why it works (intuition, no equations):

Rewards shape habits. If correct and incorrect answers can get similar rewards, the agent can “cheat” by always sounding confident. MSCR fixes this by guaranteeing that any correct answer earns more than any honest mistake, and false confidence on wrong answers is penalized.
Verification tools already give solid signals (errors, numbers), so confidence can latch onto them. Evidence tools lack clear negatives (search always returns something), so MSCR supplies the missing “don’t bluff” signal.

Building Blocks (each with a quick sandwich):

🍞 Hook: You know how you sometimes say how sure you are on a quiz? 🥬 Calibration: What—Match confidence to correctness; How—predict, say confidence, check over many cases; Why—prevents sounding sure-but-wrong. 🍞 Anchor: Saying “70% sure” and being right 7 out of 10 times.
🍞 Hook: Imagine a helper that can search or calculate. 🥬 Tool-use Agents: What—AIs that call tools; How—plan, act, observe, repeat; Why—more power but trickier confidence. 🍞 Anchor: Search for facts, then compute a date difference.
🍞 Hook: Like getting a treat for doing the trick correctly. 🥬 Reinforcement Learning (RL): What—learn by rewards; How—try actions, get reward, improve; Why—teaches good habits. 🍞 Anchor: The agent learns that honest confidence is rewarded.
🍞 Hook: Sometimes you loudly guess; sometimes you check your work. 🥬 Evidence vs. Verification Tools: What—noisy retrieval vs. deterministic checks; How—fetch text vs. run code; Why—noise inflates confidence, checks ground it. 🍞 Anchor: Reading blogs vs. running a Python formula.
🍞 Hook: Scoreboards help teams improve the right things. 🥬 CAR + MSCR: What—RL training with a margin-separated calibration reward; How—separate rewards for right/wrong, punish false bravado; Why—stops reward overlap that causes bluffing. 🍞 Anchor: Even a low-confidence correct answer scores above any honest wrong one, guiding learning safely.

03Methodology

At a high level: Question → Agent plans/tool-uses over multiple turns → Produces final answer + confidence → Reward (accuracy + calibration + format) → RL update → Better-calibrated agent.

Step-by-step (like a recipe):

Input and Planning

What happens: The agent reads the question, thinks step-by-step, and decides whether to use a tool (search or code).
Why this step exists: Without planning, the agent might rush to an answer and misstate its confidence.
Example: Q: “Who was the first woman to win a Nobel Prize and in what year?” The agent plans to search, read a passage, and cross-check dates.

Tool Invocation (Evidence or Verification)

What happens: Evidence tool—send a search query, read several passages; Verification tool—write a small program or calculation and run it.
Why this step exists: Tools extend the agent’s abilities beyond memory, but they change how reliable the signals are.
Example (Evidence): Search returns two snippets—one clearly states “Marie Curie, 1903,” another is noisy. The agent must not become 100% confident just because it found text.
Example (Verification): For “Compute the radius from area A= (20√21)/63,” the agent writes Python code to solve for r and gets a concrete number or an error.

Produce Final Answer and Verbalized Confidence

What happens: The agent gives an answer and a numerical confidence inside <confidence> tags.
Why this step exists: If the agent doesn’t state confidence, we can’t measure calibration or reward honesty.
Example: “Answer: Marie Curie (1903). <confidence>78</confidence>”.

Check Output Format

What happens: A simple format checker ensures the reasoning → action → observation chain is followed and the <confidence> tag is present.
Why this step exists: Without enforcing structure, the agent might skip confidence or muddle steps, making rewards noisy.
Example: If the tag is missing, apply a penalty.

Reward the Outcome: Accuracy + Calibration (Secret Sauce)

What happens: The agent gets a reward that combines: (a) whether the answer is correct, (b) how well the stated confidence fits the truth, and (c) whether the output followed the format.
Why this step exists: If we only reward correctness, the agent may learn to bluff with high confidence; if we only reward honesty, it might stay timid. We need both.
Example: Correct + reasonable confidence earns more; wrong + high confidence gets penalized extra.

Reinforcement Learning Update (GRPO)

What happens: The RL algorithm (Group Relative Policy Optimization) adjusts the agent’s parameters to increase expected reward.
Why this step exists: Without RL, the agent won’t learn from trial and error across many tasks and tools.
Example: After many questions, the agent raises confidence only when evidence is strong or verification succeeded.

The Secret Sauce: MSCR (Margin-Separated Calibration Reward)

🍞 Hook: Imagine a spelling bee where a correct but shy answer should still beat a confident wrong shout.
🥬 The Concept: MSCR is a reward that guarantees any correct answer scores above any honest mistake and punishes false confidence on wrong answers.
- What it is: A reward design that separates the score ranges for correct vs. incorrect outcomes by a safe margin.
- How it works: 1) Give a base reward for being correct; 2) add a bonus when confidence is appropriate; 3) for wrong answers, subtract more when confidence is high; 4) ensure no overlap between the best wrong and worst right cases.
- Why it matters: Without this, the agent can “reward-hack,” e.g., guessing with high confidence to chase points.
🍞 Anchor: A correct 20%‑confident answer still beats a wrong 0%‑confident answer, preventing perverse incentives.

Baselines and Comparisons (brief sandwiches for new ideas):

🍞 Hook: Sometimes people just turn the temperature down on a too-bold speaker. 🥬 Temperature Scaling: What—post-hoc rescaling of confidence; How—adjust logits with a temperature; Why—can smooth confidence but doesn’t teach real judgment. 🍞 Anchor: The agent sounds calmer but may still be confidently wrong on the same items.
🍞 Hook: What if we pay only for useful searches? 🥬 MASH: What—penalizes excessive search to encourage abstention; How—reward less tool spam; Why—can reduce risky behavior but doesn’t align confidence with truth directly. 🍞 Anchor: The agent searches fewer times but might still overtrust a single noisy page.
🍞 Hook: Scorecards that mix correctness and honesty need clear numbers. 🥬 Brier Score: What—measures squared gap between confidence and outcome; How—bigger gap, bigger penalty; Why—captures calibration and sharpness. 🍞 Anchor: Saying 90% but being wrong hurts more than saying 55% and being wrong.

Evaluation Metrics (used later in results):

🍞 Hook: Report cards have different subjects. 🥬 ECE (Expected Calibration Error): What—averages how far confidence is from accuracy across bins; How—group by confidence ranges and compare; Why—summarizes overall calibration. 🍞 Anchor: If 80% bins are only right 60% of the time, ECE goes up.
🍞 Hook: Can you sort answers from strongest to weakest? 🥬 AUROC: What—how well confidence ranks right answers above wrong ones; How—compute area under the ROC curve; Why—reveals if confidence can separate hits from misses. 🍞 Anchor: Higher AUROC means the top‑confidence answers tend to be the correct ones.
🍞 Hook: How sure are you when you’re actually wrong? 🥬 MCIP: What—Mean Confidence on Incorrect Predictions; How—average confidence over the wrong answers; Why—exposes overconfidence on mistakes. 🍞 Anchor: If wrong answers average 90% confidence, that’s dangerous.

Putting it all together: CAR = structure checks + MSCR calibration-aware outcome reward + RL updates. Inputs are questions from datasets (NQ, HotpotQA, SimpleQA‑verified; for math: AIME, MATH‑500). Outputs are answers plus confidence. The training loop teaches the agent to speak carefully when evidence is noisy and more boldly when verification truly supports it.

04Experiments & Results

The Test: The authors measured both accuracy and how well confidence matched reality. They used:

Search tasks (NQ, HotpotQA) to stress evidence tools in a local Wikipedia retriever and in a real web API (Serper), including SimpleQA‑verified for out‑of‑distribution checks.
Math reasoning tasks (AIME 2024/2025, MATH‑500) to test verification tools via code execution sandboxes.
Metrics: Accuracy (get it right?), ECE (overall calibration error), Brier Score (penalizes confidence mistakes), AUROC (can confidence sort right vs. wrong?), and MCIP (how confident on wrong answers in the pilot study).

The Competition: CAR (with different reward designs) went up against strong baselines:

Vanilla Search‑R1 (rewarding task success and format),
Temperature Scaling (post‑hoc smoothing),
MASH (discourages over‑searching to promote abstention).

The Scoreboard (with context):

On search agents across model sizes, CAR with MSCR cut calibration error dramatically—ECE reductions up to about 68%—while keeping or slightly improving accuracy versus baselines. That’s like moving from a shaky B‑ grade on honesty to a solid A, without lowering your test score.
AUROC often rose by up to ~17%, showing the agent became better at ranking correct answers as more confident than incorrect ones. That means confidence became more meaningful, not just rescaled.
Weighted Brier training with λ=1 gave very low ECE but hurt accuracy a lot (a sign of reward hacking: the agent learned to game the score rather than be correct). MSCR avoided this trap, striking a better accuracy‑calibration balance.

Surprising and Notable Findings:

Confidence Dichotomy (Pilot Study): Evidence tools (web search) increased MCIP—agents were very confident on their wrong answers—especially when trained with tool‑use RL. Verification tools (code) showed the opposite trend: RL made agents less overconfident (lower MCIP). This difference was statistically significant.
Generalization to Noisy Web: Agents trained locally with CAR (MSCR) stayed better‑calibrated when switched to the real, noisy Serper API. Accuracy remained competitive, while ECE and Brier improved, showing the learned behavior wasn’t brittle.
Transfer to Tool‑Integrated Reasoning (Math): With code interpreters, CAR (MSCR) reduced ECE and Brier and ticked up AUROC on AIME and MATH‑500. However, absolute ECE on the hardest math (AIME) remained higher than on easier sets (MATH‑500), suggesting that calibration in verification settings still depends on the agent’s underlying reasoning strength.

Make the numbers meaningful:

Think of ECE like the gap between how sure you say you are and how often you’re right. CAR with MSCR shrank this gap a lot—up to two‑thirds smaller—so the agent’s “I’m 80% sure” statements lined up much better with reality.
AUROC rising by ~17% is like getting much better at sorting your “good bets” above your “bad bets,” which is critical if a user only wants to act on the top‑confidence answers.
The reward design mattered: MSCR’s strict margin prevented the “safe failure” loophole, where a wrong but low‑confidence answer could tie a hesitant right one. This kept learning focused on being right and honest.

Big picture: CAR didn’t just change the volume knob on confidence (as temperature scaling does). It taught the agent when to speak up and when to hedge, and that lesson carried over from lab conditions to the messy real world.

05Discussion & Limitations

Limitations:

Scale: Experiments used 3B–7B parameter models. Larger models may behave differently, and we don’t yet know how the dichotomy or CAR’s gains scale up.
Task types: Focus was on short‑answer QA and math, where correctness is crisp. Open‑ended writing or long‑horizon planning may need different calibration signals and timelines.
Verification isn’t perfect: Code that runs can still be logically wrong. Calibration improved, but on the hardest math (AIME) it’s still not ideal—reasoning ability caps calibration.

Required Resources:

RL training pipeline (e.g., GRPO), retrieval infrastructure (local wiki dump or web API), and, for verification, a secure code sandbox (e.g., E2B). Compute needs are moderate to high, depending on model size and data volume.

When NOT to Use:

If you can’t collect reliable correctness signals (no clear answers, very delayed rewards), CAR’s outcome‑based calibration may struggle.
If your agent cannot output structured responses (no confidence tags), the format reward won’t apply.
If your use case already abstains aggressively with human review for all low‑confidence cases, post‑hoc scaling might be sufficient.

Open Questions:

How does the confidence dichotomy evolve with much larger or multimodal agents?
Can we design evidence tools that return explicit negative signals (e.g., “insufficient evidence detected”) to reduce overconfidence at the source?
How to extend CAR to open‑ended generation with fuzzy ground truth (e.g., reports or plans)?
Can we jointly learn when to abstain, when to search, and how to state confidence, all under one reward framework?

Takeaway: Tool choice shapes confidence. CAR with MSCR provides a principled way to realign an agent’s voice with its actual skill—especially crucial in noisy evidence‑tool settings—while remaining broadly helpful in verification‑tool workflows.

06Conclusion & Future Work

Three‑sentence summary: This paper discovers a confidence dichotomy in tool‑use agents: evidence tools tend to inflate confidence, while verification tools ground it. To fix miscalibration, the authors introduce CAR, a reinforcement learning framework with a new reward (MSCR) that cleanly separates incentives for right vs. wrong answers and punishes false bravado. CAR significantly improves calibration across datasets, transfers from clean retrievers to real web APIs, and also helps in math with code tools, all while keeping accuracy competitive.

Main achievement: A practical, general training recipe—CAR with MSCR—that reliably reduces overconfidence in tool‑use agents by aligning expressed confidence with true performance, avoiding reward‑hacking pitfalls.

Future directions: Scale to larger and multimodal agents; add tool‑level negative signals (evidence insufficiency); integrate abstention and planning into calibration learning; and adapt the framework to open‑ended, long‑horizon tasks with delayed or fuzzy correctness.

Why remember this: In agentic AI, the tool you choose changes how certain you feel. With CAR and MSCR, we finally have a way to teach agents not just to be smart, but to be honestly smart—speaking carefully when the world is noisy and confidently when the math (or code) checks out.

Practical Applications

•Deploy web-browsing assistants that flag low-confidence answers and request human review when search results are noisy.
•Build research copilots that rank sources by confidence and clearly separate speculation from verified facts.
•Create math and coding tutors that state confidence based on code execution results and suggest checks when unsure.
•Add confidence-aware summaries in customer support chatbots to guide escalation policies and reduce incorrect assurances.
•Use calibrated agents in business intelligence dashboards to prioritize high-confidence insights and label weak ones.
•Integrate CAR-trained agents into medical triage tools to signal uncertainty and trigger safe fallback procedures.
•Improve educational tools that teach students about uncertainty by modeling honest confidence alongside answers.
•Enhance legal or compliance document review with confidence-tagged findings to focus expert attention efficiently.
•Optimize internal search tools (RAG systems) to abstain or lower confidence when retrieved evidence is thin or conflicting.
•Tune multi-agent systems so planners and executors share calibrated confidence, improving coordination and safety.

Version: 1