The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents
Key Summary
- âąThis paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.
- âąEvidence tools like web search feed the agent noisy information and make it overconfident even when it is wrong.
- âąVerification tools like code interpreters give clear, deterministic feedback that helps the agent match its confidence to reality.
- âąThe authors introduce CAR (Calibration Agentic RL), a training method that teaches agents to be accurate and well-calibrated at the same time.
- âąThey design a new reward, MSCR (Margin-Separated Calibration Reward), that rewards correct answers more than any honest mistake and punishes false bravado.
- âąAcross benchmarks, CAR with MSCR cuts calibration error a lot (up to 68% ECE reduction) without sacrificing accuracy.
- âąThe learned calibration transfers from clean, local search to noisy, real web APIs and even to math problems solved with code tools.
- âąResults show that different tools need different calibration strategies; one-size-fits-all prompting is not enough.
- âąThis work moves us toward self-aware agents that can say, 'I'm not sure,' when it truly matters.
- âąBetter-calibrated agents are safer and more trustworthy for real-world tasks like research, support, and planning.
Why This Research Matters
In the real world, we need AI helpers that are not just capable but also honest about what they donât know. This paper shows that the kind of tool an agent uses can quietly push its confidence up or down, which affects our trust. By training agents with CAR and MSCR, we get systems that keep their accuracy while expressing uncertainty more truthfully. That means safer web browsing assistants, more reliable research helpers, and math solvers that know when to double-check. It also helps decision-makers judge when to act and when to seek human review. Over time, this builds a culture of careful AI that supports people instead of misleading them.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how when you take a test, itâs not just about the answersâyou also have a feeling of how well you did? If you think you got 100% but actually got 60%, thatâs a problem.
đ„Ź The Concept: Calibration is an AIâs ability to say how sure it is in a way that matches how often itâs actually right.
- What it is: Calibration means confidence should line up with correctness (e.g., 70% confidence should be right about 70% of the time).
- How it works: 1) The AI gives an answer, 2) it also gives a confidence number, 3) we check if that number matches reality over many questions.
- Why it matters: Without calibration, an AI can sound very sure while being wrong, which makes people trust it too much.
đ Anchor: If an AI says, âThe capital of Australia is Sydney, 95% sure,â but the right answer is Canberra, that overconfidence could mislead users.
đ Hook: Imagine a smart helper that can use tools, like a calculator or the internet, to finish a task for you. Cool, right?
đ„Ź The Concept: Tool-use agents are AIs that plan steps and call external tools (like web search or a code runner) to solve multi-step problems.
- What it is: A tool-use agent is an AI that reasons in multiple turns and uses outside tools to get information or to check work.
- How it works: 1) Think about the goal, 2) pick a tool, 3) use the tool, 4) read the result, 5) repeat until ready, 6) give an answer (and a confidence).
- Why it matters: Tools make agents more capable, but they also change how confident the agent feelsâsometimes in risky ways.
đ Anchor: To answer âWhat year did a scientist win a prize?â, the agent might search the web and then use a calculator to check a timeline.
đ Hook: Think of two kinds of helpers: one brings you lots of clippings from newspapers (messy!), the other runs a precise calculation and shows the result.
đ„Ź The Concept: Evidence tools (like web search) fetch information from the world, while verification tools (like code interpreters) check or compute things with clear rules.
- What it is: Evidence tools retrieve possibly noisy info; verification tools execute deterministic steps and give crisp feedback (errors or numbers).
- How it works: Evidence tools: query â retrieve passages â sift noise; Verification tools: write code/equation â run â get exact output or error.
- Why it matters: Noisy evidence can make agents feel falsely confident; deterministic checks can anchor confidence to facts.
đ Anchor: Looking up âlargest lakeâ online might bring mixed answers, but running code to compute an area from a formula gives a definite number.
Before this paper, AI research already knew how to measure calibration for single-turn models, and people had started to notice that agents using search could be overly sure of themselves. But we didnât know if ALL tool use caused miscalibration, or if the kind of tool mattered. Some tried prompt tricks (âBe careful!â) or standard reinforcement learning to improve tool use. These helped the task success a bit but didnât fix the confidence problem consistently.
The missing piece this paper fills is a clear diagnosis and a tailored fix. First, it shows a confidence dichotomy: evidence tools tend to inflate confidence; verification tools tend to calm it down. Second, it introduces a training method (CAR) that directly teaches agents to be both correct and honest about uncertainty, with a special reward (MSCR) that keeps the incentives clean and stable.
Why care? In daily life, we rely on agents to browse the web, plan trips, summarize research, help with homework, or check math. If they sound 100% certain while wrong, people can make bad choices. Calibrated agents say, âIâm 60% sure,â when thatâs the truthâhelping us decide when to trust, when to double-check, and when to ask for help.
02Core Idea
đ Hook: Imagine a friend who sometimes guesses loudly after glancing at headlines but speaks more carefully after doing the math. Same person, different tools, different confidence.
đ„Ź The Concept: The key insight is that tool type shapes confidence: evidence tools push agents toward overconfidence, while verification tools anchor confidenceâand we can train agents (CAR) with a special reward (MSCR) to keep confidence honest.
- What it is: A discovery (confidence dichotomy) plus a fix (CAR training with MSCR) that aligns spoken confidence with actual correctness.
- How it works: 1) Show that evidence vs. verification tools shift confidence in opposite ways, 2) add a training reward that separates correct from incorrect outcomes by a safe margin, 3) optimize the agent with RL so it learns to pair answers with truthful confidence.
- Why it matters: Without this, agents using search sound too certain; with CAR+MSCR, agents learn to express uncertainty reliably while staying accurate.
đ Anchor: After CAR training, when the web is noisy, the agent says, âI think this is right, 55%,â but when code confirms a calculation, it says, âIâm 90% sure,â matching reality better.
Three analogies to clarify the same idea:
- Weather forecaster: If your thermometer is precise (verification), you trust it; if youâre reading clouds from a blurry window (evidence), you should be cautious. CAR teaches the forecaster to report confidence that fits the instrument.
- Classroom quiz: If you checked your math with a calculator (verification), you can be confident; if you skimmed a textbook paragraph (evidence), be careful. MSCR rewards being both correct and honest about how you know.
- Detective work: A suspect list from gossip (evidence) can mislead; fingerprints from a lab (verification) are stronger. CAR trains the detective to weigh the kind of evidence when stating certainty.
Before vs. After:
- Before: Agents using search often became overconfident; simple prompts didnât fix it. Rewards that only score right/wrong could accidentally teach âconfident guessing.â
- After: With CAR+MSCR, agents keep accuracy while lowering calibration error a lot, and they transfer this honest behavior from clean testbeds to noisy real web and even to math tasks with code tools.
Why it works (intuition, no equations):
- Rewards shape habits. If correct and incorrect answers can get similar rewards, the agent can âcheatâ by always sounding confident. MSCR fixes this by guaranteeing that any correct answer earns more than any honest mistake, and false confidence on wrong answers is penalized.
- Verification tools already give solid signals (errors, numbers), so confidence can latch onto them. Evidence tools lack clear negatives (search always returns something), so MSCR supplies the missing âdonât bluffâ signal.
Building Blocks (each with a quick sandwich):
- đ Hook: You know how you sometimes say how sure you are on a quiz? đ„Ź Calibration: WhatâMatch confidence to correctness; Howâpredict, say confidence, check over many cases; Whyâprevents sounding sure-but-wrong. đ Anchor: Saying â70% sureâ and being right 7 out of 10 times.
- đ Hook: Imagine a helper that can search or calculate. đ„Ź Tool-use Agents: WhatâAIs that call tools; Howâplan, act, observe, repeat; Whyâmore power but trickier confidence. đ Anchor: Search for facts, then compute a date difference.
- đ Hook: Like getting a treat for doing the trick correctly. đ„Ź Reinforcement Learning (RL): Whatâlearn by rewards; Howâtry actions, get reward, improve; Whyâteaches good habits. đ Anchor: The agent learns that honest confidence is rewarded.
- đ Hook: Sometimes you loudly guess; sometimes you check your work. đ„Ź Evidence vs. Verification Tools: Whatânoisy retrieval vs. deterministic checks; Howâfetch text vs. run code; Whyânoise inflates confidence, checks ground it. đ Anchor: Reading blogs vs. running a Python formula.
- đ Hook: Scoreboards help teams improve the right things. đ„Ź CAR + MSCR: WhatâRL training with a margin-separated calibration reward; Howâseparate rewards for right/wrong, punish false bravado; Whyâstops reward overlap that causes bluffing. đ Anchor: Even a low-confidence correct answer scores above any honest wrong one, guiding learning safely.
03Methodology
At a high level: Question â Agent plans/tool-uses over multiple turns â Produces final answer + confidence â Reward (accuracy + calibration + format) â RL update â Better-calibrated agent.
Step-by-step (like a recipe):
- Input and Planning
- What happens: The agent reads the question, thinks step-by-step, and decides whether to use a tool (search or code).
- Why this step exists: Without planning, the agent might rush to an answer and misstate its confidence.
- Example: Q: âWho was the first woman to win a Nobel Prize and in what year?â The agent plans to search, read a passage, and cross-check dates.
- Tool Invocation (Evidence or Verification)
- What happens: Evidence toolâsend a search query, read several passages; Verification toolâwrite a small program or calculation and run it.
- Why this step exists: Tools extend the agentâs abilities beyond memory, but they change how reliable the signals are.
- Example (Evidence): Search returns two snippetsâone clearly states âMarie Curie, 1903,â another is noisy. The agent must not become 100% confident just because it found text.
- Example (Verification): For âCompute the radius from area A= (20â21)/63,â the agent writes Python code to solve for r and gets a concrete number or an error.
- Produce Final Answer and Verbalized Confidence
- What happens: The agent gives an answer and a numerical confidence inside <confidence> tags.
- Why this step exists: If the agent doesnât state confidence, we canât measure calibration or reward honesty.
- Example: âAnswer: Marie Curie (1903). <confidence>78</confidence>â.
- Check Output Format
- What happens: A simple format checker ensures the reasoning â action â observation chain is followed and the <confidence> tag is present.
- Why this step exists: Without enforcing structure, the agent might skip confidence or muddle steps, making rewards noisy.
- Example: If the tag is missing, apply a penalty.
- Reward the Outcome: Accuracy + Calibration (Secret Sauce)
- What happens: The agent gets a reward that combines: (a) whether the answer is correct, (b) how well the stated confidence fits the truth, and (c) whether the output followed the format.
- Why this step exists: If we only reward correctness, the agent may learn to bluff with high confidence; if we only reward honesty, it might stay timid. We need both.
- Example: Correct + reasonable confidence earns more; wrong + high confidence gets penalized extra.
- Reinforcement Learning Update (GRPO)
- What happens: The RL algorithm (Group Relative Policy Optimization) adjusts the agentâs parameters to increase expected reward.
- Why this step exists: Without RL, the agent wonât learn from trial and error across many tasks and tools.
- Example: After many questions, the agent raises confidence only when evidence is strong or verification succeeded.
The Secret Sauce: MSCR (Margin-Separated Calibration Reward)
- đ Hook: Imagine a spelling bee where a correct but shy answer should still beat a confident wrong shout.
- đ„Ź The Concept: MSCR is a reward that guarantees any correct answer scores above any honest mistake and punishes false confidence on wrong answers.
- What it is: A reward design that separates the score ranges for correct vs. incorrect outcomes by a safe margin.
- How it works: 1) Give a base reward for being correct; 2) add a bonus when confidence is appropriate; 3) for wrong answers, subtract more when confidence is high; 4) ensure no overlap between the best wrong and worst right cases.
- Why it matters: Without this, the agent can âreward-hack,â e.g., guessing with high confidence to chase points.
- đ Anchor: A correct 20%âconfident answer still beats a wrong 0%âconfident answer, preventing perverse incentives.
Baselines and Comparisons (brief sandwiches for new ideas):
-
đ Hook: Sometimes people just turn the temperature down on a too-bold speaker. đ„Ź Temperature Scaling: Whatâpost-hoc rescaling of confidence; Howâadjust logits with a temperature; Whyâcan smooth confidence but doesnât teach real judgment. đ Anchor: The agent sounds calmer but may still be confidently wrong on the same items.
-
đ Hook: What if we pay only for useful searches? đ„Ź MASH: Whatâpenalizes excessive search to encourage abstention; Howâreward less tool spam; Whyâcan reduce risky behavior but doesnât align confidence with truth directly. đ Anchor: The agent searches fewer times but might still overtrust a single noisy page.
-
đ Hook: Scorecards that mix correctness and honesty need clear numbers. đ„Ź Brier Score: Whatâmeasures squared gap between confidence and outcome; Howâbigger gap, bigger penalty; Whyâcaptures calibration and sharpness. đ Anchor: Saying 90% but being wrong hurts more than saying 55% and being wrong.
Evaluation Metrics (used later in results):
-
đ Hook: Report cards have different subjects. đ„Ź ECE (Expected Calibration Error): Whatâaverages how far confidence is from accuracy across bins; Howâgroup by confidence ranges and compare; Whyâsummarizes overall calibration. đ Anchor: If 80% bins are only right 60% of the time, ECE goes up.
-
đ Hook: Can you sort answers from strongest to weakest? đ„Ź AUROC: Whatâhow well confidence ranks right answers above wrong ones; Howâcompute area under the ROC curve; Whyâreveals if confidence can separate hits from misses. đ Anchor: Higher AUROC means the topâconfidence answers tend to be the correct ones.
-
đ Hook: How sure are you when youâre actually wrong? đ„Ź MCIP: WhatâMean Confidence on Incorrect Predictions; Howâaverage confidence over the wrong answers; Whyâexposes overconfidence on mistakes. đ Anchor: If wrong answers average 90% confidence, thatâs dangerous.
Putting it all together: CAR = structure checks + MSCR calibration-aware outcome reward + RL updates. Inputs are questions from datasets (NQ, HotpotQA, SimpleQAâverified; for math: AIME, MATHâ500). Outputs are answers plus confidence. The training loop teaches the agent to speak carefully when evidence is noisy and more boldly when verification truly supports it.
04Experiments & Results
The Test: The authors measured both accuracy and how well confidence matched reality. They used:
- Search tasks (NQ, HotpotQA) to stress evidence tools in a local Wikipedia retriever and in a real web API (Serper), including SimpleQAâverified for outâofâdistribution checks.
- Math reasoning tasks (AIME 2024/2025, MATHâ500) to test verification tools via code execution sandboxes.
- Metrics: Accuracy (get it right?), ECE (overall calibration error), Brier Score (penalizes confidence mistakes), AUROC (can confidence sort right vs. wrong?), and MCIP (how confident on wrong answers in the pilot study).
The Competition: CAR (with different reward designs) went up against strong baselines:
- Vanilla SearchâR1 (rewarding task success and format),
- Temperature Scaling (postâhoc smoothing),
- MASH (discourages overâsearching to promote abstention).
The Scoreboard (with context):
- On search agents across model sizes, CAR with MSCR cut calibration error dramaticallyâECE reductions up to about 68%âwhile keeping or slightly improving accuracy versus baselines. Thatâs like moving from a shaky Bâ grade on honesty to a solid A, without lowering your test score.
- AUROC often rose by up to ~17%, showing the agent became better at ranking correct answers as more confident than incorrect ones. That means confidence became more meaningful, not just rescaled.
- Weighted Brier training with λ=1 gave very low ECE but hurt accuracy a lot (a sign of reward hacking: the agent learned to game the score rather than be correct). MSCR avoided this trap, striking a better accuracyâcalibration balance.
Surprising and Notable Findings:
- Confidence Dichotomy (Pilot Study): Evidence tools (web search) increased MCIPâagents were very confident on their wrong answersâespecially when trained with toolâuse RL. Verification tools (code) showed the opposite trend: RL made agents less overconfident (lower MCIP). This difference was statistically significant.
- Generalization to Noisy Web: Agents trained locally with CAR (MSCR) stayed betterâcalibrated when switched to the real, noisy Serper API. Accuracy remained competitive, while ECE and Brier improved, showing the learned behavior wasnât brittle.
- Transfer to ToolâIntegrated Reasoning (Math): With code interpreters, CAR (MSCR) reduced ECE and Brier and ticked up AUROC on AIME and MATHâ500. However, absolute ECE on the hardest math (AIME) remained higher than on easier sets (MATHâ500), suggesting that calibration in verification settings still depends on the agentâs underlying reasoning strength.
Make the numbers meaningful:
- Think of ECE like the gap between how sure you say you are and how often youâre right. CAR with MSCR shrank this gap a lotâup to twoâthirds smallerâso the agentâs âIâm 80% sureâ statements lined up much better with reality.
- AUROC rising by ~17% is like getting much better at sorting your âgood betsâ above your âbad bets,â which is critical if a user only wants to act on the topâconfidence answers.
- The reward design mattered: MSCRâs strict margin prevented the âsafe failureâ loophole, where a wrong but lowâconfidence answer could tie a hesitant right one. This kept learning focused on being right and honest.
Big picture: CAR didnât just change the volume knob on confidence (as temperature scaling does). It taught the agent when to speak up and when to hedge, and that lesson carried over from lab conditions to the messy real world.
05Discussion & Limitations
Limitations:
- Scale: Experiments used 3Bâ7B parameter models. Larger models may behave differently, and we donât yet know how the dichotomy or CARâs gains scale up.
- Task types: Focus was on shortâanswer QA and math, where correctness is crisp. Openâended writing or longâhorizon planning may need different calibration signals and timelines.
- Verification isnât perfect: Code that runs can still be logically wrong. Calibration improved, but on the hardest math (AIME) itâs still not idealâreasoning ability caps calibration.
Required Resources:
- RL training pipeline (e.g., GRPO), retrieval infrastructure (local wiki dump or web API), and, for verification, a secure code sandbox (e.g., E2B). Compute needs are moderate to high, depending on model size and data volume.
When NOT to Use:
- If you canât collect reliable correctness signals (no clear answers, very delayed rewards), CARâs outcomeâbased calibration may struggle.
- If your agent cannot output structured responses (no confidence tags), the format reward wonât apply.
- If your use case already abstains aggressively with human review for all lowâconfidence cases, postâhoc scaling might be sufficient.
Open Questions:
- How does the confidence dichotomy evolve with much larger or multimodal agents?
- Can we design evidence tools that return explicit negative signals (e.g., âinsufficient evidence detectedâ) to reduce overconfidence at the source?
- How to extend CAR to openâended generation with fuzzy ground truth (e.g., reports or plans)?
- Can we jointly learn when to abstain, when to search, and how to state confidence, all under one reward framework?
Takeaway: Tool choice shapes confidence. CAR with MSCR provides a principled way to realign an agentâs voice with its actual skillâespecially crucial in noisy evidenceâtool settingsâwhile remaining broadly helpful in verificationâtool workflows.
06Conclusion & Future Work
Threeâsentence summary: This paper discovers a confidence dichotomy in toolâuse agents: evidence tools tend to inflate confidence, while verification tools ground it. To fix miscalibration, the authors introduce CAR, a reinforcement learning framework with a new reward (MSCR) that cleanly separates incentives for right vs. wrong answers and punishes false bravado. CAR significantly improves calibration across datasets, transfers from clean retrievers to real web APIs, and also helps in math with code tools, all while keeping accuracy competitive.
Main achievement: A practical, general training recipeâCAR with MSCRâthat reliably reduces overconfidence in toolâuse agents by aligning expressed confidence with true performance, avoiding rewardâhacking pitfalls.
Future directions: Scale to larger and multimodal agents; add toolâlevel negative signals (evidence insufficiency); integrate abstention and planning into calibration learning; and adapt the framework to openâended, longâhorizon tasks with delayed or fuzzy correctness.
Why remember this: In agentic AI, the tool you choose changes how certain you feel. With CAR and MSCR, we finally have a way to teach agents not just to be smart, but to be honestly smartâspeaking carefully when the world is noisy and confidently when the math (or code) checks out.
Practical Applications
- âąDeploy web-browsing assistants that flag low-confidence answers and request human review when search results are noisy.
- âąBuild research copilots that rank sources by confidence and clearly separate speculation from verified facts.
- âąCreate math and coding tutors that state confidence based on code execution results and suggest checks when unsure.
- âąAdd confidence-aware summaries in customer support chatbots to guide escalation policies and reduce incorrect assurances.
- âąUse calibrated agents in business intelligence dashboards to prioritize high-confidence insights and label weak ones.
- âąIntegrate CAR-trained agents into medical triage tools to signal uncertainty and trigger safe fallback procedures.
- âąImprove educational tools that teach students about uncertainty by modeling honest confidence alongside answers.
- âąEnhance legal or compliance document review with confidence-tagged findings to focus expert attention efficiently.
- âąOptimize internal search tools (RAG systems) to abstain or lower confidence when retrieved evidence is thin or conflicting.
- âąTune multi-agent systems so planners and executors share calibrated confidence, improving coordination and safety.