Confidence Estimation for LLMs in Multi-turn Interactions

Caiqi Zhang; Ruihan Yang; Xiaochen Zhu; Chengzu Li; Tiancheng Hu; Yijiang River Dong; Deqing Yang; Nigel Collier

Confidence Estimation for LLMs in Multi-turn Interactions

Intermediate

Caiqi Zhang, Ruihan Yang, Xiaochen Zhu et al.1/5/2026

arXiv PDF

Key Summary

•This paper studies how sure (confident) large language models are during multi-turn chats where clues arrive step by step.
•It says a good confidence signal should be calibrated at every step and usually go up when real new information is added.
•The authors introduce new tools: InfoECE (a per-information-level calibration score) and Kendall’s tau (a trend score for monotonicity).
•They build controlled datasets using a Hinter–Guesser setup and adapt quizbowl-style data so clues get steadily more helpful.
•Common confidence methods like asking the model to self-report or checking P(TRUE) are often poorly calibrated and can be fooled by turn count.
•Their new probe, P(SUFFICIENT), asks if the information so far is enough to uniquely pin down the answer, and it works better in many cases.
•Placebo hints (extra turns with no real info) reveal which methods truly track information versus just the number of turns.
•Across models and tasks, accuracy is similar for multi-turn versus single-turn summaries, but confidence behaves very differently.
•Even the best method still has room to improve, so the paper provides a foundation and clear targets for future research.
•This matters for safer AI assistants and agents that must know when to ask for more details or when it’s okay to act.

Why This Research Matters

In real life, AI assistants and agents must know when they truly have enough information to act, and when they should ask for more details. This work gives us a clearer way to judge and improve that judgment during multi-turn conversations, which is how people actually interact with AI. Better confidence signals help prevent harmful overconfidence and reduce frustration from needless hesitation. They also enable smarter tool-use—like searching or asking clarifying questions only when needed. With methods like P(SUFFICIENT), products can be safer, more cost-efficient, and more trustworthy. Over time, this turns chatty models into reliable teammates who act carefully and confidently for the right reasons.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine playing 20 Questions. At first, you barely know anything, so your guesses are wobbly. As you hear more clues, you feel more and more sure. That growing "I’m confident now!" feeling is what we want AI to have during conversations.

🥬 The concept: We use large language models (LLMs) to chat, plan, and solve problems. In real talks, information arrives over several turns, not all at once. The big challenge is: can an AI keep track of how sure it should be as the conversation unfolds—and do that reliably?

🍞 Anchor: Think of an AI detective. After clue 1, it should be cautious. After clue 5, it should feel bolder—but only if the clues truly help.

—

🍞 Hook: You know how a friend might claim they’re “100% sure,” but then they’re wrong? That’s not very trustworthy.

🥬 The concept (Hallucinations): Hallucinations are when an AI says something false but sounds very confident. This is dangerous in high-stakes settings like medicine, finance, or safety.

How it works (problem story):

The AI answers a user’s question.
It often sounds certain even if it’s guessing.
People may believe it because it sounds confident.

Why it matters: Without reliable confidence, we can’t tell when to trust the AI, when to ask for more info, or when to stop and check. This blocks safe use in real life.

🍞 Anchor: If an AI tells a doctor, “I’m sure this is Disease X,” but it’s just guessing, that could lead to a harmful decision.

—

🍞 Hook: Imagine a thermometer that says it’s 70°F but it’s actually 50°F outside. You’d call that thermometer badly calibrated.

🥬 The concept (Confidence estimation and calibration): Confidence estimation is the AI’s self-check of how likely its answer is right. Calibration means that when the AI says “I’m 70% sure,” it’s correct about 70% of the time.

How it works:

The AI gives an answer and a confidence score (0 to 1).
We compare that score to how often it’s actually right.
The closer the match, the better the calibration.

Why it matters: Poor calibration means the AI either over-trusts itself or is too timid—both lead to bad decisions.

🍞 Anchor: If a model says it’s 90% sure and is right 9 out of 10 times, that’s well calibrated.

—

🍞 Hook: When you build a LEGO set step by step, your confidence you’re doing it right grows as pieces click into place—unless a piece doesn’t fit.

🥬 The concept (Multi-turn conversations): These are back-and-forth chats where clues arrive across turns, and ambiguity shrinks.

How it works:

Turn 1: little info, many possible answers.
Turn 2–n: more clues arrive.
The answer space narrows; the AI should adjust confidence.

Why it matters: Most real interactions are multi-turn, so a confidence signal that adapts turn by turn is essential.

🍞 Anchor: A travel assistant asks follow-up questions before suggesting flights; as details arrive, it should become more certain.

—

🍞 Hook: If two quizzes have different numbers of questions, you shouldn’t compare raw scores without adjusting.

🥬 The concept (What was missing): Past work focused on single-turn cases. We lacked a way to judge confidence that changes across turns and across conversations of different lengths.

How it works:

Define what “good” confidence should look like in multi-turn.
Create fair metrics that adjust for dialogue length.
Build datasets where information grows in a controlled way.

Why it matters: Without this, we can’t tell which confidence method is truly reliable in real conversations.

🍞 Anchor: You can’t judge a marathoner and a sprinter by the same timer; you need the right metric for each race.

—

🍞 Hook: Imagine playing 20 Questions with a helpful hint-giver who always adds a clue that actually helps.

🥬 The concept (The paper’s setup): The authors create a controlled “Hinter–Guesser” pipeline and adapt incremental-quiz datasets so that each turn adds useful info. They then test multiple confidence methods and introduce new metrics to judge them.

How it works:

Ensure each new turn really adds information.
Make the model answer and report confidence each turn.
Check calibration at each information level and whether confidence usually rises with real info.

Why it matters: This creates a fair playground to see which confidence methods truly track evidence, not just time or turn count.

🍞 Anchor: If a clue says “It’s a city in Southeast Asia,” and later “It’s inland with a tropical climate,” a good system should become more confident only because these facts truly narrow the choices.

02Core Idea

🍞 Hook: You know how you won’t shout “Bingo!” until you have enough matching numbers—not just a lucky guess? You wait until the evidence is sufficient.

🥬 The concept (Aha! in one sentence): Instead of asking “Am I right?”, ask “Do I have enough information for my answer to be the only reasonable one?”—that’s the P(SUFFICIENT) idea.

How it works:

At each turn, the model gives its current best answer.
We probe the model: “Given the clues so far, is the information sufficient to uniquely support this answer?”
The model outputs a probability that the info is sufficient (yes/no).

Why it matters: A model can be accidentally correct early on; sufficiency asks if the clues truly justify confidence, aligning confidence with evidence rather than luck.

🍞 Anchor: Guessing “television” early in 20Q might be right by chance. P(SUFFICIENT) stays low until enough distinct clues prove it really must be “television.”

—

Multiple analogies for the same idea:

Jigsaw puzzle: Don’t claim the picture is a tiger after placing only two sky-blue pieces. Wait until enough edge and stripe pieces lock in. P(SUFFICIENT) asks, “Do we have enough pieces to be sure?”
Courtroom: A verdict shouldn’t be based on a hunch. P(SUFFICIENT) is like the judge asking, “Is there enough evidence beyond reasonable doubt?”
Safe crack: You might hear clicks, but you shouldn’t yank the handle until all tumblers align. P(SUFFICIENT) checks alignment, not just noise.

—

🍞 Hook: When you read a mystery, you don’t treat chapter 1 like chapter 10. Clues build.

🥬 The concept (Before vs. After): Before, confidence methods often treated each turn separately or asked, “Is my answer true?” which can be misleading early. After, the system tracks when new info truly narrows the answer space, so confidence usually rises only with real, useful clues.

How it works:

Control information growth turn by turn.
Normalize turns into information levels for fair comparison.
Use sufficiency probing to align confidence with identifiability.

Why it matters: The AI becomes better at knowing when to act, when to ask questions, and when to wait.

🍞 Anchor: A planning assistant won’t book a hotel until it’s sure about dates, city, and budget—its sufficiency rises only when those details are confirmed.

—

🍞 Hook: If you say you’re 80% sure, you should be right about 80% of the time.

🥬 The concept (InfoECE, the calibration metric): InfoECE checks calibration at equal information levels across conversations of different lengths so comparisons are fair.

How it works:

Convert each turn position into a normalized “information level” (like 20%, 40%, …).
Group turns by these levels and compare average confidence to actual accuracy.
Smaller gaps mean better calibration.

Why it matters: Without normalizing, a 3-turn chat and an 8-turn chat can’t be fairly compared.

🍞 Anchor: Comparing mid-game confidence at 50% progress across all games tells you who is realistically assessing how well they’re doing.

—

🍞 Hook: If you add real clues, confidence should usually go up—like turning on more lights in a dark room.

🥬 The concept (Monotonicity with Kendall’s tau): Kendall’s tau measures whether confidence tends to trend upward as information increases.

How it works:

Look at all turn pairs (earlier vs. later) in a dialogue.
Count how often later confidence is higher than earlier.
Turn this into a score from -1 (downward) to +1 (upward).

Why it matters: A good method should reward real clues, not just later turns.

🍞 Anchor: If clue 4 is better than clue 3, your confidence should usually be higher at clue 4.

—

🍞 Hook: Sometimes people talk more but say nothing. Does the AI notice?

🥬 The concept (Placebo hints): A placebo hint is an extra turn that adds no useful information. It tests whether confidence rises just because time passes.

How it works:

Compare confidence after a real hint versus after a placebo hint.
A good method rises after real hints and stays flat or drops after placebo.
This isolates true information tracking from turn-count bias.

Why it matters: We want confidence that follows evidence, not the clock.

🍞 Anchor: “Is this a valid hint? Yes.” adds no facts. P(SUFFICIENT) often lowers or keeps confidence flat, which is what we want.

—

🍞 Hook: Asking “Am I right?” is different from asking “Is there enough proof I’m right?”

🥬 The concept (Comparing probes): P(TRUE) asks if the current answer is correct; P(SUFFICIENT) asks if there’s enough information to be sure. Self-consistency (SC) checks how often many sampled answers agree. Verbalized confidence asks the model to state a number.

How it works:

P(TRUE) can be fooled by lucky guesses.
P(SUFFICIENT) aligns confidence with evidence that rules out alternatives.
SC is often well calibrated in fully specified tasks but can be weak on turn-by-turn monotonicity.
Verbalized scores are easy to collect but often miscalibrated and unstable.

Why it matters: Picking the right confidence tool changes how safe and useful the AI becomes.

🍞 Anchor: In 20Q-style games, P(SUFFICIENT) usually climbs only when each clue truly narrows down the hidden answer.

03Methodology

At a high level: Input (dialogue with growing clues) → Normalize turns into information levels → Ask models to answer and estimate confidence each turn → Evaluate calibration (InfoECE) and monotonicity (Kendall’s tau) → Stress-test with placebo hints and compare multi-turn vs. single-turn summary.

Step-by-step recipe with Sandwich explanations for key pieces:

Dialogue and turns 🍞 Hook: Think of a treasure hunt where each turn gives a new hint about the treasure’s location. 🥬 The concept: A multi-turn dialogue is a sequence of question–answer steps where each new turn adds information. How it works:

Keep a history: (q1, a1, q2, a2, …, qi-1, ai-1).
At turn i, provide the next hint and ask the model to guess and give confidence.
Record whether the guess is correct. Why it matters: This structure lets us track how confidence should change with each new hint. 🍞 Anchor: In “Guess My City,” early you hear “Asia,” later “Southeast Asia,” then “inland,” then “tropical.”

Normalized information level 🍞 Hook: Comparing halftime scores across games is fairer than comparing minute 17 in one game to minute 42 in another. 🥬 The concept: We normalize each turn to a fractional information level s = i / L, where L is that dialogue’s length. How it works:

Compute s for each turn.
Bin turns by s (e.g., 0–20%, 20–40%, …).
Compare confidence and accuracy within the same bins across dialogues of different lengths. Why it matters: This keeps comparisons fair when some chats are shorter or longer. 🍞 Anchor: You compare everyone’s confidence at 50% progress, not by raw turn number.

Calibration via InfoECE 🍞 Hook: If your weather app says 70% chance of rain, it should rain 7 out of 10 times. 🥬 The concept: InfoECE measures the average gap between confidence and actual correctness within each information-level bin. How it works:

For each bin, compute average confidence and average accuracy.
Take their absolute difference and average across bins.
Lower numbers mean better calibration. Why it matters: It tells whether confidence matches reality at comparable stages of information. 🍞 Anchor: If at 40–60% info, models say 60% sure and are right 60% of the time, calibration is good there.

Monotonicity via Kendall’s tau 🍞 Hook: Climbing stairs should take you higher step by step. 🥬 The concept: Kendall’s tau checks whether confidence tends to rise as turns progress. How it works:

Look at every pair of turns (earlier, later) in a dialogue.
Count how often later confidence is higher than earlier (concordant) vs. lower (discordant).
Score from -1 (downward) to +1 (upward). Why it matters: A good method rewards real clues with higher confidence. 🍞 Anchor: If clue 5 is stronger than clue 3, confidence should tend to be higher at 5 than at 3.

Datasets and the Hinter–Guesser paradigm 🍞 Hook: A helpful game master reveals fair, useful clues each round. 🥬 The concept: Hinter–Guesser ensures each turn adds a real, non-trivial hint and continues until the answer is both correct and unique. How it works:

Hinter (LLM) gives helpful, uncertainty-reducing hints.
Guesser (LLM) makes a best guess and marks whether multiple answers still fit.
Keep only runs that converge to a unique, correct answer. Why it matters: Guarantees progressive information, enabling fair monotonicity testing. 🍞 Anchor: For 20Q and GUESS, clues narrow the entity/city; for GRACE/TRICKME, quizbowl clues get more specific.

Confidence methods under test 🍞 Hook: There are different thermometers; we need to know which reads temperature best. 🥬 The concept: Compare several confidence estimators. How it works:

Verbalized confidence: ask the model to say a number (with or without chain-of-thought).
Self-consistency (SC): sample many answers; the fraction agreeing with the chosen one is the confidence.
Logit-based probes: force a tiny classification and use the internal probabilities. P(TRUE) asks “Is my answer true?” P(SUFFICIENT) asks “Do we have enough info to be sure?” Why it matters: Different tools have different strengths; some can be fooled by lucky guesses or turn count. 🍞 Anchor: In under-specified games, P(SUFFICIENT) often tracks real evidence best.

Placebo hints test 🍞 Hook: Talking more isn’t the same as saying more. 🥬 The concept: Add a turn that adds no real info and see if confidence still rises. How it works:

Compare three states: before the turn, with a real hint, with a placebo hint.
Good methods go up with real hints but stay flat or go down with placebos. Why it matters: Filters out methods that just grow with the turn index. 🍞 Anchor: “Is this a valid hint? Yes.” shouldn’t raise confidence if it adds no facts.

Multi-turn versus single-turn summary 🍞 Hook: Reading a story chapter by chapter feels different than reading a summary, even if facts match. 🥬 The concept: Compare performance when clues are fed turn-by-turn versus compressed into one concise prompt. How it works:

Build a summary of all clues up to a turn.
Compare accuracy and confidence between formats. Why it matters: Reveals whether confidence depends on conversational structure vs. content. 🍞 Anchor: The same facts, told as a tidy summary, can change how models assess sufficiency.

Secret sauce (what’s clever):

Tie confidence to identifiability (P(SUFFICIENT)) instead of mere correctness.
Normalize by information level (InfoECE) for fair, per-stage calibration.
Stress-test with placebo hints to separate signal from turn-count noise.
Evaluate monotonicity with Kendall’s tau to see if confidence truly climbs with evidence.

04Experiments & Results

The test: The authors measured two main things: (1) calibration at each information level using InfoECE (lower is better), and (2) monotonicity using Kendall’s tau (higher is better). They ran four open-source models (Llama 3.1 8B/70B; Qwen 2.5 7B/72B) across four datasets (20Q, GUESS for under-specified; GRACE, TRICKME for fully specified but hard).

The competition: Methods included verbalized confidence (with and without chain-of-thought), self-consistency (SC), P(TRUE), and the new P(SUFFICIENT).

Scoreboard with context:

Calibration (InfoECE): Verbalized confidence and P(TRUE) were generally poorly calibrated (often 40–80 points InfoECE). SC tended to be the most calibrated on fully specified incremental QA (GRACE/TRICKME). In under-specified games (20Q/GUESS), P(SUFFICIENT) achieved strikingly better calibration in some cases—for example with Llama3.1-70B, InfoECE dropped to about 13.05 on 20Q and 5.27 on GUESS, which is like moving from a shaky “C” to a confident “A” relative to others still stuck around the “D” range. Overall, SC is a strong default for calibration on fully specified tasks, while P(SUFFICIENT) is especially effective when answers become identifiable step by step.
Monotonicity (Kendall’s tau) on the current answer: Ideally, confidence should rise with real clues. P(SUFFICIENT) most consistently showed strong upward trends, such as tau ≈ 83.76% on GUESS with Qwen2.5-72B and tau ≈ 71.38% on TRICKME with Llama3.1-70B. SC often had weak monotonicity on under-specified tasks.
Monotonicity against the ground-truth answer: If you score confidence relative to the true answer (not just the model’s guess), all methods’ tau scores jump, and P(SUFFICIENT) usually leads (e.g., ≈ 93.91% on GUESS with Qwen2.5-72B; ≈ 91.62% on 20Q with Llama3.1-70B). This suggests models can recognize when clues align with the truth, even if their current guess hasn’t caught up.

Surprising findings and stress tests:

Information vs. turn count (placebo hints): Across 40 comparisons, informative hints produced more significant confidence changes than placebos (27 vs. 18 with p < 0.05). P(SUFFICIENT) most cleanly separated real information from mere turn accumulation, often decreasing confidence after a placebo—good behavior that shows it’s tracking evidence, not time.
P(TRUE) sometimes confounded by turn count: On GUESS, P(TRUE) often rose even for placebo hints, indicating a length artifact—confidence creeping up just because another turn happened.
Verbalized confidence was unstable: The easy-to-ask, self-reported numbers were often poorly calibrated and sometimes moved the wrong way under information or placebo.
SC moderately robust: It usually didn’t get fooled by placebos, and it gained with real info, but it wasn’t perfect and sometimes still picked up turn-index effects.

Multi-turn versus single-turn summary:

Accuracy was similar across formats (mean absolute gap < 1%), so models didn’t clearly “get lost” in this progressive setup.
Confidence behaved very differently. P(SUFFICIENT) often dropped sharply for single-turn summaries (e.g., on 20Q for Qwen2.5-7B: ≈ 63 → 13), implying that it benefits from the turn-by-turn structure to judge sufficiency. Verbalized scores sometimes went up in summaries for larger models without matching accuracy gains—an example of miscalibration concerns. SC was relatively stable and sometimes improved in summaries on 20Q.

Takeaway of results:

No method is perfect, but P(SUFFICIENT) best matches the two desiderata—calibration (especially in under-specified settings) and monotonicity (tracking true information gains). SC is strong for calibration in fully-specified incremental QA but often lacks monotonicity. Verbalized and P(TRUE) methods are convenient but unreliable in multi-turn dynamics, with P(TRUE) especially prone to turn-count bias in open-ended tasks.

05Discussion & Limitations

Limitations:

The controlled set-ups simplify real chats: fewer topic shifts, misunderstandings, or mixed intents. So, real-world transfer may require more robustness.
The tasks focus on information-seeking (guessing entities/cities, quizbowl), not creative or collaborative dialogues. Confidence behavior might differ in brainstorming or negotiation.
The evaluation centers on calibration and monotonicity. Human trust and utility (e.g., when to hand off to a person) need user studies.
The paper studies confidence, not full uncertainty decomposition (like aleatoric vs. epistemic). Bridging confidence with broader uncertainty remains open.

Required resources:

Access to LLMs capable of returning logits/probabilities for probes and generating multiple samples for SC.
Datasets with progressive hints (or the Hinter–Guesser pipeline) to ensure information actually grows each turn.
Enough compute to run per-turn probing and sampling across many dialogues.

When not to use:

If the conversation does not add information progressively (e.g., small talk), sufficiency-style probes may be less meaningful.
In highly creative tasks where there may be many equally acceptable outputs, asking for strong sufficiency may be ill-posed.
If you cannot access logits or sampling due to system constraints, some methods (P(SUFFICIENT), SC) may be impractical.

Open questions:

Can we design a unified method that achieves both excellent calibration and strong monotonicity across all regimes?
How to robustly detect and down-weight filler turns in wild, messy conversations?
Can we fuse sufficiency signals with retrieval, tools, or planners so agents know exactly when to ask, search, or act?
What training-time interventions (e.g., fine-tuning with sufficiency-aware objectives) most improve multi-turn confidence?
How do user interfaces display evolving confidence so people make better decisions without over-trusting the AI?

06Conclusion & Future Work

3-sentence summary: This paper introduces a framework to judge and improve how LLMs estimate confidence during multi-turn conversations, where clues arrive over time. It proposes two core targets—per-level calibration and monotonicity—and new tools like InfoECE and sufficiency probing (P(SUFFICIENT)) to test them fairly. Experiments show common methods often fail in dynamic dialogues, while P(SUFFICIENT) performs comparatively better, though the problem is far from solved.

Main achievement: Reframing confidence around evidence sufficiency and providing a controlled, length-normalized evaluation (InfoECE + Kendall’s tau + placebo stress tests) that reveals what truly tracks information versus turn count.

Future directions:

Develop training and inference methods that directly optimize sufficiency-aware confidence and monotonicity.
Build richer, messier multi-turn datasets (topic shifts, repairs, competing intents) to stress-test methods.
Integrate sufficiency with tool-use and retrieval so agents can decide when to ask, search, or act.
Conduct user studies to connect better calibration and monotonicity with real improvements in trust and outcomes.

Why remember this: In real conversations, confidence should grow with real evidence, not with the clock. This paper delivers the first systematic way to test that idea and a concrete probe—P(SUFFICIENT)—that moves us closer to reliable, decision-ready conversational AI.

Practical Applications

•Customer support bots that ask clarifying questions when sufficiency is low instead of guessing and escalating when needed.
•Autonomous agents that only execute high-impact actions (book, buy, send, deploy) when P(SUFFICIENT) passes a threshold.
•Medical triage assistants that surface disclaimers and suggest follow-up questions whenever sufficiency is low.
•Coding copilots that request minimal repro steps or run tests before refactoring when they lack sufficient information.
•Research assistants that trigger retrieval or cite sources when P(SUFFICIENT) is low, and summarize when it’s high.
•Educational tutors that prompt students with targeted hints until sufficiency for the final answer is achieved.
•Legal or compliance helpers that flag low-sufficiency answers for human review instead of making risky claims.
•Financial advisory chatbots that hold off on recommendations until required data fields raise P(SUFFICIENT).
•Voice assistants that reduce follow-up questions when sufficiency is already high, speeding up routine tasks.
•Agent frameworks that use monotonicity checks to detect confusing conversations and reset or re-ask strategically.

Version: 1