NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Jiayu Liu; Rui Wang; Qing Zong; Qingcheng Zeng; Tianshi Zheng; Haochen Shi; Dadi Guo; Baixuan Xu; Chunyang Li; Yangqiu Song

NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Intermediate

Jiayu Liu, Rui Wang, Qing Zong et al.1/16/2026

arXiv PDF

Key Summary

•The paper studies why large language models (LLMs) sound too sure of themselves when using retrieval-augmented generation (RAG) and how to fix it.
•It finds that noisy passages (irrelevant or even contradictory) make LLMs overconfident, breaking the link between confidence and correctness.
•The authors propose NAACL Rules—simple, common-sense guidelines that tell a model how to act when the retrieved evidence is messy or conflicting.
•They build NAACL, a training framework that teaches models these rules using about 2,000 carefully filtered HotpotQA examples.
•Instead of relying on bigger 'teacher' models or expensive sampling, NAACL uses supervised fine-tuning with rule-guided, self-generated data.
•Across four datasets, NAACL reduces Expected Calibration Error (ECE) by about 10.9% in-domain and 8.0% out-of-domain, a strong improvement.
•NAACL also helps models judge which passages are useful and explain why their confidence is high or low, improving interpretability.
•Even just prompting a model with the rules (no training) often beats strong baselines, showing the rules themselves are powerful.
•The method generalizes to more passages at test time, suggesting it learns the idea of noise-awareness rather than overfitting a format.

Why This Research Matters

Many real systems—search assistants, medical triage bots, legal research tools—use RAG, where bad or conflicting passages are common. If a model sounds sure while being wrong, people can be misled into trusting a faulty answer. NAACL directly addresses this by teaching the model to judge the evidence first, then speak its confidence. This leads to safer decisions: users know when to trust, when to double-check, and when to escalate. Because it avoids big teacher models and heavy sampling, NAACL is practical for deployment. Clear passage judgments also make the model’s confidence understandable to humans.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you're taking an open-book test. You can bring three pages of notes. If one page lies, one is off-topic, and one is correct, you might still feel super confident—even if you picked the wrong page to trust.

🥬 The Concept (Retrieval-Augmented Generation, RAG):

What it is: RAG is when an AI looks up extra passages before answering, like using notes during a test.
How it works: (1) The AI gets a question. (2) It retrieves a few short passages from a big library. (3) It reads them with the question. (4) It writes an answer.
Why it matters: Without retrieval, the AI can forget or invent facts. With retrieval, it can ground answers in real text. 🍞 Anchor: If you ask, “What channel is the Premier League on in France?”, the AI searches and reads passages to answer.

🍞 Hook: You know how a friend might say, “I’m 100% sure,” and then be wrong? Confidence doesn’t always match correctness.

🥬 The Concept (Verbal Confidence Calibration):

What it is: It’s how well an AI’s stated confidence (like 30% vs 90%) matches how often it’s actually right.
How it works: (1) AI answers and says a number (0–100%). (2) Over many questions, we check if its 80% answers are really right ~80% of the time. (3) We adjust training to align words with reality.
Why it matters: Without calibration, the AI can sound sure when it’s wrong, which misleads people. 🍞 Anchor: If the model says “Canal+ (100%)” but the truth in context is “SFR Sport,” that’s dangerous overconfidence.

🍞 Hook: Think of a report card that compares how confident you felt on tests vs. how many you got right.

🥬 The Concept (Expected Calibration Error, ECE):

What it is: ECE measures the average gap between what the model’s confidence says and how often it’s correct.
How it works: (1) Bucket answers by stated confidence (like 0–10%, 10–20%, …). (2) For each bucket, compare average confidence vs. actual accuracy. (3) Average the gaps.
Why it matters: A big ECE means the model’s “I’m sure” doesn’t match reality. 🍞 Anchor: If 90% confidence answers are only right 60% of the time, ECE will be large.

🍞 Hook: Imagine listening to three witnesses. One tells the truth, one tells a convincing lie, and one talks about something unrelated.

🥬 The Concept (Retrieval Noise: counterfactual, relevant, irrelevant):

What it is: Noisy passages are bad or unhelpful evidence mixed into what the model reads.
How it works: (1) Counterfactual: sounds right but supports a wrong answer. (2) Relevant: on-topic but missing the key fact. (3) Irrelevant: off-topic filler.
Why it matters: Noise can trick the model into feeling sure even when the answer is wrong. 🍞 Anchor: For the France TV question, mixing “SFR Sport” (true) with “Canal+” (false) and “The English Channel is 560 km long” (irrelevant) can inflate wrong confidence.

The world before: LLMs were great at reasoning and writing, but in fact-heavy questions they sometimes hallucinated. RAG improved grounding by pulling in real text. However, the model’s confidence still didn’t behave in messy real-world retrieval: across four datasets, average ECE went above 0.4 (bad), meaning the models’ spoken certainty was unreliable.

The problem: Noisy retrieval (especially counterfactual or irrelevant evidence) made models overconfident. Contradictions didn’t lower confidence enough, and extra off-topic passages inflated certainty.

Failed attempts: Prior work (a) optimized confidence in closed-book settings, ignoring retrieval uncertainty; (b) used white-box signals (like logits) not available for many models; or (c) used heavy sampling that’s too slow.

The gap: We needed a simple, training-time way to make models confidence-aware of retrieval noise—and to do it without bigger teacher models or expensive test-time tricks.

Real stakes: In medicine, law, finance, or customer support, a wrong but very confident answer can mislead people into bad decisions. Better calibration helps humans decide when to trust, double-check, or escalate.

02Core Idea

🍞 Hook: Think of a referee who first checks if the camera angles disagree, then decides how sure to be about the call.

🥬 The Concept (NAACL Rules):

What it is: NAACL Rules are three common-sense guidelines that tell the model how to set confidence when retrieval is noisy.
How it works: (1) Conflict Independence: If passages contradict each other, don’t trust them; fall back to your own knowledge and lower confidence. (2) Noise Invariance: If a passage is irrelevant, ignore it; its presence shouldn’t boost your confidence. (3) Parametric Fallback: If nothing helpful is retrieved, answer from your internal knowledge, but be honest about uncertainty.
Why it matters: Without rules, models keep high confidence even when the evidence is messy or useless. 🍞 Anchor: With “SFR Sport” (true), “Canal+” (false), and an off-topic passage, the rules tell the model to spot the conflict/noise and drop confidence.

Aha! moment in one sentence: Confidence should depend not just on the answer, but on whether the retrieved evidence is consistent, relevant, and trustworthy.

Three analogies:

Courtroom: Witnesses (passages) might disagree; the judge (model) checks contradictions before deciding how sure to be.
Treasure maps: If maps point to different spots, don’t dig with 100% certainty; step back and reassess.
Group project: If teammates give on-topic fluff or off-topic chatter, don’t let the extra words make you feel more certain.

Before vs. after:

Before: Models saw more text and often felt more certain—even when it was wrong or irrelevant.
After: Models first judge the text: Is it helpful? Is it consistent? Only then do they answer and set confidence.

Why it works (intuition): Confidence should track evidence quality. If the evidence is contradictory or off-topic, the true probability of being right drops. The rules teach the model to treat noisy evidence as weak evidence.

Building blocks:

🍞 Hook: You know how you rate each source: “this is helpful,” “this is meh,” “this contradicts.”
🥬 The Concept (Passage and Group Judgments):
- What it is: The model labels each passage and the whole set as helpful/contradictory/irrelevant before answering.
- How it works: (1) For each passage, decide if it can answer the question. (2) For the group, check if answers agree. (3) Use these judgments to guide confidence.
- Why it matters: Without judging, the model can’t know when to trust retrieval. 🍞 Anchor: If two passages back different TV channels, the group is ‘conflicting’, so confidence should go down.
🍞 Hook: Imagine practicing many times and keeping only the tries where your confidence matched reality best.
🥬 The Concept (Rule-guided Data Synthesis and Filtering):
- What it is: Create training examples with known noise, then keep the responses that follow the rules and have well-aligned confidence.
- How it works: (1) Build counterfactual/relevant/irrelevant mixes. (2) Sample many model answers. (3) Keep only those that judge passages correctly, follow rules, and have low Brier Score (confidence matches right/wrong).
- Why it matters: It teaches the model not just what to answer, but how sure to be, given messy evidence. 🍞 Anchor: From 16 tries per question, keep the one that said “SFR Sport, 30%” under conflict instead of “Canal+, 100%.”
🍞 Hook: Like a coach fine-tuning your play based on the best practice tapes.
🥬 The Concept (Supervised Fine-Tuning, SFT):
- What it is: Train the model with the high-quality, rule-following examples so it learns noise-aware confidence.
- How it works: Show Input → judgments → answer → confidence; optimize so the model copies this behavior.
- Why it matters: Without SFT, the model keeps its old overconfident habits. 🍞 Anchor: After SFT, the model says “SFR Sport, 30%” in conflict, not “Canal+, 100%.”

03Methodology

High-level recipe: Question + Retrieved Passages → (A) Judge passages and group → (B) Apply NAACL Rules → (C) Answer + Confidence.

Step 0: Build realistic noisy training inputs

What happens: For each HotpotQA question, construct three retrieval mixes: (1) Counterfactual: gold + conflicting false passages; (2) Consistent: gold + relevant/irrelevant noise; (3) Irrelevant: only relevant/irrelevant, no gold.
Why this step exists: The model must practice seeing different kinds of noise it will meet in the real world.
Example: The Simpsons planet question gets (a) one true passage about Rigel VII + two false ones naming other planets, (b) the true passage + on-topic-but-empty passages, (c) only on-topic/irrelevant passages without the key fact.

Step 1: Generate many candidate traces per input (Best-of-N sampling)

What happens: Prompt the model to (i) label each passage (passage-level judgment), (ii) label the whole set (group-level judgment: consistent/conflicting), and (iii) produce answer + verbal confidence. Do this 16 times per question to get variety.
Why this step exists: Some generations will be better at following rules and aligning confidence; we need options to choose the best.
Example: Across 16 answers, one says “Rigel VII, 90%” ignoring conflict; another says “Xenon Prime, 10%” acknowledging conflict; another says “Rigel VII, 40%” with proper conflict detection.

🍞 Hook: You know how coaches keep the replays where you used perfect form. 🥬 The Concept (Multi-stage Data Filtering):

What it is: A strict filter to keep only high-quality, rule-following training examples.
How it works:
1. Format Consistency: Can we parse judgments, answer, and confidence? If not, drop.
2. Passage Judgment Accuracy: Do the labels match the known setup (gold/relevant/irrelevant/counterfactual)? If not, drop.
3. Rule Adherence: Does the reasoning explicitly apply the NAACL Rules (e.g., detect contradictions before answering)? If not, drop.
4. Confidence Alignment: Among remaining candidates for this question, keep the one with the lowest Brier Score (confidence best matches right/wrong).
5. Class Balancing: Keep similar amounts of counterfactual, consistent, and irrelevant cases.
Why it matters: Without clean, rule-following examples, the model could learn shortcuts or keep overconfidence. 🍞 Anchor: For the TV channel example, keep the run that said “conflict detected → fallback → 30%,” drop the one that said “100% Canal+.”

Step 2: Supervised Fine-Tuning (SFT) on the filtered set

What happens: Train the model (with efficient adapters) on about 2,000 high-quality examples where the input, judgments, answer, and confidence form a full, readable trace.
Why this step exists: To build the habit: judge first, answer next, then state calibrated confidence.
Example: The model repeatedly practices: “Label passages → see conflict/noise → lower confidence when appropriate.”

Step 3: Inference-time behavior after NAACL

What happens: Given a new question with retrieved passages, the model:
1. Judges passages (helpful? unhelpful?)
2. Judges the group (consistent? conflicting?)
3. Applies NAACL Rules (ignore irrelevant, fallback on conflict/no gold)
4. Answers with a confidence score that matches the situation.
Why this step exists: This is how calibration gets better in the wild.
Example: With 5 retrieved passages where only one is truly useful and another contradicts it, the model still lowers its confidence and explains why.

The secret sauce:

Rule-first thinking: Teach the model to inspect evidence quality before committing to confidence.
Process supervision: Passage- and group-level judgments make the confidence trace interpretable.
Data without a big teacher: Synthesize and filter strong examples from the model itself, making the method lightweight and practical.

04Experiments & Results

🍞 Hook: Imagine grading how often a student’s “I’m 90% sure” matches actually getting it right 90% of the time.

🥬 The Concept (AUROC):

What it is: AUROC tells how well confidence separates right answers from wrong ones.
How it works: (1) Look at all pairs (one correct, one incorrect). (2) Count how often the correct one has higher confidence. (3) Higher AUROC = better discrimination.
Why it matters: Even if average confidence is okay, we need it to rank right answers higher than wrong ones. 🍞 Anchor: If correct answers usually have higher confidence than mistakes, AUROC will be high.

The test: The authors used four QA datasets (Natural Questions, HotpotQA, Bamboogle, StrategyQA) and two retrievers, and measured ECE (calibration) and AUROC (discrimination). They also stressed the model with controlled noise: gold-only, gold+irrelevant, gold+relevant, and gold+counterfactual.

The competition: Baselines included Vanilla prompting, Chain-of-Thought prompting, a noise-aware prompting (rules as instructions only), Ensemble (average of multiple runs), and Label-only SFT (train on just answers and confidence, no reasoning traces).

Scoreboard with context:

Before NAACL, average ECE across models/datasets was above 0.4—like a report card saying your self-rated confidence is way off.
With NAACL, ECE dropped by about 10.9% in-domain and 8.0% out-of-domain, often reaching below ~0.30 (a big improvement). AUROC also rose, meaning confidence better separated right vs. wrong.
Noise-aware prompting alone (no training) was often the second-best, beating heavy baselines like Ensemble and Label-only SFT. This shows the rules themselves are strong guidance.
Out-of-distribution test (more passages at inference) still showed gains, meaning NAACL learned the principle of noise-awareness, not a fixed format.

Surprising findings:

Counterfactual passages were especially harmful: they didn’t always lower confidence; instead, models often picked a side and stayed very sure, decoupling confidence from truth.
Even irrelevant passages raised average confidence—just having more text made models feel more certain, which is a red flag.
Fancy prompting from closed-book settings didn’t fix RAG calibration. The issue is about evidence quality, not just reasoning steps.

Concrete sense-making:

Think of ECE drops like moving from being off by a mile to off by a block when guessing how sure you are.
Think of AUROC gains as becoming better at ranking your best answers above your mistakes.
The reliability diagrams flattened toward the ideal diagonal: when the model says 70%, it’s right about 70%—that’s trustworthy.

05Discussion & Limitations

Limitations:

Scale and access: Tested on 7B–8B open models. Larger or proprietary models weren’t fine-tuned due to compute or access limits.
Synthetic vs. organic noise: Training constructed clean categories of noise; real-world noise can be messier, especially in domains like medicine or law.
Task scope: Focused on short-form QA with fixed retrieval depth; long-form generation and ultra-long contexts pose new calibration challenges.

Required resources:

A modest fine-tuning setup (e.g., a few GPUs), about 2K rule-aligned examples, and a RAG pipeline to retrieve passages.
No need for big teacher models or costly test-time sampling.

When NOT to use it:

If your system never shows confidence to users or never relies on retrieval, simpler calibration might suffice.
If answers demand multi-page synthesis and continuous updates in real time, the step-by-step judgments may need extra engineering.
If you only have streaming latency budgets with zero extra reasoning tokens allowed, you may prefer lighter heuristic filters.

Open questions:

Can we extend NAACL to long-form summarization where confidence is paragraph-level instead of a single number?
How does it behave in specialized domains with subtle contradictions (e.g., biomedical abstracts)?
Can the rules be learned as soft constraints that adapt to different retrieval qualities automatically?
Could retrieval models also be co-trained to reduce counterfactual noise at the source while the generator calibrates confidence downstream?

06Conclusion & Future Work

Three-sentence summary:

RAG models often sound too sure when retrieval is noisy, breaking trust between confidence and correctness.
NAACL teaches models three simple rules—handle conflicts, ignore irrelevant noise, and fall back to internal knowledge—to align spoken confidence with evidence quality.
With ~2K rule-aligned examples and SFT, NAACL cuts ECE notably and improves interpretability without big teachers or heavy sampling.

Main achievement:

Turning evidence quality into confidence behavior: the model first judges passages and consistency, then answers with calibrated confidence.

Future directions:

Move from short answers to long-form, segment-level confidence; co-train retrievers to reduce counterfactuals; adapt rules dynamically across domains.

Why remember this:

It shows a practical, lightweight way to make AI not just accurate but honest about what it doesn’t know when the evidence is messy—exactly what real-world systems need to be safely helpful.

Practical Applications

•Customer support copilots that lower confidence when retrieved FAQs conflict, prompting escalation to a human.
•Healthcare chatbots that flag low confidence when medical articles disagree, nudging a re-check or clinician review.
•Enterprise search assistants that ignore irrelevant documents instead of letting them inflate confidence.
•Legal research tools that detect conflicting case law summaries and present cautious, well-calibrated answers.
•News verification bots that show passage-level judgments before rating claim confidence.
•Educational tutors that tell students when their retrieved sources disagree and suggest follow-up reading.
•Developer assistants that downshift confidence when conflicting docs or outdated APIs are retrieved.
•Financial analysis agents that explicitly lower certainty if market reports diverge.
•Scientific assistants that reveal confidence tied to evidence quality, aiding literature reviews.
•Safety-critical dashboards (e.g., operations centers) that surface low-confidence alerts when data sources conflict.

Version: 1