Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Haoming Xu; Ningyuan Zhao; Yunzhi Yao; Weihong Xu; Hongru Wang; Xinle Deng; Shumin Deng; Jeff Z. Pan; Huajun Chen; Ningyu Zhang

Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Beginner

Haoming Xu, Ningyuan Zhao, Yunzhi Yao et al.1/9/2026

arXiv PDF

Key Summary

•LLMs can look confident but still change their answers when the surrounding text nudges them, showing that confidence alone isn’t real truthfulness.
•The paper introduces Neighbor-Consistency Belief (NCB), a score that checks if a model gives consistent answers not just to one question but to related neighbor questions too.
•In tests, answers that had perfect self-consistency still collapsed from 100% accuracy to 33.8% when gentle misleading context was added.
•High-NCB facts stayed much steadier than low-NCB facts under peer pressure and authoritative but misleading sources.
•A new stress-testing protocol, inspired by classic psychology, simulates social pressure (many wrong peers) and authority bias (misleading but ‘official’ sources).
•Reasoning out loud (Chain-of-Thought) didn’t always help; multi-turn Reflection helped more consistently to resist interference.
•Bigger models weren’t automatically more truthful; the robustness gap between high- and low-NCB facts stayed even as model size grew.
•A training method called Structure-Aware Training (SAT) made new knowledge about 30% less brittle by teaching the model to agree with itself across different contexts.
•NCB acts like checking the roots of a tree, not just the leaves: it measures deep, structured belief instead of surface-level confidence.
•This framework helps make LLMs safer for real-world uses like healthcare, law, and education, where context can easily mislead a model.

Why This Research Matters

In real life, LLMs read long prompts with mixed signals, so measuring truth by a single answer is not enough. NCB checks whether answers fit a whole web of related facts, which better predicts whether the model stays true under pressure. This means safer AI for classrooms, clinics, and courts, where misleading context could otherwise flip a correct answer to a harmful one. The stress tests mirror real-world challenges like peer pressure and authority bias, making the evaluation realistic. SAT shows we can actually train models to keep calm and correct in noisy situations. Together, these tools help us build systems we can trust when it really counts.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine studying for a quiz by memorizing just the final answer to each question, without learning any of the related facts. If a friend confidently says a different answer during the test, you might second-guess yourself and switch to the wrong one.

🥬 The Situation Before: Large Language Models (LLMs) have become very good at answering questions, but people often judged them by a single thing: do they give the same answer over and over (self-consistency)? If a model repeatedly says “Paris” when asked about France’s capital, we might think it truly knows the fact.

The Problem: Real life is messy. LLMs work inside long prompts, retrieved documents, and multi-agent chats. That means there’s lots of extra context—some helpful, some misleading. The paper shows that even when a model is perfectly self-consistent in a quiet setting, it can suddenly flip to the wrong answer when surrounded by gentle but misleading hints, peer opinions, or authoritative-sounding text.

Failed Attempts: People tried measuring token probabilities, asking models for their confidence, or sampling many times to see if the answers agree (self-consistency). These can be helpful but miss something big: they check only the one question in front of them, not whether the model’s answers match a web of related facts. So a model might memorize the right answer, yet still fail when the surroundings change.

The Gap: What was missing was a way to tell if a model’s belief is truly anchored in a structure—like a network of related truths that support each other—or if it’s just a loose, brittle memory. Without such a measure, we misread “confidence” as “understanding.”

🍞 Anchor Example: Think of a tree in the wind. If it’s only taped to the ground (memorized fact), a breeze can knock it over. If it has roots (supporting facts that fit together), it stays standing even as the wind (misleading context) blows. This paper looks at the roots, not just the trunk.

— New Concepts Explained —

🍞 You know how weather forecasts update when new clouds appear? 🥬 Bayesian Statistics: It’s a simple idea of updating how sure you are about something using new evidence. You start with a guess (prior), see new clues (likelihood), and update your belief (posterior). Why it matters: Without a way to update beliefs sensibly, a model might overreact to noisy context. 🍞 Anchor: If you think there’s a 70% chance of rain, but you then see a dark storm front, your belief should go up—Bayesian thinking captures this.

🍞 You know how a friend who always repeats the same answer might still be wrong if they never explain why? 🥬 Self-Consistency (SC): SC checks if a model repeats the same answer across multiple tries. How it works: Ask the same question many times; score higher if the answers agree. Why it matters: Without SC, we can’t tell if the model is stable—but SC alone can hide brittleness. 🍞 Anchor: A student always says “42” to a math problem. They’re consistent—but are they correct under different wordings or hints?

🍞 Imagine trying to read while people around you whisper different stories. 🥬 Contextual Interference: It’s when extra context (peers, documents, or prompts) pushes the model toward a wrong answer. How it works: Add misleading but plausible text; see if the model flips. Why it matters: Real-world prompts often contain such interference; a trustworthy model should resist it. 🍞 Anchor: If classmates all say “B” on a test, you might switch—this is the model’s version of peer pressure.

02Core Idea

🍞 Hook: Picture a bicycle wheel. If the spokes (related facts) are tight and balanced, the wheel stays true even when it hits bumps. If the spokes are loose, it wobbles and crashes.

🥬 Aha in One Sentence: The model’s truthfulness isn’t shown by one answer alone; it’s shown by how consistently that answer lines up with its whole neighborhood of related facts.

Multiple Analogies:

Tree roots: Deep roots (related facts) keep the tree steady in wind (misleading context). Shallow roots mean a small gust can topple it.
Spider web: If many strands connect, the web holds; if strands are broken, it tears easily.
Puzzle picture: If each piece fits a bigger scene, you can’t be fooled by a piece that looks similar but belongs to a different puzzle.

Before vs After:

Before: We checked if the model repeated the same answer and called that reliable.
After: We check whether the answer is supported by consistent answers to neighbor questions (attributes, implications, and associations). This better predicts if the model will stay correct when the context tries to mislead it.

Why It Works (intuition, no equations): If a model truly “knows” a fact, then many small, related checks should line up too—like verifying a person’s job, time period, and achievements. The odds that all those pieces are right by accident are tiny. If the model only memorized the target answer without the surrounding structure, those neighbor checks fall apart, and the belief collapses under pressure.

Building Blocks (with Sandwich explanations):

🍞 You know how scientists update their guesses when they gather more data? 🥬 Bayesian Belief Estimation: It’s a way to estimate how likely the model’s belief is truly structured by using the pattern of correct answers across the neighborhood. How it works: Compare how probable the observed consistency is under “structured understanding” versus “isolated memorization.” Why it matters: It turns scattered observations into a single, sensible estimate of robust belief. 🍞 Anchor: If a student gets the main question and many spin-off questions right, you believe they really understand it.

🍞 Think of checking a friend’s claim by asking follow-up questions. 🥬 Neighbor-Consistency Belief (NCB): It’s a score that looks at the target answer plus multiple neighbor checks (attributes, implications, associations) and summarizes how consistently the model stays correct. How it works: Sample answers to the main question and its neighbors; aggregate correctness into one score. Why it matters: Without NCB, brittle memorization can masquerade as understanding. 🍞 Anchor: A historian who names a king (target) and also gets the king’s era, allies, and laws right (neighbors) likely truly knows the topic.

🍞 Imagine practice drills where a coach adds crowd noise or tricky referees to see if you still play well. 🥬 Cognitive Stress Testing: It tests whether the model holds onto truth when peers disagree (peer pressure) or when sources look official but are wrong (authority bias). How it works: Add controlled interference and measure drops in accuracy and answer rates. Why it matters: Real-world deployment constantly adds noise; robustness must be proven, not assumed. 🍞 Anchor: A student who keeps solving problems correctly even when the class chatter is distracting shows real understanding.

03Methodology

At a high level: Input (a target question and its correct answer) → Build a small neighborhood of related questions → Measure how often the model answers each one correctly → Stress-test with misleading context → Use the results to score belief robustness (NCB) and to train models to be steadier (SAT).

Step-by-step (like a recipe):

Build the Belief Neighborhood

What happens: For each target fact (e.g., “Who was the IMU Vice-President in 2012?”), the authors create Neighbor Facts: (a) Entity Prerequisites (attributes about the entity), (b) Logical Implications (facts that must be true if the answer is right), and (c) Thematic Associations (choose the right entity among similar ones).
Why it exists: These neighbors act like follow-up questions to check whether the model’s answer fits a larger picture. Without neighbors, shallow memorization looks the same as deep understanding.
Example: If the answer is “Marcelo Viana,” neighbors might ask whether Marcelo Viana is Brazilian, his role in IMU leadership, or to pick him from a list of similar mathematicians.

Compute Neighbor-Consistency Belief (NCB)

What happens: The model is asked the main question and each neighbor multiple times. We count how often it gets each right and combine these into one NCB score (a higher score means sturdier belief).
Why it exists: One correct answer could be luck or rote memory; many consistent wins across related checks signal a structured belief.
Example: If the model answers the main question right 30/30 times and most neighbors right in repeated tries, NCB will be high.

Prepare Contextual Interference (Stress Tests)

What happens: Two families of interference are used: a) Peer Quantity (Social Pressure): Show several peer agents giving the wrong answer or chatting about misleading neighbor facts. b) Source Credibility (Authority Bias): Present misleading context labeled as low/medium/high credibility, sometimes directly wrong (conflict) or subtly suggestive (misleading but true about a distractor).
Why it exists: Real prompts often contain many voices or ‘official’ texts. We need to know if the model sticks to truth or caves in.
Example: Peers all say “Jacob Palis” (wrong) before the model answers; or a “journal article” praises a distractor and nudges the model off-course.

Evaluate with Coverage and Accuracy

What happens: Coverage checks how often the model gives a valid answer (not blank/refusal). Accuracy checks how often that answer matches the truth (loosely allowing small wording differences).
Why it exists: A model that refuses too much might dodge mistakes; we want to see both willingness to answer and correctness.
Example: If the model answers 9 out of 10 times (Coverage=90%) and gets 8 correct (Accuracy≈89% over valid responses), we learn both steadiness and reliability.

Compare Standard, Chain-of-Thought, and Reflection

What happens: They test three answering modes: direct answer (Standard), thinking out loud (CoT), and answering again after reconsidering (Reflection).
Why it exists: To see which inference-time strategy best resists interference. Without this, we might wrongly assume more reasoning always helps.
Example: CoT sometimes wobbles under moderate interference; Reflection more consistently reduces errors across conditions.

Train with Structure-Aware Training (SAT)

What happens: SAT teaches the model to keep its answer distribution stable across different contexts (neighbor-style passages and general/noisy text). A frozen ‘teacher’ (the model’s own earlier checkpoint) provides the target distribution, and a ‘student’ learns to match it in different contexts.
Why it exists: If a belief is truly structured, it should survive context changes. SAT bakes that invariance into training, making newly learned facts less brittle.
Example: For a new fact the model didn’t know, SAT reduces performance drop under stress by about 30% versus standard augmentation.

— Extra Sandwiches for new terms used —

🍞 You know how you test if someone really knows a topic by asking about nearby details? 🥬 Neighbor Facts: These are closely related questions (attributes, implications, associations) around the main fact. How it works: They probe different angles of the same concept. Why it matters: Without them, we can’t tell deep understanding from copying. 🍞 Anchor: If someone knows a city’s name, they should also know its country, language, and landmarks.

🍞 Imagine a game show where the host won’t accept silence. 🥬 Coverage: It’s the share of tries where the model actually gives a valid answer. How it works: Count non-empty, non-refusal answers. Why it matters: Otherwise, a model could look accurate just by skipping hard questions. 🍞 Anchor: If a student leaves many blanks, their accuracy on the few attempts could look high but be misleading.

🍞 You know grading a quiz by checking if the answer matches the key? 🥬 Accuracy: It’s how often valid answers match the truth (allowing small wording differences). How it works: Compare the predicted entity with the correct one using a forgiving match. Why it matters: It tells us real correctness, not just willingness to answer. 🍞 Anchor: Writing “NYC” vs. “New York City” still counts as correct.

04Experiments & Results

The Test: The team built a 2,000-question dataset with each fact anchored to a small neighborhood (around 8 related questions on average) across STEM, Arts & Culture, Social Sciences, and Sports. They then ran stress tests:

Peer Quantity (like Asch conformity): vary how many peers agree on the wrong answer or chatter about a distractor.
Source Credibility: present misleading text from low/medium/high-credibility sources, in conflict (explicitly wrong) or misleading (true but about the wrong entity) modes.

The Competition: Four strong instruction-tuned models (Qwen-2.5-32B, Qwen3-30B-Instruct, Qwen3-30B-Thinking, OLMo-2-32B) were evaluated. For each model, the team focused on the “High Self-Consistency” set (cases where the model was perfectly consistent at baseline) to expose the illusion of confidence and test whether NCB predicts robustness.

The Scoreboard (with context):

Shock test: Even facts with perfect baseline consistency fell from 100% accuracy to about 33.8% under mild peer interference—like going from an A+ to a failing grade as soon as whispering starts in the hallway.
High vs. Low NCB: High-NCB facts dropped far less than Low-NCB facts across all models and protocols. In many 35% top/bottom splits, High NCB groups saw drops around 11–19%, while Low NCB dropped around 23–31%—like scoring a steady B+/A- versus sinking to a C/D when the room gets noisy.
Scaling: Bigger didn’t mean braver. Across a family of models, the robustness gap stayed; size alone didn’t fix susceptibility to interference.
Strategies: Chain-of-Thought was unstable: at moderate interference, it sometimes made things worse. Reflection (a second-turn reconsideration) reliably helped, reducing drop rates across almost all settings.
Coverage insights: The “Thinking” variant (Qwen3-Thinking) sometimes chose to abstain more on Low-NCB items, hinting that when its belief is unstructured, it gets cautious; when structured, it can answer confidently and withstand interference.

Surprising Findings:

A single truth-teller matters: In peer setups, just one dissenting correct peer improved accuracy noticeably over unanimous wrong peers, mirroring classic social psychology.
Too much interference backfires on the interference: When misleading context became over-the-top, models sometimes snapped back to their parametric knowledge—like a student ignoring obviously ridiculous rumors.
Reflection beats CoT for noise: While reasoning seems helpful in theory, the measured data show Reflection’s steady advantage at filtering misleading context.

In short: NCB acts like a truth anchor. High-NCB knowledge stays standing when social pressure or ‘official’ but misleading sources blow hard, across multiple models and setups. SAT then teaches models to keep answers stable across contexts, shrinking brittleness of newly learned facts by roughly 30%.

05Discussion & Limitations

Limitations:

Narrow relation types: The neighborhoods use three kinds of relations (attributes, implications, associations). Richer structures like causal chains or taxonomies weren’t included, which might reveal even more about belief.
Static facts only: The study focused on time-invariant facts. For live, changing knowledge (e.g., current events), extra care is needed to separate real updates from mere interference.
Proxy for understanding: NCB measures robustness, not human-like comprehension. It correlates with stability, but human validation is still needed to link NCB to genuine understanding.
Computation cost: Building and sampling neighborhoods, plus stress testing, adds overhead. This needs optimization for broad deployment.
Bias and dual-use risks: Automatically generated neighborhoods may inherit biases; stress-test designs could be misused to craft stronger adversarial prompts.

Required Resources:

Multiple samples per question to estimate consistency reliably.
Neighbor generation pipelines and verification (LLM + human-in-the-loop) to ensure high-quality neighborhoods.
GPU resources for sampling, stress-testing, and training (SAT with teacher–student distillation across contexts).

When NOT to Use:

Rapid, real-time fact updates where the ground truth is changing; NCB’s static anchor may misjudge healthy updates as inconsistency.
Very low-resource domains where generating verified neighbor facts is hard; the metric may unfairly penalize obscure truths.
Tasks dominated by creative generation rather than factual precision.

Open Questions:

Can NCB be extended to dynamic knowledge and multi-hop reasoning without becoming too expensive?
How does NCB relate to human judgments of “understanding,” and can we align it with human studies?
Can we automate richer neighborhoods (causal, hierarchical) safely and scalably?
How can we best combine Reflection, verification steps, and NCB to build robust, efficient inference-time defenses?

06Conclusion & Future Work

Three-Sentence Summary: The paper shows that surface confidence (self-consistency) can be an illusion: models often crumble when context nudges them. It proposes Neighbor-Consistency Belief (NCB), which checks whether answers agree across a web of related questions, and finds that high-NCB facts resist peer pressure and misleading authority much better. It also introduces Structure-Aware Training (SAT), which makes newly learned knowledge about 30% less brittle by teaching context-invariant belief structure.

Main Achievement: Turning truthfulness from a single-answer check into a structural property you can measure (NCB) and improve (SAT), which better predicts and boosts robustness in real-world, noisy contexts.

Future Directions: Extend NCB to dynamic facts and richer relations (like causal chains); validate against human judgments of understanding; explore more efficient neighborhood construction and inference-time defenses that pair well with NCB.

Why Remember This: Because in the real world, context is windy. This work gives us a way to see the roots, not just the leaves—so we can tell when a model truly stands firm, and how to train it to keep standing even when the breeze turns into a storm.

Practical Applications

•Pre-deployment screening: Use NCB to filter and prioritize high-robustness knowledge before releasing a model to users.
•RAG safety: When retrieval returns conflicting documents, weight answers by NCB to avoid being swayed by noisy sources.
•Curriculum learning: Teach new facts with SAT so the model’s knowledge stays stable across varied contexts.
•Agent swarms: In multi-agent setups, detect and resist peer-induced errors by preferring high-NCB facts.
•Medical QA triage: Route low-NCB questions to human experts or verified databases; auto-answer high-NCB ones.
•Legal research assistants: Flag low-NCB outputs when citing sources, prompting extra verification.
•Education tools: Show students neighbor checks so they learn structured understanding, not just final answers.
•Content moderation: Identify topics where low NCB indicates higher risk of manipulation or misinformation.
•Model editing audits: After injecting new facts, measure NCB to ensure edits are integrated structurally, not just memorized.
•Inference-time defense: Default to Reflection (second-turn reconsideration) on low-NCB items to reduce flips.

Version: 1