FaithLens: Detecting and Explaining Faithfulness Hallucination

Shuzheng Si; Qingyi Wang; Haozhe Zhao; Yuzhuo Bai; Guanqiao Chen; Kangyang Luo; Gang Chen; Fanchao Qi; Minjia Zhang; Baobao Chang; Maosong Sun

FaithLens: Detecting and Explaining Faithfulness Hallucination

Intermediate

Shuzheng Si, Qingyi Wang, Haozhe Zhao et al.12/23/2025

arXiv PDF

Key Summary

•Large language models can say things that sound right but aren’t supported by the given document; this is called a faithfulness hallucination.
•FaithLens is a compact 8B-parameter model that checks if a claim matches its document and also explains why, boosting user trust.
•The team first synthesizes training data with explanations using a strong reasoning model, then filters it for correct labels, helpful explanations, and diverse examples.
•They fine-tune the detector on this curated data and then improve it further with rule-based reinforcement learning that rewards right answers and clear explanations.
•The “explanation quality reward” is clever: if a simple “novice” model can reach the right label using the detector’s explanation, the explanation earns a point.
•On 12 varied datasets (like summarization, RAG, and multi-hop reasoning), FaithLens beats specialized detectors and even advanced LLMs like GPT-4.1 and o3.
•FaithLens’ explanations scored highly for readability, helpfulness, and informativeness when judged by another model and by humans.
•It’s also inexpensive to run, giving state-of-the-art accuracy at a fraction of the cost of big API models.
•Ablation studies show each piece—data filters and both rewards—matters for the final performance and explanation quality.
•Limitations include text-only scope, extra inference time for step-by-step reasoning, and binary (yes/no) labels rather than fine-grained categories.

Why This Research Matters

- People increasingly rely on AI to summarize, answer questions, and find facts, so catching unsupported claims is essential for trust and safety. - FaithLens not only flags problems but also explains them, helping users learn, correct, and improve prompts or sources. - It works across many task types, so teams don’t need separate detectors for summarization, RAG, and multi-hop reasoning. - Its explanations are tested for usefulness, not just style, which makes them truly instructive for humans and other models. - Because it’s compact and affordable to run, organizations can deploy it widely without huge costs. - Clear, reliable detection helps reduce misinformation spread in education, customer support, media, and research. - This approach sets a pattern for building other explainable, efficient AI verifiers beyond text-only tasks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re grading a book report. The student says, “The dragon became king,” but the chapter never said that. You’d want to catch that mistake and ask, “Show me where it says that.”

🥬 The Concept (Faithfulness Hallucination): Faithfulness hallucination is when an AI says something that isn’t supported by the document it was supposed to use.

What it is: A mismatch between what the AI claims and what the provided text actually says.
How it works: 1) You give the AI a document; 2) The AI produces a claim or answer; 3) A checker verifies if every part of the claim is supported by the document.
Why it matters: Without this check, AI could spread confident but unsupported statements in summaries, Q&A, or search.

🍞 Anchor: If the document says “Paris is the capital of France,” but the AI claims “Lyon is the capital,” that’s a faithfulness hallucination.

The World Before Before tools like FaithLens, many teams tried to detect these hallucinations by simply asking a huge model to judge other models’ answers. This worked fairly well but was costly and slow for real-world use—like needing a superstar teacher to grade every homework page. Some smaller detectors existed but often behaved like black boxes: they’d say “Wrong,” but not “Why,” making it hard for users to trust or fix problems.
The Problem Real applications—like retrieval-augmented generation (RAG), summarization, and dialogue—need fast, accurate, and trustworthy hallucination detection. Three issues stood out:

Lack of Explainability: Many detectors only output “faithful” or “hallucinated,” with no explanation.
Inconsistent Generalization: A detector good at summaries might stumble on multi-hop reasoning or RAG.
Low-Quality Training Data: Building labeled datasets is hard, labels can be noisy, and synthetic data is often not carefully filtered, so models learn shallow tricks.

Failed Attempts

Prompting advanced LLMs as judges: Accurate but too expensive to run at scale.
Small classifiers trained on generic NLI data: Fast, but miss the nuanced ways hallucinations appear across tasks.
Synthetic data without strict quality control: Cheaper to build, but easily polluted by wrong labels, weak explanations, and repetitive examples.

The Gap We needed a detector that is:

Cost-efficient (runs locally at 8B parameters),
Consistently strong across tasks,
And explainable, so users see the reasoning.

🍞 Hook: You know how a science fair judge doesn’t just score you—they also leave comments so you know what to improve? That’s what we want from a detector.

🥬 The Concept (Explanatory Detection): A model that gives both a decision and an explanation you can follow.

What it is: A system that says “faithful” vs. “hallucinated” and also explains which parts of the document support that.
How it works: Train on data that includes correct labels and human-readable explanations; then tune further so explanations are actually useful to others.
Why it matters: Without explanations, users can’t pinpoint the mistake or trust the detector.

🍞 Anchor: “Claim: The law mentions the Lanham Act.” Explanation: “The document lists several acts (e.g., Truth in Lending) but never the Lanham Act—so the claim isn’t supported.”

Real Stakes

In schools, students may use AI to summarize source texts; hallucinations teach the wrong facts.
In customer support, a hallucinated answer can mislead customers.
In news and research tools, a small mistake can ripple into big misunderstandings.
In regulated fields (medicine, law, finance), unsupported claims can be risky or noncompliant.

FaithLens aims to meet this need by training a compact model to both detect and clearly explain faithfulness, while being affordable and consistent across many tasks.

02Core Idea

🍞 Hook: Imagine a referee who not only blows the whistle but also explains the exact rule that was broken, so everyone learns.

🥬 The Concept (FaithLens’ Key Insight): Train a small model to say whether a claim is supported by a document and to explain why—then use simple, verifiable rules to reward correct answers and helpful explanations.

What it is: An 8B-parameter detector that pairs decisions with explanations, trained using curated synthetic data and rule-based reinforcement learning (RL).
How it works: 1) Generate training samples (doc, claim, chain-of-thought, explanation, label) with a strong reasoning model; 2) Filter them for label correctness, explanation helpfulness, and diversity; 3) Fine-tune; 4) Use RL with rewards for correct predictions and explanations that help a novice model choose the right label.
Why it matters: This makes the detector both effective (accurate) and trustworthy (clear explanations), without the cost of giant models.

🍞 Anchor: A student answers “No, not supported,” and their short paragraph points to the exact sentences in the text that prove it—another student can read that and also get the right answer. That’s FaithLens’ idea.

Multiple Analogies (same idea, 3 ways)

Detective Analogy: The model is a detective who must present both the verdict and the evidence notes; a junior detective (novice model) should be able to solve the case using those notes—if yes, the notes were good.
Teacher Analogy: When grading, the teacher gives points for the right answer and extra points if the student’s explanation is clear enough that another student could learn from it.
Cooking Analogy: The dish (answer) must taste right (correct), and the recipe (explanation) must be clear enough that a beginner can recreate the dish.

Before vs After

Before: Detectors often gave only a yes/no and relied on expensive judges; explanations were rare or low-quality.
After: FaithLens reliably flags hallucinations and explains them, at low cost, with strong results across summarization, RAG, and multi-hop reasoning.

Why It Works (intuition)

Explanations become truly useful when they help someone else correctly decide. Instead of guessing if an explanation “sounds good,” FaithLens tests it: can a simpler model use it to reach the right label? If yes, reward it.
Filtering data for correct labels and helpful explanations prevents the model from learning from noisy examples.
Ensuring diversity avoids overfitting to easy or repetitive patterns, making the model robust across tasks.

Building Blocks (mini Sandwich intros):

🍞 Hook: You know how we toss out spoiled fruit before making a fruit salad? 🥬 Label Correctness (Filter 1):
- What: Keep only examples where the synthesized label matches the trusted ground-truth.
- How: Compare labels; discard mismatches.
- Why: Wrong labels teach wrong lessons. 🍞 Anchor: If the answer key says “supported” but the synthetic sample says “not supported,” we throw it out.
🍞 Hook: Imagine checking if your notes help a friend solve the same worksheet. 🥬 Explanation Quality (Filter 2):
- What: Keep explanations that actually make a model more confident in the right label.
- How: Measure model perplexity on the correct label with and without the explanation; keep those that reduce perplexity.
- Why: Fancy words aren’t enough; the explanation must truly help. 🍞 Anchor: If adding the explanation makes the model surer about choosing “No,” it passes.
🍞 Hook: A sports team needs players with different strengths, not five goalies. 🥬 Data Diversity (Filter 3):
- What: Keep a varied set of document-claim pairs using clustering.
- How: Embed (turn into vectors), cluster with K-medoids, use medoids as probes to ensure each kept sample helps across types.
- Why: Prevents the detector from becoming great at only one pattern. 🍞 Anchor: We keep examples that boost performance on a spread of probe samples, not just one niche.
🍞 Hook: Think of practice (fine-tuning) and then scrimmage with scorekeeping (RL with rewards). 🥬 Rule-Based RL with Two Rewards:
- What: Reward correct predictions and explanations that help a novice model get the right label.
- How: R_pred = 1 for correct label; R_exp = 1 if novice model gets it right after reading the explanation.
- Why: Balances accuracy and clarity. 🍞 Anchor: If your answer is right and your notes teach your buddy to be right too, you get two gold stars.

03Methodology

At a high level: Input (Document + Claim) → Think (internal reasoning) → Reason (human-friendly explanation) → Answer (Yes/No). For training: Data Synthesis → Three-Filter Cleaning → Supervised Fine-Tuning → Rule-Based RL with Two Rewards → FaithLens.

Step 1: Data Synthesis 🍞 Hook: Imagine asking a top student to write example questions, good answers, and short explanations so the class can practice. 🥬 What it is: Use a strong reasoning model to create training samples that include the document, the claim, a chain-of-thought (CoT), an explanation, and a label.

How it works (recipe):
1. Start from public datasets with (doc, claim, ground-truth label).
2. Prompt a strong model to produce its step-by-step thoughts (CoT), a short human-readable explanation, and its label.
3. Save (doc, claim, CoT, explanation, label) as a synthesized sample.
Why it matters: Most public data lacks explanations; we need them to teach our model to explain. 🍞 Anchor: For a doc about “Paris is France’s capital,” the top student writes: Claim: “Paris is the capital.” CoT: checks sentence; Explanation: “The doc says Paris is the capital,” Label: supported.

Step 2: Three-Filter Data Cleaning 2a) Label Correctness 🍞 Hook: You would toss a practice sheet if the answer key is wrong. 🥬 What it is: Keep only samples whose synthesized label matches the verified label from the original dataset.

How: Compare y_synth vs y_ground_truth; discard mismatches.
Why: Wrong labels teach wrong rules. 🍞 Anchor: If ground-truth is “supported” but the synth says “not supported,” delete it.

2b) Explanation Quality (Perplexity-based) 🍞 Hook: If your study guide doesn’t actually help your friend get the right answer, it’s not a good guide. 🥬 What it is: Keep explanations that make a training model more confident in the correct label.

How: Compute model perplexity on the correct label with only (doc, claim, CoT). Then add the explanation and recompute. Keep samples where perplexity drops.
Why: Ensures explanations are truly helpful, not fluff. 🍞 Anchor: If adding “The doc explicitly states…” lowers uncertainty about “No,” it passes.

2c) Data Diversity (Embedding + K-Medoids + Probe Set) 🍞 Hook: A well-balanced meal needs different food groups. 🥬 What it is: Ensure the set covers many kinds of (doc, claim) relationships.

How: Embed each (doc, claim), cluster via K-medoids, select medoids as probes. For any candidate sample, check if using it as an in-context example lowers perplexity on enough probe samples; keep it if it helps sufficiently many.
Why: Avoids tunnel vision on one pattern. 🍞 Anchor: We keep a sample if it helps across multiple “probe” examples, not only its closest neighbor.

Step 3: Supervised Fine-Tuning (SFT) 🍞 Hook: First, the team practices with a clean, well-designed workbook. 🥬 What it is: Train the base 8B model to produce (CoT → Explanation → Answer) on the filtered data.

How: Standard fine-tuning on sequences that include think, reason, and answer sections.
Why: Gives the model a solid, clean start—so it learns to both decide and explain. 🍞 Anchor: After SFT, the model can already catch “Lanham Act” is missing from the document and say so clearly.

Step 4: Rule-Based Reinforcement Learning (GRPO-style) 🍞 Hook: Next, scrimmage with a scoreboard that gives points for winning and coaching. 🥬 What it is: Generate multiple candidate outputs and score them with simple rules; update the model to prefer higher-scoring outputs.

How:
1. For each (doc, claim), produce several candidate (explanation, answer) pairs.
2. Compute rewards:
  - Prediction Correctness Reward (R_pred): 1 if label is right, else 0.
  - Explanation Quality Reward (R_exp): 1 if a novice model reads the explanation and then picks the right label.
  - Format Reward (R_format): 1 if the output uses the required tags.
  - Sum them: R_final = R_pred + R_exp + R_format.
3. Use a group-based policy optimization (like GRPO) to push the model toward better-scoring candidates while staying close to a reference policy.
Why: SFT teaches imitation; RL teaches optimization for what we truly care about—accuracy and teachable explanations. 🍞 Anchor: If your output is correct, clearly explains, and follows the template, you earn three points.

Step 5: Inference Flow (Doc + Claim → Think → Reason → Answer) 🍞 Hook: When giving your final answer in class, you show your short notes, then give the conclusion. 🥬 What it is: At run-time, FaithLens reads the document and claim, “thinks” privately, writes a short human-friendly explanation, and outputs Yes/No.

How: The output has three parts: <think>…</think>, <reason>…</reason>, <answer>Yes/No</answer>.
Why: The explanation builds trust, and the format keeps tools easy to parse. 🍞 Anchor: For “Is Lyon the capital of France?” it writes: Reason: “The doc states Paris is the capital,” Answer: “No.”

Secret Sauce

Two-way quality control: Before training (filters) and after training (RL rewards).
The “novice test” for explanations: If a simpler model can use your explanation to choose correctly, you truly explained it.
Diversity assurance: Clustering and probe checks reduce brittleness across tasks and domains.

04Experiments & Results

🍞 Hook: Picture a decathlon. You don’t just want to win one event—you want to be solid across all of them.

🥬 The Concept (Evaluation Across Many Tasks): Test FaithLens on lots of different challenge types and compare it to both big LLMs and specialized detectors.

What it is: A broad evaluation on 12 datasets (11 from LLM-AggreFact, plus HoVer for many-hop reasoning).
How it works: Use macro-F1 (balanced across classes), same cleaned benchmarks as prior work, and measure explanation quality separately.
Why it matters: A good detector shouldn’t only ace one dataset; it should be reliable across many.

🍞 Anchor: Think: summarization errors, RAG mismatches, and multi-hop Wikipedia claims—all in one test suite.

The Test

Tasks: Summarization (CNN/DM, XSum), RAGTruth, dialogue summarization (Tofu), claim verification (WiCE), expert Q&A, and HoVer (multi-hop).
Metric: Macro-F1. Also, explanation quality (readability, helpfulness, informativeness) judged by GPT-4.1 and by humans.

The Competition

Advanced LLMs: GPT-4o, GPT-4.1, o3, o3-mini, o1, DeepSeek-V3.2, Claude-3.7-Sonnet, Llama-3.1-405B.
Specialized Detectors: AlignScore, MiniCheck, FactCG, ClearCheck.

The Scoreboard (with context)

Overall Effectiveness: FaithLens (8B) reaches state-of-the-art across 12 tasks with an average macro-F1 around 86.4, surpassing strong baselines including GPT-4.1 and o3. Think of it as scoring an A when others hover between B and A−.
Stability: FaithLens shows the lowest standard deviation across tasks, meaning performance is consistently strong rather than spiky.
HoVer (multi-hop): Notably strong on complex reasoning detection, where many small models struggle—FaithLens maintains high accuracy.

Explainability Results

Using GPT-4.1 as a judge, FaithLens’ explanations score highly for readability (clear structure), helpfulness (guides the reader to the right conclusion), and informativeness (specific evidence, not fluff).
Human Evaluation: In pairwise comparisons on 120 samples, humans preferred FaithLens’ explanations or found them tied over GPT-4o in most cases on readability/helpfulness/informativeness.
Why this matters: Explanations that people and models can use boost trust and make debugging easier.

Efficiency

Cost: On a 1.2K-sample run across all datasets, FaithLens is dramatically cheaper to run than API-based giants (around $0.1 vs. multiple dollars), yet delivers top-tier accuracy.
Data Efficiency: FaithLens trains only on public data, produces explanations, and—after filtering—uses fewer but higher-quality samples (about 28K explainable items used) than some baselines that rely on larger or private sets.

Surprising/Notable Findings

Each piece matters: Ablations show removing label-correctness filtering hurts prediction accuracy; removing explanation-quality filtering lowers explanation scores; removing diversity filtering reduces cross-task stability. Dropping RL or the explanation reward also degrades both accuracy and clarity.
Claim Decomposition: Breaking complex claims into atomic facts can further boost performance, but increases inference time by 2–4×; it’s helpful but not always worth the latency.
Generalization Across Backbones: The same training recipe improves other base models (e.g., Qwen variants), not just Llama-3.1-8B.

Bottom line: FaithLens balances accuracy, clarity, and cost, and it keeps that balance across many very different tasks.

05Discussion & Limitations

🍞 Hook: Think of a great pocket-sized camera: it’s versatile and clear, but it won’t replace a full movie studio for every job.

🥬 The Concept (Honest Assessment): Understand where FaithLens shines and where it’s not the right tool.

Limitations:
1. Text-only: It doesn’t handle images, audio, or video grounding yet.
2. Extra latency: It writes a short explanation before answering; that’s more time than a bare yes/no.
3. Binary labels: Outputs “faithful” vs. “hallucinated,” not fine-grained types (e.g., “unsupported number,” “wrong entity”).
Required Resources:
- A modest GPU/CPU can run an 8B model for inference.
- For RL training, you also need a “novice” model (like Llama-3.1-8B-Inst) to compute the explanation reward.
- For data filtering, an embedding model (e.g., Llama-Embed-Nemotron-8B) and clustering.
When NOT to Use:
- If you need millisecond responses with no explanation (e.g., ultra-low latency pipelines), a binary-only classifier might be better.
- If your task is multi-modal (e.g., verifying text against a picture), FaithLens (as-is) won’t ground non-text.
- If you require detailed taxonomy labels (exact error types) for downstream analytics, you’ll need an extension.
Open Questions:
1. Multi-modal extensions: How to design grounding signals and explanations for images/tables?
2. Granular labels: Can we reliably and affordably train models to tag specific hallucination types?
3. Faster explanations: Can we compress or retrieve explanations to reduce latency without losing clarity?
4. Better novice tests: Are there alternative ways to automatically score explanation usefulness?

🍞 Anchor: Use FaithLens when you want trustworthy, explainable checks over text; don’t use it as a vision-language fact-checker or a nanosecond yes/no light.

06Conclusion & Future Work

Three-Sentence Summary

FaithLens is a compact detector that says whether a claim is supported by a document and explains why.
It’s trained on synthetic examples that are carefully filtered for correct labels, helpful explanations, and diversity, then refined via rule-based RL with rewards for correct answers and explanations that help a novice model succeed.
Across 12 varied tasks, FaithLens achieves state-of-the-art accuracy, high-quality explanations, and low cost.

Main Achievement

A practical balance of trustworthiness, effectiveness, and efficiency: strong accuracy with human-usable explanations at a fraction of the cost of very large models.

Future Directions

Extend to multi-modal detection (text + images/tables),
Provide fine-grained hallucination categories,
Reduce explanation latency,
Explore richer automatic tests for explanation usefulness beyond a single novice model.

Why Remember This

FaithLens shows that “explain-and-verify” can be trained efficiently: explanations aren’t just nice prose—they’re tested by whether they help another model be right. That simple idea makes explanations meaningful, not decorative.

Practical Applications

•Add a “fact check + why” step to RAG systems before showing answers to users.
•Attach short, human-friendly explanations to AI-generated summaries in newsrooms or classrooms.
•Use as a gatekeeper in chatbots: allow only responses marked “faithful,” else request clarification or retrieval.
•Audit enterprise knowledge-base updates by verifying each new claim against its cited documents.
•Assist editors and teachers by highlighting unsupported sentences and pointing to missing or mismatched evidence.
•Enforce compliance in legal/medical drafts by flagging claims not grounded in the provided references.
•As a training coach for small LLMs: feed explanations that help them learn to pick correct labels.
•Monitor and filter user-generated content (forums/wikis) where claims must be backed by a linked source.
•Add a review layer to agent pipelines where each step’s claim is checked and explained before proceeding.
•Support dataset curation by automatically removing mislabeled or low-diversity synthetic examples.

Version: 1