Benchmarking Large Language Models for Knowledge Graph Validation

Farzad Shami; Stefano Marchesin; Gianmaria Silvello

Benchmarking Large Language Models for Knowledge Graph Validation

Beginner

Farzad Shami, Stefano Marchesin, Gianmaria Silvello2/11/2026

arXiv

Key Summary

•Knowledge graphs are like giant fact maps, and keeping every fact correct is hard and important.
•This paper introduces FactCheck, a benchmark that fairly tests how well large language models (LLMs) can tell if graph facts are true.
•It checks three angles: what models already know, how they do with outside evidence (RAG), and whether a team of models agreeing is better than one.
•FactCheck includes three real datasets (FactBench, YAGO, DBpedia) and a giant retrieval set with over 2 million web documents.
•Turning triples into simple sentences and asking several smart search questions helps find better evidence.
•RAG often boosts accuracy but costs more time and doesn’t always help (especially on DBpedia).
•Open-source models (like Gemma2 and Mistral) sometimes beat a commercial model (GPT-4o mini) on this task using only internal knowledge.
•Voting across multiple models makes results steadier but not always better than the single best model.
•Detecting false facts is much harder than confirming true ones, especially in imbalanced datasets like YAGO.
•FactCheck also measures time and tokens, offers a mock search API, and a web app for visual error analysis.

Why This Research Matters

Our lives rely on connected facts, from maps and shopping to news and research. If those facts are wrong, the tools we use can mislead us or waste our time. FactCheck shows when LLMs can be trusted to verify facts and when they need help from evidence or teammates (consensus). It makes progress measurable and fair by supplying the same data and evidence to everyone. Teams can choose fast-but-rough, slow-but-accurate, or steady-by-voting strategies based on needs. Over time, this helps build AI that is more reliable, transparent, and safe.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school has a giant poster with lines connecting every student to their favorite book, sport, and club. If the poster is wrong in even a few places, the whole map becomes confusing for everyone who uses it.

🥬 The Concept (Knowledge Graphs): A knowledge graph is a big, organized map of facts where things (like people or places) are dots, and relationships (like 'lives in' or 'wrote') are lines. How it works:

Each fact is stored as a tiny sentence called a triple: Subject – Predicate – Object (like 'Marie Curie – won – Nobel Prize').
Many triples link together to form a web of facts.
Computers can follow the links to answer questions and power apps like search, recommendations, and chat. Why it matters: If a few facts or links are wrong, apps give bad answers, like mixing up who wrote a book or which city is a capital. 🍞 Anchor: When you search, 'Who painted Starry Night?', the graph can lead you from the painting node to the 'painted by' line to the 'Vincent van Gogh' node.

🍞 Hook: You know how teachers double-check test answers so the class gets accurate grades? Graphs need checking, too.

🥬 The Concept (Fact Validation): Fact validation is the process of making sure each tiny fact in the graph is actually true. How it works:

Pick a triple (like 'X is the capital of Y').
Look for trusted proof in the graph or outside sources.
Decide: true or false. Why it matters: Without checking, wrong facts spread, and all the apps using the graph can make mistakes. 🍞 Anchor: If the graph says 'Sydney is the capital of Australia,' validation should catch it and correct to 'Canberra.'

🍞 Hook: Think of a super-smart pen pal who can read a lot and write answers back to you in full sentences.

🥬 The Concept (Large Language Models, LLMs): LLMs are computer programs trained on lots of text so they can understand and generate language. How it works:

Read your question.
Use patterns learned from text to reason.
Produce an answer. Why it matters: LLMs can help check facts by explaining them in plain language—but they can also make confident mistakes (hallucinations), so we must test them carefully. 🍞 Anchor: If you ask, 'Is Mount Everest higher than K2?', an LLM might say 'Yes,' but we want to verify it can rely on real knowledge and evidence.

The world before: People built many ways to validate graph facts. Some methods only use the graph itself (following paths); others search the web. These work in limited cases but struggle with rare facts, missing links, or noisy data. Humans are great at checking—but too slow and expensive for millions of facts. LLMs seem promising because they understand language and know a lot inside, yet they sometimes invent details or trust the wrong context.

The problem: No one had a solid, fair benchmark to see how well LLMs can validate knowledge graph facts in different realistic settings—using just their memory, using outside evidence, or teaming up and voting.

Failed attempts:

Rule-based graph checks: good for common patterns, poor for rare or tricky cases.
Graph-only paths: fast but can’t fix errors already inside the graph.
Web-only checks: stronger evidence but slow and noisy; results vary by search.
Direct LLM answers: helpful but can be unstable or biased without guardrails.

The gap: We needed a single playground that brings all this together—datasets, fair procedures, outside evidence, and ways to compare different LLM strategies—plus cost and speed measurements.

Real stakes:

Search engines: wrong facts change what people read first.
Shopping: bad links can suggest the wrong products.
Social networks: mistagged interests confuse recommendations.
Science and health: errors can mislead research or reports.
Everyday life: from maps to movies to sports stats, you want the right info fast.

🍞 Hook: Like building a playground where every kid can try the same obstacle course so we can see who runs it best and safely.

🥬 The Concept (FactCheck Benchmark): FactCheck is that fair playground for testing LLMs on graph fact-checking. How it works:

Takes real graph triples and turns them into readable sentences.
Tests LLMs in three modes: only their memory, with outside evidence (RAG), and with multi-model voting.
Tracks accuracy, agreement, speed, and resource use. Why it matters: Without a fair test, we can’t trust or improve LLMs for real graph validation. 🍞 Anchor: It’s like giving every runner the same track, timer, and rules—then comparing results fairly.

02Core Idea

🍞 Hook: You know how a mystery club solves cases faster when they (1) remember facts, (2) look up clues, and (3) compare notes to agree on what happened?

🥬 The Concept (Aha! Moment): The key idea is to benchmark LLMs for graph fact-checking along three paths—what they know inside, what they can confirm with retrieved evidence (RAG), and what a group of models agrees on—using a shared, fair setup. How it works:

Standardize triples into simple sentences so everyone understands the same claim.
Test internal knowledge (no outside help).
Test with RAG by retrieving web documents.
Test multi-model consensus via majority voting and tie-breakers.
Measure accuracy, agreement, and speed on three real KGs. Why it matters: This shows where LLMs shine or struggle, and what it costs in time and compute to get reliable answers. 🍞 Anchor: It’s like checking if a student can answer from memory, from a textbook, or by discussing with classmates—and recording who’s fastest and most accurate.

Three analogies to understand the idea:

Detective: Memory of past cases (internal), clues from the scene (RAG), squad consensus (voting).
Sports: Player skill (internal), instant replay (RAG), referee huddle (consensus).
Cooking: Chef’s memory (internal), recipe book (RAG), chef team vote on taste (consensus).

Before vs. After:

Before: Disconnected methods, no unified testbed, unclear trade-offs, hard to compare results.
After: One consistent benchmark (FactCheck) with shared data, a huge evidence set, and agreed metrics—so progress is visible and repeatable.

🍞 Hook: You know how translating a tricky math word problem into a simple sentence helps you solve it?

🥬 The Concept (Triple-to-Sentence Transformation): FactCheck turns cryptic graph triples into clear sentences so retrieval and LLMs work better. How it works:

Convert ⟨Subject, Predicate, Object⟩ into a natural sentence (e.g., 'Ada Lovelace wrote Notes on the Analytical Engine').
Generate multiple search questions from that sentence.
Rank those questions to keep only the best ones. Why it matters: Without this step, weird labels (like CamelCase or IDs) can confuse search and models. 🍞 Anchor: 'dbpedia:Alexander_III_of_Russia isMarriedTo X' becomes 'Alexander III of Russia was married to Y,' which search engines and LLMs can understand.

🍞 Hook: When you study, you don’t just read one sentence—you look at several sources to be sure.

🥬 The Concept (Retrieval-Augmented Generation, RAG): RAG adds outside documents to help a model decide if a fact is true. How it works:

Ask top-ranked questions to a search index (here, pre-collected Google SERPs).
Filter out sources that would create circular proof (like the graph’s own Wikipedia base).
Pick the most relevant documents and chunk them for the LLM to read. Why it matters: Without outside evidence, models may rely on memory alone and make confident mistakes. 🍞 Anchor: To check 'The Nile is the world’s longest river,' RAG pulls multiple sources the LLM can compare.

🍞 Hook: Ever settle a debate by asking multiple friends and going with the majority?

🥬 The Concept (Multi-Model Consensus): Multiple LLMs each vote true/false; the majority wins; a judge model breaks ties. How it works:

Run the same claim through several models.
Count votes.
If tied, use a stronger or different model to decide. Why it matters: One model can be wrong; a group reduces outliers and stabilizes results. 🍞 Anchor: Four classmates vote on an answer; if it’s 2–2, the teacher (judge model) decides.

Building blocks inside FactCheck:

Datasets: FactBench, YAGO, DBpedia (13,530 facts) cover easy-to-hard, balanced-to-imbalanced cases.
RAG Dataset: 2M+ documents plus generated questions, pre-fetched for fairness and repeatability.
Mock API: A pretend-but-consistent search API so every experiment sees the same results.
Metrics: Class-wise F1 (for True and False separately), Consensus Alignment (agreement with majority), and timing.
Web app: Visualize steps, errors, and model behavior.

Why it works (intuition, not equations):

Standard language reduces search confusion.
Multiple questions widen the net to catch better evidence.
Filtering prevents cheating by reading the graph’s own source.
Majority vote dampens random mistakes.
Separate F1 for True and False exposes hidden biases (many systems are good at confirming truths but weak at spotting fakes).

03Methodology

High-level recipe: Input triple → Turn into plain sentence → Make many smart questions → Retrieve and filter web pages → Pick the best docs → Chunk into bite-size passages → Ask the LLM to verify → (Optional) Gather votes from multiple LLMs → Output true/false + scores.

Step 1. Triple Transformation (What)

Turn ⟨S, P, O⟩ into a natural sentence the web and models understand. Why needed
Raw graph labels (IDs, underscores, CamelCase) confuse retrieval and LLMs. Without this, you get fewer, worse sources. Example
⟨Alexander_III_of_Russia, spouse, Dagmar_of_Denmark⟩ → 'Alexander III of Russia was married to Dagmar of Denmark.'

Step 2. Question Generation and Ranking (What)

Ask an LLM to create several different search questions from the sentence, then rank them for closeness to the original claim, keeping only the top ones. Why needed
One phrasing may miss key pages; multiple phrasing covers more clues and reduces paraphrase bias. Example
From 'Alexander III of Russia was married to Dagmar of Denmark' generate 10 variants like 'Who was the wife of Alexander III of Russia?'; keep the top 3 by relevance.

Step 3. Document Retrieval and Filtering (What)

For each top question, use the pre-collected Google SERP list; fetch up to 100 pages per question; remove sources tied to the KG’s own base (e.g., Wikipedia for DBpedia) to avoid circular proof. Why needed
We want independent evidence; otherwise, the system might just echo itself. Example
Gather articles from museums, history sites, and reputable news—filter out Wikipedia if validating DBpedia facts.

Step 4. Document Processing and Selection (What)

Score documents for similarity to the sentence; keep the top k_d (e.g., 10); split each into overlapping chunks for the LLM to read efficiently. Why needed
LLMs read better with smaller, focused passages; ranking avoids wasting attention on irrelevant text. Example
Keep 10 pages most about 'Alexander III marriage', chunk each into small paragraphs with a sliding window so key names show up together.

Step 5. Internal-Knowledge Paths (DKA, GIV) (What)

DKA: a plain, direct prompt—'Is this statement true?'
GIV-Z: guided template with exact output format; re-ask if messy.
GIV-F: same, but add a few examples to teach the pattern. Why needed
Structured instructions reduce confusion; a few examples can greatly boost consistency. Example
Few-shot: Show 3 short 'true/false + reason' examples, then the real question.

Step 6. RAG-Based Verification (What)

Feed the LLM the top chunks as context with the claim; ask for a verdict (true/false) and a short rationale using only the provided evidence. Why needed
Keeps the model grounded in real sources, reducing hallucinations. Example
'Using the passages above, is the marriage statement correct? Cite which chunk supports your answer.'

Step 7. Multi-Model Consensus (What)

Run the same verification through several LLMs; take majority vote; if tie, call a judge model (either a bigger version of a model or a different commercial one). Why needed
Smooths over single-model quirks; makes outcomes steadier. Example
Four models say [True, True, False, False] → tie; judge model gives final decision.

Step 8. Metrics and Efficiency (What)

Compute F1 for True (F1(T)) and False (F1(F)) separately to see strengths and weak spots; compute Consensus Alignment to see how much each model agrees with the crowd; track time per fact with outlier filtering. Why needed
Many systems look good overall but fail at catching false facts; separating classes reveals this; timing shows real-life costs. Example
A model might have F1(T)=0.85 (great at confirming truths) but F1(F)=0.30 (weak at spotting fakes). That matters in safety-critical uses.

Secret sauce (what makes it clever):

Language-first transformation so search sees meaning, not messy IDs.
Multiple ranked questions to widen recall but keep focus.
Source filtering to avoid circular evidence.
Chunking passages so LLMs actually read the right parts.
Majority voting plus smart tie-breakers for stability.
Pre-collected SERPs and a mock API so every run is fair and repeatable.

Concrete mini-walkthrough:

Input: ⟨Marie_Curie, award, Nobel_Prize_in_Chemistry⟩.
Step 1: 'Marie Curie won the Nobel Prize in Chemistry.'
Step 2: Make 10 questions; keep the top 3 (e.g., 'Which Nobel Prize did Marie Curie win?').
Step 3–4: Retrieve pages, filter KG-base sources, keep top 10, chunk them.
Step 5–6: Ask LLM with or without RAG; get True/False + reason.
Step 7: If using consensus, collect votes and decide.
Output: Label + short explanation + timing.

04Experiments & Results

The test: FactCheck evaluates LLMs on three real knowledge graph datasets—FactBench (balanced mix), YAGO (almost all true), and DBpedia (large, diverse). It measures class-wise F1 (for True and False separately), consensus alignment (how much each model agrees with the group), and response time per fact. Methods compared: internal-only (DKA, GIV-Z, GIV-F), RAG, and multi-model consensus.

The competitors: Four open-source models (Gemma2 9B, Qwen2.5 7B, Llama3.1 8B, Mistral 7B) and one commercial model (GPT-4o mini). Consensus uses majority vote across open-source models with tie-breakers (a scaled-up model or GPT-4o mini).

Scoreboard with context:

FactBench (friendly mix):
- Best internal-only: Gemma2 with GIV-F around F1(T)=0.79 and F1(F)=0.76 (like getting two solid B+ grades). Mistral GIV-F is similar.
- RAG lifts most boats: up to about F1(T)=0.91 and F1(F)=0.89 (A/A− level), for both Gemma2 and GPT-4o mini. So external evidence helps a lot here.
YAGO (very imbalanced, mostly true):
- Everyone gets high F1(T) (up to ~0.92) but F1(F) is near 0.01–0.03—like barely catching any false facts. This shows the hard part: spotting rare errors.
- RAG still helps on True but doesn’t fix the False scarcity problem much.
DBpedia (big and varied):
- Internal-only: good F1(T) (up to ~0.89) but low F1(F) (often <0.40).
- RAG gives mixed results: sometimes small gains, sometimes tiny dips—likely because schema diversity makes retrieval and evidence selection trickier.

Surprises:

Open-source models (Gemma2, Mistral) sometimes beat GPT-4o mini when using only internal knowledge. Bigger or commercial isn’t automatically better.
RAG is powerful but not magic. On DBpedia, evidence complexity and retrieval noise can flatten or reverse gains.
Multi-model consensus stabilizes performance but doesn’t always top the best single model. The specific tie-breaker choice (bigger open-source vs. GPT-4o mini) rarely changes outcomes much.
Agreement grows with RAG: tie rates drop to ~6–9% vs. ~21–26% in zero-shot guided runs. External context nudges models to similar conclusions—good for consistency, but watch for shared bias.

Timing (speed vs. smarts):

DKA is fastest: ~0.2–0.3s per fact (like a sprint), but less powerful.
GIV-Z/F add structure and time: ~0.4–0.9s (still quick, often better accuracy).
RAG is slowest: ~1.6–2.9s per fact (6–10× slower than DKA) because retrieval and reading take time.
Consensus can be parallelized, so latency is near the slowest model in the set; tie-breaks add a small overhead.

Takeaways:

Finding false facts is the toughest challenge (F1(F) lags), especially in imbalanced sets like YAGO.
RAG often gives top scores on balanced data but costs more time and is uneven on complex graphs like DBpedia.
Few-shot guidance (GIV-F) is a sweet spot for many internal-only runs: decent accuracy without the RAG time bill.
Majority voting is a dependable stabilizer, not a guaranteed top-scorer.

Big picture: If you need fast checks, use guided internal prompts; if you need higher assurance on tricky claims, pay the RAG cost; if you need steadiness, add consensus—then choose based on your time and budget.

05Discussion & Limitations

Limitations (what this can’t do well yet):

Stability: LLMs can still be inconsistent across datasets and prompts, especially for rare or tricky claims.
False detection is hard: Even with RAG, F1(F) often lags far behind F1(T), so spotting wrong facts remains the main weakness.
RAG variability: On schema-diverse graphs (DBpedia), retrieval and evidence selection may add noise or bias, sometimes hurting accuracy.
Ensemble ceilings: Majority voting steadies results but doesn’t consistently beat the best single model.
Real-world frictions: Retrieval failures (small but real), content blocks, and regional restrictions can affect evidence quality.

Required resources:

Compute: Local LLMs benefit from strong CPUs/GPUs; RAG adds I/O and ranking overhead.
Engineering: You need a pipeline to transform triples, generate questions, manage retrieval, filter, chunk, and prompt reliably.
Monitoring: Token/time tracking and error analysis (ideally via the provided web app) to spot bottlenecks.

When NOT to use certain modes:

Skip RAG when latency must be sub-second or bandwidth is limited; prefer guided internal prompts (GIV-F).
Avoid consensus if you can’t parallelize and latency is critical.
Be cautious with internal-only checks on heavily imbalanced data (like YAGO)—they may miss rare false facts.
Be wary of sensitive domains if external sources might be blocked or filtered.

Open questions:

Can fine-tuning LLMs specifically for KG validation lift F1(F) without huge compute costs?
What hybrid retrieval schemes (graph traversal + web RAG) best handle schema diversity and reduce noise?
How do we design tie-breakers that are both cheap and smart—perhaps with lightweight meta-reasoners?
Can we calibrate confidence so systems know when to escalate to humans?
How do we fairly measure long-term stability as graphs and the web both evolve?

Bottom line: FactCheck is a big step toward trustworthy graph validation with LLMs, but the field still needs better false-spotting, smarter retrieval, and efficient, reliable ensembles.

06Conclusion & Future Work

Three-sentence summary: FactCheck is a fair, realistic benchmark that tests how LLMs validate knowledge graph facts from three angles: internal knowledge, retrieval-augmented evidence, and consensus. It supplies real datasets, a 2M+ document evidence pool, a mock API, and a web app to measure accuracy, agreement, and cost. Results show RAG often wins but is slow and uneven across datasets, open-source models can rival commercial ones, and consensus stabilizes performance without always beating the best single model.

Main achievement: Creating a reproducible, end-to-end playground—with data, evidence, tools, and metrics—that reveals not only which strategies work, but when, why, and at what cost.

Future directions: Fine-tune models for KG validation to better catch false facts; combine structured KG traversal with unstructured RAG to improve evidence quality; develop smarter, faster tie-breakers and confidence calibration; expand benchmarks to include logical rules (like domain/range and transitivity) and time-aware facts.

Why remember this: In a world that runs on connected facts, getting those facts right—fast and fairly—is essential. FactCheck doesn’t just chase higher scores; it shows the trade-offs between accuracy, stability, and speed, so teams can pick the right tool for the job. It points the way to more reliable AI systems that power search, recommendations, research, and everyday decisions.

Practical Applications

•Evaluate your LLM-based fact-checker on FactBench, YAGO, and DBpedia to find strengths and weaknesses.
•Adopt triple-to-sentence transformation to improve retrieval and LLM understanding in your pipeline.
•Use multiple ranked questions per claim to widen evidence coverage before filtering and reranking.
•Filter out circular sources (e.g., the KG’s own base) to avoid self-confirmation in evidence.
•Choose a mode by need: GIV-F for speed, RAG for higher assurance, consensus for stability.
•Measure F1(True) and F1(False) separately to track how well you catch wrong facts, not just confirm true ones.
•Leverage the mock API to reproduce retrieval results and fairly compare prompting strategies.
•Parallelize consensus inference to keep latency near your slowest model while improving steadiness.
•Build few-shot templates that force structured outputs and retry on non-conformant responses.
•Use the web app’s error analysis views to target fixes (e.g., reduce geography/nationality mix-ups).

Version: 1