The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Aileen Cheng; Alon Jacovi; Amir Globerson; Ben Golan; Charles Kwong; Chris Alberti; Connie Tao; Eyal Ben-David; Gaurav Singh Tomar; Lukas Haas; Yonatan Bitton; Adam Bloniarz; Aijun Bai; Andrew Wang; Anfal Siddiqui; Arturo Bajuelos Castillo; Aviel Atias; Chang Liu; Corey Fry; Daniel Balle; Deepanway Ghosal; Doron Kukliansky; Dror Marcus; Elena Gribovskaya; Eran Ofek; Honglei Zhuang; Itay Laish; Jan Ackermann; Lily Wang; Meg Risdal; Megan Barnes; Michael Fink; Mohamed Amin; Moran Ambar; Natan Potikha; Nikita Gupta; Nitzan Katz; Noam Velan; Ofir Roval; Ori Ram; Polina Zablotskaia; Prathamesh Bang; Priyanka Agrawal; Rakesh Ghiya; Sanjay Ganapathy; Simon Baumgartner; Sofia Erell; Sushant Prakash; Thibault Sellam; Vikram Rao; Xuanhui Wang; Yaroslav Akulov; Yulong Yang; Zhen Yang; Zhixin Lai; Zhongru Wu; Anca Dragan; Avinatan Hassidim; Fernando Pereira; Slav Petrov; Srinivasan Venkatachary; Tulsee Doshi; Yossi Matias; Sasha Goldshtein; Dipanjan Das

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Beginner

Aileen Cheng, Alon Jacovi, Amir Globerson et al.12/11/2025

arXiv PDF

Key Summary

•The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.
•It uses careful automatic judges (and human validation during construction) to score answers for both completeness and mistakes.
•FACTS Multimodal checks if a model can look at an image and explain it correctly without contradicting the facts.
•FACTS Parametric tests if a model can recall tough, user-interest facts from what it learned (no searching allowed), using carefully filtered, Wikipedia-backed questions.
•FACTS Search measures how well a model uses a shared web search tool to answer multi-step and rare-entity questions.
•FACTS Grounding v2 checks if long answers are supported by a provided document and actually answer the user’s request, using improved judge models.
•All four scores are averaged into one FACTS Score to compare models fairly and reduce overfitting, with public and private test splits.
•Top models still miss many facts (best overall around 69%), showing big room for improvement.
•Different model families show different styles: some cover more facts but risk contradictions; others avoid contradictions but miss details.
•This benchmark helps people pick reliable models, find weak spots, and track real progress on factuality over time.

Why This Research Matters

Reliable AI isn’t just about sounding smart—it’s about being right, complete, and careful across many real situations. The FACTS Leaderboard helps builders and buyers compare models fairly so they can choose the most trustworthy ones for work and study. It spotlights strengths and weaknesses, guiding teams to improve image understanding, memory, web search, and grounding. Hospitals, banks, schools, and governments can reduce risk by preferring models that score well on the parts that matter for them. Because the top score is far from perfect, the suite clearly shows where the field must advance next. Over time, this enables safer, clearer, and more useful AI assistance for everyone.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you play a trivia game, it’s not enough to be fast—you also have to be right, and sometimes you need to check a book or a website to be sure.

🥬 Filling (The Actual Concept): What it is: The FACTS Leaderboard is a big, fair test that checks how well AI tells the truth in different real-life situations—looking at pictures, remembering facts, searching the web, and sticking to a given document. How it works: 1) It splits factual truth-telling into four challenges: Multimodal (image+text), Parametric (memory-only facts), Search (use a web search tool), and Grounding v2 (use only the provided document). 2) Each challenge has its own evaluation recipe and an automated judge model to score answers. 3) The scores get averaged into one number called the FACTS Score so different AIs can be compared fairly. Why it matters: Without this, we might think an AI is great just because it’s good at one thing—like images—but it could still struggle with memory, search, or sticking to the given text.

🍞 Bottom Bread (Anchor): Imagine choosing a calculator for school. One is great at fractions but messes up long division. Another is great at long division but fails decimals. The FACTS Leaderboard is like a test that checks all the parts of math and gives you one fair overall score.

The World Before: Large language models (LLMs) got very good at talking and solving tasks, but they still made stuff up sometimes (“hallucinations”). People built many tests, but most focused on just one part—like answering questions from memory, or summarizing a document, or browsing the web. This made it hard to know if a model was truly trustworthy across different situations.

The Problem: Real users mix tasks all the time—ask about a chart in an image, need a fact the model should know, and sometimes require the model to search the web or stick exactly to a company report. No single benchmark covered all these angles in a consistent way with strong judging.

Failed Attempts: 1) Single-slice tests (only QA-from-memory or only document grounding) gave a partial picture. 2) Benchmarks often saturated—top models hit very high scores, leaving little room to see improvements. 3) Human-only judging didn’t scale well; auto-judging was sometimes noisy or biased. 4) Web-search tests used different search tools across models, making results hard to compare. 5) Public-only datasets encouraged overfitting—models could “study the test.”

The Gap: We needed a suite that (a) spans key factuality settings, (b) uses consistent and validated automated judges, (c) controls for tool differences (same search API for everyone), (d) resists overfitting (public/private splits), and (e) combines everything into a single, balanced score.

🍞 Top Bread (Hook): Imagine a science fair where every project is judged by different rules—one judge wants drawings, another wants code, another wants a speech. It’s confusing and unfair.

🥬 Filling (The Actual Concept): Factuality Measurement: What it is: A clear way to decide if an answer is true, complete, and free of contradictions. How it works: 1) Define what a correct answer must include. 2) Check for key facts (coverage). 3) Check for mistakes or contradictions. 4) Count only answers that are both complete and contradiction-free as accurate. Why it matters: If we skip any step, we might reward answers that are vague, lucky guesses, or confidently wrong.

🍞 Bottom Bread (Anchor): It’s like grading a book report: you check that the student covered the main points and didn’t claim something that isn’t in the book.

Real Stakes: People use AI to read medical reports, understand finances, study history, and even help with company documents. If the AI is wrong or ungrounded—even just sometimes—it can cause confusion, bad decisions, or safety risks. A benchmark that truly reflects everyday needs helps builders improve models in the right ways and helps users pick tools they can trust.

02Core Idea

🍞 Top Bread (Hook): Imagine testing a car not just on straight roads but also hills, rain, and traffic. One road can’t tell you if it’s truly safe everywhere.

🥬 Filling (The Actual Concept): The Aha! Moment: Measure factual truthfulness across four different “roads” (image tasks, memory-only facts, search-based answers, and document-grounded writing), judge them carefully, and average the results into one fair score. Why this changes things: It balances strengths and weaknesses—no more winning by being good at only one thing.

Multiple Analogies:

Sports Decathlon: One athlete might sprint fast (Search) but struggle in pole vault (Grounding). The decathlon score (FACTS Score) tells who’s best overall.
Healthy Plate: You need veggies (Multimodal), protein (Parametric), grains (Search), and fruit (Grounding). The balanced meal is the overall score.
Orchestra: Strings, brass, woodwinds, and percussion must all play well. The symphony (FACTS Score) shows how the whole team performs.

Before vs After:

Before: Benchmarks spotlighted narrow skills; top models sometimes looked similar because tasks were saturated, and results were messy due to different tools or judges.
After: One suite covers major real-world factuality needs, uses validated automated judges, standardizes the search tool, and resists overfitting with public/private splits. The top score is ~69%, not 99%, so we can still see meaningful progress.

🍞 Top Bread (Hook): You know how when you build with LEGO, you need the right blocks in the right order?

🥬 Filling (The Actual Concept): Building Blocks:

FACTS Multimodal: Integrate image understanding with world knowledge.
FACTS Parametric: Recall tough facts purely from what the model learned.
FACTS Search: Use the same web search API to solve multi-step and rare questions.
FACTS Grounding v2: Write long answers supported by a given document and actually answer the question.
Automated Judge Models: Specialized models that check coverage, contradictions, and eligibility.
FACTS Score: Simple average of the four sub-scores. Why it works: By testing four complementary abilities and scoring with robust judges, the suite exposes blind spots (like contradictions, missing facts, over- or under-confidence) that single tests would miss.

🍞 Bottom Bread (Anchor): Like a school report card that averages math, reading, science, and art—now you know the student’s overall strengths and weaknesses, not just one grade.

03Methodology

At a high level: Input (a model to evaluate) → Four tasks (Multimodal, Parametric, Search, Grounding v2) → Task-specific auto-judging → Average into FACTS Score.

Step 1: FACTS Multimodal (images + questions) 🍞 Top Bread (Hook): Imagine looking at a photo of a train and being asked, “What model is it and when was it introduced?” 🥬 Filling (The Actual Concept): What happens: 1) Each question has a human-made rubric listing Essential facts (must include) and Non-Essential facts (nice to include). 2) The model answers. 3) An autorater checks two things: Coverage (did you include essential facts?) and No-Contradiction (did you avoid saying anything that conflicts with the rubric, the image, or common knowledge?). 4) Only answers that pass both count as accurate. Why this step exists: Without Coverage, a model could be too vague; without No-Contradiction, it could be confidently wrong. Example: An image of a specific locomotive—if the model names the wrong introduction year, it fails No-Contradiction even if it describes the train well. Secret Sauce: Dual-verdict judging catches both missing pieces and wrong claims. Validation: Human comparisons show solid alignment (Spearman 0.64 coverage, macro F1 ~72 for coverage; macro F1 ~78 for No-Contradiction). 🍞 Bottom Bread (Anchor): It’s like a checklist for a science diagram: include the main labels (coverage) and don’t mislabel anything (no-contradiction).

Step 2: FACTS Parametric (memory-only facts) 🍞 Top Bread (Hook): Think of a history quiz where you can’t look anything up—you must remember. 🥬 Filling (The Actual Concept): What happens: 1) Build a tough set of user-interest questions whose answers are supported by Wikipedia. 2) Adversarially sample: keep questions that multiple strong open models miss in closed-book mode. 3) Human verification confirms correctness, uniqueness, and Wikipedia support; fixes are allowed if they keep original intent. 4) A grader (Gemini-2.5-Pro, sampled three times) marks responses as correct, incorrect, not-attempted, or unknown. Metrics: Accuracy (main), Attempted accuracy (among tries), Hedging rate (not-attempted), and F1 (balance of accuracy and attempted-accuracy). Why this step exists: Memory matters—some tasks can’t rely on search. Example: “Who played harmonica on The Rockford Files theme?” (Tommy Morgan). Secret Sauce: Adversarial filtering + Wikipedia backing yields challenging, clean, closed-book questions. 🍞 Bottom Bread (Anchor): It’s like asking, “Who was President in 1995?” No Googling—either you know it (Bill Clinton) or you don’t.

Step 3: FACTS Search (use a shared web search tool) 🍞 Top Bread (Hook): Picture being a detective: you don’t remember every fact, so you look up clues—carefully. 🥬 Filling (The Actual Concept): What happens: 1) All models use the same Brave Search API so the playing field is level. 2) The dataset mixes hard-tail questions, two-hop Wikipedia, multi-document synthesis, and knowledge-graph hop queries—designed to need web searching. 3) A judge (Gemini 2.0 Flash) marks answers as correct, incorrect, or not-attempted. Metrics: Accuracy, Attempted accuracy, Hedging rate, F1, and average number of searches per question. Why this step exists: Many real tasks require browsing and combining sources; using the same search tool isolates model skill from tool differences. Example: “Among the films written by the creator of The Sopranos, which was released earliest?” Answer: “Grave of the Vampire.” Secret Sauce: Standardized tool + carefully crafted multi-hop questions stress real browsing skill. 🍞 Bottom Bread (Anchor): Like a scavenger hunt where everyone has the same map—who navigates best?

Step 4: FACTS Grounding v2 (long answers must stick to the given document and answer the user’s need) 🍞 Top Bread (Hook): Imagine your teacher hands you a chapter and says, “Answer using only this text.” 🥬 Filling (The Actual Concept): What happens: 1) A model writes a long answer to a non-trivial request using only the provided context (sometimes up to 32k tokens). 2) Two judge models (Gemini 2.5 Flash and GPT-5) decide if the answer’s claims are grounded in the document. 3) Eligibility check: Even if grounded, does it truly answer the user’s question (not vague or evasive)? Ineligible answers are counted as inaccurate. Why this step exists: Without eligibility, a model could avoid errors by being too short or generic. Example: If asked to list the document’s main causes of an event, “It had many causes” is ineligible. Secret Sauce: Dual judges + eligibility prevents gaming the metric. Judges and prompts were validated against human labels; updated prompts improved macro-F. 🍞 Bottom Bread (Anchor): Like grading a book report for both “used only the book” and “actually answered the question.”

Step 5: Aggregation and Overfitting Control 🍞 Top Bread (Hook): Think of combining quiz grades from four subjects into one report card, while also keeping some questions secret so nobody just memorizes the test. 🥬 Filling (The Actual Concept): What happens: 1) Each sub-benchmark reports its accuracy. 2) The FACTS Score is the simple average of the four accuracies. 3) Public/private splits prevent overfitting; Kaggle runs evaluations to maintain integrity. Why this step exists: Averaging balances strengths and weaknesses, and hidden splits ensure honest progress. Example: A model great at Search but weak at Grounding can’t top the leaderboard unless it improves all-around. Secret Sauce: Simplicity (average) + safeguards (private sets) = robust comparisons. 🍞 Bottom Bread (Anchor): Like a music competition with surprise songs in the final round to make sure the winner isn’t just rehearsed for one tune.

Concept Minis (new terms used above):

🍞 Hook: You know how a grocery list has “must-buy” items. 🥬 Coverage: What it is: Checking if essential items are present. How it works: Compare answer to the essential rubric facts. Why it matters: Missing essentials = not complete. Anchor: If the recipe says eggs and you forgot eggs, the cake fails.
🍞 Hook: Imagine not saying anything wrong while explaining. 🥬 No-Contradiction: What it is: Ensure no statements conflict with the rubric, image, or common knowledge. Why it matters: One big error can mislead. Anchor: Calling a cat a dog ruins the pet report.
🍞 Hook: Sometimes it’s smarter to say “I’m not sure.” 🥬 Hedging: What it is: Choosing not to answer when uncertain. Why it matters: Encourages caution over guessing. Anchor: Leaving a test question blank instead of writing a wild guess.
🍞 Hook: If you try fewer questions but do them well, how good are you when you actually try? 🥬 Attempted Accuracy: What it is: Accuracy only over answered questions. Why it matters: Shows quality when the model commits. Anchor: Your free-throw percentage only when you take the shot.

04Experiments & Results

The Test: Each sub-benchmark measures what matters for its setting. Multimodal checks both coverage and no-contradiction, counting an answer as accurate only if it passes both. Parametric evaluates closed-book factual recall with accuracy, attempted accuracy, F1, and hedging. Search measures browsing skill under one shared API, again using accuracy, attempted accuracy, F1, hedging, and average searches. Grounding v2 measures whether long answers are grounded and eligible, using two judge models and an eligibility filter.

The Competition: The suite reports results for many leading models (e.g., Gemini 3 Pro, GPT-5, Claude 4.5 Opus, Grok 4, etc.), across public and private splits. Kaggle runs evaluations to prevent data leakage and keep the scoreboard trustworthy.

The Scoreboard (contextualized):

Overall FACTS Score (average of the four parts): Top model Gemini 3 Pro scores about 68.8%, which is like getting a solid B when nobody gets an A—clear room to grow. Gemini 2.5 Pro sits around 62.1%. GPT-5 ~61.8%. Others trail behind, with many in the 40–55% band, showing how hard the suite is.
Multimodal: Gemini models lean recall-oriented (high coverage), while GPT models lean precision-oriented (highest no-contradiction). Top accuracies cluster around mid-40%, reflecting how tricky it is to be both complete and error-free on images.
Parametric: Gemini 3 Pro leads with ~76% accuracy and low hedging (~1–2%). Some models improve attempted accuracy by hedging more (e.g., GPT-5 has lower raw accuracy than GPT-o3 but higher attempted accuracy and F1 because it hedges more). This reveals different uncertainty strategies.
Search: Gemini 3 Pro tops with ~83.8% accuracy and conducts fewer searches on average than other top models, suggesting more efficient search planning. Grok models search a lot; Claude models often hedge more, pushing attempted accuracy high but lowering overall accuracy.
Grounding v2: Accuracies vary widely. The dual-judge setup and eligibility filter stop “safe but vague” answers from scoring, surfacing true grounding quality.

Surprising Findings:

Strategy trade-offs: Precision vs recall styles matter. Some models cover many essentials but risk contradictions; others avoid contradictions but miss key details.
Hedging helps attempted accuracy: Models that admit uncertainty can look better on “attempted” scores—useful for safety-conscious deployments.
Efficient browsing: The top Search model achieves strong results with fewer queries, hinting at better planning rather than brute-force searching.
Autorater reliability: Validations (e.g., macro F1 ~78 for contradictions) show the judges can catch nuanced errors like small visual misreads or slight date mistakes.

05Discussion & Limitations

Limitations:

Not all factuality is covered: no video tasks, and fast-changing facts are limited; tool-use beyond web search (like databases) isn’t fully explored.
Judge models aren’t perfect: even with validation and dual judges, auto-raters can mislabel edge cases or be sensitive to phrasing.
Tool dependence: Search results and ranking can shift; using a single API standardizes evaluation but ties performance to that tool’s behavior.
Overfitting risk remains: Public splits exist; private splits help, but repeated submissions may still drift toward leaderboard gaming.
Long-context complexity: Grounding checks full answers, but subtle omissions can be hard to catch automatically.

Required Resources:

Access to the submission platform (Kaggle) and the standardized evaluation setup.
Ability to run your model on image QA, closed-book QA, tool-use with Brave Search API, and long-context grounding tasks.
Compute to process thousands of prompts and handle multiple autorater calls per item.

When NOT to Use:

Creative writing, opinions, or style-focused tasks where “truth” is subjective.
Breaking news or rapidly changing facts (stock tickers, live sports) where ground truths shift daily.
Math-proof or symbolic reasoning benchmarks; FACTS is about factuality, not formal proof.
Multilingual evaluations (as defined here): the suite focuses on English.

Open Questions:

Measuring tailness more precisely: Which rare entities cause the most failures, and can we predict them?
Temporal drift: How to score changing facts while keeping stability and fairness?
Cross-language factuality: How to extend to multilingual and multimodal (e.g., video) settings?
Better calibration: Can we align hedging with calibrated probabilities and user risk preferences?
Robustness to adversarial prompts: How do prompt injections or tricky instructions affect factuality and grounding?
Judge advances: Can we create even more reliable, explainable, and unbiased auto-judges?

06Conclusion & Future Work

Three-Sentence Summary: The FACTS Leaderboard is a balanced, four-part benchmark that tests whether AI answers are factually correct across images, memory-only questions, web search, and document-grounded writing. It uses validated automated judges, standardized tools, and public/private splits to keep scoring fair, robust, and hard to game. Results show meaningful gaps remain (top score ~69%), guiding research toward real improvements in truthfulness.

Main Achievement: Turning fragmented factuality testing into a single, holistic and credible suite—complete with strong judging and one simple FACTS Score—so builders and users can see true all-around reliability.

Future Directions: Expand into video and rapidly changing facts, strengthen judge reliability and explainability, add multilingual coverage, refine tailness measurement, and model uncertainty calibration. Explore tool-use beyond search (e.g., databases, calculators) under standardized conditions.

Why Remember This: FACTS sets a new bar for what “being factual” means in practice—cover the essentials, avoid contradictions, ground long answers, and search wisely—and it tracks these skills together so progress is honest, visible, and user-relevant.

Practical Applications

•Select an enterprise LLM by comparing FACTS Scores and sub-scores aligned to your use case (e.g., prioritize Grounding for financial reporting).
•Set model improvement goals by targeting the weakest sub-benchmark (e.g., raise Multimodal No-Contradiction by 5%).
•Tune browsing agents using the same search API to improve FACTS Search accuracy with fewer queries.
•Design data curation and fine-tuning strategies guided by Parametric failures on Wikipedia-backed facts.
•Adopt eligibility checks in production to block vague but technically ‘safe’ answers in grounded workflows.
•Calibrate uncertainty policies (when to hedge) by monitoring attempted accuracy vs. raw accuracy trade-offs.
•Evaluate visual RAG pipelines by combining image features with rubric-based coverage and contradiction checks.
•Use public/private split practices to reduce overfitting in internal benchmarks and vendor evaluations.
•Run A/B tests on prompt templates to boost grounding and reduce contradictions in long-form outputs.
•Create red-team tests using adversarial sampling methods to keep internal QA sets challenging over time.

Version: 1