QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

Dehai Min; Kailin Zhang; Tongtong Wu; Lu Cheng

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

Intermediate

Dehai Min, Kailin Zhang, Tongtong Wu et al.12/22/2025

arXiv PDF

Key Summary

•QuCo-RAG is a new way to decide when an AI should look things up while it writes, using facts from its training data instead of its own shaky confidence.
•It does two checks: before writing, it looks for rare names and terms; during writing, it checks if the mentioned entities have ever appeared together in the training corpus.
•If an entity is very rare or two entities never co-occur, QuCo-RAG triggers retrieval to fetch evidence and avoid hallucinations.
•This uses Infini-gram, a super-fast tool that can count appearances of words and phrases across trillions of tokens in milliseconds.
•Across tough multi-hop question-answering tasks, QuCo-RAG boosts Exact Match by 5–12 points on OLMo-2 models and up to 14 points when transferred to Llama, Qwen, and GPT models.
•It stays efficient, needing fewer tokens and calls than many dynamic RAG baselines while achieving the best accuracy.
•The approach works well even in specialized domains like biomedicine without extra tuning.
•Because it leans on objective corpus statistics, QuCo-RAG is less fooled by the AI’s overconfidence and catches ‘confident hallucinations.’
•Limitations include alias handling (e.g., ‘NYC’ vs ‘New York City’) and the need to update statistics as the world changes.

Why This Research Matters

This work makes AI answers more trustworthy by checking facts against what the AI actually learned from. Instead of trusting the model’s feelings (which can be overconfident), it asks the training data for proof before and during writing. That means fewer confident mistakes in schoolwork, research, and professional tools. It stays fast thanks to efficient indexing, so you get better answers without big delays. The idea transfers across many models, even those with private training data, making it practical at scale. It also works in specialized areas like medicine, where accuracy really matters.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how sometimes you’re sure you remember a fact, but when you check your notes, you realize your memory was wrong? Computers that write, called large language models (LLMs), can feel that way too—they can sound confident and still be wrong.

🍞 Top Bread (Hook): Imagine taking a test and always sounding 100% sure, even when you’re guessing. 🥬 The Actual Concept: Hallucination is when an AI writes something that sounds plausible but is factually wrong.

How it works: 1) The model tries to predict the next word. 2) It doesn’t always truly “know,” but still produces fluent text. 3) Without checking outside facts, it can invent details.
Why it matters: Without a way to verify, the AI can mislead people confidently. 🍞 Bottom Bread (Anchor): If you ask, “Who discovered the planet Krypton?” a hallucinating model might confidently tell you a fake answer because Krypton is fictional.

🍞 Top Bread (Hook): You know how you sometimes look up a fact while writing a report? 🥬 The Actual Concept: Retrieval-Augmented Generation (RAG) is when an AI pauses to fetch supporting documents before or during writing.

How it works: 1) Turn the question into a search query. 2) Retrieve relevant passages. 3) Use them as evidence while generating the answer.
Why it matters: Without RAG, the AI relies only on its memory, which can be incomplete. 🍞 Bottom Bread (Anchor): When asked “When did Marie Curie win her first Nobel Prize?”, RAG retrieves a page that mentions 1903 and uses it to answer correctly.

🍞 Top Bread (Hook): Imagine solving a mystery where new clues appear as you think. 🥬 The Actual Concept: Dynamic RAG is a smarter version of RAG that decides when to retrieve in the middle of writing—only when needed.

How it works: 1) Start answering. 2) Watch for moments of uncertainty or missing info. 3) Retrieve exactly then. 4) Continue writing with the new evidence.
Why it matters: Without dynamic timing, the system either retrieves too little (miss clues) or too much (waste time). 🍞 Bottom Bread (Anchor): For “Are the directors of Il Seduttore and The Trial of Joan of Arc from the same country?”, the model may need to fetch each director’s nationality at different steps.

🍞 Top Bread (Hook): You know how your “confidence” doesn’t always match whether you’re right? 🥬 The Actual Concept: Model-internal uncertainty (like token probabilities or entropy) is the model’s own guess about how sure it is.

How it works: 1) The model assigns scores to words. 2) Higher scores feel “confident.” 3) But scores can be miscalibrated.
Why it matters: If scores don’t track truth, retrieval may trigger at the wrong times. 🍞 Bottom Bread (Anchor): A method might mark the word “Il” as “uncertain” but happily output a wrong director name with low “uncertainty.”

🍞 Top Bread (Hook): Picture a giant library of everything the model read while learning. 🥬 The Actual Concept: A pre-training corpus is the huge collection of text used to train an LLM.

How it works: 1) Gather web pages, books, and articles. 2) Train the model to predict next words from that text. 3) The model’s knowledge mirrors what’s common in the corpus.
Why it matters: If something barely appears in the corpus, the model is less likely to have memorized it well. 🍞 Bottom Bread (Anchor): If “Silas Hardy” appears only a few times in the corpus, the model may struggle to recall his birth date without retrieval.

🍞 Top Bread (Hook): Think of rare animals in a rainforest—hard to spot, easy to misidentify. 🥬 The Actual Concept: Long-tail knowledge means facts about rare entities or topics that appear infrequently in training.

How it works: 1) Count how often an entity appears. 2) Rare counts = long-tail. 3) These are hard for models to remember.
Why it matters: Long-tail facts are prime spots for hallucinations. 🍞 Bottom Bread (Anchor): An obscure 19th-century senator may be “long-tail,” so the model benefits from retrieval to avoid mistakes.

Before this paper, dynamic RAG often depended on those internal uncertainty scores—numbers that can be misleading because LLMs are not perfectly calibrated. Researchers tried clever formulas (like entropy trends) and even made models generate special self-check tokens. But a big problem remained: the model could be very confident and very wrong at the same time.

What was missing was an objective referee—something outside the model’s feelings. The authors noticed that the training corpus already holds clues about what the model likely knows well (frequent entities) and where it’s shaky (rare entities). Even more, if two entities have never been seen together in all that training data, claims that connect them are risky without evidence. This insight is powerful because it moves from “how confident do I feel?” to “what does the data say?”

The stakes are real. Wrong answers in homework confuse students. Wrong citations mislead readers. In medicine or law, mistakes can have serious consequences. A system that knows when it doesn’t know—and then looks things up—makes AI more trustworthy in daily life.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re writing a report with a huge library behind you. You don’t guess—you check the library whenever your topic is rare or your claim is unbacked. 🥬 The Actual Concept: The key idea is to trigger retrieval based on objective statistics from the pre-training corpus: low-frequency entities (before writing) and zero co-occurrence between entities in generated claims (during writing).

How it works: 1) Before generating, count how often the question’s entities appear in the corpus; if they’re rare, retrieve. 2) During generation, extract (head, relation, tail) triplets from each sentence. 3) Check if head and tail ever appear together in the corpus; if never, retrieve and regenerate that sentence. 4) Use Infini-gram to do all counts in milliseconds.
Why it matters: Without this, the system trusts its own shaky confidence scores and misses many hallucinations. 🍞 Bottom Bread (Anchor): If the model writes “Mario Camerini directed Il Seduttore,” but the entities don’t co-occur in the corpus, QuCo-RAG flags it, retrieves, and corrects to “Franco Rossi.”

Three analogies to see it clearly:

Cookbook chef: Before cooking a rare dish (low-frequency ingredient), the chef looks up a recipe; while cooking, if a step pairs two ingredients that never go together (zero co-occurrence), the chef rechecks the book.
Map + street signs: You plan a route (pre-check for rarity), then as you drive, if you see a weird combo of signs you’ve never seen together (zero co-occurrence), you pause to recheck the map.
School report: If your topic is obscure, you gather sources first; later, if a claim links two names that your notes never linked, you look up a source before keeping it.

Before vs. After:

Before: Retrieval decisions depended on the model’s internal “feelings” (probabilities, entropy), which often mislabeled the wrong parts as safe.
After: Retrieval depends on what the corpus says: rare entities trigger help early, and never-seen-together entities trigger verification on the fly. Fewer wasted retrievals, more timely corrections.

Why it works (intuition):

A model’s memory reflects its diet. Frequent entities are better memorized; rare ones are shaky. If two entities never co-occur in the training corpus, then any strong claim linking them should be treated as suspect without evidence. Co-occurrence is asymmetric: seeing them together doesn’t prove a specific relation, but never seeing them together strongly suggests risk, making it a great trigger for retrieval.

Building blocks (as mini concepts):

🍞 Hook: You know how notes help you keep track of who-did-what? 🥬 The Concept: Triplet extraction turns a sentence into small fact units (head, relation, tail).
- How: 1) Read a sentence. 2) Find the main entity (head), the relation (like ‘directed by’), and the other entity (tail). 3) Output (h, r, t).
- Why: Without triplets, you can’t efficiently check claims. 🍞 Anchor: “Beowulf & Grendel was directed by Sturla Gunnarsson” becomes (Beowulf & Grendel, directed by, Sturla Gunnarsson).
🍞 Hook: Think of counting how many times a name shows up in your textbook. 🥬 The Concept: Entity frequency tells how common or rare an entity is in the corpus.
- How: 1) Query the index for the entity’s surface form. 2) Get a count. 3) Average across all entities in the question.
- Why: Rare entities signal long-tail risk and trigger early retrieval. 🍞 Anchor: If “Lee Mantle” appears 180 times and “Silas Hardy” 258 times, the average helps decide if retrieval is needed.
🍞 Hook: Like checking if two names ever show up on the same page. 🥬 The Concept: Entity co-occurrence counts how often two entities appear within the same window (e.g., paragraph).
- How: 1) Take (h, t). 2) Search corpus windows for both present. 3) Count hits.
- Why: If count = 0, a linking claim is risky and should trigger retrieval. 🍞 Anchor: If “Xawery Żuławski” and “mother Małgorzata Braunek” co-occur, the claim has some support; if they never co-occur, retrieve.
🍞 Hook: Picture a super-fast librarian who can answer, “How often does this phrase appear?” almost instantly. 🥬 The Concept: Infini-gram is a tool that counts n-grams over trillions of tokens with millisecond latency.
- How: 1) Build a special index over the corpus. 2) Answer frequency and co-occurrence queries quickly.
- Why: Without speed, checking every sentence would be too slow. 🍞 Anchor: While generating, the system pings Infini-gram to check if two names ever show up together before accepting the sentence.

03Methodology

At a high level: Question → Pre-Generation Knowledge Assessment (entity frequency) → Optional initial retrieval → Generation loop → Per-sentence Triplet Extraction → Co-occurrence Check via Infini-gram → If zero, retrieve and regenerate; else continue → Final answer.

Step 1: Pre-Generation Knowledge Assessment (Input Uncertainty)

What happens: The system extracts key entities from the question (like person names, places, organizations) and looks up how often each appears in the pre-training corpus. If the average frequency is below a threshold, it triggers retrieval before writing starts.
Why this step exists: Rare entities live in the long tail; the model likely hasn’t memorized them well. Skipping this step risks starting from a shaky base and hallucinating early.
Example: “Who was born earlier, Silas Hardy or Lee Mantle?” If both names are rare, the system fetches background documents first so it doesn’t guess.

🍞 Top Bread (Hook): Think of scanning a class roster to see how common each name is. 🥬 The Actual Concept: Entity frequency pre-check estimates how likely the model is to know the topic.

How it works: 1) Extract entities from the question. 2) Query the corpus index for each entity’s count. 3) Average the counts; compare to a threshold (e.g., 10^1–10^3). 4) If below, retrieve top-3 docs with BM25 and prepend them to the prompt.
Why it matters: Without the pre-check, the model may start wrong and never recover. 🍞 Bottom Bread (Anchor): Before answering about “Silas Hardy,” the system sees he’s rare and retrieves a short bio.

Step 2: Generation Loop with Runtime Claim Verification (Output Uncertainty)

What happens: After each generated sentence, a small model extracts fact triplets (head, relation, tail). Then the system checks if head and tail co-occur anywhere in the corpus within a window (e.g., 1,000 tokens). If the minimum co-occurrence across triplets is zero, it triggers retrieval and regenerates that sentence using retrieved evidence.
Why this step exists: Even if you start with good context, later claims can go off track. This step catches confident but unsupported statements.
Example: If the model writes “Silas Hardy was born on 30 April 1867,” but “Silas Hardy” never co-occurs with that date or connected entities, the system retrieves and corrects it.

🍞 Top Bread (Hook): Imagine writing one sentence at a time and asking, “Does this sentence have backup?” 🥬 The Actual Concept: Co-occurrence-based runtime check validates each sentence’s factual links.

How it works: 1) Extract triplets from the sentence. 2) For each (h, r, t), query co-occurrence of (h, t). 3) If any pair has count = 0, treat as high risk and retrieve. 4) Build a semantic query using (h + r), retrieve top docs, and regenerate the sentence.
Why it matters: Without it, errors that slip in mid-answer remain uncorrected. 🍞 Bottom Bread (Anchor): Writing “The film was directed by Mario Camerini” triggers a check; if “film title” and “Mario Camerini” never appeared together in training, retrieval fixes it.

Step 3: Tools and Parameters

Triplet Extractor: A tiny 0.5B model distilled from GPT-4o-mini, trained on 40K examples, makes extraction fast and reliable.
Co-occurrence Window: 1,000 tokens (about a passage), balancing precision with coverage.
Retrieval Engine: BM25 over Wikipedia (or another corpus); top-3 docs per query.
Thresholds: Co-occurrence threshold = 1 (trigger on zero); entity frequency threshold stable over wide ranges.

🍞 Top Bread (Hook): You know how a library catalog helps you find things quickly? 🥬 The Actual Concept: Infini-gram powers millisecond-latency frequency and co-occurrence queries over trillions of tokens.

How it works: 1) Prebuild a suffix-array-like index. 2) Answer n-gram counts quickly. 3) Serve queries during generation without slowing the model much.
Why it matters: Without this speed, checking every sentence would be too slow for practical use. 🍞 Bottom Bread (Anchor): While answering, the system asks Infini-gram “Do these two names ever show up together?” and gets an answer almost instantly.

Secret Sauce (why this method is clever):

It replaces fuzzy internal confidence with crisp, corpus-grounded signals. Rare entity? Retrieve early. Never-seen-together entities? Retrieve now. Because these checks are discrete and interpretable, they align naturally with the yes/no decision to retrieve. They are fast (thanks to Infini-gram), robust across models (proxy corpora work), and conservative where it matters (better a safe retrieval than a missed hallucination).

04Experiments & Results

The Tests: Researchers used three datasets. Two are multi-hop QA sets where answers require combining clues: 2WikiMultihopQA and HotpotQA. The third is PubMedQA for biomedical questions. They measured Exact Match (EM)—did you get the exact answer?—and F1—did your words overlap with the gold answer.

🍞 Top Bread (Hook): Think of a spelling bee where you must say the exact word (EM) and also get credit for partly right spellings (F1). 🥬 The Actual Concept: EM and F1 are strict and lenient scoring methods, respectively.

How it works: EM checks for exact string equality; F1 checks token overlap.
Why it matters: Together, they show both precision and partial correctness. 🍞 Bottom Bread (Anchor): If the correct answer is “Paris,” saying “Paris” gets EM=1; saying “the city of Paris” gets EM=0 but decent F1.

Competitors: They compared to no retrieval (Wo-RAG), always-one-shot retrieval (SR-RAG), retrieve-every-sentence (FS-RAG), and several dynamic methods that trust internal signals (FLARE, DRAGIN, ETC, SeaKR).

Scoreboard with context:

On OLMo-2 models (7B, 13B, 32B) with full access to their 4T-token corpus, QuCo-RAG beat the best baselines by 5–12 EM points across datasets. That’s like jumping from a B- to solid A/A- while others hovered in the B range.
Efficiency: Despite strong accuracy, QuCo-RAG used fewer tokens and fewer LLM calls than most dynamic baselines. It triggered about 1.7–2.6 retrievals per question—far less than “retrieve every sentence.”
Transferability: Even for models with private training data (Llama-3, Qwen2.5, GPT-4.1/5), using the OLMo-2 corpus as a proxy still worked. Gains reached up to +14 EM on 2WikiMultihopQA and healthy boosts on HotpotQA. That’s like studying with a friend’s textbook that overlaps a lot with yours and still acing the test.
Domain generalization: On PubMedQA, QuCo-RAG got the best accuracy (about 66%), with modest retrieval and token use. Methods relying on internal signals either over-retrieved (too much, too expensive) or under-retrieved (no better than no retrieval), while QuCo-RAG stayed balanced.

Surprising findings:

GPT models with agentic web search underperformed no-retrieval in some settings—live web results can be noisy for complex multi-hop reasoning. In contrast, corpus-grounded checks are precise about when to retrieve and what to query.
Threshold stability: The entity-frequency threshold worked well across wide ranges, and the co-occurrence trigger at zero made intuitive, reliable decisions with low overhead.
Frequency bins: QuCo-RAG dominated on low-frequency entities (the long tail) where others collapsed toward no-retrieval performance, and it kept improving with high-frequency entities by making better use of abundant corpus evidence.

Put simply: Using the training corpus as a fact-checker gave consistent, strong wins in both accuracy and efficiency across models, datasets, and domains.

05Discussion & Limitations

Limitations:

Aliases and lexical matching: The co-occurrence check looks for exact surface forms. If entities appear under different names (e.g., “NYC” vs “New York City”), co-occurrence might be missed, triggering extra retrieval. This is safer than missing a hallucination but could be optimized with entity linking.
Time-sensitive facts: A static pre-training corpus can’t verify very new events. Periodic updates or time-stamped corpora are needed.
Co-occurrence is asymmetric: Seeing two names together doesn’t prove the specific relation; it only says the claim might be plausible. But zero co-occurrence is a strong red flag.
Dependency on corpus access: For best results, you need an index over the (or a proxy) pre-training corpus. If you can’t build or query it, you lose the main advantage.
Triplet extraction errors: The small extractor could miss or mis-parse relations. In practice, its speed and training help, but there’s room to improve.

Required resources:

An Infini-gram (or similar) index of a large corpus (CPU and disk heavy, but GPU-light).
A standard retriever (e.g., BM25) and a small triplet extraction model.
Basic integration to run queries during generation.

When NOT to use:

Creative fiction or brainstorming where factual grounding isn’t required.
Ultra-fresh news or domains where the corpus lags reality by months and facts change daily.
Settings dominated by heavy aliasing or multilingual variants without entity linking.
Privacy-locked environments where the relevant corpus cannot be indexed.

Open questions:

Can we formally link entity frequency and co-occurrence to bounds on hallucination probability?
How best to handle aliases, multilingual names, events, and numbers?
Can we make time-aware indices that track changing facts?
What’s the best way to combine internal signals with corpus statistics—can they complement each other?
How can these signals guide data collection, continued pretraining, or model editing to permanently fix knowledge gaps?

06Conclusion & Future Work

In three sentences: QuCo-RAG uses hard facts from the pre-training corpus—entity frequency before writing and entity co-occurrence during writing—to decide exactly when to retrieve evidence. This corpus-grounded trigger replaces unreliable internal confidence, catching confident hallucinations while staying efficient. It works across models and domains, delivering big gains on multi-hop QA and strong transfer to models with closed training data.

Main achievement: Turning retrieval decisions into objective, data-checked yes/no calls (rare entity? retrieve; never-seen-together entities? retrieve), powered by millisecond-scale corpus queries.

Future directions: Add alias-aware entity linking, build time-stamped indices for changing facts, extend checks to events and numbers, and blend internal and external signals for even smarter triggers. Also use these statistics to guide training data curation and targeted model editing.

Why remember this: It’s a clean, general recipe for safer, more accurate AI writing—ask the library when memory is weak, and verify claims as you go. That simple shift from “I feel confident” to “the data says” makes AI more trustworthy in the real world.

Practical Applications

•Smarter homework helpers that look up rare facts and verify claims mid-explanation.
•Customer support bots that fetch policies only when needed and prevent confident wrong answers.
•Medical Q&A systems that verify entity links in claims before suggesting conclusions.
•Enterprise search assistants that cross-check names and relationships in internal documents.
•Legal research tools that trigger retrieval when parties or cases never co-occur in prior texts.
•Scientific writing aids that verify cited relationships between authors, papers, and findings.
•News summarizers that flag and check never-seen-together name combinations to avoid misinformation.
•Developer copilots that retrieve API docs when encountering rare functions or unlinked concepts.
•Education platforms that highlight long-tail topics and provide readings before students answer.
•Agent systems that self-verify with corpus stats before acting on generated plans.

Version: 1