HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences

Yusuke Sakai; Hidetaka Kamigaito; Taro Watanabe

HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences

Beginner

Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe1/26/2026

arXiv PDF

Key Summary

•The paper finds almost 300 accepted NLP papers (mostly in 2025) that include at least one fake or non-existent reference, which the authors call a HalluCitation.
•EMNLP 2025 alone accounts for 154 of these papers, showing the problem is rising fast and affecting main conference tracks, not just workshops.
•The team built a practical pipeline: OCR the reference list, normalize it, match titles in trusted databases (ACL Anthology, arXiv, DBLP, OpenAlex), and then manually verify suspicious cases.
•If a paper has four or more suspicious references flagged by matching, it is very likely to contain a real HalluCitation, making this a useful rule-of-thumb for automated checks.
•Most HalluCited papers contain only one or two errors, which are hard for busy reviewers to catch among many correct citations.
•Some hallucinated references come from contaminated secondary sources (e.g., Google Scholar, Semantic Scholar), so the issue is not always caused by AI tools or dishonest authors.
•The authors argue for toolkits and automated checks before submission and at ingestion, rather than punishing authors after acceptance.
•They recommend clear definitions of what counts as a HalluCitation, better traceability of corrections, and lighter reviewer loads to protect conference credibility.
•Results are a conservative lower bound: the true number of HalluCitations could be higher due to cautious verification rules and method limits.
•Overall, this work spotlights a growing risk to scientific trust and offers a practical, scalable way to catch and prevent fake references.

Why This Research Matters

If we can’t trust references, we can’t trust the facts built on top of them—this affects news stories, health apps, education tools, and everyday decisions that rely on science. As AI writing assistants become common, we need automatic checks to make sure their suggested citations are real. This paper provides a scalable way to flag likely fakes so humans only inspect a small, suspicious subset. It also shows that errors often come from messy databases, so the fix is better tools and data hygiene—not punishing authors. With these improvements, conferences and journals can maintain credibility even as submissions surge. Ultimately, reliable citations keep the chain of knowledge strong from classrooms to cutting-edge labs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how your teacher asks you to list the books you used for a report, so anyone can go check them? Now imagine a student makes up a book title that doesn’t exist—if no one notices, their whole report looks trustworthy but has a crack in its foundation.

🥬 The Concept (Academic Papers and Citations):

What it is: An academic paper is like a carefully written report about a new discovery, and citations are its “receipts” pointing to earlier work it builds on.
How it works: Authors collect prior research, write their new ideas, and use citations to give credit and show evidence; reviewers check this for accuracy before acceptance.
Why it matters: Without accurate citations, readers can’t trace facts, and the chain of knowledge breaks.

🍞 Anchor: If a paper says, “As shown by Clark et al. (2022),” readers should be able to click or search and find that exact paper—every time.

🍞 Hook: Imagine a school where students use a super-smart writing helper that can draft essays very fast. That helper is great, but sometimes it can make up a book or mix up details.

🥬 The Concept (Large Language Models, LLMs):

What it is: LLMs are AI tools that help people write, summarize, and search, trained on lots of text so they’re very good at sounding fluent.
How it works: You ask a question; the model predicts likely words to answer, sometimes also searching the web or databases.
Why it matters: When LLMs or connected tools suggest citations, they can sometimes produce realistic-looking but fake references.

🍞 Anchor: A writing assistant that drafts a bibliography in seconds is helpful—unless two entries point to papers that don’t exist.

🍞 Hook: Think about referees in sports. They keep the game fair. In science, we have reviewers who check papers before they’re published.

🥬 The Concept (Peer Review):

What it is: Peer review is when expert researchers read and judge whether a paper is correct and useful.
How it works: Reviewers read, test reasoning, check references, and recommend accept or reject; meta-reviewers oversee fairness and quality.
Why it matters: If reviewers are rushed or overloaded, small but important errors—like fake references—can slip through.

🍞 Anchor: If a referee has to judge 20 games in one day, they might miss a foul; if reviewers have too many papers, they might miss a fake citation.

🍞 Hook: Picture a library card catalog with a few pretend cards hidden inside. If you pick one of those, you think a book exists when it doesn’t.

🥬 The Concept (HalluCitation):

What it is: A HalluCitation is a fake or non-existent reference in a paper—something that looks like a citation but points to nothing real.
How it works: It might come from an AI assistant inventing a title, a mistaken database entry, or a wrong link/ID that doesn’t match any real paper.
Why it matters: HalluCitations damage trust. If references can’t be verified, the paper’s foundation weakens, and conference credibility is at risk.

🍞 Anchor: A citation like “Anticipated for NeurIPS 2024” or a wrong arXiv ID that leads nowhere is a classic HalluCitation.

The world before: Most people assumed accepted papers’ references were reliable. Tools like LLMs boosted writing speed and helped non-native English speakers, but the number of submissions exploded, and review workloads rose. The problem: catching a few fake references among hundreds of thousands is like finding needles in a haystack, and manual checking of every citation is impossible.

Failed attempts: Purely manual checks don’t scale; simple parsing tools sometimes miss references that overflow pages; relying on any single database can fail if titles change slightly or entries are incomplete.

The gap: We needed a practical, high-precision way to fish out suspicious citations at scale and then verify them—without blaming authors automatically.

Real stakes: If fake references become common, teachers, doctors, engineers, and policymakers might quote research that can’t be traced. That hurts everyday trust—from news stories to medical apps—and burdens reviewers who already work under tight deadlines.

02Core Idea

🍞 Hook: Imagine airport security. Most passengers are fine, so you don’t search everyone in depth—you first use scanners to spot a few bags that look odd, then a human inspects only those.

🥬 The Concept (Key Insight):

What it is: First, use machines to rapidly flag suspicious citations by title-matching against trusted databases; then, humans carefully verify those few, revealing both the true scale and causes of fake references.
How it works: OCR the reference list → normalize entries → title-match in multiple databases with a strict similarity threshold → flag hard-to-match ones as candidates → manually verify existence and details; count a paper as problematic if at least one fake is confirmed.
Why it matters: This hybrid approach scales to tens of thousands of papers and keeps precision high, so we don’t punish authors unfairly.

🍞 Anchor: Like a smoke detector that only wakes you when there’s real smoke—the automated matcher beeps for suspicious cases, and a person checks the kitchen to confirm.

Multiple analogies for the same idea:

Librarian analogy: The system flips quickly through thousands of cards (citations), pings the ones that don’t match the official catalog, and the librarian examines only those.
Jigsaw analogy: If a puzzle piece (title) doesn’t fit any picture in the box (databases) above a 0.9 match, it’s flagged as a likely counterfeit piece.
Airport analogy: A scanner (matcher) flags a few bags; a guard (human) opens those bags. Fast, fair, and scalable.

Before vs After:

Before: We assumed accepted papers’ references were mostly fine; reviewers were the only safety net under time pressure.
After: We now have evidence that HalluCitations surged in 2025 (e.g., 154 at EMNLP 2025 alone) and that a simple candidate-count rule (≈4 or more flags) is a strong warning sign. This shifts the community toward proactive toolkits and automated checks.

Why it works (intuition):

Titles are strong identifiers: even small typos still match closely; truly fake items fail to reach a high similarity score across multiple databases.
Most citations are correct: so focusing on the few unmatched ones gives great efficiency.
Human-in-the-loop: final verification prevents overzealous auto-rejections and keeps precision high.

Building blocks (introduced with sandwich explanations):

🍞 Hook: You know how you snap a photo of a page to search the words? 🥬 OCR (Optical Character Recognition):
- What it is: A tool that turns the PDF’s reference text into editable words.
- How it works: Scans the page blocks, finds characters, and outputs strings.
- Why it matters: Without OCR, we can’t read titles reliably from PDFs. 🍞 Anchor: The system used MinerU to grab reference blocks accurately.
🍞 Hook: Imagine tidying your desk so you can find things quickly. 🥬 Normalization:
- What it is: Cleaning and structuring each reference (title, authors, venue).
- How it works: A parser (GROBID) splits fields and standardizes formats.
- Why it matters: Messy text breaks matching; normalized fields match better. 🍞 Anchor: GROBID helps ensure “Clark et al., 2022” is parsed into a real title.
🍞 Hook: Picture checking a store’s official product list before buying. 🥬 Database Matching:
- What it is: Comparing citation titles to known databases (ACL Anthology, arXiv, DBLP, OpenAlex).
- How it works: Use character-level fuzzy matching (normalized Levenshtein) and flag anything below 0.9 similarity as suspicious.
- Why it matters: If no trusted source matches well, the reference may be fake or severely wrong. 🍞 Anchor: A supposed “TACL 2022” paper that no database can find is a red flag.
🍞 Hook: Like a librarian double-checking a rare book exists. 🥬 Manual Verification:
- What it is: Humans use links, DOIs, IDs, pages, and web searches to confirm a paper exists and details match.
- How it works: If no reliable match exists or key fields conflict, it’s a HalluCitation; one confirmed fake is enough to mark the paper.
- Why it matters: Prevents false alarms and keeps decisions fair. 🍞 Anchor: If an arXiv ID leads to nothing or a link goes to a different paper, it’s ruled a HalluCitation.

Secret bonus insight: Not all HalluCitations come from AI; contaminated secondary databases can inject errors. So the fix is better tools and checks—not automatic blame.

03Methodology

At a high level: PDF of a paper → Reference extraction (OCR) → Reference parsing/normalization → Heuristic filtering for target sources → Title matching in multiple databases → Candidate list of suspicious citations → Manual verification → Output: list of papers with at least one confirmed HalluCitation.

Step-by-step, like a recipe:

Input collection

What happens: Gather every PDF and its metadata from ACL, NAACL, and EMNLP 2024–2025 (main, Findings, workshops): 17,842 papers.
Why it exists: We need complete coverage to know the scale of the issue.
Example: Imagine loading every paper from those conferences into one big library cart.

Citation extraction via OCR

What happens: Use MinerU to read just the references section at the text-block level and pull raw strings.
Why it exists: PDFs are not plain text; OCR makes the text machine-readable. Without it, many references—especially those split across pages—are missed.
Example: A 40-reference list becomes 40 clean lines of text instead of an image.

Parsing and normalization

What happens: Feed those reference strings to GROBID to parse fields (title, authors, venue, year) and standardize them.
Why it exists: Matching works best when titles are clean and fields are consistent. Without normalization, small formatting quirks stop matches.
Example: “Clark et al. (2022). Canine…” becomes a neat record with a precise title field.

Heuristic filtering to narrow the search

What happens: Keep references that mention ACL/EMNLP/NAACL or arXiv-related keywords. Add cross-checks with DBLP and OpenAlex.
Why it exists: There are 740k+ citations total; we focus on sources with strong databases to make matching reliable and efficient.
Example: If a reference mentions arXiv:2405.xxxx, it goes into the “check carefully” bucket.

Database title matching (the scanner)

What happens: Compare each normalized title to titles in ACL Anthology, arXiv, DBLP, and OpenAlex using normalized Levenshtein similarity (RapidFuzz). If no match >=0.9 is found, flag as a candidate.
Why it exists: Titles are strong fingerprints. Without a high-threshold match across multiple trusted sources, the reference is suspicious.
Example: A citation with a slightly misspelled title might still match at 0.95. A made-up title likely scores far below 0.9 everywhere.

Building the candidate list (the short list)

What happens: For each paper, count how many citations were flagged. This yields 2,950 candidate papers and 4,104 candidate citations.
Why it exists: We can’t manually inspect 740k+ citations. A short list per paper is manageable. Without this, the task is impossible at scale.
Example: A paper with five flagged items jumps to the top of the manual-check queue.

Manual verification (the human judge)

What happens: For each candidate, humans try to locate the referenced work—using links, DOIs, IDs, pages, venue info, or searching by title. If no real paper is found or key attributes conflict, it’s a HalluCitation.
Why it exists: Machines are great at filtering; humans excel at judgment. Without this step, we’d risk false positives.
Example: “Anticipated for NeurIPS 2024” or a link that opens a different paper is ruled fake.

Policy for calling a paper “HalluCited”

What happens: If at least one reference is confirmed fake, the whole paper is marked HalluCited; remaining candidates in that paper are not checked further (to focus effort widely).
Why it exists: The question is presence vs absence. Without this rule, manual work balloons.
Example: After finding one confirmed fake in a paper with nine flags, we stop and move on.

Key examples with real flavor:

Wrong link: A citation’s clickable title goes to a valid site but points to a completely different paper (e.g., “Canine” link mismatch).
Vague claims: “Anticipated for …” suggests a paper isn’t actually published anywhere.
Non-existent IDs: An arXiv number that returns nothing, or a TACL volume/page that never existed.

The secret sauce:

Multiple trusted databases plus a strict 0.9 similarity threshold make automated flags precise.
The “four-or-more” candidate rule-of-thumb: If a paper has ≈4+ flags, there’s a high chance (about 60% in that bin; >75% cumulatively) it truly contains a HalluCitation.
Conservative verification: Prioritize precision—results are a lower bound, but very credible.

What breaks without each step:

No OCR: Many references never get read; you miss the majority.
No normalization: Slight formatting or page splits block matches.
No multi-database match: A single source miss labels too many as suspicious.
No human check: You risk punishing authors for machine quirks.

Anchoring the big picture:

Input → OCR → Normalize → Filter → Match → Flag → Human-check → Output.
Scales to 17,842 papers and 741,656 citations, yet keeps the human workload feasible and fair.

04Experiments & Results

The tests (what and why):

Measure how many accepted papers include at least one confirmed HalluCitation.
Track growth over time (2024 vs 2025) and distribution across venues, tracks, and topics.
Evaluate how candidate frequency per paper predicts real HalluCitations—useful for auto-flagging.

The competition (what it’s compared against):

There aren’t classic baselines like competing models; the study’s value comes from coverage, precision, and practical signals (e.g., the four-or-more rule) compared across years and venues.

The scoreboard (with context):

Scale: 17,842 papers and 741,656 extracted citations.
HalluCited papers rose from 20 in 2024 to 275 in 2025. That’s like going from a rare typo to a noticeable pattern.
EMNLP 2025 had 154 HalluCited papers by itself—over half of the 2025 total—similar to one school having most of the missing-homework cases in a district.
Proportion jump: From about 0.28% in 2024 to 2.59% in 2025 overall; EMNLP 2025 reached about 3.7%. That’s like moving from a few raindrops to a steady drizzle you can’t ignore.
Candidate growth: Both the average and maximum number of flagged citations per paper increased from 2024 to 2025, suggesting a real trend beyond minor method noise.

Predictive hit rates by candidate count (why the four-or-more rule matters):

Papers with ≥9 candidates: ~100% contain a real HalluCitation (cumulative).
8 candidates: ~84% in-bin; ~94% cumulative.
7 candidates: ~93% in-bin; ~93% cumulative.
4 candidates: ~61% in-bin; ~77% cumulative.
Takeaway: When a paper has around four or more flags, it’s very likely to have a real fake reference. This is like getting multiple warning lights on a dashboard—it’s probably not a sensor glitch.

A tricky reality: Most HalluCited papers had only one or two fake references. These are hard to spot manually because they hide among 30–50 good citations. Reviewers under time pressure might miss them, especially outside their specialty.

Topic trends (EMNLP 2025 Main + Findings):

Areas with relatively higher proportions include LLM Efficiency, AI/LLM Agents, and Low-Resource NLP—many of them newly introduced tracks that are tough to staff with highly specialized reviewers.
Title keyword differences: HalluCited papers more often used concise terms like “LLM,” and leaned toward efficiency topics like “Multimodal,” “Decoding,” and “Quantization,” whereas general papers had more “Large Language Model,” “Human,” “Reasoning,” and “Preference.”

Peer-review pipeline observations (ARR preprints and workloads):

Among opted-in ARR preprints (≈20% disclosure rate), the share with candidates was relatively high, but many were filtered out before acceptance—still, a non-trivial number made it through, especially by EMNLP 2025.
Reviewer and meta-reviewer loads in some cycles were heavy enough that catching subtle citation errors would be challenging.

Surprising findings:

Contaminated secondary databases (e.g., Google Scholar, Semantic Scholar) can list non-existent or incorrect entries, which then propagate to other papers. This means HalluCitations can occur even without AI tools—and sometimes despite good intentions.
Some bad entries (e.g., wrong arXiv IDs, missing authors, expired links) spread widely, showing that database hygiene matters for the whole ecosystem.

Contextual meaning:

An increase from 20 to 275 HalluCited papers isn’t just a number; it signals a shift in how we must safeguard trust—by combining automated flags, clearer definitions, and fair human checks.

05Discussion & Limitations

Limitations (be specific):

Scope: Focused on six ACL-family conferences (2024–2025). Other fields may differ, and earlier years had negligible counts.
Accepted-only emphasis: Many rejected submissions aren’t public, so the true upstream rate could be higher. ARR opt-in preprints (≈20% disclosure) only show part of the picture.
Conservative verification: Prioritized precision; results should be read as a lower bound. OCR errors, parsing misses, and strict thresholds mean some real HalluCitations likely went undetected.
Source coverage: Centered on ACL Anthology and arXiv (plus DBLP and OpenAlex). Domains without strong bibliographic infrastructure are harder to analyze reliably.

Required resources:

Compute: OCR passes on thousands of PDFs (e.g., an NVIDIA A6000 GPU was used), plus typical workstation analysis.
Data: Access to ACL Anthology, arXiv dumps, DBLP, and OpenAlex.
People: Manual verification time for candidate citations.

When NOT to use this approach:

Domains with poor or fragmented databases, where even real items rarely match at high similarity.
Non-standard or non-English reference styles where OCR/parsers struggle significantly.
Situations where only BibTeX snippets exist without PDFs—and fields are too incomplete for robust matching.

Open questions:

Causation vs correlation: How much do AI tools contribute versus database contamination or human copy-paste habits?
Better automation: Can we integrate DOI/URL resolution, author disambiguation, and venue-year cross-checks to reduce manual work further?
Policy and incentives: What’s the best community-wide definition of a HalluCitation, and how do we track and verify corrections between reviews and camera-ready versions?
Ecosystem hygiene: How can we clean and protect secondary databases so bad entries don’t multiply? Can reference-managers add built-in verification?
Long-term trends: Will the surge continue, and do certain emerging areas consistently face higher risk due to reviewer scarcity?

Bottom line: The study is honest about being a careful, lower-bound estimate within a specific community, and it points to practical fixes—tooling, definitions, traceability, and lighter reviewer loads—to restore and protect trust.

06Conclusion & Future Work

Three-sentence summary:

This paper shows that fake or non-existent references—HalluCitations—rose sharply in accepted NLP papers in 2025, with EMNLP 2025 contributing more than half the cases.
A scalable, high-precision pipeline (OCR → normalization → strict multi-database title matching → human verification) surfaces true issues and yields an actionable rule-of-thumb: papers with about four or more flagged citations are very likely to contain a real HalluCitation.
Many errors stem from contaminated secondary databases, so the solution is proactive toolkits and automated checks—not punishment after the fact.

Main achievement:

Turning a vague worry into a measurable, community-facing signal: clear statistics on prevalence, a practical detection recipe, and guidance (the four-or-more rule) that organizers, reviewers, and authors can use immediately.

Future directions:

Integrate detection into author toolkits and submission checks (e.g., pubcheck), enrich matching with DOIs/URLs/author disambiguation, and expand to other domains.
Establish a shared definition of HalluCitation severity and add traceability so camera-ready corrections are verifiable.
Improve database hygiene via collaborative cleaning and reference-manager plug-ins that fetch from primary sources first.

Why remember this:

Scientific trust depends on verifiable references. This work shows the problem is growing, proves a scalable way to catch it, and charts a constructive path—tools and transparency—to keep our knowledge chain strong.

Practical Applications

•Add a pre-submission citation checker to author toolkits that flags suspicious references for quick fixes.
•Integrate automated HalluCitation scanning into conference and journal submission systems (e.g., pubcheck).
•Build a reference-manager plug-in (Zotero, Paperpile) that verifies titles/DOIs against primary sources before import.
•Create a reviewer dashboard badge that highlights papers with ≥4 flagged citations for targeted inspection.
•Deploy a batch scanner at publishers to catch and correct bad references during production.
•Offer a browser extension that verifies a highlighted reference (title/DOI/ID) in one click using multiple databases.
•Run periodic database-cleaning sweeps (e.g., arXiv/DBLP/OpenAlex cross-checks) to find and fix contaminated entries.
•Teach students and researchers a simple ‘trust but verify’ workflow: always confirm titles/IDs from primary sources.
•Add DOI/URL resolution and author-name disambiguation to further reduce false flags in automated tools.
•Publish a public leaderboard of citation hygiene for venues to encourage good practices and transparency.

Version: 1