EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei; Dehai Min; Zewen Liu; Yuzhang Xie; Guanchen Wu; Carl Yang; Max S. Y. Lau; Qi He; Lu Cheng; Wei Jin

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Intermediate

Mingyang Wei, Dehai Min, Zewen Liu et al.1/6/2026

arXiv PDF

Key Summary

•EpiQAL is a new benchmark that tests how well AI models answer population-level disease questions using real research papers.
•It has three parts: A checks facts stated in the text, B checks multi-step reasoning that links evidence and epidemiology rules, and C checks if models can rebuild study conclusions when the Discussion section is hidden.
•The benchmark was built with expert guidance, multi-model checking, and question “hardness” control to reduce shortcuts and errors.
•Ten open AI models were tested, and none solved everything; multi-step reasoning (Part B) was the hardest.
•Model size alone didn’t predict success; some smaller models beat larger ones in reasoning tasks.
•Chain-of-Thought prompting helped multi-step inference but didn’t reliably help simple fact recall or masked conclusion tasks.
•Exact Match and F1 were used to score multi-answer questions, rewarding precision and penalizing guesswork.
•EpiQAL offers fine-grained signals that show where models struggle—evidence grounding, inference, or synthesis—so researchers can improve them.
•The dataset focuses on PLOS Neglected Tropical Diseases articles, is carefully verified, and includes some human review for edge cases.
•This matters for real-world public health decisions where careful, evidence-based reasoning saves time and lives.

Why This Research Matters

Public health decisions affect whole communities, so we need AI that can reason from real evidence, not just repeat facts. EpiQAL checks whether models can tie their answers to actual study text and connect multiple clues like a careful scientist. This helps teams pick the right AI for tasks like judging whether an intervention worked or where to focus resources. With clearer signals about what models do well (and where they fail), researchers can train safer, more reliable systems. In emergencies like outbreaks, better reasoning means faster, more confident choices. Over time, this can guide smarter policies, reduce waste, and improve health outcomes.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how reading a news story about a big flu season is different from reading a doctor’s note for one patient? One is about whole communities, the other is about a single person.

🥬 Filling (The Actual Concept):

What it is: Epidemiology is the science of how diseases spread and affect groups of people, not just one person.
How it works: It looks at data from studies, counts cases, tracks who gets sick and when, and tests what happens if we try new actions (like vaccines or masks).
Why it matters: Without good epidemiology, we might guess wrong about which actions help the most, wasting time or causing harm at the population level.

🍞 Bottom Bread (Anchor): For example, deciding how many vaccine doses to buy for a city requires knowing likely spread, who’s most at risk, and which actions cut infections the most.

Before this paper, AI benchmarks mostly checked medical facts or doctor-style questions (patient-level). These are helpful, but they don’t test the special skills needed to read research papers and make smart, population-level judgments—like estimating disease burden, tracking transmission, or judging intervention effects. During COVID-19, the science grew fast. Public health teams needed tools to quickly pull reliable, study-based insights from mountains of papers. Regular question-answering tests weren’t built for this job—they often allowed shortcut tricks, like matching words, or only used abstracts or yes/no answers. That’s like testing if someone can skim for bolded words, not if they can truly understand and reason across the whole article.

Researchers saw a key problem: we need a controlled and trustworthy way to test epidemiological reasoning. Controlled means questions shouldn’t be easy to game with keyword matching. Trustworthy means answers must be anchored in real study evidence, not opinion.

🍞 Top Bread (Hook): Imagine a science fair judge asking you to explain your results, but also to point to your data table to prove it. No data, no points.

🥬 Filling (The Actual Concept):

What it is: Evidence grounding means every answer should point back to real, verifiable study text.
How it works: The benchmark requires that correct answers match specific document evidence or are valid conclusions built from it.
Why it matters: Without grounding, models might “sound right” but be wrong or made-up.

🍞 Bottom Bread (Anchor): Example: If a model claims a vaccine is 70% effective, it must be traceable to the study’s reported estimate or a proper inference from its results.

Past attempts didn’t fully solve this. Exam-style sets (like MedQA/MedMCQA) test broad medical knowledge but not study-level, population reasoning. Literature datasets (like PubMedQA) often used abstracts and tiny label sets (yes/no/maybe), which can’t capture the messy, multi-answer nature of real epidemiology. COVID-era datasets were helpful but often disease-specific and extractive (find-the-span), so models could succeed by surface matching rather than reasoning.

The missing piece was a benchmark that: covers many epidemiology topics; ties answers to real documents; allows multiple correct answers; prevents shortcuts; and can scale without tons of expensive expert time.

🍞 Top Bread (Hook): Picture a new kind of test where you must show your work, use the whole textbook, and sometimes pick more than one right answer.

🥬 Filling (The Actual Concept):

What it is: EpiQAL is a three-part benchmark that checks fact recall, multi-step inference, and conclusion reconstruction with hidden Discussions.
How it works: It uses an expert-built topic map (taxonomy), multi-model verification to check quality, and difficulty control to avoid easy, guessable items.
Why it matters: It diagnoses exactly where AI models struggle in population-level reasoning.

🍞 Bottom Bread (Anchor): In practice, EpiQAL can show if a model is great at pulling facts but weak at combining results to decide if an intervention likely reduced cases.

02Core Idea

🍞 Top Bread (Hook): Imagine building a quiz that doesn’t just ask what’s in a book, but asks you to prove your answer with the right pages and combine clues to solve a mystery.

🥬 Filling (The Actual Concept):

The “Aha!” in one sentence: EpiQAL is a carefully designed benchmark that tests whether AI can ground its answers in research papers and perform true epidemiological reasoning—especially multi-step inference—rather than relying on superficial text matching.

Multiple Analogies:

Detective analogy: The paper is the crime scene, facts are clues, and the model must connect them to solve the case (inference), not just name a suspect mentioned in passing (surface match).
Cooking analogy: The paper is your pantry, steps are the recipe, and the final dish is a well-supported conclusion; you can’t cook by just staring at the ingredients list—you have to combine them correctly.
Math analogy: It’s not enough to write the final number; you have to show your steps and use the right theorems (epidemiology principles) to get credit.

Before vs After:

Before: Benchmarks measured clinical facts or extractive matches; models could score well without true reasoning.
After: EpiQAL splits skills into three challenges—text-grounded recall (A), multi-step inference (B), and masked conclusion reconstruction (C)—so we see precisely where models succeed or fail.

Why It Works (intuition):

It forces grounding by checking that correct options are backed by document evidence.
It raises the bar on reasoning with Part B, which needs multiple pieces of evidence plus epidemiology principles.
It removes easy shortcuts by masking the Discussion in Part C, so models must reconstruct the conclusion from earlier sections.
It uses multi-model verification and difficulty control, so poor or overly-easy items don’t sneak in.

Building Blocks (with Sandwich explanations for each new concept):

🍞 Hook: You know how we organize a library by sections so we can find books fast? 🥬 The Concept: Expert-curated taxonomy is an expert-made map of epidemiology topics used to guide question creation.

How it works: Experts define six classes and 25 topics (like surveillance, outbreaks, immunity, modeling), steering what each question should test.
Why it matters: Without it, coverage is lopsided, and we can’t tell which skills are being measured. 🍞 Anchor: If we want to test “transmission modes,” the taxonomy ensures the question actually targets that knowledge.

🍞 Hook: Imagine asking three smart friends to check your homework. 🥬 The Concept: Multi-LLM verification uses several AI models to cross-check whether each option is correct and well-grounded.

How it works: Multiple models vote; low-confidence items go to human review.
Why it matters: Without it, errors and hidden biases could slip through. 🍞 Anchor: If 3 out of 4 models agree an option isn’t supported by the text, it gets flagged before reaching the final set.

🍞 Hook: Think of game levels—easy, medium, hard—so players can’t just mash buttons and win. 🥬 The Concept: Difficulty control adjusts questions to prevent easy shortcuts.

How it works: A pool of models estimates item difficulty; too-easy items get rewritten (stem refinement) to hide giveaway keywords.
Why it matters: Without it, models could win by keyword matching instead of reasoning. 🍞 Anchor: Replacing “cutaneous leishmaniasis” with a descriptive phrase forces understanding, not string matching.

🍞 Hook: When you take notes from a book, sometimes you copy facts directly; other times you combine ideas to reach a new point. 🥬 The Concept: Text-grounded factual recall asks for facts clearly stated in the document.

How it works: Models must pick options that appear explicitly in the paper.
Why it matters: Without reliable recall, higher-level reasoning falls apart. 🍞 Anchor: “The study enrolled 250 patients” is a recall fact.

🍞 Hook: Solving a mystery requires linking multiple clues. 🥬 The Concept: Multi-step inference links several pieces of evidence with epidemiological principles to reach a new conclusion.

How it works: Combine document facts plus domain rules to infer impacts (e.g., effectiveness).
Why it matters: Without inference, models can’t answer real policy questions. 🍞 Anchor: “Lower cases after intervention + matched trends + proper design ⇒ intervention likely reduced transmission.”

🍞 Hook: Imagine the teacher hides the ‘Conclusions’ page and asks you to figure it out from the rest. 🥬 The Concept: Conclusion reconstruction (with masked Discussion) asks models to rebuild the main takeaway from earlier sections.

How it works: The correct option comes from the hidden Discussion but must be derivable from Intro/Methods/Results.
Why it matters: Without synthesis, models depend on authors’ summaries instead of evidence. 🍞 Anchor: If Results show consistent drops across sites and time, the model should infer effectiveness even without reading the Discussion.

🍞 Hook: A coach might tell you to talk through your thinking in steps. 🥬 The Concept: Chain-of-Thought prompting encourages step-by-step reasoning.

How it works: Prompts nudge models to write intermediate steps before answering.
Why it matters: Helps in multi-step inference; less useful for simple retrieval. 🍞 Anchor: Writing out “First, incidence fell; second, confounders controlled; therefore, effect likely real.” boosts Part B performance.

03Methodology

At a high level: Input document → Subset-specific processing (taxonomy or masking) → Constrained QA generation (topic, logic, option rules) → Multi-LLM verification (and human review if needed) → Difficulty control (judge hardness, stem refinement) → Output question with multiple candidate answers.

Step-by-step (with reasons and mini-examples):

Input processing

What happens: Collect full-text research articles from PLOS Neglected Tropical Diseases; parse sections; optionally link disease entities to knowledge graphs (for construction of Part B only); for Part C, hide the Discussion section at test time.
Why it exists: Ensures a consistent, rich source of evidence and enforces the masked-input challenge (Part C).
Example: A paper on Chagas disease is parsed into Introduction, Methods, Results, Discussion (Discussion later masked in Part C evaluation).

Topic and logic constraints (subset-specific)

What happens: Apply a constraint schema with three parts—topic constraint (taxonomy or paper structure), logic constraint (what reasoning is allowed), and option constraints (what counts as a valid correct/distractor option).
Why it exists: Prevents off-target questions and ensures each subset truly tests its intended skill.
Example: Part A’s logic constraint limits to “facts stated in the text”; Part B demands multi-evidence inference; Part C requires reconstructing Discussion conclusions from earlier sections.

Constrained question generation

What happens: A generation model writes the question and options under the schema: • Part A (text-grounded recall): Correct options must appear explicitly in the paper; distractors are look-alike facts from the same paper but wrong for the specific question. • Part B (multi-step inference): During construction only, external knowledge triples are summarized to inspire inference-style questions; correct options are derived implications requiring multiple paper cues; distractors include causal reversals or entity swaps. • Part C (masked conclusion reconstruction): Correct options are key conclusions from the (hidden) Discussion that can be derived from earlier sections; distractors are plausible but unsupported, contradictory, or reversed claims.
Why it exists: Forces grounding and separates skills across A/B/C.
Example data: • A: Q: “Which population had the highest incidence reported?” Options include the exact group named (correct) and other groups from nearby sentences (distractors). • B: Q: “Given reduced vector counts and stable reporting completeness, what follows about intervention impact?” Correct: “Likely transmission reduction.” Distractor: “Increased transmission” (causal reversal). • C: Q: “Which conclusion best follows from the study’s results (without the Discussion)?” Correct: A main conclusion derivable from Results; distractor: a limitation or external-literature claim.

Multi-LLM verification and human review

What happens: Several different AI models check each option for label correctness and evidence consistency with the test-time input (full paper for A/B; Discussion-masked for C). Options with mixed votes are routed to human reviewers.
Why it exists: Reduces errors from single-model biases; keeps quality high at scale while minimizing human effort.
What breaks without it: Incorrect options slip in; reasoning quality drifts; trust drops.
Example: If an option claims “seroprevalence increased to 40%,” verifiers look for matching spans; if it’s unsupported, it’s rejected.

Difficulty control (Parts B and C)

What happens: A pool of models solves each item; their scores form a DiffScore (harder = higher). Overly easy items trigger stem refinement—a rewriting step that replaces salient named entities with descriptive phrases drawn from web snippets.
Why it exists: Blocks models from exploiting lexical overlap; keeps the benchmark challenging and diagnostic.
What breaks without it: Models might pass by matching names, not reasoning.
Example: Replace “cutaneous leishmaniasis” with “a vector-borne skin disorder caused by protozoan parasites transmitted by sandflies,” preserving meaning but removing easy anchors.

Evaluation protocol

What happens: Models get the document input (with Discussion masked for C), the question, and options; they select any number of correct choices or abstain.
Why it exists: Epidemiology often has multiple valid answers—or sometimes none; the format penalizes over-selection.
Metrics (with Sandwich explanations): 🍞 Hook: Grading a multi-answer quiz means you must match all the right boxes, not just some. 🥬 Exact Match (EM):
- What it is: Score 1 only if the chosen set equals the gold set exactly; else 0.
- How it works: Compare predicted set to the correct set.
- Why it matters: Rewards precise, calibrated answers; penalizes guessing extra options. 🍞 Anchor: Picking A and C when the true answers are A and C earns 1; picking A, C, and D earns 0.
🍞 Hook: Sometimes we want partial credit if you got some but not all parts right. 🥬 F1 Score:
- What it is: A balance of precision and recall on sets.
- How it works: Measures overlap size relative to both predicted and true sets.
- Why it matters: Shows coverage even when EM is strict. 🍞 Anchor: Selecting A and B when the truth is A and C gives partial credit for A.

Secret Sauce highlights:

The three-part constraint schema (topic, logic, options) keeps each subset true to its purpose.
Multi-LLM verification plus targeted human review balances scale and trust.
Difficulty control with stem refinement measurably increases reasoning demand.
Masked Discussion in Part C uniquely tests synthesis from earlier sections.

04Experiments & Results

The Test: Researchers built three 500-question subsets (A, B, C) from 500 PLOS Neglected Tropical Diseases articles and measured how well ten open AI models could answer multi-answer multiple-choice questions using set-based Exact Match and F1.

The Competition: Models ranged from small (around 3B parameters) to large (~70–110B). Families included Llama, Mistral, Qwen, GLM, and Microsoft Phi. All models were tested in a closed-book setting with only the provided document and options; some runs used Chain-of-Thought (CoT) prompting.

The Scoreboard (with context, like report cards):

Part A (text-grounded recall): Best EM ≈ 0.812 (Mistral-Large), which is like getting an A- when some others got Cs. Several models did well here because the task is explicit retrieval.
Part B (multi-step inference): Best EM ≈ 0.760 (Mistral-7B), a strong B+/A- in a hard class. Many models fell below 0.70 EM, showing inference is the main challenge.
Part C (masked conclusion reconstruction): Best EM ≈ 0.800 (Mistral-7B w/ or w/o CoT), also an A-, showing some models can synthesize conclusions from earlier sections.

Surprising Findings:

Scale isn’t everything: Mistral-7B outperformed larger models on Parts B and C. Llama-3.1-8B often approached Llama-70B on inference. Smaller than ~7B tended to struggle.
Chain-of-Thought helps inference more than recall: CoT significantly boosted Part B for several models (e.g., Llama-3.1-8B EM 0.262 → 0.584), but gave mixed or minimal gains on A and C.
Precision matters: Mistral-7B’s small F1–EM gap meant it picked the right options without over-selecting distractors; models that “hedged” by selecting extras lost EM points.
Generator bias unlikely: Even though Qwen generated Part B items during construction, Mistral-7B (a different family) topped the chart, suggesting the benchmark measures real reasoning, not generator style.

Readable takeaways by subset:

A (Recall): If you can find it verbatim, you can score well. Most larger or well-tuned models do fine here; CoT doesn’t help much.
B (Inference): Hardest. You must connect multiple evidence pieces and apply domain principles (e.g., considering confounders, direction of effects). CoT often helps.
C (Reconstruction): Tricky synthesis without the Discussion safety net. The best models infer the conclusion from Results; others get stuck by plausible but unsupported distractors.

Bottom line: Today’s LLMs are decent at pulling facts, uneven at building arguments, and only a few are reliable at drawing author-level conclusions without being told them directly.

05Discussion & Limitations

Limitations:

Narrow source: The dataset uses PLOS Neglected Tropical Diseases, possibly under-covering respiratory surveillance, chronic diseases, or policy-heavy topics.
Scale: 500 items per subset may miss rare, long-tail skills.
Generation model: Part B’s items were generated with a single Qwen model; while top performance came from Mistral-7B, cross-family generation could further reduce subtle artifacts.
Residual errors: Even with multi-LLM checks and human review, some LLM-generated artifacts may persist.
Model scope: Evaluations focus on open models up to ~110B; results may differ for larger proprietary systems.
Real-world gap: Public health decisions often need multiple documents, time series, and geographic context beyond single-article reasoning.

Required Resources:

Access to long-context inference, several candidate open models, and moderate GPU resources for evaluation.
For reconstruction of the benchmark or extensions: tools for entity extraction/linking, knowledge graph access, and multi-model orchestration.

When NOT to Use:

If your task is purely clinical at the patient level (e.g., bedside diagnosis), clinical QA benchmarks may be a better fit.
If you require multi-document synthesis across time and space, EpiQAL’s single-document focus may be too limited.
If you cannot accept multi-answer formats or need extractive spans only, the multi-select structure may not align with your needs.

Open Questions:

How well do models perform when reasoning across multiple papers and data sources (surveillance feeds, maps, lab data)?
Can retrieval or tool-use (e.g., calculators, causal diagrams) reliably boost multi-step inference without hallucination?
What training signals (e.g., counterfactual reasoning tasks, causal QA) best improve epidemiological inference?
How to calibrate models to avoid over-selection of plausible distractors while preserving recall?
Can richer difficulty control and diversified corpora generalize results across diseases and settings?

06Conclusion & Future Work

Three-sentence summary: EpiQAL is a new benchmark that tests whether AI models can ground their answers in research papers and perform population-level epidemiological reasoning. It contains three parts—fact recall, multi-step inference, and masked conclusion reconstruction—built with expert taxonomy, multi-model verification, and difficulty control. Experiments show today’s models still struggle with inference, and size alone doesn’t guarantee success; Chain-of-Thought helps in reasoning but not always elsewhere.

Main achievement: A carefully controlled, trustworthy, and diagnostic benchmark that pinpoints where models fail on epidemiological evidence use—especially multi-step inference—rather than rewarding superficial text matching.

Future directions: Expand to broader corpora (respiratory, chronic, policy), add multi-document and temporal/geographic reasoning, explore retrieval/tool-augmented methods, and refine training signals for causal and statistical inference. Increase generator diversity and continue human-in-the-loop auditing to reduce residual artifacts.

Why remember this: EpiQAL shifts the focus from memorizing medical facts to proving epidemiological reasoning with evidence. It provides the detailed feedback researchers need to build safer, more reliable AI helpers for public health—where careful reasoning can guide smarter interventions and ultimately save lives.

Practical Applications

•Select the right model for public health tasks by comparing performance on EpiQAL-A (recall), B (inference), and C (reconstruction).
•Calibrate models to reduce over-selection of distractors, improving Exact Match on multi-answer questions.
•Use Chain-of-Thought prompting selectively for multi-step inference tasks (like EpiQAL-B), not for simple fact retrieval.
•Design training curricula that target causal reasoning and evidence synthesis using EpiQAL-B style items.
•Develop retrieval- or tool-augmented pipelines for masked-input synthesis similar to EpiQAL-C.
•Audit model outputs with multi-LLM verification or ensemble critiques before deployment in sensitive settings.
•Benchmark new long-context or domain-tuned models to identify gains in evidence grounding.
•Create difficulty-controlled practice sets for analysts learning to read epidemiology papers without shortcutting.
•Monitor F1–EM gaps to balance coverage and precision when false positives are costly.
•Extend EpiQAL to multi-document tracking for operational outbreak intelligence exercises.

Version: 1