LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan; Dingjie Song; Zhe Fang; Yisheng Ji; Xiang Li; Quanzheng Li; Lichao Sun

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Beginner

Zhiling Yan, Dingjie Song, Zhe Fang et al.2/10/2026

arXiv

Key Summary

•LiveMedBench is a new, always-updating test for medical AIs that keeps test questions safely separated from training data to avoid cheating by memorization.
•It gathers fresh, real-world patient-doctor cases every week from verified medical communities in English and Chinese.
•A multi-agent pipeline (Screener, Validator, Controller) cleans and fact-checks each case against medical guidelines so the data is clinically sound.
•Instead of vague scoring, an automated rubric breaks each case into clear, bite-sized checks (like accuracy, completeness, safety) and grades model answers objectively.
•Across 38 different AI models, even the best model only scored 39.2%, showing that real clinical reasoning is still very hard for today’s AIs.
•84% of models got worse on cases posted after their knowledge cutoff dates, proving that old, static tests can hide contamination and outdated knowledge.
•Human doctors strongly agreed that the cases and rubrics are high quality, and the rubric-based grader matched doctors far better than the usual 'LLM-as-a-Judge' approach.
•Most mistakes were not missing facts, but failing to apply the right facts to the specific patient (context problems) and overgeneralizing guidelines.
•Giving models trusted references to look up (retrieval) recovered performance on new cases, showing that access to up-to-date knowledge helps a lot.

Why This Research Matters

Medical advice must be fresh, accurate, and safe—people’s health depends on it. LiveMedBench stops hidden shortcuts (like memorized tests) and keeps evaluations tied to the latest real-world cases. By grading with clear checklists, it rewards what truly protects patients: accuracy, completeness, safety, context, and clear communication. Doctors confirmed that both the cases and the automated grader align well with clinical practice, so scores are meaningful, not just numbers. Results highlight where today’s AIs struggle—applying facts to a person’s unique situation—guiding researchers to fix the right problems. As models improve, a live, contamination-free benchmark ensures progress is real, not just test-taking tricks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how your teacher gives you a brand-new quiz each week so nobody can just memorize last year’s answers? That keeps things fair and fresh.

🥬 The Concept: Large Language Models (LLMs)

What it is: An LLM is a computer program that reads and writes text like a super-fast reader who learned from many books and websites.
How it works: 1) It sees your question, 2) Predicts the next best word many times in a row, 3) Shapes those words into an answer, 4) Uses patterns it learned from lots of text.
Why it matters: Without LLMs, we wouldn’t have smart chatbots to explain symptoms, summarize instructions, or support doctors. 🍞 Anchor: When you ask a health chatbot, “Is a fever of 102°F dangerous?”, an LLM decides which words to use to explain if and when to see a doctor.

🍞 Imagine a report card for AIs that tells us how good they are at medical questions.

🥬 The Concept: Benchmarks (Medical Benchmarks)

What it is: A benchmark is a fair test to compare AIs on the same set of questions.
How it works: 1) Collect cases, 2) Make standard rules to score, 3) Test many AIs, 4) Compare scores.
Why it matters: Without a fair test, we can’t trust claims like “This model is better at medicine.” 🍞 Anchor: Think of a spelling bee list everyone uses to rank contestants—medical benchmarks do that for AI.

🍞 You know how tests aren’t fair if someone saw the answer key beforehand?

🥬 The Concept: Data Contamination

What it is: When test questions leak into an AI’s training data, the AI might just memorize answers.
How it works: 1) A public test is posted online, 2) That content sneaks into training data, 3) The model scores high by recall, not reasoning.
Why it matters: Scores look great, but they don’t show real medical skill. 🍞 Anchor: If a model says all the right words on a test it already saw, that doesn’t mean it can handle a new patient with a twist.

🍞 Picture using an old map to drive—roads change, and you’ll get lost.

🥬 The Concept: Temporal Misalignment (Outdated Knowledge)

What it is: When a test or AI is out of date compared to current medical rules.
How it works: 1) Medical guidelines change, 2) Old static tests don’t, 3) Models trained long ago miss new info.
Why it matters: In medicine, old advice can be unsafe. 🍞 Anchor: A test from 2020 might not include a 2026 guideline update about a drug’s new side effects.

🍞 Imagine a judge who just eyeballs a performance and says “Pretty good!” without a checklist.

🥬 The Concept: LLM-as-a-Judge

What it is: Using an AI to score another AI’s answer with a single holistic opinion.
How it works: 1) Feed the answer to a judge model, 2) It gives a 1–10 score, 3) No clear reasons or itemized proof.
Why it matters: It can be biased by long answers and can miss safety mistakes. 🍞 Anchor: A long, fancy paragraph might get an A from the judge, even if it recommends the wrong medicine.

🍞 Think of rubrics your teacher uses: checklists that say exactly what earns points.

🥬 The Concept: Rubric-based Evaluation

What it is: A grade sheet with specific yes/no checks (criteria) tied to accuracy, completeness, safety, and more.
How it works: 1) Break expert advice into bite-sized checks, 2) Mark each ‘met/not met’, 3) Add points for correct items, subtract for unsafe ones.
Why it matters: Rubrics make grading fair, clear, and reproducible. 🍞 Anchor: “Did you list red flag symptoms? Did you avoid a harmful drug?” Each gets its own checkbox.

Before this paper, most medical tests for AI were static, easy to leak into training, and often judged by vague overlap or subjective AI judges. People tried to fix this by automatically tweaking old questions or hiring many humans to review, but that either didn’t keep up with real-world changes or couldn’t scale. The missing piece was a live, contamination-resistant, clinically validated, and clearly scored benchmark that matches how doctors actually reason and communicate. Why should you care? Because in health, the difference between a fresh, honest test and a stale, fuzzy one can mean safer advice, better triage, and fewer dangerous mistakes for real people—maybe even someone you know.

02Core Idea

🍞 Imagine a fresh food market where everything is picked this week, checked by experts, and labeled with clear nutrition facts.

🥬 The Concept: LiveMedBench

What it is: A live, contamination-free medical benchmark that updates weekly and uses clear rubrics to grade AI answers.
How it works: 1) Collect real new cases from verified medical communities, 2) Clean and fact-check them with a multi-agent system, 3) Turn doctor advice into checklists (rubrics), 4) Grade AI answers by those checklists.
Why it matters: It prevents cheating-by-memorization and keeps tests aligned with current medicine. 🍞 Anchor: It’s like giving AIs a pop quiz written this week, graded with a precise answer key.

The “Aha!” in one sentence: Keep the test questions fresh and medically verified, then score with transparent, bite-sized criteria instead of fuzzy vibes.

Three analogies to see it from different angles:

Newsroom analogy: Today’s headlines, not last year’s—LiveMedBench tests on the latest real-world cases.
Referee analogy: A referee with a rulebook and a checklist, not just a gut feeling.
Kitchen analogy: Ingredients are washed (validated), recipes are precise (rubrics), and taste tests are blind (no leaked answers).

Before vs After:

Before: Static, possibly leaked tests; outdated topics; vague scoring that rewards length over safety.
After: Weekly new cases; strict time separation from training; itemized checks that reward correctness and penalize unsafe advice.

🍞 You know how a team wins when every player has a job—scout, coach, medic.

🥬 The Concept: Multi-Agent Clinical Curation Framework

What it is: Three cooperating AI “agents” that turn noisy forum threads into clean, guideline-aligned clinical cases.
How it works: 1) Screener structures the case (what’s asked, facts, advice), 2) Validator checks evidence and completeness, 3) Controller audits for any made-up detail and writes the final narrative.
Why it matters: Without this team, cases could be messy, wrong, or unsafe. 🍞 Anchor: It’s like editors, fact-checkers, and proofreaders working together before printing a textbook page.

🍞 Think of a teacher turning model answers into a checklist that exactly matches what a good doctor would say.

🥬 The Concept: Automated Rubric-based Evaluation Framework

What it is: A system that converts physician responses into case-specific criteria and then grades model outputs against them.
How it works: 1) Extract key facts and red flags, 2) Build positive (reward) and negative (penalty) checks, 3) Assign axes (accuracy, completeness, safety, context, communication) with weights, 4) Tally points.
Why it matters: It aligns better with doctors and catches safety issues. 🍞 Anchor: “Did you warn about dehydration?” +2; “Did you recommend a contraindicated drug?” −10.

Why it works (intuition, no math):

Time wall: Only use cases newer than what models were likely trained on to reduce contamination.
Evidence wall: Advice must match trusted medical sources; contradictions veto the case.
Rubric wall: Answers earn points only for concrete, case-tied checks; length and fluff don’t help.

Building blocks:

Weekly data mining from verified physician communities (English + Chinese).
Screener using SOAP (Subjective, Objective, Assessment, Plan) to structure the story.
Validator measuring: (1) a real chief complaint, (2) enough information present, (3) evidence alignment via retrieval.
Controller preventing hallucinations and writing a clean case.
Rubric generator making bipolar criteria with axes and weights.
Grader marking each criterion as met/not met and computing a normalized score.

🍞 Anchor: A case about sudden chest pain becomes: questions about red flags, checks for aspirin cautions, and penalties if the answer wrongly reassures a heart attack as “just stress.”

03Methodology

At a high level: New clinical posts → Screener (structure) → Validator (quality + evidence) → Controller (audit + finalize) → Rubric generator (criteria) → Grader (objective scoring) → Model score.

🍞 Think of organizing a messy backpack before school.

🥬 The Concept: SOAP Framework (used by the Screener)

What it is: A standard doctor note format: Subjective (symptoms), Objective (measurements), Assessment (diagnosis), Plan (next steps).
How it works: 1) Subjective: patient story, 2) Objective: vitals/labs/meds, 3) Assessment: what’s likely going on, 4) Plan: what to do next.
Why it matters: Keeps cases clear and complete. 🍞 Anchor: For cough + fever: Subjective (3 days, worse at night), Objective (100.8°F), Assessment (viral URI likely), Plan (rest, fluids, red flags).

Step A: Screener (Structure the case)

What happens: Extracts patient narrative (Subjective/Objective), the main question (Query), and physician advice (Assessment/Plan).
Why it exists: Without structure, grading and evidence-checking are unreliable.
Example: From a long thread, the Screener picks: “Query: Is this food poisoning or stomach flu?; Narrative: vomiting x2 days, mild fever; Advice: oral rehydration, BRAT diet, seek care if blood in stool.”

🍞 Like checking a science report with textbooks next to you.

🥬 The Concept: Validator (Quality + Evidence)

What it is: A checker that confirms the case is clinically meaningful, complete enough, and aligned with guidelines.
How it works: 1) Chief Complaint check: Is the main question a real medical request? 2) Info Sufficiency: Is the minimum checklist (onset, severity, red flags…) present in the narrative? 3) Evidence Alignment: Compare advice to trusted sources (e.g., guidelines, PubMed) using retrieval.
Why it matters: Prevents weak, unsafe, or contradicted advice from entering the benchmark. 🍞 Anchor: If advice says “antibiotics for likely viral gastroenteritis,” evidence alignment flags a contradiction and discards the case.

🍞 Imagine a final hall monitor who stops anything made-up from passing.

🥬 The Concept: Controller (Audit + Finalize)

What it is: The last gatekeeper that rejects any detail not explicitly in the original thread and writes the final polished case.
How it works: 1) Veto any unsupported specifics, 2) Merge structured points into a clean narrative (N), keep the question (Q), and rewrite advice (A) clearly.
Why it matters: Blocks hallucinations and ensures faithfulness to the source. 🍞 Anchor: If a dosage wasn’t in the thread, the Controller rejects the case rather than guessing.

🍞 Think of looking up facts in a reliable encyclopedia before answering.

🥬 The Concept: Retrieval-Augmented Generation (RAG) in Validation

What it is: A way for the system to fetch trustworthy medical snippets to verify advice.
How it works: 1) Search medical sources, 2) Pull top matches, 3) Compare each advice item as supported/neutral/contradicted.
Why it matters: Stops outdated or unsafe guidance from getting in. 🍞 Anchor: For chest pain, RAG ensures aspirin or ER-referral rules match current guidelines.

Rubric Generation (make the checklist)

What happens: The physician advice is distilled into bipolar criteria.
- Positive criteria: Reward key facts (“Mentions oral rehydration”).
- Negative criteria: Penalize risks (“Recommends unnecessary antibiotics”).
Axes: Accuracy, Completeness, Safety, Context Awareness, Communication Quality.
Weights: −10 to +10; big penalties for dangerous mistakes.
Why it exists: Precise, patient-tied checks beat vague judgments.
Example criteria for norovirus-like case:
- Accuracy +4: Identifies likely viral gastroenteritis.
- Completeness +3: Lists red flags (bloody stool, severe dehydration).
- Safety −10: Incorrectly prescribes antibiotics.
- Context +3: Asks about recent travel/contacts.
- Communication +2: Gives clear, stepwise home care.

Rubric-based Grader (score the answer)

What happens: For each criterion, mark met/not met based on the model’s text and add/subtract the weight.
Why it exists: Ensures long, fluffy text doesn’t get credit without meeting the checklist.
Example: If the answer gives rehydration advice and red flags but suggests antibiotics, the score adds completeness points and subtracts a big safety penalty.

Secret sauce (what’s clever):

Weekly updates create a strong time wall against contamination.
Evidence-veto prevents unsafe or contradicted advice from entering.
Bipolar criteria with big safety penalties center patient safety.
Bilingual, multi-specialty coverage mirrors real-world variety.

Small end-to-end example:

Input post: “Two days of vomiting and diarrhea, mild fever, friend was sick last week. Should I take antibiotics?”
Screener: Structures narrative, query, and physician’s conservative care advice.
Validator: Confirms chief complaint, checks minimal info (duration, fever, exposure), and verifies advice via guidelines (no antibiotics for likely viral).
Controller: Ensures no added details, writes clean N, Q, A.
Rubric: Builds criteria for accuracy (viral cause), completeness (rehydration, red flags), safety (no antibiotics), context (asks about dehydration signs), communication (clear steps).
Grader: Scores a model that wrongly suggests antibiotics with a safety penalty, leading to a low final score despite a long explanation.

04Experiments & Results

🍞 Imagine testing bikes on a real road with hills, not just on a smooth gym treadmill.

🥬 The Concept: Knowledge Cutoff and Post-Cutoff Testing

What it is: A model’s “knowledge cutoff” is the latest date of information it was trained on; post-cutoff cases come after that date.
How it works: 1) Split cases by time, 2) Compare scores before vs after the cutoff, 3) Drops suggest contamination or outdated knowledge.
Why it matters: Proves whether models truly generalize to new, unseen medical situations. 🍞 Anchor: If a model does great on 2024 cases but dips on 2026 cases, it likely memorized old data or needs updated info.

The test: What was measured and why

They measured how well 38 models handled open-ended, real cases using rubrics across five axes: Accuracy, Completeness, Communication Quality, Context Awareness, and Safety.
They also checked how well the automated grader matched physicians versus a typical “LLM-as-a-Judge.”

The competition: Who was compared

Proprietary general models (e.g., GPT-series, Gemini-series), open-source general models (e.g., DeepSeek, Qwen, GLM), and medical-specific models (e.g., Baichuan medical variants, Med-Gemma).
All were tested zero-shot (no special training on this benchmark).

The scoreboard (with context)

Top score: GPT-5.2 reached 39.2%. Think of that as scoring an A+ only if everyone else also gets in the 30s—instead, most models scored far lower, more like a tough exam where even the best student gets under 40%.
Proprietary models generally led, but strong open-source models narrowed the gap (e.g., GPT-OSS 120B at 25.0%).
General-purpose models outperformed medical-specialized ones overall, likely due to scale and training diversity.

Temporal findings: contamination and obsolescence

84% of models scored worse on cases posted after their knowledge cutoffs, indicating that older, static benchmarks can overestimate real ability.
This supports LiveMedBench’s live, post-cutoff testing approach as a better indicator of true clinical competence.

Grading alignment with humans

Rubric-based grader: strong alignment with physicians (not just vibes), catching safety issues and avoiding length bias.
LLM-as-a-Judge: weaker, often over-generous, and less sensitive to dangerous mistakes.

Where models did well or poorly

Better: Common fields like Gastroenterology/Hepatology and Emergency Medicine.
Harder: Niche surgical and specialty areas (e.g., Dentistry/Oral Surgery, Pediatric Surgery, Allergy/Immunology).
Thematic strength: models communicated reasonably under uncertainty, but struggled with Context-Seeking (not asking for missing details before advising).

Surprising and important findings

Main bottleneck: Contextual application—models knew many facts but failed to tailor them to patient specifics (35–48% of errors), and often overgeneralized guidelines (22–32%). Hallucinations were much rarer than expected for top models.
Retrieval helps: Giving models access to trusted sources (open-book) generally improved performance on fresh 2026 cases, showing that knowledge access—not just reasoning—is key.

05Discussion & Limitations

🍞 Think of a weather app: it’s super helpful, but it can’t show you tomorrow’s snow if it only reads last summer’s reports.

Limitations (what this can’t do yet)

Source bias: Cases come from specific communities (US and China) and may under-represent other populations and health systems.
Text-only scope: Threads needing images (X-rays, rashes) are excluded; multimodal clinical tasks aren’t covered yet.
Rubric dependence: Grading is only as good as the criteria; rare edge cases may still be hard to encode.
Language coverage: English and Chinese today; other languages pending.
Retrieval quality: Evidence checks depend on the retriever and sources chosen.

Required resources

Access to capable LLMs for curation and grading, API budgets, and modest compute for weekly harvesting.
Medical guideline sources and retrieval tools.

When NOT to use it

Vision-heavy tasks (dermatology images, radiology scans) or device data (ECG waveforms) where text alone is insufficient.
Regulatory performance claims for clinical deployment without additional validation.
Training models directly on it (it’s for evaluation; using it for training could reintroduce contamination).

Open questions

How to make models proactively ask the right missing questions (better Context-Seeking)?
How to add safe, fair multimodal cases at scale?
Can we design even more robust graders that remain stable across model updates and languages?
How to expand sources to reduce demographic and practice-style bias while keeping verification rigorous?
What training approaches best reduce guideline overgeneralization without harming safety?

06Conclusion & Future Work

Three-sentence summary

LiveMedBench is a live, contamination-resistant medical benchmark that turns real, recent physician-answered cases into clear, case-specific rubrics and uses them to objectively grade AI responses.
A multi-agent curation pipeline and evidence checks ensure clinical integrity, while the rubric-based grader aligns better with doctors than typical LLM-as-a-Judge methods.
Testing 38 models showed low absolute scores, clear performance drops on new post-cutoff cases, and a dominant failure in applying knowledge to patient context—retrieval access helps.

Main achievement

Delivering a continually refreshed, clinically verified, rubric-scored benchmark that provides honest, safety-aware report cards for medical AIs.

Future directions

Add multimodal (images, waveforms), expand languages and specialties, strengthen graders against drift, and explore training methods that reduce context and overgeneralization errors.

Why remember this

Because in medicine, fresh, fair, and transparent testing isn’t a luxury—it’s how we protect patients while improving AI. LiveMedBench sets a higher bar: no leaks, up-to-date cases, and grading that rewards what truly matters—accurate, complete, safe, and context-aware care.

Practical Applications

•Evaluate hospital chatbots weekly to ensure they follow the latest triage and safety rules.
•Audit telemedicine assistants for dangerous recommendations using negative (penalty) criteria.
•Compare open-source and proprietary AIs fairly before selecting a system for a clinic.
•Stress-test new medical models on post-cutoff cases to detect contamination and outdated knowledge.
•Tune prompts and system instructions to improve context-seeking (asking missing key questions).
•Integrate retrieval so models look up current guidelines and reduce overgeneralized advice.
•Create specialty-specific dashboards (e.g., cardiology vs. dentistry) to target training improvements.
•Localize evaluation for English and Chinese deployments and expand to more languages over time.
•Use frozen snapshots to report reproducible scores in research and procurement decisions.
•Run safety regression checks after any model update to catch newly introduced risks.

Version: 1