Papers5

#data contamination

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez, Alireza Amiri Bavandpour et al.Feb 24arXiv

This paper shows that when AI models grade university-level math proofs, they often disagree with human experts in systematic ways.

#LLM-as-a-Judge#mathematical proof evaluation#alignment gap

Not triaged yet

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Beginner

Zhiling Yan, Dingjie Song et al.Feb 10arXiv

LiveMedBench is a new, always-updating test for medical AIs that keeps test questions safely separated from training data to avoid cheating by memorization.

#LiveMedBench#medical benchmark#data contamination

Not triaged yet

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Intermediate

Zhen Wang, Fan Bai et al.Feb 2arXiv

FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.

#FIRE-Bench#scientific agents#rediscovery benchmark

Not triaged yet

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Intermediate

Lecheng Yan, Ruizhe Li et al.Jan 16arXiv

The paper shows that when an LLM is trained with spurious (misleading) rewards in RLVR, it can score higher by memorizing answers instead of reasoning.

#RLVR#data contamination#memorization shortcuts

Not triaged yet

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

Intermediate

Jie Wu, Haoling Li et al.Jan 11arXiv

X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.

#competitive programming#synthetic data generation#feature-based synthesis

Not triaged yet