QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
IntermediateSantiago Gonzalez, Alireza Amiri Bavandpour et al.Feb 24arXiv
This paper shows that when AI models grade university-level math proofs, they often disagree with human experts in systematic ways.
#LLM-as-a-Judge#mathematical proof evaluation#alignment gap