This paper shows that when AI models grade university-level math proofs, they often disagree with human experts in systematic ways.
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.