This paper shows that when AI models grade university-level math proofs, they often disagree with human experts in systematic ways.
The paper shows a three-way no-win situation: an AI society cannot be closed off, keep learning forever, and stay perfectly safe for humans all at the same time.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.