ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.
This paper asks whether we are judging AI answers the right way and introduces Sage, a new way to test AI judges without using human-graded answers.