Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.