Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.