This paper studies how AI agents get better while they are working, not just whether they finish the job.
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.