AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
IntermediateKeyu Li, Junhao Shi et al.Jan 16arXiv
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.
#autonomous agents#long-horizon evaluation#agent benchmarking