Papers4

#agent benchmarking

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Romain Froger, Pierre Andrews et al.Feb 12arXiv

Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.

#Gaia2#ARE platform#asynchronous environments

Not triaged yet

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Intermediate

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

Not triaged yet

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Intermediate

Keyu Li, Junhao Shi et al.Jan 16arXiv

AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.

#autonomous agents#long-horizon evaluation#agent benchmarking

Not triaged yet

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Beginner

Daocheng Fu, Jianbiao Mei et al.Jan 13arXiv

The paper introduces Trainee-Bench, a new way to test AI agents that feels like a real first day at work, with tasks arriving over time, hidden clues, and changing priorities.

#Trainee-Bench#dynamic task scheduling#active exploration

Not triaged yet