Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
IntermediateMike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
#Terminal-Bench#command line interface#Docker containers