GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.
CLI-Gym is a new way to create lots of realistic computer-fixing tasks for AI by safely breaking and then repairing software environments inside containers.
ABC-Bench is a new test that checks if AI coding agents can really do backend work from start to finish, not just write a few lines of code.