CLI-Gym is a new way to create lots of realistic computer-fixing tasks for AI by safely breaking and then repairing software environments inside containers.
FeatureBench is a new benchmark that tests AI coding agents on building real software features, not just fixing small bugs.