This paper shows that code-writing AI agents can take an existing math problem and automatically turn it into a new, harder one while keeping it solvable.
CLI-Gym is a new way to create lots of realistic computer-fixing tasks for AI by safely breaking and then repairing software environments inside containers.
FeatureBench is a new benchmark that tests AI coding agents on building real software features, not just fixing small bugs.
This paper builds an AI team that can make real full‑stack websites (frontend, backend, and database) from plain English instructions.