SWE-rebench V2 is a giant, language-agnostic robot pipeline that turns real GitHub pull requests into safe, runnable software tasks for training AI coding agents.
This paper shows that you can vastly improve a modelβs command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.