Papers3

#regression testing

SWE-CI is a new benchmark that tests how well AI coding agents can keep a codebase healthy over many changes, not just fix one bug.

Not triaged yet

Big AI models do great in the lab but stumble in the real world because the world keeps changing.

Not triaged yet

SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.

Not triaged yet