Papers3

#software evolution

SWE-CI is a new benchmark that tests how well AI coding agents can keep a codebase healthy over many changes, not just fix one bug.

Not triaged yet

Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.

Not triaged yet

SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.

Not triaged yet