SWE-CI is a new benchmark that tests how well AI coding agents can keep a codebase healthy over many changes, not just fix one bug.
Big AI models do great in the lab but stumble in the real world because the world keeps changing.
SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.