Big AI models do great in the lab but stumble in the real world because the world keeps changing.
SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.