Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.