SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Minh V. T. Thai; Tue Le; Dung Nguyen Manh; Huy Phan Nhat; Nghi D. Q. Bui

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Intermediate

Minh V. T. Thai, Tue Le, Dung Nguyen Manh et al.12/20/2025

arXiv PDF

Key Summary

•SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.
•Tasks come from seven mature Python libraries and ask agents to follow release notes to move a codebase from one version to the next.
•Each task is big: the ideal solution edits about 21 files and is checked by around 874 tests on average, so tiny fixes are not enough.
•Even top models struggle: GPT-5 solved about 21% of SWE-EVO tasks, compared to 65% on the easier SWE-Bench Verified benchmark.
•SWE-EVO adds a soft metric called Fix Rate that gives credit for partial progress but gives zero if the agent breaks any previously passing tests.
•Failures reveal what goes wrong: stronger models mostly misread tricky instructions in release notes, while smaller models trip on tools or syntax.
•More pull requests linked to a change usually means a harder task, and stronger models spend more turns on harder cases.
•SWE-EVO keeps SWE-Bench’s simple setup but raises difficulty with longer instructions, broader edits, and heavier regression checks.
•This benchmark highlights the big gap between one-off fixes and true software evolution, guiding research toward agents that can plan across many files and versions.
•It matters for real teams because most software work is maintenance and evolution, not writing brand-new code.

Why This Research Matters

Most real software work is evolving and maintaining big codebases, not writing tiny snippets. SWE-EVO directly tests whether AI agents can follow human instructions (release notes) to safely move a whole project to the next version. This matters for companies that want dependable AI help across many files without breaking production. It exposes where today’s agents fail—especially on instruction following and multi-file coordination—so teams can invest in the right fixes. The Fix Rate metric rewards steady, safe progress, encouraging development practices that mirror real engineering values. In short, SWE-EVO points the way to AI teammates that can truly help with complex, long-term software upgrades.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re helping rebuild a school over the summer. It’s not just fixing one squeaky door—you have to follow the principal’s plan, update many classrooms, keep the cafeteria working, and make sure the fire alarms still pass safety checks.

🥬 The Concept (Coding Agents):

What it is: Coding agents are AI helpers that read, write, and change code in software projects.
How it works:
1. Read a goal described in words.
2. Find the right files and functions.
3. Edit the code and run tests.
4. Repeat until the tests pass.
Why it matters: Without coding agents, we can’t scale help across huge codebases or speed up routine tasks for developers. 🍞 Anchor: A coding agent can change a data-loading function and then run tests to be sure it didn’t break the rest of the app.

🍞 Hook: You know how game patch notes tell you what’s new, fixed, or changed? Software has notes like that too.

🥬 The Concept (Release Notes):

What it is: Release notes are short guides telling what changed from one software version to the next.
How it works:
1. Developers ship a new version.
2. They list new features, fixes, and important warnings.
3. Users (and AIs) read these notes to understand required changes.
Why it matters: Without release notes, agents won’t know what to build or fix for the next version. 🍞 Anchor: A release note might say, “Add Google login and fix a crash in payment retries.”

🍞 Hook: Following a recipe helps you bake the cake you actually want, not a random dessert.

🥬 The Concept (Software Requirement Specifications, SRS):

What it is: SRS are detailed instructions describing exactly what software should do.
How it works:
1. Collect needs from users.
2. Write clear, checkable requirements.
3. Build and test against those requirements.
Why it matters: Without an SRS, you risk building the wrong thing or breaking what already works. 🍞 Anchor: “The app must allow login with Google while keeping old email sign-ins working” is an SRS-style requirement.

🍞 Hook: Big group projects go faster when teammates split roles, like planner, builder, and checker.

🥬 The Concept (Multi-Agent Systems):

What it is: Multiple AI agents with different jobs working together.
How it works:
1. Planner decides steps.
2. Coder edits files.
3. Tester runs checks.
4. Coordinator loops them until done.
Why it matters: One agent can get overwhelmed; teams handle long, tricky tasks better. 🍞 Anchor: One agent maps files to change, another writes code, a third runs tests and reports failures.

🍞 Hook: Growing a garden takes many seasons—plant, prune, fertilize, and keep it healthy.

🥬 The Concept (Long-Horizon Software Evolution):

What it is: Making many coordinated changes over time so a whole codebase grows from version A to version B without breaking.
How it works:
1. Read high-level goals (release notes/SRS).
2. Plan multi-step edits across many files.
3. Implement and test repeatedly.
4. Keep old features working (no regressions).
Why it matters: Real software work is mostly maintenance and evolution, not one-line fixes. 🍞 Anchor: Adding GitHub login touches routes, database, security checks, and settings, while keeping email login working.

The world before this paper: Early benchmarks like HumanEval and MBPP checked tiny, single-file tasks. SWE-Bench was a big jump because it used real GitHub issues and tests, but it still measured isolated fixes. Meanwhile, industry moved fast: by 2025, over 90% of teams used AI in development, and most day-to-day work was maintaining and evolving large, messy codebases.

The problem: Existing tests don’t ask agents to read long release notes, coordinate edits across many files, and pass huge test suites. As models got better, SWE-Bench scores climbed, but that didn’t prove agents could guide a whole codebase to the next release.

Failed attempts: People tried bigger models and clever agent scaffolds. They won more one-off issues but sometimes passed benchmarks with incomplete fixes or limited test coverage. The hard part—stitching many edits together safely—wasn’t truly measured.

The gap: We lacked a benchmark that says, “Here’s version A, here are the release notes—evolve this repo to version B and don’t break anything.”

Real stakes: Think of upgrading a popular library used by millions. A wrong change could break many apps. Teams need agents that can follow specs, change many files, and keep thousands of tests green. That’s daily life in software maintenance, and that’s what SWE-EVO targets.

02Core Idea

🍞 Hook: Renovating a house isn’t just fixing one loose doorknob—it’s updating rooms, plumbing, and electricity while keeping the lights on.

🥬 The Concept (SWE-EVO):

What it is: SWE-EVO is a benchmark that tests if coding agents can evolve a whole software project from one release to the next by following release notes and passing big test suites.
How it works:
1. Give the agent a real repo snapshot (before).
2. Provide the release notes that describe the next version.
3. The agent plans, edits many files, and runs tests.
4. Success means all targeted failures are fixed and nothing else breaks.
Why it matters: Without a realistic evolution test, we can’t tell if agents are ready for real-world maintenance. 🍞 Anchor: For a data library, the agent must add a new warning behavior, fix two bugs, update docs, and still pass hundreds of old tests.

The “Aha!” in one sentence: Measure what we really need—can AI agents transform a codebase across versions using release notes as the map—then score them both on total success and meaningful partial progress.

Three analogies:

Garden: Nurture many plants (files) over a season (release) without letting weeds (regressions) spread.
Orchestra: Many instruments (modules) must play the new song (features/fixes) in sync, not off-key.
City upgrade: Update roads, bridges, and traffic lights (APIs, models, tests) while keeping traffic flowing.

Before vs After:

Before (isolated issues): One bug, one patch, few tests; good for quick wins but not whole-project change.
After (SWE-EVO): Long instructions (about 2,390 words), edits across ~21 files and ~51 functions, with ~874 total tests; agents must plan and coordinate.

🍞 Hook: Report cards that only say pass/fail can miss how much a student improved.

🥬 The Concept (Resolved Rate vs. Fix Rate):

What it is: Resolved Rate is strict pass/fail for a whole task; Fix Rate is a soft score that gives credit for how many targeted failing tests are fixed, but it becomes zero if any previously passing test breaks.
How it works:
1. Split tests into FAIL_TO_PASS (should start failing, end passing) and PASS_TO_PASS (should stay passing).
2. Resolved Rate: all must pass → 1, otherwise 0.
3. Fix Rate: fraction of FAIL_TO_PASS that now pass, only if no regressions were introduced.
Why it matters: Without Fix Rate, near-misses look the same as no progress; with it, we see steady improvement and safer behavior. 🍞 Anchor: If an agent fixes 8 of 10 target tests but breaks none others, Fix Rate = 0.8; but if it breaks one old test, Fix Rate = 0.

Why SWE-EVO works: Release notes mirror real SRS-style guidance; big test suites guard against hidden breakage; and the PR-count difficulty signal captures that some version upgrades bundle many coordinated changes. Together, this stresses instruction following, multi-file reasoning, and regression safety.

Building blocks:

Data: Version-tagged snapshots from 7 mature Python repos (e.g., scikit-learn, pydantic, dvc, dask, requests, modin, conan).
Tasks: Move the codebase from start_version to end_version per release notes.
Metrics: Resolved Rate (binary) and Fix Rate (soft, regression-safe).
Features: Much longer specs, wider edits (≈610 lines changed on average), bigger tests (≈81 FAIL_TO_PASS, ≈874 total).

03Methodology

At a high level: Input (release note + start-version repo) → Plan multi-step changes → Edit many files → Run tests and iterate → Output a patch that upgrades the repo without regressions.

Step-by-step recipe:

Repository selection and data scraping

What happens: The authors inherit repositories and environments from SWE-Bench/SWE-gym so agents can run code and tests out-of-the-box.
Why this step exists: Recreating environments is hard; reusing stable, executable setups keeps evaluation fair and plug-and-play.
Example: A scikit-learn snapshot includes code, tests, and a Docker image so agents can install dependencies and run pytest.

Candidate selection using versions

What happens: They keep only cases where the base commit is a tagged release. They then define the task as the change described by the release notes to the next tagged release.
Why this step exists: Tagged versions give clear before/after states; release notes provide the human-facing SRS.
Example: Start at dask 2023.6.1 and aim for 2023.7.0 with the listed fixes and enhancements.

Execution-based filtering

What happens: They apply just the test changes and check that at least one test is FAIL_TO_PASS (failing before, passing after the real patch) and that the environment runs cleanly.
Why this step exists: Guarantees each task has a measurable behavioral change and won’t crash the runner.
Example: Keep instances with at least 1 FAIL_TO_PASS and no installation/runtime errors.

Task formulation

Model input: The agent gets (i) the release-note text and (ii) the full pre-release codebase. Sometimes, extra PR/issue text referenced by the notes is also included.
Model output: A multi-file patch (diff) that edits lines across files to implement the change.
Tests: Split into FAIL_TO_PASS (should flip to passing) and PASS_TO_PASS (must stay passing to avoid regressions).
Why this step exists: Makes success measurable and aligns with real engineering practices.
Example numbers (means): 363 non-test files (~78K lines), gold patch edits ~610 lines in ~21 files and ~51 functions, with ~81 FAIL_TO_PASS and ~874 total tests.

Evaluation metrics

Resolved Rate: 1 if all relevant tests pass, else 0.
Patch Apply Rate: percent of patches that apply cleanly (no syntax/context errors).
Fix Rate: partial credit for the fraction of FAIL_TO_PASS tests fixed, but zero if any PASS_TO_PASS test breaks.
Why this step exists: Resolved Rate is clear but coarse; Fix Rate reveals meaningful progress in large suites without rewarding regressions.

Benchmark features

Compared to SWE-Bench: Release notes are far longer (~2,390 vs ~195 words), patches touch more lines/files/functions (~610 lines, ~21 files, ~51 functions), and tests are heavier (~81 FAIL_TO_PASS, ~874 total). This enforces long-context reading and coordinated edits.
Difficulty diversity: One SWE-EVO task often corresponds to multiple upstream PRs; more PRs generally means tougher, multi-step evolution.
Robustness: At least one FAIL_TO_PASS per instance; ~81% have two or more.

The secret sauce:

Using release-note deltas as the high-level SRS ensures tasks demand instruction following, not just bug localization.
The Fix Rate’s regression rule (break anything → 0) strongly rewards safe progress, matching how real teams value “don’t break prod.”
PR-count-as-difficulty is a pragmatic proxy: many coordinated upstream changes usually mean deeper reasoning and wider code impact.

Concrete walk-through example:

Input: “In requests, ensure redirects preserve auth headers; add a warning for deprecated parameter X; fix timeout handling.”
Plan: Find auth handling and redirect code, identify deprecation sites, locate timeout logic and tests.
Edit: Update redirect function to keep headers; add a DeprecationWarning at the right call path; adjust timeout logic.
Test: Run FAIL_TO_PASS items; confirm no PASS_TO_PASS test fails.
Output: A single patch that edits multiple files and passes the whole suite.

04Experiments & Results

The test: Measure how well popular agent frameworks (OpenHands, SWE-agent) plus many LLMs evolve real repos per release notes. Report Resolved Rate (strict), Patch Apply Rate (valid diffs), and Fix Rate (partial but regression-safe).

The competition: Compare the same models on SWE-EVO versus SWE-Bench Verified to see how much harder long-horizon evolution is.

The scoreboard with context:

Top-line difficulty: Even GPT-5 resolves only about 19–21% of SWE-EVO tasks, versus 65% on SWE-Bench Verified. That’s like going from scoring an A on single homework problems to a C- on a full-semester project.
Scaling trend: Bigger models beat smaller ones consistently (e.g., gpt-5 > gpt-5-mini > gpt-5-nano). Rankings mirror SWE-Bench but with much lower absolute scores.
Apply rates: Many models achieve ≈100% Patch Apply Rate, meaning they can produce syntactically valid diffs, but correctness under tests is the challenge.
Context helps a bit: Adding referenced PR/issue text gives modest gains, but agents still struggle to implement exactly what the release notes imply.

Fix Rate reveals hidden differences:

Under OpenHands, gpt-4.1 and gpt-oss-120b both resolve only ~2.08% (1/48), but Fix Rate separates them (~4.65% vs ~2.08%), showing one makes steadier partial progress.
Stronger models show higher Fix Rates (e.g., GPT-5 around 27–31% depending on agent), highlighting meaningful improvements even when full resolution is rare.

Surprising and instructive findings:

Failure fingerprints differ: Strong models rarely fail due to syntax/tooling but often misread or incompletely follow long, nuanced instructions (Instruction Following failures >60%). Smaller models more often stumble on tool use, syntax, or looping.
Effort matches difficulty for the best models: GPT-5 variants spend more turns on harder instances (with more PRs), suggesting adaptive planning. In contrast, some models (e.g., o3) use many turns regardless of difficulty, and others (e.g., deepseek-r1) often use few turns even when hard—risking premature decisions.
Difficulty proxy confirmed: Instances solved often have fewer PRs linked; the hardest buckets average ~15 PRs, while the easiest average under 2.

Big picture: SWE-EVO exposes a capability gap in sustained, multi-file reasoning and faithful instruction following. Passing many tests at once, without breaking others, is the true bottleneck.

05Discussion & Limitations

Limitations:

Language scope: Python-only for now; results may not generalize to Java, C++, or front-end stacks.
Specification source: Relies on release notes; some evolutions (e.g., performance tweaks or silent security patches) may be under-described.
Scale: 48 high-quality instances enable careful study but limit fine-grained statistical comparisons.
Context completeness: Linked PR/issue text isn’t always available or perfectly aligned with the release note intent.

Required resources:

Reproducible containers (Docker images), runnable test suites, and enough compute for up to ~100 agent iterations per task.
Agent frameworks (e.g., OpenHands, SWE-agent) and access to strong LLMs for best performance.

When NOT to use SWE-EVO:

If you only need function-level code completion or toy single-file bug fixes—lighter benchmarks (HumanEval, MBPP, SWE-Bench Lite) run faster and are more targeted.
If your stack isn’t Python and you need language-specific signals today.

Open questions:

How to improve instruction following on long, nuanced release notes (better parsing, retrieval, or planning)?
How to orchestrate multi-agent roles to cover planning, editing, verification, and rollback more reliably?
How to add memory and world-modeling so agents keep consistent plans across many edits?
Can we grow SWE-EVO to more languages, larger scales, and richer specs (e.g., design docs, ADRs, API schemas)?
What additional soft metrics (beyond Fix Rate) best capture safe, incremental progress without rewarding regressions?

06Conclusion & Future Work

Three-sentence summary: SWE-EVO is a realistic benchmark that asks coding agents to evolve full codebases from one tagged version to the next by following release notes and passing large test suites. Experiments show a sharp difficulty jump versus SWE-Bench Verified—top models solve only about one-fifth of tasks—revealing limits in long-horizon, multi-file reasoning and precise instruction following. The new Fix Rate metric adds valuable nuance by rewarding partial progress while strictly disallowing regressions.

Main achievement: Defining and operationalizing software evolution as the evaluation target—complete with versioned snapshots, long natural-language specs, and regression-heavy tests—so we can measure what real engineering actually needs.

Future directions: Expand to more languages and repos, incorporate richer specs (design docs, ADRs), study better planning/verification loops, and refine soft metrics. Investigate training methods (e.g., RL on evolution trajectories) and tool creation to strengthen instruction following and regression safety.

Why remember this: SWE-EVO shifts the goalposts from fixing one-off issues to safely steering entire systems between versions—the kind of work most engineers actually do. It reveals today’s capability gap, offers clearer progress signals (Fix Rate), and charts a research path toward truly autonomous, production-ready coding agents.

Practical Applications

•Evaluate and compare coding agents on realistic, multi-file evolution tasks before adopting them in production.
•Select the right LLM or agent scaffold (OpenHands vs. SWE-agent) based on Resolved Rate and Fix Rate.
•Tune prompts and workflows to improve instruction following on long, nuanced release notes.
•Train agents with reinforcement learning on SWE-EVO-style tasks to boost planning and regression safety.
•Use Fix Rate to monitor incremental progress during long upgrades and prevent risky regressions.
•Build multi-agent pipelines (planner, coder, tester) tailored to release-note-driven evolution.
•Prioritize tasks by PR-count difficulty to allocate more compute or human review where needed.
•Stress-test tool use and patch application reliability in large repositories.
•Design adaptive stopping rules (more turns on harder instances) informed by SWE-EVO findings.
•Create internal benchmarks mirroring SWE-EVO to track your agent’s performance over time.

Version: 1