FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Qixing Zhou; Jiacheng Zhang; Haiyang Wang; Rui Hao; Jiahe Wang; Minghao Han; Yuxue Yang; Shuzhe Wu; Feiyang Pan; Lue Fan; Dandan Tu; Zhaoxiang Zhang

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Intermediate

Qixing Zhou, Jiacheng Zhang, Haiyang Wang et al.2/11/2026

arXiv

Key Summary

•FeatureBench is a new benchmark that tests AI coding agents on building real software features, not just fixing small bugs.
•It uses execution-based checks, meaning the code must actually run and pass unit tests to count as correct.
•Tasks are created automatically from real open-source Python repositories by tracing which parts of the code each test touches.
•Each task includes a clear, callable interface so agents know exactly what function names, inputs, and outputs to implement.
•The benchmark keeps other features working using pass-to-pass (P2P) tests while the target feature is removed using fail-to-pass (F2P) tests.
•It scales easily and can be updated over time, helping prevent data leakage from models memorizing old tasks.
•Across 200 tasks from 24 repositories, top agents that excel on SWE-bench solved only about 11% here, showing FeatureBench is much harder.
•Ablations show clear interfaces and visible tests help a lot, while missing interfaces and limited steps significantly reduce success.
•The toolkit also provides 3,825 executable environments, making it valuable for both evaluation and agent training.

Why This Research Matters

When AI can build full features reliably, software teams can ship improvements faster and more safely. FeatureBench measures that real skill by forcing agents to pass exact interfaces and keep other parts of the codebase working. This helps companies choose better AI tools and helps researchers see where current systems fail, like cross-file reasoning and long-horizon planning. Because tasks are executable and continually updated from real repositories, the benchmark stays relevant and reduces data leakage. The 3,825 runnable environments also double as training grounds to make future agents stronger. In short, FeatureBench turns “AI can fix small issues” into “AI can help ship real features,” which is what truly matters in production software.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re asked to add a brand-new game mode to your favorite video game, not just fix a tiny visual glitch. That’s a lot harder, right?

🥬 The Concept (Agentic Coding): Agentic coding means an AI developer plans, decides, and writes code step by step like a real engineer. How it works: 1) Read a goal, 2) Make a plan, 3) Edit files, run tools, and tests, 4) Iterate until it works. Why it matters: Without this autonomy, AIs can fix tiny issues but struggle to build full features that touch many files. 🍞 Anchor: An AI not only fixes a typo in a Python function but also adds a new model class with proper imports, tests, and docs.

The World Before: AI coding assistants got surprisingly good at narrow tasks—especially fixing individual bugs from a single pull request. Benchmarks like SWE-bench became the scoreboards to measure these abilities. But software teams don’t spend most of their time on tiny bug patches; they plan and ship features that often span multiple files, multiple commits, and require keeping all the other parts of the app working. That’s where previous benchmarks fell short: they mainly tested bug-level edits, sometimes didn’t run the code end-to-end, and often depended on handcrafted or PR-only pipelines that missed the full “feature” picture.

🍞 Hook: You know how building a treehouse takes planning, tools, and keeping the rest of the yard intact? Feature work in codebases is just like that. 🥬 The Concept (Feature-level Coding Tasks): These are bigger missions that add new capabilities to software. How it works: 1) Understand the spec, 2) Wire interfaces, 3) Implement logic across files, 4) Keep existing stuff working. Why it matters: Without testing at this level, we can’t tell if AI can really “ship features,” not just patch holes. 🍞 Anchor: Instead of only fixing a failing test, the AI adds a new GPT-2 model class and ensures other models like BERT still work.

The Problem: Three blockers stood in the way of a realistic, scalable feature benchmark. First, ambiguous requirements meant multiple “valid” solutions that didn’t match what tests expected. Second, PR-based mining missed real feature patches scattered across commits and time, with missing tags or context. Third, it was hard to keep benchmarks fresh and executable as code evolves, leading to data leakage or stale tasks.

Failed Attempts: PR-only approaches capture clean, human-written snapshots but often miss multi-PR features; commit-synthesis methods can create tasks but don’t always guarantee the rest of the codebase still works; and LLM-judged tasks risk subjective scoring. Some pipelines dropped pass-to-pass (P2P) checks, so agents could “solve” the target while silently breaking neighbors—unlike real development where regression is a big no-no.

🍞 Hook: Think of test-driven building like following a checklist while assembling a bike—you know every bolt to tighten. 🥬 The Concept (Execution-based Evaluation): This checks code by actually running unit tests. How it works: 1) Set up an environment, 2) Run tests, 3) Pass means correct, fail means not done. Why it matters: Without running code, we can’t trust that it really works in practice. 🍞 Anchor: The AI’s new feature only counts if pytest passes on both new and existing tests.

The Gap: We needed a benchmark that (1) targets feature development explicitly, (2) evaluates by executing code and tests, and (3) scales automatically from real repositories while ensuring other features remain intact. We also needed clear interfaces so agents know exactly what to implement, avoiding ambiguity.

🍞 Hook: Imagine removing one Lego structure from a big build without collapsing the rest. 🥬 The Concept (Fail-to-Pass and Pass-to-Pass Tests): F2P tests should fail before and pass after the feature is implemented; P2P tests should pass both before and after, guarding other features. How it works: 1) Pick failing-on-purpose target tests (F2P), 2) Pick guard tests (P2P), 3) Validate both before-and-after. Why it matters: Without P2P, agents could “win” by breaking unrelated parts. 🍞 Anchor: Add GPT-2 support while BERT and LLaVA tests still pass.

Real Stakes: Why should anyone care? Because modern teams want AI collaborators that help ship features, not just fix lint errors. In daily life, this can mean faster delivery of safer apps, smoother updates to libraries you depend on (like pandas or transformers), and trustworthy automation that doesn’t secretly break your favorite tool. It also matters for research and training: a benchmark with 3,825 runnable environments and 200 carefully verified tasks is a goldmine to teach and test robust coding behavior over time.

🍞 Hook: Think of a factory that can keep making new, fair tests as products evolve. 🥬 The Concept (Continually Updatable Benchmarking): The dataset can grow with new repositories and commits. How it works: 1) Configure install briefly, 2) Auto-discover tests, 3) Trace dependencies, 4) Generate tasks and environments, 5) Repeat as repos evolve. Why it matters: Without updates, models might memorize answers and look smart without truly improving. 🍞 Anchor: New tasks from 2025 commits get added so agents trained in 2024 face fresh challenges.

02Core Idea

🍞 Hook: Picture a treasure hunt map that not only shows where the treasure is, but also which bridges you must not break while you dig.

🥬 The Concept (The Aha!): Use tests and runtime tracing to carve out one missing feature—with crystal-clear interfaces—while protecting everything else, so we can fairly and automatically measure if agents can truly build features end to end. How it works: 1) Select F2P and P2P tests, 2) Trace which code objects they call, 3) Build a dependency graph, 4) Extract only the target feature’s code, 5) Verify the rest still works, 6) Generate a precise, callable interface for the agent to implement. Why it matters: This removes ambiguity, enables execution-based scoring, and scales to many repos. 🍞 Anchor: The agent is told “Implement class GPT2Model at path X with this exact signature,” then must pass GPT-2 tests while BERT tests keep passing.

Multiple Analogies:

Lego analogy: We remove only the blue tower (target feature) from a city without knocking over the red houses (other features), then ask the agent to rebuild the blue tower exactly to spec.
Cooking analogy: We give the chef the dish name, required ingredients, and plating rules for a new menu item, while ensuring all the other dishes in the kitchen still come out right.
School analogy: The exam highlights the exact questions to answer and how to format them, and you must solve them without messing up other answers on the page.

🍞 Hook: You know how following a well-written recipe makes cooking smoother and safer? 🥬 The Concept (Interface-Driven Tasks): Each task includes precise function names, input and output shapes, and import paths. How it works: 1) Extract signatures from code or generate via LLM when missing, 2) Include them in the prompt, 3) Require a directly callable module. Why it matters: Without strict interfaces, many correct-but-incompatible solutions would crash tests. 🍞 Anchor: The spec says forward(self, input_ids, ...) must return logits shaped (batch, seq, classes), so the grader can call it reliably.

Before vs After:

Before: Benchmarks often centered on bug fixes, vague requirements, or non-executable checks; PR-based mining missed multi-commit features.
After: FeatureBench targets full features, runs code in Docker, uses F2P/P2P to guard integrity, and scales automatically from unit tests and tracing.

Why It Works (intuition, no math):

Clear interfaces collapse ambiguity so many different “styles” of correct code still plug into the same test harness.
Dynamic tracing plus a dependency graph finds the real boundary of the feature—what to remove and what must stay.
P2P tests act like guardrails, ensuring extraction never breaks neighbors.
Post-verification locks in correctness: before patch the target fails and guards pass, after patch everything passes.

🍞 Hook: Think of a build-a-feature kit with labeled bags and a checklist to avoid mistakes. 🥬 The Concept (Building Blocks of FeatureBench):

Test discovery and selection
Runtime tracing and dependency graph construction
LLM-aided top-object classification
Code extraction via BFS over the graph
Post-verification of F2P fail and P2P pass
Prompt generation with precise interfaces
Two difficulty modes: extend repo (L1) vs from-scratch (L2)
Execution-based metrics: Resolved, Passed, Token IO How it works: Each block reduces uncertainty and enforces realism. Why it matters: Together they produce challenging, fair, and scalable tasks. 🍞 Anchor: From transformers to pandas, the same pipeline makes new, verifiable feature tasks with a few minutes of setup per repo.

🍞 Hook: Imagine a playground where the rules are clear, the equipment is sturdy, and new games are added over time. 🥬 The Concept (Why This Benchmark Changes the Game): It widens the target from bug fixing to shipping features, aligns scoring with reality by running tests, and scales with minimal human effort. How it works: strict interfaces + tracing + F2P/P2P + Dockerized execution. Why it matters: When top agents drop from over 70% on SWE-bench to near 11% here, we learn where today’s systems really break—and how to improve them. 🍞 Anchor: An agent that aced small fixes now struggles to wire a multi-file model correctly, surfacing NameErrors and missing imports that real engineers must handle daily.

03Methodology

At a high level: Repository + quick install config → Discover tests and run (F2P/P2P) → Dynamic trace to build a dependency graph → Classify top-level tested objects with an LLM → Graph traversal to decide what to remove vs keep → Extract code to make the feature truly missing → Post-verify F2P fails and P2P passes → Generate problem statement with clear interfaces → Package Docker image and run execution-based evaluation.

🍞 Hook: You know how a detective maps who-talked-to-who to solve a mystery? 🥬 The Concept (Dynamic Tracing and Dependency Graph): The system records which functions call which during tests, then builds a graph of those relationships. How it works: 1) Run F2P and P2P tests, 2) Trace function calls, 3) Create nodes with metadata (file, line, called-by, etc.). Why it matters: Without the graph, we’d guess feature boundaries and risk breaking other parts. 🍞 Anchor: Running GPT-2 tests shows exactly which attention modules, tokenizers, and layers get touched.

Step-by-step details:

Environment setup

What happens: A maintainer lists install commands once (about three minutes). Scripts build a Docker image to ensure reproducibility.
Why this step exists: Agents don’t waste time hunting dependencies; evaluations won’t flake due to system differences.
Example: For transformers, pip install -e . plus extras gets a stable sandbox.

🍞 Hook: Think of picking the right questions for a fair quiz. 🥬 The Concept (Selecting F2P and P2P Tests): F2P tests target the missing feature; P2P tests guard existing features. How it works: 1) Use pytest to collect runnable test files, 2) Pick one as F2P and several as P2P, 3) Confirm F2P fails in the undeveloped codebase and P2P passes. Why it matters: Without this, agents could break other features to “pass.” 🍞 Anchor: Choose test_modeling_gpt2.py as F2P while test_modeling_bert.py is P2P.

LLM-assisted top-object classification

What happens: An LLM reads the F2P test and marks which imported objects are the main targets versus helpers.
Why: This steers graph traversal so we remove the right code and keep utilities.
Example: Mark GPT2Model.forward as a target, but not test helper assert_* utilities.

Graph traversal and node labeling (remained vs extracted)

What happens: Breadth-first search starts from target nodes. Any node touched during P2P runs is labeled remained; the rest along the F2P path are extracted.
Why: This ensures we only remove the feature-specific code while leaving shared foundations intact.
Example: Keep shared tensor utils used by BERT; remove GPT-2-specific blocks.

🍞 Hook: It’s like carefully removing a single organ while keeping the patient healthy. 🥬 The Concept (Code Patch Extraction): We physically delete the extracted nodes’ code from the repo. How it works: 1) Identify source spans for extracted nodes, 2) Remove or stub them, 3) Save the complementary patch. Why it matters: If the feature isn’t truly missing, the task isn’t real. 🍞 Anchor: The altered repo fails GPT-2 tests but still passes BERT’s.

Post-verification

What happens: Validate the undeveloped codebase fails all F2P tests, passes all P2P tests, and still exposes needed utilities.
Why: Catches bad extractions that accidentally broke unrelated code.
Example: If P2P now fails, traversal was too aggressive and must be adjusted.

Problem statement generation

What happens: The system crafts a prompt with: high-level description, exact import path, function/class signature, shapes, and annotations. If docstrings are missing, an LLM writes them from code context.
Why: Clear, callable interfaces eliminate ambiguity and allow automated grading.
Example: from transformers import GPT2Model; forward(self, input_ids, ...) → logits shape (B, T, C).

🍞 Hook: Two ways to learn piano—add a melody to a song (guided) or compose from silence (harder). 🥬 The Concept (Two Difficulty Modes): L1 extends the existing repo; L2 builds from scratch with only the interface given. How it works: 1) For L1, remove feature code along traced path; for L2, remove the whole repo and test against the agent’s new library. Why it matters: L2 tests deep reasoning without context; L1 tests integration with surrounding code. 🍞 Anchor: Implement GPT2Model within transformers (L1) vs as a standalone pip-installable module (L2).

Evaluation and metrics

What happens: Apply the agent’s patch, run pytest, parse results.
Why: Execution-based scoring is objective and reproducible.
Example metrics:
- Resolved Rate: did all F2P and P2P pass?
- Passed Rate: fraction of F2P test points passed.
- Token IO: average input/output tokens consumed.

🍞 Hook: Imagine an exam that forbids peeking at the answer key. 🥬 The Concept (Anti-cheating Protections): The system blacklists URLs and scans logs for suspicious file reads of installed packages. How it works: 1) Defensive prompts, 2) Regex checks for reading site-packages, 3) Flag violations. Why it matters: Prevents trivial solutions that don’t reflect real implementation ability. 🍞 Anchor: The agent can’t just pip install transformers and copy code.

Secret Sauce:

The combination of P2P guarding plus dynamic tracing yields surgical feature removal.
Interface-driven prompts transform the task into “write a directly callable feature,” closing gaps between implementation and tests.
Automated post-verification ensures both realism and fairness at scale.

🍞 Hook: Like a factory line with quality checks at every station. 🥬 The Concept (End-to-End Automation): After a brief install config, everything else is scripted—collection, tracing, extraction, verification, prompt generation, and packaging. How it works: 1) Scripts drive Docker, pytest, tracing, and LLM calls, 2) Outputs ready-to-run instances. Why it matters: This benchmark can keep growing with minimal human effort. 🍞 Anchor: 24 repositories produced 200 verified tasks and 3,825 runnable environments.

04Experiments & Results

The Test: FeatureBench evaluates whether agents can implement feature-level code that passes both F2P and P2P tests inside reproducible Docker environments. The score is execution-based. Key metrics are: Resolved Rate (all tests pass), Passed Rate (F2P fraction passed), and Token IO (efficiency). Two settings exist: a Full set of 200 tasks and a Lite set of 30 tasks, each containing Level 1 (extend repo) and Level 2 (from scratch) difficulties.

🍞 Hook: Think of a track meet where each runner competes on the same course, with judges timing the race precisely. 🥬 The Concept (Resolved, Passed, Token IO): These are the scoreboard numbers. How it works: 1) Resolved needs all tests green, 2) Passed shows partial progress on F2P, 3) Token IO measures tokens spent. Why it matters: Together they reveal accuracy, partial competence, and efficiency. 🍞 Anchor: An agent scoring 11% Resolved with 45% Passed is like finishing a lot of laps but not crossing the final line often.

The Competition: Multiple strong agent setups were tested, including OpenHands with Claude Opus 4.5, DeepSeek-V3.2, Qwen3-Coder-480B-A35B, Gemini-3-Pro-Preview, and a Codex scaffold with GPT-5.1-Codex. Internet access was allowed; up to 500 steps per task were used in baselines. Anti-cheating checks ensured agents couldn’t simply peek at ground-truth packages.

Scoreboard with Context:

Headline: Even top systems that excel on SWE-bench struggled here. For example, Claude Opus 4.5 previously achieved about 74.4% resolved on SWE-bench verified, but on a comparable FeatureBench subset it managed roughly 5.2% resolved. On the full FeatureBench, leading agents typically landed near 11–12.5% resolved.
Passed rates were higher—often under 50%—showing agents make partial headway but stumble on full integration.
Token IO was huge (often near or above a million input tokens per task), signaling today’s agents are computationally heavy yet still underperform on complex feature building.
L2 (from scratch) is notably harder than L1 (extend repo), confirming that missing context raises reasoning demands.

Surprising Findings:

Interfaces matter a lot: Removing explicit function signatures and call paths from prompts caused large drops in both Resolved and Passed rates. Clear, callable interfaces are a key unlock.
Visible tests help: Giving agents access to unit tests greatly boosted performance, suggesting better test generation and exposure could enable stronger planning and debugging.
More steps, up to a point: Increasing max steps from 50 to 100 improved results, but going from 100 to 500 yielded only modest additional gains.
Failure modes reveal real weaknesses: NameError and AttributeError were common, indicating agents guessed cross-file names instead of consistently reading and wiring dependencies—a sign of shallow repository reasoning.
Complexity, not time, drives difficulty: Performance correlated negatively with lines of code to implement, but showed little dependence on task commit date. That implies task hardness, not recency, is the main barrier, though future leakage must still be monitored.

🍞 Hook: If you try to build a robot hand from scratch, it’s much harder without a sample part to study. 🥬 The Concept (L1 vs L2 Difficulty): L1 shows surrounding code; L2 hides it and demands full reconstruction from the interface. How it works: 1) L1: complete missing pieces in-place, 2) L2: build the feature as a standalone library. Why it matters: L2 exposes planning and organization gaps. 🍞 Anchor: Filling in GPT-2 within transformers (L1) is easier than re-creating it from nothing (L2).

Takeaway: FeatureBench exposes the limits of current agents on large, multi-file, interface-bound tasks. The low resolved rates (about 11–12.5%) versus high success on prior bug-fix benchmarks show we’ve hit a new frontier: building features with guardrails is much tougher than fixing isolated issues.

05Discussion & Limitations

Limitations:

Language and ecosystem scope: Current pipeline targets Python repositories with runnable unit tests; other languages and weaker-test repos are not yet covered.
LLM-aided classification is strong but imperfect: Identifying top-level tested objects achieved good precision and recall but can still misclassify utilities, requiring safeguards like P2P selection sanity checks.
Execution cost: Large token usage, long run times, and heavy Dockerized testing make large-scale evaluations computationally expensive.
Visibility trade-offs: Exposing unit tests can improve performance but is not always realistic; hiding them makes tasks much tougher and more representative of spec-only development.
Data leakage risk: Although the toolkit can add fresh tasks over time, widespread training on popular repos means ongoing vigilance is needed.

Required Resources:

Minimal human setup per repo (roughly three minutes) to record install commands.
Container runtime and storage for Docker images.
Sufficient compute for tracing, BFS extraction, and repeated pytest runs.
Agent-side budgets: long-context windows, tool use, and hundreds of steps per instance.

When Not to Use:

If you only need micro-bug benchmarks or single-file edits; FeatureBench is overkill.
Repos without stable unit tests or that depend on complex external services with flaky I/O.
When you cannot run Dockerized execution for policy or infrastructure reasons.

Open Questions:

How can agents develop robust repository-wide reading behaviors to avoid name and attribute errors?
Can planning modules and memory improve multi-step, multi-file reasoning without exploding tokens?
What is the best way to expose tests or partial specs to guide agents while staying realistic?
How to generalize this pipeline to other languages and monorepos with complex build systems?
Can we train on these verifiable environments to measurably improve long-horizon coding performance?

🍞 Hook: Imagine giving a student practice problems that exactly match the final exam’s format. 🥬 The Concept (Training Utility of FeatureBench): The same verifiable environments can be used to train better code agents. How it works: 1) Provide executable goals, 2) Reward passing tests, 3) Iterate with curriculum from Lite to Full, L1 to L2. Why it matters: This could close the gap revealed by today’s low resolved rates. 🍞 Anchor: Agents learn to avoid NameErrors by repeatedly practicing multi-file integration on real repos.

06Conclusion & Future Work

Three-sentence summary: FeatureBench is a feature-oriented, execution-based benchmark that automatically constructs realistic coding tasks from real Python repositories by tracing tests and extracting the exact missing feature. It supplies precise, callable interfaces and guards other features with pass-to-pass tests, producing fair, scalable, and verifiable evaluations. Experiments show state-of-the-art agents that perform well on bug-fix benchmarks struggle here, solving only about 11–12.5% of tasks.

Main achievement: Turning feature development into an automated, interface-driven, execution-verified benchmark—complete with F2P/P2P checks, dynamic dependency tracing, and post-verification—across 200 tasks and 3,825 environments from 24 repositories.

Future directions: Expand beyond Python, integrate smarter test exposure strategies, reduce token and runtime costs, and use these environments to train agents that plan, read repositories reliably, and wire multi-file features without brittle guesses. Continue automatic updates to keep tasks fresh and minimize leakage as models evolve.

Why remember this: FeatureBench raises the bar from fixing to building—measuring whether AI can truly ship real features without breaking the rest. It is a practical yardstick and a rich training ground, pointing directly at what needs to improve for AI to become a dependable teammate in real-world software engineering.

Practical Applications

•Evaluate and compare AI coding agents on realistic feature-building tasks before adopting them in your team.
•Use the Lite set for quick benchmarking of new agent scaffolds, then graduate to the Full set for deeper assessment.
•Train agents with the executable environments to improve multi-file reasoning and reduce NameError or AttributeError failures.
•Design curricula that start with L1 (extend repo) tasks and progress to L2 (from scratch) for stronger, generalizable skills.
•Stress-test interface-following by removing or varying signature hints and measuring the performance drop.
•Benchmark token efficiency (Token IO) to select cost-effective agents for long-horizon development.
•Regression-proof new features in your own repos by adopting the F2P/P2P principle in your CI pipelines.
•Analyze failure logs to build targeted tools (e.g., cross-file symbol search) that address common agent pitfalls.
•Continuously add fresh tasks from updated commits to monitor model improvements and prevent memorization.
•Compare agent behaviors when unit tests are visible versus hidden to pick realistic deployment settings.

Version: 1