NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Key Summary
- ā¢NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.
- ā¢Unlike past tests that focus on short functions or bug fixes, this benchmark checks long-horizon skills like planning, architecture, and cross-file consistency.
- ā¢Agents are judged only by running the real project's upstream pytest suite in a controlled Docker environmentāno opinions, just pass or fail.
- ā¢Across 104 real-world tasks, even the best agents scored under 40.5% average test pass rate and rarely finished a full repository correctly.
- ā¢Common failures include stopping too early, losing the big-picture plan, breaking imports across files, and weak dependency handling.
- ā¢Long context windows help a lot (1M tokens models like Claude-4.5 lead), but context size alone isnāt enough without strong planning and persistence.
- ā¢Revealing all tests during development boosts scores but still doesnāt break 60%, showing the task is truly hard even with guidance.
- ā¢Tool-use analysis shows that explicit planning (task tracking) correlates strongly with better performance, while blind editing and endless navigation hurt.
- ā¢NL2Repo-Bench gives researchers a rigorous way to measure and improve the long-horizon reasoning that real software development needs.
Why This Research Matters
Real software is more than single functionsāitās many files that must fit and run together. NL2Repo-Bench measures whether AI can handle that full journey, from reading a long spec to delivering a package that passes real tests. This matters for companies hoping to speed up development, reduce bugs, and automate routine engineering tasks safely. It helps researchers pinpoint where agents stumbleālike planning, dependency setup, and cross-file consistencyāso they can design better training and tools. It also gives a fair way to compare agents, since grading is based on actual execution, not opinions. Over time, progress here could enable trustworthy AI partners that can build, test, and ship real software.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): You know how building a treehouse takes more than hammering one nailāyou need a plan, a blueprint, supplies, and careful steps so every board fits together.
š„¬ Filling (The Actual Concept)
- What it is: This paper introduces a new way to test whether AI coding agents can build whole software projectsānot just tiny piecesāfrom a single long instruction document.
- How it works (story of the world before ā problem ā failed attempts ā gap ā stakes):
- The world before: AI models got good at helping with short tasks like writing one function or fixing a small bug. Benchmarks like HumanEval and MBPP measured those small wins well. Other tests like SWE-bench asked agents to fix issues in an existing repository, which still gives a lot of structure and clues.
- The problem: Real software isnāt just one function. Itās a city of files, folders, imports, tests, and packaging. To build it from scratch, an agent must plan far ahead, keep the big picture in mind, and make all parts work together over time. Older benchmarks didnāt measure this long-horizon ability.
- Failed attempts: Some benchmarks tried whole-repo challenges but often included scaffolds (pre-made folders and function signatures) or used LLMs to judge correctness. That makes things easier or fuzzier than real life and can hide whether the code truly works.
- The gap: We needed a from-scratch challenge using only natural-language requirements, no scaffolding, and judged only by running the original projectās real tests. Thatās the missing piece.
- Real stakes: If we want AI to be a reliable teammate or someday build apps end-to-end, it has to handle the full journey: plan, create, test, package, and ship. Long-horizon reasoning isnāt optionalāitās the backbone of real software engineering.
- Why it matters: Without a strict, real-world test, we could think agents are ready for prime time when theyāre not. That risks broken software, wasted time, and missed opportunities.
š Bottom Bread (Anchor): Imagine giving a student one long assignment: āBuild a calculator library with parsing, evaluation, and flexible comparison.ā No starter code, no hints. You grade by running the real unit tests. Thatās exactly what NL2Repo-Bench does for coding agents.
ā New Concept 1 ā š Hook: Imagine a robot chef that can read a recipe and cook by itself. š„¬ The Concept: Autonomous Coding Agents
- What it is: Programs that can plan, write, run, and fix code without humans doing each step.
- How it works:
- Read the goal written in plain English.
- Plan tasks (files to make, functions to write, tests to run).
- Write code and organize folders.
- Run tests and fix mistakes.
- Why it matters: Without autonomy, the agent keeps asking for help and never finishes big projects. š Anchor: A coding agent gets āBuild a math parser.ā It designs modules, writes a parser, runs tests, and fixes bugsāon its own.
ā New Concept 2 ā š Hook: Planning a cross-country road trip requires mapping many stops ahead. š„¬ The Concept: Long-Horizon Reasoning
- What it is: Thinking and executing over many steps while staying on plan.
- How it works:
- Set the big goal (finish a working library).
- Break it into phases (design ā implement ā test ā package).
- Track progress and adjust when surprises happen.
- Keep earlier decisions in mind while coding later parts.
- Why it matters: Without it, agents lose track, repeat mistakes, or stop early. š Anchor: While building a parser, the agent remembers earlier design choices so the grader module imports the right symbols later.
ā New Concept 3 ā š Hook: Architects draw blueprints before builders lift a hammer. š„¬ The Concept: Architectural Design
- What it is: Planning the projectās file layout, modules, and how they connect.
- How it works:
- Choose package names and folders.
- Decide core modules (parser.py, grader.py).
- Define clear interfaces between them.
- Plan public APIs exposed in init.py.
- Why it matters: Without architecture, code grows messy and parts donāt fit. š Anchor: The math_verify project splits parsing and grading so tests can import functions cleanly.
ā New Concept 4 ā š Hook: Before you bake, you check you have flour, eggs, and sugar. š„¬ The Concept: Dependency Management
- What it is: Making sure your project and its libraries are installed and linked correctly.
- How it works:
- List what you need (e.g., antlr4, sympy-like tools).
- Pin versions in pyproject.toml.
- Import modules carefully across files.
- Ensure init.py exposes the right names.
- Why it matters: Missing or mismatched dependencies break imports and tests. š Anchor: If latex2sympy2_extended is missing, parsing LaTeX expressions will crash during tests.
ā New Concept 5 ā š Hook: Building a Lego city means making buildings that connect by roads. š„¬ The Concept: Multi-Module Logic Implementation
- What it is: Writing several modules that cooperate to provide the full feature set.
- How it works:
- Implement each moduleās job.
- Connect them via clean function calls.
- Share common types and constants.
- Keep behavior consistent across files.
- Why it matters: If pieces donāt cooperate, the whole project fails. š Anchor: parser.py converts text to SymPy objects; grader.py checks equivalenceāthey must agree on types.
ā New Concept 6 ā š Hook: All puzzle pieces must fitāeven after you move them. š„¬ The Concept: Cross-File Consistency
- What it is: Making sure names, types, and expectations match across files.
- How it works:
- Keep the same function signatures everywhere.
- Use shared utilities for common rules.
- Update imports when files change.
- Test the whole package, not just one file.
- Why it matters: A tiny mismatch (like a renamed function) can break everything. š Anchor: If verify() expects strict=False but grader.py forgot that argument, tests fail.
ā New Concept 7 ā š Hook: To check if a toy car works, you roll itādonāt just look at it. š„¬ The Concept: Execution-Based Evaluation
- What it is: Judging by running the real tests, not by looks or guesses.
- How it works:
- Build the generated repo inside Docker.
- pip install it.
- Run the upstream pytest suite.
- Count passes/fails as the score.
- Why it matters: It measures actual behavior, not opinions. š Anchor: If parse("sin(x)+x") and parse("sin(y)+y") compare equal in lenient mode, the test passes.
ā New Concept 8 ā š Hook: A teacher gives you just the assignment sheet and a blank notebook. š„¬ The Concept: NL2Repo-Bench
- What it is: A benchmark where an agent gets one natural-language document and must create a full, installable Python library that passes the original projectās tests.
- How it works:
- Pick real-world repos with solid tests.
- Reverse-engineer them into long, precise NL documents.
- Give agents only that doc and an empty folder.
- Judge by running the upstream tests in Docker.
- Why it matters: It fairly tests true end-to-end software building. š Anchor: The math_verify example document describes parsing and grading rules; the agent must rebuild the library so all official tests pass.
02Core Idea
š Top Bread (Hook): Imagine being told, āBuild a full board game from this rulebookāno pieces, no boardāthen weāll judge by actually playing it.ā
š„¬ Filling (The Actual Concept)
-
The "Aha!" in one sentence: Test coding agents on full, from-scratch repository generation using only a long natural-language spec, and grade them by the original projectās real pytest suite inside a controlled environment.
-
Multiple analogies (3 ways):
- Cooking analogy: You get a detailed recipe book (the NL document) and an empty kitchen (workspace). You must cook the whole meal (repository) so that expert tasters (pytest) approve every dish.
- School project analogy: You receive a long assignment sheet and must produce a working science fair project without any starter kit. The judges run your experiment to see if it really works.
- Lego city analogy: Youāre handed city-planning notes and must build the entire Lego city from scratch. Inspectors then drive through it to verify all roads, bridges, and services function.
-
Before vs After: ⢠Before: Benchmarks mostly checked short code snippets, bug fixes, or completion with scaffolding. Agents could lean on pre-existing structure. ⢠After: NL2Repo-Bench removes the training wheels: no code, no scaffolding, one long spec, and strict pass/fail by upstream tests. Now we see if agents can sustain a coherent plan for the whole software.
-
Why it works (intuition, no equations): ⢠Real behavior is what matters: Running upstream tests gives a binary, reliable signalādoes the software behave as intended? ⢠Long specs + empty workspaces force global thinking: With no hints like function signatures, agents must design architecture, manage dependencies, and maintain cross-file consistency. ⢠Controlled Docker environments reduce noise: If a test fails, itās likely the agentās implementation, not a random system glitch.
-
Building blocks of the idea:
- Task sourcing: Choose real Python libraries with solid pytest suites, modern upkeep, and manageable sizes.
- Reverse-engineered NL documents: Translate the repository into a precise, exhaustive natural-language spec using AST-assisted coverage so nothing essential is missing.
- Environment standardization: Ship a per-task Docker image where the original project is guaranteed to pass, isolating environment issues from model ability.
- Non-functional constraint relaxation: Remove build gotchas (like missing README checks) so results reflect code quality, not packaging trivia.
- Strict execution-based scoring: Install the generated package and run the original tests; report pass rates and Pass@1.
- Analysis beyond scores: Trace tool usage, turns, early stops, and context window effects to expose true long-horizon weaknesses.
š Bottom Bread (Anchor): In the math_verify example, the doc explains how to parse numbers (like $1,000.99 or European decimals) and compare expressions in strict vs lenient mode. The agent must design modules, implement parsing and grading, package the library, and then pass the same tests the real project usesāno shortcuts.
03Methodology
At a high level: Input (one long NL document) ā [Plan architecture] ā [Implement multi-module code + manage dependencies] ā [Package + install] ā [Run upstream pytest in Docker] ā Output (pass rate and analysis).
Step-by-step (what happens, why it exists, example):
- Task selection (realistic, testable repos)
- What: Curate real Python libraries with 300ā120,000 LOC, recency, and fully passing pytest suites.
- Why: Ensures tasks are meaningful, modern, and verifiable.
- Example: Choose a math expression verification library whose upstream tests already pass.
- Reverse-engineering into a natural-language spec
- What: Human annotators study the repo and write a complete NL document describing goals, APIs, modules, and behaviors.
- Why: The agent must rely only on words, like real requirements.
- Example: The spec says parse("sin(x)+x") equals parse("sin(y)+y") in lenient mode; lists functions parse() and verify() with arguments and behaviors.
- AST-assisted coverage for API completeness
- What: Use an AST scanner to extract all functions/classes/signatures; cross-check that the spec documents each tested API.
- Why: Missing a required API would make the task impossible.
- Example: The tool finds grader.verify and parser.parse; the document includes both with correct parameters.
- Environment building with Docker
- What: Build a per-task Docker image where the original repo passes its own tests (pin dependencies as needed).
- Why: Isolates model performance from environment flakiness.
- Example: Pin antlr4-python3-runtime and latex2sympy2_extended versions so upstream tests always pass.
- Non-functional constraint relaxation
- What: Remove packaging gotchas unrelated to core logic (e.g., strict README checks) or pre-create harmless files.
- Why: Prevents unfair failures and keeps focus on functionality.
- Example: If setup.py insists on a LICENSE file, provide a minimal one in the image.
- Quality assurance and refinement
- What: Human experts review for fidelity; static coverage ensures completeness; pilot runs identify ambiguities.
- Why: Guarantees each task is solvable and fair.
- Example: If a test fails due to ambiguous lenient comparison rules, refine the doc to clarify.
- Agent run protocol
- What: The agent receives only the spec and an empty workspace. It can use tools (edit files, run tests, plan) without human help or turn limits.
- Why: Measures truly autonomous, long-horizon behavior.
- Example: In OpenHands, the agent plans with task_tracker, edits with str_replace_editor, and runs pytest via execute_bash.
- Execution-based evaluation
- What: Package the generated repo, install it, and run upstream pytest. Count passes and report pass rates and Pass@1.
- Why: Objective, binary behavior check.
- Example: Even if some tests fail to collect initially, all runnable tests are executed to avoid scoring zero unfairly.
- Analysis diagnostics (the secret sauce)
- What: Track tool usage patterns, number of turns, early termination vs non-finish, and context window impact.
- Why: Reveals why agents fail (overconfidence, navigation traps, blind editing) and what helps (planning loops, edit-test cycles).
- Example: High task_tracker usage correlates strongly with better scores; models with huge contexts maintain coherence longer.
Secret sauce highlights:
- Single-document, no-scaffold start forces true architecture and planning.
- Strict upstream tests avoid subjective grading.
- Docker standardization makes apples-to-apples comparisons possible.
- Rich telemetry (tools, turns, stop patterns) exposes long-horizon failure modes.
ā Extra Concept ā š Hook: Itās hard to follow a very long story if you canāt keep many pages in mind. š„¬ The Concept: Context Window
- What it is: How much text the model can remember at once.
- How it works:
- Read the long spec and conversation history.
- Keep earlier decisions in memory.
- Use that memory to keep code consistent.
- Longer windows reduce forgetfulness.
- Why it matters: Small windows make models forget past choices and break cross-file logic. š Anchor: With a 1M-token window, the agent can recall the API contract while editing distant files.
ā Extra Concept ā š Hook: Stopping a race before the finish line wonāt win any medals. š„¬ The Concept: Premature Termination vs Non-Finish
- What it is: Two ways agents fail to complete: stopping too early with confidence, or never deciding to finish.
- How it works:
- Early stop: Agent says ādoneā before the repo is ready.
- Non-finish: Agent waits for user input or times out.
- Both lead to incomplete repos and low scores.
- Planning + persistence reduce both.
- Why it matters: These are top causes of failure on long tasks. š Anchor: One model declared success after ~70 steps; tests later showed missing imports and failing cases.
ā Extra Concept ā š Hook: A checklist makes big chores safer and smoother. š„¬ The Concept: Agentic Planning (task_tracker)
- What it is: The agentās to-do list and progress tracker.
- How it works:
- Create tasks (design, implement modules, package, test).
- Update status as each task completes.
- Replan when failures appear.
- Keep the big picture visible.
- Why it matters: Without a plan, the agent wanders and forgets crucial steps. š Anchor: Agents with more task_tracker calls scored higher on NL2Repo-Bench.
04Experiments & Results
The test: Can agents turn a single long NL document into a fully installable repo that passes the upstream pytest suite, across 104 real-world tasks (web, ML, utilities, databases, system tools, and more)?
What they measured and why:
- Pass rate (% of tests passed): A direct measure of functional correctness (like getting an exam score).
- Pass@1: How many full repositories pass all tests on the first try.
- By difficulty (Easy/Medium/Hard): Shows how performance scales with project size/complexity.
- By category: Reveals strengths/weaknesses across domains.
- Tool usage and turns: Uncovers behavior patterns (planning, early stop, testing loops).
- Context window impact: Tests whether memory size helps maintain long-range coherence.
Competition lineup:
- Closed-source: Claude-Sonnet-4/4.5, Gemini-3-pro, GPT-5.
- Open-source: DeepSeek-V3.1/V3.2, Kimi-k2, Qwen3-235B (Instruct/Thinking), GLM-4.6.
- Frameworks: OpenHands (main), plus Cursor-CLI and Claude Code for comparison.
Scoreboard with context:
- Top scores: Claude-Sonnet-4.5 variants reached around 40% pass rate (e.g., 40.2% using Claude Code). Thatās like an A- when the class average is closer to a Cā or D.
- Many models scored below 20% (a failing grade), showing this benchmark is tough.
- Pass@1: The best model fully passed only 5 out of 104 repositories in a single runārare complete wins.
- Difficulty effect: Performance drops steadily from Easy to Hard; larger, more complex repos hurt everyone.
- Category differences: System tools and database interaction tended to be stronger; ML and networking were harder. Infrastructure-style projects with clearer packaging seemed easier than algorithm-heavy or protocol-heavy ones.
Surprising findings:
- Thinking-model paradox: Qwen3-Thinking often stopped early (about half the tasks), likely due to āI reasoned it outā overconfidence without real execution.
- GPT-5 conservatism: Low early-stop but very high non-finish rate (over 80%); it waits for human approval, which is fatal in a fully autonomous setting.
- Planning correlates with success: task_tracker usage showed a strong positive correlation with score (~0.71), signaling that explicit planning helps long projects.
- EditāTest loop wins: High performers cycled between code edits and running tests. Lower performers got stuck in navigation (ls/cd/read) or blind editing streaks.
Context window impact:
- 1M-token models (Claude series, Gemini-3-pro) outperformed smaller-window peers. Big memory helps keep the plan and cross-file details consistent over 100ā200+ turns.
- But context is not enough: Some long-context models still lagged; persistence and planning quality also matter.
Ablations:
- Turn limits: Raising the max turns from 50 to 200 boosts scores; beyond ~200, gains flatten. The barrier is reasoning/coherence, not just more tries.
- Revealing tests: When agents can see the full test suite during development, scores jump (e.g., Claude-4.5 from ~40% to ~59%). Yet even then, it doesnāt break 60%, proving end-to-end repo creation remains hard.
Takeaway in plain words: Todayās agents can build pieces, but making an entire townāwith roads, power, and rules that all work togetherāstill trips them up. The winners planned well, tested often, remembered more, and kept going until the end.
05Discussion & Limitations
Limitations (be specific):
- Language and ecosystem scope: Focused on Python libraries with pytest; results may differ for other languages or frameworks.
- From-scratch only: It doesnāt evaluate repair-in-place skills (a different but also important capability).
- Tooling dependence: Agent behavior partly reflects the available toolset (e.g., OpenHands actions) and its reliability.
- Compute and time: Long-horizon evaluations demand many interaction turns and test runs, which is resource-intensive.
- Hidden test bias: Even with AST coverage, subtle behaviors might still be harder to reproduce from NL text than from reading tests directly.
Required resources:
- Per-task Docker images; reliable compute for running upstream tests many times.
- Models with large context windows and sufficient token budgets.
- Agent frameworks that support planning, editing, execution, and logging for analysis.
When not to use NL2Repo-Bench:
- If you only need function-level code completion measurements (HumanEval-style tasks are faster and cheaper).
- When human-in-the-loop collaboration is the goal (this benchmark expects full autonomy during the build phase).
- If you cannot allocate enough compute/time to let agents iterate and run full test suites.
Open questions:
- How to reduce early-stop and non-finish behaviors without hand-holding? Can self-verification loops (planāexecuteāauditāreplan) help reliably?
- Whatās the best way to maintain cross-file contracts over hundreds of turnsāmemory tools, code maps, or learned architectural priors?
- Can curriculum-style training (function ā module ā package) improve long-horizon skills more than scaling alone?
- How should agents balance editātest cycles vs. heavy up-front design to avoid both blind editing and navigation traps?
- Beyond bigger context windows, what memory and planning mechanisms are truly necessary for agent-scale software engineering?
06Conclusion & Future Work
3-sentence summary: NL2Repo-Bench is a rigorous benchmark that asks coding agents to build complete Python libraries from a single long natural-language document and then judges them by running the original projectsā upstream tests in a controlled environment. Across 104 real-world tasks, even top agents struggled (ā40% pass rate), revealing deep weaknesses in long-horizon planning, cross-file consistency, and persistent execution. The analysis shows that explicit planning, frequent editātest loops, and very large context windows helpābut are not sufficientāto solve end-to-end repository generation today.
Main achievement: Establishing a strict, execution-based, from-scratch repository benchmark that isolates true long-horizon agentic competence and exposes where current systems fail.
Future directions:
- Architectures for durable planning and self-auditing across hundreds of steps.
- Memory tools beyond raw context size (code maps, contract trackers, spec-to-API checkers).
- Training curricula and synthetic tasks that grow from functions to full packages.
- Better environment and dependency reasoning, including robust packaging and import management.
Why remember this: NL2Repo-Bench changes the question from āCan AIs write snippets?ā to āCan AIs build software?ā It sets a clear, fair barāpass the real tests from scratchāand gives researchers the compass they need to make agents that can truly ship code.
Practical Applications
- ā¢Use NL2Repo-Bench to compare coding agents fairly before adopting them in your CI pipeline.
- ā¢Train agents to maintain explicit to-do lists (task tracking) and measure the score gains.
- ā¢Adopt editātest loops in agent workflows to reduce blind editing and catch regressions early.
- ā¢Provide agents with repository maps (API contracts, import graphs) to improve cross-file consistency.
- ā¢Tune packaging and dependency heuristics (pyproject.toml, __init__.py exports) to cut ImportError rates.
- ā¢Simulate long-horizon curricula: start agents on small repos and scale to harder NL2Repo tasks.
- ā¢Instrument agents to detect and prevent premature finish (require passing smoke tests before finishing).
- ā¢Leverage larger context windows or memory tools to preserve architectural decisions over many turns.
- ā¢Apply non-functional constraint relaxation in your internal evals to focus on functionality over packaging trivia.
- ā¢Use ablation-style evaluations (turn limits, reveal tests) to diagnose whether your agent struggles with planning or implementation.