NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding; Shengda Long; Changxin Pu; Huan Zhou; Hongwan Gao; Xiang Gao; Chao He; Yue Hou; Fei Hu; Zhaojian Li; Weiran Shi; Zaiyuan Wang; Daoguang Zan; Chenchen Zhang; Xiaoxu Zhang; Qizhi Chen; Xianfu Cheng; Bo Deng; Qingshui Gu; Kai Hua; Juntao Lin; Pai Liu; Mingchen Li; Xuanguang Pan; Zifan Peng; Yujia Qin; Yong Shan; Zhewen Tan; Weihao Xie; Zihan Wang; Yishuo Yuan; Jiayu Zhang; Enduo Zhao; Yunfei Zhao; He Zhu; Liya Zhu; Chenyang Zou; Ming Ding; Jianpeng Jiao; Jiaheng Liu; Minghao Liu; Qian Liu; Chongyang Tao; Jian Yang; Tong Yang; Zhaoxiang Zhang; Xinjie Chen; Wenhao Huang; Ge Zhang

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Intermediate

Jingzhe Ding, Shengda Long, Changxin Pu et al.12/14/2025

arXiv PDF

Key Summary

•NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.
•Unlike past tests that focus on short functions or bug fixes, this benchmark checks long-horizon skills like planning, architecture, and cross-file consistency.
•Agents are judged only by running the real project's upstream pytest suite in a controlled Docker environment—no opinions, just pass or fail.
•Across 104 real-world tasks, even the best agents scored under 40.5% average test pass rate and rarely finished a full repository correctly.
•Common failures include stopping too early, losing the big-picture plan, breaking imports across files, and weak dependency handling.
•Long context windows help a lot (1M tokens models like Claude-4.5 lead), but context size alone isn’t enough without strong planning and persistence.
•Revealing all tests during development boosts scores but still doesn’t break 60%, showing the task is truly hard even with guidance.
•Tool-use analysis shows that explicit planning (task tracking) correlates strongly with better performance, while blind editing and endless navigation hurt.
•NL2Repo-Bench gives researchers a rigorous way to measure and improve the long-horizon reasoning that real software development needs.

Why This Research Matters

Real software is more than single functions—it’s many files that must fit and run together. NL2Repo-Bench measures whether AI can handle that full journey, from reading a long spec to delivering a package that passes real tests. This matters for companies hoping to speed up development, reduce bugs, and automate routine engineering tasks safely. It helps researchers pinpoint where agents stumble—like planning, dependency setup, and cross-file consistency—so they can design better training and tools. It also gives a fair way to compare agents, since grading is based on actual execution, not opinions. Over time, progress here could enable trustworthy AI partners that can build, test, and ship real software.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how building a treehouse takes more than hammering one nail—you need a plan, a blueprint, supplies, and careful steps so every board fits together.

🥬 Filling (The Actual Concept)

What it is: This paper introduces a new way to test whether AI coding agents can build whole software projects—not just tiny pieces—from a single long instruction document.
How it works (story of the world before → problem → failed attempts → gap → stakes):
1. The world before: AI models got good at helping with short tasks like writing one function or fixing a small bug. Benchmarks like HumanEval and MBPP measured those small wins well. Other tests like SWE-bench asked agents to fix issues in an existing repository, which still gives a lot of structure and clues.
2. The problem: Real software isn’t just one function. It’s a city of files, folders, imports, tests, and packaging. To build it from scratch, an agent must plan far ahead, keep the big picture in mind, and make all parts work together over time. Older benchmarks didn’t measure this long-horizon ability.
3. Failed attempts: Some benchmarks tried whole-repo challenges but often included scaffolds (pre-made folders and function signatures) or used LLMs to judge correctness. That makes things easier or fuzzier than real life and can hide whether the code truly works.
4. The gap: We needed a from-scratch challenge using only natural-language requirements, no scaffolding, and judged only by running the original project’s real tests. That’s the missing piece.
5. Real stakes: If we want AI to be a reliable teammate or someday build apps end-to-end, it has to handle the full journey: plan, create, test, package, and ship. Long-horizon reasoning isn’t optional—it’s the backbone of real software engineering.
Why it matters: Without a strict, real-world test, we could think agents are ready for prime time when they’re not. That risks broken software, wasted time, and missed opportunities.

🍞 Bottom Bread (Anchor): Imagine giving a student one long assignment: “Build a calculator library with parsing, evaluation, and flexible comparison.” No starter code, no hints. You grade by running the real unit tests. That’s exactly what NL2Repo-Bench does for coding agents.

— New Concept 1 — 🍞 Hook: Imagine a robot chef that can read a recipe and cook by itself. 🥬 The Concept: Autonomous Coding Agents

What it is: Programs that can plan, write, run, and fix code without humans doing each step.
How it works:
1. Read the goal written in plain English.
2. Plan tasks (files to make, functions to write, tests to run).
3. Write code and organize folders.
4. Run tests and fix mistakes.
Why it matters: Without autonomy, the agent keeps asking for help and never finishes big projects. 🍞 Anchor: A coding agent gets “Build a math parser.” It designs modules, writes a parser, runs tests, and fixes bugs—on its own.

— New Concept 2 — 🍞 Hook: Planning a cross-country road trip requires mapping many stops ahead. 🥬 The Concept: Long-Horizon Reasoning

What it is: Thinking and executing over many steps while staying on plan.
How it works:
1. Set the big goal (finish a working library).
2. Break it into phases (design → implement → test → package).
3. Track progress and adjust when surprises happen.
4. Keep earlier decisions in mind while coding later parts.
Why it matters: Without it, agents lose track, repeat mistakes, or stop early. 🍞 Anchor: While building a parser, the agent remembers earlier design choices so the grader module imports the right symbols later.

— New Concept 3 — 🍞 Hook: Architects draw blueprints before builders lift a hammer. 🥬 The Concept: Architectural Design

What it is: Planning the project’s file layout, modules, and how they connect.
How it works:
1. Choose package names and folders.
2. Decide core modules (parser.py, grader.py).
3. Define clear interfaces between them.
4. Plan public APIs exposed in init.py.
Why it matters: Without architecture, code grows messy and parts don’t fit. 🍞 Anchor: The math_verify project splits parsing and grading so tests can import functions cleanly.

— New Concept 4 — 🍞 Hook: Before you bake, you check you have flour, eggs, and sugar. 🥬 The Concept: Dependency Management

What it is: Making sure your project and its libraries are installed and linked correctly.
How it works:
1. List what you need (e.g., antlr4, sympy-like tools).
2. Pin versions in pyproject.toml.
3. Import modules carefully across files.
4. Ensure init.py exposes the right names.
Why it matters: Missing or mismatched dependencies break imports and tests. 🍞 Anchor: If latex2sympy2_extended is missing, parsing LaTeX expressions will crash during tests.

— New Concept 5 — 🍞 Hook: Building a Lego city means making buildings that connect by roads. 🥬 The Concept: Multi-Module Logic Implementation

What it is: Writing several modules that cooperate to provide the full feature set.
How it works:
1. Implement each module’s job.
2. Connect them via clean function calls.
3. Share common types and constants.
4. Keep behavior consistent across files.
Why it matters: If pieces don’t cooperate, the whole project fails. 🍞 Anchor: parser.py converts text to SymPy objects; grader.py checks equivalence—they must agree on types.

— New Concept 6 — 🍞 Hook: All puzzle pieces must fit—even after you move them. 🥬 The Concept: Cross-File Consistency

What it is: Making sure names, types, and expectations match across files.
How it works:
1. Keep the same function signatures everywhere.
2. Use shared utilities for common rules.
3. Update imports when files change.
4. Test the whole package, not just one file.
Why it matters: A tiny mismatch (like a renamed function) can break everything. 🍞 Anchor: If verify() expects strict=False but grader.py forgot that argument, tests fail.

— New Concept 7 — 🍞 Hook: To check if a toy car works, you roll it—don’t just look at it. 🥬 The Concept: Execution-Based Evaluation

What it is: Judging by running the real tests, not by looks or guesses.
How it works:
1. Build the generated repo inside Docker.
2. pip install it.
3. Run the upstream pytest suite.
4. Count passes/fails as the score.
Why it matters: It measures actual behavior, not opinions. 🍞 Anchor: If parse("sin(x)+x") and parse("sin(y)+y") compare equal in lenient mode, the test passes.

— New Concept 8 — 🍞 Hook: A teacher gives you just the assignment sheet and a blank notebook. 🥬 The Concept: NL2Repo-Bench

What it is: A benchmark where an agent gets one natural-language document and must create a full, installable Python library that passes the original project’s tests.
How it works:
1. Pick real-world repos with solid tests.
2. Reverse-engineer them into long, precise NL documents.
3. Give agents only that doc and an empty folder.
4. Judge by running the upstream tests in Docker.
Why it matters: It fairly tests true end-to-end software building. 🍞 Anchor: The math_verify example document describes parsing and grading rules; the agent must rebuild the library so all official tests pass.

02Core Idea

🍞 Top Bread (Hook): Imagine being told, “Build a full board game from this rulebook—no pieces, no board—then we’ll judge by actually playing it.”

🥬 Filling (The Actual Concept)

The "Aha!" in one sentence: Test coding agents on full, from-scratch repository generation using only a long natural-language spec, and grade them by the original project’s real pytest suite inside a controlled environment.
Multiple analogies (3 ways):
1. Cooking analogy: You get a detailed recipe book (the NL document) and an empty kitchen (workspace). You must cook the whole meal (repository) so that expert tasters (pytest) approve every dish.
2. School project analogy: You receive a long assignment sheet and must produce a working science fair project without any starter kit. The judges run your experiment to see if it really works.
3. Lego city analogy: You’re handed city-planning notes and must build the entire Lego city from scratch. Inspectors then drive through it to verify all roads, bridges, and services function.
Before vs After: • Before: Benchmarks mostly checked short code snippets, bug fixes, or completion with scaffolding. Agents could lean on pre-existing structure. • After: NL2Repo-Bench removes the training wheels: no code, no scaffolding, one long spec, and strict pass/fail by upstream tests. Now we see if agents can sustain a coherent plan for the whole software.
Why it works (intuition, no equations): • Real behavior is what matters: Running upstream tests gives a binary, reliable signal—does the software behave as intended? • Long specs + empty workspaces force global thinking: With no hints like function signatures, agents must design architecture, manage dependencies, and maintain cross-file consistency. • Controlled Docker environments reduce noise: If a test fails, it’s likely the agent’s implementation, not a random system glitch.
Building blocks of the idea:
1. Task sourcing: Choose real Python libraries with solid pytest suites, modern upkeep, and manageable sizes.
2. Reverse-engineered NL documents: Translate the repository into a precise, exhaustive natural-language spec using AST-assisted coverage so nothing essential is missing.
3. Environment standardization: Ship a per-task Docker image where the original project is guaranteed to pass, isolating environment issues from model ability.
4. Non-functional constraint relaxation: Remove build gotchas (like missing README checks) so results reflect code quality, not packaging trivia.
5. Strict execution-based scoring: Install the generated package and run the original tests; report pass rates and Pass@1.
6. Analysis beyond scores: Trace tool usage, turns, early stops, and context window effects to expose true long-horizon weaknesses.

🍞 Bottom Bread (Anchor): In the math_verify example, the doc explains how to parse numbers (like $1,000.99 or European decimals) and compare expressions in strict vs lenient mode. The agent must design modules, implement parsing and grading, package the library, and then pass the same tests the real project uses—no shortcuts.

03Methodology

At a high level: Input (one long NL document) → [Plan architecture] → [Implement multi-module code + manage dependencies] → [Package + install] → [Run upstream pytest in Docker] → Output (pass rate and analysis).

Step-by-step (what happens, why it exists, example):

Task selection (realistic, testable repos)

What: Curate real Python libraries with 300–120,000 LOC, recency, and fully passing pytest suites.
Why: Ensures tasks are meaningful, modern, and verifiable.
Example: Choose a math expression verification library whose upstream tests already pass.

Reverse-engineering into a natural-language spec

What: Human annotators study the repo and write a complete NL document describing goals, APIs, modules, and behaviors.
Why: The agent must rely only on words, like real requirements.
Example: The spec says parse("sin(x)+x") equals parse("sin(y)+y") in lenient mode; lists functions parse() and verify() with arguments and behaviors.

AST-assisted coverage for API completeness

What: Use an AST scanner to extract all functions/classes/signatures; cross-check that the spec documents each tested API.
Why: Missing a required API would make the task impossible.
Example: The tool finds grader.verify and parser.parse; the document includes both with correct parameters.

Environment building with Docker

What: Build a per-task Docker image where the original repo passes its own tests (pin dependencies as needed).
Why: Isolates model performance from environment flakiness.
Example: Pin antlr4-python3-runtime and latex2sympy2_extended versions so upstream tests always pass.

Non-functional constraint relaxation

What: Remove packaging gotchas unrelated to core logic (e.g., strict README checks) or pre-create harmless files.
Why: Prevents unfair failures and keeps focus on functionality.
Example: If setup.py insists on a LICENSE file, provide a minimal one in the image.

Quality assurance and refinement

What: Human experts review for fidelity; static coverage ensures completeness; pilot runs identify ambiguities.
Why: Guarantees each task is solvable and fair.
Example: If a test fails due to ambiguous lenient comparison rules, refine the doc to clarify.

Agent run protocol

What: The agent receives only the spec and an empty workspace. It can use tools (edit files, run tests, plan) without human help or turn limits.
Why: Measures truly autonomous, long-horizon behavior.
Example: In OpenHands, the agent plans with task_tracker, edits with str_replace_editor, and runs pytest via execute_bash.

Execution-based evaluation

What: Package the generated repo, install it, and run upstream pytest. Count passes and report pass rates and Pass@1.
Why: Objective, binary behavior check.
Example: Even if some tests fail to collect initially, all runnable tests are executed to avoid scoring zero unfairly.

Analysis diagnostics (the secret sauce)

What: Track tool usage patterns, number of turns, early termination vs non-finish, and context window impact.
Why: Reveals why agents fail (overconfidence, navigation traps, blind editing) and what helps (planning loops, edit-test cycles).
Example: High task_tracker usage correlates strongly with better scores; models with huge contexts maintain coherence longer.

Secret sauce highlights:

Single-document, no-scaffold start forces true architecture and planning.
Strict upstream tests avoid subjective grading.
Docker standardization makes apples-to-apples comparisons possible.
Rich telemetry (tools, turns, stop patterns) exposes long-horizon failure modes.

— Extra Concept — 🍞 Hook: It’s hard to follow a very long story if you can’t keep many pages in mind. 🥬 The Concept: Context Window

What it is: How much text the model can remember at once.
How it works:
1. Read the long spec and conversation history.
2. Keep earlier decisions in memory.
3. Use that memory to keep code consistent.
4. Longer windows reduce forgetfulness.
Why it matters: Small windows make models forget past choices and break cross-file logic. 🍞 Anchor: With a 1M-token window, the agent can recall the API contract while editing distant files.

— Extra Concept — 🍞 Hook: Stopping a race before the finish line won’t win any medals. 🥬 The Concept: Premature Termination vs Non-Finish

What it is: Two ways agents fail to complete: stopping too early with confidence, or never deciding to finish.
How it works:
1. Early stop: Agent says “done” before the repo is ready.
2. Non-finish: Agent waits for user input or times out.
3. Both lead to incomplete repos and low scores.
4. Planning + persistence reduce both.
Why it matters: These are top causes of failure on long tasks. 🍞 Anchor: One model declared success after ~70 steps; tests later showed missing imports and failing cases.

— Extra Concept — 🍞 Hook: A checklist makes big chores safer and smoother. 🥬 The Concept: Agentic Planning (task_tracker)

What it is: The agent’s to-do list and progress tracker.
How it works:
1. Create tasks (design, implement modules, package, test).
2. Update status as each task completes.
3. Replan when failures appear.
4. Keep the big picture visible.
Why it matters: Without a plan, the agent wanders and forgets crucial steps. 🍞 Anchor: Agents with more task_tracker calls scored higher on NL2Repo-Bench.

04Experiments & Results

The test: Can agents turn a single long NL document into a fully installable repo that passes the upstream pytest suite, across 104 real-world tasks (web, ML, utilities, databases, system tools, and more)?

What they measured and why:

Pass rate (% of tests passed): A direct measure of functional correctness (like getting an exam score).
Pass@1: How many full repositories pass all tests on the first try.
By difficulty (Easy/Medium/Hard): Shows how performance scales with project size/complexity.
By category: Reveals strengths/weaknesses across domains.
Tool usage and turns: Uncovers behavior patterns (planning, early stop, testing loops).
Context window impact: Tests whether memory size helps maintain long-range coherence.

Competition lineup:

Closed-source: Claude-Sonnet-4/4.5, Gemini-3-pro, GPT-5.
Open-source: DeepSeek-V3.1/V3.2, Kimi-k2, Qwen3-235B (Instruct/Thinking), GLM-4.6.
Frameworks: OpenHands (main), plus Cursor-CLI and Claude Code for comparison.

Scoreboard with context:

Top scores: Claude-Sonnet-4.5 variants reached around 40% pass rate (e.g., 40.2% using Claude Code). That’s like an A- when the class average is closer to a C– or D.
Many models scored below 20% (a failing grade), showing this benchmark is tough.
Pass@1: The best model fully passed only 5 out of 104 repositories in a single run—rare complete wins.
Difficulty effect: Performance drops steadily from Easy to Hard; larger, more complex repos hurt everyone.
Category differences: System tools and database interaction tended to be stronger; ML and networking were harder. Infrastructure-style projects with clearer packaging seemed easier than algorithm-heavy or protocol-heavy ones.

Surprising findings:

Thinking-model paradox: Qwen3-Thinking often stopped early (about half the tasks), likely due to “I reasoned it out” overconfidence without real execution.
GPT-5 conservatism: Low early-stop but very high non-finish rate (over 80%); it waits for human approval, which is fatal in a fully autonomous setting.
Planning correlates with success: task_tracker usage showed a strong positive correlation with score (~0.71), signaling that explicit planning helps long projects.
Edit–Test loop wins: High performers cycled between code edits and running tests. Lower performers got stuck in navigation (ls/cd/read) or blind editing streaks.

Context window impact:

1M-token models (Claude series, Gemini-3-pro) outperformed smaller-window peers. Big memory helps keep the plan and cross-file details consistent over 100–200+ turns.
But context is not enough: Some long-context models still lagged; persistence and planning quality also matter.

Ablations:

Turn limits: Raising the max turns from 50 to 200 boosts scores; beyond ~200, gains flatten. The barrier is reasoning/coherence, not just more tries.
Revealing tests: When agents can see the full test suite during development, scores jump (e.g., Claude-4.5 from ~40% to ~59%). Yet even then, it doesn’t break 60%, proving end-to-end repo creation remains hard.

Takeaway in plain words: Today’s agents can build pieces, but making an entire town—with roads, power, and rules that all work together—still trips them up. The winners planned well, tested often, remembered more, and kept going until the end.

05Discussion & Limitations

Limitations (be specific):

Language and ecosystem scope: Focused on Python libraries with pytest; results may differ for other languages or frameworks.
From-scratch only: It doesn’t evaluate repair-in-place skills (a different but also important capability).
Tooling dependence: Agent behavior partly reflects the available toolset (e.g., OpenHands actions) and its reliability.
Compute and time: Long-horizon evaluations demand many interaction turns and test runs, which is resource-intensive.
Hidden test bias: Even with AST coverage, subtle behaviors might still be harder to reproduce from NL text than from reading tests directly.

Required resources:

Per-task Docker images; reliable compute for running upstream tests many times.
Models with large context windows and sufficient token budgets.
Agent frameworks that support planning, editing, execution, and logging for analysis.

When not to use NL2Repo-Bench:

If you only need function-level code completion measurements (HumanEval-style tasks are faster and cheaper).
When human-in-the-loop collaboration is the goal (this benchmark expects full autonomy during the build phase).
If you cannot allocate enough compute/time to let agents iterate and run full test suites.

Open questions:

How to reduce early-stop and non-finish behaviors without hand-holding? Can self-verification loops (plan–execute–audit–replan) help reliably?
What’s the best way to maintain cross-file contracts over hundreds of turns—memory tools, code maps, or learned architectural priors?
Can curriculum-style training (function → module → package) improve long-horizon skills more than scaling alone?
How should agents balance edit–test cycles vs. heavy up-front design to avoid both blind editing and navigation traps?
Beyond bigger context windows, what memory and planning mechanisms are truly necessary for agent-scale software engineering?

06Conclusion & Future Work

3-sentence summary: NL2Repo-Bench is a rigorous benchmark that asks coding agents to build complete Python libraries from a single long natural-language document and then judges them by running the original projects’ upstream tests in a controlled environment. Across 104 real-world tasks, even top agents struggled (≈40% pass rate), revealing deep weaknesses in long-horizon planning, cross-file consistency, and persistent execution. The analysis shows that explicit planning, frequent edit–test loops, and very large context windows help—but are not sufficient—to solve end-to-end repository generation today.

Main achievement: Establishing a strict, execution-based, from-scratch repository benchmark that isolates true long-horizon agentic competence and exposes where current systems fail.

Future directions:

Architectures for durable planning and self-auditing across hundreds of steps.
Memory tools beyond raw context size (code maps, contract trackers, spec-to-API checkers).
Training curricula and synthetic tasks that grow from functions to full packages.
Better environment and dependency reasoning, including robust packaging and import management.

Why remember this: NL2Repo-Bench changes the question from “Can AIs write snippets?” to “Can AIs build software?” It sets a clear, fair bar—pass the real tests from scratch—and gives researchers the compass they need to make agents that can truly ship code.

Practical Applications

•Use NL2Repo-Bench to compare coding agents fairly before adopting them in your CI pipeline.
•Train agents to maintain explicit to-do lists (task tracking) and measure the score gains.
•Adopt edit–test loops in agent workflows to reduce blind editing and catch regressions early.
•Provide agents with repository maps (API contracts, import graphs) to improve cross-file consistency.
•Tune packaging and dependency heuristics (pyproject.toml, __init__.py exports) to cut ImportError rates.
•Simulate long-horizon curricula: start agents on small repos and scale to harder NL2Repo tasks.
•Instrument agents to detect and prevent premature finish (require passing smoke tests before finishing).
•Leverage larger context windows or memory tools to preserve architectural decisions over many turns.
•Apply non-functional constraint relaxation in your internal evals to focus on functionality over packaging trivia.
•Use ablation-style evaluations (turn limits, reveal tests) to diagnose whether your agent struggles with planning or implementation.

Version: 1