daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Mohan Jiang; Dayuan Fu; Junhao Shi; Ji Zeng; Weiye Si; Keyu Li; Xuefeng Li; Yang Xiao; Wenjie Li; Dequan Wang; Pengfei Liu

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Intermediate

Mohan Jiang, Dayuan Fu, Junhao Shi et al.2/2/2026

arXiv PDF

Key Summary

•Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
•This paper turns real GitHub Pull Request (PR) chains into rich, step-by-step lessons that teach AIs to plan, stay consistent, and fix errors over time.
•Instead of single quick fixes, the data shows full feature evolution with reviews, tests, and bug-fixes that can be checked for correctness.
•The method builds training examples that average 85k tokens and 116 tool calls, yet only 239 samples already boost performance a lot.
•Fine-tuning GLM-4.6 on this data improves Toolathlon by about 47% and beats datasets tens to hundreds of times larger.
•The key is explicit supervision of three meta-skills: progressive task decomposition, long-term consistency, and verifiable refinement.
•A strict evaluator filters bad rollouts (score < 0.8), preventing the model from learning noisy or wrong behaviors.
•Scaling both training horizon (longer PR chains) and test-time budgets helps even more, revealing long-horizon scaling laws.
•The approach is model-agnostic and also lifts Qwen variants, showing it transfers across architectures.
•This sets a practical path to unlock long-horizon agency by mining real software evolution, not expensive manual labels.

Why This Research Matters

Software work is rarely a one-step fix; it’s a journey of planning, changing, testing, and refining. daVinci-Agency captures that journey directly from real PR histories so AIs can learn to handle long, messy, real-world problems. This makes assistants more reliable, faster, and less wasteful with tokens and tool calls, saving both compute and developer time. It also reduces the need for expensive human-labeled data by reusing the built-in checks (tests, reviews) that projects already have. As we deploy agents into IDEs, CI systems, and operations, these long-horizon skills translate to fewer regressions and safer automation. Over time, such agents can manage multi-version upgrades, guide large refactors, and sustain projects with less human babysitting.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building a big LEGO city over many weekends. You can’t finish it in one sitting—you plan, add districts, fix mistakes, and keep the whole city style consistent.

🥬 The Concept — Large Language Models (LLMs):

What it is: LLMs are smart text tools that read and write language and can call tools like code editors or test runners.
How it works:
1. Read a prompt
2. Think (reason) about it
3. Possibly use tools (like editing files or running tests)
4. Produce an answer
Why it matters: Without tool use and reasoning, they can’t solve real coding tasks. 🍞 Anchor: When you ask an AI to fix a bug, it reads the issue, edits code, runs tests, and explains its steps.

🍞 Hook: You know how running a marathon is different from a quick sprint? You need pacing, a route plan, and to adjust if you cramp.

🥬 The Concept — Long-horizon agentic tasks:

What it is: Tasks that take many steps, with choices that affect much later outcomes.
How it works:
1. Set a long-term goal
2. Break it into stages
3. Act, check results, and adjust
4. Keep everything aligned over time
Why it matters: Without long-horizon skill, the AI forgets goals and tiny errors snowball. 🍞 Anchor: Upgrading a library across a project takes many coordinated PRs, not one tiny edit.

🍞 Hook: Think of being a team captain. You assign positions, plan plays, and keep everyone in sync.

🥬 The Concept — Understanding of task management:

What it is: Knowing how to split big goals into smaller, ordered jobs.
How it works:
1. Identify subgoals
2. Order them logically
3. Track progress and adjust
Why it matters: Without it, the AI tackles steps out of order and gets stuck. 🍞 Anchor: Fix tests first, then refactor code, then add a new feature.

🍞 Hook: Imagine a school project that lasts months. You keep drafts, feedback, and versions.

🥬 The Concept — Project workflow context:

What it is: The shared history, files, branches, and discussions around a software change.
How it works:
1. Read issues and comments
2. See code versions
3. Understand how parts fit together
Why it matters: Without context, the AI makes changes that clash with the project. 🍞 Anchor: A PR description explains the why; the diff shows the what; tests verify it.

🍞 Hook: Like checking homework with a rubric every time you submit a draft.

🥬 The Concept — Continuous integration (CI) principles:

What it is: Automated checks (tests, builds, lint) that verify each change.
How it works:
1. Run tests on each commit/PR
2. Catch failures early
3. Keep main branch healthy
Why it matters: Without CI, errors sneak in and pile up. 🍞 Anchor: A failing CI test tells the AI exactly what broke after its edit.

🍞 Hook: Picture a coach guiding a season: plan, play, review, improve, repeat.

🥬 The Concept — Software project management:

What it is: Coordinating tasks, timelines, and quality across features.
How it works:
1. Plan milestones
2. Review work
3. Merge when ready
Why it matters: Without management, features drift and systems break. 🍞 Anchor: Issues, PR reviews, and milestone tracking keep the team aligned.

🍞 Hook: Think of a teacher’s notes on your essay—what to fix and why.

🥬 The Concept — Feedback mechanisms in software development:

What it is: Signals like test results, code reviews, and user reports that guide changes.
How it works:
1. Submit a change
2. Receive feedback (tests/review)
3. Revise and resubmit
Why it matters: Without feedback, the AI cannot refine its solution. 🍞 Anchor: A review comment: “This breaks NumPy 1.14 parsing—please use PEP 440 parser.”

🍞 Hook: Imagine keeping a journal of every time you fixed your bike and what went wrong.

🥬 The Concept — Bug-fix tracking:

What it is: Recording what bugs happened and how they were fixed.
How it works:
1. File an issue
2. Propose a fix in a PR
3. Link follow-up PRs if more fixes are needed
Why it matters: Without this trail, future fixes repeat old mistakes. 🍞 Anchor: PR #21 mentions it fixes a regression introduced in PR #15.

The world before: LLMs showed strong short-term tool use, but struggled with long, multi-stage problems: they lost the plot, repeated errors, or over-edited. Datasets were either tiny and manual (too costly) or synthetic and shallow (missing real failure-and-refine patterns). People tried distilling trajectories from teacher models or using simulated environments, but these often taught surface behaviors and single-step tricks. The gap was clear: agents lacked explicit training on the cross-stage evolution that real projects live through.

The real stakes: In everyday tools—IDEs, data pipelines, cloud deployments—a smart assistant must plan over many steps, respect earlier choices, and fix its own mistakes. Without long-horizon training, assistants waste tokens, loop tools, and ship brittle changes. That’s why this paper turns real PR chains—the living history of software—into lessons that teach steady, marathon-style problem solving.

02Core Idea

🍞 Hook: You know how comic books tell one big story over many issues, with each issue building on the last? You learn characters, plot twists, and how earlier choices matter later.

🥬 The Concept — Chain-of-Pull Requests (PRs):

What it is: A sequence of related PRs that evolve one feature or fix over time, each building on the last.
How it works:
1. Find PRs that reference or fix each other
2. Order them by their real dependency (not just time)
3. Treat the whole chain as one long task with stages
Why it matters: Without chains, training data misses the cause-and-effect links that teach persistence. 🍞 Anchor: PR #15 adds a feature; PR #21 fixes its bug; later PRs polish edge cases—one storyline.

The “Aha!” moment in one sentence: Real PR chains naturally encode the three meta-skills long tasks need—break the work up, keep the goal consistent, and refine with verifiable feedback—so use them as training data.

Three analogies:

Recipe series: Start with a base cake, then layers, then frosting—each step depends on the last.
School project drafts: First draft, peer review, revisions—feedback drives refinement.
City building: Roads first, then utilities, then housing—consistency across phases avoids chaos.

🍞 Hook: When cleaning your room, you don’t do it all at once; you tackle toys, then books, then clothes.

🥬 The Concept — Progressive task decomposition:

What it is: Turning a big goal into a clear, ordered list of sub-tasks across PR stages.
How it works:
1. Identify the step’s intent from PR text
2. Localize the relevant files/modules
3. Do focused edits and tests before moving on
Why it matters: Without it, the agent tries everything everywhere and wastes steps. 🍞 Anchor: Stage 1 introduces a PEP 440 version parser; Stage 2 updates callers; Stage 3 updates tests.

🍞 Hook: Imagine keeping the same art style across a comic series so readers aren’t confused.

🥬 The Concept — Long-term consistency enforcement:

What it is: Ensuring later changes still satisfy the original functional goal and don’t break earlier wins.
How it works:
1. Carry forward prior edits to the next stage
2. Check against the shared objective and tests
3. Adjust new changes to fit the whole picture
Why it matters: Without consistency, later fixes undo earlier progress. 🍞 Anchor: A refactor must still pass the original failing test it aimed to fix.

🍞 Hook: Think of taste-testing soup and adjusting salt before serving.

🥬 The Concept — Verifiable refinement:

What it is: Improving a solution using checks (tests/reviews) that confirm it really got better.
How it works:
1. Run tests/CI or read reviews
2. Find precise failure
3. Patch and re-check until green
Why it matters: Without verification, the agent “fixes” things that don’t actually work. 🍞 Anchor: CI failure shows an infinite recursion; the agent switches to calling super() and passes.

Before vs After:

Before: Data showed one-off fixes; models learned quick sprints and forgot marathon skills.
After: PR chains deliver authentic, staged supervision, so models learn to plan, align, and refine over time.

Why it works (intuition): PR chains capture real dependency structure (cause → effect), external verification (tests/reviews), and authentic error patterns (regressions, hotfixes). Training on these sequences teaches the model to model state over time, carry goals forward, and correct itself—because the data itself requires these behaviors to succeed.

Building blocks:

Data sourcing from mature, interactive repos
Graph-building to link semantically dependent PRs
Query construction that gives intent and context but hides exact edits
Stage-by-stage rollouts with state carryover
A strict evaluator (score ≥ 0.8) to filter noisy samples
Supervised fine-tuning on the accepted long trajectories

03Methodology

High-level flow: Input (real PR chains) → [Construct dependency chains and intent queries] → [Stage-by-stage rollouts with state carryover and feedback] → [Rejection sampling and packing] → Output (long-horizon training trajectories).

Step A: PR-chain construction

What happens: Use GitHub metadata (commit messages, review links) to find PRs that fix/extend each other; build an ordered chain by semantic references, not just time.
Why it exists: Time-ordered lists miss true dependencies; without accurate links, later stages won’t align with earlier intent.
Example: PR #21 says “Fix regression from #15,” so chain [#15 → #21] is formed even if other PRs appeared in between.

Step B: Query construction (q = f(x, p̂, R))

What happens: For each PR in the chain, synthesize a conceptual sub-query from its natural language (issue/description/comments) and its patch, while intentionally hiding literal code details. Provide a global overview for the whole chain at the start.
Why it exists: If the prompt gives exact edits, the agent just copies; hiding specifics forces navigation, localization, and reasoning.
Example: “Adjust version parsing to handle dotted pre-releases per semantic rules; focus on the initialization logic and how version strings are interpreted in import checks.” No function names given.

Step C: Rollout environment with state carryover

What happens: Execute stages sequentially. The code edits from stage t−1 are applied to the base of stage t (S(t)init = B_t ⊕ Δτ{t−1}). The agent must live with its prior choices.
Why it exists: Without carryover, the agent never learns to manage long-term consequences and consistency.
Example: If Stage 1 changes the version parser, Stage 2 starts with that change in place when updating downstream callers.

Step D: Tool-rich scaffolds and logging

What happens: Use two scaffolds (SII-CLI and mini-swe-agent) with plentiful tool calls (edit, grep, run tests). Log the full trajectory of observations, thoughts, and tool actions (often 100+ per sample).
Why it exists: Long-horizon skills emerge when the agent repeatedly acts, checks, and adjusts; without tools, there’s nothing to practice.
Example: The agent runs pytest, sees a failing test about ManyToMany inlines, searches files, edits a method, re-runs tests.

Step E: Evaluator and rejection sampling

What happens: GLM-4.6 judges functional equivalence between the agent’s patch and the ground truth PR patch, giving a score s. Only samples with s ≥ 0.8 are kept; up to three refinement tries allowed.
Why it exists: Unfiltered self-generated data is noisy and can un-teach good behaviors; strict filtering preserves correctness signals.
Example: Without the infinite recursion fix, score is low; after changing to a super() call, the score passes.

Step F: Training data assembly and SFT

What happens: Accepted trajectories are packed into training sets averaging ~85k tokens and ~117 tool calls. Models are fine-tuned with consistent hyperparameters.
Why it exists: Long sequences with clear stage goals teach decomposition, consistency, and refinement in one go.
Example data stats: Token max 3.14M, tool call max 1165; nine curated repos (e.g., numpy, scipy, pulsar).

The Secret Sauce:

Real evolutionary structure: Authentic cause→effect across PRs encodes the meta-skills naturally.
State carryover: Forces long-term consistency under the agent’s own edits.
Strict evaluator: Prevents drift from low-quality self-data, enabling effective on-policy self-distillation.
Intent-only prompts: Push the model to navigate and reason, instead of copy-paste edits.

04Experiments & Results

The test: Measure whether models learn long-horizon agency—planning over stages, keeping goals aligned, and refining with feedback—without wasting tokens or tool calls.

What we compared: Models fine-tuned on daVinci-Agency (239 samples) versus models trained on big agent datasets (e.g., SWE-Smith 66k, CodeAgent ~60k, CC-Bench 2.6k) and strong baselines (GLM-4.6, Kimi-K2, DeepSeek, Qwen variants).

Scoreboard with context:

GLM-4.6 base vs +daVinci-Agency (239 samples): • SWE-bench: 0.608 → 0.632 (solid bump on a hard software benchmark) • Toolathlon: 0.157 → 0.231 (≈47% relative gain; like going from a B- to a strong B+/A-) • τ-Bench-retail/airline and DS-1000/SciCode: steady gains or maintained robustness • Overall average: 0.441 → 0.475 (clear multi-benchmark lift)
Against huge datasets: • Beats SWE-Smith (66k samples) on many metrics despite being ~275× smaller. • The 0.231 on Toolathlon is over 148% better than some baselines with far more data (e.g., 0.093).
Cross-model transfer: • Qwen3-30B-A3B: 0.295 → 0.307 overall • Qwen3-32B: 0.280 → 0.292 overall • Even small Qwen3-8B nudges upward on coding tasks • On AgencyBench Code, GLM-4.6-daVinci-Agency scores 15.9 vs 11–12 for strong peers.

Surprising findings:

Data efficiency: Only 239 samples with long, structured supervision outperformed datasets tens of thousands in size.
Efficiency per token: On SWE-bench, token use drops by 113.6k (GLM-4.6) to 288.8k (Qwen3-32B) on average, and tool calls fall by up to ~26%—fewer steps, better results.
Scaling laws: Longer training horizons (completing more PRs per chain) and larger inference-time budgets both widen the performance gap in favor of daVinci-Agency, showing that the method thrives when you let it plan further.
Rejection sampling is critical: Removing it tanks performance (average ~0.205), proving quality filtering is necessary for self-distillation to help rather than harm.

Interpretation: Models trained on authentic PR evolution internalize the three meta-skills, which makes their plans tighter, their edits more stable across stages, and their fixes verifiably correct—so they can win long races without running in circles.

05Discussion & Limitations

Limitations:

PR-source dependence: Works best where rich PR discussions, tests, and reviews exist; sparse or low-quality repos provide weaker signals.
Chain length cap: Current success rates limit chains to about five PRs; even longer arcs likely help but are harder to complete reliably.
On-policy bias: Using the same family (GLM-4.6) for rollout and training can imprint its habits; strong filtering mitigates but doesn’t remove this risk.
Domain coverage: Focused on code-heavy, well-tested projects (e.g., NumPy/Scipy/Pulsar); domains without testable signals are tougher.
Compute/context: Long sequences (85k tokens average) need memory and careful batching.

Required resources:

Access to GitHub PR metadata and repo history; stable scaffolds (SII-CLI, mini-swe-agent); evaluator model; GPUs with large context support; storage for long logs.

When not to use:

Tiny one-shot tasks where long-horizon structure doesn’t exist.
Repos with few tests or reviews (weak verification signals).
Closed-source environments where diffs and history aren’t accessible.
Ultra-latency-sensitive settings where long context is infeasible.

Open questions:

How far can chain length scale before diminishing returns—or do returns keep compounding?
Can we generalize the paradigm to non-code domains with weaker ground truth (e.g., design docs, robotics logs)?
What is the best evaluator mix (LLM+static analysis+unit fuzzing) to strengthen acceptance without false rejections?
How to auto-discover deeper semantic links across distant PRs, not only explicit references?
Can curriculum strategies (easy→hard chains) accelerate skill acquisition further?

06Conclusion & Future Work

Three-sentence summary: This paper turns real GitHub PR chains into long, verifiable training lessons that teach AIs to plan in stages, keep goals aligned, and fix mistakes. With only 239 long, high-quality samples and a strict evaluator, models like GLM-4.6 gain big advantages on long-horizon benchmarks while using fewer tokens and tool calls. The method scales with longer chains and bigger test-time budgets, revealing a clear path to stronger agency.

Main achievement: Showing that modeling real software evolution—rather than synthetic one-offs—unlocks the three meta-skills of long-horizon agency (decomposition, consistency, refinement) in a data-efficient, verifiable way.

Future directions:

Extend chains beyond five PRs and improve success rates in multi-stage rollouts.
Hybrid evaluators combining LLM judgment, unit tests, and static analysis.
Broaden domains (data engineering, docs, infra) where verification signals are available.
Build curricula that grow horizon length and dependency complexity over time.

Why remember this: It reframes training data as living stories of change, not snapshots—teaching AIs to run marathons, not sprints. By mining the world’s real evolution trails (PRs), we get scalable, checkable supervision that reliably instills long-term, goal-directed behavior.

Practical Applications

•Automated multi-PR feature rollouts that keep tests green across stages.
•Guided large-scale refactors (e.g., API migration) with persistent state and verification.
•Continuous bug triage and fix cycles that link root causes to follow-up patches.
•CI co-pilots that propose patches, read failures, and refine until passing.
•Repository onboarding agents that localize where to implement a change and plan staged edits.
•Version upgrade assistants that adjust parsers and all downstream callers over multiple PRs.
•Code review aides that predict likely regressions by reading PR evolution patterns.
•Test gap fillers that learn from failures and propose targeted new tests.
•Tool-use tutors that reduce redundant shell/edit loops for efficient problem solving.
•Long-horizon planning bots for data pipelines or infra-as-code with drift control.

Version: 1