daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently
Key Summary
- ā¢Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
- ā¢This paper turns real GitHub Pull Request (PR) chains into rich, step-by-step lessons that teach AIs to plan, stay consistent, and fix errors over time.
- ā¢Instead of single quick fixes, the data shows full feature evolution with reviews, tests, and bug-fixes that can be checked for correctness.
- ā¢The method builds training examples that average 85k tokens and 116 tool calls, yet only 239 samples already boost performance a lot.
- ā¢Fine-tuning GLM-4.6 on this data improves Toolathlon by about 47% and beats datasets tens to hundreds of times larger.
- ā¢The key is explicit supervision of three meta-skills: progressive task decomposition, long-term consistency, and verifiable refinement.
- ā¢A strict evaluator filters bad rollouts (score < 0.8), preventing the model from learning noisy or wrong behaviors.
- ā¢Scaling both training horizon (longer PR chains) and test-time budgets helps even more, revealing long-horizon scaling laws.
- ā¢The approach is model-agnostic and also lifts Qwen variants, showing it transfers across architectures.
- ā¢This sets a practical path to unlock long-horizon agency by mining real software evolution, not expensive manual labels.
Why This Research Matters
Software work is rarely a one-step fix; itās a journey of planning, changing, testing, and refining. daVinci-Agency captures that journey directly from real PR histories so AIs can learn to handle long, messy, real-world problems. This makes assistants more reliable, faster, and less wasteful with tokens and tool calls, saving both compute and developer time. It also reduces the need for expensive human-labeled data by reusing the built-in checks (tests, reviews) that projects already have. As we deploy agents into IDEs, CI systems, and operations, these long-horizon skills translate to fewer regressions and safer automation. Over time, such agents can manage multi-version upgrades, guide large refactors, and sustain projects with less human babysitting.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine building a big LEGO city over many weekends. You canāt finish it in one sittingāyou plan, add districts, fix mistakes, and keep the whole city style consistent.
š„¬ The Concept ā Large Language Models (LLMs):
- What it is: LLMs are smart text tools that read and write language and can call tools like code editors or test runners.
- How it works:
- Read a prompt
- Think (reason) about it
- Possibly use tools (like editing files or running tests)
- Produce an answer
- Why it matters: Without tool use and reasoning, they canāt solve real coding tasks. š Anchor: When you ask an AI to fix a bug, it reads the issue, edits code, runs tests, and explains its steps.
š Hook: You know how running a marathon is different from a quick sprint? You need pacing, a route plan, and to adjust if you cramp.
š„¬ The Concept ā Long-horizon agentic tasks:
- What it is: Tasks that take many steps, with choices that affect much later outcomes.
- How it works:
- Set a long-term goal
- Break it into stages
- Act, check results, and adjust
- Keep everything aligned over time
- Why it matters: Without long-horizon skill, the AI forgets goals and tiny errors snowball. š Anchor: Upgrading a library across a project takes many coordinated PRs, not one tiny edit.
š Hook: Think of being a team captain. You assign positions, plan plays, and keep everyone in sync.
š„¬ The Concept ā Understanding of task management:
- What it is: Knowing how to split big goals into smaller, ordered jobs.
- How it works:
- Identify subgoals
- Order them logically
- Track progress and adjust
- Why it matters: Without it, the AI tackles steps out of order and gets stuck. š Anchor: Fix tests first, then refactor code, then add a new feature.
š Hook: Imagine a school project that lasts months. You keep drafts, feedback, and versions.
š„¬ The Concept ā Project workflow context:
- What it is: The shared history, files, branches, and discussions around a software change.
- How it works:
- Read issues and comments
- See code versions
- Understand how parts fit together
- Why it matters: Without context, the AI makes changes that clash with the project. š Anchor: A PR description explains the why; the diff shows the what; tests verify it.
š Hook: Like checking homework with a rubric every time you submit a draft.
š„¬ The Concept ā Continuous integration (CI) principles:
- What it is: Automated checks (tests, builds, lint) that verify each change.
- How it works:
- Run tests on each commit/PR
- Catch failures early
- Keep main branch healthy
- Why it matters: Without CI, errors sneak in and pile up. š Anchor: A failing CI test tells the AI exactly what broke after its edit.
š Hook: Picture a coach guiding a season: plan, play, review, improve, repeat.
š„¬ The Concept ā Software project management:
- What it is: Coordinating tasks, timelines, and quality across features.
- How it works:
- Plan milestones
- Review work
- Merge when ready
- Why it matters: Without management, features drift and systems break. š Anchor: Issues, PR reviews, and milestone tracking keep the team aligned.
š Hook: Think of a teacherās notes on your essayāwhat to fix and why.
š„¬ The Concept ā Feedback mechanisms in software development:
- What it is: Signals like test results, code reviews, and user reports that guide changes.
- How it works:
- Submit a change
- Receive feedback (tests/review)
- Revise and resubmit
- Why it matters: Without feedback, the AI cannot refine its solution. š Anchor: A review comment: āThis breaks NumPy 1.14 parsingāplease use PEP 440 parser.ā
š Hook: Imagine keeping a journal of every time you fixed your bike and what went wrong.
š„¬ The Concept ā Bug-fix tracking:
- What it is: Recording what bugs happened and how they were fixed.
- How it works:
- File an issue
- Propose a fix in a PR
- Link follow-up PRs if more fixes are needed
- Why it matters: Without this trail, future fixes repeat old mistakes. š Anchor: PR #21 mentions it fixes a regression introduced in PR #15.
The world before: LLMs showed strong short-term tool use, but struggled with long, multi-stage problems: they lost the plot, repeated errors, or over-edited. Datasets were either tiny and manual (too costly) or synthetic and shallow (missing real failure-and-refine patterns). People tried distilling trajectories from teacher models or using simulated environments, but these often taught surface behaviors and single-step tricks. The gap was clear: agents lacked explicit training on the cross-stage evolution that real projects live through.
The real stakes: In everyday toolsāIDEs, data pipelines, cloud deploymentsāa smart assistant must plan over many steps, respect earlier choices, and fix its own mistakes. Without long-horizon training, assistants waste tokens, loop tools, and ship brittle changes. Thatās why this paper turns real PR chainsāthe living history of softwareāinto lessons that teach steady, marathon-style problem solving.
02Core Idea
š Hook: You know how comic books tell one big story over many issues, with each issue building on the last? You learn characters, plot twists, and how earlier choices matter later.
š„¬ The Concept ā Chain-of-Pull Requests (PRs):
- What it is: A sequence of related PRs that evolve one feature or fix over time, each building on the last.
- How it works:
- Find PRs that reference or fix each other
- Order them by their real dependency (not just time)
- Treat the whole chain as one long task with stages
- Why it matters: Without chains, training data misses the cause-and-effect links that teach persistence. š Anchor: PR #15 adds a feature; PR #21 fixes its bug; later PRs polish edge casesāone storyline.
The āAha!ā moment in one sentence: Real PR chains naturally encode the three meta-skills long tasks needābreak the work up, keep the goal consistent, and refine with verifiable feedbackāso use them as training data.
Three analogies:
- Recipe series: Start with a base cake, then layers, then frostingāeach step depends on the last.
- School project drafts: First draft, peer review, revisionsāfeedback drives refinement.
- City building: Roads first, then utilities, then housingāconsistency across phases avoids chaos.
š Hook: When cleaning your room, you donāt do it all at once; you tackle toys, then books, then clothes.
š„¬ The Concept ā Progressive task decomposition:
- What it is: Turning a big goal into a clear, ordered list of sub-tasks across PR stages.
- How it works:
- Identify the stepās intent from PR text
- Localize the relevant files/modules
- Do focused edits and tests before moving on
- Why it matters: Without it, the agent tries everything everywhere and wastes steps. š Anchor: Stage 1 introduces a PEP 440 version parser; Stage 2 updates callers; Stage 3 updates tests.
š Hook: Imagine keeping the same art style across a comic series so readers arenāt confused.
š„¬ The Concept ā Long-term consistency enforcement:
- What it is: Ensuring later changes still satisfy the original functional goal and donāt break earlier wins.
- How it works:
- Carry forward prior edits to the next stage
- Check against the shared objective and tests
- Adjust new changes to fit the whole picture
- Why it matters: Without consistency, later fixes undo earlier progress. š Anchor: A refactor must still pass the original failing test it aimed to fix.
š Hook: Think of taste-testing soup and adjusting salt before serving.
š„¬ The Concept ā Verifiable refinement:
- What it is: Improving a solution using checks (tests/reviews) that confirm it really got better.
- How it works:
- Run tests/CI or read reviews
- Find precise failure
- Patch and re-check until green
- Why it matters: Without verification, the agent āfixesā things that donāt actually work. š Anchor: CI failure shows an infinite recursion; the agent switches to calling super() and passes.
Before vs After:
- Before: Data showed one-off fixes; models learned quick sprints and forgot marathon skills.
- After: PR chains deliver authentic, staged supervision, so models learn to plan, align, and refine over time.
Why it works (intuition): PR chains capture real dependency structure (cause ā effect), external verification (tests/reviews), and authentic error patterns (regressions, hotfixes). Training on these sequences teaches the model to model state over time, carry goals forward, and correct itselfābecause the data itself requires these behaviors to succeed.
Building blocks:
- Data sourcing from mature, interactive repos
- Graph-building to link semantically dependent PRs
- Query construction that gives intent and context but hides exact edits
- Stage-by-stage rollouts with state carryover
- A strict evaluator (score ā„ 0.8) to filter noisy samples
- Supervised fine-tuning on the accepted long trajectories
03Methodology
High-level flow: Input (real PR chains) ā [Construct dependency chains and intent queries] ā [Stage-by-stage rollouts with state carryover and feedback] ā [Rejection sampling and packing] ā Output (long-horizon training trajectories).
Step A: PR-chain construction
- What happens: Use GitHub metadata (commit messages, review links) to find PRs that fix/extend each other; build an ordered chain by semantic references, not just time.
- Why it exists: Time-ordered lists miss true dependencies; without accurate links, later stages wonāt align with earlier intent.
- Example: PR #21 says āFix regression from #15,ā so chain [#15 ā #21] is formed even if other PRs appeared in between.
Step B: Query construction (q = f(x, pĢ, R))
- What happens: For each PR in the chain, synthesize a conceptual sub-query from its natural language (issue/description/comments) and its patch, while intentionally hiding literal code details. Provide a global overview for the whole chain at the start.
- Why it exists: If the prompt gives exact edits, the agent just copies; hiding specifics forces navigation, localization, and reasoning.
- Example: āAdjust version parsing to handle dotted pre-releases per semantic rules; focus on the initialization logic and how version strings are interpreted in import checks.ā No function names given.
Step C: Rollout environment with state carryover
- What happens: Execute stages sequentially. The code edits from stage tā1 are applied to the base of stage t (S(t)init = B_t ā ĪĻ{tā1}). The agent must live with its prior choices.
- Why it exists: Without carryover, the agent never learns to manage long-term consequences and consistency.
- Example: If Stage 1 changes the version parser, Stage 2 starts with that change in place when updating downstream callers.
Step D: Tool-rich scaffolds and logging
- What happens: Use two scaffolds (SII-CLI and mini-swe-agent) with plentiful tool calls (edit, grep, run tests). Log the full trajectory of observations, thoughts, and tool actions (often 100+ per sample).
- Why it exists: Long-horizon skills emerge when the agent repeatedly acts, checks, and adjusts; without tools, thereās nothing to practice.
- Example: The agent runs pytest, sees a failing test about ManyToMany inlines, searches files, edits a method, re-runs tests.
Step E: Evaluator and rejection sampling
- What happens: GLM-4.6 judges functional equivalence between the agentās patch and the ground truth PR patch, giving a score s. Only samples with s ā„ 0.8 are kept; up to three refinement tries allowed.
- Why it exists: Unfiltered self-generated data is noisy and can un-teach good behaviors; strict filtering preserves correctness signals.
- Example: Without the infinite recursion fix, score is low; after changing to a super() call, the score passes.
Step F: Training data assembly and SFT
- What happens: Accepted trajectories are packed into training sets averaging ~85k tokens and ~117 tool calls. Models are fine-tuned with consistent hyperparameters.
- Why it exists: Long sequences with clear stage goals teach decomposition, consistency, and refinement in one go.
- Example data stats: Token max 3.14M, tool call max 1165; nine curated repos (e.g., numpy, scipy, pulsar).
The Secret Sauce:
- Real evolutionary structure: Authentic causeāeffect across PRs encodes the meta-skills naturally.
- State carryover: Forces long-term consistency under the agentās own edits.
- Strict evaluator: Prevents drift from low-quality self-data, enabling effective on-policy self-distillation.
- Intent-only prompts: Push the model to navigate and reason, instead of copy-paste edits.
04Experiments & Results
The test: Measure whether models learn long-horizon agencyāplanning over stages, keeping goals aligned, and refining with feedbackāwithout wasting tokens or tool calls.
What we compared: Models fine-tuned on daVinci-Agency (239 samples) versus models trained on big agent datasets (e.g., SWE-Smith 66k, CodeAgent ~60k, CC-Bench 2.6k) and strong baselines (GLM-4.6, Kimi-K2, DeepSeek, Qwen variants).
Scoreboard with context:
- GLM-4.6 base vs +daVinci-Agency (239 samples): ⢠SWE-bench: 0.608 ā 0.632 (solid bump on a hard software benchmark) ⢠Toolathlon: 0.157 ā 0.231 (ā47% relative gain; like going from a B- to a strong B+/A-) ⢠Ļ-Bench-retail/airline and DS-1000/SciCode: steady gains or maintained robustness ⢠Overall average: 0.441 ā 0.475 (clear multi-benchmark lift)
- Against huge datasets: ⢠Beats SWE-Smith (66k samples) on many metrics despite being ~275à smaller. ⢠The 0.231 on Toolathlon is over 148% better than some baselines with far more data (e.g., 0.093).
- Cross-model transfer: ⢠Qwen3-30B-A3B: 0.295 ā 0.307 overall ⢠Qwen3-32B: 0.280 ā 0.292 overall ⢠Even small Qwen3-8B nudges upward on coding tasks ⢠On AgencyBench Code, GLM-4.6-daVinci-Agency scores 15.9 vs 11ā12 for strong peers.
Surprising findings:
- Data efficiency: Only 239 samples with long, structured supervision outperformed datasets tens of thousands in size.
- Efficiency per token: On SWE-bench, token use drops by 113.6k (GLM-4.6) to 288.8k (Qwen3-32B) on average, and tool calls fall by up to ~26%āfewer steps, better results.
- Scaling laws: Longer training horizons (completing more PRs per chain) and larger inference-time budgets both widen the performance gap in favor of daVinci-Agency, showing that the method thrives when you let it plan further.
- Rejection sampling is critical: Removing it tanks performance (average ~0.205), proving quality filtering is necessary for self-distillation to help rather than harm.
Interpretation: Models trained on authentic PR evolution internalize the three meta-skills, which makes their plans tighter, their edits more stable across stages, and their fixes verifiably correctāso they can win long races without running in circles.
05Discussion & Limitations
Limitations:
- PR-source dependence: Works best where rich PR discussions, tests, and reviews exist; sparse or low-quality repos provide weaker signals.
- Chain length cap: Current success rates limit chains to about five PRs; even longer arcs likely help but are harder to complete reliably.
- On-policy bias: Using the same family (GLM-4.6) for rollout and training can imprint its habits; strong filtering mitigates but doesnāt remove this risk.
- Domain coverage: Focused on code-heavy, well-tested projects (e.g., NumPy/Scipy/Pulsar); domains without testable signals are tougher.
- Compute/context: Long sequences (85k tokens average) need memory and careful batching.
Required resources:
- Access to GitHub PR metadata and repo history; stable scaffolds (SII-CLI, mini-swe-agent); evaluator model; GPUs with large context support; storage for long logs.
When not to use:
- Tiny one-shot tasks where long-horizon structure doesnāt exist.
- Repos with few tests or reviews (weak verification signals).
- Closed-source environments where diffs and history arenāt accessible.
- Ultra-latency-sensitive settings where long context is infeasible.
Open questions:
- How far can chain length scale before diminishing returnsāor do returns keep compounding?
- Can we generalize the paradigm to non-code domains with weaker ground truth (e.g., design docs, robotics logs)?
- What is the best evaluator mix (LLM+static analysis+unit fuzzing) to strengthen acceptance without false rejections?
- How to auto-discover deeper semantic links across distant PRs, not only explicit references?
- Can curriculum strategies (easyāhard chains) accelerate skill acquisition further?
06Conclusion & Future Work
Three-sentence summary: This paper turns real GitHub PR chains into long, verifiable training lessons that teach AIs to plan in stages, keep goals aligned, and fix mistakes. With only 239 long, high-quality samples and a strict evaluator, models like GLM-4.6 gain big advantages on long-horizon benchmarks while using fewer tokens and tool calls. The method scales with longer chains and bigger test-time budgets, revealing a clear path to stronger agency.
Main achievement: Showing that modeling real software evolutionārather than synthetic one-offsāunlocks the three meta-skills of long-horizon agency (decomposition, consistency, refinement) in a data-efficient, verifiable way.
Future directions:
- Extend chains beyond five PRs and improve success rates in multi-stage rollouts.
- Hybrid evaluators combining LLM judgment, unit tests, and static analysis.
- Broaden domains (data engineering, docs, infra) where verification signals are available.
- Build curricula that grow horizon length and dependency complexity over time.
Why remember this: It reframes training data as living stories of change, not snapshotsāteaching AIs to run marathons, not sprints. By mining the worldās real evolution trails (PRs), we get scalable, checkable supervision that reliably instills long-term, goal-directed behavior.
Practical Applications
- ā¢Automated multi-PR feature rollouts that keep tests green across stages.
- ā¢Guided large-scale refactors (e.g., API migration) with persistent state and verification.
- ā¢Continuous bug triage and fix cycles that link root causes to follow-up patches.
- ā¢CI co-pilots that propose patches, read failures, and refine until passing.
- ā¢Repository onboarding agents that localize where to implement a change and plan staged edits.
- ā¢Version upgrade assistants that adjust parsers and all downstream callers over multiple PRs.
- ā¢Code review aides that predict likely regressions by reading PR evolution patterns.
- ā¢Test gap fillers that learn from failures and propose targeted new tests.
- ā¢Tool-use tutors that reduce redundant shell/edit loops for efficient problem solving.
- ā¢Long-horizon planning bots for data pipelines or infra-as-code with drift control.