daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng; Dayuan Fu; Tiantian Mi; Yumin Zhuang; Yaxing Huang; Xuefeng Li; Lyumanshan Ye; Muhang Xie; Qishuo Hua; Zhen Huang; Mohan Jiang; Hanning Wang; Jifan Lin; Yang Xiao; Jie Sun; Yunze Wu; Pengfei Liu

daVinci-Dev: Agent-native Mid-training for Software Engineering

Intermediate

Ji Zeng, Dayuan Fu, Tiantian Mi et al.1/26/2026

arXiv PDF

Key Summary

•This paper teaches code AIs to work more like real software engineers by training them in the middle of their learning using real development workflows.
•The key idea is agent-native mid-training: feeding the model large examples that keep the whole story of how code was found, edited, tested, and fixed.
•They build two kinds of training journeys (trajectories): contextually-native (full PR context and edits) and environmentally-native (real tool runs and test feedback).
•By learning from both the ‘what’ (files and edits) and the ‘how’ (test failures and retries), the model becomes better at multi-step problem solving.
•On the tough SWE-Bench Verified benchmark, their 32B and 72B models reach 56.1% and 58.5% solved, the best among open recipes of similar sizes.
•They achieve these results with less than half the mid-training tokens of prior open methods like Kimi-Dev (~73.1B vs ~150B).
•The training also improves general coding tasks (like HumanEval) and even science reasoning benchmarks.
•This approach is scalable because GitHub has lots of Pull Requests for breadth, and tools like Dockerized tests add depth and authenticity.
•They release data-building code, recipes, and many checkpoints to help the community explore agentic mid-training.
•Bottom line: teaching AIs the full development loop—look, change, test, fix—makes them stronger, more reliable code agents.

Why This Research Matters

Software touches nearly everything we use: phones, cars, hospitals, and classrooms. Teaching AI to fix code the way real engineers do—by searching, editing, testing, and revising—means fewer crashes and faster bug fixes in everyday apps. Because this method is token-efficient and scales with public PRs and automated tests, more teams (not just those with huge budgets) can build strong code agents. It also strengthens general reasoning, so these models can help with scientific computing and data tasks beyond pure coding. Safer, faster software updates can improve cybersecurity, reduce downtime, and cut maintenance costs. Sharing data-building tools and checkpoints helps the whole community move forward together.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine learning to fix a bike. If someone only shows you a photo of a bike already fixed, you won’t learn how to find the loose bolt, which wrench to try, or how to test the wheels after tightening. You need the whole process, not just the final picture.

🥬 The Concept (Large Language Models — LLMs):

What it is: An LLM is a very big computer program that learns to read and write by seeing lots of text.
How it works: 1) Read mountains of text; 2) Learn patterns in words and code; 3) Predict the next token; 4) Get better at following instructions with extra training.
Why it matters: Without a strong base, the model can’t understand instructions, code, or long problems well. 🍞 Anchor: When you ask an LLM to fix a buggy function, it uses its language understanding to read the code and your request before suggesting changes.

🍞 Hook: You know how real programmers don’t fix bugs in one step? They search files, read code, try an edit, run tests, see failures, and try again.

🥬 The Concept (Agentic Software Engineering):

What it is: It’s when an AI acts like a developer—navigating a codebase, making edits, and running tests in multiple steps.
How it works: 1) Localize (find the right files); 2) Read (understand the code); 3) Edit (apply a patch); 4) Test (run unit tests); 5) Revise (fix based on failures); repeat.
Why it matters: Without this loop, the AI guesses in one shot and misses hidden issues across files. 🍞 Anchor: To fix a failing date parser, an agent searches for “parse_date,” opens the file, changes the function, runs tests, sees a failure for empty strings, and then adds a guard to pass.

🍞 Hook: Think of studying only perfect homework solutions—no scratch work, no mistakes, no teacher notes. Would you learn the real thinking process?

🥬 The Concept (Distribution Mismatch):

What it is: The training data the model sees (static code snapshots) doesn’t match the interactive world it faces during real debugging (actions and feedback).
How it works: 1) Training often shows final files and merged commits; 2) Real life requires step-by-step actions and tool outputs; 3) The gap confuses the model when deployed.
Why it matters: Without matching practice, the model doesn’t learn how to react to errors or tests. 🍞 Anchor: A model trained on finished Pull Requests might know the final patch, but not how to find the right file or fix tests when they first fail.

🍞 Hook: Imagine a recipe book that only shows final meal photos. You’d wish it listed the ingredients, steps, and even what to do if the sauce splits.

🥬 The Concept (Pull Requests — PRs):

What it is: A PR is a developer’s proposal to change code, including descriptions, changed files, and commit history.
How it works: 1) Describe the problem; 2) Edit files in commits; 3) Get reviews; 4) Merge if approved.
Why it matters: PRs are rich stories of how code was changed and why—great for teaching an agent the process, not just the result. 🍞 Anchor: A PR titled “Fix header parsing for non-ASCII” includes the issue, the files edited, and the commits that gradually remove an incorrect strip() call.

🍞 Hook: When students learn, they often first get general knowledge, then special practice. AIs are similar.

🥬 The Concept (Post-training: SFT and RL):

What it is: After base training, models get extra practice—SFT (teacher-corrected examples) and RL (learning from rewards like passing tests).
How it works: 1) SFT shows right step-by-step answers; 2) RL lets the model try, get feedback, and improve policies.
Why it matters: This sharpens skills, but depends on how good the base model already is and how much diverse data exists. 🍞 Anchor: After learning general code, SFT shows “how to fix this bug” transcripts; RL rewards the model when unit tests pass.

The world before: Code models were pretty good at writing single functions from prompts. But real software tasks happen across big repositories. Most training data were static files or final diffs—great for ‘what code looks like,’ not ‘how to change and validate it.’

The problem: When these models are deployed as agents, they must search, read, edit, run, and revise. Training didn’t expose them to this step-by-step life, causing the distribution mismatch. Also, collecting large amounts of high-quality interactive data is costly, so post-training alone hits limits.

Failed attempts: Some tried factorized training (teach “find file” and “edit code” separately) or used simulated environments. But this breaks the natural dependency between steps and lacks authentic feedback (like real test failures), so models don’t learn the full loop.

The gap: Models need mid-training on data that preserves the complete agent workflow and real feedback—earlier in the pipeline—so they enter post-training already thinking like agents.

Real stakes: Better code agents mean faster bug fixes, safer updates, and fewer frustrating software glitches in apps, websites, and devices we use daily.

02Core Idea

🍞 Hook: You know how practicing a sport in real games (with a scoreboard and a referee) makes you better than just shooting hoops alone?

🥬 The Concept (Agentic Mid-training):

What it is: A big, middle part of training where the model learns from data that look and feel like real software engineering—actions plus feedback.
How it works: 1) Build giant datasets that keep the full story (issue → files → edits → tests); 2) Mix two kinds of journeys: context-rich PR workflows and real execution rollouts; 3) Train the base model on these so it internalizes the agent loop; 4) Later, do SFT/RL more efficiently.
Why it matters: Without this, post-training must teach both basics and strategies at once, which is costly and brittle. 🍞 Anchor: After mid-training, when the agent sees a failing test, it already “knows” to re-open files, search again, and try a smaller patch.

Aha! in one sentence: Teach the model the whole development dance—look, change, test, and fix—during mid-training using agent-native data, so post-training has a flying start.

Multiple analogies:

Cooking: Don’t just show the final cake—show the batter mixing, oven timing, and what to do if it sinks.
Maps: Don’t only give the destination—record the turns, detours, and traffic alerts the driver handled.
Music: Don’t just play the final song—include practice clips, missed notes, and how the musician corrected rhythm.

🍞 Hook: Imagine two types of practice—reading whole recipes and actually cooking with a hot stove.

🥬 The Concept (Agent-native Data):

What it is: Training examples that preserve the agent’s full info flow and real environment dynamics.
How it works: 1) Contextually-native: PR-based workflows that unify localization and edits; 2) Environmentally-native: real tool calls, test runs, and error messages from executable repos; 3) Combine for breadth and depth.
Why it matters: Without both, models either miss real feedback (depth) or lack variety and coverage (breadth). 🍞 Anchor: A PR shows which files changed and why; a rollout shows that pytest failed with a TypeError after the first patch.

🍞 Hook: Think of a detective story where you follow all clues, then watch the suspect actually react when confronted.

🥬 The Concept (Contextually-native Trajectories):

What it is: PR-derived samples that bundle issues, relevant files (at base), and the commit-by-commit edit path.
How it works: 1) Collect PR metadata and issues; 2) Reconstruct base file contents and commits in order; 3) Organize into sections (context, files, edits) to mirror localize→read→edit.
Why it matters: If you split retrieval and editing into separate lessons, the agent loses the cause-and-effect chain. 🍞 Anchor: For a bug “empty string crashes parse_date,” the sample includes the base utils/date.py, then each commit’s patch refining the fix.

🍞 Hook: Now turn on the oven: feel the heat, watch the timer, and taste-test.

🥬 The Concept (Environmentally-native Trajectories):

What it is: Real action–observation loops from running an agent inside Dockerized repositories with actual tests.
How it works: 1) Build executable envs from real PRs; 2) Let an agent act (search, view, patch, run tests); 3) Record outputs (errors, failures, passes); 4) Keep both passing and failing runs.
Why it matters: Without genuine tool outputs, the model never learns how edits change test results. 🍞 Anchor: The agent applies a patch, pytest shows one test still failing, the agent revises the code, and the next run passes.

Before vs After:

Before: Models saw static code or isolated subtasks; at deployment they struggled to coordinate steps.
After: Models pre-learn full workflows and feedback handling, so SFT/RL polish skills instead of building them from scratch.

Why it works (intuition):

The model’s training distribution matches the deployment world: sequences of actions and observations.
PR data teaches where and how edits happen; environment rollouts teach how feedback drives revisions.
Combining breadth (many workflows) with depth (authentic feedback) yields stronger, token-efficient learning.

Building blocks:

Data breadth: D_ctx_gen (many languages) + D_ctx_py (Python focus for benchmarks).
Feedback depth: D_env (real tool/test traces), with extra weight on passing rollouts.
Mid-training schedule: stage-wise mixing of these datasets to shift the base model’s capabilities toward agentic reasoning.
Post-training: SFT on curated trajectories to align with the desired scaffold and task formats.

03Methodology

High-level recipe: Input → Build two datasets (contextually-native PRs; environmentally-native rollouts) → Mid-train the base model on both → Post-train with SFT on strong trajectories → Output: a model that already ‘thinks’ like a code agent.

Step A: Contextually-native trajectories (PR workflows)

What happens: The team mines GitHub Pull Requests to reconstruct full developer workflows. For each PR, they gather the repository context, issues (if any), base versions of changed files, and the ordered commit sequence with diffs. They also generate short, clearer summaries for PRs and commits when messages are too terse.
Why this step exists: If you only show the final diff, the model won’t learn how localization and editing connect. Keeping files and edits together teaches the ‘localize → read → edit’ chain.
Example: Issue: “strip() wrongly removes non-ASCII headers.” Relevant file: src/waitress/task.py at base. Edits: A commit removes value.strip() and adjusts logic. The sample shows the file content before, then the edit sequence.

Data construction details

Two subsets: D_ctx_gen (26.7B tokens) spans top-starred repos across languages (broad coverage); D_ctx_py (41.9B) focuses on Python (aligned to SWE-Bench Verified).
Collection: GitHub REST and GraphQL APIs fetch PR metadata, linked issues, base file contents, and commit diffs. Relevant files come from the symmetric diff set. They align files to the parent of the first commit for accuracy.
Filtering: Keep merged PRs; drop bot PRs; for Python subset, keep PRs with 1–5 changed Python files. Discard samples over 32k tokens. Decontaminate SWE-Bench Verified repos to avoid leakage.
Formatting: General PRs use XML-like tags and include reviews; Python PRs use a Markdown layout that matches agent edit actions (search-and-replace format), mirroring agent steps with interleaved reasoning.

Step B: Environmentally-native trajectories (executable feedback)

What happens: They build Docker environments from real PRs (as in SWE-REBENCH), including unit tests and tools. Then they run a capable agent (GLM-4.5/4.6 inside SWE-AGENT) to interact: search files, open code, apply patches, run tests, and read errors.
Why this step exists: Authentic execution feedback (test failures, stack traces) can’t be faked well. It teaches the agent how to react after each edit.
Example: A rollout shows grep hits for parse_date, file views, apply_patch v1, pytest failing with AssertionError, then apply_patch v2, and finally all tests passing.
Dataset: D_env totals ~3.1B raw tokens (~4.5B effective), including both passing (0.7B) and non-passing trajectories. They keep both to expose real debugging.

🍞 Hook: You know how a coach reviews both your winning and losing plays to help you improve?

🥬 The Concept (Supervised Fine-tuning — SFT):

What it is: After mid-training, SFT polishes the model on high-quality demonstration trajectories to match the agent scaffold’s format.
How it works: 1) Feed curated step-by-step traces (e.g., SWE-Smith, D_env_pass); 2) Mask losses to learn only from desired parts; 3) Train for a few epochs.
Why it matters: Without SFT, the model might be capable but not aligned to the exact tools and response formats used at deployment. 🍞 Anchor: The model learns to call ‘apply_patch’, ‘run_tests’, and format thoughts exactly as SWE-AGENT expects.

Training pipeline details

Base models: Qwen2.5-32B-Base and Qwen2.5-72B-Base.
Mid-training (MT):
- Stage 1: D_ctx_gen to build broad software engineering priors.
- Stage 2: D_ctx_py (+ D_env when used) to specialize in Python and agent feedback.
- Hyperparameters: large global batch, warmup then cosine decay; no loss mask.
Post-training (SFT):
- Datasets: public SWE-Smith and/or D_env_pass (passing rollouts), with D_env_pass upsampled 3× during MT and used again in SFT to activate capabilities.
- Hyperparameters: smaller batch, warmup then cosine decay; standard loss masks.

Secret sauce

Keep the whole story: Bundling base file contents with sequential edits mirrors how agents actually work.
Learn from real friction: Dockerized tests and genuine tool outputs provide the ‘heat of the kitchen’ the model must handle.
Mix breadth and depth: PR breadth builds strong priors; environment depth teaches reactive strategies.
Token efficiency: Because the data match deployment, the model learns faster per token than with synthetic or factorized data.

Concrete mini-walkthrough

Input: Bug report says parse_date crashes on empty strings.
Step A (PR-style learning): See date.py at base; a commit adds a guard; another commit tweaks error handling.
Step B (Env-style learning): The agent applies its first patch; pytest shows one failing case; it narrows the fix and tries again; pass.
Output: A model that, when deployed, naturally follows the ‘search → read → edit → test → revise’ loop without being micromanaged.

04Experiments & Results

🍞 Hook: If two teams play the same game on the same field with the same rules, you can fairly compare who plays better.

🥬 The Concept (Benchmark — SWE-Bench Verified):

What it is: A standardized set of real GitHub issues turned into executable tasks with tests, used to measure if agents can fix them.
How it works: 1) Each task has a repo snapshot and tests; 2) The agent gets a description and must make code changes; 3) Success = tests pass with the agent’s changes.
Why it matters: Without a common, executable yardstick, we can’t compare methods reliably. 🍞 Anchor: The model gets an issue like “date parsing fails on empty string,” applies changes, runs tests, and earns a ‘pass’ only if all tests succeed.

The test

They evaluate with the SWE-AGENT scaffold (a standardized agent wrapper) using large context (128k) and up to 100 steps, reporting Pass@1 (solve rate). Multiple runs are averaged for stability. Some benchmark glitches were patched for fairness.

The competition

Baselines include the prior open mid-training recipe Kimi-Dev and multiple SFT-only or RL-tuned systems. Comparisons are done at similar model sizes (32B and 72B) and with agentic scaffolds.

The scoreboard (with context)

32B scale:
- Strong SFT baseline (no MT): ~53.0%.
- daVinci-Dev-32B (with D_ctx + D_env MT, then SFT): 56.1%.
- This is state-of-the-art among open recipes at this size using agent scaffolds, despite starting from a non-coder base model.
72B scale:
- Kimi-Dev (MT + RL) reports 48.6% under SWE-AGENT.
- Strong SFT baseline (no MT): ~56.6%.
- Ours with D_ctx MT only + strong SFT: 58.2%.
- Full daVinci-Dev-72B (D_ctx + D_env MT + strong SFT): 58.5%, beating Kimi-Dev while using less than half the MT tokens (~73.1B vs ~150B).
Translation of numbers: 58.5% is like scoring an A when strong baselines hover around high B to low A-, and a prior well-known method sits in the C+/B- range under the same scaffold.

Surprising and informative findings

Mid-training helps even when SFT is already strong: For the 72B model, D_ctx alone boosts weak SFT from 38.0% to 46.4%, and strong SFT beyond 58%.
Authentic rollouts matter: Adding D_env to MT nudges performance further (58.2% → 58.5%), showing value from environment dynamics.
Zero-shot signal: With D_env alone, the 72B model reached ~47% zero-shot; mixing in D_ctx_py raised it to ~55%, proving PR grounding supplies essential knowledge breadth.
Generalization beyond SWE: On HumanEval and EvalPlus, both 32B and 72B models jump significantly (e.g., +12–23 points on HumanEval), and even science reasoning tasks (GPQA, SciBench) see gains, suggesting the agentic loop trains broader reasoning.

Efficiency and scaling

Token efficiency: Outperforms a prior ~150B-token recipe using only ~73.1B MT tokens by matching the training distribution to deployment.
Scaling law hint: As MT steps increase on a D_ctx_py + D_env mix, Pass@1 rises roughly log-linearly, indicating headroom for more gains with more data/compute.

05Discussion & Limitations

Limitations

Privacy/attribution: The general PR corpus may include developer names or comments. Even if public, this raises privacy and memorization concerns; future releases should scrub identifiers or provide opt-out mechanisms.
Evaluation sensitivity: A few benchmark test fixes were applied. While intended for fairness, such patches can introduce variance and make cross-paper comparisons harder unless shared openly.
Scope: Results are on one base family (Qwen2.5) and one major benchmark (SWE-Bench Verified). Repeating on other bases and diverse, real-world production tasks is needed.
Environment coverage: Building Dockerized, testable repos at scale remains resource-intensive; some repos are hard to reproduce exactly, which may bias data toward well-behaved projects.

Required resources

Data pipelines for mining PRs, reconstructing base files and commit sequences, and generating summaries.
Environment builders (Docker, unit tests), GPU compute for MT/SFT, and storage for tens of billions of tokens.
Agent scaffolds (e.g., SWE-AGENT) and tool wrappers for consistent action–observation logging.

When not to use

Ultra-short tasks where single-turn generation suffices; the agent loop adds overhead with little benefit.
Domains without reliable executable feedback (no tests, no verifiable outcomes); environmentally-native data would be weak.
Privacy-restricted codebases where PR mining or environment construction is disallowed.

Open questions

How much failing-trajectory data is optimal? What is the best balance of pass vs fail for learning robust revision strategies?
Can we automate environment creation for more languages and build systems without human fixes?
What is the ideal staging schedule (D_ctx vs D_env mix) across model sizes?
How do these mid-trained priors interact with different RL algorithms and reward designs?
Can agent-native mid-training similarly boost agents beyond coding (e.g., data science notebooks, robotics control with simulators)?

06Conclusion & Future Work

Three-sentence summary

This paper shows that teaching code models during mid-training with agent-native data—complete PR workflows plus real execution rollouts—builds strong, foundational agent skills.
The resulting models solve more SWE-Bench Verified tasks than prior open recipes of similar sizes, while using less than half the mid-training tokens.
Beyond agent tasks, the approach also boosts general coding and even science reasoning, hinting that learning the fix–test–revise loop strengthens broad problem-solving.

Main achievement

A practical, scalable recipe for agent-native mid-training that unifies contextually-native PR data (breadth) with environmentally-native rollouts (depth), yielding state-of-the-art open results at 32B and 72B under agentic scaffolds.

Future directions

Scale PR mining to more languages and ecosystems, and automate executable environment creation further.
Explore optimal mixes and curricula of passing vs failing trajectories, and study synergy with modern RL strategies.
Extend the paradigm to adjacent domains (e.g., data pipelines, scientific computing, long-context research agents).

Why remember this

Because it reframes training: don’t just show AIs what finished code looks like—show them how real engineers navigate, edit, test, and fix. When training matches deployment, agents become more capable, reliable, and efficient per token, bringing better software to everyone faster.

Practical Applications

•Automated bug triage and patching in large codebases with fewer human cycles.
•Continuous integration bots that propose fixes when tests fail on nightly builds.
•Repository onboarding assistants that localize relevant files and explain edit plans.
•Refactoring agents that make small, test-verified changes across many modules safely.
•Security patch assistants that apply and validate mitigations with unit tests.
•Upgrade helpers that adapt code to new library versions and confirm with tests.
•Teaching tools that show students full fix–test–revise workflows on real PRs.
•Data pipeline fixers that diagnose and patch failing ETL jobs using logged errors.
•Scientific computing aides that adjust simulation code and validate against benchmarks.
•Documentation co-authors that summarize PR intent and explain commit-level changes.

Version: 1