daVinci-Dev: Agent-native Mid-training for Software Engineering
Key Summary
- ā¢This paper teaches code AIs to work more like real software engineers by training them in the middle of their learning using real development workflows.
- ā¢The key idea is agent-native mid-training: feeding the model large examples that keep the whole story of how code was found, edited, tested, and fixed.
- ā¢They build two kinds of training journeys (trajectories): contextually-native (full PR context and edits) and environmentally-native (real tool runs and test feedback).
- ā¢By learning from both the āwhatā (files and edits) and the āhowā (test failures and retries), the model becomes better at multi-step problem solving.
- ā¢On the tough SWE-Bench Verified benchmark, their 32B and 72B models reach 56.1% and 58.5% solved, the best among open recipes of similar sizes.
- ā¢They achieve these results with less than half the mid-training tokens of prior open methods like Kimi-Dev (~73.1B vs ~150B).
- ā¢The training also improves general coding tasks (like HumanEval) and even science reasoning benchmarks.
- ā¢This approach is scalable because GitHub has lots of Pull Requests for breadth, and tools like Dockerized tests add depth and authenticity.
- ā¢They release data-building code, recipes, and many checkpoints to help the community explore agentic mid-training.
- ā¢Bottom line: teaching AIs the full development loopālook, change, test, fixāmakes them stronger, more reliable code agents.
Why This Research Matters
Software touches nearly everything we use: phones, cars, hospitals, and classrooms. Teaching AI to fix code the way real engineers doāby searching, editing, testing, and revisingāmeans fewer crashes and faster bug fixes in everyday apps. Because this method is token-efficient and scales with public PRs and automated tests, more teams (not just those with huge budgets) can build strong code agents. It also strengthens general reasoning, so these models can help with scientific computing and data tasks beyond pure coding. Safer, faster software updates can improve cybersecurity, reduce downtime, and cut maintenance costs. Sharing data-building tools and checkpoints helps the whole community move forward together.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine learning to fix a bike. If someone only shows you a photo of a bike already fixed, you wonāt learn how to find the loose bolt, which wrench to try, or how to test the wheels after tightening. You need the whole process, not just the final picture.
š„¬ The Concept (Large Language Models ā LLMs):
- What it is: An LLM is a very big computer program that learns to read and write by seeing lots of text.
- How it works: 1) Read mountains of text; 2) Learn patterns in words and code; 3) Predict the next token; 4) Get better at following instructions with extra training.
- Why it matters: Without a strong base, the model canāt understand instructions, code, or long problems well. š Anchor: When you ask an LLM to fix a buggy function, it uses its language understanding to read the code and your request before suggesting changes.
š Hook: You know how real programmers donāt fix bugs in one step? They search files, read code, try an edit, run tests, see failures, and try again.
š„¬ The Concept (Agentic Software Engineering):
- What it is: Itās when an AI acts like a developerānavigating a codebase, making edits, and running tests in multiple steps.
- How it works: 1) Localize (find the right files); 2) Read (understand the code); 3) Edit (apply a patch); 4) Test (run unit tests); 5) Revise (fix based on failures); repeat.
- Why it matters: Without this loop, the AI guesses in one shot and misses hidden issues across files. š Anchor: To fix a failing date parser, an agent searches for āparse_date,ā opens the file, changes the function, runs tests, sees a failure for empty strings, and then adds a guard to pass.
š Hook: Think of studying only perfect homework solutionsāno scratch work, no mistakes, no teacher notes. Would you learn the real thinking process?
š„¬ The Concept (Distribution Mismatch):
- What it is: The training data the model sees (static code snapshots) doesnāt match the interactive world it faces during real debugging (actions and feedback).
- How it works: 1) Training often shows final files and merged commits; 2) Real life requires step-by-step actions and tool outputs; 3) The gap confuses the model when deployed.
- Why it matters: Without matching practice, the model doesnāt learn how to react to errors or tests. š Anchor: A model trained on finished Pull Requests might know the final patch, but not how to find the right file or fix tests when they first fail.
š Hook: Imagine a recipe book that only shows final meal photos. Youād wish it listed the ingredients, steps, and even what to do if the sauce splits.
š„¬ The Concept (Pull Requests ā PRs):
- What it is: A PR is a developerās proposal to change code, including descriptions, changed files, and commit history.
- How it works: 1) Describe the problem; 2) Edit files in commits; 3) Get reviews; 4) Merge if approved.
- Why it matters: PRs are rich stories of how code was changed and whyāgreat for teaching an agent the process, not just the result. š Anchor: A PR titled āFix header parsing for non-ASCIIā includes the issue, the files edited, and the commits that gradually remove an incorrect strip() call.
š Hook: When students learn, they often first get general knowledge, then special practice. AIs are similar.
š„¬ The Concept (Post-training: SFT and RL):
- What it is: After base training, models get extra practiceāSFT (teacher-corrected examples) and RL (learning from rewards like passing tests).
- How it works: 1) SFT shows right step-by-step answers; 2) RL lets the model try, get feedback, and improve policies.
- Why it matters: This sharpens skills, but depends on how good the base model already is and how much diverse data exists. š Anchor: After learning general code, SFT shows āhow to fix this bugā transcripts; RL rewards the model when unit tests pass.
The world before: Code models were pretty good at writing single functions from prompts. But real software tasks happen across big repositories. Most training data were static files or final diffsāgreat for āwhat code looks like,ā not āhow to change and validate it.ā
The problem: When these models are deployed as agents, they must search, read, edit, run, and revise. Training didnāt expose them to this step-by-step life, causing the distribution mismatch. Also, collecting large amounts of high-quality interactive data is costly, so post-training alone hits limits.
Failed attempts: Some tried factorized training (teach āfind fileā and āedit codeā separately) or used simulated environments. But this breaks the natural dependency between steps and lacks authentic feedback (like real test failures), so models donāt learn the full loop.
The gap: Models need mid-training on data that preserves the complete agent workflow and real feedbackāearlier in the pipelineāso they enter post-training already thinking like agents.
Real stakes: Better code agents mean faster bug fixes, safer updates, and fewer frustrating software glitches in apps, websites, and devices we use daily.
02Core Idea
š Hook: You know how practicing a sport in real games (with a scoreboard and a referee) makes you better than just shooting hoops alone?
š„¬ The Concept (Agentic Mid-training):
- What it is: A big, middle part of training where the model learns from data that look and feel like real software engineeringāactions plus feedback.
- How it works: 1) Build giant datasets that keep the full story (issue ā files ā edits ā tests); 2) Mix two kinds of journeys: context-rich PR workflows and real execution rollouts; 3) Train the base model on these so it internalizes the agent loop; 4) Later, do SFT/RL more efficiently.
- Why it matters: Without this, post-training must teach both basics and strategies at once, which is costly and brittle. š Anchor: After mid-training, when the agent sees a failing test, it already āknowsā to re-open files, search again, and try a smaller patch.
Aha! in one sentence: Teach the model the whole development danceālook, change, test, and fixāduring mid-training using agent-native data, so post-training has a flying start.
Multiple analogies:
- Cooking: Donāt just show the final cakeāshow the batter mixing, oven timing, and what to do if it sinks.
- Maps: Donāt only give the destinationārecord the turns, detours, and traffic alerts the driver handled.
- Music: Donāt just play the final songāinclude practice clips, missed notes, and how the musician corrected rhythm.
š Hook: Imagine two types of practiceāreading whole recipes and actually cooking with a hot stove.
š„¬ The Concept (Agent-native Data):
- What it is: Training examples that preserve the agentās full info flow and real environment dynamics.
- How it works: 1) Contextually-native: PR-based workflows that unify localization and edits; 2) Environmentally-native: real tool calls, test runs, and error messages from executable repos; 3) Combine for breadth and depth.
- Why it matters: Without both, models either miss real feedback (depth) or lack variety and coverage (breadth). š Anchor: A PR shows which files changed and why; a rollout shows that pytest failed with a TypeError after the first patch.
š Hook: Think of a detective story where you follow all clues, then watch the suspect actually react when confronted.
š„¬ The Concept (Contextually-native Trajectories):
- What it is: PR-derived samples that bundle issues, relevant files (at base), and the commit-by-commit edit path.
- How it works: 1) Collect PR metadata and issues; 2) Reconstruct base file contents and commits in order; 3) Organize into sections (context, files, edits) to mirror localizeāreadāedit.
- Why it matters: If you split retrieval and editing into separate lessons, the agent loses the cause-and-effect chain. š Anchor: For a bug āempty string crashes parse_date,ā the sample includes the base utils/date.py, then each commitās patch refining the fix.
š Hook: Now turn on the oven: feel the heat, watch the timer, and taste-test.
š„¬ The Concept (Environmentally-native Trajectories):
- What it is: Real actionāobservation loops from running an agent inside Dockerized repositories with actual tests.
- How it works: 1) Build executable envs from real PRs; 2) Let an agent act (search, view, patch, run tests); 3) Record outputs (errors, failures, passes); 4) Keep both passing and failing runs.
- Why it matters: Without genuine tool outputs, the model never learns how edits change test results. š Anchor: The agent applies a patch, pytest shows one test still failing, the agent revises the code, and the next run passes.
Before vs After:
- Before: Models saw static code or isolated subtasks; at deployment they struggled to coordinate steps.
- After: Models pre-learn full workflows and feedback handling, so SFT/RL polish skills instead of building them from scratch.
Why it works (intuition):
- The modelās training distribution matches the deployment world: sequences of actions and observations.
- PR data teaches where and how edits happen; environment rollouts teach how feedback drives revisions.
- Combining breadth (many workflows) with depth (authentic feedback) yields stronger, token-efficient learning.
Building blocks:
- Data breadth: D_ctx_gen (many languages) + D_ctx_py (Python focus for benchmarks).
- Feedback depth: D_env (real tool/test traces), with extra weight on passing rollouts.
- Mid-training schedule: stage-wise mixing of these datasets to shift the base modelās capabilities toward agentic reasoning.
- Post-training: SFT on curated trajectories to align with the desired scaffold and task formats.
03Methodology
High-level recipe: Input ā Build two datasets (contextually-native PRs; environmentally-native rollouts) ā Mid-train the base model on both ā Post-train with SFT on strong trajectories ā Output: a model that already āthinksā like a code agent.
Step A: Contextually-native trajectories (PR workflows)
- What happens: The team mines GitHub Pull Requests to reconstruct full developer workflows. For each PR, they gather the repository context, issues (if any), base versions of changed files, and the ordered commit sequence with diffs. They also generate short, clearer summaries for PRs and commits when messages are too terse.
- Why this step exists: If you only show the final diff, the model wonāt learn how localization and editing connect. Keeping files and edits together teaches the ālocalize ā read ā editā chain.
- Example: Issue: āstrip() wrongly removes non-ASCII headers.ā Relevant file: src/waitress/task.py at base. Edits: A commit removes value.strip() and adjusts logic. The sample shows the file content before, then the edit sequence.
Data construction details
- Two subsets: D_ctx_gen (26.7B tokens) spans top-starred repos across languages (broad coverage); D_ctx_py (41.9B) focuses on Python (aligned to SWE-Bench Verified).
- Collection: GitHub REST and GraphQL APIs fetch PR metadata, linked issues, base file contents, and commit diffs. Relevant files come from the symmetric diff set. They align files to the parent of the first commit for accuracy.
- Filtering: Keep merged PRs; drop bot PRs; for Python subset, keep PRs with 1ā5 changed Python files. Discard samples over 32k tokens. Decontaminate SWE-Bench Verified repos to avoid leakage.
- Formatting: General PRs use XML-like tags and include reviews; Python PRs use a Markdown layout that matches agent edit actions (search-and-replace format), mirroring agent steps with interleaved reasoning.
Step B: Environmentally-native trajectories (executable feedback)
- What happens: They build Docker environments from real PRs (as in SWE-REBENCH), including unit tests and tools. Then they run a capable agent (GLM-4.5/4.6 inside SWE-AGENT) to interact: search files, open code, apply patches, run tests, and read errors.
- Why this step exists: Authentic execution feedback (test failures, stack traces) canāt be faked well. It teaches the agent how to react after each edit.
- Example: A rollout shows grep hits for parse_date, file views, apply_patch v1, pytest failing with AssertionError, then apply_patch v2, and finally all tests passing.
- Dataset: D_env totals ~3.1B raw tokens (~4.5B effective), including both passing (0.7B) and non-passing trajectories. They keep both to expose real debugging.
š Hook: You know how a coach reviews both your winning and losing plays to help you improve?
š„¬ The Concept (Supervised Fine-tuning ā SFT):
- What it is: After mid-training, SFT polishes the model on high-quality demonstration trajectories to match the agent scaffoldās format.
- How it works: 1) Feed curated step-by-step traces (e.g., SWE-Smith, D_env_pass); 2) Mask losses to learn only from desired parts; 3) Train for a few epochs.
- Why it matters: Without SFT, the model might be capable but not aligned to the exact tools and response formats used at deployment. š Anchor: The model learns to call āapply_patchā, ārun_testsā, and format thoughts exactly as SWE-AGENT expects.
Training pipeline details
- Base models: Qwen2.5-32B-Base and Qwen2.5-72B-Base.
- Mid-training (MT):
- Stage 1: D_ctx_gen to build broad software engineering priors.
- Stage 2: D_ctx_py (+ D_env when used) to specialize in Python and agent feedback.
- Hyperparameters: large global batch, warmup then cosine decay; no loss mask.
- Post-training (SFT):
- Datasets: public SWE-Smith and/or D_env_pass (passing rollouts), with D_env_pass upsampled 3Ć during MT and used again in SFT to activate capabilities.
- Hyperparameters: smaller batch, warmup then cosine decay; standard loss masks.
Secret sauce
- Keep the whole story: Bundling base file contents with sequential edits mirrors how agents actually work.
- Learn from real friction: Dockerized tests and genuine tool outputs provide the āheat of the kitchenā the model must handle.
- Mix breadth and depth: PR breadth builds strong priors; environment depth teaches reactive strategies.
- Token efficiency: Because the data match deployment, the model learns faster per token than with synthetic or factorized data.
Concrete mini-walkthrough
- Input: Bug report says parse_date crashes on empty strings.
- Step A (PR-style learning): See date.py at base; a commit adds a guard; another commit tweaks error handling.
- Step B (Env-style learning): The agent applies its first patch; pytest shows one failing case; it narrows the fix and tries again; pass.
- Output: A model that, when deployed, naturally follows the āsearch ā read ā edit ā test ā reviseā loop without being micromanaged.
04Experiments & Results
š Hook: If two teams play the same game on the same field with the same rules, you can fairly compare who plays better.
š„¬ The Concept (Benchmark ā SWE-Bench Verified):
- What it is: A standardized set of real GitHub issues turned into executable tasks with tests, used to measure if agents can fix them.
- How it works: 1) Each task has a repo snapshot and tests; 2) The agent gets a description and must make code changes; 3) Success = tests pass with the agentās changes.
- Why it matters: Without a common, executable yardstick, we canāt compare methods reliably. š Anchor: The model gets an issue like ādate parsing fails on empty string,ā applies changes, runs tests, and earns a āpassā only if all tests succeed.
The test
- They evaluate with the SWE-AGENT scaffold (a standardized agent wrapper) using large context (128k) and up to 100 steps, reporting Pass@1 (solve rate). Multiple runs are averaged for stability. Some benchmark glitches were patched for fairness.
The competition
- Baselines include the prior open mid-training recipe Kimi-Dev and multiple SFT-only or RL-tuned systems. Comparisons are done at similar model sizes (32B and 72B) and with agentic scaffolds.
The scoreboard (with context)
- 32B scale:
- Strong SFT baseline (no MT): ~53.0%.
- daVinci-Dev-32B (with D_ctx + D_env MT, then SFT): 56.1%.
- This is state-of-the-art among open recipes at this size using agent scaffolds, despite starting from a non-coder base model.
- 72B scale:
- Kimi-Dev (MT + RL) reports 48.6% under SWE-AGENT.
- Strong SFT baseline (no MT): ~56.6%.
- Ours with D_ctx MT only + strong SFT: 58.2%.
- Full daVinci-Dev-72B (D_ctx + D_env MT + strong SFT): 58.5%, beating Kimi-Dev while using less than half the MT tokens (~73.1B vs ~150B).
- Translation of numbers: 58.5% is like scoring an A when strong baselines hover around high B to low A-, and a prior well-known method sits in the C+/B- range under the same scaffold.
Surprising and informative findings
- Mid-training helps even when SFT is already strong: For the 72B model, D_ctx alone boosts weak SFT from 38.0% to 46.4%, and strong SFT beyond 58%.
- Authentic rollouts matter: Adding D_env to MT nudges performance further (58.2% ā 58.5%), showing value from environment dynamics.
- Zero-shot signal: With D_env alone, the 72B model reached ~47% zero-shot; mixing in D_ctx_py raised it to ~55%, proving PR grounding supplies essential knowledge breadth.
- Generalization beyond SWE: On HumanEval and EvalPlus, both 32B and 72B models jump significantly (e.g., +12ā23 points on HumanEval), and even science reasoning tasks (GPQA, SciBench) see gains, suggesting the agentic loop trains broader reasoning.
Efficiency and scaling
- Token efficiency: Outperforms a prior ~150B-token recipe using only ~73.1B MT tokens by matching the training distribution to deployment.
- Scaling law hint: As MT steps increase on a D_ctx_py + D_env mix, Pass@1 rises roughly log-linearly, indicating headroom for more gains with more data/compute.
05Discussion & Limitations
Limitations
- Privacy/attribution: The general PR corpus may include developer names or comments. Even if public, this raises privacy and memorization concerns; future releases should scrub identifiers or provide opt-out mechanisms.
- Evaluation sensitivity: A few benchmark test fixes were applied. While intended for fairness, such patches can introduce variance and make cross-paper comparisons harder unless shared openly.
- Scope: Results are on one base family (Qwen2.5) and one major benchmark (SWE-Bench Verified). Repeating on other bases and diverse, real-world production tasks is needed.
- Environment coverage: Building Dockerized, testable repos at scale remains resource-intensive; some repos are hard to reproduce exactly, which may bias data toward well-behaved projects.
Required resources
- Data pipelines for mining PRs, reconstructing base files and commit sequences, and generating summaries.
- Environment builders (Docker, unit tests), GPU compute for MT/SFT, and storage for tens of billions of tokens.
- Agent scaffolds (e.g., SWE-AGENT) and tool wrappers for consistent actionāobservation logging.
When not to use
- Ultra-short tasks where single-turn generation suffices; the agent loop adds overhead with little benefit.
- Domains without reliable executable feedback (no tests, no verifiable outcomes); environmentally-native data would be weak.
- Privacy-restricted codebases where PR mining or environment construction is disallowed.
Open questions
- How much failing-trajectory data is optimal? What is the best balance of pass vs fail for learning robust revision strategies?
- Can we automate environment creation for more languages and build systems without human fixes?
- What is the ideal staging schedule (D_ctx vs D_env mix) across model sizes?
- How do these mid-trained priors interact with different RL algorithms and reward designs?
- Can agent-native mid-training similarly boost agents beyond coding (e.g., data science notebooks, robotics control with simulators)?
06Conclusion & Future Work
Three-sentence summary
- This paper shows that teaching code models during mid-training with agent-native dataācomplete PR workflows plus real execution rolloutsābuilds strong, foundational agent skills.
- The resulting models solve more SWE-Bench Verified tasks than prior open recipes of similar sizes, while using less than half the mid-training tokens.
- Beyond agent tasks, the approach also boosts general coding and even science reasoning, hinting that learning the fixātestārevise loop strengthens broad problem-solving.
Main achievement
- A practical, scalable recipe for agent-native mid-training that unifies contextually-native PR data (breadth) with environmentally-native rollouts (depth), yielding state-of-the-art open results at 32B and 72B under agentic scaffolds.
Future directions
- Scale PR mining to more languages and ecosystems, and automate executable environment creation further.
- Explore optimal mixes and curricula of passing vs failing trajectories, and study synergy with modern RL strategies.
- Extend the paradigm to adjacent domains (e.g., data pipelines, scientific computing, long-context research agents).
Why remember this
- Because it reframes training: donāt just show AIs what finished code looks likeāshow them how real engineers navigate, edit, test, and fix. When training matches deployment, agents become more capable, reliable, and efficient per token, bringing better software to everyone faster.
Practical Applications
- ā¢Automated bug triage and patching in large codebases with fewer human cycles.
- ā¢Continuous integration bots that propose fixes when tests fail on nightly builds.
- ā¢Repository onboarding assistants that localize relevant files and explain edit plans.
- ā¢Refactoring agents that make small, test-verified changes across many modules safely.
- ā¢Security patch assistants that apply and validate mitigations with unit tests.
- ā¢Upgrade helpers that adapt code to new library versions and confirm with tests.
- ā¢Teaching tools that show students full fixātestārevise workflows on real PRs.
- ā¢Data pipeline fixers that diagnose and patch failing ETL jobs using logged errors.
- ā¢Scientific computing aides that adjust simulation code and validate against benchmarks.
- ā¢Documentation co-authors that summarize PR intent and explain commit-level changes.