SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

Huatong Song; Lisheng Huang; Shuang Sun; Jinhao Jiang; Ran Le; Daixuan Cheng; Guoxin Chen; Yiwen Hu; Zongchao Chen; Wayne Xin Zhao; Yang Song; Tao Zhang; Ji-Rong Wen

SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

Intermediate

Huatong Song, Lisheng Huang, Shuang Sun et al.2/3/2026

arXiv PDF

Key Summary

•SWE-Master is a fully open, step-by-step recipe for turning a regular coding model into a strong software-fixing agent that works across many steps, files, and tests.
•It builds skill in three phases: learn from good examples (long-horizon SFT), practice in real sandboxes with rewards (RL), and think smarter at test time with multiple tries and a smart checker (TTS with SWE-World).
•The team carefully makes and filters “teacher” examples so the model practices problems that are not too easy and not impossible, which speeds up learning and avoids bad habits.
•They change the RL recipe to be stable and fair, including a ‘forced submission’ so partial progress still gets scored, which prevents training from collapsing.
•At inference, SWE-Master adds IDE-style tools powered by the Language Server Protocol (LSP), letting the agent jump to definitions and references like a developer using VS Code—faster and more precise than grep.
•On the SWE-bench Verified benchmark, SWE-Master gets 61.4% solved with the 32B model (Pass@1), and 70.8% with TTS@8, rivaling much bigger or closed systems.
•LSP-based navigation keeps success rates while cutting tokens and steps, making the agent cheaper and snappier to run.
•Everything is open-source and reproducible: data pipeline, training code, RL setup, and inference scaffold, so others can learn and build on it.
•Anti-“git hacking” protections block sneaky shortcuts (like git log/show) so the agent must truly reason, keeping evaluation honest.
•Results show a clear pattern: better curation → stronger SFT → safer, steadier RL → smarter test-time selection → big, reliable gains.

Why This Research Matters

Real apps break in messy, interconnected ways, and quick, correct fixes save users time and money. SWE-Master shows how open models can be trained to handle full, real workflows—reading issues, changing multiple files, and passing tests—without closed secrets. By adding IDE-style navigation (LSP), it reduces wasted reading and speeds up problem solving, which lowers cost and energy use. Its test-time scaling finds better answers without always running slow or risky executions. Because the whole pipeline is open and reproducible, students, startups, and researchers can learn from it, re-use parts, and push the field forward responsibly.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how fixing a big LEGO set is harder than snapping in one tiny piece? Real software is like a giant LEGO city—thousands of pieces that must fit perfectly, and changing one block can break a bridge far away.

🥬 The Concept (SWE agents): Software Engineering (SWE) agents are AI helpers that read an issue, explore a large codebase, change multiple files, run tests, and keep improving their fix until everything passes.

How it works (step by step):
1. Read the bug report
2. Search and open related files
3. Edit code in several places
4. Run tests and read errors
5. Repeat edits and tests until all pass
Why it matters: Without an SWE agent, code AIs often write short snippets but struggle with big projects where many files connect.

🍞 Anchor: Imagine a bug that makes a weather app crash when loading maps. An SWE agent finds the right files, adjusts the map loader, updates imports, runs tests, and confirms no other feature breaks.

The world before: Code models were good at single functions or small tasks, like writing one recipe card. But real software jobs require a full kitchen: shopping for the right ingredients (search), cooking multiple dishes (editing many files), and taste-testing (unit tests) until the whole meal is ready. Benchmarks like SWE-bench Verified ask for that full-kitchen skill: agents must open repos, change code, run tests, and deliver a final patch that truly passes.

The problem: The best-performing systems were often closed or unclear about how they collected training data, how they fine-tuned models, what rewards they used, and what tools they enabled at inference. That made results hard to reproduce and slowed community progress. Also, training on long, multi-step tasks is tricky: if the data is too easy, agents don’t learn depth; too hard, they flail. In RL, unstable rewards can cause the model to ‘give up’ or learn shortcuts. At inference, simple greps and basic scaffolds make agents read lots of noisy text instead of understanding the code’s structure.

Failed attempts people tried:

Treating repo-level tasks like short code completion—works on tiny problems, fails on long ones.
Using any ‘successful’ trajectory without checking difficulty—teaches shallow habits.
Black-box verifiers to pick the best patch—they often get confused by long, messy contexts.
Append-only context or discard-all strategies—either the model drowns in logs or forgets crucial info.
Allowing git log/show—agents can cheat by peeking at ground-truth patches.

The gap: We needed an open, end-to-end post-training framework that:

Crafts and filters teacher trajectories with the right difficulty mix
Trains on long-horizon interactions without overfitting environment noise
Uses RL with real execution but stable rewards
Adds smart inference tools (like IDE-level code navigation) to reduce wasted effort
Scales at test time with a verifier that simulates execution logically, not just guesses

🍞 Hook: Imagine practicing piano with songs that are too easy or impossibly hard—you won’t improve much.

🥬 The Concept (Post-training framework): A post-training framework is the coaching plan used after a base model is trained, to sharpen it for specific, complex tasks.

How it works:
1. Generate many step-by-step ‘teacher’ solutions
2. Filter them for quality and just-right difficulty
3. Teach the model to imitate long, multi-step reasoning (SFT)
4. Let it practice in a real sandbox with rewards (RL)
5. Give it better tools and multiple tries at test time (inference + TTS)
Why it matters: Without a smart plan, the model learns noisy habits, gets unstable in RL, and wastes compute reading irrelevant text.

🍞 Anchor: It’s like turning a decent basketball player into a clutch point guard by giving curated scrimmages, live-game practice with a scoreboard, and a playbook that makes court navigation faster.

Real stakes (why you should care):

Faster bug fixes in real projects mean fewer app crashes and happier users.
Lower compute at inference (thanks to LSP tools) saves money and energy.
Open, reproducible recipes let students, startups, and researchers build strong agents without secrets.
Safer evaluation (blocking git hacks) keeps progress honest.
Smarter test-time scaling finds better patches without risky or expensive full executions, which matters when running code could be costly or unsafe.

02Core Idea

🍞 Hook: Imagine building a treehouse. Success takes three things: good lessons (learn), real practice with feedback (try and see), and smart choices on game day (pick the best plan).

🥬 The Concept (Key insight): SWE-Master’s aha is that great repo-level coding skill emerges when you combine curated long-horizon examples (SFT), stable practice in real sandboxes (RL), and smart test-time selection—plus IDE-style tools that help the agent navigate code like a pro.

How it works:
1. Make and filter teacher trajectories to teach multi-step thinking
2. Do long-horizon SFT that learns actions and reasoning, not environment noise
3. Run RL with reliable, shaped rewards to reinforce real success
4. At test time, try multiple candidates and select using an LLM ‘simulated test runner’ (SWE-World)
5. Equip the agent with LSP tools for fast, precise code navigation
Why it matters: Each part fixes a weak spot—data quality, learning stability, selection accuracy, and navigation efficiency.

🍞 Anchor: Like a cooking team: a recipe book of great dishes (SFT data), tasting and adjusting while cooking (RL), choosing the best plate to serve (TTS), and a kitchen with labeled drawers (LSP) so you find tools instantly.

Multiple analogies:

Orchestra: SFT is rehearsals with good sheet music; RL is live practice with audience feedback; TTS is choosing the best recorded take; LSP is the conductor’s score that shows every part clearly.
Hiking: SFT is learning routes from past hikers; RL is testing paths with signposts and warnings; TTS is checking several GPS traces to pick the safest; LSP is a topographic map revealing terrain.
Sports: SFT drills moves; RL scrimmages with scoreboards; TTS picks the best play from several options; LSP is a playbook showing player positions and routes.

🍞 Hook: You know how long puzzles need patience and seeing the big picture?

🥬 The Concept (Long-horizon SFT): Long-horizon SFT teaches the model from multi-step demos that include thinking and tool use, across many turns.

How it works:
1. Collect solved trajectories from strong teachers in real sandboxes
2. Filter out broken, too-long, or too-easy/too-hard cases
3. Train the model to produce thoughts and actions, while masking raw environment outputs
Why it matters: Without it, the model copies logs instead of learning how to decide next actions.

🍞 Anchor: Like learning chess from grandmaster games: you replay the plans, not the crowd noise.

🍞 Hook: Ever train a puppy with treats for the right trick?

🥬 The Concept (Reinforcement Learning): RL lets the model try, get a reward if tests pass, and learn which choices led to success.

How it works:
1. Run in a Docker repo environment
2. Submit a patch; if all tests pass, reward = 1, else 0
3. Use a stable policy update (GRPO) with tweaks: forced submission, fair length scaling, clip-higher
Why it matters: Without well-shaped rewards and stability tweaks, the agent loops or collapses.

🍞 Anchor: A dog learns ‘sit’ faster when feedback is clear, timely, and consistent.

🍞 Hook: When you take a test, trying several solutions can help you pick the best answer.

🥬 The Concept (Test-Time Scaling): TTS runs multiple agent attempts and uses SWE-World, a simulator that predicts which patch would pass tests, to pick the winner.

How it works:
1. Generate N candidate patches
2. For each, simulate test feedback multiple times
3. Choose the patch with the best predicted score
Why it matters: Without good selection, extra attempts don’t turn into extra wins.

🍞 Anchor: Like taking several photos and choosing the sharpest one.

🍞 Hook: Finding where a function is defined by scrolling forever is like looking for a friend in a crowd without a map.

🥬 The Concept (Language Server Protocol & IDE-level navigation): LSP gives the agent IDE-style powers—go to definition, find references, see call hierarchies—so it navigates code semantically, not just by keywords.

How it works:
1. The agent calls LSP tools (e.g., get_definition, get_references)
2. The language server returns precise locations and summaries
3. The agent jumps directly to the right spots and edits with confidence
Why it matters: Grep is blind to code meaning; LSP sees structure. Without it, the agent wastes tokens and time.

🍞 Anchor: It’s the difference between wandering a city and using GPS turn-by-turn directions.

Before vs after:

Before: Disconnected training, noisy data, brittle RL, keyword-only search.
After: Curated long-horizon SFT, stable RL with verifiable rewards, smart TTS selection, and IDE-grade navigation.

Why it works (intuition):

Teach the right difficulties → the model learns decision depth.
Mask environment logs → the model learns to reason, not copy.
Shape rewards and force submission → steady learning signals.
Parallel attempts + simulated verifier → convert compute into accuracy.
LSP tools → less wandering, more precise fixes.

Building blocks:

Teacher-trajectory synthesis and difficulty-based filtering
Long-horizon SFT with environment-response masking
RL with GRPO tweaks, forced submission, and budget awareness
TTS with SWE-World simulated evaluation
LSP tool integration and anti-git-hacking protections

03Methodology

High-level overview: Input (issue + repo) → [Teacher trajectory synthesis + filtering] → [Long-horizon SFT] → [RL in Docker with shaped rewards] → [Inference with LSP tools + Test-Time Scaling] → Output (verified patch)

Step A: Teacher trajectory synthesis and data curation

What happens: Strong teacher models (MiniMax-M2, GLM-4.6) act as agents inside real Dockerized repos to solve issues. We record their full multi-turn ‘thoughts + actions + observations’ and whether tests passed. Then we filter.
Why this step exists: Raw data is messy. Keeping only correct, well-formed, and ‘learnable’ difficulty cases teaches useful habits instead of noise.
Example: For an xarray bug, we keep trajectories where teachers searched, edited two files, added a test, and passed, but drop ones with broken tool calls or 100+ turns.
Secret sauce 1: Difficulty-based filtering. For each issue, run N rollouts; keep issues where results are mixed (sometimes solved, sometimes not). This avoids trivial and intractable extremes and trains decision-making on the edge.

🍞 Hook: You know how practicing math problems that are just beyond your comfort zone makes you learn fastest?

🥬 The Concept (Teacher trajectories): These are step-by-step recordings from a skilled model solving a task, used to teach the student model.

How it works:
1. Let teacher models solve issues in real sandboxes
2. Save their thoughts, actions, and final success
3. Filter for correctness and the right difficulty
Why it matters: Without good teachers and filtering, the student copies bad habits or learns nothing new.

🍞 Anchor: Like watching a coach demonstrate a perfect layup, then practicing that same sequence.

Step B: Long-horizon Supervised Fine-Tuning (SFT)

What happens: Train Qwen2.5-Coder-32B (and Qwen3-4B) on ~60K filtered multi-turn demos. Extend context (YaRN) to 80K and mask environment outputs so loss focuses on the agent’s reasoning and tool calls.
Why this step exists: Teaches multi-step planning and the rhythm of ‘read → decide → act → reflect’ across long interactions.
Example: The model learns to open dataset.py, trace a function to its definition, edit safely, run pytest, and rescan logs before final submission—without overfitting to raw log text.
Secret sauce 2: Environment-response masking keeps the model from parroting logs and instead learning to choose the next best action.

Step C: Reinforcement Learning (RL) with real environments

What happens: Start from the SFT model. Let it act in Docker repos; if the submitted patch passes all tests, reward = 1, else 0. Optimize with GRPO plus stabilizing tweaks.
Why this step exists: Imitation teaches basics; RL teaches confidence, exploration, and handling hard corners.
Details that matter: • Forced submission: If a run times out or hits max turns/tokens, automatically submit the current patch and score it. This prevents ‘almost-solved’ work from becoming zero signal and stops collapse. • Reward shaping: Penalize forced submissions a bit (e.g., ×0.5) so the model prefers timely, confident finishes. • Clip-higher and remove KL: Encourage exploration and avoid over-restricting probability growth. • Budget awareness: Each turn includes ‘steps remaining’ so the agent plans with time in mind. • Anti-git-hacking: Block git log/show so rewards come from real reasoning, not shortcuts.
Example: The agent loops test-edit-test, sees remaining steps are low, crafts a careful final patch, submits, and earns reward.
Secret sauce 3: Combining forced submission + shaped rewards stabilized training and lifted success rates versus zeroing-out truncated runs.

🍞 Hook: If you freeze up right before turning in homework, a friendly nudge to ‘submit what you have’ can still earn credit and teach you to finish next time.

🥬 The Concept (Reward shaping with forced submission): A small twist to reward rules that scores near-complete attempts and reduces RL instability.

How it works:
1. If time/turns run out, force-submit the current patch
2. Score it normally but apply a small penalty
3. Ignore runs broken by container errors
Why it matters: Without it, many promising runs give zero signal and training collapses.

🍞 Anchor: Like getting partial credit for showing your work, which keeps you learning.

Step D: Test-Time Scaling (TTS) with SWE-World

What happens: At inference, generate N candidate patches. Use SWE-World (an LLM-based simulator) to predict which would pass tests, then submit the best.
Why this step exists: Multiple tries only help if you can reliably pick the winner without running real tests each time.
Example: Run 8 candidates; SWE-World ranks them using simulated test reports; the chosen patch boosts Pass@1 from 61.4% to 70.8%.
Secret sauce 4: The simulated evaluator tracks close to the theoretical ‘oracle’ upper bound, so compute turns into real gains.

🍞 Hook: When you take several photos, a good previewer helps you choose the sharpest shot before printing.

🥬 The Concept (SWE-World simulated verifier): A model that ‘pretends’ to run tests and outputs a reasoned pass/fail report.

How it works:
1. Extract changed files, tests, and patch
2. Simulate a test run and produce a report
3. Repeat a few times and average, then rank candidates
Why it matters: Real execution can be slow or risky; a good simulator saves time and still picks winners.

🍞 Anchor: Like a flight simulator that lets you practice safely before flying the real plane.

Step E: LSP-powered IDE-level navigation

What happens: The agent calls lsp_tool functions (go to definition, find references, call hierarchy). This is distilled into SWE-Master so it uses IDE-grade navigation at inference.
Why this step exists: Grep is noisy; LSP is precise. This cuts wasted tokens, speeds localization, and reduces hallucinated edits.
Example: Instead of scrolling for ‘swap_dims’, the agent jumps to its exact definition and references, fixes the true root cause, and submits earlier.
Secret sauce 5: LSP tools delivered efficiency gains (23–24% fewer input tokens; ~17% shorter trajectories) while maintaining the same success rate.

Optional manager: Summary-based context

What happens: Periodically compress long histories into concise summaries, keeping a small raw sliding window.
Why: Maintains long-term memory without blowing past context limits; helpful for bigger foundation models.
Example: For GLM-4.7 and M2.1, token use dropped 36–42% with similar accuracy.

Full pipeline recipe:

Spin up decoupled Docker sandboxes per issue
Roll out teacher agents; keep only valid, mixed-difficulty solves
Train long-horizon SFT with env-response masking (80K context)
Run RL with GRPO mods, forced submission, and budget awareness
At inference, enable LSP tools; generate K candidates; use SWE-World to pick best
Enforce anti-git-hacking and security rules to keep evaluation honest

04Experiments & Results

The test: SWE-bench Verified (500 real, solvable GitHub issues across 12 Python repos) measures the true end-to-end skill: explore repo → edit files → run tests → submit a patch that passes all F2P and P2P tests. Score = resolve rate (percent of issues fully fixed).

The competition: SWE-Master uses the same open base model (Qwen2.5-Coder-32B) and runs under identical settings when comparing SFT vs RL. It’s matched against strong open-source agents and foundation models that report SWE-bench Verified numbers.

Scoreboard with context:

Long-horizon SFT alone: 57.8% Pass@1. That’s already a big leap over many open agents, showing that careful data curation + trajectory-level supervision gets you far.
Add RL with real execution: 61.4% Pass@1. Think of this like going from a solid A- to a clean A: the model learns to explore deeper and finish confidently.
Add TTS@8 with SWE-World: 70.8% Pass@1. This is like moving from an A to near A+/frontier territory, rivaling systems with far more parameters or closed tricks.
Pass@8: 76.2%. With 8 parallel tries and an oracle-like chooser, the potential tops three-quarters solved—evidence of strong headroom when selection is ideal.

Where it shines:

Efficiency from LSP tools: After distilling LSP navigation, SWE-Master keeps the same success rate (~61%) while cutting input tokens by ~24% and shortening average turns by ~17%. It acts more like a developer with a code-aware IDE than a user scrolling logs.
TTS selection quality: The SWE-World simulator’s accuracy reaches ~77.6% and tracks Pass@K closely, meaning extra compute is well converted into more solves, not wasted attempts.
RL behavior: Tool usage shifts show more execute_bash and file_replace in RL vs SFT, indicating more active debugging and iterative refinement. Interaction turns skew higher (peaks near 120–140) as the RL model uses budget to verify and polish.

Surprising findings:

Forced submission helps a lot. About 24% of time/turn-truncated runs actually contained a fix that would have passed—just not submitted. Scoring them prevented reward collapse and stabilized training.
More turns ≠ guaranteed success. Across datasets, as turns go very high, success rates can drop due to compounding noise. The sweet spot is using enough steps to localize and verify, but not so many that you wander.
Anti-git-hacking reduces temptation and even slightly improves scores, likely because the model wasn’t trained to exploit these commands. Blocking them kept reasoning on track.

Comparisons (illustrative highlights from the paper’s table):

SWE-Master-32B-RL: 61.4% (Pass@1), 70.8% (TTS@8)
daVinci-Dev-32B: 56.1%, daVinci-Dev-72B: 58.5%
SWE-Compressor (32B SFT): 57.6%
DeepSWE-32B-Preview: 42.2% (Pass@1), 59.0% (TTS@16)
Several large foundation models report 60–74% but use internal scaffolds; SWE-Master hits 61.4% with 32B and scales at test time to 70.8% under open, reproducible settings.

Sequential vs parallel scaling:

Sequential (more turns): Gains up to ~125 turns, then plateaus; RL scales better than SFT here.
Parallel (more candidates): Both SFT and RL curves climb as K increases; TTS@K closely follows Pass@K, showcasing a strong verifier.

Takeaway: Careful SFT curation → + RL with shaped rewards → + LSP navigation → + TTS selection together unlock strong, reproducible performance that competes with bigger or closed systems, while also being more efficient and transparent.

05Discussion & Limitations

Limitations (be specific):

Language coverage: Experiments are Python-focused (Pyright LSP). Other languages need swapping in the right language server and re-validation.
Compute and infra: Training long-horizon SFT and RL on a 32B model plus 13k Docker images demands serious GPU/CPU, storage, and orchestration.
Reward sparsity: Even with forced submission, binary rewards can be sparse; subtle partial credit beyond pass/fail remains under-explored.
Verifier fidelity: SWE-World achieves strong but not perfect agreement with real execution. Rare mis-rankings can still pick suboptimal patches.
Context handling on smaller models: The summary-based manager helped large foundations, but didn’t yield clear efficiency gains on SWE-Master yet.

Required resources:

GPUs for 32B SFT/RL, large batch sizes, and extended context (80–128K).
Multiple CPU nodes hosting Docker images (≈13,000), fast storage, and reliable networking.
A disciplined data pipeline for rollouts, filtering, and logging; robust monitoring for container errors and timeouts.

When NOT to use:

Tiny repos or single-file tasks: A simpler code model may be cheaper and just as good.
Live production environments: Prefer offline sandboxes; real infra may be risky for trial-and-error.
Severe compute limits: If you can’t afford multi-epoch SFT, RL rollouts, or TTS, aim for smaller configs or pure SFT with fewer contexts.
Closed toolchains: If LSP or Docker cannot be deployed, expect lower efficiency and more brittle behavior.

Open questions:

Better intermediate rewards: Can we score partial progress (e.g., passing subsets of tests) without reward hacking?
Cross-language, polyglot agents: How well do LSP tools scale across mixed-language monorepos?
Memory and planning: What’s the best blend of summaries, retrieval, and structured scratchpads for long debugging sessions?
Verifier evolution: Can we push simulated evaluation closer to real test fidelity, or mix in cheap real checks safely?
Safety and security: How to systematically prevent data leakage or unsafe commands across varied environments?

06Conclusion & Future Work

3-sentence summary: SWE-Master shows that with the right post-training recipe—curated long-horizon SFT, stable RL in real sandboxes, and smart test-time selection—open models can become strong, efficient software engineering agents. Adding IDE-style LSP tools lets agents navigate codebases semantically, cutting tokens and time without hurting success rates. The whole pipeline is open and reproducible, enabling the community to learn, verify, and improve.

Main achievement: A transparent, end-to-end framework that lifts a 32B open model to 61.4% Pass@1 on SWE-bench Verified and 70.8% with TTS@8, while demonstrating practical efficiency gains via LSP navigation.

Future directions: Expand beyond Python with plug-and-play language servers, refine reward shaping with safe partial credits, strengthen simulated verifiers, and mature context management for small and mid-size models. Explore richer safety rules and better anti-shortcut protections across diverse repos.

Why remember this: SWE-Master is a blueprint for turning ‘code LLMs’ into real repo-level problem solvers—open, testable, and efficient. It proves that data curation, stable RL, smart inference, and IDE-aware tools work best together, and sets a clear, reproducible path the community can follow and surpass.

Practical Applications

•Automated bug triage and repair bot that proposes verified patches for CI pipelines.
•Onboarding assistant that helps developers navigate large legacy repos with LSP jumps and call graphs.
•Refactoring helper that safely updates APIs across many files and runs tests to prevent regressions.
•Dependency upgrade assistant that edits code, updates configs, and verifies stability with pytest.
•Test authoring copilot that writes fail-to-pass tests, localizes issues, and proposes minimal fixes.
•Code review partner that suggests targeted diffs and explains impact via references and call chains.
•Rapid hotfix generator that prepares candidate patches and ranks them safely before production rollout.
•Repository explorer for security teams to trace vulnerable functions and patch call sites precisely.
•Education tool that shows students full debugging traces, from localization to verified fixes.
•Benchmark and research bed for studying long-horizon RL, verifiers, and tool-use in code agents.

Version: 1