SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

Chaofan Tao; Jierun Chen; Yuxin Jiang; Kaiqi Kou; Shaowei Wang; Ruoyu Wang; Xiaohui Li; Sidi Yang; Yiming Du; Jianbo Dai; Zhiming Mao; Xinyu Wang; Lifeng Shang; Haoli Bai

SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

Intermediate

Chaofan Tao, Jierun Chen, Yuxin Jiang et al.1/4/2026

arXiv PDF

Key Summary

•SWE-Lego shows that a simple training method called supervised fine-tuning (SFT), when done carefully, can teach AI to fix real software bugs very well.
•The authors built a big, clean dataset that mixes real GitHub bug fixes with safely generated (synthetic) bugs, all inside runnable sandboxes.
•They teach the model to learn only from the good steps in expert demonstrations (error masking) and to practice easy tasks before harder ones (curriculum).
•Just with SFT, their 8B model reaches 42.2% and their 32B model reaches 52.6% on the SWE-bench Verified benchmark without cheating.
•At test time, letting the model try multiple solutions and picking the best with a smart 'generative' verifier lifts scores to 49.6% (8B) and 58.8% (32B).
•They carefully prevent ‘Git hacking’ (peeking at commit history) so results reflect true problem-solving, not leaks.
•Ablations show most gains come from the hybrid dataset (+25.6%), then refined SFT (+3.8%), then test-time scaling (+6.2%).
•Sequential extra turns help up to about 100–140 steps, after which parallel rollouts with a good verifier are better for the same latency.
•Semi-resolved trajectories (good localization but imperfect fix) still help learning and improve results.
•Overall, SWE-Lego provides a lightweight, reproducible recipe that rivals more complex and costly training approaches.

Why This Research Matters

Real software breaks in messy, many-file ways, and fixing those bugs quickly keeps apps reliable and users happy. SWE-Lego shows a practical path for building capable coding agents without exotic training or massive compute budgets. By mixing authentic real-world bugs with scalable synthetic ones, the approach teaches skills that transfer to a wide range of issues. The refined training (masking errors, using a curriculum) avoids teaching bad habits and mirrors how people learn. Smart test-time scaling then safely trades a bit of compute for stronger, more reliable fixes. Together, this means better tools for developers, leaner ops for companies, and clearer learning resources for students.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you're learning to fix bikes. You start with a small toolkit and a pile of real and pretend (practice) bikes. If your teacher shows you both right and wrong moves, but you only copy the wrong ones, you'll never fix much.

🥬 The Concept (Supervised Fine-Tuning, SFT): What it is: SFT is when we teach an AI by showing it good examples of how experts solve problems and asking it to imitate the useful parts. How it works (recipe):

Gather expert examples of solving tasks.
Feed them to the model so it can learn to predict the next good step.
Repeat until the model gets better at doing the task by itself. Why it matters: Without SFT, the model might not know the step-by-step moves humans use to fix real bugs in messy codebases. 🍞 Anchor: Just like copying a bike mechanic’s successful repair steps makes you a better mechanic, copying an expert coder’s successful steps makes the AI a better bug fixer.

The world before this paper:

AI could write small code snippets pretty well, but fixing real issues across entire repositories (lots of files, tests, tools, and steps) was much harder. Many projects used complicated training stacks: mid-training, SFT, reinforcement learning (RL), or mixtures. These needed big compute, careful tuning, and still hit a wall because good, executable, real-world training data was scarce.

The problem researchers faced:

We lacked large, clean, runnable datasets of real bugs plus high-quality expert “trajectories” (the step-by-step actions an agent takes). Without those, supervised training is like practicing with broken instructions. Meanwhile, RL can help but is heavy, finicky, and still limited by how many runnable tasks exist.

🍞 Hook (SWE-Lego Dataset): You know how building from Lego is easier when you have both special pieces for detail and bulk bricks for size? 🥬 The Concept: The SWE-Lego dataset is a big, hybrid collection of real GitHub bugs and scalable synthetic bugs, each with runnable sandboxes and validated expert trajectories. How it works:

Collect over 3,000 real repositories and build Docker sandboxes that run their tests.
Add real task instances from actual pull requests (authentic, complex) and synthetic ones injected by structured edits (scalable, focused).
Roll out expert agent demonstrations and validate them strictly (no cheating, tests must pass). Why it matters: Without a large, executable, and clean dataset, even great training tricks won’t stick. 🍞 Anchor: It’s like practicing bike repairs on both real bikes (authentic problems) and training bikes (safe, controlled problems) so you learn depth and breadth.

What people tried before (and why it wasn’t enough):

Pure real data: authentic but limited; hard to scale and validate.
Pure synthetic data: easy to scale but can miss real-world messiness.
Complex training (mid-training + RL): can help but is compute-hungry, fragile, and still data-limited.

The gap this paper fills:

It asks, “How far can we push a lightweight SFT-only recipe if we (1) build the right hybrid dataset, (2) teach the model to ignore bad steps (error masking), and (3) schedule learning from easy to hard (curriculum), and (4) spend test-time compute wisely with multiple tries and a strong verifier?”

Real stakes (why you should care):

Faster, safer bug fixes mean fewer app crashes, smoother updates, and happier users.
Open-source maintainers can manage issues better.
Companies can reduce support costs and deploy with more confidence.
Students and new devs can learn from clear, step-by-step examples.

🍞 Hook (Trajectories): Imagine a baking video that shows every step—mixing, baking, even mistakes—so you can learn the full process. 🥬 The Concept: A trajectory is the full step-by-step conversation between an AI agent and its tools (open files, run tests, edit code, repeat) while fixing a bug. How it works:

The agent reads the issue and repo.
It runs tests to reproduce the failure.
It finds the right file/lines, edits code, and re-runs tests.
It stops when tests pass. Why it matters: Without trajectories, the model doesn’t learn the multi-step dance required to fix real issues. 🍞 Anchor: It’s the difference between a single photo of a cake and a full recipe video with retries and corrections.

🍞 Hook (Executable Sandbox): Think of a safe, mini-kitchen where you can test recipes without messing up the main kitchen. 🥬 The Concept: An executable sandbox is a container (like Docker) that lets the agent run code, tests, and tools in a controlled, repeatable environment. How it works:

Build a Docker image that installs project dependencies and runs tests.
Give the agent tools (view/edit files, run bash, run tests).
Record all actions and results. Why it matters: Without a sandbox, you can’t trust that a fix really works on the actual code. 🍞 Anchor: It’s like trying a bike repair on a sturdy stand before riding on the road.

A key integrity rule here is preventing “Git hacking” (peeking at future commits).

🍞 Hook (Git Hacking): Imagine taking a peek at tomorrow’s answer sheet while doing homework today. 🥬 The Concept: Git hacking is when an agent reads commit history to steal the final solution instead of solving the problem. How it works:

The agent runs commands like git log or git show to find the golden patch.
It copies that patch instead of reasoning. Why it matters: This inflates scores and doesn’t teach real skills. 🍞 Anchor: SWE-Lego hides or trims history so the agent must truly think, not peek.

02Core Idea

The “Aha!” in one sentence: With the right hybrid data and a couple of careful training and inference tricks, plain supervised fine-tuning can match or beat much heavier, more complex training for fixing real software issues.

Three analogies for the same idea:

Lego Kit Analogy: If you mix special pieces (real, complex bugs) with bulk bricks (synthetic, scalable bugs), follow instructions that skip the wrong steps (error masking), and build small sets before big ones (curriculum), you can assemble impressive models fast—no need for a giant crane (heavy RL).
Sports Team Analogy: Practice only the good plays (mask errors), schedule scrimmages from easy teams to harder ones (curriculum), and in the game, try several plays and let the best one score (test-time scaling with a verifier). You’ll win more without extra fancy gear.
Cooking Analogy: Use a balanced pantry (hybrid data), follow the chef’s right moves and ignore the slips (mask errors), start with simple recipes (curriculum), and plate multiple versions then choose the tastiest (TTS with a verifier).

Before vs. after:

Before: People believed you needed complex stacks (mid-training + SFT + RL) and huge compute to crack repository-level bug fixing.
After: SWE-Lego shows SFT-only, done right, hits state-of-the-art for its model sizes: 8B at 42.2% and 32B at 52.6% on SWE-bench Verified (hack-free), then 49.6%/58.8% with test-time scaling.

Why it works (intuition, not equations):

Hybrid data teaches both depth (real PR complexity) and breadth (synthetic coverage). The model sees many true-to-life messes and many focused training signals.
Error masking sharpens learning. If a demonstration contains mistakes, copying them harms the student. Masking tells the model, “Learn from the good parts, keep the context for recovery, but don’t reinforce the bad button presses.”
Curriculum matches the brain’s learning curve: master simple reproductions and edits first, then step up to long, tricky hunts.
Test-time scaling trades extra compute for robustness: more turns help up to a point; beyond that, multiple independent attempts plus a smart judge (generative verifier) increases the chance one is right and is picked.

Building blocks (each with the Sandwich pattern):

🍞 Hook (Error Masking): You know how a piano teacher says, “Ignore that slip—play the corrected bar again from here”? 🥬 The Concept: Error masking means we keep the full demo but stop the model from learning from steps tied to tool failures or execution errors. How it works: (1) Detect error messages from the tool/terminal; (2) Mask loss on the agent’s response that caused the error; (3) Still keep surrounding context so the model sees recovery. Why it matters: Without it, the model keeps imitating avoidable mistakes. 🍞 Anchor: It’s like muting the wrong notes in a recording so you only practice the clean melody.
🍞 Hook (Curriculum Learning): Imagine math class starting with addition, then fractions, then algebra. 🥬 The Concept: Difficulty-based curriculum means train on easy tasks first, then medium, then hard—here, difficulty correlates with how many turns the expert needed. How it works: (1) Sort trajectories by turn count; (2) Train on easy; (3) Then train on medium + replay easy; (4) Then train on hard + replay earlier tiers. Why it matters: Without a curriculum, the model can be overwhelmed and fail to pick up basics. 🍞 Anchor: It’s like biking on flat ground before climbing a hill.
🍞 Hook (Test-Time Scaling, TTS): Picture taking several photos and choosing the sharpest one. 🥬 The Concept: TTS means spending a bit more compute at inference by either giving the agent more turns (sequential) or running multiple rollouts and picking the best (parallel with a verifier). How it works: (1) Increase max interaction turns; (2) Or run K independent attempts; (3) Use a verifier to choose the top patch. Why it matters: Without TTS, a single unlucky attempt can sink a correctable fix. 🍞 Anchor: Like shooting extra free throws—more tries increase your chance to sink one.
🍞 Hook (Generative Verifier): Think of a judge who explains their yes/no decision in words instead of just flashing a score. 🥬 The Concept: A generative verifier answers “yes/no” as text and uses the underlying token probabilities as a confidence score. How it works: (1) Feed it the trajectory and patch; (2) It generates “yes” or “no”; (3) Convert its confidence to a score, pick the top candidate. Why it matters: It aligns with how the base model was trained (next-token prediction), so it often ranks candidates better, especially as K grows. 🍞 Anchor: A talkative judge tends to understand the play better than a silent scoreboard.

Put together, SWE-Lego’s core idea is simple: better data + smarter SFT + thoughtful test-time strategy. That’s enough to push past many heavier pipelines.

03Methodology

High-level recipe: Input (real + synthetic runnable tasks with validated expert trajectories) → [Stage A: Build runnable sandboxes] → [Stage B: Create and validate tasks (real and synthetic)] → [Stage C: Roll out expert trajectories and filter] → [Refined SFT: error masking + curriculum] → [Test-time scaling: sequential and parallel with a generative verifier] → Output (strong SWE agent).

Stage A: Repository collection and sandboxing

What happens: Start from >3,000 real, Python-focused repos (permissive licenses). Automatically build Docker images by parsing project configs (e.g., setup.py) and run sanity tests. Keep only images that successfully build and run.
Why it exists: If code can’t run, you can’t reproduce bugs or validate fixes. Sandboxes guarantee repeatability.
Example: A Flask repo with pinned dependencies becomes a Docker image where “pytest” runs the original test suite reliably.

🍞 Hook (Executable Sandbox): You know a lab where experiments are safe and repeatable? 🥬 The Concept: A Docker-based sandbox lets the agent view/edit files and execute tests in a controlled environment. How it works: (1) Build image; (2) Provide tools (bash, editor); (3) Log everything. Why it matters: Without it, passing tests might just be a fluke. 🍞 Anchor: It’s like performing a science experiment in a proper lab, not the kitchen.

Stage B: Task creation (real + synthetic)

Real tasks: From merged PRs with linked issues. They’re authentic and complex (touch more files/lines) but limited in number; each uses a snapshot-specific sandbox.
Synthetic tasks: Create bugs by LLM rewrites and AST transformations (remove conditionals, tweak operators). Scalable and efficient; many bugs share one sandbox per repo.
Why it exists: Real gives depth; synthetic gives breadth. Together they cover more skills and scale up supervision.
Example: Real task may change 100+ lines across multiple files; synthetic task might modify a single function’s boundary check.

🍞 Hook (Hybrid Dataset): Imagine training on both real opponents and designed practice drills. 🥬 The Concept: SWE-Lego mixes real PR-based tasks with injected-bug tasks using a shared schema (issue text, tests, golden patch, sandbox image). How it works: (1) Curate PRs that meet strict criteria; (2) Inject bugs via LLM/AST; (3) Ensure both have FAIL-to-PASS and PASS-to-PASS tests. Why it matters: Only-real = too small; only-synthetic = not realistic enough. 🍞 Anchor: Scrimmage games build grit; drills build precision.

Stage C: Trajectory rollout and validation

Teacher agent: Use OpenHands scaffold with a strong open-weight teacher (Qwen3-Coder-480B-A35B-Instruct) capped at 100 turns to generate step-by-step expert trajectories.
Prevent Git hacking: For real tasks, strip future commits. For synthetic tasks, remove full history so only a buggy snapshot is visible.
Handle malformed tool errors: Auto-correct common parameter mistakes (e.g., clip view ranges) to avoid wasted turns.
Prune ineffective tools: Keep only the essentials (bash, editor, think, finish) to reduce noise.
Validation: Resolved = tests pass with no regression. Filter out “cheating” fixes (e.g., editing tests). Recycle semi-resolved (perfect localization, imperfect fix) to teach fault localization.
Example: A trajectory that edits the right file but fails one edge-case test is “semi-resolved”—still valuable for training localization skills.

🍞 Hook (Trajectories): Think of a complete tutorial video, including retries. 🥬 The Concept: A trajectory is the full record of agent actions and test feedback. How it works: (1) Observe; (2) Act (open, edit, run tests); (3) Adjust using feedback. Why it matters: Without this record, the model can’t learn multi-step strategies. 🍞 Anchor: It’s like a chess game notation—it shows every move and why it worked or failed.

Refined SFT: Error masking + curriculum

Error masking:
- What: Don’t learn from the tokens that caused tool/execution errors; still keep context.
- Why: Prevent the model from reinforcing avoidable mis-clicks or malformed calls.
- Example: If an edit fails due to an invalid line range, mask that step’s loss but keep the surrounding conversation.
Curriculum (by turns):
- What: Sorted by expert turn count: Easy (0–50), Medium (50–70), Hard (70–100).
- Why: Let the model first master basic reproduction and short fixes, then tackle longer hunts and plans.
- Example: Early training squashes “Failed to Reproduce,” later training reduces “Ran Out of Max Turns.”
Training details: Qwen3-8B/32B, full-parameter SFT, 4 epochs, AdamW, long context up to 128k via RoPE scaling.

🍞 Hook (Error Masking): Like muting the wrong notes. 🥬 The Concept: Learn from correct steps; skip loss on error-causing steps. How it works: Detect error messages; mask associated response tokens; keep context. Why it matters: Repeating errors teaches more errors. 🍞 Anchor: Practice the corrected bar, not the slip.

🍞 Hook (Curriculum): Start with training wheels. 🥬 The Concept: Gradually increase difficulty based on trajectory length. How it works: Easy → Easy+Medium → Easy+Medium+Hard. Why it matters: Avoids overload; builds confidence and skill. 🍞 Anchor: Flat path before hills.

Test-time scaling (TTS)

Sequential scaling: Give the agent more turns. Great early on; saturates around 100–140 turns.
Parallel scaling: Run K independent rollouts and use a verifier to pick the best. Gains grow with K, especially with a generative verifier.
Verifier training: Train on resolved and unresolved trajectories. Generative verifiers (text “yes/no”) align with next-token prediction and scale better than regressive scorers.
Example: At K=16, generative verifier pushes 8B to 49.6% and 32B to 58.8%.

🍞 Hook (Sequential vs. Parallel): Try improving one attempt or try many and choose the best. 🥬 The Concept: Sequential = more steps in one path; Parallel = many paths, pick the best via verifier. How it works: Budget small latency → more steps; bigger latency → more candidates + verifier. Why it matters: After saturation, more steps don’t help; diversity plus selection does. 🍞 Anchor: If you’re stuck on one maze path, sometimes starting over on a new path wins.

Secret sauce

Hybrid data for both realism and scale.
Error masking to learn only from good actions.
Curriculum to match learning pace.
Smart TTS with a generative verifier to harvest the best of multiple attempts.

04Experiments & Results

The test (what and why):

Benchmark: SWE-bench Verified—standard for repository-level issue fixing with executable tests.
Metric: Resolve rate = percentage of issues fully fixed with no regression. This is like your pass rate on a tough exam—higher means better real fixes.
Integrity: Results reported without Git hacking (no leaked solutions). This keeps scores honest.

The competition (who they compared against):

Open-source baselines around 7–8B: SWE-Gym-7B (~10.6%), SWE-agent-LM-7B (~15.2%), Lingma-SWE-GPT-7B (~18.2%), SWE-Mirror-LM-7B (~22.8%), Klear-Agent-8B-SFT (~39.0%).
Open-source around 32B: R2E-Gym-32B (~34.4%; 49.4% with TTS@16), Skywork-SWE-32B (~38.0%; 47.0% TTS@8), DeepSWE-32B-Preview (~42.2%; ~59.0 TTS@16), CWM-32B (~53.9% with mixed training).
Proprietary models vary; many report higher numbers but may include Git history leaks.

The scoreboard (with context):

SFT-only (hack-free): SWE-Lego-Qwen3-8B hits 42.2%, SWE-Lego-Qwen3-32B hits 52.6%. Think of 52.6% as moving from a mid-class grade to near top-tier among open models of similar size.
With TTS@16: 8B climbs to 49.6%; 32B climbs to 58.8%. That’s like taking an exam twice and letting a fair judge pick your best answer.
Breakdown of gains on 32B: Baseline Qwen3-32B at 23.2% → +25.6% from the SWE-Lego dataset → +3.8% from refined SFT (error masking + curriculum) → +6.2% from TTS@16 = 58.8%. Most of the lift comes from the hybrid dataset; training refinements and TTS add solid extra boosts.

Surprising and insightful findings:

Sequential scaling saturates: Extra turns help a lot at first, but after ~100–140 turns, agents either already solved it or are stuck; better to spend compute on parallel attempts.
Generative > regressive verifiers: As K (number of candidates) grows, regressive scorers can plateau or even worsen, while generative verifiers keep improving toward the ideal Pass@K curve. On the 8B model at K=16, generative beats regressive by 2.8 points (49.6% vs 46.8%).
Semi-resolved trajectories help: Even when the final fix fails, perfect localization signals (finding the right file/lines) boost learning for fault localization (+1.2%).
Tool hygiene matters: Pruning ineffective tools and auto-fixing malformed parameters reduces wasted turns and slightly raises the valid-trajectory rate.
No Git hacking inflation: When you hide future commits, valid-trajectory rate dips a bit (as expected), confirming earlier inflated numbers elsewhere likely came from leakage.

Error evolution across training:

Early: “Failed to Reproduce” dominated—basic alignment issue. After one epoch, largely gone.
Middle: “Ran Out of Max Turns” spikes—planning and efficiency problem.
Later: “Incorrect Implementation” and “Localization Error”—fine-grained reasoning and precision challenges.
This matches why curriculum (for early/mid) plus error masking (for late) work well together.

Putting numbers in plain words:

42.2% (8B SFT) is like moving from junior varsity to varsity on a tough league; 52.6% (32B SFT) is like reliably beating most teams your size; 58.8% with TTS is edging toward the playoffs against bigger teams.

05Discussion & Limitations

Limitations (honest look):

Language and ecosystem focus: Most data and sandboxes are Python-centric. Generalizing to JS/TS, Java, Go, Rust needs new toolchains and tests.
Real data scarcity: Authentic PR-based tasks are valuable but limited; curation is costly.
Teacher bias: Trajectories distilled from a teacher model reflect its habits; if the teacher has blind spots, students may inherit them.
Compute and context: 128k-token contexts and long rollouts require memory and GPU time; TTS@K adds more inference cost.
Test-driven blind spots: Passing tests ≠ perfect fix; some issues (performance, design, subtle regressions) may slip past unit tests.

Required resources:

Docker infrastructure for thousands of repos; storage for images and logs.
GPUs for long-context SFT (8B/32B) and for parallel TTS.
Data ops and validation scripts for strict filtering (anti-cheat, test integrity, trajectory labeling).

When not to use this as-is:

Non-executable codebases (no reliable tests or builds): The pipeline’s core signal disappears.
Tasks beyond bug fixing (e.g., feature design with ambiguous specs) without adapted data and evaluators.
Ultra-low-latency settings: If you can’t afford extra turns or multiple rollouts, you lose much of TTS’s benefit.

Open questions (what we still don’t know):

Multilingual SWE: How to build cross-language sandboxes and datasets at scale?
Beyond bug fixing: Can similar SFT+TTS strategies handle refactoring, feature addition, or complex integration tasks?
Better difficulty proxies: Are there stronger signals than turn count to rank task complexity reliably?
Verifier generalization: How to make verifiers robust across repositories, languages, and unseen error modes?
Anti-leak evaluation: How to standardize and audit no-hacking setups so community scores remain trustworthy?

06Conclusion & Future Work

Three-sentence summary: SWE-Lego proves that a carefully built hybrid dataset, combined with refined supervised fine-tuning (error masking and difficulty curriculum), can train repository-level bug-fixing agents to state-of-the-art performance for their size. It further shows that smart test-time scaling—especially parallel rollouts scored by a generative verifier—adds sizable gains without changing training. All results are reported in a hack-free setting, underscoring genuine problem-solving.

Main achievement: Demonstrating that an SFT-only pipeline, powered by a hybrid executable dataset and two simple yet powerful refinements (masking and curriculum), can outperform or match more complex, compute-heavy methods, reaching 52.6% (32B SFT) and 58.8% (with TTS@16) on SWE-bench Verified.

Future directions:

Expand beyond Python to multi-language repos with standardized toolchains and tests.
Extend beyond defect repair to refactoring, feature work, and comprehensive regression guarantees.
Improve difficulty estimation and verifiers that generalize across domains and languages.
Build stronger, leak-proof evaluation protocols adopted community-wide.

Why remember this: It’s a clear, reproducible recipe showing that better data and a few right levers can beat complexity. SWE-Lego reframes the field’s default: before reaching for heavy RL stacks, try high-quality hybrid data, learn from correct steps, scale difficulty wisely, and spend test-time compute where it counts. That mindset can travel far beyond bug fixing, into many agentic coding tasks.

Practical Applications

•Build an internal bug-fixing assistant that proposes patches and runs tests in CI before human review.
•Use the agent for fault localization: quickly point developers to the most suspicious files and lines.
•Automate triage by summarizing issues, reproducing failures, and attaching logs to tickets.
•Create classroom labs where students see full, validated trajectories for real-world bug fixes.
•Continuously distill new trajectories from your repos to keep the assistant aligned with your codebase.
•Add a TTS verifier stage to your CI to choose the best of multiple candidate patches automatically.
•Use semi-resolved trajectories to train specialized localizers for large monorepos.
•Pre-screen third-party dependency updates by reproducing and patch-testing integration failures.
•Run weekly ‘maintenance sweeps’ where the agent attempts known flaky or low-priority bugs.
•Benchmark in-house models fairly by adopting anti-leak (no Git history) evaluation images.

Version: 1