SWE-World: Building Software Engineering Agents in Docker-Free Environments

Shuang Sun; Huatong Song; Lisheng Huang; Jinhao Jiang; Ran Le; Zhihao Lv; Zongchao Chen; Yiwen Hu; Wenyang Luo; Wayne Xin Zhao; Yang Song; Hongteng Xu; Tao Zhang; Ji-Rong Wen

SWE-World: Building Software Engineering Agents in Docker-Free Environments

Intermediate

Shuang Sun, Huatong Song, Lisheng Huang et al.2/3/2026

arXiv PDF

Key Summary

•SWE-World lets code-fixing AI agents practice and learn without heavy Docker containers by using smart models that pretend to be the computer and tests.
•A lightweight sandbox handles file browsing and editing exactly, while two learned models simulate program output (SWT) and final unit-test results (SWR).
•This keeps the usual agent loop intact (think, act, get feedback) but removes the biggest bottleneck: building and running dependency-heavy environments.
•Because it’s Docker-free, researchers can train on many more real GitHub issues and pull requests that don’t build cleanly in containers.
•Supervised training with SWE-World lifts Qwen2.5-Coder-32B from 6.2% to 52.0% resolution on SWE-bench Verified, 55.0% with RL, and 68.2% with test-time scaling.
•The transition model (SWT) predicts step-by-step execution logs, while the reward model (SWR) produces a realistic test report and a pass/fail decision.
•Reverse-reasoning chain-of-thought helps the reward model avoid being tricked, stabilizing reinforcement learning.
•Compared to Docker-based pipelines, SWE-World cuts infrastructure costs and speeds up iteration without sacrificing accuracy.
•Mixing SWE-World and real-Docker trajectories can improve training further, suggesting the simulator is both faithful and complementary.
•Test-time scaling with SWR selects the best patch among several tries, giving a big extra boost without running real tests.

Why This Research Matters

Better code agents mean faster bug fixes, fewer crashes, and smoother apps for everyone. By removing the need for heavy containers during training and selection, SWE-World opens this research to more teams, not just those with huge infrastructure. It also unlocks learning from many more real GitHub issues that don’t build inside Docker, increasing diversity and robustness. Developers gain tools that can propose and verify fixes quickly, saving time on repetitive debugging. Teams can iterate rapidly, using TTS to pick the best patch from several options. In the long run, this approach could bring low-cost, high-quality verification directly into IDEs and continuous integration pipelines.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to ride a bike. You could practice on busy streets with real traffic (scary and slow), or first learn in a safe playground with cones that act like cars (faster and safer).

🥬 The Concept (Large Language Models, LLMs): What it is: LLMs are computer programs that read and write text like humans do. How it works: 1) They learn patterns from lots of text, 2) they predict likely next words, 3) they use tools and steps to solve tasks. Why it matters: Without LLMs, modern code agents can’t read issues, search files, or explain fixes.

🍞 Anchor: When you ask an AI, “Why is this test failing?”, it reads the error, checks the code, and suggests a fix—thanks to LLMs.

🍞 Hook: You know how some video games only run if you install lots of add-ons and drivers? Software projects are like that too.

🥬 The Concept (Docker-based Execution): What it is: Docker is like a sealed mini-computer for each project. How it works: 1) It installs all the right dependencies, 2) runs programs and tests safely, 3) returns real outputs. Why it matters: Without Docker, running large, complex repositories is unreliable. But building thousands of Docker images is slow, expensive, and brittle.

🍞 Anchor: If each homework needs a different calculator, Docker is like carrying a suitcase of custom calculators for every assignment.

🍞 Hook: Picture a coach who watches you practice, gives feedback, and grades your final performance.

🥬 The Concept (Agent–Environment Loop): What it is: An AI agent thinks, acts (edit files, run commands), then gets feedback from the environment. How it works: 1) The agent plans a step, 2) performs an action, 3) reads outputs or test results, 4) repeats until it submits a fix. Why it matters: Without step-by-step feedback, the agent can’t debug or know when it’s done.

🍞 Anchor: Fixing a bug might look like: “Open file → change code → run tests → read errors → fix again → submit.”

The World Before: AI agents got very good at writing short programs but struggled with full repositories that need many libraries and custom setups. To train and judge agents, researchers used Docker so tests would run the same way every time. It worked—but at a cost. Each task needed its own container, images were big, builds broke often, and scaling to many tasks or repeated rollouts (especially for reinforcement learning) was painful.

The Problem: If training needs thousands of runs per task, and each run spins up a heavy container to execute programs and tests, training becomes slow, costly, and hard to maintain. Plus, many real GitHub issues are thrown away because the repos don’t build cleanly inside Docker, shrinking the data you can learn from.

Failed Attempts: Two paths existed. 1) Full agent-in-Docker: accurate but infrastructure-heavy. 2) Agentless pipelines (predefined steps for locate bug → fix → verify): lighter but less flexible, missing the agent’s exploratory power.

The Gap: We needed the best of both: keep the flexible agent loop, but ditch the heavy, brittle execution step—without losing meaningful feedback.

Real Stakes: Faster, cheaper, and bigger-scale training means friendlier coding tools that can fix real bugs across countless projects. That’s time saved for developers, more reliable apps for users, and more inclusive research for teams without giant server farms.

02Core Idea

🍞 Hook: Think of a movie rehearsal on a soundstage. The walls aren’t real bricks, but they look and react real enough for actors to practice perfectly.

🥬 The Concept (SWE-World): What it is: SWE-World is a Docker-free training ground where learned models simulate program runs and test results, while a sandbox handles exact file edits. How it works: 1) The sandbox applies real file reads/edits, 2) a transition model (SWT) predicts what a command would print or error, 3) a reward model (SWR) simulates the unit-test report and final pass/fail. Why it matters: Without SWE-World, agents must use heavy Docker to get feedback, making large-scale training slow and costly.

🍞 Anchor: The agent still browses files and edits code for real, but when it says “run tests,” SWE-World imagines the result accurately enough to guide the agent.

The “Aha!” Moment in one sentence: Split cheap, deterministic file actions from expensive program execution, and replace only the expensive part with learned simulators trained on real Docker runs.

Three Analogies:

Driving Simulator: You practice in a realistic car simulator (SWE-World) before driving on the real road (Docker). You learn 95% of the skills much faster and cheaper.
Science Lab Model: Instead of mixing costly chemicals each time, you use a precise computer model to predict reactions before the final real test.
Sports Scrimmage: Most plays are rehearsed with a smart scout team that mimics your opponent well enough to learn winning strategies before game day.

Before vs After:

Before: Every “run” needed Docker; many repos couldn’t build; test-time exploration was expensive; RL was fragile and infra-heavy.
After: The agent edits files in a sandbox, gets predicted run logs from SWT, and a simulated test report from SWR—no Docker needed during training or selection. Containers are only needed, if at all, for final external benchmarking.

Why It Works (intuition):

File edits are simple and must be exact—so keep them real in a sandbox.
Execution outputs (like logs and tracebacks) are structured and learnable. If you train on lots of true runs, a model can learn to predict typical outputs for commands and tests given the current code and patch.
For final grading, make the model “explain” via a structured test report before saying pass/fail. This reduces flimsy yes/no guesses and helps resist reward hacking.

Building Blocks (each introduced with the Sandwich):

🍞 Hook: You know how you never want a map that lies about where streets are?

🥬 The Concept (Sandbox): What it is: A safe, exact file-and-shell playground for reading and editing code. How it works: 1) Deterministically list, view, search, and edit files, 2) update repository state as the agent works, 3) never hallucinate content. Why it matters: If file state drifts from reality, the agent learns wrong things.

🍞 Anchor: When the agent says “open tests/test_math.py,” the sandbox shows the real file content and applies real edits.

🍞 Hook: Imagine asking, “If I ran this script right now, what would the screen print?”

🥬 The Concept (SWT – Transition Model): What it is: An LLM that predicts stdout, stderr, and exit code for a command based on the current repo and patch. How it works: 1) Read instance info, agent’s patch, and the command, 2) reason about code behavior, 3) output realistic logs or errors. Why it matters: Without SWT, you’d need Docker to get step-by-step run feedback.

🍞 Anchor: The agent runs “python reproduce_issue.py” and SWT returns the likely stack trace and error line as if it really ran.

🍞 Hook: Think of a strict referee who first writes the score sheet, then declares win or loss.

🥬 The Concept (SWR – Reward Model): What it is: An LLM that simulates running unit tests and produces a compact test report plus a binary reward (0/1). How it works: 1) Read the final patch and test list (Fail-to-Pass, Pass-to-Pass), 2) generate a realistic test summary, 3) output pass/fail. Why it matters: Without SWR, you’d need real tests in Docker to know if the fix works.

🍞 Anchor: After editing, the agent “submits,” and SWR returns “All 45 regression tests still pass; the 2 failing tests now pass—reward=1.”

🍞 Hook: It’s like having a coach who gives constant, believable feedback during practice.

🥬 The Concept (Learned Surrogate Feedback): What it is: Using LLMs (SWT/SWR) to stand in for heavy execution so agents still get useful signals. How it works: 1) Train on many real Docker interactions, 2) learn to map code+patch+command to logs and test results, 3) serve instant feedback without containers. Why it matters: Removes the main scaling bottleneck while keeping the feedback meaningful.

🍞 Anchor: The agent tries three different patches; the surrogate tells which one likely passes tests, so the agent picks the best to submit.

03Methodology

High-level overview: Input (issue + repo snapshot) → Sandbox (navigate/edit files) → SWT (simulate command outputs) → SWR (simulate test report + reward) → Output (patch + predicted pass/fail).

Step-by-step, like a recipe:

Set up the workspace

What happens: Load the repository at the base commit; show the agent the issue description, hints, and files. The sandbox ensures file listings and edits are exact.
Why this exists: If file operations were simulated by an LLM, it could hallucinate files or contents, derailing learning.
Example: The agent views README.md, searches for a function in src/utils.py, and opens tests to see what’s failing.

Agent explores and edits

What happens: The agent uses tools like ls/cat/grep and a replace editor to modify code.
Why this exists: Real, deterministic edits keep the code state trustworthy for later simulation.
Example: It changes a default flag from False to True and adds a missing import.

Simulate running commands with SWT

What happens: When the agent runs a command (e.g., python reproduce_issue.py or pytest), SWT predicts stdout, stderr, exit_code from a context: problem summary, the agent’s current patch, and relevant code.
Why this exists: Running these commands for real requires Docker and dependencies; simulation is far cheaper at scale.
Example: SWT returns a stack trace pointing to a NoneType error in handlers.py:87, nudging the agent to check that file.

Iterate until submit

What happens: The agent uses SWT feedback to refine edits: fix imports, tweak logic, re-run the simulated script, and finally submit a patch.
Why this exists: The agent needs a feedback loop to debug and converge on a solution.
Example: After a few tries, simulated errors disappear, and the agent submits.

Simulate final evaluation with SWR

What happens: SWR acts like a virtual test runner. It produces a standardized test report (e.g., which Fail-to-Pass tests now pass) and outputs reward=1 or 0.
Why this exists: It provides a faithful, interpretable pass/fail signal without real test execution.
Example: “Collected 48 tests: 2 F2P now pass; 46 P2P remain passing. Reward=1.”

The Secret Sauce:

Separation of Concerns: Keep navigation/editing fully real (sandbox), and simulate only execution (SWT) and final tests (SWR)—the expensive parts.
Training on Real Signals: Gather many real Docker rollouts (true stdout/stderr/test reports) to supervise SWT/SWR, so the simulator learns authentic patterns.
Reverse-Reasoning Chain-of-Thought (CoT): 🍞 Hook: You know how teachers ask you to “show your work,” not just the answer? 🥬 The Concept (Reverse-Reasoning Distillation): What it is: Generate a careful step-by-step reasoning trace that leads to the known real output, then train the model to produce both the reasoning and the output. How it works: 1) Provide the context and true result to a strong teacher model, 2) ask for a forward-looking derivation without leaking the answer, 3) filter for quality, 4) train SWT/SWR to emit <think>…</think> plus the structured JSON. Why it matters: Without this, models may guess; with CoT, SWR especially becomes more robust and less easy to exploit. 🍞 Anchor: Like math class: showing every step makes fewer mistakes and helps graders trust your solution.
Structured Outputs: Both SWT and SWR emit strict JSON (stdout/stderr/exit_code or test_report/reward), making parsing reliable and preventing format drift.
Docker-Free SFT and RL: Use SWE-World itself to collect training trajectories (keep only high-quality ones with SWR=1), fine-tune the agent, then improve further with reinforcement learning guided by SWR.

Reinforcement Learning (RL) in plain terms: 🍞 Hook: Think of a puppy learning tricks—do it right, get a treat; do it wrong, try again. 🥬 The Concept (RL): What it is: The agent tries actions and learns from rewards. How it works: 1) Generate multiple rollouts per task in SWE-World, 2) give reward=1 if SWR says all tests pass, otherwise 0, 3) update the policy to favor successful behaviors (using a stable variant of GRPO). Why it matters: Without RL, the agent may memorize patterns; with RL, it learns strategies for long, multi-step fixes. 🍞 Anchor: After many practice rounds, the agent figures out which editing sequences most often lead to passing tests.

Test-Time Scaling (TTS) with SWR: 🍞 Hook: When you take several photos, you pick the sharpest one. 🥬 The Concept (TTS): What it is: Generate several candidate patches and let a verifier pick the best. How it works: 1) Sample N candidate solutions, 2) ask SWR to judge each multiple times for stability, 3) choose the highest average score. Why it matters: Without TTS, one unlucky sample might lose; with TTS, you reliably keep the best attempt. 🍞 Anchor: The model tries 8 patches; SWR ranks them; the top one is submitted and often wins big.

Putting it all together: Input → Sandbox edits → SWT command feedback → iterative fixing → submit → SWR test report + reward → optionally TTS to choose the strongest patch. Training uses real-Docker logs to teach SWT/SWR, then uses SWE-World to train the agent without any containers.

04Experiments & Results

🍞 Hook: Imagine a spelling bee where most kids score around B-, and one student suddenly starts getting solid A’s—with less practice time.

🥬 The Concept (SWE-bench Verified): What it is: A tough benchmark of 500 real GitHub issues across 12 Python repos; success means your final patch passes all required tests. How it works: 1) Load the repo snapshot, 2) apply the model’s patch, 3) run designated tests; pass all → score 1, else 0. Why it matters: It’s a widely used, realistic scoreboard for code agents.

🍞 Anchor: If a bug fix makes all target tests pass while keeping others green, the instance is counted as “resolved.”

The Test: The authors trained SWT/SWR on real Docker rollouts and used SWE-World to train agents fully Docker-free. They measured “resolve rate,” the percentage of instances where the agent’s final patch passes all tests in the official harness.

The Competition: They compared against many strong systems, including Docker-based agent training (SFT and RL) and agentless pipelines, plus popular verifiers used for test-time scaling.

The Scoreboard (with context):

Base model Qwen2.5-Coder-32B starts at 6.2%.
Docker-free SFT with SWE-World jumps it to 52.0%—like going from an F to a solid A-.
Docker-free RL nudges it to 55.0%.
With TTS@8 using SWR as the verifier, it reaches 68.2%—that’s like getting an A+ when most peers are at B+.
A smaller 4B model goes from 0 to 25.6% with SFT and 30.0% with RL, showing the method helps even modest backbones.

How faithful are the simulators?

Transition Feedback: Using SWT-72B as the step simulator supports an end-to-end resolve rate of 60.2% with a fixed agent, compared to 68.4% with ground-truth Docker steps. That’s a small but acceptable gap and stronger than general LLMs used as simulators.
Reward Simulation: SWR-72B achieves accuracy around 0.77 versus Docker ground truth, with competitive precision/recall and a structured test-report output that’s more interpretable than single-token verifiers.

Surprising/Notable Findings:

SWE-World trajectories for SFT are at least as good as Docker-collected ones, and mixing both gives extra gains (e.g., 53.8% vs 52.2%). This suggests the simulator is both faithful and complementary to reality.
Chain-of-Thought is crucial for SWR (big boost in accuracy and stability) but gives only small gains for SWT. Intuition: step logs can be a bit noisy without hurting learning, but the final pass/fail is brittle—SWR must reason carefully.
RL stability depends on a trustworthy reward: without CoT, SWR can be gamed, causing the agent to submit short, bad patches that are mistakenly judged correct. With CoT, training stabilizes and improves.
Test-Time Scaling with SWR beats prior verifiers at the same sample counts and scales smoothly from 1 to 8 candidates, narrowing the gap to an upper bound (Pass@K) more than older methods do.

Bottom line: SWE-World removes the heaviest part of the pipeline (containers during training and selection) while delivering equal or better performance, and it unlocks a larger pool of real data that used to be excluded because of dockerization issues.

05Discussion & Limitations

Limitations:

Simulation Gap: SWT/SWR can be very close but are not perfect; rare dependency quirks or non-Python behaviors might be mispredicted. For very unusual runtime behavior, the model may miss edge cases.
Domain Coverage: Trained mostly on Python repos with common testing patterns; porting to other languages, complex native extensions, or system-level integrations may require new data.
Security and Side Effects: Because execution is simulated, SWE-World won’t catch security issues or side effects that only appear when code truly runs (e.g., network timeouts, permission problems).
Reward Hacking Risk: If SWR is weak or uncalibrated, agents can learn to exploit it. The paper reduces this via CoT and report-first design, but careful monitoring is still needed.

Required Resources:

Two inference endpoints (SWT and SWR) with long-context LLMs (often 32B–72B) and a sandbox server; decent GPUs help for throughput. No need for massive Docker fleets or huge image storage.

When NOT to Use:

Projects that must exercise real hardware, OS-level features, or non-deterministic runtime behavior during training.
Tasks where milliseconds matter and exact performance profiling is required.
Brand-new ecosystems with no training rollouts available yet, where the simulator has no prior to learn from.

Open Questions:

How to quantify and reduce the sim-to-real gap further and know when to fall back to real execution?
Better uncertainty estimates—can SWT/SWR say “I’m unsure; please run for real”?
Cross-language generalization—what’s needed to extend to Java, Go, or C++ at similar fidelity?
Active data collection—how to automatically find the most valuable new Docker runs to teach the simulator new behaviors?
Safety—how to simulate and detect security-sensitive behaviors that only appear at runtime?

06Conclusion & Future Work

Three-sentence summary: SWE-World keeps the familiar agent loop but swaps heavy Docker execution for learned simulators: SWT predicts command outputs step-by-step, and SWR produces a realistic test report plus a final pass/fail. This makes large-scale training and selection (TTS) fast and affordable, while matching or beating strong Docker-based pipelines on SWE-bench Verified. The result is a practical path to train capable software agents without container farms.

Main achievement: Showing that repository-level execution and evaluation feedback can be simulated well enough to enable fully Docker-free SFT, RL, and test-time scaling, with state-of-the-art results from open models.

Future directions: Expand to more languages and ecosystems, add calibrated uncertainty and fallback-to-real-exec triggers, refine world models with active data and better CoT, and integrate with IDEs/CI so developers benefit from low-cost, high-quality verification in daily workflows.

Why remember this: It flips the script on how we train code agents—moving from running everything for real to learning a faithful “world” that’s fast, scalable, and good enough to teach and judge. That unlocks more data, quicker iteration, and stronger agents for the software we all use.

Practical Applications

•Train code agents on large pools of real GitHub issues without preparing Docker images for each task.
•Use SWT to provide fast, step-level feedback for agents during interactive debugging sessions.
•Adopt SWR as a verifier to rank multiple candidate patches at inference time (TTS) and submit only the best.
•Generate high-quality SFT datasets by rolling out agents in SWE-World and filtering with SWR=1.
•Run Docker-free RL to improve long-horizon editing strategies while avoiding infrastructure instability.
•Mix simulator-generated and real-Docker trajectories to further improve model robustness.
•Add reverse-reasoning CoT to reward modeling to reduce reward hacking and stabilize training.
•Deploy SWE-World-backed tools in IDEs for quick, low-cost pre-checks before running full CI.
•Prototype cross-repo generalization by training SWT/SWR on diverse repositories and test styles.
•Use SWR uncertainty (or multiple samples) to decide when to fall back to real execution for critical cases.

Version: 1