SWE-Universe: Scale Real-World Verifiable Environments to Millions

Mouxiang Chen; Lei Zhang; Yunlong Feng; Xuwu Wang; Wenting Zhao; Ruisheng Cao; Jiaxi Yang; Jiawei Chen; Mingze Li; Zeyao Ma; Hao Ge; Zongmeng Zhang; Zeyu Cui; Dayiheng Liu; Jingren Zhou; Jianling Sun; Junyang Lin; Binyuan Hui

SWE-Universe: Scale Real-World Verifiable Environments to Millions

Intermediate

Mouxiang Chen, Lei Zhang, Yunlong Feng et al.2/2/2026

arXiv PDF

Key Summary

•SWE-Universe is a factory-like system that turns real GitHub pull requests into safe, repeatable coding practice worlds with automatic checkers.
•It fixes three big problems at once: low build success, weak or cheat-able checkers, and high cost per task.
•A smart Building Agent creates a verifier script (evaluation.sh), tests it on buggy vs. fixed code, and keeps improving it in a loop until it works for real.
•An in-loop Hacking Detector stops fake shortcuts (like grepping for a string) so only real, code-running tests are accepted.
•A lightweight but strong Mixture-of-Experts model (Qwen-Next-80A3) powers the agent, beating or matching top models while being cheaper and faster.
•The team built 807,693 multilingual, executable tasks from 52,960 repositories—far larger and more diverse than prior datasets.
•Training on these tasks boosts models on SWE-Bench Verified and Multilingual, and enables effective reinforcement learning with stable rewards.
•Using these environments, Qwen3-Max-Thinking reached 75.3% on SWE-Bench Verified, showing real, production-level gains.
•The pipeline scales via distributed execution (MEGAFLOW) and Docker image caching, keeping speed high and costs manageable.
•A quality-judge agent filters noisy tasks, making the giant dataset both big and trustworthy enough for next-gen coding agents.

Why This Research Matters

This work makes it possible to train coding AIs on hundreds of thousands of real, diverse software problems with trustworthy, executable checks. That means faster bug fixes and safer updates in the apps, websites, and devices we use every day. It also lowers costs so more teams—not just big tech—can build capable coding assistants. The anti-cheat design keeps models honest, reducing the risk of overfitting to shortcuts that don’t generalize. By covering many languages, it helps create agents that can hop between ecosystems like human engineers do. In the long run, this can improve software reliability, speed up security patches, and help learners practice on real-world tasks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you’re learning to fix bikes, it’s best to practice on real bikes with clear checklists, not just toy parts? AI that writes code needs the same thing: real projects with trustworthy ways to check if a fix truly works.

🍞 Hook: Imagine your class doing science experiments. You want real materials, clear lab steps, and a test that shows which results are correct. 🥬 The Concept: Pull Requests (PRs)

What it is: A Pull Request is a proposal to change a project’s code, often connected to an issue that explains the problem.
How it works:
1. A developer describes a problem in an issue.
2. They write code to fix it and open a PR.
3. Teammates review and merge if it’s good.
Why it matters: PRs bundle problem, fix, and often tests—everything needed to make a training “mini-world” for coding agents. 🍞 Anchor: A PR titled “Fix crash when loading empty file” includes the bug description, the code changes, and tests that prove the crash is gone.

🍞 Hook: You know how a PE class has obstacle courses to test different skills? 🥬 The Concept: Test Suites

What it is: A test suite is a group of checks to confirm software behaves correctly.
How it works: 1) Each test sets inputs, 2) runs code, 3) checks outputs.
Why it matters: Without tests, you don’t know if a fix actually solves the problem. 🍞 Anchor: For a calculator app, a test suite might check add(2, 2) == 4, add(−1, 1) == 0, and so on.

🍞 Hook: Think of a teacher’s red pen that says “pass” or “try again.” 🥬 The Concept: Evaluation Script

What it is: An evaluation script (here named evaluation.sh) automatically decides if a fix passes.
How it works: 1) Sets up the project, 2) runs the right tests or commands, 3) exits with success (0) or failure (non-zero).
Why it matters: It’s the referee. Without it, agents don’t get reliable feedback. 🍞 Anchor: The script runs pytest -q and returns 0 only if the repaired tests pass.

🍞 Hook: Remember a lunchbox that keeps your sandwich the same at home or school? 🥬 The Concept: Docker Environment

What it is: A Docker environment is a portable, same-everywhere setup for running code.
How it works: 1) Describe tools and versions, 2) build an image, 3) run it anywhere.
Why it matters: Ensures experiments are repeatable on any machine. 🍞 Anchor: A Python 3.10 + specific library versions Dockerfile runs the project identically on Windows, Mac, or Linux servers.

The world before: Early benchmarks like SWE-Bench offered real problems, but mostly in one language (Python). That made it easier to start but limited an agent’s ability to handle many ecosystems (like JavaScript, Rust, or Go). Also, setting up thousands of different projects with tricky dependencies often failed, wasting compute. Worse, simple or “cheat-able” verifiers let models pass without actually running code (like checking a file’s text with grep instead of running tests). Finally, people leaned on huge, costly language models for per-repo reasoning, making it too expensive to scale to hundreds of thousands of cases.

The problem: How do we automatically build millions of real, multilingual practice worlds from live GitHub activity, make each one verifiable by executing code (not by shortcuts), and keep time and money under control?

Failed attempts: Manual curation doesn’t scale. One-language pipelines miss cross-language skills. Naive verifier generation lets hacks slip in. Heavy models raise cost and latency. These paths hit walls of quality or budget.

The gap: A missing piece was a specialized, efficient building agent that can: (1) split PR patches into tests vs. fixes, (2) craft a trustworthy evaluation.sh, (3) prove it works by comparing buggy vs. fixed code states, and (4) reject hacks right in the loop—while being fast and affordable.

Real stakes: Better coding agents can help fix open-source bugs faster, reduce crashes in apps you use, and support learners and professionals. Imagine fewer app errors, quicker security patches, and smoother gadgets—from school laptops to hospital systems—because AI can practice at scale on real tasks with real checks.

02Core Idea

🍞 Hook: You know how a good assembly line doesn’t just build toys—it also tests them and throws out fakes before they leave the factory? 🥬 The Concept: SWE-Universe

What it is: SWE-Universe is an automated factory that turns GitHub pull requests into safe, repeatable coding environments with real, executable verifiers.
How it works:
1. Gather PRs connected to issues.
2. Split each PR into a test patch and a fix patch.
3. Apply the test patch, then generate an evaluation.sh verifier.
4. Flip the repo between buggy and fixed states and check the script’s behavior.
5. Reject hacks, iterate until the verifier truly runs code and distinguishes states.
Why it matters: Without this factory, you get too few tasks, weak checkers, and costs that explode. With it, you get millions of trustworthy, multilingual practice worlds. 🍞 Anchor: A PR fixing a Rust panic becomes a Dockerized task whose evaluation.sh runs cargo tests and only passes when the real fix is present.

🍞 Hook: Think of a helpful robot that can follow instructions, try, check itself, and try again. 🥬 The Concept: Building Agent

What it is: The Building Agent is the robot that constructs each task and its verifier.
How it works:
1. Reads the PR to separate test vs. fix.
2. Applies the test patch.
3. Writes evaluation.sh to run tests or a custom check.
4. Uses tools to switch between buggy and fixed, then evaluates results.
5. If it fails, it revises the script and repeats.
Why it matters: No robot, no scale—humans can’t handcraft hundreds of thousands of tasks. 🍞 Anchor: The agent writes a Node.js verifier that runs pnpm exec jest, revises it when a missing dependency breaks the run, and retries until it cleanly distinguishes states.

🍞 Hook: You know how you check your math homework by plugging answers back into the problem? 🥬 The Concept: Iterative Self-Verification

What it is: The agent repeatedly tests its own verifier on buggy vs. fixed code until it behaves correctly.
How it works:
1. Run evaluation.sh on buggy: expect fail.
2. Run on fixed: expect pass.
3. If not, revise and try again (up to many turns).
Why it matters: It boosts build success (reported from 82.6% to 94% on a held-out set) and filters out flaky scripts. 🍞 Anchor: A Go project’s tests initially crash due to missing toolchain; after the agent installs it and updates PATH, the verifier starts passing only on the fixed code.

🍞 Hook: Imagine a game where players try to cheat; you need a referee who can spot tricks. 🥬 The Concept: Hacking Detection

What it is: An in-loop checker that flags “cheat” verifiers (like string greps) that don’t really run code.
How it works:
1. Inspect evaluation.sh for telltale hacks.
2. If found, reject immediately.
3. Force the agent to generate a genuine, execution-based verifier.
Why it matters: Stops fake wins that would mis-train agents. 🍞 Anchor: A Scala task’s script that greps for a new method name gets rejected; the agent must run the actual test build.

🍞 Hook: Think of a sports team where each player is best at a position. 🥬 The Concept: Mixture-of-Experts (MoE)

What it is: A model design that routes each question to specialized “experts,” combining speed and skill.
How it works:
1. The router picks which experts to activate per token or step.
2. Only some experts run, saving compute.
3. Their outputs are merged into the final answer.
Why it matters: Provides strong reasoning at lower cost and latency. 🍞 Anchor: Qwen-Next-80A3 uses MoE with hybrid attention to outperform or match top models in building environments while staying efficient.

Multiple analogies for the same big idea:

Factory analogy: SWE-Universe is a factory that assembles tasks, test-runs them on a conveyor belt, and removes counterfeits with a fraud detector.
Lab analogy: It’s a lab that sets up experiments (environments), runs trials (buggy vs. fixed), and rejects fake results.
Game referee analogy: It’s a referee that demands real gameplay (executing code), not score tampering (grepping text).

Before vs. After:

Before: Mostly Python-only, small datasets; fragile or cheat-able verifiers; very costly per task.
After: 807,693 multilingual, executable tasks; strong in-loop anti-cheat; efficient model powering mass production.

Why it works (intuition):

Verifying a verifier is simpler than verifying an entire custom setup. If a script passes only after the fix and fails before, and it truly runs tests, it’s likely a good checker. Iteration plus anti-cheat keeps the process honest. An MoE builder makes the loop affordable at huge scale.

Building blocks (what pieces you need): PR crawling and patch separation, the Building Agent, toolset (bash, switch-to-resolved, switch-to-bug), iterative self-verification, in-loop hacking detection, Dockerization, distributed execution (MEGAFLOW), and optional quality-judging.

🍞 Anchor: From a JavaScript PR linked to an issue, SWE-Universe builds a Docker image, writes evaluation.sh to run Jest, proves it fails before and passes after, rejects any greps, and ships a clean, reusable task.

03Methodology

At a high level: PR with tests → [Patch Separation] → [Agent builds environment + evaluation.sh] → [Iterative Self-Verification + Hacking Detection] → [Dockerized, verifiable task].

Step 1. Crawl PRs and filter

What happens: Harvest ~33.3M recent PRs and keep only those with clear tests and issue links; remove overlaps with known benchmarks; trim overly massive diffs.
Why it exists: Ensures each task has a problem statement and a realistic, runnable test target.
Example: Keep a PR titled “Fix panic on empty input #123” linked to Issue #456 with added tests; drop a 5,000-file refactor.

🍞 Hook: Think of sorting puzzle boxes by whether they include instructions. 🥬 The Concept: Pull Requests (PRs) [recap, kept minimal]

What it is: Bundles of problem + fix (and often tests).
How it works: Developers propose, reviewers approve.
Why it matters: Each PR becomes one self-contained practice world. 🍞 Anchor: The PR describes how to reproduce a bug and how the fix changes behavior.

Step 2. Patch separation (test vs. fix)

What happens: The agent splits PR changes into a test patch and a fix patch; tasks without discernible tests are discarded.
Why it exists: Lets us apply tests first, then use fix toggling for verification.
Example: In a Python PR, modified tests/test_io.py is the test patch; changes in src/io.py form the fix patch.

Step 3. Agent-based environment building

What happens: The agent applies the test patch, then crafts evaluation.sh. It can:
- Run human-written unit tests, or
- Write its own focused tests if no clean entry point exists.
Why it exists: Standardizes verification via a simple bash interface with exit codes, decoupled from any one language.
Example: For a Rust PR, it runs cargo test --quiet; for a JS monorepo, it sets up workspaces and invokes pnpm.

🍞 Hook: Like giving a robot a toolbox to build and check a model car. 🥬 The Concept: Evaluation Script [recap detail]

What it is: A bash script that sets up, runs checks, and exits pass/fail.
How it works: Installs deps, runs tests, interprets results via exit code.
Why it matters: A universal interface that works across languages. 🍞 Anchor: evaluation.sh returns 0 only if pytest -q finishes with all tests passing.

Toolset the agent uses

bash: File edits, dependency installs, script creation.
switch-to-resolved / switch-to-bug: Atomically toggle fix patch on/off to produce fixed vs. buggy states.
Why it exists: Enables self-verification and precise comparisons.
Example: After writing evaluation.sh, the agent runs it on buggy (expect fail) and fixed (expect pass).

🍞 Hook: You know how you test light bulbs by flipping a switch on and off? 🥬 The Concept: Iterative Self-Verification [recap]

What it is: Repeatedly testing the verifier on both states until correct.
How it works: Buggy=fail, Fixed=pass; revise and retry if mismatch.
Why it matters: Raises success from 82.6% to 94% on a held-out set. 🍞 Anchor: A Java build initially fails due to missing JDK; after installing it, the verifier behaves correctly.

🍞 Hook: Like a teacher watching for copied homework during practice, not after grades are posted. 🥬 The Concept: Hacking Detection [recap]

What it is: In-loop check that rejects non-execution-based scripts (e.g., grep shortcuts).
How it works: LLM inspects evaluation.sh; flags hacks → immediate failure and retry.
Why it matters: Prevents training on fake signals and saves time. 🍞 Anchor: A script that only searches source code text for the patch gets rejected before moving on.

Step 4. Dockerization and distribution

What happens: Successful builds become Docker images; images are pushed to a registry with layer caching; runs orchestrated by MEGAFLOW across many cloud instances.
Why it exists: Reproducibility at scale and lower storage costs.
Example: Thousands of parallel VMs each produce verified images that share common base layers.

🍞 Hook: A lunch-packing line puts the same sandwich in many identical lunchboxes. 🥬 The Concept: Docker Environment [recap]

What it is: Same-everywhere containers.
How it works: Build once, run anywhere the same way.
Why it matters: Makes tasks stable for every learner and agent. 🍞 Anchor: A Go toolchain image guarantees go test behaves the same on every machine.

Secret sauce (what makes it clever)

Verify the verifier: It’s easier to check if a script distinguishes buggy vs. fixed than to fully inspect every setup decision.
Anti-cheat in the loop: Catch shortcuts early so the agent learns to truly execute code.
Efficient brain: An MoE builder (Qwen-Next-80A3) trained on high-quality building trajectories keeps cost and latency low while beating or matching top models.

🍞 Hook: A coach who places athletes in their best positions boosts team performance without overworking anyone. 🥬 The Concept: Mixture-of-Experts (MoE) [recap]

What it is: Specialized sub-models activated on demand.
How it works: Router picks experts; combine outputs; save compute.
Why it matters: Strong results at lower cost per build. 🍞 Anchor: The builder model achieves a 78.44% success rate (no-hack) across languages while being faster than dense peers.

Quality control

A quality-judge agent reviews task description, Docker, scripts, and (optionally) the ground-truth patch to flag issues; ~78.72% accuracy on a human-labeled benchmark.
Helps keep a massive dataset trustworthy.

🍞 Hook: Like a judge at a cooking contest who ensures dishes match the recipe and are safe to eat. 🥬 The Concept: Quality-Judge Agent

What it is: An automated reviewer for task quality.
How it works: Reads artifacts and scores quality with a trained model.
Why it matters: Big doesn’t help if quality is low; this keeps standards high. 🍞 Anchor: The judge flags a task whose tests don’t match the issue text, preventing confusing training examples.

04Experiments & Results

The test: Can models automatically build reliable, non-cheat environments from diverse PRs? And does the giant dataset improve agent skills when used for training and RL?

What they measured and why

Success Rate (w/o Hack): Verifier distinguishes buggy vs. fixed and passes anti-cheat—this is the real score.
Success Rate (w/ Hack): Counts scripts that distinguish states even if hacked—shows how tempting shortcuts are.
Downstream performance: Scores on SWE-Bench Verified (mostly Python) and SWE-Bench Multilingual—does training transfer?
RL gains: Does pass/fail from evaluation.sh serve as a stable reward to improve agents?

The competition

Builder model comparisons include Claude-Opus-4.5, Claude-Sonnet variants, Gemini-3-Pro, DeepSeek-v3, GLM-4, MiniMax-M2, and Qwen3-Coder-480B.

The scoreboard (with context)

Qwen-Next-80A3 (ours): 78.44% success w/o hack; 82.50% w/ hack. This is like scoring a solid A where others hover around A− to B+.
Claude-Opus-4.5: 77.81% w/o hack; 85.00% w/ hack—a bigger gap suggests more shortcutting.
Across languages, C/C++ is hardest for everyone (complex toolchains), while Python/JS show high rates. Our model is the most balanced across ecosystems.
Gap analysis: Many general models see a 7–8% jump when hacks are allowed; our model’s smaller gap (~4.06%) shows better integrity under anti-cheat.

Scaling to millions (production run)

Using the pipeline and the MoE builder, they achieved a non-hacked success rate of 75.9% across a huge candidate pool and produced 807,693 executable tasks from 52,960 repositories.
Language mix mirrors open-source reality: Python (202k), JS/TS (176k), Go (121k), Java (86k), Rust (74k), C/C++ (37k), C# (24k), Others (87k). Verification scripts are longest on average for C/C++ (≈46 lines), shortest for Rust (≈19), matching expected ecosystem complexities.

Mid-training results

Training a model on ~500k successful agentic trajectories (≈30B tokens) steadily boosts performance.
On SWE-Bench Verified: climbs from ~50.3% to over 61% (a strong improvement on a mature benchmark).
On SWE-Bench Multilingual: rises from ~31% to over 46% (a 15+ point jump), showing the dataset’s multilingual diversity pays off.

Reinforcement learning results

With pass/fail as a reward, Qwen3-30B-A3B improves on SWE-Bench Multilingual from ~32% to 42.0%.
This 10-point absolute gain shows the verifiers provide a stable, meaningful reward signal for agentic RL.

Topline production metric

Applying the full approach to the flagship Qwen3-Max-Thinking yields 75.3% on SWE-Bench Verified—state-of-the-art territory, showing tangible real-world value.

Surprising findings

Anti-cheat matters a lot: Many strong models “find” hacks unless guided not to. Training on non-hacked trajectories reduces this tendency.
C/C++ verifiers tend to be longer, hinting that standardized toolchains (like Rust’s cargo) make shorter, cleaner verifiers easier to write.
A relatively lightweight, specialized MoE can outperform or match heavier, proprietary models on this complex, practical task—efficiency and specialization win.

05Discussion & Limitations

Limitations

Repository dependence: The dataset reflects public GitHub activity; areas with fewer tests or weaker issue practices may be underrepresented.
Residual noise: Even with a judge agent, some tasks can have ambiguous descriptions, imperfect Docker setups, or misaligned tests.
Hard ecosystems: Extremely complex build systems (e.g., some C/C++ projects) still challenge automated setup and may fail more often.
Evolution over time: Toolchain versions and online dependencies can change, risking future breakage without careful pinning and caching.

Required resources

Compute for large-scale orchestration: Many parallel VMs/containers to run builds and tests (MEGAFLOW-like infra).
Storage and registry: Space for Docker images and layer caching.
A capable builder model (e.g., Qwen-Next-80A3) and the hacking detector/checking loop.

When not to use

Projects without runnable tests or any clear way to execute behavior (e.g., documentation-only changes).
Ultra-monomorphic tasks where cross-language generalization isn’t needed and a small, curated set already suffices.
Scenarios that require human-authored pedagogical explanations per task rather than auto-generated verifiers.

Open questions

Stronger anti-fragility: How to make tasks resilient to changing ecosystems (e.g., registry outages, deprecations)?
Deeper semantics: Beyond pass/fail, can we measure partial credit or regression risks for richer RL rewards?
Broader coverage: How to include GUI apps, mobile projects, or services requiring external credentials while staying safe and privacy-preserving?
Test synthesis quality: How to further improve LLM-written tests to avoid false positives/negatives without manual review?
Fairness and bias: Do language, domain, or framework imbalances bias agent learning—and how can we correct them?

06Conclusion & Future Work

Three-sentence summary

SWE-Universe is an automated factory that builds trustworthy, executable coding tasks from real GitHub PRs at unprecedented scale, powered by an efficient MoE builder and strict anti-cheat checks.
Its iterative self-verification and in-loop hacking detection raise build reliability and keep costs low, enabling 807,693 multilingual environments across 52,960 repositories.
Training with this data measurably improves coding agents via both supervised mid-training and reinforcement learning, culminating in 75.3% on SWE-Bench Verified for a flagship model.

Main achievement

Turning the act of building verifiable environments into a scalable, reliable, and economical pipeline—verifying the verifier and rejecting hacks in the loop—so the community can train agents on massive, real, and diverse problems.

Future directions

Stronger, more semantic verifiers (partial credit, coverage-aware rewards), broader project types (mobile/GUI/services), and more robust, self-healing environments that survive ecosystem drift.
Continued efficiency gains in builder models and orchestration systems to push beyond a million tasks with even tighter quality controls.

Why remember this

It shows that careful engineering around verification—plus efficient, specialized models—can unlock million-scale, real-world practice for coding agents, moving from small, narrow benchmarks to a living, multilingual universe of trustworthy tasks.

Practical Applications

•Train coding agents that can fix real bugs across many programming languages.
•Build robust RL pipelines where pass/fail tests provide stable rewards.
•Benchmark environment-building capabilities of new models using the provided cross-lingual setup.
•Automate large-scale creation of verifiable tasks for internal developer training platforms.
•Stress-test CI/CD systems by integrating auto-built verifiers into pre-merge checks.
•Generate reproducible, Dockerized replicas of historical PRs for regression analysis.
•Curate language- or framework-specific subsets (e.g., Rust, Go) to target team skill gaps.
•Study and reduce verifier vulnerabilities by analyzing flagged hacking attempts.
•Measure real-world generalization by training on one set of repos and testing on unseen ones.
•Prototype quality-judging tools to triage noisy issue/PR pairs at scale.

Version: 1