MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering
Key Summary
- •This paper builds a smart team of AI helpers, called MEnvAgent, that automatically sets up the right computer environments for code projects in many languages.
- •It follows a plan–do–check loop so it can fix mistakes by itself and keep improving until tests actually run and pass.
- •A special reuse trick lets it start from a past, similar environment and only make small changes, saving lots of time and compute.
- •The system proves each task truly matters by using the Fail-to-Pass (F2P) rule: the old code must fail tests and the fixed code must pass in the same environment.
- •They made a new benchmark, MEnvBench, with 1,000 tasks across 10 languages to measure success fairly by running real tests.
- •Across languages, MEnvAgent boosts strict F2P success by 8.6 percentage points and cuts time by 43% compared to strong baselines.
- •It also built MEnvData-SWE, a large, realistic dataset of Docker environments and solution steps, which improves many coding models after fine-tuning.
- •The method works best when there are similar past environments to reuse, and big C/C++ or Java builds still pose tough challenges.
- •All code, benchmark, and data are released to help others build more trustworthy software agents.
Why This Research Matters
Reliable software progress depends on environments that actually run and verify code, not just guess. By automating environment setup across 10 languages and proving correctness with F2P, MEnvAgent turns fragile, manual steps into fast, repeatable ones. This unlocks bigger, fresher benchmarks and training data, so coding agents improve on real, modern repositories. Teams save time reproducing bugs, speeding up fixes and releases. Education and research benefit from shareable, Dockerized tasks that behave the same everywhere. In short, it brings trustworthy, scalable verification to the heart of AI-for-coding.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) Imagine you're baking cookies from 10 different countries. Each recipe needs special ingredients, tools, and oven settings. If any detail is wrong, the cookies flop.
🥬 Filling (The Actual Concept)
- What it is: In software, an executable environment is the exact computer setup (OS, tools, packages, versions) a project needs so its tests can run and tell us if the code works.
- How it works:
- Pick a base system (like a ready-to-cook kitchen image).
- Install the right tools and libraries with the correct versions.
- Run the project's tests to check behavior.
- Why it matters: Without the right environment, tests crash or mislead us, so we can’t trust results or train agents using real, verifiable feedback.
🍞 Bottom Bread (Anchor) For a Python repo needing Python 3.10, PyTest, and NumPy 1.24, using Python 3.12 or the wrong NumPy version can make tests fail for the wrong reasons.
The World Before:
- AI coding agents got better at reading and writing code, but they often stumbled when setting up environments—especially across many languages like Python, Java, Go, Rust, and C/C++.
- Many benchmarks tested agents using rough signals (like static checks) instead of actually running tests. This was fast but not very trustworthy.
- High-quality, run-the-tests datasets existed mostly for Python, and many environments were built by hand—slow and not scalable.
The Problem:
- Real verification needs running tests in the exact right setup. But building that setup is hard: tools conflict, compilers break, and test commands differ (pytest vs mvn test vs go test, etc.).
- Starting from scratch each time wastes hours on installs and compiles, and a single mistake can force a full restart.
Failed Attempts:
- Static guesses: scanning files to guess dependencies is quick but misses runtime issues (like native libs or system packages).
- One-size scripts: fixed test commands or language-specific installers break on non-standard repos.
- Manual builds: accurate but too slow and usually limited to one language.
The Gap:
- We needed a way to automatically construct correct, executable environments across many languages—and to do it quickly and repeatedly at scale.
Real Stakes:
- Better benchmarks and training data let us build coding agents that truly fix real-world bugs.
- Companies spend huge time reproducing issues; fast, reliable environments shorten development cycles.
- Research like reinforcement learning with verifiable rewards needs trustworthy pass/fail signals; poor environments block progress.
🍞 Top Bread (Hook) You know how your teacher checks both the wrong answer and the corrected answer to see if you learned? That double-check matters.
🥬 Filling (The Actual Concept: Fail-to-Pass (F2P) Criterion)
- What it is: A strict rule that says a valid setup must make the original (buggy) code fail tests and the fixed code pass tests in the same environment.
- How it works:
- Apply the test patch only, run tests, expect failure.
- Apply the fix patch too, run tests again, expect success.
- Why it matters: Without F2P, we can’t be sure the failure was real or the pass wasn’t accidental; it proves the environment is faithful.
🍞 Bottom Bread (Anchor) If a bug is about date parsing, the environment must fail on the old code and pass after the fix—otherwise the setup can’t be trusted.
02Core Idea
🍞 Top Bread (Hook) Imagine a pit crew in a car race. One person checks the engine, another changes tires, another tests the brakes, and someone watches the clock. Working together, they get the car track-ready fast.
🥬 Filling (The Actual Concept: Multi-agent Architecture)
- What it is: A team of specialized AI agents that each handle a part of environment building and testing, coordinating in a loop.
- How it works:
- One agent analyzes the repository.
- Another chooses a base image and writes an install script.
- Another writes the test-running script.
- An execution agent runs installs and adjusts to errors.
- A verification agent runs tests and diagnoses failures.
- Feedback loops guide retries until success.
- Why it matters: One agent alone often misses details; teamwork raises success and speeds up recovery from mistakes.
🍞 Bottom Bread (Anchor) For a Java repo using Maven, the team picks Ubuntu + JDK, installs Maven, then runs mvn test; if surefire fails, feedback triggers a fix.
🍞 Top Bread (Hook) You know how you plan homework, do it, then check answers? That simple routine keeps you on track.
🥬 Filling (The Actual Concept: Planning–Execution–Verification Framework)
- What it is: A plan–do–check loop to design the environment, build it, and then confirm it works.
- How it works:
- Planning: analyze files, pick base image, draft install and test scripts.
- Execution: run installs, adapt to live errors.
- Verification: run tests, attribute failures, and feed reports back to planning.
- Why it matters: Without this loop, errors pile up and no one learns how to fix them.
🍞 Bottom Bread (Anchor) If pytest can’t find a plugin, verification reports a missing dependency; planning adds it; execution retries.
🍞 Top Bread (Hook) Imagine reusing last year’s science fair project board and only swapping a few parts instead of building everything from zero.
🥬 Filling (The Actual Concept: Environment Reuse Mechanism)
- What it is: A way to fetch a past, similar environment and patch it just enough to fit the new task.
- How it works:
- Search a pool of previously successful environments for the closest match (same repo/version if possible).
- Try running tests in that environment.
- If tests fail, generate a small patch (extra installs or tweaks) and retry.
- Why it matters: Avoids long rebuilds, reducing time and errors from scratch setups.
🍞 Bottom Bread (Anchor) A Home Assistant task fails due to a missing pyrainbird package; the agent installs just that, then all tests pass.
The Aha! Moment (one sentence):
- Instead of rebuilding every environment from scratch, let a team of agents reuse and minimally patch similar past environments within a plan–do–check loop, then prove correctness with Fail-to-Pass.
Multiple Analogies:
- Cookbook: Start from a known recipe (old environment), tweak spices (patches), taste-test (verification).
- Lego: Reuse a mostly built model and swap a few bricks, rather than pouring out the whole box again.
- School project: Use last semester’s poster layout and adjust only the text and pictures.
Before vs After:
- Before: Single-language focus, manual builds, fragile scripts, slow retries, unreliable verification signals.
- After: Polyglot automation, fast reuse, guided error diagnosis, strict F2P validation, measurable gains.
Why It Works (intuition):
- Software evolves incrementally; nearby snapshots share 80–90% of dependencies. Reusing a recent environment keeps most parts ready, so only a few missing or mismatched pieces need fixing. The verification loop prevents drift by catching real runtime failures and steering precise patches.
Building Blocks:
- Repository Analysis Agent, Environment Setup Agent, Test Configuration Agent, Environment Execution Agent, Verification Agent, EnvPatchAgent, and the Environment Pool with similarity search.
🍞 Top Bread (Hook) Ever keep a photo album of projects so you don’t start over next time?
🥬 Filling (The Actual Concept: Environment Pool and Retrieval)
- What it is: A library of previously verified environments with metadata for quick matching.
- How it works:
- Index environments by repo, version, and time.
- Prefer exact-version matches; otherwise, pick the closest newer one (backward-compatibility often holds).
- Why it matters: Good matches mean tiny patches and big time savings.
🍞 Bottom Bread (Anchor) Keycloak v2.5 reuses an environment from v2.6 with just a few dependency tweaks.
03Methodology
Overview: Input → Planning → Execution → Verification → (Reuse or Retry) → Output
- Input: Task context from GitHub (repository snapshot, test patch, fix patch).
- Output: A Dockerized environment and scripts that satisfy F2P (buggy fails, fixed passes).
Step-by-step:
- Planning
- What happens:
a) Repository Analysis Agent scans files (package.json, pom.xml, go.mod, Cargo.toml, CMakeLists.txt), infers language/build tools/tests.
b) Environment Setup Agent selects a base image (e.g., ubuntu:22.04, python:3.10, openjdk:17) and drafts an install script (P).
c) Test Configuration Agent writes a test script (T) aligned with the setup (pytest flags, mvn test goals, go test packages, etc.). - Why it exists: A wrong base or missing install step derails everything; incorrect test commands hide success or fabricate failures.
- Example: For a Python repo with uv and pytest, the plan chooses ubuntu:22.04, installs uv and Python 3.10, then runs pytest -rA tests/.
- Execution
- What happens:
a) Environment Execution Agent starts a container from the base image.
b) Runs P line by line, watching logs to hot-fix simple issues (e.g., apt-get update before apt install).
c) On repeated failure, it stops and sends logs back to planning for a new strategy. - Why it exists: Live deviations (mirror outages, version conflicts) require on-the-spot adjustments; otherwise timeouts and restarts waste hours.
- Example: If pip install fails due to build-essential missing, the agent injects apt-get install build-essential and retries.
- Verification
- What happens:
a) The Verification Agent executes T first with only the test patch (expect fail).
b) Then executes T with fix patch (expect pass).
c) If anything goes wrong, it attributes the error (missing dependency, wrong test path, flaky compile) and feeds structured hints upstream. - Why it exists: F2P is the gold standard; precise attribution turns random trial-and-error into guided improvement.
- Example: mvn test fails due to a missing JDK tool; diagnosis: install openjdk-17-jdk-headless.
Secret Sauce: Environment Reuse
- Retrieval: Search the Environment Pool for same repo and version; if none, choose the temporally closest newer environment to maximize backward compatibility.
- Adaptation: The EnvPatchAgent generates a tiny patch ΔP of extra commands only when verification fails.
🍞 Top Bread (Hook) Think of borrowing a nearly finished costume and just adjusting the size.
🥬 Filling (The Actual Concept: EnvPatchAgent)
- What it is: An agent that writes minimal, context-aware fix commands to adapt a reused environment.
- How it works:
- Read verification logs and the original setup context.
- Propose the smallest safe change (install one missing lib, set one env var, add one compiler flag).
- Apply and re-verify; repeat until pass.
- Why it matters: Small, targeted changes are faster and less risky than full rebuilds.
🍞 Bottom Bread (Anchor) In Home Assistant, it only added pip install pyrainbird after activating the right Conda env—tests then passed.
Concrete Data Flow Example (Python):
- Input: Repo snapshot R, test patch adds failing test, fix patch repairs code.
- Planning chooses python:3.10, installs uv and project deps, test script is pytest -rA.
- Execution completes after adding build-essential.
- Verification: buggy fails; fixed passes → environment saved to pool.
Concrete Data Flow Example (Java):
- Input: Keycloak task; build via Maven.
- Planning picks openjdk:17, installs maven, sets MAVEN_OPTS, test script mvn -q -DskipITs=false test.
- Execution adds missing libssl-dev on error.
- Verification: initially fails due to plugin; EnvPatchAgent pins surefire version; then F2P succeeds.
04Experiments & Results
The Test
- Benchmark: MEnvBench with 1,000 tasks spanning 10 languages (Python, Java, Go, JavaScript, TypeScript, Rust, C, C++, PHP, Ruby).
- Metrics:
- Pass Rate: fixed code passes tests.
- Fail-to-Pass (F2P) Rate: buggy fails and fixed passes in the same environment.
- Time Cost: wall-clock seconds per task.
The Competition
- Repo2Run (Python-focused).
- SWE-Bench-Live (6 languages).
- SWE-Factory (multi-agent baseline across all 10 languages).
The Scoreboard (contextualized)
- Averaged across languages and backbones, MEnvAgent lifts F2P by 8.6 percentage points and Pass Rate by 11.0 points while cutting time by 43%.
- Think of it like moving from a class average of 80 to 88 while finishing homework 43% faster.
- Visual trade-offs show MEnvAgent clustered in the top-left (better and faster), while baselines are slower (right) or less valid (low).
Surprising Findings
- Reuse scales with data: as more instances per repo are available (1 → 10), reuse success rises to 39%, time drops sharply, and pass rates improve.
- Language patterns: Go and Python do well thanks to standardized ecosystems; Java improves notably with better setup scripts; C/C++ remain tough due to complex compilers and native libs.
- Larger repos correlate with lower F2P (harder builds), confirming that big projects are inherently trickier.
Why The Gains Happen
- Reuse avoids repeating hours of installs and compiles.
- Verification-driven patching replaces guesswork with targeted, minimal fixes.
- Multi-agent specialization shortens the error-repair loop.
Downstream Utility
- Using MEnvAgent, they built MEnvData-SWE: 3,005 realistic, Dockerized instances across 10 languages plus 3,872 solution trajectories.
- Fine-tuning multiple models on these trajectories consistently improved scores on SWE-bench Verified and Multilingual, with the largest model matching or beating strong references in places.
05Discussion & Limitations
Limitations
- Reuse depends on history: if there’s no similar past environment for a repo, benefits shrink and builds may fall back to scratch.
- Big native builds (C/C++) and intricate Java projects still cause timeouts or tricky linker/compiler errors.
- Flaky tests and evolving ecosystems can mislead verification unless guarded carefully.
Required Resources
- Containerized compute with Docker and preferably Kubernetes for high concurrency (1000+ builds).
- Access to capable LLMs for planning and diagnosis; logs storage for the environment pool.
- Time budget per task (global timeouts around hours), especially for scratch builds.
When NOT to Use
- Highly proprietary or non-containerizable stacks where Docker isolation isn’t allowed.
- Projects with no automated tests (no verification signal) or tasks lacking clear test/fix patches.
- Ultra time-critical scenarios where even reuse is too slow (e.g., seconds-level SLAs).
Open Questions
- How to predict the best base image and patch sequence using learned policies to reduce retries further?
- Can we generalize reuse across related but different repos (cross-project transfer) safely?
- How to robustly handle flaky tests and nondeterministic builds in F2P verification?
- What’s the best way to represent and search the environment pool (semantic embeddings of build graphs)?
- Can reinforcement learning with verifiable rewards accelerate the planning and patching agents beyond supervised heuristics?
06Conclusion & Future Work
Three-sentence Summary
- MEnvAgent is a team of AI helpers that plans, builds, verifies, and reuses environments across 10 languages, proving correctness with the strict Fail-to-Pass rule.
- By retrieving similar past environments and patching them minimally, it achieves higher success and much faster builds than strong baselines.
- The framework also powers a large, realistic dataset (MEnvData-SWE) that boosts many coding models after fine-tuning.
Main Achievement
- A scalable, polyglot environment-construction system with a reuse-and-patch mechanism that raises F2P by 8.6 percentage points while cutting time by 43% on a new 1,000-task benchmark.
Future Directions
- Smarter retrieval and patch policies learned from experience; better handling of large native builds; robust defenses against flaky tests; cross-repository reuse; integration with RL using verifiable rewards.
Why Remember This
- It shows that trustworthy software-agent progress depends on getting environments right—and that reuse plus intelligent verification can make this both fast and reliable at scale.
Practical Applications
- •Auto-build runnable environments for GitHub issues to reproduce and fix bugs quickly.
- •Keep an environment pool per repository to speed up CI for historical regressions.
- •Use F2P verification when curating datasets to ensure tasks are meaningful and non-flaky.
- •Patch only what’s missing (e.g., one package or flag) instead of full rebuilds to cut build time.
- •Generate multilingual SWE training data from real repos to fine-tune coding models.
- •Diagnose failing pipelines by attributing errors to environment vs test-command issues.
- •Warm-start developer sandboxes by retrieving the closest verified environment snapshot.
- •Accelerate benchmark maintenance by rapidly updating environments for new repo versions.
- •Support RL with verifiable rewards by guaranteeing execution-based pass/fail signals.