MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era

Lei Zhang; Mouxiang Chen; Ruisheng Cao; Jiawei Chen; Fan Zhou; Yiheng Xu; Jiaxi Yang; Zeyao Ma; Liang Chen; Changwei Luo; Kai Zhang; Fan Yan; KaShun Shum; Jiajun Zhang; Zeyu Cui; Feng Hu; Junyang Lin; Binyuan Hui; Min Yang

MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era

Intermediate

Lei Zhang, Mouxiang Chen, Ruisheng Cao et al.1/12/2026

arXiv PDF

Key Summary

•MegaFlow is a new system that helps thousands of AI agents practice and test big, messy tasks (like fixing real software bugs) all at once without crashing or wasting money.
•It splits work into three parts—Model Service, Agent Service, and Environment Service—so each part can grow or shrink on its own.
•Instead of a few giant computers, MegaFlow uses many small, cheap machines, which cuts costs by 32% at 2,000 tasks and scales smoothly up to 10,000 tasks.
•An event-driven design lets the system react instantly to changes (like a task finishing) without constant checking, saving time and resources.
•A hybrid mode runs tasks either in fresh, temporary environments (safer) or long-lived ones (faster), picking the best option per job.
•MegaFlow avoids storage headaches by pulling container images on demand from the cloud instead of keeping huge local copies.
•Compared to centralized setups, MegaFlow keeps performance steady as you add more tasks and avoids slowdowns from network and resource fights.
•It already managed over 2 million training runs in production and supports popular coding-agent frameworks like OpenHands and SWE-Agent.
•For reinforcement learning, MegaFlow coordinates 1,024 parallel environments per step, helping larger models learn faster and better.
•This fills a major infrastructure gap so researchers can focus on smarter agents instead of wrestling with scheduling, storage, and cluster rules.

Why This Research Matters

AI agents are moving from simple chat to real computer work—editing code, running apps, and fixing problems—which needs safe, repeatable environments at huge scale. MegaFlow makes that scale practical and affordable by splitting responsibilities, reacting to events instantly, and using many small cloud machines instead of a few giant ones. This means faster research cycles and better, more reliable agent training on real-world tasks. Companies can reduce costs and risks while testing thousands of workflows in parallel. Education and open-source communities can run larger benchmarks without specialized hardware. Ultimately, smoother orchestration unlocks smarter agents that can help with software quality, automation, and productivity across industries.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine running a giant after-school fair with thousands of games, each needing its own supplies, space, and helpers. If everyone shows up at once, you’d need a smart plan to avoid chaos—who gets what, where, and when. That’s what training modern AI agents feels like.

🥬 The Concept (The world before): For years, AI got really good at straight-line tasks like classifying pictures or answering one-off questions. These jobs fit neatly on big servers with predictable speeds and data. But a new wave—agentic AI—means AIs that interact step by step with messy environments: writing code, running tests, controlling apps, and fixing things across many tries. These aren’t just math problems; they’re living, changing adventures.

How it worked before (and why it hurt):

People tried to run everything on a few super-powerful machines. That works for small demos but buckles when thousands of agents need their own tools and sandboxes.
Teams stored huge “containers” (boxed-up software environments) locally. Even normal datasets like SWE-bench balloon to tens of terabytes if you keep every version nearby.
Clusters often disallow “arbitrary containers” for security, so the very thing agents need—isolated sandboxes—is blocked by the rules of the building they’re in.
Centralized setups got stuck on network traffic jams: pulling many large images at once or juggling too many tasks on the same network card.

What broke without better infrastructure:

Security: Agents need safe, isolated playgrounds. Cluster policies often forbid them.
Storage: Keeping all containers locally is like storing every carnival booth in your garage—impossible at scale.
Throughput: Running container-heavy tasks on a few big machines causes crowding; parallelism stalls.

🍞 Anchor: Think of 2,000 kids wanting 2,000 different craft kits at the same time. If the school has one supply closet and one hallway, everyone waits. You need many smaller stations, fast restocking, and a system that knows exactly when to send what where.

🍞 Hook: You know how you might split the fair team into three crews—one designs the games (brains), one runs the games (organizers), and one sets up the rooms and tables (spaces)?

🥬 The Concept (The gap MegaFlow fills): MegaFlow splits agent training into three independent services: Model Service (the brain math), Agent Service (the coordinator), and Environment Service (the sandbox where actions happen). Each has a simple, shared language (APIs), so they can scale and improve on their own.

How it works at a high level:

The Agent Service picks tasks and talks to the Environment Service to start safe containers.
The Model Service thinks (inference) and learns (training) when the Agent Service asks.
The Environment Service runs the actions and returns observations, rewards, and stop signals.
Everything is connected by events, so the system reacts right away without constantly checking.

Why it matters: Without this split, you get one giant, tangled machine where a slowdown in any part slows everything, and fixing storage, security, or scheduling becomes a nightmare.

🍞 Anchor: It’s like having separate teams for kitchen, servers, and dining room. The kitchen can add more cooks, the servers can speed up, and the dining room can add more tables—all without tripping over each other.

🍞 Hook: Picture two ways to book rooms for all those games: either reserve a fresh room for each game (perfectly clean, but some setup time) or keep a room running and reuse it (faster turnaround). Which do you pick?

🥬 The Concept (Prior failed attempts): Earlier systems mostly picked one path—either over-isolate (slow and expensive) or over-reuse (fast but riskier). They also tried to push everything through a few giant servers, causing bottlenecks.

Why that failed:

One huge server hits bandwidth and memory peaks then sits idle.
Polling for status wastes time (like asking every minute, “Are we there yet?”).
Local storage fills up with massive container images.

🍞 Anchor: If you only have one gym and one hallway, lining up thousands of kids means a traffic jam no matter how fast the gym teacher is. You need more doors and smarter timing.

🍞 Hook: Imagine the fair now can rent many small rooms across town for a few hours each—quick to grab, cheap, and perfect for one game at a time.

🥬 The Concept (What MegaFlow changes): MegaFlow moves container-heavy work to elastic cloud machines, pulls images on demand from cloud storage, and coordinates everything with events. The system proved it can run tens of thousands of tasks smoothly, save ~32% at 2,000 tasks, and keep timing steady as load grows.

Why this matters in real life:

Better AI software helpers that actually compile, test, and fix real code at scale.
Faster research cycles—less waiting on computers, more learning from results.
Lower cost and fewer headaches setting up giant training runs.

🍞 Anchor: It’s like switching from one crowded cafeteria to many food trucks that appear when needed, bring their own gear, and leave no mess behind.

02Core Idea

🍞 Hook: You know how a big sports tournament runs best when the coaches (strategy), players (action), and stadium crew (fields and equipment) are managed separately—but stay in perfect sync?

🥬 The Concept (Aha! in one sentence): Decouple agent training into three services—Model, Agent, Environment—and coordinate them with an event-driven, elastic, hybrid-execution system so we can run massive, container-heavy tasks efficiently.

How it works—three analogies:

City traffic analogy:

Model Service = traffic brain that plans routes.
Agent Service = dispatch that assigns drivers and tracks trips.
Environment Service = roads and intersections where driving happens.
Events = traffic lights that signal when to go/stop.
Elasticity = add more lanes (small instances) during rush hour.

Restaurant analogy:

Kitchen (Model): cooks dishes (answers/trainings) on order.
Servers (Agent): decide table order, collect feedback, send new orders.
Dining Room (Environment): where customers interact.
Hybrid mode: private room per party (ephemeral) vs shared dining (persistent).
Events: bell rings when a dish is ready—no one keeps peeking.

School fair analogy:

Designers (Model), Organizers (Agent), Rooms/Booths (Environment), Bell schedule (Events), Extra classrooms on demand (Elasticity).

Before vs After:

Before: One big machine did everything, storage ballooned, startup times exploded under load, and network clogged.
After: Many small, identical machines spin up fast; containers pull over fast internal links; tasks finish steadily even at 10,000-way parallelism; costs drop.

Why it works (intuition, not equations):

Separate concerns: Each service optimizes its own specialty without dragging others down.
Locality of pain: If containers are slow, scale Environment Service only; if thinking is slow, scale Model Service.
Avoid polling: Events tell us exactly when something changes, saving cycles and lag.
Many-small-instances: Sidesteps single-machine bandwidth wars and keeps usage predictable.
Hybrid execution: Reuse environments when safe for speed; rebuild fresh when isolation matters.

Building blocks (each with a mini sandwich):

🍞 Hook: Imagine calling one phone number that routes you to the right team every time. 🥬 Unified APIs: A simple shared language for the three services so they can talk cleanly.

How: Standard request/response for inference, training, rollout control, and environment lifecycle.
Why: Without it, every project reinvents glue code and breaks when parts change. 🍞 Anchor: Like a universal power outlet adapter for all your devices.

🍞 Hook: Think of renting many small study rooms instead of one giant auditorium. 🥬 Elastic Resource Strategy: Prefer many standard, small cloud instances over a few giant ones.

How: Scale out by task, deallocate immediately when done.
Why: Avoids contention and scarce-giant-machine limits. 🍞 Anchor: It’s easier to find 100 small rooms than one stadium on short notice.

🍞 Hook: A class starts when the bell rings—you don’t keep asking the clock. 🥬 Event-Driven Coordination: The system reacts to lifecycle and completion events.

How: Cloud event bus sends signals; services update states instantly.
Why: No wasteful polling; faster, more reliable reactions. 🍞 Anchor: Firefighters wait for the alarm, not constant door knocks.

🍞 Hook: Sometimes you want a fresh notebook; other times you reuse the same one. 🥬 Hybrid Execution Model: Ephemeral (fresh container per task) and Persistent (reuse warmed environments).

How: Scheduler picks mode per workload.
Why: Balance isolation vs speed; save minutes per task at scale. 🍞 Anchor: New exam = clean sheet; homework practice = same notebook.

🍞 Hook: You don’t rebuild a kitchen; you order groceries when needed. 🥬 On-Demand Containers: Pull images from cloud registries over fast internal links.

How: Pre-provision metadata; fetch only what’s used.
Why: Avoids terabytes of local bloat. 🍞 Anchor: Like streaming a movie instead of owning every DVD.

The punchline: Coordination—not raw compute—is the real bottleneck for agent training. MegaFlow fixes coordination at scale.

03Methodology

At a high level: Inputs (task specs, datasets) → Agent Service plans rollouts → Environment Service provisions and runs containers → Model Service performs inference/training → Agent Service collects trajectories and rewards → Outputs (scores, logs, updated models).

We’ll walk through each step, sandwich-style for every new core component.

Task Scheduler 🍞 Hook: Picture a librarian who hands out books in order so no one pushes ahead. 🥬 What it is: A fast, asynchronous scheduler that lines up tasks and assigns them fairly (FIFO). How it works:

Receives rollout requests and places them in a queue.
For ephemeral tasks: spins up a dedicated small instance, runs one task, then shuts it down.
For persistent tasks: picks a machine from a warm pool and launches the container inside it. Why it matters: Without a clear line and matching rooms to tasks, you get pileups, idle machines, or overbooked ones. 🍞 Anchor: Like assigning each test-taker a seat—no seat, no start.

Resource Manager 🍞 Hook: Imagine a crossing guard who makes sure only the right number of kids cross at once. 🥬 What it is: The part that tracks capacity and sets limits so the system doesn’t overwhelm itself. How it works:

Uses distributed semaphores to cap concurrent tasks.
Respects user limits for Model Service calls to avoid downstream overloads.
Enforces admin quotas to keep things fair and prevent abuse. Why it matters: Without limits, a rush of tasks can knock over the whole system or starve other users. 🍞 Anchor: Like a turnstile at a stadium—it counts people in and out so the place stays safe.

Environment Manager 🍞 Hook: Think of a clean science lab for each experiment so chemicals don’t mix. 🥬 What it is: The layer that prepares containerized, isolated sandboxes where agents act. How it works:

Prepares required images in cloud registries.
Launches containers quickly via high-bandwidth internal networks.
Uses two layers of isolation: the VM instance (resources) and containers (process/filesystem).
Delegates container lifecycle to proven agent frameworks (e.g., OpenHands, SWE-Agent). Why it matters: Without strict isolation, tasks can break each other, leak data, or fail unpredictably. 🍞 Anchor: Each student gets their own lab bench and goggles—no sharing beakers mid-experiment.

Event-Driven Monitoring 🍞 Hook: Instead of constantly asking “Are we there yet?”, wait for a ping when the bus arrives. 🥬 What it is: Real-time signals that tell MegaFlow when instances are ready and tasks finish. How it works:

Cloud events announce instance up/down and task completion.
On event, MegaFlow updates states, frees resources, and stores results.
For details, it augments events with specific API calls (only when needed). Why it matters: Polling wastes time and money; events keep the system snappy and precise. 🍞 Anchor: The oven dings when the cookies are done—you don’t keep opening the door.

Data Persistence 🍞 Hook: You keep your to-do list in a notebook and your photos in an album—they serve different needs. 🥬 What it is: Separate storage for live operations vs big artifacts. How it works:

Operational metadata (task specs, states, instance info) in a document DB; queues in fast in-memory stores.
Large artifacts (logs, trajectories, checkpoints) in cloud object storage. Why it matters: Mixing small, hot data with giant files slows everything; separation keeps both fast and safe. 🍞 Anchor: A backpack for daily supplies and a locker for bulky stuff.

Model Service 🍞 Hook: The brain answers questions and learns from mistakes. 🥬 What it is: The compute part that runs inference (produce actions) and training (update weights). How it works:

Supports engines like Transformers, vLLM, and SGLang for efficient inference.
Trains via distributed frameworks like FSDP, Megatron, or VeRL/GSPO-based loops. Why it matters: Without a fast, scalable brain, agents can’t think quickly enough to keep up. 🍞 Anchor: Like a calculator that can also learn better shortcuts as it practices.

Agent Service 🍞 Hook: The team captain who plans plays, records results, and asks coaches for new strategies. 🥬 What it is: The coordinator that chooses scaffolds (OpenHands, SWE-Agent, etc.), runs rollouts, and aggregates feedback. How it works:

Launches tasks across datasets; gathers observations, actions, rewards.
Feeds experiences to the Model Service to improve policy. Why it matters: Without a conductor, musicians (models and environments) won’t play in sync. 🍞 Anchor: The project manager who assigns tickets, tracks progress, and files reports.

Environment Service 🍞 Hook: The playground where rules are enforced and scores are kept. 🥬 What it is: The runtime that executes actions, returns observations/rewards, and signals when to stop. How it works:

Provisions cloud instances; runs multiple containers as needed.
Enforces isolation and collects outputs. Why it matters: Without a consistent arena, you can’t trust results or compare runs. 🍞 Anchor: A game court with clear lines and a scoreboard.

Hybrid Execution (Ephemeral vs Persistent) 🍞 Hook: Use a disposable plate for a picnic (clean) or a ceramic one at home (efficient). 🥬 What it is: Choose fresh-per-task containers (ephemeral) or reuse warmed environments (persistent). How it works:

Ephemeral: perfect isolation, slightly higher startup time.
Persistent: reuses images and setup to cut latency. Why it matters: Matching mode to task saves minutes per job at massive scale. 🍞 Anchor: New lab gloves for chemicals (safety) vs keeping your favorite pencil (speed and comfort).

Many-Small-Instances Strategy 🍞 Hook: Ten small buses beat one mega-bus during city traffic. 🥬 What it is: Prefer lots of identical 8-core, 16GB machines—one task per instance. How it works:

Spin up to thousands of instances elastically; avoid bandwidth fights. Why it matters: Eliminates single-machine bottlenecks and availability constraints. 🍞 Anchor: More checkout lines shorten the wait even if each is modest.

Unified APIs 🍞 Hook: One remote that controls TV, speakers, and lights. 🥬 What it is: Consistent request/response designs connecting services. How it works:

Clear contracts for inference calls, rollout control, and environment management. Why it matters: Teams can swap parts without breaking everything. 🍞 Anchor: USB plugs work everywhere; no custom cords needed.

Example with actual data flow:

Task: Fix a failing test in a real repo (from SWE-bench).
Agent Service schedules the rollout; Resource Manager grants a slot.
Environment Service launches a container; Event says “ready.”
Agent queries Model Service for next action (edit file, run tests).
Environment executes; returns logs and pass/fail signals.
On completion event, results go to object storage; Agent records reward and sends experience to Model Service for training.

Secret sauce:

Event-driven plus elastic scale avoids polling and hotspots.
Hybrid mode squeezes latency without sacrificing safety.
Delegation to existing agent frameworks prevents reinventing containers and lets MegaFlow focus on orchestration excellence.

04Experiments & Results

🍞 Hook: If two delivery companies race—one giant truck vs a fleet of small vans—who wins when 10,000 packages arrive at once?

🥬 The Test: The team measured how fast, how cheaply, and how smoothly MegaFlow runs huge numbers of real, container-heavy coding tasks. They tracked total time, cost, startup delays, and CPU/memory usage patterns.

The Competition:

Baseline: High-Spec Centralized—few huge servers (208 cores, 3TB RAM, 1 Gbps), up to 50 tasks per machine.
MegaFlow Distributed: Many 8-core, 16GB instances (100 Mbps each), usually one task per instance, elastic scaling.

Scoreboard with context:

Throughput/Time: MegaFlow stayed steady around ~100 minutes from 1 to 10,000 tasks; centralized crept worse (about 100→110 minutes) as contention grew. That’s like the fleet of vans finishing at the same time no matter how many packages show up, while the mega truck slows as the pile grows.
Scale limit: Centralized capped near 2,000 concurrent tasks due to limited giant machines; MegaFlow reached 10,000 by grabbing many small ones.
Cost: At 2,000 tasks, MegaFlow cost about $1,006 vs$ 1,470 centralized—32% savings. Bigger scales likely save even more.
Resource usage: Centralized showed “bursty” spikes (CPU up to 25%, memory ~50%) then long idle valleys—hard to plan, wasteful. MegaFlow stayed steady (CPU ~5–10%, memory ~12%) with tight variability—predictable and efficient when multiplied across thousands of nodes.
Latency breakdown: Total time was ~75 min (persistent mode), ~90 min (ephemeral), and ~110 min (centralized). Environment startup for centralized got much worse with more tasks (1→13 minutes by 1,000 tasks). MegaFlow persistent stayed under ~1 minute; ephemeral rose gently (1→6 minutes), mostly due to registry pulls under high concurrency.

Surprising findings:

The real choke point wasn’t raw compute—it was coordination and local resource fights (network bandwidth, image pulls, initialization) on big servers.
Cloud registries handled high concurrent pulls reasonably well; the worst slowdowns came from local constraints in centralized setups.
Stable, low, and predictable utilization across many small machines beat higher peaks on a few giants in overall efficiency.

Reinforcement learning (RL) angle:

MegaFlow orchestrated 1,024 parallel environments per training step (64 tasks × 16 replicas), with agents allowed up to 100 turns and 128k-token contexts.
Using GSPO-style optimization, larger models (235B MoE) improved more and faster on SWE-bench Verified than a 30B MoE, showing the system can sustain heavy, realistic RL for coding agents.

🍞 Anchor: It’s like switching from a single crowded highway to a city grid of many streets and green-wave lights. Traffic keeps moving, costs drop, and you can always add a few more lanes where needed.

05Discussion & Limitations

🍞 Hook: Even the best amusement park has ride height limits, maintenance windows, and lines on holidays.

🥬 Limitations (honest talk):

Cloud dependence: The reported build runs on Alibaba Cloud; while APIs are abstracted, porting to another provider still needs engineering and testing.
Container image tug-of-war: Under extreme concurrency, even cloud registries can slow a bit; careful caching and prefetch policies matter.
Data gravity: Large artifacts (logs, trajectories, checkpoints) can grow quickly; teams still need lifecycle policies and budgets.
Not a magic database or Kubernetes replacement: MegaFlow delegates rather than reimplements; extremely complex multi-service dependency graphs may need tighter integration with container orchestrators.
Per-task isolation costs: Ephemeral mode adds startup time; choosing modes well is critical.

Required resources:

Access to elastic cloud compute with event bridges, object storage, and internal high-bandwidth registries.
Team familiarity with agent frameworks (e.g., OpenHands, SWE-Agent) and model serving stacks (vLLM/SGLang/Transformers).

When not to use it:

Tiny, single-machine experiments where containers and cloud spin-up dominate total time.
Ultra-low-latency, non-containerized tasks where a single optimized box is best.
Workloads with strict on-prem-only policies that forbid cloud or container use.

Open questions:

Automatic switching between ephemeral and persistent during a run based on live telemetry.
Deeper integration with Kubernetes for complex service chains (databases, web UIs, tools) per task, while keeping orchestration light.
Multi-cloud load balancing and spot-instance strategies for even lower cost.
Smarter image distribution (delta layers, peer-to-peer) to shave startup seconds at 10k+ scale.

🍞 Anchor: The park runs great today. Next, we add smarter line management, shared lockers for gear, and maybe a second location across town.

06Conclusion & Future Work

🍞 Hook: Picture three pit crews—one tunes the engine, one guides the driver, and one readies the track—each scaling up on demand and signaling each other the instant anything changes.

🥬 3-sentence summary:

MegaFlow cleanly separates agent training into Model, Agent, and Environment services, tied together by unified APIs and event-driven coordination.
Using many small cloud instances, hybrid execution (ephemeral vs persistent), and on-demand containers, it avoids storage, security, and network bottlenecks that cripple centralized setups.
In production-scale tests, it cut costs by ~32% at 2,000 tasks, scaled to 10,000 concurrent runs, and enabled high-throughput RL for complex coding agents.

Main achievement:

Turning coordination—not compute—into the solved problem for large-scale, container-heavy agent training by introducing a practical, modular, and elastic orchestration architecture.

Future directions:

Auto-switch execution modes mid-run, deeper Kubernetes integration for multi-service tasks, smarter image distribution, and multi-cloud elasticity.

Why remember this:

As AI agents move from toy demos to real, interactive work, the winners won’t just have smarter models—they’ll have smoother orchestration. MegaFlow is a blueprint for that smoothness at massive scale.

🍞 Anchor: It’s the difference between a single, overworked kitchen and a well-run food hall—more variety, faster service, happier customers.

Practical Applications

•Run massive coding-agent evaluations (e.g., SWE-bench, SWE-Gym) without overhauling your cluster.
•Train RL agents for software repair using 1,000+ parallel environments with automatic log and artifact storage.
•Reduce infrastructure costs by switching from giant servers to many small elastic instances.
•Shorten task startup times by reusing persistent environments for long training jobs.
•Integrate with existing agent frameworks (OpenHands, SWE-Agent) via unified APIs without custom glue for each project.
•Add event-driven alerts to auto-reclaim resources the moment tasks finish, cutting idle spend.
•Adopt on-demand container pulls from cloud registries to avoid maintaining terabytes of local images.
•Throttle model API calls via the Resource Manager to prevent cascading failures under peak load.
•Port workloads across clouds by keeping to MegaFlow’s abstracted interfaces and storage patterns.
•Collect, version, and analyze trajectories and rewards in object storage for reproducible research.

Version: 1