Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts
Key Summary
- âąMixture-of-Experts (MoE) models often send far more tokens to a few âfavoriteâ experts, which overloads some GPUs while others sit idle.
- âąStandard Expert Parallelism (EP) assumes the work is evenly spread, so it can slow down a lot or even crash with out-of-memory errors when routing is imbalanced.
- âąLeast-Loaded Expert Parallelism (LLEP) watches the load in real time and spills extra tokensâand the matching expert weightsâfrom busy GPUs to the least busy ones.
- âąLLEP keeps the math of the model exactly the same, so quality doesnât change; it only changes where the work happens.
- âąWith heavy imbalance, LLEP makes MoE layers run up to about 5â6Ă faster and cuts peak memory per GPU by up to about 4â5Ă compared to standard EP.
- âąIn full models like gpt-oss-20b and gpt-oss-120b, LLEP boosts end-to-end throughput by roughly 2.2Ă and 1.9Ă respectively.
- âąIt decides to spill only when the transfer cost is worth it, using simple knobs: a capacity factor (α), a minimum useful chunk size (m), and an imbalance threshold (λ).
- âąLLEP works for both inference and training (with correct gradients), and it helps most when batches are large, models are wider, or there are many experts.
- âąAblation studies show speedups grow with batch size, hidden size, and number of experts, and depend on tuning α and λ to the hardware.
- âąBecause it balances time and memory across GPUs, LLEP enables serving bigger MoE models on fewer GPUs without crashes.
Why This Research Matters
Big AI models are expensive to run, and crashes or slowdowns waste time and money. By shifting overloads to the least busy GPUs at the right moment, LLEP makes models faster and more reliable without changing their answers. This lets teams serve larger MoE models on fewer GPUs, which lowers cloud costs and energy use. It also reduces out-of-memory failures, so engineers spend less time firefighting and more time improving products. For users, it means quicker responses and steadier performance, even during traffic spikes. For researchers, it enables efficient fine-tuning of specialized MoEs without breaking their learned expertise.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine a big school where each student goes to a special tutor for different subjects. If most kids rush to the math tutor and almost no one goes to the art tutor, the math room gets crowded while other rooms are empty. The hallway gets jammed, and some kids canât even get in.
đ„Ź The Concept (Mixture-of-Experts, MoE): An MoE model is a smart system with many âexpertsâ (tiny specialized brains) and a ârouterâ that sends each token to a few top experts that fit it best.
- How it works:
- A router scores which experts match each token.
- Each token goes to its top-K experts.
- Those experts process the token and send results back.
- Why it matters: This keeps compute per token low while allowing a very large total model. đ Anchor: If the text is about math, math experts wake up more; if itâs about grammar, language experts do more.
đ Hook: You know how a team project goes faster when everyone has a fair share, not just one person doing it all?
đ„Ź The Concept (Expert Parallelism, EP): EP spreads the experts across different GPUs so they can share the load at the same time.
- How it works:
- GPUs exchange tokens to reach the GPUs that host the needed experts (All-to-All communication).
- Each GPU runs its experts on the received tokens (fast matrix math called GEMMs).
- Results go back to the original GPUs (another All-to-All).
- Why it matters: Spreading experts lets models be huge without one GPU holding everything. đ Anchor: Itâs like different kitchens each cooking their assigned dishes, then swapping plates so every diner gets a full meal.
-
The World Before: MoE let models grow giant by using only a few experts per token. To keep things tidy during pre-training, people encouraged balanced routing so experts saw similar amounts of work. EP became the standard way to train and serve MoE, relying on the idea that each GPU would get about the same number of tokens.
-
The Problem: In the real world, even well-trained MoEs route tokens unevenly. Some experts become specialists (like âmathâ or âcodeâ) and attract most tokens when the task matches them. During fine-tuning or inference, this imbalance means some GPUs get flooded while others relax. Flooded GPUs run slowly or run out of memory, causing crashes.
-
Failed Attempts:
- Shrink the batch size: avoids out-of-memory (OOM) but wastes throughput and makes everything slower.
- Process in small chunks with heavy checkpointing: still slow, still memory-limited.
- Replicate busy experts across GPUs (some inference-only tricks): costs lots of extra memory and still fails under extreme imbalance; not suitable for training.
- Reserve extra memory headroom: expensive in CPU/GPU memory, not always practical.
- The Gap: We need a system-level method that:
- Works with the model âas isâ (no changing its learned routing).
- Balances both speed and memory in real time.
- Decides smartly when moving work is worth it.
- Supports both inference and training correctly (including gradients).
- Real Stakes: If a few experts get all the tokens, latency spikes, users wait longer, servers can crash, and cloud bills rise. Teams may need more GPUs than necessary, or smaller orgs canât run big models at all. A method that keeps every GPU busyâbut not overloadedâmeans faster answers, fewer crashes, lower costs, and greener computing.
đ Hook: Imagine a librarian who sees one checkout line exploding while others are empty. A quick call sends some books and customers to the shortest line, so everyone finishes faster.
đ„Ź The Concept (Routing Imbalance): Routing imbalance is when far more tokens go to a few experts than to others.
- How it works: The router picks top experts per token, and specialized experts attract more tokens for matching tasks, creating spikes.
- Why it matters: EP assumes balance; without it, some GPUs become bottlenecks or OOM. đ Anchor: In tests, certain expert positions (like E11) got up to about 20% load when a perfect balance would be near 3% per expert, and one GPU carried 30â35% instead of about 12.5%.
02Core Idea
đ Hook: You know how, when a few store checkout lines get super long, the manager opens new lanes and guides people to the least busy cashier so everyone finishes sooner?
đ„Ź The Concept (Least-Loaded Expert Parallelism, LLEP): LLEP is a way to move extra tokensâand their expertsâ weightsâoff crowded GPUs to the least-loaded GPUs so all GPUs finish together without running out of memory.
- How it works:
- Measure how many tokens each expert gets (global loads) and see if routing is imbalanced past a threshold (λ).
- If it is, plan who should handle which chunk of each expertâs tokens so no GPU exceeds a target capacity (α) and each chunk is big enough to be efficient (m).
- For chunks assigned to a non-native GPU, send both the tokens and that expertâs weights there.
- Compute, send outputs back, and in training return weight gradients to the native GPU.
- Why it matters: Without this, a few GPUs become the slow and risky âlong lines,â causing latency spikes and OOMs. LLEP keeps time and memory balanced. đ Anchor: If one expert suddenly gets 95% of tokens, LLEP splits that work across many GPUs while preserving the exact same model answers.
-
The "Aha!" Moment in one sentence: Donât fight the modelâs natural specializationâjust move excess work to the least-loaded GPUs, exactly when it helps, so every GPU finishes at the same time.
-
Multiple Analogies:
- Traffic: If one highway lane is jammed, open an express lane and guide overflow cars there so the trip ends sooner for everyone.
- Pizza Kitchen: If one oven is jammed with pepperoni pizzas, move a batch (with the recipe card) to another free oven so all pizzas finish together.
- Classroom: If one helper has a crowd of students, send some studentsâand a copy of the worksheetâto another helper with free time.
- Before vs After:
- Before (Standard EP): Assumes loads are even; if not, some GPUs become bottlenecks and memory explodes.
- After (LLEP): Detects imbalance, spills just enough work and weights to idle GPUs, and brings peak memory back under control while speeding up completion time.
- Why It Works (intuition without equations):
- Total time is set by the slowest GPU. If we siphon away its extra work, we shrink the worst time.
- Matrix math (GEMMs) runs more efficiently on bigger chunks; LLEP keeps chunks big enough (m) to stay fast.
- Moving data isnât free, so LLEP only spills when the save in compute time beats the transfer cost, and it caps per-GPU work using α.
- When routing is already balanced, LLEP politely steps aside (controlled by λ) to avoid extra overhead.
- Building Blocks (what LLEP is made of):
- Imbalance Detector (λ): Checks if routing is skewed enough to act; otherwise fall back to standard EP.
- Capacity Guard (α): Sets a soft ceiling for how many tokens a GPU should process this step.
- Efficiency Floor (m): Ensures spilled chunks are large enough to keep GEMMs efficient.
- Least-Loaded Assignment (LLA): A greedy planner that assigns token chunks to the least-loaded GPUs first, minimizing weight transfers.
- Spill Procedure (LLAS): Repeats assignment until all extra tokens are placed.
- Exactness and Gradients: LLEP keeps outputs identical to the base model and returns spilled weight gradients to their home GPUs during training.
03Methodology
At a high level: Input tokens â Router picks top-K experts â Check imbalance (λ) â If balanced, do standard EP; else plan least-loaded assignments (α, m) â Dispatch tokens and, if needed, weights â Compute GEMMs on assigned GPUs â Combine outputs back â (Training) Return spilled gradients to native experts.
Step 1: Router selects experts
- What happens: For each token, the router gives scores and picks the top-K experts.
- Why this step exists: Specialization boosts quality and keeps per-token compute low.
- Example: A batch has math and general text; math tokens gravitate to math-focused experts; generic text goes to shared language experts.
Step 2: Measure global expert loads
- What happens: Count how many tokens were routed to each expert across all GPUs (global loads).
- Why it matters: This shows if a few experts will swamp their host GPUs.
- Example: Expert 11 gets 95% of tokens; others get scrapsâclear red flag.
Step 3: Imbalance check (λ)
- What happens: Compute max(load)/mean(load). If below λ, routing is âbalanced enough.â
- Why it matters: When balance is fine, standard EP is already optimal; we skip LLEPâs extra planning.
- Example: If λ = 1.3 and current ratio is 1.2, we do standard EP; if itâs 8.0, we use LLEP.
Step 4: Plan least-loaded assignments (LLA with α and m)
- What happens: Sort experts by how loaded they are and assign their token chunks so each GPU stays under a target capacity (α times average). If the native GPU canât handle all tokens, spill the extra to the least-loaded GPUs. Avoid tiny chunks smaller than m.
- Why it matters: This keeps every GPU busy but not overloaded, and avoids slow, tiny GEMMs.
- Example: 8 GPUs, 32 experts, top-4 routing. If Expert 11 has 120k tokens and average capacity is 40k, native GPU takes 40k, and the remaining 80k is split in big chunks (e.g., 20k) across the least-loaded GPUs.
Step 5: Build dispatch and weight-transfer plans
- What happens: For each assigned chunk, we decide which GPU gets which part of the tokens, and if that GPU doesnât own the expert weights, we schedule a P2P weight copy.
- Why it matters: Tokens are useless without the right weights; scheduling both avoids stalls.
- Example: If GPU 3 will process Expert 11âs 20k-token chunk, we copy Expert 11âs weights to GPU 3 before compute.
Step 6: Dispatch (All-to-All) and weight P2P transfer
- What happens: Use All-to-All to send tokens to assigned GPUs; use P2P to fetch any required expert weights.
- Why it matters: This is how the right work reaches the right place in time.
- Example: Tokens destined for Experts 2, 5, and 11 are split and shipped to the GPUs that will compute them under the LLEP plan.
đ Hook: Think of a giant group chat where everyone can message everyone directly instead of passing notes through one person. đ„Ź The Concept (All-to-All Communication): All-to-All lets every GPU exchange data with every other GPU in one coordinated operation.
- How it works:
- Break data into per-destination chunks.
- Send all chunks at once; receive matching chunks from others.
- Reassemble on arrival.
- Why it matters: Itâs the backbone of EP/LLEP data movement and keeps transfers organized and fast. đ Anchor: Tokens from GPU 0 that need Expert 11 on GPU 2 get shipped directly there while GPU 2 ships its tokens elsewhere at the same time.
Step 7: Compute assigned chunks (GEMMs)
- What happens: Each GPU runs grouped matrix multiplications (GEMMs) for all its assigned native and foreign experts.
- Why it matters: This is where the heavy lifting happens; larger, well-packed chunks run much faster per token than many tiny ones.
- Example: GPU 5 computes three big GEMMs: one for its native Expert 20 and two for borrowed Experts 11 and 7.
đ Hook: Mixing batter in one big bowl is faster than making dozens of mini batches in tiny cups. đ„Ź The Concept (GEMM Operations): GEMMs are fast matrix multiplications that power neural network layers.
- How it works:
- Stack token vectors into a matrix.
- Multiply by the expertâs weight matrix.
- Apply gating weights and move on.
- Why it matters: Bigger, fewer GEMMs are more efficient than many tiny ones; LLEP helps create big enough chunks. đ Anchor: Running one 20k-token GEMM is typically faster than twenty 1k-token GEMMs, even with the same total math.
Step 8: Combine outputs (reverse All-to-All)
- What happens: Expert outputs return to the tokensâ original GPUs and are merged.
- Why it matters: The modelâs next layers expect the batch in its original order.
- Example: Outputs from multiple experts are summed per token and restored to the source GPUâs batch layout.
Step 9: Backward pass (training)
- What happens: Gradients for any spilled expert are returned to the expertâs native GPU and added to its native gradients.
- Why it matters: This keeps training exact and stable, with correct learning.
- Example: Gradients from GPU 3 for Expert 11 travel back to the home GPU, ensuring the weights update correctly.
The Secret Sauce (why LLEP is clever):
- Itâs exact: no approximations, so model quality is unchanged.
- Itâs selective: acts only when imbalance hurts and only spills when it helps.
- It balances both time and memory: caps per-GPU tokens (α) and avoids tiny, slow chunks (m).
- Itâs practical: works with standard collectives (NCCL All-to-All) and simple P2P transfers; can be further optimized or fused.
04Experiments & Results
- The Test: What did they measure and why?
- Layer-level speed and peak memory: To see how fast and safe the MoE layers run under heavy imbalance.
- Full-model throughput: To understand real-world serving speed when attention and other layers are included.
- Training wall-clock to accuracy: To check if LLEP shortens time to reach a target score while keeping accuracy.
-
The Competition: Standard Expert Parallelism (EP) was the main baseline. Prior ideas like replicating busy experts (inference-only) are not directly comparable for training and often raise memory use.
-
The Scoreboard (with context):
- Synthetic imbalance tests across popular MoE configs (gpt-oss-120b, DeepSeek-V3, Kimi-K2) on 8Ă H200:
- Speedup: LLEP beats EP in every imbalanced setting, up to roughly 6.1Ă in the most extreme case (for example, 95% of tokens to 1 expert). Thatâs like finishing a 60-minute task in under 10 minutes.
- Peak memory: LLEP stays nearly flat across all imbalance levels, while EPâs peak can balloon up to around 4â5Ă higher, risking OOM. Thatâs like keeping your backpack the same size while EPâs backpack overflows.
- Balanced case: LLEP matches EP thanks to the λ switch that turns LLEP off when itâs not needed.
- End-to-end inference on real data (Megatron-Math) with pre-trained models:
- gpt-oss-20b: up to about 2.2Ă higher throughput.
- gpt-oss-120b: about 1.9Ă higher throughput.
- Note: Full models include attention and other fixed costs, so MoE-layer speedups are even larger than the end-to-end numbers show.
- Training (SFT) with full parameters, ZeRO-3 and CPU offloading:
- About 1.25Ă faster time-to-accuracy on AIMEâ25. Despite extra CPU/offload overheads, LLEP still cuts wall-time, which is practical for real training.
- Surprising Findings:
- Even top MoE models naturally route imbalanced tokens (for good reasons like specialization). For example, one expert position (E11) repeatedly dominated loads.
- Separate, highly tuned GEMM calls (like cuBLAS) can beat a single fused kernel for large matrices, so fewer, larger GEMMs are kingâLLEP helps create those bigger chunks.
- Speedups grow with batch size, hidden size, and number of experts: the heavier the job, the more LLEP pays off.
- LLEP stays safe in balanced scenarios by auto-reverting to standard EP (λ guard), avoiding slowdowns when itâs not needed.
đ Hook: Imagine testing a recipe by changing just one ingredient at a time to see what really makes it tasty. đ„Ź The Concept (Ablation Study): An ablation study changes one setting at a time to learn what matters most for performance.
- How it works:
- Fix everything else.
- Vary batch size, α, λ, hidden size, or number of experts.
- Measure speedup and memory.
- Why it matters: This shows which knobs to tune for your hardware and model. đ Anchor: Results showed bigger batches and wider layers make LLEP shine brighter, while α and λ need tuning to hit the sweet spot.
05Discussion & Limitations
Limitations (what this canât do well yet):
- If routing is already balanced or batches are tiny, LLEPâs planning and transfers may not help and can add a little overhead (it tries to avoid this via λ).
- Gains depend on network speed: slow interconnects (especially across nodes) reduce the benefit of spilling.
- Hyperparameter tuning (α, λ, m) matters; wrong settings can underperform or, in edge cases, still risk OOM on bad spikes.
- Weight transfers can contend with token traffic; low-level, fused kernels could squeeze more gains but require engineering effort.
Required Resources:
- Multiple GPUs with fast links (e.g., NVLink, high-bandwidth NICs) and NCCL for collectives.
- Ability to do peer-to-peer (P2P) weight copies.
- Telemetry to read per-expert token loads each step.
- Some GPU memory headroom for temporarily holding borrowed weights.
When NOT to Use:
- Single-GPU runs (no one to share work with).
- Highly balanced routing over time (λ stays low); standard EP is fine.
- Very small batches or very small models where data movement dominates compute.
- Multi-node clusters with slow inter-node links; prefer intra-node spilling only.
Open Questions:
- Auto-tuning: Can α, λ, and m be chosen automatically per layer, batch, and hardware topology?
- Prediction: Can we forecast imbalance a step ahead to pre-stage weights and overlap comms better?
- Heterogeneous clusters: How to assign spills when GPUs differ in speed or memory?
- Hybrid strategies: Combine LLEP with selective replication or caching to cut repeated weight transfers.
- System integration: Fused kernels with DeepEP or Triton-Distributed to further hide comms under compute.
- Multi-tenant serving: How does LLEP play with multiple concurrent models sharing GPUs?
06Conclusion & Future Work
Three-sentence summary: MoE models naturally route unevenly to specialized experts, which overloads some GPUs and causes slowdowns or crashes under standard Expert Parallelism. Least-Loaded Expert Parallelism (LLEP) dynamically spills extra tokensâand the needed expert weightsâfrom busy GPUs to the least-loaded ones, keeping the math exact while balancing time and memory. This delivers up to roughly 5â6Ă MoE-layer speedups, around 4â5Ă memory relief, and about 2Ă end-to-end throughput boosts in large models like gpt-oss-120b.
Main Achievement: A practical, exact, and hardware-conscious load balancer for MoE that works during both inference and training, delivering big speed and memory wins under real-world imbalance without changing model behavior.
Future Directions: Automated, topology-aware tuning of α, λ, and m; predictive scheduling to prefetch weights; fused compute-communication kernels; intra-node-first policies for multi-node clusters; and hybrids that pair LLEP with selective expert replication or caching.
Why Remember This: Instead of forcing experts to be âfair,â LLEP respects specialization and fixes the system around itâmoving work, not changing minds. That simple shift unlocks faster, cheaper, and more reliable MoE serving and training, especially as models and expert counts keep growing.
Practical Applications
- âąSpeed up inference for MoE LLMs during traffic spikes by enabling LLEP to auto-balance overloaded experts.
- âąReduce serving costs by keeping GPU memory flat under imbalance, allowing larger batches on fewer GPUs.
- âąStabilize fine-tuning of specialized MoEs (e.g., math, code) by preventing OOM when domain experts get flooded.
- âąImprove multi-tenant GPU clusters by spilling work to least-loaded GPUs within a node for smoother QoS.
- âąDeploy on smaller GPU rigs (e.g., 8Ă H100/H200) to run bigger MoE models than EP would allow without crashes.
- âąUse α, λ, and m tuning guides from ablations to pick hardware-specific settings for best throughput.
- âąCombine with ZeRO and CPU offloading in training to shorten wall-clock to target accuracy.
- âąImplement intra-node-first spilling policies to minimize slow inter-node transfers in multi-node clusters.
- âąA/B test EP vs. LLEP on real workloads to size clusters more accurately and cut over-provisioning.
- âąIntegrate with monitoring to alert when imbalance rises and automatically switch between EP and LLEP.