šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
🧩ProblemsšŸŽÆPrompts🧠Review
SearchSettings
Janus: Disaggregating Attention and Experts for Scalable MoE Inference | How I Study AI

Janus: Disaggregating Attention and Experts for Scalable MoE Inference

Intermediate
Zhexiang Zhang, Ye Wang, Xiangyu Wang et al.12/15/2025
arXiv

Key Summary

  • •Janus splits a Mixture-of-Experts (MoE) model into two parts—attention and experts—so each can use just the right amount of GPUs.
  • •It invents a two-phase way for the two parts to talk that sends fewer, bigger messages and cuts waiting time.
  • •Janus puts the 'who-should-handle-this-token' gate on the expert side to keep communication simple and fast.
  • •A tiny, super-fast GPU scheduler balances which expert replicas get used so no GPU gets overloaded.
  • •It also copies (replicates) popular experts and spreads co-activated experts across GPUs to avoid traffic jams.
  • •Janus scales attention and expert GPUs independently to meet a per-token speed promise (TPOT) while saving cost.
  • •Across tests, Janus delivers up to 3.9Ɨ higher throughput per GPU than strong baselines while meeting latency SLOs.
  • •Its scheduling overhead stays under 100 microseconds, so the 'planner' never becomes the bottleneck.
  • •On real-world-like traffic, Janus can reduce GPU usage by about 25% compared to monolithic serving.
  • •The key idea is simple: separate different kinds of work, talk smartly between them, and keep the load balanced.

Why This Research Matters

Janus makes large AI models cheaper and faster to run, which means more people can use powerful assistants without long waits. By matching the right work to the right GPUs and cutting network fuss, it keeps responses snappy even when traffic spikes. Companies can hit strict per-token speed promises while using fewer GPUs, saving serious money and energy. This helps services like chatbots, coding helpers, and search tools stay responsive during busy hours. It also makes it easier to add new users or features without rebuilding the whole system. In short, Janus helps turn cutting-edge AI into dependable, cost-effective everyday tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž Hook: Imagine a big school where every question from students goes to one giant classroom. The teacher must handle math, history, and science all at once. It works, but it's slow and wastes energy when most students only need help in one subject.

🄬 The Concept (Mixture-of-Experts, MoE):

  • What it is: An MoE model is a team of many small specialist networks called 'experts' that only wake up when needed.
  • How it works: (1) Read the input token; (2) A tiny 'gate' picks the top-k experts for that token; (3) Only those experts run; (4) Their answers get combined for the final result.
  • Why it matters: Without MoE, every part of the big model would run on every token, wasting tons of compute and memory.

šŸž Anchor: Like sending a math question to math teachers only, not to the whole faculty.

šŸž Hook: You know how some chores use muscles (lifting) and some use memory (remembering a list)? Computers have the same split: compute vs. memory.

🄬 The Concept (Memory-bound vs. Compute-bound):

  • What it is: A job is memory-bound when speed is limited by how fast data moves, not by how fast math runs.
  • How it works: (1) If each step needs lots of data from memory, the GPU keeps waiting on memory; (2) If math is heavy and data reuse is high, you hit compute limits.
  • Why it matters: MoE layers tend to be memory-bound at real-world batch sizes, so reducing how many experts activate is more important than cranking math speed.

šŸž Anchor: It’s like trying to bake cookies fast but the oven (memory) is slow to heat—more dough (compute) won’t help.

šŸž Hook: Picture two types of classes: attention (organizing notes) and experts (deep subject help). They need different classrooms.

🄬 The Concept (Attention vs. MoE layers):

  • What it is: Attention tracks context (with KV caches) while MoE experts provide specialized processing.
  • How it works: (1) Attention reads and writes caches of past tokens; (2) MoE picks a few experts per token and runs them; (3) They alternate across layers.
  • Why it matters: They stress GPUs differently—attention can be steady while MoE latency grows with more activated experts.

šŸž Anchor: Think of attention as a librarian (steady organizing) and MoE as calling specific tutors (can pile up if many get called).

šŸž Hook: Suppose the school promises, ā€œEach answer comes in under 200 ms.ā€ That’s a Service Level Objective (SLO).

🄬 The Concept (TPOT SLO):

  • What it is: Time-Per-Output-Token (TPOT) is a per-token speed promise (e.g., each new word within 200 ms).
  • How it works: (1) Measure time to produce each next token; (2) Alert/scale if it slips; (3) Keep the experience snappy.
  • Why it matters: Without tracking TPOT, a system can look fine on averages but feel laggy to users.

šŸž Anchor: Like promising each slice of pizza arrives every few seconds instead of one big late delivery.

šŸž Hook: If you force every student, librarian, and tutor into one giant room, you can’t arrange chairs well for each.

🄬 The Concept (Monolithic deployment problem):

  • What it is: Running attention and MoE together as one fixed block (same parallelism, same GPUs).
  • How it works: (1) Load all experts across many GPUs; (2) Tie attention’s scaling to expert scaling; (3) Any change means reloading the whole block.
  • Why it matters: It wastes GPUs when attention doesn’t need as many, and it’s clumsy to scale when traffic changes.

šŸž Anchor: Like buying a bus for one student because some days there’s a crowd—you pay for empty seats most days.

šŸž Hook: What if we let the librarian team and the tutor team sit in different buildings and coordinate smartly?

🄬 The Concept (Disaggregation gap and solution):

  • What it is: Disaggregation means splitting attention and MoE into separate sub-clusters that scale independently.
  • How it works: (1) Put attention instances on one GPU pool; (2) Put MoE experts on another; (3) Communicate activations between them efficiently; (4) Scale each side based on its own load.
  • Why it matters: Without disaggregation, you over-provision and still struggle during spikes; with it, you right-size and adapt faster.

šŸž Anchor: Like giving librarians a quiet office and tutors a bigger study hall, and passing folders between them only when needed.

02Core Idea

šŸž Hook: Imagine running a city with traffic lights that adapt per neighborhood and delivery trucks that carry bigger, fewer bundles. Suddenly, rush hour feels smooth.

🄬 The Concept (Janus’s key insight):

  • What it is: Separate attention and experts, talk between them using fewer, bigger messages, and balance how many experts wake up on each GPU.
  • How it works: (1) Disaggregate attention and MoE into two GPU sub-clusters; (2) Use adaptive two-phase communication to aggregate small messages; (3) Keep the gate and a microsecond GPU scheduler on the MoE side to balance activated experts; (4) Replicate hot experts and spread co-activated experts across GPUs; (5) Independently autoscale attention and MoE to meet TPOT SLO.
  • Why it matters: Without this, memory-bound MoE layers stall, GPUs sit idle or overloaded, and per-token latency slips.

šŸž Anchor: It’s like splitting mail sorting and delivery into two hubs, bundling letters before crossing town, and sending vans so each has the same number of stops.

Multiple Analogies for the same idea:

  • Postal analogy: Sort (attention) and deliver (experts) in different centers; bundle mail before crossing town; balance delivery routes so no van gets too many stops.
  • Kitchen analogy: Prep station (attention) and specialist chefs (experts) are in separate zones; pass trays in batches; assign dishes so no chef’s station gets swamped.
  • School analogy: Librarians (attention) and tutors (experts) in different rooms; pass student folders in stacks; schedule tutors so each gets a similar number of students.

Before vs. After:

  • Before: One big block; attention and experts share the same GPU plan; lots of tiny, chatty messages; random or coarse expert use; hard scaling.
  • After: Two right-sized pools; fewer, bigger cross-pool transfers; balanced expert activations; replicas placed smartly; independent scaling to hit SLOs at lower cost.

Why it works (intuition, not equations):

  • MoE is memory-bound for normal batch sizes, so time grows with the number of experts that wake up—not with how hard they compute.
  • If you evenly spread which experts wake up, you cut the slowest GPU’s wait, shrinking overall latency.
  • Fewer, bigger messages reduce network overhead more than shaving a few bytes off each tiny message.
  • Independent scaling avoids paying attention’s bill when only experts need more capacity (or vice versa).

Building Blocks (each as a sandwich):

  • šŸž Hook: Ever notice it’s faster to send one big package than 100 tiny ones? 🄬 The Concept (Adaptive two-phase communication): What it is: A two-step way to move activations between attention and MoE that aggregates traffic inside a node first, then sends larger bundles across nodes. How it works: (1) Intra-node all-gather to build a full local batch; (2) Inter-node send to a target node, which fan-outs locally via NVLink; adapt pattern based on load. Why it matters: Without it, you drown in small-message overhead and miss SLOs. šŸž Anchor: Like neighbors pooling parcels at one house before calling a single courier.
  • šŸž Hook: If everyone chooses a different plan at the same time, chaos! But if everyone runs the same tiny, fast rulebook, it just works. 🄬 The Concept (GPU-kernel, sync-free scheduler): What it is: A microsecond GPU scheduler that maps activated experts to replicas so each GPU has about the same number of experts to run. How it works: (1) Collect distinct activated experts from token top-k; (2) Greedily assign each expert to the least-loaded instance among its replicas; (3) Remap tokens to those replicas; (4) Every GPU runs the same deterministic kernel—no cross-GPU sync. Why it matters: Without it, some GPUs become stragglers. šŸž Anchor: Think traffic lights all following the same timing plan—no walkie-talkies needed.
  • šŸž Hook: Popular booths get long lines; spread them out and the fair moves faster. 🄬 The Concept (Activation-aware replication and placement): What it is: Copy hot experts more times and place frequently co-activated experts on different GPUs. How it works: (1) Count activations per expert; (2) Give extra replicas to hot ones; (3) Use a heuristic to place replicas so co-activated experts don’t pile up on the same GPU. Why it matters: Without it, a few experts become permanent bottlenecks. šŸž Anchor: Like putting two favorite food trucks on opposite sides of the park.
  • šŸž Hook: Sometimes you need more librarians, not more tutors—and sometimes the opposite. 🄬 The Concept (Independent autoscaling): What it is: Separate controllers grow/shrink attention and MoE pools to meet TPOT at the lowest GPU cost. How it works: (1) Watch queueing and latency; (2) Predict if adding/removing A or E instances will fix SLOs; (3) Choose the cheapest action that works. Why it matters: Without it, you overpay for GPUs you don’t need. šŸž Anchor: Like staffing the help desk and the classroom separately based on their own lines.

03Methodology

At a high level: Requests → Attention sub-cluster → Two-phase send to MoE → Gate + Load-balanced scheduling + MoE compute → Two-phase send back → Next layer → Output tokens

Step-by-step with purpose and examples:

  1. Attention sub-cluster receives tokens
  • What happens: Attention instances (each on one GPU) keep KV caches and process the decoding step for incoming requests.
  • Why it exists: Attention is common to all tokens and must be fast and steady; keeping it separate prevents expert-side needs from inflating its GPU count.
  • Example: 32 requests arrive; attention instances batch them and prepare activations for the next layer.
  1. Two-phase communication (Attention → MoE)
  • What happens: Within each node, attention instances first all-gather their tiny chunks over NVLink to form bigger activation bundles. Then, bundles are sent across nodes to MoE nodes. If there are many destinations, use one-to-one to a node’s MoE aggregator, which fan-outs locally.
  • Why it exists: Avoids the O(mƗn) swarm of small messages and slashes latency dominated by message count.
  • Example: Instead of 64 tiny sends to 8 MoE GPUs, each attention node sends 2 large bundles to 2 MoE nodes.
  1. Gate on the MoE side
  • What happens: The MoE side computes the top-k expert IDs per token using the gating network.
  • Why it exists: Keeping gating here avoids shuffling token fragments and metadata on the attention side, cutting complexity and tiny-message churn.
  • Example: For a token, the gate picks experts E7 and E42 (top-2).
  1. Activation Load-Balanced Scheduling (AEBS) as a GPU kernel
  • What happens: The kernel (a) collects the set of distinct activated experts from the batch; (b) assigns each expert to the least-loaded instance among its replicas; (c) remaps tokens from logical expert IDs to physical replica IDs; (d) dispatches activations accordingly.
  • Why it exists: MoE is memory-bound; latency scales with the number of experts that wake up per GPU. Balancing this number across GPUs prevents stragglers.
  • Example: If E7 has replicas on GPU2 and GPU5, and GPU5 is lighter, this batch’s E7 requests go to GPU5; E42 goes to GPU2 to keep both even.
  1. MoE compute
  • What happens: Selected replicas run their FFN computations for their assigned tokens.
  • Why it exists: This is the core expert work; doing it on balanced loads lowers the slowest GPU’s time.
  • Example: Each MoE GPU runs roughly the same count of experts (say, 6 each), so the layer finishes together.
  1. Two-phase communication (MoE → Attention)
  • What happens: The MoE side combines intermediate results within a node (e.g., all-reduce/multicast), then sends back larger bundled results to attention nodes, which distribute to their instances via NVLink.
  • Why it exists: Mirrors the forward direction to keep cross-node transfers big and few.
  • Example: A MoE node sends 2 big replies instead of dozens of tiny ones.
  1. Next layer or output token
  • What happens: The process repeats across interleaved attention and MoE layers until a token is produced; TPOT is tracked.
  • Why it exists: Ensures each token meets the latency promise.
  • Example: The system records 150 ms TPOT—SLO met.

The Secret Sauce:

  • Fewer, bigger network messages beat many tiny ones. This is why the two-phase scheme wins.
  • The scheduler runs on GPUs in microseconds and is sync-free: all GPUs run the same deterministic mapping, so no cross-GPU chatter.
  • Replicate hot experts and spread co-activated pairs apart; this shrinks the per-GPU count of activated experts.
  • Independent autoscaling lets you buy just the librarians or just the tutors you need to hit TPOT at the lowest GPU cost.

Concrete mini example:

  • Batch: 64 tokens, top-2 experts each.
  • Gate picks a set of 20 distinct experts across the batch.
  • AEBS sees E5 and E12 are hot; their replicas exist on GPUs 0 and 3. GPU 3 is lighter, so AEBS sends E5 there, and sends E12 to GPU 0 to even out.
  • Co-activated pair (E5, E12) is placed on different GPUs by the placement heuristic; fewer GPUs need to wake extra experts at once.
  • Result: Each MoE GPU activates ā‰ˆ5 experts, not 8 on one and 2 on another; layer time drops.

Why each step matters (what breaks without it):

  • No disaggregation → You overprovision attention when only experts need help; scaling is clumsy.
  • No two-phase comms → Ocean of tiny packets; latency balloons.
  • Gating on attention side → Extra tensor packing and metadata churn; more tiny transfers.
  • No AEBS → Some GPUs wake many experts and become laggards.
  • No replication/placement → Hot experts and co-activated pairs create permanent hotspots.
  • No independent autoscaling → You pay for GPUs you don’t need or miss the TPOT SLO.

04Experiments & Results

The Test (what they measured and why):

  • Per-GPU throughput: How many tokens per second per GPU—measures resource efficiency.
  • TPOT (Time-Per-Output-Token): A per-token latency SLO target (e.g., ≤200 ms) that ensures snappy user experience.
  • Overheads: Scheduling time (must be microseconds), communication cost, and load balance across MoE GPUs.

The Competition:

  • SGLang (monolithic): Attention and experts share one fixed plan; scales only in big chunks (whole-model replicas). Strong baseline.
  • DisAgg (naive disaggregation): Splits attention/experts but uses random expert scheduling and coarse, node-level scaling.

The Scoreboard (with context):

  • Throughput: Janus delivers up to 3.9Ɨ higher throughput per GPU than SGLang (and up to 2.8Ɨ over DisAgg) while still hitting the 200 ms TPOT SLO. That’s like getting an A+ when others are at B- and doing it using fewer class hours.
  • Latency: Janus meets the TPOT SLO across batch sizes; DisAgg misses it at the largest batch (512) due to comms and imbalance.
  • Ablations: Adding two-phase communication alone reduced TPOT by about 18% at large batches (512); adding load-balanced scheduling and placement shaved an extra ~7% in the low-to-medium batch range (16–64), when not all experts were already active.
  • Balance: Activation imbalance (max āˆ’ min activated experts per GPU) drops from ~8 to ~4 with Janus—half the gap—curbing stragglers.
  • Overhead: Scheduling stays under 100 microseconds across 8–16 MoE GPUs and 16–512 batch sizes—so the planner is never the slow part.
  • Real-world-like traces: Over two simulated days of dynamic traffic, Janus’s fine-grained, independent scaling tracks demand closely and saves ~25% GPUs vs monolithic SGLang while keeping service quality.

Surprising Findings:

  • Attention scaling can hurt latency at small batches (overhead outweighs benefit), while MoE often benefits more steadily—this validates why disaggregation and independent scaling are so powerful.
  • Gating on the MoE side, even though it may send a bit more data, wins because it slashes the number and fussiness of tiny messages.
  • When expert pools are so large that there’s little replica redundancy (e.g., too many distinct experts on too few GPUs), scheduling gains shrink—scaling MoE instances restores the headroom and the benefits.

05Discussion & Limitations

Limitations (be specific):

  • If your workload runs at very large batch sizes where MoE becomes compute-bound, balancing activated experts matters less.
  • When expert pools are huge and there’s no room for replicas, the scheduler has limited wiggle room; you may need to scale out MoE GPUs first.
  • Networks without strong intra-node bandwidth (e.g., no NVLink) or with very constrained interconnects may see smaller gains from two-phase communication.
  • Highly static, predictable traffic reduces the advantage of fine-grained autoscaling.
  • The placement heuristic depends on recent activation/co-activation stats; if patterns shift abruptly, it may take a reconfiguration cycle to catch up.

Required Resources:

  • Multi-GPU nodes with fast intra-node links (e.g., NVLink) and high-speed interconnects (e.g., InfiniBand with GPUDirect RDMA via UCX).
  • Ability to run GPU kernels for scheduling and to collect activation statistics periodically.
  • A control plane to scale attention and MoE instances independently.

When NOT to Use:

  • Small or single-GPU models that fit comfortably on one device; monolithic is simpler and fine.
  • Extremely compute-bound scenarios (very large batches) where communication and memory-bound behavior aren’t dominant.
  • Ultra-tight environments without room for expert replicas and with very weak networking.

Open Questions:

  • Best ways to combine this with prefill/decoding disaggregation and pipeline micro-batching for even higher goodput.
  • Smarter, possibly learning-based placement and replication that adapts faster to shifting popularity and co-activation.
  • Robustness under heterogeneous hardware pools and fault tolerance when GPUs join/leave frequently.
  • Fairness and isolation in multi-tenant settings where different apps share experts or GPUs.
  • Integrating and co-tuning with emerging GPU communication libraries to squeeze even more latency out.

06Conclusion & Future Work

Three-sentence summary: Janus separates attention and experts into their own GPU pools, talks between them using an adaptive two-phase scheme, and balances expert activations with a microsecond GPU scheduler. It also replicates hot experts and spreads co-activated ones apart, while independently autoscaling both sides to meet per-token latency SLOs at lower cost. The result is up to 3.9Ɨ higher throughput per GPU than strong baselines with SLO attainment and about 25% GPU savings on dynamic traces.

Main achievement: Showing that careful disaggregation plus communication bundling and activation-balanced scheduling can turn memory-bound MoE layers from a bottleneck into a well-balanced, scalable service.

Future directions: Combine with prefill/decoding disaggregation and micro-batching, add learning-based replication/placement, and extend to heterogeneous clusters and even faster data planes.

Why remember this: The simple idea—separate different kinds of work, send fewer bigger messages, and keep the load even—translates into real, measurable wins: faster responses, fewer GPUs, and a smoother user experience.

Practical Applications

  • •Deploy cost-efficient MoE LLMs that meet strict per-token latency SLOs in production.
  • •Autoscale attention and expert GPU pools independently to ride daily traffic waves without overprovisioning.
  • •Use two-phase communication to reduce cross-node chatter in multi-node inference clusters.
  • •Balance expert activations with a GPU-kernel scheduler to avoid straggler GPUs and stabilize latency.
  • •Replicate hot experts and spread co-activated experts across GPUs to prevent recurring hotspots.
  • •Retrofit existing disaggregated stacks by moving gating to the MoE side and bundling messages.
  • •Simulate scaling plans to choose the cheapest action (add attention vs. add experts) that restores TPOT.
  • •Apply the approach to heterogeneous clusters by mapping attention and MoE to best-fit GPU types.
  • •Combine with micro-batching pipelines to overlap attention and MoE work for even higher goodput.
  • •Integrate activation telemetry into monitoring to trigger periodic expert placement re-optimization.
#Mixture-of-Experts inference#disaggregated serving#activation load balancing#two-phase communication#expert replication#co-activation-aware placement#TPOT SLO#GPU kernel scheduler#memory-bound workloads#NVLink all-gather#GPUDirect RDMA UCX#per-GPU throughput#expert parallelism#autoscaling#DeepSeek-V2
Version: 1

Notes

0/2000
Press Cmd+Enter to submit