•The lecture explains why simply making language models bigger (more parameters) helped for years, but also why data size and training time matter just as much. From BERT in 2018 to GPT‑2, GPT‑3, PaLM, Chinchilla, and Llama 2, the trend shows performance rises when models are scaled correctly with enough data and compute.
•We are running into limits: data is finite and training ever‑larger dense models is very expensive. The big question is how to get more capability without paying a lot more compute for every token processed.
•Sparse Activation addresses this by turning on only a small part of the network for each input. Think of a huge school where only the teachers needed for a student’s question are called into the room instead of the whole staff.
•There are two main paths to sparse activation: Conditional Computation and Pruning. Conditional Computation routes each input through a chosen sub‑network; Pruning removes unimportant connections or neurons to shrink and speed up the model.
•Mixture of Experts (MoE) is a popular form of Conditional Computation. A small "gate" network picks one or a few "experts" (mini‑networks) to process each token, and the expert outputs are combined to produce the final result.
•Mathematically, MoE outputs a weighted sum of expert outputs. The gate provides probabilities G_i(x) for each expert i on input x, and each expert E_i(x) returns its own vector; the model sums G_i(x) * E_i(x) across experts.
•Training MoE is tricky because the gate can collapse to always choosing the same expert, and experts see less data since they get only certain tokens. Load balancing losses and careful routing are used to keep traffic spread out.
Why This Lecture Matters
This lecture matters for anyone designing or deploying language models under real constraints—ML engineers, researchers, data scientists, and engineering managers. Simply making dense models bigger is hitting walls: we’re running out of high-quality data, training costs are soaring, and latency budgets are tight. Sparse Activation, especially Mixture of Experts, offers a path to greater capability without linear compute growth by activating only the specialists needed per input.
In practice, this knowledge helps you build models that meet accuracy targets while fitting within fixed GPU hours and inference SLAs. You can increase capacity by adding experts and balance traffic through load balancing losses, instead of increasing per-token compute. You can also apply pruning to existing models to ship faster, cheaper systems with minimal accuracy loss.
For projects, this means more flexible architectures: swap a feed-forward layer for an MoE layer, start with top‑1 routing, monitor expert usage, and incrementally scale experts. For your career, understanding MoE puts you at the forefront of efficient LLM design, a key theme as the industry optimizes cost-performance trade-offs. As organizations look to deploy capable models responsibly, the ability to use compute wisely—guided by scaling laws and sparse architectures—has become a core competitive skill.
Lecture Summary
Tap terms for definitions
01Overview
This lecture teaches why the field moved from simply making dense language models bigger to using smarter architectures like Mixture of Experts (MoE) that activate only parts of the network for each input. It starts with a timeline: BERT (2018) showed that bigger wasn’t always better, GPT‑2 (2019) hinted at emergent abilities from scaling, GPT‑3 (2020) massively scaled parameters and revealed strong zero‑shot and few‑shot performance, Chinchilla (2022) argued that data and model size must be scaled together, PaLM (2022) demonstrated chain‑of‑thought reasoning at large scale, and Llama 2 (2023) showed that smaller models trained longer on more tokens can compete. These steps build a picture: capability rises when model size, data size, and training budget (compute) are all balanced.
With that context, the lecture introduces Sparse Activation—an approach that grows model capacity without paying the full compute cost for every token. Instead of turning on all neurons for each input (dense), we activate only the parts that matter (sparse). Two main families are described. Conditional Computation chooses a sub‑network at run time using a routing function; Pruning permanently removes less important weights or neurons to make models smaller and faster. Mixture of Experts (MoE) is a specific Conditional Computation design where a “gate” picks one or more “experts” (specialized sub‑models) to process each input.
In MoE, each token passes through a small router that outputs a probability distribution over experts. The system then sends the token’s representation to a few selected experts, collects their outputs, and combines them (often with the same probabilities). This lets the system host many experts (high total parameter count) but compute only with a few on each token (low per‑token compute). The lecture states the core formula: the output is the sum over experts of the gate’s probability times the expert’s output. It also highlights practical training challenges: gates can be hard to train, experts see fewer examples and can overfit, and the system can become unbalanced if one expert gets too many tokens.
To stabilize training, the lecture introduces several techniques. Gate training commonly uses softmax (a smooth distribution over experts) or sparsemax (which yields zeros and encourages using only a few experts). Load balancing losses push the gate to distribute tokens more evenly, avoiding a single “hot” expert. Dropout prevents experts from overfitting to their subset of tokens. Together, these ideas keep experts specialized yet broadly useful.
The target audience is students with basic knowledge of neural networks and Transformers: you should know what tokens are, how an encoder or decoder works, and why pre‑training and fine‑tuning matter. This lecture is friendly to learners who understand high‑level concepts but want a structured view of scaling history and the motivation for sparse architectures. You do not need advanced math to follow; the equations and algorithms are explained intuitively.
After this lecture, you will be able to explain why naive scaling hit limits and how compute, data, and training time must be balanced; describe the difference between dense and sparse activation; define Conditional Computation and Pruning; explain Mixture of Experts and its gate‑and‑experts structure; and outline how to integrate an MoE layer into a Transformer. You will be able to articulate common training issues (gate collapse, imbalance) and name standard fixes (load balancing, dropout, sparsemax). You will also understand practical concerns like how to choose the number of experts and how MoE applies to Transformers (e.g., Switch Transformer).
The lecture flows in three parts. First, it builds motivation by reviewing the history of scaling language models and deriving the “compute‑centric” lesson. Second, it introduces Sparse Activation, contrasting Conditional Computation and Pruning, with short examples and known results (e.g., movement pruning). Third, it dives into Mixture of Experts: the core idea, the governing equation, how gates and experts work, and typical training and deployment challenges. It closes with a short Q&A about the number of experts and using MoE inside Transformers, emphasizing the key benefit—more capacity for roughly the same per‑token compute—and the key caution—training and engineering are harder.
Key Takeaways
✓Balance model size, data, and training steps. Before adding parameters, check if your model is undertrained on data; training longer on more tokens may outperform a bigger model at the same compute. Use scaling law intuition to pick a good balance. This keeps costs predictable and results strong.
✓Start MoE small and simple. Replace one FFN with an MoE layer using top‑1 gating and a modest number of experts (e.g., 4–8). Verify stability by monitoring expert loads and gate entropy. Scale to more experts only after routing is healthy.
✓Use load balancing from day one. Add an auxiliary loss that penalizes uneven expert usage to avoid gate collapse. Tune its weight so it influences routing but doesn’t dominate the main loss. Track the loss over time to catch regressions.
✓Control capacity to prevent drops. Set capacity_factor so each expert can handle its expected tokens with a margin. Large drop rates hurt quality; aim for near-zero drops in steady state. Adjust batch size and top‑k along with capacity.
✓Prefer top‑1 gating initially. It minimizes per-token compute and simplifies debugging. Once stable, try top‑2 for potential quality gains, measuring latency and throughput. Keep an eye on communication overhead as k increases.
✓Instrument routing thoroughly. Log per-expert token counts, gate entropy, and drop rates per step. Visualize token topics per expert to confirm specialization. Use alerts for imbalance thresholds.
✓Regularize experts. Apply dropout and consider modest weight decay inside expert MLPs. This reduces overfitting to narrow token subsets. It improves generalization and stability.
Glossary
Language model
A system that predicts the next word (token) in a sequence given the previous words. It learns patterns from large text datasets to estimate these probabilities. Bigger models can learn more patterns, but they also need more data and compute. Language models can answer questions and write text by choosing likely next words repeatedly.
Transformer
A neural network architecture that uses attention to process sequences. Instead of reading words one by one, it looks at all words in a sentence at once and decides which are most relevant. It stacks layers of attention and feed-forward networks. Transformers power most modern language models because they scale well.
Dense model
A model where all parts are active for every input. Every neuron and layer participates in processing, which makes compute per input predictable but often large. Dense models are easy to implement and train. However, they may waste compute on irrelevant parts.
Sparse Activation
Only a subset of the network turns on for each input. This saves compute because you don’t run everything all the time. It allows the model to store many more parameters than it uses per token. The trick is deciding which parts to activate for each input.
Version: 1
•The gate is often trained with a softmax (smooth probabilities) or sparsemax (encourages zeros) over experts. Sparsemax makes it easier to pick just a few experts, which saves compute.
•Dropout (randomly dropping units during training) helps experts avoid overfitting to their tokens. Together with load balancing, it stabilizes training so experts specialize instead of competing for all inputs.
•The main benefits of MoE are higher capacity without higher per‑token compute, better accuracy when routing is healthy, and often stronger generalization. The costs are harder training, more debugging, and complex deployment.
•MoE works inside Transformers by replacing a feed‑forward layer with an MoE layer. Each token is routed to a small set of expert MLPs, outputs are combined, and the rest of the Transformer (attention, layer norms) stays the same.
•Choosing the number of experts depends on task complexity and compute budget. Practitioners try several counts and select what balances specialization, speed, and stability.
•Switch Transformer is a simple MoE variant that typically routes each token to its single best expert (top‑1 gating). It popularized large‑scale MoE for language modeling while keeping compute per token similar to a dense layer.
•Pruning is another sparse strategy: remove low‑value weights or entire neurons after or during training. Movement pruning showed that you can prune BERT heavily (up to 90%) while keeping accuracy on many tasks.
•The big picture: capability comes from compute managed well—right model size, enough data, enough training, and smarter architectures. MoE is one way to spend compute more wisely by using specialists only when needed.
02Key Concepts
01
Capability ceiling and function approximator view: A language model can be seen as a function that maps a token sequence to a probability distribution over the next token. Making the function more expressive generally means giving it more parameters (more degrees of freedom). This is why early progress came from scaling up model size. But a function’s performance also depends on how much data it sees and how well it is trained. You can think of the model like a clay sculpture: more clay (parameters) helps, but you also need enough practice (data) and time to shape it well (training).
02
BERT (2018): BERT is a Transformer encoder with about 345 million parameters in its largest version. At the time, this felt very large, yet simply scaling BERT did not produce consistent downstream gains. This suggested that capacity alone was not the whole story; pre‑training objectives and data also matter. Researchers hypothesized that data wasn’t big enough or objectives weren’t aligned with downstream tasks. It planted the seed that we must think beyond just parameter count.
03
GPT‑2 (2019): GPT‑2 scaled to 1.5 billion parameters and showed surprising behaviors. The model learned tasks without explicit supervision when prompted in the right way. This demonstrated emergent abilities: the model did useful things even without fine‑tuning. The finding raised interest in zero‑shot and few‑shot learning. It indicated scaling can unlock qualitatively new capabilities.
04
GPT‑3 (2020): GPT‑3 jumped to 175 billion parameters, more than 100× GPT‑2. The paper emphasized that simply scaling the architecture led to stronger zero‑shot and few‑shot performance. This validated the idea that large language models can generalize widely with the right prompts. The lesson was: scaling can keep paying off. But the compute cost was enormous, which raised sustainability questions.
05
Chinchilla (2022): Chinchilla argued that GPT‑3 was undertrained on data for its size. It showed that for a fixed compute budget, using a smaller model with much more data yields better performance. Specifically, a ~70B parameter model trained on 1.4T tokens outperformed a larger model trained on far fewer tokens. The message: scale model size and data together to use compute efficiently. This reframed scaling as a balancing act, not a one‑way race to giant models.
06
PaLM and chain‑of‑thought (2022): PaLM scaled to 540B parameters and exhibited strong chain‑of‑thought reasoning when prompted with worked examples. The model could solve multi‑step problems like a tennis balls arithmetic story by reasoning step by step. This suggested that some forms of reasoning can emerge from scale and data without explicit symbolic modules. It nudged the field to think about prompting methods alongside scaling. It again showed that capacity plus data can unlock new behaviors.
07
Llama 2 (2023): Llama 2 models range from 7B to 70B parameters, smaller than GPT‑3 or PaLM, but trained on around 2T tokens. Despite fewer parameters, they were competitive due to longer, broader training. This reinforced that training duration and data scale can compensate for smaller model size. Practical takeaway: you can get excellent results from smaller models if you invest in data and training time.
08
Compute‑centric view: The big pattern across years is that compute—how much training work you do and how you allocate it—drives capability. Model size, data size, and training steps are three knobs tied to the same budget. The best results come from balancing them rather than maximizing only one. As we hit data limits and rising costs, smarter use of compute becomes crucial. This motivates architectures that spend compute where it matters most.
09
Sparse Activation concept: Instead of activating the entire network for every input, turn on only the parts that are useful. Picture a huge building of rooms (neurons); for each visitor (input), you unlock only the rooms they need. This lets you store many specialized rooms but pay to use only a few at a time. Capacity rises because you can host many rooms, but per‑visit cost stays similar. This contrasts with dense models that power up the whole building every time.
10
Conditional Computation: This approach dynamically selects a sub‑network based on each input. A routing function first looks at the input and then chooses which layers or modules to run. In one case the route might be layers 1 and 3; in another case, layers 2 and 4. You get a big network in total, but each input walks only a short path. This is compute efficient and can increase effective capacity if routing is good.
11
Pruning: Pruning removes weights or neurons that matter less, either after training or while training. Weight pruning zeroes small weights; neuron pruning deletes entire units. The benefits include a smaller model and faster inference with little or no accuracy loss. Movement pruning demonstrated up to 90% pruning of BERT with minimal drops on many tasks. Pruning is like trimming a tree: cut weak branches so the tree is lighter but still healthy.
12
Mixture of Experts (MoE) basics: MoE has many experts (sub‑models) and a gate that assigns inputs to experts. Each expert is a specialist; the gate outputs a probability distribution over experts for each input. The final output is a weighted sum of the experts’ outputs. This can greatly increase total parameters while keeping per‑input compute roughly the same. The key is smart routing so each token meets the right specialists.
13
MoE mathematics: The output y(x) = Σ_i G_i(x)·E_i(x), where G_i(x) are gate probabilities and E_i(x) are expert outputs. G is usually produced by a small neural network followed by softmax or sparsemax. Experts are often MLPs similar to Transformer feed‑forward blocks. Routing may pick top‑1 or top‑k experts to limit compute. The weighted sum combines expertise into a single representation.
14
Training the gate: The gate must learn to recognize which expert is best for which input. Softmax gives smooth probabilities to all experts; sparsemax makes many probabilities exactly zero to force sparsity. Without help, gates can collapse, sending most tokens to a single expert. Load balancing terms in the loss encourage even traffic. Good gates spread tokens enough to keep all experts trained and useful.
15
Training the experts: Each expert sees only a subset of tokens, which can cause overfitting or undertraining. Dropout helps by adding noise so experts learn more general patterns. Balanced routing gives every expert enough examples to learn. Experts often mirror standard Transformer MLPs to stay simple and efficient. Healthy experts specialize without becoming narrow memorization machines.
16
Benefits of MoE: You get higher capacity for similar per‑token compute, improved accuracy when routing is effective, and often better generalization. The architecture can host many specialized skills without slowing each token much. It also makes it easier to expand capacity by adding more experts. This is appealing when data is scarce or compute is limited. It’s like hiring more part‑time specialists rather than overworking a few full‑timers.
17
Challenges of MoE: Training can be unstable, with gate collapse or expert imbalance. Debugging is harder because behavior depends on routing decisions that change across batches. Deployment is more complex due to dynamic routing and potential communication overhead. Monitoring expert usage and latency is necessary in production. These costs require careful engineering and experimentation.
18
Using MoE with Transformers: Replace some feed‑forward layers with MoE layers. Keep attention and normalization the same so the model’s backbone remains familiar. Switch Transformer is a known example with simple top‑1 routing. This approach scales to many experts while controlling per‑token work. The result is a large‑capacity Transformer that’s still efficient per token.
19
Choosing the number of experts: There’s no single right answer; it depends on task complexity and budget. If the task has many distinct subskills, more experts may help. If the task is simple, a few experts (or a dense model) may suffice. Practitioners try several options and pick what balances performance and compute. Thinking of experts as specialists can guide initial guesses (e.g., one per expected specialization).
20
Overall message: Scaling isn’t only about parameter count; it’s about compute used wisely. Sparse Activation, and especially MoE, lets us increase stored knowledge and skills without paying for everything on every token. The key ingredients are good routing, balanced training, and stable optimization. This direction addresses data and cost limits while keeping models capable. It’s a practical path beyond “just make it bigger.”
03Technical Details
Overall Architecture/Structure
Language model as function approximator (dense baseline)
Definition: A language model takes a sequence of tokens and predicts the next token’s probability distribution. In dense Transformers, every token passes through the same full stack of layers. All neurons in each layer are potentially active for every token. This yields predictable compute per token but forces you to pay for the whole network regardless of input complexity.
Data flow in dense Transformer: tokens → embedding → repeated blocks of [self‑attention → feed‑forward network (FFN) → residuals, layer norms] → logits → softmax over vocabulary.
Sparse Activation idea
Instead of activating all neurons, only a chosen subset is used per input. Imagine a city’s power grid: you don’t light every building all the time; you light only those currently in use. Architecturally, this means adding a routing decision that decides which parts to run. The model can hold many parameters (many potential routes) but spends compute on only a few routes per token.
Conditional Computation vs. Pruning
Conditional Computation: Adds a routing function that picks which sub‑network (layers or experts) to apply for a given input. The big network exists, but each input only traverses a short path. Pros: keeps capacity high, can specialize sub‑networks; cons: routing adds complexity and training challenges.
Pruning: Deletes unimportant weights/neurons permanently. Pros: simple inference, smaller memory footprint; cons: reduces maximum capacity since pruned parts are gone. Movement pruning is an example showing extreme pruning of BERT with small accuracy loss.
Mixture of Experts (MoE)
Core components: a gate (router) and a pool of experts (sub‑models). For each token representation x, the gate outputs a distribution G(x) over experts. Selected experts E_i process x and return outputs; the model combines them to yield y(x).
Formula: y(x) = Σ_i G_i(x)·E_i(x). This is a weighted sum; often only a few G_i(x) are non‑zero due to top‑k routing or sparsemax. The gate can be a small linear projection from token hidden size to num_experts, followed by a normalization (softmax/sparsemax).
Experts: commonly MLPs with the same input/output size as the replaced FFN. For example, in a Transformer with hidden size H, each expert is a two‑layer MLP (H → 4H → H) with activation (e.g., ReLU or GELU). Experts share the same interface so any token can be sent to any expert.
Data flow in an MoE Transformer layer
Input token vectors (shape: batch Ă— seq_len Ă— hidden) arrive at the MoE layer.
Router computes scores for each token: scores = x·W_g + b_g (W_g: hidden × num_experts). Apply softmax or sparsemax to get probabilities over experts.
Select top‑k experts per token (k often 1 or 2). Optionally compute a capacity for each expert (maximum number of tokens it can process this step) to avoid overload.
Dispatch: gather tokens for each selected expert (this is a permutation/pack operation). Each expert runs its MLP on its assigned token batch.
Combine: scatter expert outputs back to original token positions and weight-sum by gate probabilities. Apply residual connection and continue through the Transformer.
Compute vs. capacity
Total parameters can be very large because you can add many experts. But per‑token compute stays close to that of a single expert MLP if k is small (e.g., top‑1). Thus, you get large capacity (many specialists) but keep per‑token cost similar to a dense model with one FFN. This is the main appeal of MoE for large language models.
class MoELayer(nn.Module):
def init(self, hidden, num_experts, top_k=1, capacity_factor=1.25):
super().init()
self.router = Router(hidden, num_experts)
self.experts = nn.ModuleList([ExpertMLP(hidden) for _ in range(num_experts)])
self.top_k = top_k
self.capacity_factor = capacity_factor
self.num_experts = num_experts
def forward(self, x):
# x: [batch, seq, hidden] -> flatten tokens
B, S, H = x.shape
tokens = B*S
x_flat = x.reshape(tokens, H)
probs, logits = self.router(x_flat) # [tokens, E]
topk_vals, topk_idx = probs.topk(self.top_k, dim=-1) # routing picks
# capacity per expert (integer): how many tokens each expert can process this step
capacity = int(self.capacity_factor * math.ceil(tokens * self.top_k / self.num_experts))
# create lists of token indices per expertexpert_assignments = [[] for _ in range(self.num_experts)]
for t in range(tokens):
for r in range(self.top_k):
e = topk_idx[t, r].item()
if len(expert_assignments[e]) < capacity:
expert_assignments[e].append((t, r))
else:
pass # token would be dropped or sent to backup expert in practice
# dispatch: build per-expert batches
expert_inputs = [ [] for _ in range(self.num_experts) ]
for e in range(self.num_experts):
for (t, r) in expert_assignments[e]:
expert_inputs[e].append(x_flat[t])
expert_outputs = [None]*self.num_experts
for e in range(self.num_experts):
if expert_inputs[e]:
x_e = torch.stack(expert_inputs[e], dim=0)
y_e = self.expertse # [tokens_e, H]
expert_outputs[e] = y_e
# combine: start with zeros, then add weighted expert outputs
y_flat = torch.zeros_like(x_flat)
for e in range(self.num_experts):
if expert_outputs[e] is None: continue
for j, (t, r) in enumerate(expert_assignments[e]):
weight = topk_vals[t, r]
y_flat[t] += weight * expert_outputs[e][j]
y = y_flat.reshape(B, S, H)
# auxiliary losses could be returned (e.g., load balancing) using probs/logits
return y
Explanation:
Router forward pass computes per‑token probabilities over experts.
topk selects the best experts per token; capacity_factor controls how many tokens each expert can accept this step (prevents overload).
Dispatch gathers token vectors for each expert; combine scatters results back and weights them.
In practice, frameworks (DeepSpeed MoE, FairScale) implement efficient dispatch/combine with specialized kernels to avoid Python loops.
Auxiliary losses are computed from probs/logits to encourage balanced usage.
Important parameters and meanings
num_experts: how many experts are available. More experts increase capacity but raise communication/memory overhead.
top_k: how many experts a token uses. top‑1 is cheapest; top‑2 sometimes improves quality by allowing mixture.
capacity_factor: expert capacity per step relative to expected average load. Too small causes token drops; too large wastes compute.
gating temperature (if used): sharpens or smooths gate distributions.
load balancing weights: strength of the auxiliary loss that pushes toward even routing.
Training flow
Forward: compute routing, dispatch tokens, run experts, combine outputs, compute task loss (e.g., next‑token cross‑entropy), plus auxiliary losses (e.g., load balancing).
Backward: gradients flow through combine into expert weights and back through the router into gating parameters.
Optimization: standard optimizers (AdamW), learning rate schedule (warmup + decay), with careful tuning to avoid router collapse.
Tools/Libraries Used
PyTorch or JAX/Flax: general deep learning frameworks.
DeepSpeed MoE or FairScale: provide optimized MoE layers, routing, and distributed strategies out of the box.
CUDA kernels and communication libraries (NCCL): crucial for fast expertdispatch/combine across GPUs.
Tokenizers and data pipelines: same as dense models, but keep an eye on sequence length and batch size to maintain routing stability.
Why these tools?
MoE involves complex gather/scatter operations and possibly expert parallelism across devices. Libraries provide efficient primitives and load‑balancing utilities. They also include metrics for monitoring expert usage during training.
Step‑by‑Step Implementation Guide
Define your baseline Transformer.
Start from a known good dense model to ensure stability. Keep attention blocks unchanged for fair comparison.
Choose where to insert MoE.
Replace the FFN in some or all Transformer layers with an MoE layer. Begin with a single layer to debug routing before scaling up.
Design the experts.
Use the same MLP shape as your dense FFN (e.g., H→4H→H). Consider smaller experts if you add many of them to control parameter growth.
Implement the router.
A simple linear projection from H to num_experts followed by softmax or sparsemax is a good start. Add top‑k selection (k=1 or 2). Consider a temperature parameter to adjust sharpness.
Add capacity management.
Compute per‑expert capacity as capacity_factor × (tokens × top_k / num_experts). Decide what to do when an expert is full: drop the token, route to backup, or pad for the next pass. Minimizing drops preserves quality.
Add auxiliary load balancing loss.
A common design penalizes high variance between actual expert loads and uniform load. Another penalizes peaked gate probabilities averaged over tokens. Tune the weight so neither dominates the main task loss.
Training setup.
Use AdamW with learning rate warmup. Enable dropout inside experts. Monitor gradients on router and experts to detect collapse or stagnation.
Instrumentation.
Log per‑expert token counts, entropy of gate distributions, fraction of dropped tokens, and router loss. Alert when an expert hogs traffic or many tokens are dropped.
Scale up gradually.
Start with few experts (e.g., 4), top‑1 gating, and a single MoE layer. If stable, increase experts and add more MoE layers. Keep batch size and sequence length steady while tuning router loss.
Evaluation and ablations.
Compare to dense baseline at equal per‑token FLOPs. Run with/without load balancing, with top‑1 vs top‑2, and different capacity_factors. Evaluate accuracy and latency to understand trade‑offs.
Tips and Warnings
Router collapse: If one expert gets most tokens, increase load balancing loss weight, add dropout, or increase capacity_factor to reduce drops. Monitor gate entropy; extremely low entropy is a red flag.
Expert undertraining: If some experts rarely get tokens, consider reducing num_experts or raising the balancing strength. You can also seed the router or pretrain briefly with higher temperature to spread traffic.
Communication overhead: With many experts across GPUs, dispatch/combine can bottleneck. Co‑locate subsets of experts per device, increase batch size to amortize costs, and use libraries with fused ops.
Stability: Warm up the router slowly; a sudden sharp gate (low temperature) can destabilize early training. Consider gradient clipping and careful learning rate schedules.
Capacity vs. drops: Too small capacity causes many tokens to be dropped or rerouted, hurting quality. Aim for a capacity_factor that yields minimal drops (<1–2%).
Debugging: Visualize which tokens go to which experts (e.g., by topic or length). If experts don’t specialize, check if the task encourages specialization; otherwise, fewer experts may be better.
Deployment: Dynamic routing may increase latency variance. For production, pin experts, pre‑allocate buffers, and consider batching across requests with similar sequence lengths. Measure tail latency, not only averages.
Security & reliability: Ensure routing decisions are deterministic when required (e.g., for caching). Guard against OOM by enforcing capacity and backpressure.
Putting It All Together
MoE offers a path to more capacity without proportional per‑token compute. The architecture’s success hinges on a simple yet effective router, balanced expert loads, and robust engineering around dispatch/combine. With careful tuning, you can approach or exceed dense model accuracy at similar per‑token cost, while enjoying the ability to scale capacity by adding experts.
Connections to the Scaling Timeline
Early scaling showed benefits but faced compute and data limits. Sparse Activation, especially MoE, is a response: move from “activate everything always” to “activate just what you need.” This keeps the compute budget reasonable while letting the model store far more specialized knowledge. In practice, it complements lessons from Chinchilla (balance compute across model and data) by adding a smarter architecture that spends that compute where it counts.
04Examples
đź’ˇ
Chain-of-thought arithmetic: Prompt: “Roger has 5 tennis balls. He buys 2 more cans, each with 3 balls. How many now?” A large model like PaLM can list steps: 5 + (2 × 3) = 11, then answer 11. Input is the word problem; processing uses learned reasoning steps; output is the step-by-step text and final number. The key point is that large scale can unlock reasoning behaviors without explicit symbolic modules.
đź’ˇ
Dense vs sparse activation: For a given sentence, a dense Transformer runs every layer and neuron, like turning on all lights in a building. A sparse model activates only a subset of layers/neurons based on the input, like lighting only rooms being used. The input is the same; processing differs by how much of the network is used; the output is the next-token probabilities. The point is saving compute while keeping or improving quality.
đź’ˇ
Conditional computation layer routing: Input x goes through a routing function that chooses layers 1 and 3, skipping 2 and 4. Another input x' might be routed to layers 2 and 4. The output is computed only through the selected layers. This example shows dynamic sub-network selection per input, improving efficiency.
đź’ˇ
Pruning weight example: After training a BERT model, we compute the importance of weights and set the smallest ones to zero. The input and tasks remain unchanged; the processing uses a sparser matrix, and output accuracy stays similar. Movement pruning demonstrated pruning up to 90% of weights with minimal accuracy loss. The key point: many dense weights are redundant.
đź’ˇ
MoE gate with two experts: The gate outputs probabilities [0.8, 0.2] for Expert 1 and Expert 2. 80% of the token’s representation effectively goes to Expert 1, and 20% to Expert 2; both outputs are combined. The input is the token vector x; processing is E1(x) and E2(x) weighted by gate; output is 0.8E1(x)+0.2E2(x). This demonstrates weighted expert combination.
đź’ˇ
Topic-based specialization: Suppose Expert A tends to handle science words and Expert B handles sports words. The router learns to send “atom, electron, gravity” to A and “goal, coach, stadium” to B. Processing is faster because each expert sees familiar patterns. The key idea is that experts become specialists.
đź’ˇ
Softmax vs sparsemax gating: With softmax, an input might get probabilities [0.6, 0.4, 0.0, …], giving every expert some probability mass. With sparsemax, most probabilities become exactly zero, e.g., [0.7, 0.3, 0, 0, …], making routing simpler and cheaper. Processing then uses only selected experts. The point is that sparsemax enforces sparsity in routing decisions.
đź’ˇ
Load balancing loss: Imagine the gate sends 90% of tokens to Expert 1 and starves others. We add a loss term that penalizes uneven usage, pushing traffic toward all experts. After training, token assignments even out, and each expert learns better. This avoids expert collapse and improves overall accuracy.
đź’ˇ
Dropout in experts: Each expert applies dropout during training so it doesn’t memorize its limited tokens. The input is the assigned token set; processing randomly drops some activations; output remains robust. Over time, experts generalize better. The point is to combat overfitting in narrow data slices.
đź’ˇ
Switch Transformer top-1 routing: Each token is routed to its single best expert to minimize compute. The gate picks the top expert index; that expert processes the token; output equals that expert’s output (or a simple weighted version). This keeps per-token cost near a dense FFN while allowing many experts overall. The example shows a simple, scalable MoE design.
đź’ˇ
Capacity factor and token drops: Suppose 10,000 token-expert assignments are expected across 10 experts (1,000 per expert). With a capacity factor of 1.25, each expert can handle up to 1,250 tokens this step. If more arrive, some tokens are dropped or rerouted, which can hurt quality. The lesson: set capacity high enough to avoid drops while keeping compute reasonable.
đź’ˇ
Replacing FFN with MoE in a Transformer layer: In layer 6 of a 12-layer model, swap the dense FFN for an MoE module with 16 experts. Tokens are routed to top-1 expert, experts process them, outputs are combined, and the model continues. Training shows improved accuracy at similar per-token FLOPs. The example demonstrates how MoE integrates into standard architectures.
đź’ˇ
Choosing number of experts for a task: For a simple sentiment task, 4 experts suffice and train stably. For a multi-domain QA task with legal, medical, and coding text, 32 experts perform better by specializing. Practitioners test several counts and pick the sweet spot. This shows the specialists-per-domain intuition in practice.
đź’ˇ
Debugging expert imbalance: During training, logs show Expert 0 gets 60% of tokens and others <5%. Increasing load balancing loss weight and slightly raising router temperature spreads traffic more evenly. Accuracy improves once all experts learn. The point: monitoring and adjustments are essential.
đź’ˇ
Practical deployment concern: In production, dynamic routing can increase latency variance due to uneven expert loads. Engineers cap per-expert capacity, pre-allocate buffers, and co-locate experts to reduce communication. After these steps, tail latency improves while average latency remains steady. The example highlights engineering needed for MoE at scale.
05Conclusion
This lecture traced how the field learned that capability comes from compute used wisely—model size, data size, and training time must be balanced. It showed that dense scaling alone faces limits: data is finite and training costs are high. Sparse Activation offers a way forward: only turn on the parts of the network you need for each input. Conditional Computation dynamically picks a sub-network, while Pruning permanently removes less useful pieces to shrink and speed models.
Mixture of Experts is a powerful Conditional Computation design: a small gate routes each token to one or a few specialized experts, and their outputs are combined. The core equation y(x) = Σ_i G_i(x)·E_i(x) captures the weighted mixture. Training requires care: gates can collapse and experts can be imbalanced, but softmax or sparsemax gating, load balancing losses, and dropout help. In Transformers, replacing a feed-forward block with an MoE layer yields large capacity without increasing per-token compute much, as popularized by Switch Transformer.
To practice, start by swapping one FFN for an MoE module in a small Transformer and monitor expert usage. Tune load balancing loss and capacity factors to avoid token drops and expert starvation. Compare top‑1 versus top‑2 routing and measure both accuracy and latency. Try pruning on a dense baseline to see how much redundancy you can remove without hurting performance.
Next steps include exploring more advanced routing mechanisms, studying distributed expert placement for multi-GPU training, and experimenting with domain-specific experts. Learn how different tokenizations, sequence lengths, and batch sizes affect routing stability. Investigate data scaling strategies that align with your compute budget, inspired by the Chinchilla perspective.
The core message is simple: don’t pay to light the whole building when only a few rooms are needed. With MoE and Sparse Activation, we can hold far more knowledge and skills in a model while keeping per-token compute similar, provided we train and engineer the system carefully. This approach is a practical path to stronger language models when data and budgets are tight.
✓Co-locate experts and optimize dispatch. If you span experts across GPUs, communication can bottleneck. Group experts per device, use fused dispatch/combine kernels, and increase batch size to improve efficiency. Measure tail latency, not just averages.
✓Tune router sharpness. A temperature on softmax (or choice of sparsemax) affects how peaky routing is. Too sharp early causes instability; warm up to sharper routing later. Adjust in tandem with load balancing loss.
✓Compare fairly to dense baselines. Match per-token FLOPs when reporting accuracy gains. Run ablations: with/without load balancing, different top‑k, capacity settings. This isolates the benefit of MoE rather than extra compute.
✓Choose expert count based on task diversity. For multi-domain or skill-rich tasks, more experts can help; for simple tasks, fewer may suffice. Pilot with 4, 8, 16 experts and compare. Don’t assume more is always better.
✓Plan for deployment complexity. Dynamic routing introduces latency variance and memory pressure. Enforce capacity, pre-allocate buffers, and consider routing-aware batching. Continuously monitor expert loads in production.
✓Use pruning to speed baselines. Before adopting MoE, try pruning your dense model to cut inference cost. Movement or magnitude pruning can shrink models significantly with small accuracy loss. It provides a strong baseline for cost-performance.
✓Guard against expert starvation. If some experts get <1% of tokens, reduce num_experts or increase balancing. Consider curriculum or data mixing to expose diverse tokens early. Healthy specialization requires enough training examples per expert.
✓Stabilize with gradualism. Introduce MoE late in training or after a warmup period with higher router temperature and stronger regularization. Gradual changes reduce shocks to optimization. This avoids early collapse.
✓Document and version routing configs. Small changes in capacity, temperature, or loss weights can shift behavior. Keep configs under version control and note metrics per run. Reproducibility is critical for iterative tuning.
Conditional Computation
A method where the model chooses a sub-network to run based on the input. A routing function selects which layers or experts to execute. This increases efficiency, since each input follows only a small path. It can also improve specialization.
Pruning
Removing parts of a model that contribute little to accuracy. This can be individual weights or whole neurons. Pruning makes models smaller and faster. If done carefully, accuracy stays nearly the same.
Weight pruning
A pruning method that zeroes out individual weights deemed unimportant. The layer structure stays the same but becomes sparse. This often requires special libraries to get speedups. It is a fine-grained way to shrink models.
Neuron pruning
Removing entire neurons or channels from a layer. This gives immediate speed benefits because smaller matrices are multiplied. It is a coarse-grained way to simplify models. It may require adjusting following layers.