📚 Stanford CS336: Language Modeling from Scratch4 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 4: Mixture of Experts

Intermediate

Stanford Online

Key Summary

•The lecture explains why simply making language models bigger (more parameters) helped for years, but also why data size and training time matter just as much. From BERT in 2018 to GPT‑2, GPT‑3, PaLM, Chinchilla, and Llama 2, the trend shows performance rises when models are scaled correctly with enough data and compute.
•We are running into limits: data is finite and training ever‑larger dense models is very expensive. The big question is how to get more capability without paying a lot more compute for every token processed.
•Sparse Activation addresses this by turning on only a small part of the network for each input. Think of a huge school where only the teachers needed for a student’s question are called into the room instead of the whole staff.
•There are two main paths to sparse activation: Conditional Computation and Pruning. Conditional Computation routes each input through a chosen sub‑network; Pruning removes unimportant connections or neurons to shrink and speed up the model.
•Mixture of Experts (MoE) is a popular form of Conditional Computation. A small "gate" network picks one or a few "experts" (mini‑networks) to process each token, and the expert outputs are combined to produce the final result.
•Mathematically, MoE outputs a weighted sum of expert outputs. The gate provides probabilities G_i(x) for each expert i on input x, and each expert E_i(x) returns its own vector; the model sums G_i(x) * E_i(x) across experts.
•Training MoE is tricky because the gate can collapse to always choosing the same expert, and experts see less data since they get only certain tokens. Load balancing losses and careful routing are used to keep traffic spread out.

Why This Lecture Matters

This lecture matters for anyone designing or deploying language models under real constraints—ML engineers, researchers, data scientists, and engineering managers. Simply making dense models bigger is hitting walls: we’re running out of high-quality data, training costs are soaring, and latency budgets are tight. Sparse Activation, especially Mixture of Experts, offers a path to greater capability without linear compute growth by activating only the specialists needed per input. In practice, this knowledge helps you build models that meet accuracy targets while fitting within fixed GPU hours and inference SLAs. You can increase capacity by adding experts and balance traffic through load balancing losses, instead of increasing per-token compute. You can also apply pruning to existing models to ship faster, cheaper systems with minimal accuracy loss. For projects, this means more flexible architectures: swap a feed-forward layer for an MoE layer, start with top‑1 routing, monitor expert usage, and incrementally scale experts. For your career, understanding MoE puts you at the forefront of efficient LLM design, a key theme as the industry optimizes cost-performance trade-offs. As organizations look to deploy capable models responsibly, the ability to use compute wisely—guided by scaling laws and sparse architectures—has become a core competitive skill.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches why the field moved from simply making dense language models bigger to using smarter architectures like Mixture of Experts (MoE) that activate only parts of the network for each input. It starts with a timeline: BERT (2018) showed that bigger wasn’t always better, GPT‑2 (2019) hinted at emergent abilities from scaling, GPT‑3 (2020) massively scaled parameters and revealed strong zero‑shot and few‑shot performance, Chinchilla (2022) argued that data and model size must be scaled together, PaLM (2022) demonstrated chain‑of‑thought reasoning at large scale, and Llama 2 (2023) showed that smaller models trained longer on more tokens can compete. These steps build a picture: capability rises when model size, data size, and training budget (compute) are all balanced.

With that context, the lecture introduces Sparse Activation—an approach that grows model capacity without paying the full compute cost for every token. Instead of turning on all neurons for each input (dense), we activate only the parts that matter (sparse). Two main families are described. Conditional Computation chooses a sub‑network at run time using a routing function; Pruning permanently removes less important weights or neurons to make models smaller and faster. Mixture of Experts (MoE) is a specific Conditional Computation design where a “gate” picks one or more “experts” (specialized sub‑models) to process each input.

In MoE, each token passes through a small router that outputs a probability distribution over experts. The system then sends the token’s representation to a few selected experts, collects their outputs, and combines them (often with the same probabilities). This lets the system host many experts (high total parameter count) but compute only with a few on each token (low per‑token compute). The lecture states the core formula: the output is the sum over experts of the gate’s probability times the expert’s output. It also highlights practical training challenges: gates can be hard to train, experts see fewer examples and can overfit, and the system can become unbalanced if one expert gets too many tokens.

To stabilize training, the lecture introduces several techniques. Gate training commonly uses softmax (a smooth distribution over experts) or sparsemax (which yields zeros and encourages using only a few experts). Load balancing losses push the gate to distribute tokens more evenly, avoiding a single “hot” expert. Dropout prevents experts from overfitting to their subset of tokens. Together, these ideas keep experts specialized yet broadly useful.

The target audience is students with basic knowledge of neural networks and Transformers: you should know what tokens are, how an encoder or decoder works, and why pre‑training and fine‑tuning matter. This lecture is friendly to learners who understand high‑level concepts but want a structured view of scaling history and the motivation for sparse architectures. You do not need advanced math to follow; the equations and algorithms are explained intuitively.

After this lecture, you will be able to explain why naive scaling hit limits and how compute, data, and training time must be balanced; describe the difference between dense and sparse activation; define Conditional Computation and Pruning; explain Mixture of Experts and its gate‑and‑experts structure; and outline how to integrate an MoE layer into a Transformer. You will be able to articulate common training issues (gate collapse, imbalance) and name standard fixes (load balancing, dropout, sparsemax). You will also understand practical concerns like how to choose the number of experts and how MoE applies to Transformers (e.g., Switch Transformer).

The lecture flows in three parts. First, it builds motivation by reviewing the history of scaling language models and deriving the “compute‑centric” lesson. Second, it introduces Sparse Activation, contrasting Conditional Computation and Pruning, with short examples and known results (e.g., movement pruning). Third, it dives into Mixture of Experts: the core idea, the governing equation, how gates and experts work, and typical training and deployment challenges. It closes with a short Q&A about the number of experts and using MoE inside Transformers, emphasizing the key benefit—more capacity for roughly the same per‑token compute—and the key caution—training and engineering are harder.

Key Takeaways

✓Balance model size, data, and training steps. Before adding parameters, check if your model is undertrained on data; training longer on more tokens may outperform a bigger model at the same compute. Use scaling law intuition to pick a good balance. This keeps costs predictable and results strong.
✓Start MoE small and simple. Replace one FFN with an MoE layer using top‑1 gating and a modest number of experts (e.g., 4–8). Verify stability by monitoring expert loads and gate entropy. Scale to more experts only after routing is healthy.
✓Use load balancing from day one. Add an auxiliary loss that penalizes uneven expert usage to avoid gate collapse. Tune its weight so it influences routing but doesn’t dominate the main loss. Track the loss over time to catch regressions.
✓Control capacity to prevent drops. Set capacity_factor so each expert can handle its expected tokens with a margin. Large drop rates hurt quality; aim for near-zero drops in steady state. Adjust batch size and top‑k along with capacity.
✓Prefer top‑1 gating initially. It minimizes per-token compute and simplifies debugging. Once stable, try top‑2 for potential quality gains, measuring latency and throughput. Keep an eye on communication overhead as k increases.
✓Instrument routing thoroughly. Log per-expert token counts, gate entropy, and drop rates per step. Visualize token topics per expert to confirm specialization. Use alerts for imbalance thresholds.
✓Regularize experts. Apply dropout and consider modest weight decay inside expert MLPs. This reduces overfitting to narrow token subsets. It improves generalization and stability.

Glossary

Language model

A system that predicts the next word (token) in a sequence given the previous words. It learns patterns from large text datasets to estimate these probabilities. Bigger models can learn more patterns, but they also need more data and compute. Language models can answer questions and write text by choosing likely next words repeatedly.

Transformer

A neural network architecture that uses attention to process sequences. Instead of reading words one by one, it looks at all words in a sentence at once and decides which are most relevant. It stacks layers of attention and feed-forward networks. Transformers power most modern language models because they scale well.

Dense model

A model where all parts are active for every input. Every neuron and layer participates in processing, which makes compute per input predictable but often large. Dense models are easy to implement and train. However, they may waste compute on irrelevant parts.

Sparse Activation

Only a subset of the network turns on for each input. This saves compute because you don’t run everything all the time. It allows the model to store many more parameters than it uses per token. The trick is deciding which parts to activate for each input.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 4: Mixture of Experts

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Language model

Transformer

Dense model

Sparse Activation

02Key Concepts

03Technical Details

04Examples

05Conclusion

Conditional Computation

Pruning

Weight pruning

Neuron pruning