Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling Laws 2
IntermediateKey Summary
- •Scaling laws relate a model’s log loss (how surprised it is by the next token) to three knobs: number of parameters (N), dataset size (D), and compute budget (C). As you increase N, D, and C, loss usually drops smoothly. But this only holds when you keep many other things steady and consistent.
- •Data quality is a first-class factor. If your dataset is full of spam, markup, and non-language junk, doubling the data may not help and can even hurt. Clean, diverse, deduplicated text is essential for predictable scaling.
- •Tokenization changes the number of tokens in the same text, which changes D. Using different tokenizers can make scaling curves look wrong because the effective dataset size changes. A simple fix is to standardize on one tokenizer for all runs.
- •Architecture matters. Dense Transformers and Mixture-of-Experts (MoE) do not share the same scaling behavior. You can’t directly port scaling laws measured on a dense model to an MoE without re-measuring.
- •Evaluation choice changes what you see. Scaling laws tracked with log loss may not match trends on downstream tasks like question answering, summarization, or translation. Always evaluate on a fixed, consistent task set when comparing.
- •Emergent abilities are capabilities that appear suddenly as models get bigger, like doing arithmetic or multi-step reasoning. Whether these are truly emergent or just due to better data and engineering is debated. Classic scaling laws don’t predict these jumps well.
- •Phase transitions are model behavior shifts similar to water turning into ice or steam. As model size or data scale grows, internal weight distributions can change and the model can act differently. This suggests we may need scaling laws that include phase changes.
Why This Lecture Matters
People building language and multi-modal models must plan how to spend compute and data wisely. Scaling laws turn vague guesses into measurable trade-offs between model size, dataset tokens, and budget. By understanding what keeps these laws stable—data quality, consistent tokenization, fixed architecture and evaluation—you can run cleaner experiments and make better budget decisions. This knowledge helps ML engineers, researchers, and product teams avoid wasted runs and pick the best model size for their needs. At the same time, modern systems show behaviors that classic laws don’t fully capture: emergent abilities, phase transitions, architecture shifts like MoE, and parameter-efficient fine-tuning. Recognizing these factors prevents overconfidence in smooth extrapolations and encourages regime-aware planning. In real projects, this means you can detect when a task needs a bigger jump in scale, when PEFT can cheaply unlock domain performance, or when multi-modal alignment is the true bottleneck. Career-wise, mastering scaling principles makes you effective at large-model planning, benchmarking, and cost control—skills highly valued across AI teams. You’ll be able to design fair comparisons, interpret why a model under- or over-performs, and choose the right adaptation method. As the industry moves toward ever-larger, multi-modal systems, using nuanced scaling practices will keep your models competitive while staying within resource limits.
Lecture Summary
Tap terms for definitions01Overview
This lecture focuses on practical and modern views of scaling laws for language models. Scaling laws are the observed patterns that show how a model’s loss improves as you increase three main levers: the number of parameters (N), the size of the dataset (D), and the compute budget (C). While past work gave clean relationships—like how to choose an optimal data-to-parameters balance under a fixed compute budget—real training projects uncover many details that can disrupt these patterns. The lecture explains those practical considerations and then digs into newer developments that stretch or break the classic laws.
The session begins with a quick recap: log loss (negative log probability of the next token) typically drops when you scale N, D, and C. The common experimental setup is to fix compute and sweep model sizes and dataset sizes to find the best trade-off. But to do this well, you must control confounders—data quality, tokenization, architecture, and evaluation choices. Each of these can shift curves or flatten gains, making it seem like scaling laws fail when the setup is actually inconsistent.
From there, the lecture moves to new frontiers that don’t fit older laws neatly. Emergent abilities are capabilities that appear at larger scales, such as basic arithmetic or chain-of-thought reasoning, which smaller models lack. Whether these are sudden “phase changes” or simply smoother improvements that look sharp due to measurement choices is debated. The main point: classical scaling laws don’t predict these sudden jumps, highlighting their limitations.
Next, the idea of phase transitions is introduced using a physics analogy: just like water turning into ice at 0°C, models may undergo behavior changes when certain scale thresholds are crossed. Evidence includes shifts in weight distributions and qualitative changes in model outputs. These transitions suggest that new, richer scaling laws might be needed—laws that allow for regime changes rather than assuming one smooth curve forever.
The lecture then covers Parameter-Efficient Fine-Tuning (PEFT), which adapts large pre-trained models by updating only a tiny fraction of parameters. Techniques like LoRA and prefix tuning show that high-quality adaptation is possible with far less compute and memory than full fine-tuning. Because PEFT updates so few parameters, the original scaling laws for full-model training may not apply, so practitioners should expect different trends and hyperparameter optima.
Finally, the lecture turns to multi-modality—training models that work with text plus images, audio, or video. These models can scale well, but they follow different rules because tokenization, information density, and alignment challenges differ across modalities. Systems like Flamingo demonstrate impressive results, yet their scaling curves and best practices are not identical to pure-text models. The lecture closes by summarizing practical best practices (standardize tokenizer, data quality, evaluation; be mindful of architecture choice) and by emphasizing that modern research is pushing beyond simple formulas. Learners come away with both the classic picture and a set of caveats and extensions that match how state-of-the-art systems are actually built today.
This material suits intermediate learners who know what a language model and a Transformer are, and who are familiar with loss, tokens, and training budgets. After studying this, you will be able to design controlled scaling experiments, interpret scaling curves, decide between dense and MoE architectures, plan PEFT fine-tuning efficiently, and anticipate differences when moving to multi-modal setups. The lecture is structured as: (1) recap of log loss and classic scaling setup, (2) practical variables to control (data quality, tokenization, architecture, evaluation), and (3) recent developments (emergent abilities, phase transitions, PEFT, multi-modality), ending with key takeaways.
Key Takeaways
- ✓Standardize everything non-essential to the experiment. Use a single tokenizer, fixed data filtering, one architecture family, and a constant evaluation suite. Changing these midstream makes your curves noisy or misleading. Control variables tightly so N and D are the only major differences.
- ✓Invest in data quality before scaling. Deduplicate aggressively, remove spam and boilerplate, and language-detect and normalize content. High-quality tokens give more learning per token and make scaling laws hold better. It’s cheaper to clean data than to buy more compute for junk.
- ✓Plan with a fixed compute budget. Build an N–D grid that fits your GPU-hours or FLOPs. Often, a moderate N with more D beats a huge N with too little data. Validate this with careful sweeps rather than guessing.
- ✓Keep tokenizer constant to keep D consistent. Switching tokenizers changes token counts and breaks comparisons. If a new tokenizer is necessary, rebuild baselines from scratch. Never mix tokenizers within a scaling curve.
- ✓Measure both log loss and downstream tasks. Loss is smooth and informative, but real tasks can plateau or jump. Use a fixed suite to catch capability changes and avoid cherry-picking. Decide wins based on a balanced view.
- ✓Expect architecture-specific scaling. Dense and MoE models won’t follow the same curves. Re-measure scaling exponents when you switch families. Tune routing and expert configurations for MoE to unlock gains.
- ✓Watch for emergent abilities and phase transitions. If a capability suddenly appears, don’t force a smooth fit over it. Consider piecewise or regime-aware models for better predictions. Document internal signals like weight distribution shifts.
Glossary
Log loss
A number that tells how surprised a model is by the next token. Lower means the model guessed well, higher means it guessed poorly. It is the negative log of the predicted probability for the correct token. It’s a core way to measure language modeling quality. When log loss improves, perplexity also improves.
Perplexity
A way to express how confused a language model is. Lower perplexity means the model predicts text better. It’s mathematically related to log loss: lower log loss gives lower perplexity. People often track perplexity to compare models. It is easier to read than raw log loss for many.
Parameters (N)
The numbers inside a model that it learns during training. More parameters usually means more capacity to learn patterns. But too many without enough data can cause problems. Scaling laws track how performance changes as N grows.
Dataset size (D)
How many tokens (pieces of text) the model sees during training. More tokens generally help the model learn better. But only if those tokens are high quality. Tokenization changes how many tokens the same text becomes.
