📚 Stanford CS336: Language Modeling from Scratch9 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 9: Scaling Laws 1

Intermediate

Stanford Online

Key Summary

•Scaling laws are empirical rules that show how a model’s loss (error) drops as you grow model size, data, or compute. They take a power-law form: Loss = A × N^(-α), where N can be parameters, data tokens, or compute, and α is the scaling exponent. This lets us predict how bigger models might perform without training them.
•The 2020 Kaplan et al. study trained many GPT-2–style models and found neat power-law fits for three knobs: model size (parameters), data size (tokens), and compute (FLOPs). The exponents they reported were roughly α_params ≈ 0.076, α_data ≈ 0.074, and α_compute ≈ 0.069. These similar exponents suggest model size and data are about equally useful for improving loss.
•A key insight is that loss vs. scale looks like a straight line on a log-log plot, which is the signature of a power law. You can fit a line to log(loss) vs. log(N) and the slope gives you the exponent (up to a sign). This simple procedure turns messy training results into a clear predictive model.
•Hoffmann et al. (2022) studied the best way to spend a fixed compute budget, called compute-optimal scaling. They used the relation C ∝ N × D (compute ~ parameters × tokens) and found that you should grow model size and data together. With more compute, both the best N and D increase roughly in proportion to the square root of C in this simplified view.
•Compute-optimal scaling explains why just making a giant model without enough data wastes compute (overfitting), and using too much data for a tiny model underuses compute (underfitting capacity). There’s a sweet spot for each compute budget. Picking N and D well can save time and money while improving performance.
•Why do scaling laws appear? Learning curves often follow a power law with steps T: loss ∝ T^(−β). Combine this with compute C ∝ N × T, and you get loss ∝ C^(−β/(1+β)), another power law. Intuitively, this reflects diminishing returns: each extra batch helps a bit less than the last.

Why This Lecture Matters

Scaling laws turn large-model development from an expensive guessing game into a predictable engineering process. For ML engineers, researchers, and product teams, they provide a compact formula to forecast the returns from increasing parameters, data, or compute. This helps set realistic budgets, timelines, and goals, especially when training runs can cost millions of dollars. With compute-optimal scaling, you can choose the best mix of model size and data for a fixed compute budget, avoiding wasteful runs that either overfit or underutilize capacity. In real projects, leaders need to justify investments and plan infra. Scaling laws quantify expected gains (e.g., around 5% loss reduction per doubling at today’s exponents), allowing teams to prioritize between collecting more clean data or engineering larger models. They also guide validation protocols—keeping tokenization and datasets consistent—so improvements are attributable to scale rather than setup changes. For applied scientists, scaling laws suggest when to stop: if marginal gains fall below a threshold, it may be time to improve data quality or algorithms instead of just scaling. Career-wise, understanding scaling laws is now core literacy for professionals building or deploying LLMs. It shows you can reason about trade-offs, avoid costly mistakes, and communicate with stakeholders in concrete, quantitative terms. In an industry where models and datasets grow rapidly, scaling literacy helps you stay efficient and competitive. It’s a lens to make smarter decisions in research exploration, infrastructure design, and product planning.

Lecture Summary

Tap terms for definitions

01Overview

This lecture explains scaling laws for large language models (LLMs): simple empirical rules that show how a model’s error (loss) shrinks as you increase model size (number of parameters), data size (number of training tokens), or compute (amount of floating-point operations). Scaling laws are valuable because they let you predict the performance of models you haven’t trained yet. In 2020, before these laws were widely known, training a much bigger model was a costly guess. Now it is more scientific: by fitting a power law to smaller models, you can project to larger ones with surprising accuracy.

The lecture focuses on three pillars. First, the general form: loss = A × N^(−α), where N stands for the scale you’re changing (parameters, data, or compute), A is a constant, and α is the scaling exponent. Second, major empirical results: Kaplan et al. (2020) showed that loss follows power laws for parameters, data, and compute with similar exponents (around 0.07–0.08), and Hoffmann et al. (2022) introduced compute-optimal scaling to choose the best mix of model size and data for a fixed compute budget using C ∝ N × D. Third, basic theory: learning curves often follow a power law in training steps and, when combined with compute relations, naturally produce scaling laws. The lecture also outlines important limitations: emergence at large scales, data quality, and architecture dependence can bend or break simple power-law predictions.

This content is pitched at learners who know the basics of neural networks and Transformers but may be new to systematic scaling. If you understand what parameters, tokens, loss, and FLOPs are, you can follow the arguments. No deep math beyond logarithms and linear regression is required; the main technique is fitting a straight line on a log-log plot. If you have never seen cross-entropy loss or perplexity, this lecture gently explains their roles in measuring performance.

After this lecture, you will be able to: describe the standard scaling-law form and what the constants mean; read and interpret loss-vs-scale plots; fit your own scaling exponents by training multiple models and running a log-log linear regression; estimate how much you gain by doubling model size or data; and choose model size and data size intelligently for a fixed compute budget using the compute-optimal idea. You will also be able to list common caveats so you avoid overconfident extrapolations when data quality shifts or architectures change.

The lecture is structured in four parts. It starts with a definition and intuition for scaling laws, including why log-log straight lines indicate power laws. Next, it reviews landmark papers: Kaplan et al. (2020) for basic parameter/data/compute scaling and Hoffmann et al. (2022) for compute-optimal trade-offs, illustrating with conceptual plots and the relation C ∝ N × D. Then it provides a simple theoretical justification using learning curves and the law of diminishing returns to derive loss ∝ C^(−β/(1+β)). Finally, it covers other related scaling laws (e.g., transfer scaling) and important limitations: breakdown at extreme scales, emergent behaviors, data quality issues, and architecture dependence. The session ends with practical advice on how to estimate the scaling exponent α in your own projects by training several points, logging losses, and fitting a line in log space.

Key Takeaways

✓Fit in log space, plan in real space: Always log-transform scale and loss, then fit a line; use the resulting α to plan real-world runs. This keeps your math simple and your predictions easy to apply. Visualize the fit to catch outliers early. A clean fit is your best planning tool.
✓Balance model size and data: Treat parameters and data as co-equal levers since their exponents are similar. If memory is tight but storage is cheap, lean into more data; if data is scarce, scale parameters. Avoid extreme imbalances that cause overfitting or underutilization. Balance yields better returns at fixed compute.
✓Use compute as the budget frame: Think in terms of C ∝ N × D to plan feasible runs. For a fixed C, increasing one dimension forces you to decrease the other. This prevents overpromising and underdelivering on timelines and costs. It aligns engineering constraints with achievable gains.
✓Train to convergence parity: Ensure larger models are trained long enough; undertraining makes them look worse and distorts your fit. Watch validation curves to confirm plateaus. Equal effort across scales gives a fair comparison. It protects your exponent estimate.
✓Keep validation and tokenization constant: Use the same validation set and tokenizer across runs. Changing either can shift loss values unrelated to true improvements. This keeps your scaling curve honest. Consistency beats noisy metrics.
✓Estimate gains per doubling: Translate α into a simple rule of thumb, like ~5% loss drop per doubling. Use it to set stakeholder expectations about returns on added compute. This prevents unrealistic hopes about instant large gains. It supports steady, compounding progress.

Glossary

Scaling law

A rule that shows how a model’s error changes when you make the model bigger, give it more data, or use more compute. It usually looks like a power law, which is a simple math formula with an exponent. This makes it easy to draw as a straight line on a log-log plot. It helps predict results for bigger models without training them all. It turns many experiments into one clear pattern.

Power law

A math relationship where one quantity equals a constant times another quantity raised to a power. In scaling laws, loss changes as a power of model size, data, or compute. On a log-log plot, a power law becomes a straight line. The line’s slope equals the negative exponent. This makes finding the exponent easy with linear regression.

Exponent (α)

The exponent tells how fast loss falls when you scale up. A larger α means loss drops faster for each doubling. A smaller α means gains are smaller per doubling but still steady. In language models, α is often around 0.07. Even small exponents matter when you scale many times.

Constant (A)

The constant A sets the overall level of loss in the power law. It depends on the task, data quality, and training setup. Two teams might have the same α but different A because their data or recipe differs. A shifts the line up or down on a log-log plot. It captures difficulty and setup effects.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 9: Scaling Laws 1

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Scaling law

Power law

Exponent (α)

Constant (A)

02Key Concepts

03Technical Details

04Examples

05Conclusion

Model size (parameters)

Data size (tokens)

Compute (FLOPs)

Loss (cross-entropy)