🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling Laws 2 | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch11 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling Laws 2
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling Laws 2

Intermediate
Stanford Online
LLMYouTube

Key Summary

  • •Scaling laws relate a model’s log loss (how surprised it is by the next token) to three knobs: number of parameters (N), dataset size (D), and compute budget (C). As you increase N, D, and C, loss usually drops smoothly. But this only holds when you keep many other things steady and consistent.
  • •Data quality is a first-class factor. If your dataset is full of spam, markup, and non-language junk, doubling the data may not help and can even hurt. Clean, diverse, deduplicated text is essential for predictable scaling.
  • •Tokenization changes the number of tokens in the same text, which changes D. Using different tokenizers can make scaling curves look wrong because the effective dataset size changes. A simple fix is to standardize on one tokenizer for all runs.
  • •Architecture matters. Dense Transformers and Mixture-of-Experts (MoE) do not share the same scaling behavior. You can’t directly port scaling laws measured on a dense model to an MoE without re-measuring.
  • •Evaluation choice changes what you see. Scaling laws tracked with log loss may not match trends on downstream tasks like question answering, summarization, or translation. Always evaluate on a fixed, consistent task set when comparing.
  • •Emergent abilities are capabilities that appear suddenly as models get bigger, like doing arithmetic or multi-step reasoning. Whether these are truly emergent or just due to better data and engineering is debated. Classic scaling laws don’t predict these jumps well.
  • •Phase transitions are model behavior shifts similar to water turning into ice or steam. As model size or data scale grows, internal weight distributions can change and the model can act differently. This suggests we may need scaling laws that include phase changes.

Why This Lecture Matters

People building language and multi-modal models must plan how to spend compute and data wisely. Scaling laws turn vague guesses into measurable trade-offs between model size, dataset tokens, and budget. By understanding what keeps these laws stable—data quality, consistent tokenization, fixed architecture and evaluation—you can run cleaner experiments and make better budget decisions. This knowledge helps ML engineers, researchers, and product teams avoid wasted runs and pick the best model size for their needs. At the same time, modern systems show behaviors that classic laws don’t fully capture: emergent abilities, phase transitions, architecture shifts like MoE, and parameter-efficient fine-tuning. Recognizing these factors prevents overconfidence in smooth extrapolations and encourages regime-aware planning. In real projects, this means you can detect when a task needs a bigger jump in scale, when PEFT can cheaply unlock domain performance, or when multi-modal alignment is the true bottleneck. Career-wise, mastering scaling principles makes you effective at large-model planning, benchmarking, and cost control—skills highly valued across AI teams. You’ll be able to design fair comparisons, interpret why a model under- or over-performs, and choose the right adaptation method. As the industry moves toward ever-larger, multi-modal systems, using nuanced scaling practices will keep your models competitive while staying within resource limits.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on practical and modern views of scaling laws for language models. Scaling laws are the observed patterns that show how a model’s loss improves as you increase three main levers: the number of parameters (N), the size of the dataset (D), and the compute budget (C). While past work gave clean relationships—like how to choose an optimal data-to-parameters balance under a fixed compute budget—real training projects uncover many details that can disrupt these patterns. The lecture explains those practical considerations and then digs into newer developments that stretch or break the classic laws.

The session begins with a quick recap: log loss (negative log probability of the next token) typically drops when you scale N, D, and C. The common experimental setup is to fix compute and sweep model sizes and dataset sizes to find the best trade-off. But to do this well, you must control confounders—data quality, tokenization, architecture, and evaluation choices. Each of these can shift curves or flatten gains, making it seem like scaling laws fail when the setup is actually inconsistent.

From there, the lecture moves to new frontiers that don’t fit older laws neatly. Emergent abilities are capabilities that appear at larger scales, such as basic arithmetic or chain-of-thought reasoning, which smaller models lack. Whether these are sudden “phase changes” or simply smoother improvements that look sharp due to measurement choices is debated. The main point: classical scaling laws don’t predict these sudden jumps, highlighting their limitations.

Next, the idea of phase transitions is introduced using a physics analogy: just like water turning into ice at 0°C, models may undergo behavior changes when certain scale thresholds are crossed. Evidence includes shifts in weight distributions and qualitative changes in model outputs. These transitions suggest that new, richer scaling laws might be needed—laws that allow for regime changes rather than assuming one smooth curve forever.

The lecture then covers Parameter-Efficient Fine-Tuning (PEFT), which adapts large pre-trained models by updating only a tiny fraction of parameters. Techniques like LoRA and prefix tuning show that high-quality adaptation is possible with far less compute and memory than full fine-tuning. Because PEFT updates so few parameters, the original scaling laws for full-model training may not apply, so practitioners should expect different trends and hyperparameter optima.

Finally, the lecture turns to multi-modality—training models that work with text plus images, audio, or video. These models can scale well, but they follow different rules because tokenization, information density, and alignment challenges differ across modalities. Systems like Flamingo demonstrate impressive results, yet their scaling curves and best practices are not identical to pure-text models. The lecture closes by summarizing practical best practices (standardize tokenizer, data quality, evaluation; be mindful of architecture choice) and by emphasizing that modern research is pushing beyond simple formulas. Learners come away with both the classic picture and a set of caveats and extensions that match how state-of-the-art systems are actually built today.

This material suits intermediate learners who know what a language model and a Transformer are, and who are familiar with loss, tokens, and training budgets. After studying this, you will be able to design controlled scaling experiments, interpret scaling curves, decide between dense and MoE architectures, plan PEFT fine-tuning efficiently, and anticipate differences when moving to multi-modal setups. The lecture is structured as: (1) recap of log loss and classic scaling setup, (2) practical variables to control (data quality, tokenization, architecture, evaluation), and (3) recent developments (emergent abilities, phase transitions, PEFT, multi-modality), ending with key takeaways.

Key Takeaways

  • âś“Standardize everything non-essential to the experiment. Use a single tokenizer, fixed data filtering, one architecture family, and a constant evaluation suite. Changing these midstream makes your curves noisy or misleading. Control variables tightly so N and D are the only major differences.
  • âś“Invest in data quality before scaling. Deduplicate aggressively, remove spam and boilerplate, and language-detect and normalize content. High-quality tokens give more learning per token and make scaling laws hold better. It’s cheaper to clean data than to buy more compute for junk.
  • âś“Plan with a fixed compute budget. Build an N–D grid that fits your GPU-hours or FLOPs. Often, a moderate N with more D beats a huge N with too little data. Validate this with careful sweeps rather than guessing.
  • âś“Keep tokenizer constant to keep D consistent. Switching tokenizers changes token counts and breaks comparisons. If a new tokenizer is necessary, rebuild baselines from scratch. Never mix tokenizers within a scaling curve.
  • âś“Measure both log loss and downstream tasks. Loss is smooth and informative, but real tasks can plateau or jump. Use a fixed suite to catch capability changes and avoid cherry-picking. Decide wins based on a balanced view.
  • âś“Expect architecture-specific scaling. Dense and MoE models won’t follow the same curves. Re-measure scaling exponents when you switch families. Tune routing and expert configurations for MoE to unlock gains.
  • âś“Watch for emergent abilities and phase transitions. If a capability suddenly appears, don’t force a smooth fit over it. Consider piecewise or regime-aware models for better predictions. Document internal signals like weight distribution shifts.

Glossary

Log loss

A number that tells how surprised a model is by the next token. Lower means the model guessed well, higher means it guessed poorly. It is the negative log of the predicted probability for the correct token. It’s a core way to measure language modeling quality. When log loss improves, perplexity also improves.

Perplexity

A way to express how confused a language model is. Lower perplexity means the model predicts text better. It’s mathematically related to log loss: lower log loss gives lower perplexity. People often track perplexity to compare models. It is easier to read than raw log loss for many.

Parameters (N)

The numbers inside a model that it learns during training. More parameters usually means more capacity to learn patterns. But too many without enough data can cause problems. Scaling laws track how performance changes as N grows.

Dataset size (D)

How many tokens (pieces of text) the model sees during training. More tokens generally help the model learn better. But only if those tokens are high quality. Tokenization changes how many tokens the same text becomes.

Compute budget (C)

#scaling laws#log loss#perplexity#tokenization#data quality#transformer#mixture of experts#evaluation metrics#emergent abilities#phase transitions#weight distribution#parameter-efficient fine-tuning#lora#prefix tuning#multi-modality#flamingo#compute budget#fixed compute#curve fitting#piecewise scaling
Version: 1
  • •Parameter-Efficient Fine-Tuning (PEFT) updates only a tiny portion of weights during fine-tuning. Methods like LoRA (low-rank adapters) and prefix tuning can match full fine-tuning quality while training 0.1%–1% of parameters. This likely obeys different scaling behaviors than full-model training.
  • •Multi-modality mixes text with images, audio, or video. Models like Flamingo show strong performance when scaled, but their scaling laws differ from pure text. Expect different trends because modalities have different tokenization, information density, and noise.
  • •Compute-aware planning is key. Many teams fix compute C and choose the best split between model size N and tokens D. You must balance the two carefully: too big a model with too little data overfits; too small a model wastes potential.
  • •Standardization and controls are mandatory. Keep tokenizer, data filtering, architecture, and evaluation sets fixed when building scaling curves. Without this, trends will be noisy or misleading.
  • •The field is evolving. Today’s scaling laws are good approximations, but emergent abilities, phase transitions, PEFT, and multi-modal training show their limits. Expect refined laws that include quality, modality, and behavior shifts.
  • 02Key Concepts

    • 01

      🎯 Log loss: A measure of how surprised the model is by the next token. 🏠 It’s like a quiz where the model bets on the next word; if it guesses well, it loses fewer points. 🔧 Technically, it’s negative log probability: -log p(x_next | context). 💡 Lower log loss means better predictions and better language modeling. 📝 Example: If a model predicts “cat” with high probability after “The black,” and the next token is indeed “cat,” its log loss for that token is small.

    • 02

      🎯 Scaling laws: Rules that show how loss changes with model size (N), data size (D), and compute (C). 🏠 Think of three sliders on a music player—move each slider and the sound (performance) changes. 🔧 Empirically, as N, D, and C increase, validation log loss tends to drop in a predictable way. 💡 These laws help you plan budgets and choose the right model/data mix. 📝 Example: Under a fixed compute budget, increasing tokens D often beats making N huge if the model would otherwise be undertrained.

    • 03

      🎯 Fixed-compute scaling: The common practice of holding C constant and trading off N vs. D. 🏠 It’s like having a set amount of fuel for a road trip—decide whether to drive a heavier car (bigger model) or go farther (more tokens). 🔧 You distribute compute between forward/backward passes for different N and D settings. 💡 This reveals the most efficient use of your GPU hours. 📝 Example: At the same GPU-days, training a medium model on more tokens can outperform a very large model trained on too few tokens.

    • 04

      🎯 Data quality: The cleanliness and usefulness of your training text. 🏠 It’s like cooking; ingredients full of sand (spam/HTML junk) make a bad meal no matter how big your pot is. 🔧 Deduplication, language detection, spam filtering, and formatting cleanup raise effective signal per token. 💡 Poor quality breaks scaling expectations because extra tokens add noise, not knowledge. 📝 Example: Doubling tokens using unfiltered web crawl may barely reduce loss, while doubling with curated, deduplicated text lowers loss significantly.

    • 05

      🎯 Tokenization: Turning text into tokens that the model processes. 🏠 Like cutting bread into slices—thin or thick slices change how many pieces you get from the same loaf. 🔧 Different tokenizers (e.g., BPE, SentencePiece) produce different token counts and token boundaries. 💡 If you switch tokenizers between runs, D changes even if raw text is identical, distorting scaling curves. 📝 Example: “hello” could be one token in one tokenizer and two tokens like “hel” + “lo” in another, changing dataset length and loss comparisons.

    • 06

      🎯 Architecture dependence: Scaling laws change with model architecture. 🏠 Different engines in cars (sedan vs. hybrid) don’t use fuel the same way. 🔧 Dense Transformers and Mixture-of-Experts (MoE) have different parameter utilization and compute-per-token patterns. 💡 You cannot reuse dense-model scaling exponents for MoE without re-measurement. 📝 Example: An MoE with the same parameter count may have lower per-token compute, shifting the best N–D trade-off.

    • 07

      🎯 Evaluation choice: The tasks and metrics you track to judge progress. 🏠 It’s like testing a runner with sprints vs. marathons—one test can’t summarize all ability. 🔧 Loss/perplexity may trend smoothly, but downstream tasks like QA, summarization, or translation can plateau or jump. 💡 Use a fixed, consistent evaluation suite to compare fairly across scales. 📝 Example: A model’s log loss improves, but factual QA accuracy barely moves until a larger size, then jumps sharply.

    • 08

      🎯 Emergent abilities: New skills that appear as models get bigger. 🏠 Like a kid who suddenly starts solving puzzles once they reach a certain level of practice. 🔧 Examples include arithmetic or chain-of-thought reasoning appearing at certain scales. 💡 Classic scaling laws don’t predict these step-like changes. 📝 Example: A 300M model fails multi-step math, but a 10B model with similar training starts solving it reliably.

    • 09

      🎯 Competing theories for emergence: Memorization vs. richer representations. 🏠 Like knowing many phone numbers by heart vs. learning how to calculate on the fly. 🔧 Larger models may simply memorize more examples, or they may learn better internal structures for numbers and logic. 💡 Likely both mechanisms contribute depending on task and data. 📝 Example: A model could memorize common arithmetic answers but also learn positional number representations that enable generalization.

    • 10

      🎯 Limits of classic scaling laws: They assume smooth, continuous improvement. 🏠 Imagine a road map that doesn’t show bridges—fine until you reach a river. 🔧 When abilities suddenly appear or evaluation metrics jump, smooth power-law fits can mislead. 💡 Use scaling laws as guides, not absolute rules, especially beyond measured regimes. 📝 Example: Extrapolating a small-model curve to predict chain-of-thought accuracy at 100B parameters often fails.

    • 11

      🎯 Phase transitions: Regime changes in model behavior as scale grows. 🏠 Like water freezing into ice at 0°C—same molecules, new structure. 🔧 Researchers observe weight distribution shifts or abrupt changes in internal activations at certain sizes/datasets. 💡 These can trigger qualitative capability jumps and change optimization dynamics. 📝 Example: A model’s attention patterns become more focused at a certain scale, coinciding with improved reasoning consistency.

    • 12

      🎯 Weight distribution shifts: Internal parameters re-organize with scale. 🏠 Like a city where traffic concentrates on a few main roads as it grows. 🔧 As models scale, certain layers or heads may dominate, altering variance and sparsity patterns. 💡 These shifts can signal phase transitions and forecast new abilities. 📝 Example: Histograms of layer weights narrow in some layers and widen in others right before a leap in a reasoning benchmark.

    • 13

      🎯 New scaling laws with transitions: Models may need piecewise or regime-aware fits. 🏠 A hiking trail map with switchbacks is clearer than a straight line drawn over a mountain. 🔧 Instead of one power law, use segmented curves or models that allow abrupt slope changes. 💡 This improves predictions around ability thresholds. 📝 Example: Fit one exponent below 1B parameters and another above 10B to capture different improvement rates.

    • 14

      🎯 Parameter-Efficient Fine-Tuning (PEFT): Update a small subset of weights to adapt a big model. 🏠 It’s like adding a small steering device to a big ship instead of rebuilding the engine. 🔧 Methods such as LoRA and prefix tuning adjust lightweight components while freezing the backbone. 💡 This reduces compute, memory, and storage while keeping high accuracy. 📝 Example: Fine-tuning only 0.5% of parameters for a domain-specific QA task can match full fine-tuning performance.

    • 15

      🎯 LoRA (Low-Rank Adaptation): Insert trainable low-rank matrices into weight paths. 🏠 Think of adding a small set of dials that slightly reshape big knobs. 🔧 You decompose updates into low-rank factors (A, B) and learn them while freezing original weights. 💡 Achieves strong adaptation with minimal parameter overhead. 📝 Example: Add rank-8 adapters to attention projections to learn domain style without touching the full weight matrix.

    • 16

      🎯 Prefix tuning: Learn short trainable vectors prepended to inputs. 🏠 Like giving the model a hint card before every question. 🔧 These prefix embeddings condition the model to a task without changing main weights. 💡 Very memory-efficient and easy to apply across tasks. 📝 Example: Train a 20-token prefix so the model adopts a medical explanation style.

    • 17

      🎯 Full fine-tuning vs. PEFT: Two ways to adapt models. 🏠 Rebuilding a house (full FT) vs. adding smart extensions (PEFT). 🔧 Full FT updates all weights, costing more compute and risking forgetting; PEFT updates very few, keeping base knowledge intact. 💡 PEFT often reaches similar performance with far less cost, but sometimes full FT still wins for hard distribution shifts. 📝 Example: For a niche legal domain, PEFT matches full FT on summaries; for a new language script, full FT may outperform.

    • 18

      🎯 Multi-modality: Models that handle text plus images/audio/video. 🏠 Like a student who can read, listen, and watch to learn. 🔧 Training mixes modalities, requiring alignment (e.g., image-text pairs) and cross-attention to fuse signals. 💡 Scaling laws differ because token counts, noise, and information density vary by modality. 📝 Example: A model trained on image–caption pairs shows improved visual reasoning as both visual encoder and text decoder scale.

    • 19

      🎯 Flamingo-style systems: A concrete multi-modal family. 🏠 Think of a language model with eyes. 🔧 These systems accept images plus text and produce text outputs, using special attention to connect visual features to words. 💡 Scaling data and model size helps, but not identically to text-only models. 📝 Example: Increasing image–text pairs boosts visual QA more than adding plain text tokens of similar count.

    • 20

      🎯 Practical workflow for scaling studies: Standardize, sweep, evaluate. 🏠 Like a fair science experiment with one variable changed at a time. 🔧 Fix tokenizer, data cleaning, and evaluation; select an architecture; run N–D sweeps at fixed C; fit curves; check anomalies. 💡 This prevents misleading conclusions and reveals true trade-offs. 📝 Example: With a single BPE tokenizer and a fixed QA benchmark, you compare 300M, 1B, and 3B models over 50B, 100B, and 200B tokens to find the best compute use.

    03Technical Details

    Overall Architecture/Structure of Scaling Experiments

    1. Objective and Metric
    • Goal: Minimize validation log loss (negative log probability), which correlates with language modeling quality and, often, better downstream task performance.
    • Primary metric: Log loss/perplexity on a held-out validation set. Secondary metrics: accuracy on QA, ROUGE for summarization, BLEU for translation, etc.
    1. Control Variables and Independent Variables
    • Fixed compute (C): Keep total GPU-hours or FLOPs constant across runs for a fair comparison.
    • Independent variables: Model parameters (N) and dataset tokens (D). Choose a grid of (N, D) settings to explore.
    • Controlled variables: Tokenizer, data cleaning pipeline, architecture family, optimizer, learning rate schedule, batch size policy, and evaluation set.
    1. Data Flow
    • Data collection: Start with a large raw corpus (web text, books, code, etc.).
    • Cleaning and filtering: Remove duplicates, strip markup (excess HTML), detect language, block spam, and normalize text.
    • Tokenization: Encode texts with a single, consistent tokenizer to count tokens accurately and feed the model.
    • Training: For each (N, D) run, train to a target number of steps/tokens under the compute budget. Track training and validation loss.
    • Evaluation: Measure validation loss and a fixed downstream task suite after training. Store results for all sweeps.
    • Curve fitting: Fit empirical curves (often power laws) to loss as a function of N and D under fixed C.
    1. Interpretation Loop
    • Compare runs and identify best-performing trade-offs. Check for anomalies (jumps, plateaus).
    • Diagnose confounders: If curves are noisy, verify tokenizer consistency, data quality, architecture consistency, and evaluation setup.
    • Decide production plan: Choose N and D that deliver best performance per compute, and consider future scaling direction.

    Key Implementation Details and Roles of Components

    • Tokenizer: Defines tokens and counts D. A mismatch can make two runs incomparable even with identical raw text. Use one tokenizer across all variants. Common families include BPE and SentencePiece.
    • Model architecture: Dense Transformer vs. MoE. Dense models activate all parameters each token; MoE activates a subset of experts per token, altering compute patterns and scaling behavior.
    • Optimizer and schedule: AdamW or similar; cosine decay or linear warmup schedules. Keep these fixed to isolate N and D.
    • Data loader: Must avoid data skew and ensure uniform sampling. Shuffling and deterministic seeding help reproducibility.
    • Logging & checkpoints: Record FLOPs, wall-clock time, steps, and validation scores to enable fair fixed-compute comparisons.

    Parameter, Dataset, and Compute Interactions

    • Undertraining: If N is large but D is small, the model can overfit or fail to fully learn, wasting parameters.
    • Data saturation: If D is huge but N is tiny, the model may not have enough capacity to represent the data patterns.
    • Compute-limited regime: When C is fixed, there is an optimal N:D balance. Past studies (e.g., optimized scaling) indicate that increasing D often beats simply increasing N within certain ranges, but outcomes depend on your data quality and architecture.

    Architecture Choice and Scaling Behavior

    • Dense Transformer: Predictable memory and compute usage; classic scaling curves often measured here.
    • Mixture-of-Experts (MoE): Introduces routing to a subset of experts per token. Total parameter count is high, but per-token active parameters and compute are lower than a dense model with the same total N. Scaling can favor larger nominal N under fixed compute, but effective capacity and generalization trends may differ.
    • Implication: Laws measured in dense settings may not carry over. Rebuild curves for MoE with standardized conditions.

    Evaluation: Metrics and Task Suites

    • Log loss/perplexity: Smooth and sensitive to incremental improvements.
    • Downstream tasks: QA, summarization, translation, reasoning tests. These can show plateaus or step changes, which smooth loss curves may not capture.
    • Best practice: Maintain a fixed evaluation suite to detect genuine improvements and to avoid cherry-picking tasks that flatter a specific model size.

    Emergent Abilities

    • Observed phenomenon: Abilities like arithmetic or chain-of-thought reasoning appear at larger scales even when smaller models trained similarly fail.
    • Competing theories: (1) Memorization—larger models store more examples, which helps on familiar patterns. (2) Representation learning—larger models build richer internal structures (e.g., number representations, logical templates) enabling generalization.
    • Measurement caution: Binary metrics (pass/fail) can make a smooth improvement look sudden if a threshold is crossed. Nonetheless, practitioners do observe capability thresholds that classic smooth laws do not predict well.

    Phase Transitions

    • Concept: Abrupt changes in internal organization and behavior as scale increases, analogous to physical phase changes.
    • Indicators: Shifts in weight distributions, activation patterns, attention sparsity/density, or training dynamics (e.g., sudden stability changes).
    • Modeling approach: Consider piecewise scaling laws or regime-aware models where exponents change past certain scales. This improves prediction near capability thresholds.

    Parameter-Efficient Fine-Tuning (PEFT)

    • Motivation: Full fine-tuning a large model is expensive in compute and memory, and risks overwriting base capabilities.
    • Methods: • LoRA: Insert trainable low-rank adapters into key weight matrices (e.g., attention projections, MLPs). You learn A and B (low-rank factors) while freezing the original weights. Memory and compute overhead are low, and adapters can be saved/loaded separately. • Prefix tuning: Learn trainable continuous prompts (prefix embeddings) prepended to the model’s input or key/value caches. The backbone stays frozen, and task behavior is steered by the learned prefix.
    • Outcomes: Often achieves near full fine-tuning performance on many tasks while training only 0.1%–1% of parameters.
    • Scaling note: Because PEFT updates a tiny slice of parameters, its scaling behavior (with respect to N, D, and C) can differ from full-model training. For instance, adding more data may help less if the bottleneck is the adapter capacity rather than the base model.

    Multi-Modal Scaling

    • Differences from text-only: • Tokenization: Visual or audio features are chunked differently from text tokens; counting “tokens” across modalities is nontrivial. • Information density: An image can encode more information than the same number of text tokens; comparing D across modalities needs care. • Alignment: Cross-modal learning requires matching images to captions, audio to transcripts, etc. Noise in alignment reduces scaling efficiency.
    • Example: Flamingo-like systems accept images and text, producing text outputs. As model and data scale, performance improves, but the curve shape and best training recipes differ from text-only models.

    Step-by-Step Implementation Guide for a Scaling Study Step 1: Prepare data and tokenizer

    • Collect a clean text corpus. Apply deduplication, language filters, and HTML/markup stripping. Keep a validation set fixed across all runs.
    • Choose one tokenizer (e.g., a BPE with a fixed vocabulary). Do not change tokenizers mid-study. Compute token counts for D precisely.

    Step 2: Choose architecture family and hyperparameters

    • Pick dense Transformer or MoE. Fix depth/width scaling rules, positional embeddings, activation functions, and normalization settings.
    • Fix optimizer (e.g., AdamW), learning rate schedule (warmup + decay), batch size, gradient clipping, and weight decay.

    Step 3: Define compute budget (C)

    • Set total allowable FLOPs or GPU-days. Estimate per-step FLOPs for each N to understand feasible D (number of tokens/steps).

    Step 4: Create an N–D sweep grid

    • Select 4–8 model sizes (e.g., small to large) and, for each, 2–4 token budgets, all fitting within C.
    • Ensure comparisons are apples-to-apples: same tokenizer, data filtering, optimizer/schedule, and evaluation set.

    Step 5: Train runs and log results

    • Train each (N, D) configuration to the planned token count. Log training/validation losses, FLOPs, and time.
    • Evaluate on a fixed downstream suite (QA, summarization, translation, reasoning) at the same checkpoints.

    Step 6: Fit and analyze scaling curves

    • Fit a power-law or similar function to validation loss vs. N and D. Consider separate fits per architecture.
    • If you notice abrupt jumps or plateaus, consider piecewise fits or flag possible phase transitions.

    Step 7: Decide optimal operating point

    • Under fixed C, select the (N, D) that minimizes validation loss and achieves strong downstream results.
    • Plan next steps: improve data quality, adjust architecture, or expand compute for further gains.

    Tips and Warnings

    • Keep the tokenizer constant: Changing it changes D and breaks comparability.
    • Prioritize data quality: Deduplicate aggressively; filter out spam and boilerplate; language-detect and normalize.
    • Beware overfitting at large N with small D: Monitor validation loss and downstream generalization.
    • Don’t assume dense laws apply to MoE: Rebuild curves for each architecture family.
    • Evaluate consistently: Fix a specific benchmark suite to avoid cherry-picking results.
    • Consider PEFT for downstream adaptation: It saves compute and storage; store adapters separately for deployment flexibility.
    • Anticipate modality-specific issues: In multi-modal setups, ensure robust pairing (image–caption), verify feature extraction quality, and align token counts meaningfully.
    • Understand emergent behavior limits: Smooth fits can’t predict sudden ability jumps; plan buffers in budgets and expectations.

    Debugging Methods

    • If scaling curves look noisy: Verify identical tokenization and data sampling order; reseed and re-run smaller pilots.
    • If downstream metrics don’t improve despite better loss: Check domain mismatch between training data and evaluation tasks; consider targeted fine-tuning or PEFT.
    • If MoE underperforms dense at similar C: Inspect routing load balance and expert capacity; adjust number of experts or routing temperature.
    • If multi-modal scaling stalls: Audit alignment quality; increase high-quality paired data rather than only adding unpaired text.

    Putting It All Together

    • A robust scaling study is a controlled experiment. Control everything except N and D, measure carefully, and be ready to use piecewise or regime-aware models if you observe phase transitions. For production, choose the best point under your compute, and apply PEFT to cheaply adapt the model to specific domains. For multi-modal tasks, plan for different scaling dynamics and focus on high-quality cross-modal alignment to realize the gains of larger models and datasets.

    04Examples

    • đź’ˇ

      Clean vs. noisy data doubling: Start with a 50B-token clean corpus and measure validation loss. Then double to 100B tokens by adding unfiltered web crawl. Training shows only a tiny loss improvement because the new tokens are noisy and repetitive. Key point: Data quality controls are essential for scaling benefits.

    • đź’ˇ

      Tokenizer mismatch: Train a 1B model with Tokenizer A and count D = 100B tokens. Next, train the same model with Tokenizer B on the same raw text; now D = 120B tokens because B splits words more. The second run appears to have better loss per token, but the comparison is unfair. Key point: Keep the tokenizer fixed to make scaling curves valid.

    • đź’ˇ

      Dense vs. MoE at fixed compute: Compare a 2B dense model and an MoE with 16 experts totaling 8B parameters but similar per-token compute. The MoE activates only a subset of experts per token and reaches lower loss at the same compute. However, its downstream gains vary by task. Key point: Architecture changes alter scaling behavior; re-measure curves.

    • đź’ˇ

      Evaluation divergence: Two 3B models show similar validation loss. But on a reasoning QA benchmark, Model A jumps 10 points in accuracy while Model B barely moves. The loss curves didn’t predict this gap. Key point: Track downstream tasks alongside loss.

    • đź’ˇ

      Emergent arithmetic: A 300M model fails two-step arithmetic prompts, giving wrong sums. A 10B model trained on the same dataset and tokenizer starts solving them consistently. This jump looks abrupt compared to smooth loss improvements. Key point: Emergent abilities are not well captured by classic scaling laws.

    • đź’ˇ

      Phase transition signal: As model size increases from 1B to 10B, histograms of certain layer weights narrow while others widen, and attention heads become more focused. Right after this shift, reasoning accuracy improves sharply. The timing suggests a regime change. Key point: Internal distribution shifts can accompany capability jumps.

    • đź’ˇ

      LoRA for domain adaptation: Freeze a 7B base model and add LoRA rank-8 adapters on attention and MLP layers. Train on 1M domain-specific documents, updating only the adapters. The adapted model matches full fine-tuning on the domain QA set with a fraction of compute and storage. Key point: PEFT delivers efficient, high-quality adaptation.

    • đź’ˇ

      Prefix tuning for style control: Add a 20-token trainable prefix to steer the model toward a medical explanatory style. Keep all base weights frozen and fine-tune only the prefix on a small corpus. The model answers in a consistent medical tone with minimal compute. Key point: Prefix tuning is a lightweight way to inject task/style behavior.

    • đź’ˇ

      Undertraining large model: Train a 10B model on just 10B tokens and compare to a 3B model trained on 50B tokens, under similar compute. The 10B model overfits and has worse validation loss and downstream accuracy. Key point: Bigger isn’t always better if D is too small for N.

    • đź’ˇ

      Multi-modal scaling: Train a text–image model on 50M high-quality image–caption pairs and then 100M pairs. The model’s visual QA accuracy rises more steeply than when adding the same number of plain text tokens. Key point: Multi-modal scaling depends on paired data quality and quantity, not just text token counts.

    • đź’ˇ

      Compute budgeting: Given a fixed 2,000 GPU-hours, test three settings: (A) 1B params with 300B tokens, (B) 3B params with 150B tokens, and (C) 6B params with 75B tokens. After training, (B) achieves the best validation loss and balanced downstream results. Key point: The optimal N–D split depends on compute and data quality.

    • đź’ˇ

      Piecewise scaling fit: Fit a single power law to loss vs. model size from 100M to 10B and observe poor predictions near 10B. Refit with two segments: 100M–1B and 1B–10B, each with its own slope. Predictions near 10B improve significantly. Key point: Regime-aware fits capture transitions better.

    • đź’ˇ

      Routing audit in MoE: An MoE model underperforms. Inspection shows most tokens being routed to a few experts, causing overload and underuse of others. Adjust routing temperature and capacity; performance improves. Key point: MoE scaling gains require balanced expert utilization.

    • đź’ˇ

      Downstream mismatch: Validation loss improves steadily, but medical QA accuracy lags. Adding small domain-specific PEFT fine-tuning dramatically boosts medical QA, even though base loss didn’t change much. Key point: General scaling helps, but targeted adaptation can unlock task-specific gains.

    • đź’ˇ

      Alignment in multi-modal data: A dataset has many mismatched image–caption pairs. After filtering and better alignment, the model’s visual reasoning improves without changing total pair count. Key point: Quality and alignment, not just quantity, drive multi-modal scaling.

    05Conclusion

    Scaling laws are powerful tools for planning and predicting language model improvements as you vary parameters (N), dataset size (D), and compute (C). In controlled conditions—consistent tokenization, clean data, fixed architecture family, and a stable evaluation suite—log loss typically falls smoothly with scale. However, real practice exposes limits: downstream tasks may improve unevenly, emergent abilities can appear abruptly, and architecture shifts like MoE alter scaling dynamics. These realities suggest that simple, single-slope laws are approximations, and that regime-aware or piecewise descriptions may be necessary, especially near capability thresholds.

    The lecture also highlights modern techniques and contexts that change the scaling picture. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA and prefix tuning adapt large models by updating very few parameters, often matching full fine-tuning while saving major compute and memory. Because they update so little of the model, their scaling behavior differs from full-model training. Multi-modal systems further complicate scaling predictions, since modality mixing, alignment quality, and information density shift how data and model size translate into gains. Evidence of phase transitions—abrupt changes in weight distributions and capabilities—pushes us toward more nuanced scaling laws.

    To put this into practice, design careful, controlled scaling studies: fix the tokenizer, standardize data filtering, keep the architecture and optimizer constant, and use a consistent evaluation suite. Sweep N and D under a fixed compute budget, log everything, and fit curves thoughtfully, considering piecewise models when you see jumps. Use PEFT for efficient domain adaptation, and plan for modality-specific rules in multi-modal projects. As you move forward, expect evolving theories that incorporate data quality, architectural differences, emergent abilities, and phase transitions into richer, more predictive scaling frameworks.

    The core message is simple: scaling laws are essential guides, but not unbreakable rules. Respect the details—data, tokenization, architecture, and evaluation—because they determine whether your model follows the expected curve or wanders off it. Be ready to update your mental model as the field discovers new regimes and behaviors. With disciplined experimentation and flexible thinking, you can harness scaling to build better systems while avoiding costly missteps.

  • âś“Use PEFT for efficient domain adaptation. LoRA and prefix tuning can match full fine-tuning with 0.1%–1% of parameters updated. This saves compute and storage, and keeps base knowledge intact. Prefer PEFT first; use full fine-tuning only when necessary.
  • âś“Avoid undertraining large models. A very big N with too small D wastes capacity and can overfit. Check validation loss and generalization on tasks. Sometimes a smaller model trained longer works better.
  • âś“In multi-modal projects, prioritize alignment quality. Good image–text or audio–text pairing beats raw quantity of unaligned data. Scaling rules differ across modalities, so rebuild curves. Track modality-specific metrics to guide training.
  • âś“Fit curves thoughtfully and consider piecewise models. A single global power law can miss real regime changes. Break the range into segments where behavior is stable. This improves planning near capability thresholds.
  • âś“Log everything for reproducibility. Record FLOPs, steps, seeds, loss curves, and task scores. This lets you audit odd results and verify compute fairness. It also helps compare future experiments cleanly.
  • âś“Check for confounders when results look off. Tokenizer drift, data contamination, or architectural tweaks can skew outcomes. Re-run small pilots with strict controls to isolate causes. Don’t accept surprising curves without investigation.
  • âś“Balance storage and deployment with PEFT. Store small adapters instead of full model copies for each domain. This simplifies shipping multiple variants. It speeds iteration for downstream teams.
  • âś“Use downstream mismatch as a signal for adaptation. If loss improves but task scores stall, try PEFT or targeted data. Small, focused fine-tuning can unlock big task gains. Don’t rely on base scaling alone for specialized domains.
  • The total amount of computing you can spend training. It can be measured in GPU-hours or FLOPs. With fixed compute, you must trade off model size and token count. Good planning makes the best use of this budget.

    Scaling laws

    Patterns that show how model quality improves as you increase parameters, data, and compute. They often look like smooth curves when conditions are controlled. They are used to predict what happens at larger scales. But they are approximations, not perfect rules.

    Tokenizer

    A tool that splits text into tokens that the model reads. Different tokenizers cut the same sentence into different pieces. This changes the token count and affects training length. Keeping the tokenizer fixed makes comparisons fair.

    Byte Pair Encoding (BPE)

    A common tokenization method that merges frequent character pairs. It balances between character-level and word-level pieces. It helps handle rare words by splitting them. Many language models use BPE.

    +28 more (click terms in content)