📚 Stanford CS336: Language Modeling from Scratch12 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation

Intermediate

Stanford Online

Key Summary

•Evaluation tells us how good a language model really is. There are two big ways to judge models: intrinsic (measure the model directly) and extrinsic (measure it through real tasks). Intrinsic is fast and clean but might not reflect real-world usefulness. Extrinsic is realistic and practical but slow and complicated to run.
•Intrinsic evaluation for language models mainly uses perplexity. Perplexity summarizes how surprised the model is by a test set of text, normalized by length. Lower perplexity means the model assigns higher probability to the test text, which is better. Always compute perplexity on a held‑out test set, not on training data.
•Perplexity can be understood as the model’s average branching factor. For example, a perplexity of 100 means the model acts like it has about 100 equally likely next-word choices on average. This makes the idea very intuitive, like the number of doors the model must pick from at each step. But perplexity depends on vocabulary size, so compare only models with the same tokenization and vocab.
•An example test set 'I like cats. I hate cats.' shows how to compute perplexity by multiplying conditional probabilities and normalizing. Two models with different conditional probabilities yield different perplexities (e.g., 6.68 vs. 2.85). The lower number indicates the better model on that test set. This concretely demonstrates how to compare models.
•Extrinsic evaluation measures how well a language model helps with tasks like text generation, summarization, and machine translation. You plug your model into the task pipeline and score the outputs. This shows the real, practical value of the model. But it’s slower and harder to isolate the model’s impact because many parts influence performance.
•For text generation and translation, you can use human evaluation or automatic metrics. Human evaluation checks fluency, coherence, and relevance but costs time and money. Automatic metrics (BLEU, ROUGE, METEOR, TER) are cheap and fast but imperfect. The best practice is to use both.

Why This Lecture Matters

Evaluation is how you decide whether a language model is ready for real use. Engineers, researchers, and product teams need reliable ways to compare models and pick the best one for a job. Intrinsic metrics like perplexity quickly show whether a model predicts text well, which is useful during training and iteration. Extrinsic metrics and human evaluation tell you how the model performs on actual tasks such as translation, summarization, and text generation—where meaning, fluency, and usefulness matter most. This knowledge solves common problems: overfitting to training data, chasing a single score that doesn’t reflect user needs, and comparing models unfairly due to different tokenizations or setups. By using held‑out test sets, proper automatic metrics, and human judgments, you can make informed, trustworthy decisions. In real projects, this means building evaluation pipelines, establishing rubrics, and reporting with transparency, which speeds up development while avoiding costly mistakes. For your career, understanding evaluation makes you a stronger model builder and a better decision-maker. You learn to communicate results clearly, defend choices with evidence, and align metrics with product goals. In today’s industry, where models are deployed across many applications, robust evaluation is a critical skill. It ensures that improvements on paper translate into better user experiences and safer, more reliable AI systems.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on how to evaluate language models so we can answer practical questions like: Is my model good? How do I compare two models fairly? The discussion centers on two complementary approaches: intrinsic evaluation, which measures the model directly, and extrinsic evaluation, which measures how well the model helps on real tasks. The lecture’s core goal is to help you choose the right metrics, understand what they mean, and use them together to get a reliable picture of quality.

Intrinsic evaluation is about scoring the model’s own probability assignments to text. The main tool here is perplexity, which converts the model’s probability for a test set into a single, length‑normalized number. Lower perplexity means the model is less surprised by the test text—so it predicts well. You should always compute this on a held‑out test set that the model never saw during training to measure generalization, not memorization.

Perplexity has useful intuition: it can be read as the model’s branching factor. Think of it like the average number of equally likely next‑word choices the model is juggling at each step. However, perplexity is sensitive to vocabulary size and tokenization. That means you should only compare perplexity values between models that share the same tokenization and vocabulary; otherwise, the numbers can mislead you.

Extrinsic evaluation, in contrast, checks how well the model performs when used in a real task pipeline. Example tasks include text generation (like story writing or code generation), text summarization, question answering, and machine translation. Here, the main options are human evaluation—asking people to rate fluency, coherence, and relevance—and automatic metrics like BLEU, ROUGE, METEOR, and TER. Human evaluation is high‑quality but expensive and slow; automatic metrics are fast but imperfect because they focus on surface overlaps rather than deep meaning.

The lecture highlights how common automatic metrics work and their limits. BLEU measures n‑gram overlap and applies a brevity penalty so models don’t just output very short strings; it is widely used in machine translation and sometimes text generation. ROUGE emphasizes recall and is popular for summarization, where covering reference content matters most. METEOR tries to match stems and synonyms to reward meaning-preserving changes in wording. TER counts edits needed to match the reference, offering an intuitive “distance” measure. Still, these metrics don’t capture everything—like when two sentences mean the same thing but use different words (for example, “I enjoy felines” vs. “I like cats”).

The lecture’s practical advice is to combine methods. Use intrinsic evaluation (perplexity) to quickly understand the model’s predictive quality. Then, use extrinsic evaluation on the tasks you care about. Rely on automatic metrics for fast iteration and include human evaluation to check meaning and naturalness—especially before deployment. This balanced approach gives a truer measure of how good your model is in the real world.

This material is aimed at students who have built n‑gram, neural, LSTM, or Transformer language models and now need to measure their performance. You should know basic probability, conditional probability, and how language models predict the next token P(x_t | x_<t). No advanced math is required beyond multiplying probabilities and understanding why we normalize by sequence length.

After this lecture, you will be able to: compute and interpret perplexity; set up fair, held‑out test evaluations; understand the pros and cons of human vs. automatic metrics; and choose task‑appropriate metrics for text generation and machine translation. You will also learn why you should not rely on a single score and how to read perplexity as the model’s confusion level (branching factor). The lecture is structured as: a reminder of the language modeling objective; a clear split between intrinsic and extrinsic evaluation; a deep dive into perplexity; and an overview of text generation and machine translation evaluation with common metrics and their limitations, ending with a practical summary on combining approaches.

Key Takeaways

✓Always separate training, validation, and test data. Tune on validation and report final numbers on the held‑out test only. This prevents test leakage and keeps results honest. Keep your splits stable for reproducibility.
✓Use perplexity to quickly screen and compare models. Compute it on a held‑out test set and keep tokenization and vocabulary identical across models. Report PPL alongside average NLL to show both views. Consider confidence intervals via bootstrapping.
✓Interpret perplexity as branching factor for intuition. A lower PPL means fewer equally likely next choices on average. This helps explain why even modest PPL drops can matter. Communicate this intuition to non-technical stakeholders.
✓Do not rely on perplexity alone for generation quality. It measures next-token prediction, not global fluency or repetition control. Include extrinsic evaluations for tasks you care about. Review samples qualitatively.
✓Pick task-appropriate automatic metrics. Use BLEU and TER for machine translation and ROUGE for summarization. Add METEOR to better reward paraphrases. Choose metrics that align with your task’s needs.
✓Design a simple human evaluation rubric. Rate fluency, coherence, and relevance on a clear scale (e.g., 1–5). Blind model identities and randomize order to reduce bias. Use multiple raters per item and report agreement.
✓Standardize your evaluation pipeline. Fix tokenization, decoding parameters, and metric scripts. Use the same data splits and references across models. This makes comparisons fair and results reproducible.

Glossary

Intrinsic evaluation

A way to judge a model directly by how well it predicts text. It uses the model’s own probability outputs without involving a task like translation. It is fast to compute and easy to repeat. It gives a clean view of prediction quality but may not match real-world usefulness. Perplexity is the main intrinsic metric for language models.

Extrinsic evaluation

A way to judge a model by how well it performs in real tasks. The model is plugged into a pipeline like translation or summarization. Scores come from task outputs compared to references or human ratings. It measures actual utility but is slower and more complex. It is often called evaluation in the wild.

Perplexity (PPL)

A number that shows how surprised a model is by a test text on average. Lower is better and means the model predicted well. It is the inverse of the test text probability, normalized by length. It can be viewed as the model’s branching factor—the number of likely next choices. Perplexity depends on the vocabulary and tokenization used.

Held‑out test set

A part of data kept separate from training and tuning. It is used only for final evaluation. It prevents overfitting from giving fake, high scores. It makes sure we measure generalization to new data. It should match the domain we care about.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Intrinsic evaluation

Extrinsic evaluation

Perplexity (PPL)

Held‑out test set

02Key Concepts

03Technical Details

05Conclusion

Overfitting

Downstream task

BLEU

ROUGE