šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation | How I Study AI
šŸ“š Stanford CS336: Language Modeling from Scratch12 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 12: Evaluation

Intermediate
Stanford Online
LLMYouTube

Key Summary

  • •Evaluation tells us how good a language model really is. There are two big ways to judge models: intrinsic (measure the model directly) and extrinsic (measure it through real tasks). Intrinsic is fast and clean but might not reflect real-world usefulness. Extrinsic is realistic and practical but slow and complicated to run.
  • •Intrinsic evaluation for language models mainly uses perplexity. Perplexity summarizes how surprised the model is by a test set of text, normalized by length. Lower perplexity means the model assigns higher probability to the test text, which is better. Always compute perplexity on a held‑out test set, not on training data.
  • •Perplexity can be understood as the model’s average branching factor. For example, a perplexity of 100 means the model acts like it has about 100 equally likely next-word choices on average. This makes the idea very intuitive, like the number of doors the model must pick from at each step. But perplexity depends on vocabulary size, so compare only models with the same tokenization and vocab.
  • •An example test set 'I like cats. I hate cats.' shows how to compute perplexity by multiplying conditional probabilities and normalizing. Two models with different conditional probabilities yield different perplexities (e.g., 6.68 vs. 2.85). The lower number indicates the better model on that test set. This concretely demonstrates how to compare models.
  • •Extrinsic evaluation measures how well a language model helps with tasks like text generation, summarization, and machine translation. You plug your model into the task pipeline and score the outputs. This shows the real, practical value of the model. But it’s slower and harder to isolate the model’s impact because many parts influence performance.
  • •For text generation and translation, you can use human evaluation or automatic metrics. Human evaluation checks fluency, coherence, and relevance but costs time and money. Automatic metrics (BLEU, ROUGE, METEOR, TER) are cheap and fast but imperfect. The best practice is to use both.

Why This Lecture Matters

Evaluation is how you decide whether a language model is ready for real use. Engineers, researchers, and product teams need reliable ways to compare models and pick the best one for a job. Intrinsic metrics like perplexity quickly show whether a model predicts text well, which is useful during training and iteration. Extrinsic metrics and human evaluation tell you how the model performs on actual tasks such as translation, summarization, and text generation—where meaning, fluency, and usefulness matter most. This knowledge solves common problems: overfitting to training data, chasing a single score that doesn’t reflect user needs, and comparing models unfairly due to different tokenizations or setups. By using held‑out test sets, proper automatic metrics, and human judgments, you can make informed, trustworthy decisions. In real projects, this means building evaluation pipelines, establishing rubrics, and reporting with transparency, which speeds up development while avoiding costly mistakes. For your career, understanding evaluation makes you a stronger model builder and a better decision-maker. You learn to communicate results clearly, defend choices with evidence, and align metrics with product goals. In today’s industry, where models are deployed across many applications, robust evaluation is a critical skill. It ensures that improvements on paper translate into better user experiences and safer, more reliable AI systems.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on how to evaluate language models so we can answer practical questions like: Is my model good? How do I compare two models fairly? The discussion centers on two complementary approaches: intrinsic evaluation, which measures the model directly, and extrinsic evaluation, which measures how well the model helps on real tasks. The lecture’s core goal is to help you choose the right metrics, understand what they mean, and use them together to get a reliable picture of quality.

Intrinsic evaluation is about scoring the model’s own probability assignments to text. The main tool here is perplexity, which converts the model’s probability for a test set into a single, length‑normalized number. Lower perplexity means the model is less surprised by the test text—so it predicts well. You should always compute this on a held‑out test set that the model never saw during training to measure generalization, not memorization.

Perplexity has useful intuition: it can be read as the model’s branching factor. Think of it like the average number of equally likely next‑word choices the model is juggling at each step. However, perplexity is sensitive to vocabulary size and tokenization. That means you should only compare perplexity values between models that share the same tokenization and vocabulary; otherwise, the numbers can mislead you.

Extrinsic evaluation, in contrast, checks how well the model performs when used in a real task pipeline. Example tasks include text generation (like story writing or code generation), text summarization, question answering, and machine translation. Here, the main options are human evaluation—asking people to rate fluency, coherence, and relevance—and automatic metrics like BLEU, ROUGE, METEOR, and TER. Human evaluation is high‑quality but expensive and slow; automatic metrics are fast but imperfect because they focus on surface overlaps rather than deep meaning.

The lecture highlights how common automatic metrics work and their limits. BLEU measures n‑gram overlap and applies a brevity penalty so models don’t just output very short strings; it is widely used in machine translation and sometimes text generation. ROUGE emphasizes recall and is popular for summarization, where covering reference content matters most. METEOR tries to match stems and synonyms to reward meaning-preserving changes in wording. TER counts edits needed to match the reference, offering an intuitive ā€œdistanceā€ measure. Still, these metrics don’t capture everything—like when two sentences mean the same thing but use different words (for example, ā€œI enjoy felinesā€ vs. ā€œI like catsā€).

The lecture’s practical advice is to combine methods. Use intrinsic evaluation (perplexity) to quickly understand the model’s predictive quality. Then, use extrinsic evaluation on the tasks you care about. Rely on automatic metrics for fast iteration and include human evaluation to check meaning and naturalness—especially before deployment. This balanced approach gives a truer measure of how good your model is in the real world.

This material is aimed at students who have built n‑gram, neural, LSTM, or Transformer language models and now need to measure their performance. You should know basic probability, conditional probability, and how language models predict the next token P(x_t | x_<t). No advanced math is required beyond multiplying probabilities and understanding why we normalize by sequence length.

After this lecture, you will be able to: compute and interpret perplexity; set up fair, held‑out test evaluations; understand the pros and cons of human vs. automatic metrics; and choose task‑appropriate metrics for text generation and machine translation. You will also learn why you should not rely on a single score and how to read perplexity as the model’s confusion level (branching factor). The lecture is structured as: a reminder of the language modeling objective; a clear split between intrinsic and extrinsic evaluation; a deep dive into perplexity; and an overview of text generation and machine translation evaluation with common metrics and their limitations, ending with a practical summary on combining approaches.

Key Takeaways

  • āœ“Always separate training, validation, and test data. Tune on validation and report final numbers on the held‑out test only. This prevents test leakage and keeps results honest. Keep your splits stable for reproducibility.
  • āœ“Use perplexity to quickly screen and compare models. Compute it on a held‑out test set and keep tokenization and vocabulary identical across models. Report PPL alongside average NLL to show both views. Consider confidence intervals via bootstrapping.
  • āœ“Interpret perplexity as branching factor for intuition. A lower PPL means fewer equally likely next choices on average. This helps explain why even modest PPL drops can matter. Communicate this intuition to non-technical stakeholders.
  • āœ“Do not rely on perplexity alone for generation quality. It measures next-token prediction, not global fluency or repetition control. Include extrinsic evaluations for tasks you care about. Review samples qualitatively.
  • āœ“Pick task-appropriate automatic metrics. Use BLEU and TER for machine translation and ROUGE for summarization. Add METEOR to better reward paraphrases. Choose metrics that align with your task’s needs.
  • āœ“Design a simple human evaluation rubric. Rate fluency, coherence, and relevance on a clear scale (e.g., 1–5). Blind model identities and randomize order to reduce bias. Use multiple raters per item and report agreement.
  • āœ“Standardize your evaluation pipeline. Fix tokenization, decoding parameters, and metric scripts. Use the same data splits and references across models. This makes comparisons fair and results reproducible.

Glossary

Intrinsic evaluation

A way to judge a model directly by how well it predicts text. It uses the model’s own probability outputs without involving a task like translation. It is fast to compute and easy to repeat. It gives a clean view of prediction quality but may not match real-world usefulness. Perplexity is the main intrinsic metric for language models.

Extrinsic evaluation

A way to judge a model by how well it performs in real tasks. The model is plugged into a pipeline like translation or summarization. Scores come from task outputs compared to references or human ratings. It measures actual utility but is slower and more complex. It is often called evaluation in the wild.

Perplexity (PPL)

A number that shows how surprised a model is by a test text on average. Lower is better and means the model predicted well. It is the inverse of the test text probability, normalized by length. It can be viewed as the model’s branching factor—the number of likely next choices. Perplexity depends on the vocabulary and tokenization used.

Held‑out test set

A part of data kept separate from training and tuning. It is used only for final evaluation. It prevents overfitting from giving fake, high scores. It makes sure we measure generalization to new data. It should match the domain we care about.

#language model evaluation#perplexity#intrinsic evaluation#extrinsic evaluation#branching factor#held-out test set#machine translation#text generation#BLEU#ROUGE#METEOR#TER#n-gram overlap#human evaluation#fluency#coherence#relevance#negative log-likelihood#tokenization
Version: 1
  • •BLEU compares n‑gram overlap between system output and reference text, with a brevity penalty to discourage very short outputs. It is common for machine translation and often used for generation. However, BLEU can miss meaning if synonyms or paraphrases are used (e.g., 'I enjoy felines' vs. 'I like cats'). So, a low BLEU does not always mean a bad output.
  • •ROUGE focuses on recall—how much of the reference appears in the system output. It’s popular for summarization where covering key content matters. Variants like ROUGE‑1 (unigrams), ROUGE‑2 (bigrams), and ROUGE‑L (longest common subsequence) capture different aspects. Still, ROUGE doesn’t fully understand meaning.
  • •METEOR tries to be more meaning-aware than BLEU by matching stems and synonyms. This helps when wording differs but meaning aligns. It aligns words between hypothesis and reference and scores based on matches and penalties for fragmentation. It can correlate better with human judgments than BLEU in some cases.
  • •TER (Translation Edit Rate) counts the edits needed to transform a system’s translation into the reference. Fewer edits (lower TER) mean better quality. It is intuitive like a 'distance' measure but can still miss meaning-preserving paraphrases. It’s often used alongside BLEU.
  • •Automatic metrics are great for quick comparisons and optimization loops. But they can reward surface similarity and punish valid paraphrases. Human review remains important for checking meaning, naturalness, and usefulness. A balanced evaluation plan mixes intrinsic metrics, automatic extrinsic metrics, and human judgment.
  • •Perplexity can be low while generated text is repetitive or unhelpful. This happens because perplexity measures next-token prediction, not overall discourse quality. Therefore, do not rely on one metric alone. Combine metrics and include qualitative review to avoid surprises.
  • •Always evaluate on a held‑out test set to avoid overfitting. Do not peek at the test set while tuning; use a validation set instead. Keep tokenization and vocabulary consistent across models when comparing perplexity. Record your evaluation setup for reproducibility.
  • •In practice, create an evaluation pipeline: compute perplexity on a test set, run task-specific benchmarks, and collect both automatic scores and human ratings. Use the same data splits and scripts for fair comparisons. Check that improvements are statistically meaningful, not random noise. This workflow gives a trustworthy picture of model quality.
  • •The main message is to measure what matters for your goal. If you need polished writing, emphasize human fluency ratings. If you need translation accuracy, consider BLEU and TER together with human adequacy and fluency checks. Let your end use guide which metrics carry the most weight.
  • 02Key Concepts

    • 01

      Intrinsic vs. Extrinsic Evaluation: Intrinsic evaluation measures the model directly by checking how well it predicts text, while extrinsic evaluation measures how well the model supports downstream tasks. Intrinsic is fast and clean because it only needs a test set and the model’s probabilities. Extrinsic is realistic because it reflects actual use cases like translation and summarization. However, intrinsic scores may not always predict task performance, and extrinsic tests can be slow to set up and run. For a complete picture, both views are necessary.

    • 02

      Perplexity (Definition and Intuition): Perplexity summarizes the model’s average surprise on a test set, normalized by sequence length. It’s computed from the model’s assigned probability to the entire test sequence. Lower perplexity means higher model confidence on the test data. Intuitively, it’s like the number of equally likely choices the model considers at each step. This makes it a handy, single-number way to compare models when everything else is equal.

    • 03

      Perplexity Formula and Normalization: The perplexity of a sequence W = w1..wn is P(W)^(-1/n), which equals the product over i of P(wi | w1..wi-1)^(-1/n). The negative one over n exponent normalizes for length so longer test sets aren’t unfairly penalized. This also makes perplexity comparable across test sets of different sizes. The chain rule links sequence probability to conditional probabilities of each token. Using log space, perplexity can be related to average negative log-likelihood.

    • 04

      Why Use a Held-Out Test Set: Evaluating on training data leads to overfitting and misleading scores. A model can memorize training text and look excellent on it while failing on new, unseen data. A held‑out test set checks generalization—the real goal of modeling. Never tune hyperparameters on the test set to avoid ā€œtest set leakage.ā€ Use a separate validation set for tuning and the test set once at the end.

    • 05

      Perplexity as Branching Factor: Reading perplexity as branching factor makes it concrete. A perplexity of 100 suggests the model behaves as if 100 next tokens are equally likely on average. This frames how ā€œconfusedā€ the model is at each prediction step. Lower branching factor means sharper predictions. This intuition helps explain why a drop in perplexity is valuable even if the number itself seems abstract.

    • 06

      Vocabulary Sensitivity: Perplexity depends on the size and nature of the vocabulary and tokenization. Larger vocabularies often make perplexity higher because the model must spread probability over more tokens. Therefore, compare perplexity only when models share the same tokenization and vocabulary. Otherwise, numbers reflect tokenization choices, not model quality. Consistent preprocessing is key for fair comparisons.

    • 07

      Limits of Perplexity: Perplexity focuses on next-token prediction accuracy, not on the global quality of generated text. A model could achieve low perplexity yet repeat phrases or lack coherence in longer outputs. This gap means perplexity alone is not a full measure of usefulness. You still need extrinsic checks. Especially for generation tasks, human judgments of fluency and coherence are critical.

    • 08

      Extrinsic Evaluation Basics: Extrinsic evaluation measures how a model helps on tasks like text generation, translation, or summarization. You integrate the language model into a task pipeline and run benchmarks. This shows practical value, such as whether outputs are accurate and useful. But pipelines have many moving parts, making it hard to isolate the model’s contribution. Despite the extra work, extrinsic evaluation aligns with real-world goals.

    • 09

      Human Evaluation of Generated Text: Human raters score outputs on criteria like fluency (how natural it sounds), coherence (how well it fits together), and relevance (how on-topic it is). This approach captures meaning and nuance that automatic metrics miss. It is considered the gold standard but costs time and money. Clear rubrics and multiple raters improve reliability. Use human evaluation to validate key findings before deployment.

    • 10

      Automatic Metrics for Generation and Translation: Automatic metrics quickly compare systems using reference texts. BLEU measures n‑gram overlap with a brevity penalty, ROUGE emphasizes recall, METEOR matches stems and synonyms, and TER counts edit steps. They’re efficient for research and iteration. However, they can reward surface similarity while penalizing valid paraphrases. Always interpret them in context.

    • 11

      BLEU in Practice: BLEU computes modified n‑gram precisions (typically up to 4-grams) and applies a brevity penalty to discourage too-short outputs. It’s standard for machine translation but also used elsewhere. BLEU correlates with human judgment at system-level comparisons but can fail at sentence-level. It struggles when the system uses paraphrases or synonyms. Use BLEU with awareness of its blind spots.

    • 12

      ROUGE for Summarization: ROUGE measures how much of the reference summary appears in the system output. ROUGE‑1 (unigrams) and ROUGE‑2 (bigrams) capture content overlap, while ROUGE‑L captures longest common subsequence. It suits summarization where covering key points is essential. But ROUGE can miss cases where different wording still conveys the same meaning. Supplement with human evaluation for nuance.

    • 13

      METEOR and Meaning: METEOR aligns words between hypothesis and reference using stemming and synonyms to reward meaning preservation. It penalizes fragmented matches to favor coherent phrasing. This can align better with human judgments than BLEU in some settings. Still, it’s not perfect and depends on resources like synonym lists. Use it as one piece of the evaluation puzzle.

    • 14

      TER (Translation Edit Rate): TER counts the number of edits—insertions, deletions, substitutions, and shifts—needed to transform the system output into the reference translation. Lower TER means the system is closer to the reference. It’s intuitive like a distance metric. However, valid paraphrases can still look distant if word choices differ. Combine TER with other metrics and human checks.

    • 15

      The 'I like cats' Example for Perplexity: A toy test set with 'I like cats' and 'I hate cats' shows how to compute perplexity. You multiply the conditional probabilities of each token and take the āˆ’1/n power. Two models with different probabilities produce different perplexities (e.g., 6.68 vs. 2.85), letting you pick the better one. This concretely demonstrates the procedure. Even simple examples reveal real differences.

    • 16

      Automatic Metrics Miss Paraphrase Meaning: Consider a translation that changes 'I like cats' into 'I enjoy felines.' BLEU might score this low due to different words despite identical meaning. This highlights a key limitation: surface overlap is not the same as semantic equivalence. Human evaluators would likely rate it good. So, never depend on automatic metrics alone.

    • 17

      Evaluation 'In the Wild': Extrinsic evaluation is sometimes called evaluation in the wild because it measures how the model behaves in real applications. It factors in messy inputs, domain shifts, and user expectations. This approach tests whether the model is actually useful. While harder to run, it reveals practical strengths and weaknesses. It’s the ultimate check before deployment.

    • 18

      End-to-End Evaluation Strategy: A robust plan includes intrinsic perplexity, automatic task metrics, and human ratings. Start with perplexity on a held‑out test set to screen models. Then run task benchmarks using standardized scripts and references. Finally, confirm findings with human evaluation. This layered approach reduces the risk of being misled by any single metric.

    • 19

      Fair Comparison Practices: Keep tokenization and vocabulary identical when comparing perplexity. Use the same data splits, decoding settings, and references when comparing task metrics. Avoid peeking at the test set during tuning to prevent overfitting. Document settings to ensure reproducibility. These habits make results trustworthy.

    03Technical Details

    Overall Architecture/Structure of Evaluation

    1. What are we evaluating?
    • Object: A language model (LM) that predicts the next token’s probability given prior context: P(x_t | x_<t).
    • Goal: Measure how good those predictions are and how useful the LM is when embedded in real tasks.
    • Views: Intrinsic (directly about probabilities) and extrinsic (about downstream task performance).
    1. How does data flow during evaluation?
    • Intrinsic: Feed a held‑out test set into the model; for each position i, record P(w_i | w_1..w_{i-1}); combine these into a single score (perplexity).
    • Extrinsic: Plug the LM into an application pipeline (e.g., translation). Generate outputs on a benchmark dataset, then score with automatic metrics and/or human judges.
    • Reporting: Aggregate scores (e.g., average BLEU across the test set; average human ratings) and compare across models.
    1. Role of each component
    • Language Model: Produces conditional probabilities or generates tokens used to compute metrics.
    • Test Set: Provides unseen text for measuring generalization; must match the target domain if possible.
    • Metric Computation: Converts raw outputs and reference texts into numeric scores (perplexity, BLEU, ROUGE, METEOR, TER).
    • Human Evaluation: Supplies qualitative judgment that captures meaning, fluency, and coherence not fully reflected by automatic metrics.

    Intrinsic Evaluation: Perplexity

    Definition and Formula

    • Given a test sequence W = w1, w2, …, wn, define the sequence probability via the chain rule: P(W) = Ī _{i=1..n} P(w_i | w_1..w_{i-1}).
    • Perplexity (PPL) is the inverse probability normalized by length: PPL(W) = P(W)^{āˆ’1/n}.
    • Equivalently, working in logs: log P(W) = Ī£ log P(w_i | context). Average negative log-likelihood (NLL) is āˆ’(1/n) Ī£ log P(w_i | context). Then PPL(W) = exp(average NLL) if logs are natural. This reveals perplexity as a smooth transform of average NLL.

    Interpretation

    • Perplexity measures how surprised the model is by the test text on average. Lower is better because it means higher assigned probability.
    • Branching factor view: If a model always chose uniformly among B next tokens, its perplexity would be B. So, PPL ā‰ˆ effective number of equally likely choices per step.

    Normalization Rationale

    • Without normalization, longer sequences would have smaller probabilities (due to multiplying many numbers < 1), unfairly lowering P(W). The āˆ’1/n exponent neutralizes length so we can compare across test sets and models.

    Practical Computation Steps

    • Step 1: Ensure consistent tokenization and vocabulary across models if you plan to compare PPL.
    • Step 2: Prepare a held‑out test set (never used in training or tuning).
    • Step 3: For each position i in the test set, compute P(w_i | w_1..w_{iāˆ’1}). Accumulate log probabilities to avoid numerical underflow.
    • Step 4: Compute average NLL = āˆ’(1/n) Ī£ log P(w_i | context).
    • Step 5: Compute PPL = exp(average NLL). Report this number; lower is better.

    Toy Example Walkthrough

    • Test set: ā€œI like cats.ā€ ā€œI hate cats.ā€ Imagine the tokens are [I, like, cats, I, hate, cats], n = 6, or as the lecture simplified, n = 4 depending on token counting design (the core idea is unchanged).
    • Model 1: P(I)=0.25, P(like|I)=0.5, P(hate|I)=0.5, P(cats|like)=0.1, P(cats|hate)=0.1. Multiply the appropriate conditionals for each sentence and normalize with āˆ’1/n.
    • Model 2: P(I)=0.25, P(like|I)=0.3, P(hate|I)=0.7, P(cats|like)=0.8, P(cats|hate)=0.3. Repeat the calculation. The resulting perplexities (e.g., 6.68 vs. 2.85) show Model 2 is better on this test set because it assigns higher probability to the observed continuations.

    Important Caveats

    • Vocabulary Sensitivity: Larger vocabularies tend to raise perplexity because probability mass is spread over more tokens. Always keep tokenization and vocabulary fixed when comparing PPL across models.
    • Domain Mismatch: A model trained on one domain (e.g., news) may show higher perplexity on another (e.g., poetry). Perplexity then reflects distribution shift, not only model quality.
    • Not a Fluency Guarantee: Low perplexity does not ensure non-repetitive, globally coherent generations; it only reflects local prediction quality.

    Extrinsic Evaluation: General Workflow

    Pipeline Setup

    • Choose a task (e.g., text generation, machine translation) and a benchmark dataset (with references when needed).
    • Integrate the LM: For generation, use the LM to produce outputs given prompts/inputs; for translation, map source sentences to target language outputs.
    • Decide evaluation methods: Select automatic metrics suitable for the task and plan a human evaluation protocol when quality matters.

    Human Evaluation

    • Criteria: Common axes include fluency (naturalness of language), coherence (logical flow and consistency), and relevance/adequacy (does it answer or translate the content).
    • Scales: Likert scales (1–5 or 1–7) or pairwise comparisons (A vs. B) help reduce variability.
    • Procedure: Randomize and blind examples so raters do not know which model produced which output. Aggregate multiple raters per item to improve reliability.

    Automatic Metrics: How They Work

    BLEU (Bilingual Evaluation Understudy)

    • Purpose: Measures n‑gram overlap between system output (hypothesis) and one or more references.
    • Mechanics: Compute modified n‑gram precision for n = 1..4 (count hypothesis n‑grams clipped by reference counts). Combine with a brevity penalty BP to penalize overly short outputs: BP = 1 if hypothesis length > reference length, else exp(1 āˆ’ ref_len/hyp_len).
    • Score: BLEU = BP Ɨ exp(Ī£ w_n log p_n) where p_n are modified precisions and w_n are weights (often 0.25 each for up to 4-grams).
    • Strengths: Widely used, fast, and good for system-level comparisons in MT.
    • Weaknesses: Sensitive to wording differences; can undervalue valid paraphrases; less reliable at sentence-level.

    ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

    • Purpose: Emphasizes how much of the reference content appears in the system output; widely used in summarization.
    • Mechanics: ROUGE‑1 and ROUGE‑2 compute recall over unigrams and bigrams. ROUGE‑L uses the longest common subsequence to capture sentence-level structure.
    • Strengths: Captures coverage—important for summaries.
    • Weaknesses: Penalizes concise paraphrases and stylistic diversity; not meaning-aware.

    METEOR

    • Purpose: Improve alignment with human judgments by matching stems, synonyms, and exact words.
    • Mechanics: Align hypothesis and reference tokens using exact, stem, and synonym matches; compute precision, recall, and a fragmentation penalty; combine into a single score.
    • Strengths: Rewards meaning-preserving rewording better than BLEU in some cases.
    • Weaknesses: Requires linguistic resources; can still miss deeper semantics.

    TER (Translation Edit Rate)

    • Purpose: Measures the number of edits needed to transform hypothesis into the reference.
    • Mechanics: Count insertions, deletions, substitutions, and shifts; normalize by reference length.
    • Strengths: Intuitive and easy to interpret as an error rate.
    • Weaknesses: Penalizes valid paraphrases and synonym swaps.

    Putting It Together: A Balanced Plan

    • Start with intrinsic metrics (perplexity) on held‑out test sets to screen models quickly.
    • Run extrinsic benchmarks and compute BLEU/ROUGE/METEOR/TER as appropriate.
    • Conduct human evaluations on a representative sample to assess fluency, coherence, and adequacy.
    • Compare models fairly: same data splits, same decoding settings (for generation), same references, and the same metric scripts.

    Step-by-Step Implementation Guide

    Intrinsic (Perplexity)

    1. Prepare data: Split your corpus into train/validation/test. Ensure the test set is never used for training or hyperparameter tuning.
    2. Tokenize consistently: Use the same tokenizer and vocabulary for all models you will compare.
    3. Compute probabilities: For each token in the test set, query the model for P(token | preceding tokens). Accumulate log probabilities to avoid underflow.
    4. Aggregate: Compute average negative log-likelihood = āˆ’(1/n) Ī£ log P_i.
    5. Convert to perplexity: PPL = exp(average NLL). Report with confidence intervals if possible (e.g., via bootstrapping over documents).

    Extrinsic (Text Generation)

    1. Choose dataset and prompts: e.g., a set of story starters or questions with reference answers if available.
    2. Generate outputs: Fix decoding settings (e.g., temperature, top‑k/top‑p, or greedy) to make comparisons fair.
    3. Automatic scoring: If references exist, compute BLEU/ROUGE/METEOR; otherwise, design human evaluation.
    4. Human evaluation: Recruit raters, create a rubric (fluency, coherence, relevance), and blind the model identities.
    5. Analyze: Report averages and, if possible, inter‑rater agreement (e.g., Cohen’s kappa) to show reliability.

    Extrinsic (Machine Translation)

    1. Pick a benchmark: Use a standard parallel dataset with source sentences and one or more reference translations.
    2. Translate: Use your LM‑based translation system to produce hypotheses for each source sentence.
    3. Automatic metrics: Compute BLEU and TER (and possibly METEOR). Use standardized tooling to avoid tokenization mishaps.
    4. Human evaluation: Sample sentences for human adequacy and fluency judgments; compare models pairwise when possible.
    5. Report: Provide overall metrics and example cases illustrating strengths/weaknesses (e.g., handling of idioms).

    Tips and Warnings

    • Tokenization Consistency: Changing tokenization or vocabulary changes perplexity. Keep them fixed across comparisons.
    • Data Hygiene: Don’t leak test data into training; even indirect tuning on the test set inflates results.
    • Metric Limitations: BLEU/ROUGE/METEOR/TER focus on surface overlap; always validate with human checks for critical applications.
    • Reproducibility: Seed your decoders, fix decoding parameters, log versions of metric tools, and save evaluation scripts.
    • Statistical Significance: Small metric differences can be noise. Use bootstrap resampling or significance tests to check if improvements are real.
    • Efficiency: Automatic metrics enable fast iteration. Use them to narrow candidates, then invest in human evaluation for finalists.

    Concrete Illustrations Embedded in Practice

    • Paraphrase Pitfall: ā€œI like catsā€ vs. ā€œI enjoy felinesā€ shows BLEU’s weakness when surface forms differ but meaning matches. Human judges usually rate the paraphrase as good despite low overlap.
    • Repetition Issue: A model can have low perplexity yet produce repetitive text in free generation. This is why extrinsic, qualitative checks matter.
    • Branching Factor Insight: Moving from PPL 80 to PPL 40 roughly halves the model’s average uncertainty in next-token choice, often improving task performance—but not guaranteeing perfect generation.

    In summary, intrinsic evaluation with perplexity quantifies how well the model predicts text, while extrinsic evaluation with automatic and human metrics tells you how useful the model is in real tasks. Both are needed: perplexity for quick, direct insight and task metrics/human ratings for real-world validation. Use consistent setups, understand metric limitations, and triangulate with multiple methods for reliable, fair conclusions.

    05Conclusion

    Evaluation answers the essential question: how good is a language model, both in theory and in practice? Intrinsic evaluation gives a direct, fast view using perplexity, which summarizes how surprised the model is by a held‑out test set. Perplexity’s branching‑factor intuition makes it easy to grasp, but it depends on vocabulary and does not guarantee fluent or non‑repetitive generations. Extrinsic evaluation shows practical value in tasks like text generation and machine translation, using both human judgments and automatic metrics such as BLEU, ROUGE, METEOR, and TER. Automatic metrics are efficient but focus on surface overlap; human evaluation captures meaning and naturalness but is costly.

    For best results, build a balanced evaluation plan. Start with perplexity to screen models, then run task-specific benchmarks, and cap it with human reviews on a representative sample. Keep tokenization, decoding settings, and data splits consistent to ensure fair comparisons and reproducible results. Where improvements are small, use statistical tests to confirm that gains are real, not noise.

    To practice, compute perplexity on a clean test set for your current model, then evaluate its outputs on a small translation or generation benchmark with BLEU/ROUGE and a brief human rating study. As next steps, deepen your understanding of metric behavior by examining samples where metrics disagree with humans, and refine your task-specific evaluation rubrics. The core message to remember: measure what matters for your goal, and never rely on a single number. Combining intrinsic and extrinsic methods gives the clearest, most trustworthy picture of model quality.

  • āœ“Watch for vocabulary and tokenization effects. Perplexity is sensitive to how text is split and what tokens exist. Keep these settings the same when comparing models. Document them in your reports.
  • āœ“Combine metrics instead of chasing a single score. Automatic metrics are fast but imperfect, and human evaluation is rich but slow. Triangulate using both. This reduces the risk of being misled.
  • āœ“Analyze cases where metrics disagree with humans. Look at samples where BLEU is low but humans rate high (e.g., paraphrases). Use findings to refine your choice of metrics and rubrics. Share example cases in reports.
  • āœ“Use statistical checks for small gains. When improvements are small, verify they are significant. Apply bootstrap or other tests to avoid over-claiming. Report confidence intervals when possible.
  • āœ“Report both averages and qualitative examples. Numbers show trends, while examples reveal strengths and weaknesses. Include diverse cases, not only cherry-picked wins. This builds trust in your conclusions.
  • āœ“Align evaluation with deployment goals. If user satisfaction matters, weigh human ratings heavily. If exact wording is key, emphasize overlap metrics. Let the end use guide metric selection.
  • āœ“Plan for evaluation cost and speed. Use automatic metrics for rapid iteration, then spend human evaluation budget on finalists. This balances efficiency and quality. It speeds up development without sacrificing insight.
  • āœ“Document everything. Record dataset versions, splits, tokenization, decoding settings, and metric tool versions. This ensures others can reproduce your results. It also helps you debug differences later.
  • Overfitting

    When a model memorizes training data instead of learning patterns that generalize. It performs great on training examples but poorly on new ones. Testing on training data hides this problem. Using a held‑out test set reveals it. Strong regularization and more data can help reduce overfitting.

    Downstream task

    A real application where a model’s output is used to do something useful. Examples include translation, summarization, and question answering. Performance on these tasks shows practical value. Metrics or human ratings score how well the task is done. It’s the end goal of many models.

    BLEU

    An automatic metric that measures how much an output shares n‑grams with a reference text. It uses modified precision for 1‑ to 4‑grams and a brevity penalty to punish short outputs. It’s common in machine translation. It’s fast and useful for system-level comparisons. It can be unfair to paraphrases with different wording.

    ROUGE

    A family of metrics that focus on recall: how much of the reference appears in the output. ROUGE‑1 and ROUGE‑2 use unigrams and bigrams, and ROUGE‑L uses the longest common subsequence. It’s popular for summarization tasks. It values coverage of key points. It can miss meaning when paraphrases are used.

    +24 more (click terms in content)