📚 Stanford CS336: Language Modeling from Scratch14 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 14: Data 2

Intermediate

Stanford Online

Key Summary

•The lecture explains why rare words are a core challenge in language modeling. Most corpora follow Zipf’s law, where a few words appear very often and a huge number appear very rarely. Rare words make probability estimates unreliable and inflate vocabulary size, which increases memory and slows training and inference.
•One simple fix is word replacement: swap all words that occur below a threshold (like fewer than 5 times) with a special token such as <UNK>. This shrinks the vocabulary and lets the model learn a stable probability for unknown words. Variants include part-of-speech-aware tokens like <UNK-NOUN> and statistical methods to avoid replacing informative words.
•Subword tokenization breaks words into smaller, reusable pieces so rare words can be built from common parts. Byte Pair Encoding (BPE) merges the most frequent character pairs, while WordPiece merges pairs that most improve the likelihood of the data. This lets models handle unseen words by composing them from known subwords.
•Character-level models predict the next character instead of the next word or subword. This removes the out-of-vocabulary problem because the alphabet is small and fixed, but sequences become much longer and harder to learn, making training more expensive. Hybrids often combine subword tokens with character-level modeling to balance flexibility and efficiency.
•The lecture then moves to structured data and knowledge, such as database tables and knowledge graphs. Structured data contains entities and relations in explicit, clean formats, which can help models be more accurate and robust. Using this data can augment training, constrain outputs, or guide learning.
•One method is to augment training data with sentences generated from structured sources in a domain (e.g., medicine). This teaches the model domain-specific facts like symptoms of diseases or drug side effects. It boosts performance on domain tasks by filling gaps left by raw web text.

Why This Lecture Matters

Anyone building or fine-tuning language models faces the same core pain points: rare words, factual accuracy, and limited data. Product teams, research scientists, data engineers, and ML practitioners benefit from knowing how to manage these issues. Handling rare and unknown words makes models faster, smaller, and more reliable, which matters in production where latency and cost are real constraints. Subwords and sensible unknown-token strategies reduce failures on new names, typos, and domain terms. Structured knowledge integration improves correctness for question answering, search assistants, and domain tools (healthcare, finance, law), where wrong answers have real consequences. Knowledge graph embeddings add reasoning power and enable discovery and validation. Data augmentation closes gaps when labeled or in-domain data is scarce, improving robustness to paraphrasing, noise, and incomplete inputs—key for chatbots, classification, and translation systems used by diverse users. These techniques also help career growth: they are foundational skills that show up across modern NLP pipelines and interviews. In an industry where factuality, efficiency, and reliability are decisive, mastering rare word handling, structured data use, and smart augmentation equips you to build LMs that perform better in the real world and scale with evolving user demands.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on three pillars of data handling for language modeling: (1) managing rare and unknown words, (2) incorporating structured data and knowledge into models, and (3) using data augmentation to strengthen generalization. It begins by revisiting how text data differs from uniform datasets because of Zipf’s law: a small number of words appear extremely often, while an enormous number occur rarely or even once. This long tail makes it hard to estimate reliable probabilities for many words, inflates vocabulary size, and strains memory and compute. The lecture then offers practical solutions with clear trade-offs: simple word replacement with a special token, subword tokenization (such as Byte Pair Encoding and WordPiece), and character-level models, often combined for balance.

The second part turns to structured data and knowledge. Instead of learning only from raw, messy text, the lecture highlights resources like database tables and knowledge graphs that encode facts and relations directly. It outlines three ways to use these: augment training data with structured facts, constrain outputs using knowledge graphs to prevent invalid answers, and guide learning with curated parallel datasets (e.g., for translation). The lecture also introduces knowledge graph embeddings, with TransE as a core example: entities and relations are mapped to vectors so that relations act like translations in vector space, enabling reasoning, question answering, and even discovery of missing links.

Since real-world knowledge graphs are often noisy or incomplete, the lecture addresses resilience strategies. Robust embedding methods like DistMult and ComplEx can better handle noise and diverse relation types. Knowledge graph completion techniques predict missing facts based on existing patterns, while cleaning methods find and fix likely errors, sometimes with human review.

Finally, the lecture covers data augmentation—creating new training examples from existing ones to improve model robustness, especially when training data is scarce. It details back translation (translate text to a second language and back), synonym replacement (swap words for meaning-preserving alternatives), random insertion (add extra words to simulate noise), and random deletion (remove words to simulate incomplete input). It also points to advanced techniques like contextual augmentation using language models and GAN-based synthetic data generation. Each augmentation method aims to teach models to cope with variations in phrasing, noise, and missing information without drifting away from the original meaning.

This lecture is intended for students with a basic understanding of language modeling concepts—such as tokens, vocabularies, and probabilities—and some familiarity with how models are trained from text. Prior exposure to tokenization (covered previously) helps, as today’s material builds on those foundations. The content serves both those building classic n-gram or neural language models and those interested in modern large language models, as the data issues and solutions apply widely.

After completing this lecture, you will be able to: explain why rare words are challenging and choose among replacement, subword, or character-level strategies; understand how and when to integrate structured data to improve accuracy and reliability; describe and implement knowledge graph embeddings like TransE (and know when to reach for DistMult or ComplEx); and select and apply data augmentation methods that fit your task without breaking semantics. The lecture is structured in three segments—rare/unknown words, structured data and knowledge, and data augmentation—with examples such as “unbelievable” split into subwords, a QA constraint for “What is the capital of France?”, and augmentation edits to “The cat sat on the mat.” It balances conceptual explanations, everyday analogies, and practical pipelines so you can apply these techniques in real projects.

Key Takeaways

✓Start with subword tokenization as your default. Train a BPE or WordPiece model on your corpus with a target vocabulary around 30k–50k. Check how often unknown or long token sequences appear and adjust size accordingly. Subwords give strong coverage with manageable compute.
✓Use <UNK> replacement when you stick with word-level vocabularies or for safety nets. Choose a frequency threshold that meaningfully reduces vocabulary without erasing critical domain terms. Consider POS-aware unknown tokens to preserve grammar hints. Validate the impact on a dev set before full training.
✓Avoid over-replacing informative rare words. Use simple statistics (e.g., TF-IDF or domain term lists) to keep key terms while replacing typos and noise. This maintains domain accuracy while shrinking the vocabulary. Always sample and inspect replacements for sanity.
✓Remember the compute trade-offs of character-level models. Longer sequences mean higher memory and slower training, especially for transformers. If you need character coverage, consider hybrid architectures. Profile training to confirm feasibility on your hardware.
✓Leverage structured data to boost factuality. Generate natural language statements from databases or KGs to augment training. Mix them in gradually to avoid style drift. Track domain task metrics to measure benefit.
✓Constrain outputs for tasks where correctness matters. For QA, build candidate sets from a knowledge graph or verify answers post-generation. Constraints reduce nonsense and increase trust. Provide fallbacks when constraints are empty or uncertain.
✓

Glossary

Zipf's law

A rule that says a few items are very common while most are very rare. In language, a small set of words appear a lot, and a huge number appear only a few times. This creates a long tail of rare words. It makes it hard for models to learn good statistics for many words. It also increases vocabulary size and memory needs.

Rare words

Words that appear very few times in a dataset. The model has little evidence to estimate their probabilities. This can lead to poor predictions when those words appear. They also increase vocabulary size and slow down training. Handling them is key to better models.

Vocabulary

The list of all tokens a model knows how to handle. Bigger vocabularies need more memory and make the output layer larger. Smaller vocabularies can miss important words or details. Finding a good size is important for speed and accuracy. Tokenization choices control this size.

Tokenization

The process of splitting text into pieces called tokens. Tokens can be words, subwords, or characters. Good tokenization helps the model learn patterns. Poor tokenization makes learning harder. Choosing the right level affects performance.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 14: Data 2

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Zipf's law

Rare words

Vocabulary

Tokenization

02Key Concepts

03Technical Details

04Examples

05Conclusion

<UNK> (unknown token)

Part of Speech (POS)

Subword

Byte Pair Encoding (BPE)