🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 1: Overview and Tokenization | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch1 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 1: Overview and Tokenization
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 1: Overview and Tokenization

Beginner
Stanford Online
NLPYouTube

Key Summary

  • •This session introduces a brand-new course on building language models from scratch. You learn what language modeling is, where it’s used (speech recognition, translation, text generation, classification), and how different modeling families work. The class emphasizes implementing models yourself in Python and PyTorch, plus how to train and evaluate them.
  • •A language model assigns probabilities to word sequences and predicts the next word. For example, given “the cat sat on the,” it estimates which word most likely comes next (like “mat”). Thinking in probabilities lets computers judge how natural or likely a sentence is.
  • •Classic N-gram models estimate the next word using counts of short word windows (like bigrams and trigrams). You count how often sequences occur in a corpus and compute conditional probabilities. This approach is simple, fast, and a foundation for understanding modern neural methods.
  • •Neural language models learn patterns beyond fixed windows and can handle long-range relationships. Recurrent neural networks (RNNs) and LSTMs capture sequences over time. Transformers use attention to focus on important words and are state-of-the-art for large language models.
  • •Evaluation focuses on perplexity and BLEU. Perplexity is like the average number of choices the model considers at each step; lower is better. BLEU compares generated text to human references and is common in machine translation.
  • •Tokenization breaks text into pieces called tokens, often words and punctuation. Simple whitespace splitting fails on punctuation and contractions (like “I’m”). Smarter tokenizers split “I’m” into “I” and “’m” and separate punctuation like periods.
  • •Common tokenization tools include NLTK and spaCy, plus specialized tokenizers for code or social media. Good tokenization matters because models learn from and predict tokens. Poor tokenization confuses models and lowers accuracy.

Why This Lecture Matters

This lecture sets a practical foundation for anyone who needs to build or understand language systems. Software engineers, data scientists, and researchers gain a clear roadmap from raw text to working models. By explaining where language models are used—speech recognition, translation, generation, and classification—you can connect the concepts to real products like voice assistants, chatbots, and content tools. The emphasis on tokenization and subwords addresses one of the most common failure points in NLP pipelines. Many projects underperform not because the model is weak, but because the text was split and normalized poorly. Learning robust tokenization, handling contractions and punctuation, and normalizing numbers immediately improves model quality. Subword tokenization (BPE, WordPiece, SentencePiece) further solves rare-word and vocabulary-size problems, empowering you to scale beyond toy datasets toward real-world corpora. Understanding classic N-gram models prepares you to reason about probabilities, sparsity, and evaluation with perplexity and BLEU. These skills transfer directly to modern neural models like Transformers. With the course’s coding focus in Python and PyTorch and GPU support, you can implement and train models that matter in industry settings. Building this knowledge now strengthens your career, as language modeling underpins today’s most impactful AI applications—from large language models powering enterprise copilots to tools that automate analysis and content creation. In short, mastering these fundamentals lets you diagnose problems, design better pipelines, and communicate model behavior clearly to teammates and stakeholders. It’s the difference between hoping a prebuilt model works and confidently building one that does.

Lecture Summary

Tap terms for definitions

01Overview

This first lecture launches a hands-on course about building language models—the computer systems that assign probabilities to sequences of words and can predict the next word. The focus is on learning both the ideas and the practical skills to construct models from scratch. The class covers classic N-gram models, modern neural models (RNNs, LSTMs, Transformers), training processes, evaluation metrics, and especially the crucial preprocessing step: tokenization. Along the way, you will learn how to prepare data, select hyperparameters, and measure how well your model works.

The lecture starts with course goals: deeply understand language modeling; implement systems in Python and PyTorch; and master training and evaluation. You’ll see where language models are used every day—speech recognition, machine translation, text generation, and text classification. For example, in speech recognition, a language model helps choose between ambiguous sound interpretations like “recognize speech” versus “wreck a nice beach.” In translation, it pushes the system toward fluent, grammatical outputs. In generation, it produces new text that resembles the training data; and in classification, it helps distinguish categories like spam vs. not spam.

You learn what a language model does in precise terms: it estimates the probability of a word sequence, such as P(w1, w2, …, wn), and uses that to predict likely continuations, like the word after “the cat sat on the.” Classic N-gram models compute these probabilities from counts of short windows of words (like bigrams and trigrams). Neural language models go further by learning richer patterns and capturing long-distance relationships: RNNs process sequences step by step, LSTMs handle long-term dependencies, and Transformers use attention to focus on the most relevant words, defining today’s state-of-the-art.

Evaluation uses metrics that quantify how well a model performs. Perplexity measures how many choices a model effectively juggles on average at each step—the lower, the better. BLEU (often used for machine translation) compares model output to human references to gauge similarity and fluency. These metrics guide you to improve models and compare different approaches.

Before you can train any model, you must tokenize text—that is, break it into tokens. Naive whitespace splitting fails on real text because it mishandles punctuation and contractions. A better tokenizer splits “I’m going to the store.” into [“I”, “’m”, “going”, “to”, “the”, “store”, “.”]. Tokenization tools like NLTK and spaCy can handle these rules, and there are specialized tokenizers for domains like code and social media. You also learn about tokenization challenges: contractions, punctuation, and numbers (which come in many formats, like 1,000.00 and “one thousand”). Normalizing numbers can simplify the model’s job.

Subword tokenization addresses rare words and large vocabularies by breaking words into smaller pieces. For example, “unbelievable” becomes “un,” “believe,” and “able.” Common subword methods include Byte Pair Encoding (BPE), WordPiece, and SentencePiece. BPE starts with characters and merges the most frequent adjacent pairs repeatedly until reaching a desired vocabulary size. A toy example with words like “low,” “lower,” “newest,” and “widest” shows how frequent pairs such as “e”+“s” merge into “es,” and then “es”+“t” into “est.”

The lecture also covers course structure and logistics. You’ll have lectures, homework, and a final project; grading is 60% homework, 30% project, and 10% participation. Prerequisites include introductory programming (CS106A or equivalent) and probability/statistics (CS109 or equivalent); deep learning familiarity is helpful but not required. Resources include a course website with notes and assignments, Piazza for Q&A, recorded lectures, and office hours offered by the instructor and TAs (Alice, Bob, and Carol). You’ll code in Python and PyTorch and have access to GPUs for training. For extra learning, the PyTorch website and CS230 materials provide excellent tutorials.

Finally, you get a preview of what’s next: a deeper dive into N-gram language models, including smoothing, backoff, and interpolation—methods that handle the problem of unseen word sequences by sensibly redistributing probability. This foundation prepares you for the neural approaches covered later, culminating in modern transformer-based large language models.

Key Takeaways

  • ✓Start with clean tokenization. Split punctuation and contractions, and choose a consistent number normalization policy. Test tokenizers (NLTK, spaCy) on real samples from your domain. Keep the same rules for both training and inference to avoid mismatches. Small tokenization fixes often yield big accuracy gains.
  • ✓Use subword tokenization to control vocabulary size. Train BPE/WordPiece/SentencePiece on representative data. Begin with a 30K vocabulary and adjust based on coverage and performance. Subwords reduce OOVs and improve generalization to rare words. They also save memory compared to giant word-level vocabularies.
  • ✓Build a simple N-gram model first. Implement bigrams or trigrams and compute perplexity on a held-out set. This baseline clarifies how data and tokenization affect results. It helps you understand probabilities and sparsity before jumping to neural models. You’ll make better decisions later with this grounding.
  • ✓Measure with perplexity and, if relevant, BLEU. Perplexity tells you how well the model predicts next words. BLEU is useful for translation or text generation tasks with references. Track these metrics as you change tokenization and vocabulary size. Let numbers guide your improvements.
  • ✓Normalize numbers early. Decide how to handle commas, decimals, and word numbers (“one thousand”). Map equivalent forms to a single representation. This prevents the model from wasting capacity on superficial differences. It simplifies counts and improves generalization.
  • ✓Treat punctuation as tokens. Separate periods, commas, and question marks from words. This reveals sentence boundaries and improves downstream tasks. You’ll get cleaner counts and better language structure learning. Avoid attaching punctuation to words like “store.”.

Glossary

Language model (LM)

A system that assigns probabilities to sequences of words and predicts likely next words. It helps computers judge how natural a sentence is. By learning patterns from text, it can continue text or rank choices. LMs power tasks like speech recognition and translation. Good LMs make outputs fluent and sensible.

Tokenization

The process of splitting text into small pieces called tokens, such as words and punctuation. It makes raw text easier for models to understand. Good tokenization handles contractions and punctuation correctly. It reduces confusion and makes patterns clearer. Poor tokenization leads to messy learning.

Token

A unit of text used by a model, often a word, punctuation mark, or subword piece. Tokens are like puzzle pieces; the model learns how they fit together. They form sequences that the model predicts over. Clear token boundaries improve learning. Tokens are mapped to IDs for model input.

N-gram

A sequence of N tokens. N-grams capture short context windows in text. Models use them to estimate the next word’s probability from recent words. Larger N captures more context but increases sparsity. N-grams are a classic, simple modeling approach.

#language modeling#tokenization#n-gram#bigram#trigram#neural language model#rnn#lstm#transformer#attention#perplexity#bleu#bpe#wordpiece#sentencepiece#normalization#subword#pytorch#nlp
Version: 1
  • •Tokenization challenges include contractions, punctuation, and numbers. Numbers appear in many forms (1,000 vs 1000.00 vs “one thousand”), so normalization helps. Separating punctuation as tokens and splitting contractions improves downstream tasks.
  • •Subword tokenization breaks words into meaningful pieces like “un,” “believe,” and “able.” This handles rare words and reduces the vocabulary size. It also helps models generalize and saves memory.
  • •Byte Pair Encoding (BPE) starts with characters and repeatedly merges the most frequent adjacent pairs to create subwords. WordPiece and SentencePiece are other popular subword methods used in modern models. These methods allow the vocabulary to cover many words without exploding in size.
  • •A concrete BPE example uses a tiny corpus with words like “low,” “lower,” “newest,” and “widest.” The algorithm counts frequent character pairs (like “e” + “s”) and merges them to form subwords (“es,” then “est”). It keeps merging until reaching a chosen vocabulary size.
  • •Course logistics: homework is 60%, final project 30%, participation 10%. Prerequisites include programming and probability/statistics; deep learning familiarity helps but isn’t required. You’ll use Python and PyTorch, have access to GPUs, and can watch recorded lectures.
  • •Resources include the course website, Piazza, and office hours with the instructor and TAs (Alice, Bob, Carol). There’s no required textbook, but readings and links (like to the Jurafsky and Martin book) will be provided. You can find great PyTorch tutorials on the official site and CS230 materials.
  • •The next class dives deeper into N-gram models, including smoothing, backoff, and interpolation. These techniques fix data sparsity by sharing probability mass across less-seen sequences. Understanding them sets the stage for modern neural methods.
  • 02Key Concepts

    • 01

      🎯 What is a language model? It is a system that assigns probabilities to sequences of words and predicts what word likely comes next. 🏠 Think of it like a smart text predictor that guesses the next word based on what came before. 🔧 Technically, it estimates P(w1, w2, …, wn) and uses that to choose the most probable continuation. 💡 Without this, machines can’t judge which sentences sound natural or make sense. 📝 Example: after “the cat sat on the,” a language model predicts “mat” as a likely next word.

    • 02

      🎯 Why language models matter. They power speech recognition, machine translation, text generation, and classification. 🏠 It’s like having a helpful assistant that knows common phrasing and grammar. 🔧 In each task, the model provides fluency and context-aware choices among alternatives. 💡 Without this, systems would make awkward or incorrect outputs. 📝 Example: translation systems use an LM to choose smooth, grammatical sentences in the target language.

    • 03

      🎯 Speech recognition with LMs. The model helps pick the right words from ambiguous audio. 🏠 Imagine hearing a mumble and deciding between “recognize speech” and “wreck a nice beach.” 🔧 The LM assigns higher probability to the phrase that fits the conversation context. 💡 Without it, speech systems would mishear and produce nonsense. 📝 Example: the LM prefers “recognize speech” in a tech talk, not “wreck a nice beach.”

    • 04

      🎯 Machine translation with LMs. They steer outputs to be fluent in the target language. 🏠 It’s like having a grammar coach that guides phrasing. 🔧 The LM scores candidate translations and boosts those that are grammatical and natural. 💡 Without it, translations can be choppy or ungrammatical. 📝 Example: for English→French, the LM prefers common, correct French constructions.

    • 05

      🎯 Text generation. LMs create new text in the style of the training data. 🏠 It’s like a storyteller who learned from many examples and can continue the tale. 🔧 The model samples likely next words repeatedly to produce a paragraph or document. 💡 Without probabilistic guidance, generated text would be random or repetitive. 📝 Example: trained on news, it generates articles that sound like news reports.

    • 06

      🎯 Text classification help from LMs. They provide features or probabilities that help label text. 🏠 It’s like using hints about word patterns to decide spam vs. not spam. 🔧 The LM captures patterns characteristic of classes (like spammy phrases). 💡 Without such signals, classifiers may miss subtle cues. 📝 Example: an LM-informed classifier flags emails with certain patterns as spam.

    • 07

      🎯 N-gram language models. They predict the next word using only the last N−1 words. 🏠 Like remembering only the last few words of a sentence to guess what comes next. 🔧 You count N-grams in a corpus and compute conditional probabilities like P(word | previous words). 💡 Without these counts, you can’t estimate probabilities from data. 📝 Example: bigram uses one previous word; trigram uses two.

    • 08

      🎯 Counting and estimating in N-grams. You tally how often sequences appear. 🏠 It’s like keeping a scoreboard of word pairs and triples. 🔧 Probability P(sat | the cat) = count(the cat sat)/count(the cat). 💡 Without enough counts, estimates are unreliable or zero. 📝 Example: if “the cat” appears 1000 times and “the cat sat” appears 500, the probability is 0.5.

    • 09

      🎯 Neural language models. They learn complex patterns beyond fixed windows. 🏠 Think of them as readers who remember more than just the last few words. 🔧 RNNs process tokens step-by-step; LSTMs hold long-term information; Transformers use attention to focus on key words. 💡 Without neural methods, capturing long-range dependencies is hard. 📝 Example: models remember a topic mentioned many words earlier.

    • 10

      🎯 Recurrent Neural Networks (RNNs). They pass a hidden state along a sequence. 🏠 Like taking notes as you read, and using those notes to understand the next sentence. 🔧 At each step, the next hidden state depends on the current token and prior state. 💡 Without recurrence, models can’t track order well. 📝 Example: predicting next words as a story unfolds.

    • 11

      🎯 LSTMs (Long Short-Term Memory). They add gates to store and forget information. 🏠 Like a smart notebook that knows what to keep and what to erase. 🔧 Input, forget, and output gates control what info flows through time. 💡 Without this, long-distance effects fade quickly. 📝 Example: remembering the subject “dogs” many words later when choosing verb agreement.

    • 12

      🎯 Transformers and attention. They compare each word with every other to find what matters. 🏠 Like a reader who can instantly cross-reference any sentence with the rest of the page. 🔧 Attention scores relevance and lets the model focus on key tokens; Transformers stack layers of this. 💡 Without attention, handling distant dependencies is harder. 📝 Example: linking a pronoun to a faraway noun.

    • 13

      🎯 Perplexity. It measures how well a model predicts the next word. 🏠 Imagine the model having to pick among several doors; fewer doors mean it’s less “perplexed.” 🔧 Lower perplexity means the model assigns higher probabilities to correct next words. 💡 Without a clear metric, you can’t compare models. 📝 Example: a model with perplexity 10 is considering roughly 10 options on average.

    • 14

      🎯 BLEU score. It evaluates how close generated text is to human references. 🏠 Like grading an essay by seeing how much it matches a sample solution. 🔧 It counts overlapping word chunks (n-grams) and penalizes overly short outputs. 💡 Without BLEU, comparing translations would be subjective. 📝 Example: translation systems report BLEU to show improvement.

    • 15

      🎯 Tokenization. It splits text into tokens like words and punctuation. 🏠 Like cutting a sentence into puzzle pieces. 🔧 A good tokenizer handles punctuation and contractions correctly. 💡 Without proper tokens, models learn messy, inconsistent patterns. 📝 Example: “I’m going to the store.” → [“I”, “’m”, “going”, “to”, “the”, “store”, “.”].

    • 16

      🎯 Pitfalls of whitespace splitting. It treats punctuation and contractions incorrectly. 🏠 Like cutting puzzle pieces at the wrong places so they don’t fit. 🔧 “I’m” remains one token and “store.” keeps the period attached, which is unhelpful. 💡 This hurts grammar rules and counts. 📝 Example: naive split yields [“I’m”, “going”, “to”, “the”, “store.”].

    • 17

      🎯 Tokenization tools. Libraries like NLTK and spaCy provide robust tokenizers. 🏠 It’s like using a sharp, well-designed cutter instead of scissors for delicate shapes. 🔧 They use language rules to separate words, punctuation, and contractions. 💡 Good tools save time and improve model quality. 📝 Example: spaCy correctly separates punctuation by default.

    • 18

      🎯 Handling numbers and normalization. Numbers appear in many formats that mean the same thing. 🏠 Like different-looking receipts that total the same amount. 🔧 Normalizing can convert “1,000.00” and “one thousand” to a consistent form. 💡 Without normalization, the model wastes capacity on differences that don’t matter. 📝 Example: map “1,000” and “1000” to the same canonical token.

    • 19

      🎯 Subword tokenization. It splits words into frequent pieces to handle rare words. 🏠 Like breaking a long word into Lego bricks the model already knows. 🔧 Common methods include BPE, WordPiece, and SentencePiece. 💡 Without subwords, vocabulary explodes and rare words become unknown. 📝 Example: “unbelievable” → [“un”, “believe”, “able”].

    • 20

      🎯 Byte Pair Encoding (BPE). It builds subwords by merging frequent adjacent pairs. 🏠 Like fusing common letter pairs into bigger chunks. 🔧 Start with characters, count pair frequencies, merge the top pair, and repeat until hitting a target vocab size. 💡 This balances vocabulary size and coverage. 📝 Example: merge “e”+“s” → “es,” then “es”+“t” → “est.”

    • 21

      🎯 WordPiece and SentencePiece. They are alternative subword algorithms used in modern systems. 🏠 Think of them as different recipes for making the same kind of Lego bricks. 🔧 WordPiece optimizes likelihood; SentencePiece can operate over raw text and multiple languages. 💡 Having options helps for multilingual or domain-specific data. 📝 Example: BERT uses WordPiece; multilingual setups often use SentencePiece.

    • 22

      🎯 Vocabulary size trade-offs. Bigger vocabularies reduce sequence length but raise memory and sparsity issues. 🏠 Like having too many puzzle pieces: it’s hard to manage and many rarely get used. 🔧 Subwords shrink the vocab but lengthen sequences slightly. 💡 The right balance improves speed and accuracy. 📝 Example: a 30K subword vocab is a common sweet spot.

    • 23

      🎯 Course structure and grading. You’ll have lectures, homework, and a final project. 🏠 Think of it as training, practice, and a capstone build. 🔧 Grading: 60% homework, 30% project, 10% participation. 💡 Knowing the breakdown helps plan your time. 📝 Example: start homework early to spread the 60% load.

    • 24

      🎯 Prerequisites and resources. Programming and probability/statistics are required; deep learning familiarity helps. 🏠 It’s like needing a toolbox (coding) and a math map (probability) before building. 🔧 You’ll use Python and PyTorch, get GPUs, and have recorded lectures. 💡 These supports keep you productive. 📝 Example: use PyTorch’s official tutorials to ramp up quickly.

    • 25

      🎯 What’s next: N-gram smoothing, backoff, and interpolation. These fix zero counts and data sparsity. 🏠 Like sharing a little weight with neighbors when your bucket is empty. 🔧 They redistribute probability so unseen events still get a small chance. 💡 Without them, N-gram models fail on new sequences. 📝 Example: next lecture dives into these techniques.

    03Technical Details

    Overall architecture/structure

    1. Data to tokens to probabilities
    • Start with raw text data (a corpus). The first essential step is tokenization: break text into tokens—words and punctuation, or subwords. From tokens, you build a vocabulary (the set of unique tokens). With a vocabulary and tokenized sequences, you can train a model to estimate probabilities of word sequences and the next-word distribution.
    • In classic N-gram models, the model memorizes counts of token sequences and converts them to conditional probabilities. In neural models, you feed token IDs into a neural network that outputs a probability distribution over the next token. In both cases, the goal is the same: given context, assign high probability to the correct next token and good overall probabilities to full sequences.
    1. Training and evaluation loop
    • Training involves iterating over the tokenized data, updating parameters (or counts) so predicted probabilities match observed data better. For N-grams, this means computing relative frequencies (and later smoothing). For neural models, this means gradient-based optimization (though this lecture only introduces the idea broadly).
    • Evaluation uses metrics like perplexity (how well the model predicts next tokens) and BLEU (comparing generated text to references). Lower perplexity indicates better predictive ability; higher BLEU suggests closer match to human-like text in specific tasks like translation.
    1. Inference/generation
    • During inference, the model consumes a prefix (context) and outputs a probability distribution over the next token. Repeating this step produces new text. The quality of generation depends on both the learned probabilities and the tokenization scheme.

    Tokenization in detail

    A) Why tokenization is necessary

    • Models cannot operate directly on raw text because characters and spacing lack explicit boundaries for words, punctuation, and contractions. Tokenization standardizes input so the model can learn consistent patterns. Good tokenization reduces ambiguity and noise; poor tokenization causes the model to learn mismatched or overly specific patterns.

    B) Basic tokenization: whitespace vs rule-based

    • Whitespace tokenization simply splits on spaces. This fails on punctuation and contractions. Example: “I’m going to the store.” becomes ["I’m", "going", "to", "the", "store."]. This merges punctuation with words and keeps contractions intact, which often isn’t desired.
    • Rule-based/tokenizer library approaches use language-aware rules to separate punctuation and split contractions into meaningful parts. A better split is ["I", "’m", "going", "to", "the", "store", "."]. Treating punctuation as separate tokens improves sentence boundary detection and clarity for downstream tasks such as part-of-speech tagging or parsing.

    C) Handling contractions

    • Contractions compress words (I am → I’m; cannot → can’t). Splitting them into components helps models learn true word forms and grammatical relations. For example, “I’m” becomes [“I”, “’m”], which signals the verb form “am.” Without splitting, the model must memorize every contracted form as separate words, inflating the vocabulary and reducing generalization.

    D) Handling punctuation

    • Punctuation marks like periods, commas, and question marks carry sentence structure cues. Treating them as separate tokens helps the model learn sentence boundaries and pauses. For example, “The cat sat on the mat.” → ["The", "cat", "sat", "on", "the", "mat", "."]. If punctuation is glued to words, the model learns extra, meaningless tokens like “mat.” which dilute counts and patterns.

    E) Handling numbers and normalization

    • Numbers appear in many textual forms: 1,000; 1000.00; one thousand. Models can waste capacity learning all variants. Normalization strategies convert different formats into a consistent representation (e.g., "1000"). This makes counting and learning more efficient. In neural models, normalization also reduces rare-token issues and improves generalization.

    F) Tokenization tools and choices

    • NLTK and spaCy provide robust tokenizers that handle punctuation and contractions well. Domain-specific tokenizers exist for code (handling symbols like ==, {}, and camelCase) or social media (handling @mentions, #hashtags, and emojis). Choosing a tokenizer depends on the domain and the model’s requirements. Early, careful tokenization choices pay long-term dividends in accuracy and simplicity.

    Subword tokenization and vocabulary management

    A) Motivation for subwords

    • Word-level vocabularies become massive, especially for morphologically rich languages (languages where words change form a lot). Large vocabularies cause memory bloat, sparsity (many rare words), and out-of-vocabulary (OOV) problems—unknown words the model has never seen. Subword tokenization breaks words into frequent pieces, drastically reducing OOVs and controlling vocabulary size.

    B) Byte Pair Encoding (BPE)

    • BPE begins with a character-level vocabulary. You count frequencies of adjacent token pairs (initially character pairs) in the corpus. Then you merge the most frequent pair to form a new token. You repeat merging until you reach a desired vocabulary size (e.g., 30,000). This process yields subwords that reflect common letter or morpheme sequences.
    • Toy example: Training data includes “low” (x5), “lower” (x2), “newest” (x6), “widest” (x3). Split into characters with word boundary markers (conceptually). Count pair frequencies: if “e”+“s” is most frequent (occurs 9 times), merge to “es.” Recount pairs; if “es”+“t” is now most frequent (again 9 times), merge to “est.” Continue until the vocabulary reaches the target size. These merges create reusable chunks that compose many words.

    C) WordPiece and SentencePiece

    • WordPiece, used in models like BERT, selects merges that maximize the likelihood of the data under a language model objective, rather than simply the most frequent pair. SentencePiece can build subword vocabularies directly from raw text (no pre-tokenization step) and supports multilingual corpora easily. While the lecture mentions them briefly, the key takeaway is that multiple subword strategies exist to meet different goals.

    D) Vocabulary size trade-offs

    • With word-level tokens, the vocabulary equals the set of unique words—often hundreds of thousands or millions—leading to memory and sparsity issues. Subwords reduce the vocabulary (e.g., 30K to 50K tokens), which is easier to handle and covers virtually all words by composition. The trade-off is slightly longer token sequences because words become multiple subwords. In practice, this trade-off improves overall performance and resource usage.

    E) Practical tips

    • Train the subword vocabulary on data similar to your training/inference domain. Pick a vocabulary size that balances coverage and efficiency (30K is a common starting point for English). Ensure consistent tokenization during training and inference; mismatches can degrade performance severely. Keep punctuation and basic normalization consistent across datasets.

    Classic N-gram modeling

    A) Core idea

    • An N-gram model estimates P(w_t | w_{t−N+1}, …, w_{t−1}) using counts from a corpus. For bigrams (N=2), this becomes P(w_t | w_{t−1}); for trigrams (N=3), P(w_t | w_{t−2}, w_{t−1}). The estimator is count(w_{t−N+1}^{t}) / count(w_{t−N+1}^{t−1}). Example: P("sat" | "the", "cat") = count("the cat sat") / count("the cat"). This simple ratio turns observed frequencies into predictions.

    B) Data sparsity (preview)

    • Many valid sequences never appear in a finite corpus, leading to zero counts and zero probabilities. Smoothing, backoff, and interpolation (covered next lecture) fix this by redistributing probability mass to unseen or low-frequency events. Even if not detailed here, you should know these techniques are essential to make N-grams robust.

    C) Strengths and limits

    • Strengths: fast, simple, interpretable; strong baseline; helps you understand probability and sequence modeling. Limits: fixed context window misses long-range dependencies; heavy sparsity for larger N; struggles with rare and unseen phrases. These limits motivate neural approaches.

    Neural language models (high-level overview)

    A) RNNs and LSTMs

    • RNNs pass information forward via a hidden state updated at each token. LSTMs add gates to control what to remember or forget, enabling longer-term context tracking. These architectures learn to map sequences of token embeddings to next-token probabilities. While powerful, they process tokens sequentially, which can be slower compared to Transformers.

    B) Transformers and attention

    • Attention lets the model compute how much each token should pay attention to other tokens in the sequence. Transformers perform these attention operations in parallel across tokens, enabling efficient training on large datasets. They stack attention and feed-forward layers to learn deep, contextual representations. This approach dominates modern large language models.

    Evaluation metrics

    A) Perplexity

    • Perplexity summarizes how well the model predicts test data. Intuitively, if the model tends to narrow down to about k plausible options at each step, perplexity is about k. Formally, it is the exponentiated average negative log-likelihood on the test set. Lower perplexity signals better predictive performance.

    B) BLEU

    • BLEU compares generated text (candidate) with human-written references by measuring n-gram overlaps and applying brevity penalties for too-short outputs. Higher BLEU scores indicate closer matches to human-like phrasing in tasks like machine translation. BLEU is not perfect but provides a common yardstick across systems.

    Step-by-step implementation guide (from scratch mindset)

    1. Basic word-level tokenizer and normalizer
    • Step 1: Lowercase text if your application benefits from case-insensitivity (decide based on task). Step 2: Split punctuation from words (e.g., using regex) so that periods, commas, and question marks become separate tokens. Step 3: Split contractions into components (e.g., "I’m" → ["I", "’m"]). Step 4: Normalize numbers to a common form (e.g., remove commas: "1,000" → "1000"). Step 5: Build a vocabulary of unique tokens from the training set.
    • Tips: Keep the same rules for training and inference. Store mappings token→id and id→token. Consider special tokens like <unk> (unknown), <bos> (beginning of sentence), and <eos> (end of sentence) for sequence boundaries.
    1. Training a simple N-gram model
    • Step 1: Choose N (start with bigram or trigram). Step 2: For each sentence, add boundary tokens (e.g., <bos> at start, <eos> at end) and update N-gram counts by sliding a window of size N. Step 3: For each unique (N−1)-gram context, compute conditional probabilities by dividing N-gram counts by context counts. Step 4: Store probabilities in a dictionary keyed by context.
    • Tips: Use a default small probability for unseen events (or wait for next lecture’s smoothing methods). Ensure you handle <unk> consistently for tokens not in the vocabulary.
    1. Computing perplexity for evaluation
    • Step 1: Tokenize and prepare a held-out test set. Step 2: For each position t, retrieve P(w_t | context). Step 3: Sum the log probabilities across all positions and compute the average negative log-likelihood. Step 4: Exponentiate to get perplexity. Lower values indicate a better model.
    • Tips: Avoid zero probabilities by applying smoothing (preview) or a small floor probability; otherwise, perplexity becomes infinite.
    1. Building a minimal BPE tokenizer (conceptual outline)
    • Step 1: Start with a vocabulary of all characters present in the corpus. Represent each word as a sequence of characters (optionally with a word boundary marker). Step 2: Count all adjacent pair frequencies across the corpus. Step 3: Merge the most frequent pair into a new token and update all words accordingly. Step 4: Repeat Steps 2–3 until you reach a target vocabulary size (e.g., 30K merges).
    • Tips: Save the merge list because you must apply the same merges in the same order at inference. Choose the vocab size based on domain and resource constraints. Ensure space or boundary handling is consistent.
    1. Using NLTK/spaCy tokenizers
    • Step 1: Install NLTK or spaCy. Step 2: For NLTK, use word_tokenize to split sentences into tokens; for spaCy, load a language model (e.g., en_core_web_sm) and process text to get tokens. Step 3: If needed, customize rules (e.g., special-case URLs or emojis). Step 4: Verify outputs on sample texts (contractions, punctuation, numbers) before large-scale preprocessing.
    • Tips: spaCy models must be downloaded; NLTK tokenizers may require punkt downloads. Always test the tokenizer on your domain data.

    Tools and libraries

    • Python: general-purpose language for data preparation and modeling. PyTorch: deep learning framework used throughout the course to build neural models. NLTK and spaCy: tokenization and NLP utilities. GPU access: accelerates training for neural models.
    • Why these tools: Python is accessible and widely supported; PyTorch balances flexibility and performance; NLTK/spaCy offer reliable tokenization; GPUs make training practical on realistic datasets.

    Tips and warnings

    • Consistency is king: Use the exact same tokenization and vocabulary for training and inference; mismatches cause severe performance drops. Handle unknowns: Decide on <unk> strategies and apply them uniformly. Numbers: Pick a normalization policy early and stick to it. Punctuation: Keep it separate to preserve sentence boundaries.
    • Data domain: Train tokenizers and vocabularies on data similar to your task domain; mismatched domains reduce effectiveness. Vocabulary size: Too large wastes memory and increases sparsity; too small leads to long sequences and potential loss of nuance. Start around 30K and adjust. Evaluation: Always evaluate on held-out data to avoid overfitting illusions.

    Putting it all together

    • A practical pipeline: Collect data → tokenize (word or subword) → build vocabulary → choose model family (N-gram or neural) → train → evaluate with perplexity (and BLEU if relevant) → iterate on tokenization, vocabulary size, and hyperparameters. For a first project, start with a trigram model and a careful tokenizer; compute perplexity; then consider subword tokenization to reduce vocabulary and handle rare words. As you advance, move to neural models in PyTorch and explore Transformers.

    Course logistics recap

    • Grading: 60% homework, 30% final project, 10% participation. Prerequisites: programming (CS106A or equivalent), probability/statistics (CS109 or equivalent); deep learning familiarity helps but isn’t required. Resources: course website, Piazza, recorded lectures, office hours by the instructor and TAs (Alice, Bob, Carol). Tools: Python, PyTorch; GPUs available; PyTorch tutorials on the official site and CS230 provide additional guidance.

    Next steps

    • The upcoming focus: N-gram smoothing, backoff, and interpolation—methods to address zero counts and data sparsity by smartly redistributing probability. Mastering these completes your foundation in classic language modeling and sets you up to understand why and how neural models improve on these ideas.

    04Examples

    • 💡

      Next-word prediction: Input the prefix “the cat sat on the”. The model examines the context and assigns probabilities to possible next words. Because “mat” is a common continuation, it often receives a high probability. Output: the model predicts “mat” as the next token. Key point: language models use context to produce natural continuations.

    • 💡

      Speech recognition ambiguity: An audio clip could be heard as “recognize speech” or “wreck a nice beach.” The language model scores these options given the conversation context. It prefers the phrase that is more probable in that context (often “recognize speech”). Output: correct phrase selection thanks to language-model probabilities. Key point: LMs help disambiguate noisy acoustic signals.

    • 💡

      Machine translation fluency: Given an English sentence to translate into French, the system generates many candidate French sequences. The language model boosts candidates that are grammatical and natural. It penalizes awkward, rare constructions. Output: a fluent, human-like French sentence. Key point: LMs steer generation toward fluent target-language outputs.

    • 💡

      Text generation from a news corpus: Train a language model on news articles. At inference, feed a starting phrase like “Officials announced today.” The model repeatedly predicts next words, generating a paragraph that matches news style. Output: a coherent news-like article. Key point: LMs can produce text resembling their training data.

    • 💡

      Spam classification: A system uses LM-derived features or probabilities to help classify emails. Certain patterns (like suspicious phrases or token distributions) raise spam likelihood. The classifier then labels messages accordingly. Output: emails labeled as spam or not spam. Key point: LMs capture subtle text patterns useful for classification.

    • 💡

      Whitespace tokenization pitfall: Input “I’m going to the store.” A naive split yields [“I’m”, “going”, “to”, “the”, “store.”]. Punctuation sticks to words, and contractions are not split. Output: messy tokens that confuse downstream models. Key point: whitespace-only tokenization harms accuracy.

    • 💡

      Rule-based tokenization fix: Input “I’m going to the store.” A smarter tokenizer outputs [“I”, “’m”, “going”, “to”, “the”, “store”, “.”]. Punctuation is separated, and the contraction is split. Output: clean, language-aware tokens. Key point: proper tokenization supports better learning and evaluation.

    • 💡

      Number normalization: Consider “1,000”, “1000.00”, and “one thousand.” A normalization step converts all to “1000.” The model now sees these as the same value. Output: consistent numeric tokens. Key point: normalization reduces meaningless variation.

    • 💡

      Subword tokenization for rare words: Take “unbelievable.” Subword methods split it into [“un”, “believe”, “able”]. The model likely knows each subword well, even if the full word is rare. Output: tokens that the model understands and can recombine. Key point: subwords handle rare and new words effectively.

    • 💡

      BPE toy example with merges: Training data contains “low” (x5), “lower” (x2), “newest” (x6), “widest” (x3). Start from characters and count adjacent pairs. Merge the most frequent pair “e”+“s” → “es”; recount; then merge “es”+“t” → “est.” Repeat until reaching the target size. Output: a subword vocabulary with chunks like “es” and “est.” Key point: frequent pairs build reusable pieces.

    • 💡

      Bigram probability estimate: Suppose “the cat” appears 1000 times and “the cat sat” 500 times in a corpus. The trigram conditional probability P(sat | the cat) = 500/1000 = 0.5. This probability captures how often “sat” follows “the cat.” Output: a numeric estimate guiding predictions. Key point: N-grams turn counts into conditional probabilities.

    • 💡

      Perplexity intuition: A model with perplexity 10 acts like it’s choosing among about 10 likely words at each step. If training improves the model, perplexity drops (e.g., from 15 to 10). Lower perplexity means better next-word predictions. Output: a single score summarizing predictive quality. Key point: perplexity provides an easy comparison across models.

    • 💡

      BLEU for translation: The system outputs a French sentence; you compare it to reference translations. BLEU measures n-gram overlaps and applies a brevity penalty if it’s too short. A higher BLEU suggests a closer match to human translations. Output: a numeric BLEU score used for benchmarking. Key point: BLEU standardizes evaluation in MT.

    • 💡

      NLTK vs spaCy tokenization: Run both on “He emailed Dr. Smith, Jr. at 3:00 p.m.” Each tool applies different rules for abbreviations and punctuation. You inspect outputs to see which suits your task. Output: chosen tokenizer with consistent, correct splits. Key point: evaluate tokenizers on your domain text.

    • 💡

      Sentence boundaries with punctuation: Input “The cat sat on the mat. It purred.” Proper tokenization yields [“The”, “cat”, “sat”, “on”, “the”, “mat”, “.”, “It”, “purred”, “.”]. Separating periods makes sentence segmentation clear. Output: clean sentence boundaries for training. Key point: punctuation as tokens improves downstream processing.

    05Conclusion

    This kickoff lecture builds the foundation for a hands-on journey into language modeling. You learned that a language model assigns probabilities to sequences of words and predicts likely next tokens, enabling applications like speech recognition, machine translation, text generation, and classification. Classic N-gram models estimate probabilities from counts over short contexts, while neural models—RNNs, LSTMs, and especially Transformers with attention—capture long-range dependencies and now dominate state-of-the-art systems. Evaluation uses perplexity to measure predictive power and BLEU to compare generated text with human references.

    Tokenization emerged as a central practical concern: splitting text into tokens that models can learn from. Naive whitespace splitting causes issues with punctuation and contractions, while robust tokenizers (like NLTK and spaCy) produce clean, consistent tokens. Handling numbers through normalization reduces unnecessary variation. Subword tokenization (BPE, WordPiece, SentencePiece) solves rare-word and vocabulary-size challenges by breaking words into reusable pieces, as seen in the toy BPE example that merges frequent pairs (“es,” then “est”). Thoughtful tokenization choices strongly affect model performance and memory usage.

    Course logistics set expectations and support: 60% homework, 30% final project, 10% participation; prerequisites in programming and probability/statistics; deep learning familiarity is helpful but not required. You will code in Python and PyTorch, have GPU access, and benefit from recorded lectures, the course website, Piazza, and office hours from the instructor and TAs (Alice, Bob, Carol). For learning PyTorch, official tutorials and CS230 materials are excellent resources.

    To practice, start by building a tokenizer that separates punctuation and splits contractions, normalize numbers, and construct a word-level vocabulary. Implement a bigram or trigram model, compute perplexity on held-out text, then try subword tokenization (like BPE) to reduce vocabulary size and handle rare words. Next, deepen your understanding with N-gram smoothing, backoff, and interpolation to fix zero-count problems—a focus of the next lecture. The core message is that careful data preparation (especially tokenization) and a solid grasp of probability set the stage for success with both classic and modern language models. Remember: consistent tokenization, sensible vocabulary sizes, and clear evaluation metrics are the pillars of reliable language modeling.

  • ✓Split contractions into parts. Handle forms like “I’m,” “can’t,” and “they’ll.” This reduces vocabulary bloat and clarifies grammatical relations. It improves tagging, parsing, and prediction. Consistency is more important than perfection.
  • ✓Choose tools that match your domain. NLTK and spaCy are strong general-purpose tokenizers. For code or social media, explore specialized tokenizers that understand symbols and emojis. Test on your data before large-scale preprocessing. Tool choice directly impacts model quality.
  • ✓Document preprocessing decisions. Record tokenization rules, normalization choices, and vocabulary size. Save BPE merge lists or SentencePiece models for reuse. This ensures reproducibility and consistent inference. Your future self will thank you.
  • ✓Plan for unseen data. Even with good tokenization, test sets will contain surprises. Subwords reduce OOVs, but smoothing/backoff/interpolation (for N-grams) are still needed. For neural models, robust tokenization and regularization help. Always validate on held-out data.
  • ✓Balance vocabulary size and sequence length. Larger vocabularies shorten sequences but increase memory and sparsity. Smaller vocabularies lengthen sequences but improve coverage and efficiency. Start moderate (e.g., 30K) and tune. Watch GPU memory and training speed.
  • ✓Use GPUs for neural training. They cut training time and enable larger models. Confirm your environment (drivers, CUDA) and batch sizes. Monitor utilization and memory to avoid crashes. Faster experiments accelerate learning.
  • ✓Leverage PyTorch tutorials and examples. Start with small models to grasp the workflow. Move step by step from N-grams to RNN/LSTM to Transformer. Keep experiments organized with clear configs. Build confidence through incremental complexity.
  • ✓Keep evaluation honest. Never tune on the test set. Use a validation set for choices like vocabulary size or tokenizer rules. Report final numbers on a held-out test. This guards against overfitting and inflated expectations.
  • ✓Iterate systematically. Change one factor at a time: tokenization rule, vocabulary size, or model type. Record results so you can compare fairly. Let perplexity and BLEU drive decisions. Small, measured steps beat random tweaking.
  • ✓Prepare for the next step: smoothing and backoff. These are essential for practical N-gram models. They fix zero counts and stabilize predictions. Learning them will clarify why neural models later perform better. Master the basics to understand the advances.
  • ✓Connect applications to modeling choices. If you’re building a speech recognizer, focus on language-domain tokenization and perplexity reductions. For translation, subwords and BLEU matter a lot. For classification, clean tokens boost feature quality. Tailor your pipeline to the task.
  • Bigram

    An N-gram with N=2, focusing on pairs of consecutive tokens. A bigram model predicts the next word from the previous word. It is simple and fast to compute. But it misses longer patterns. It’s a good starting baseline.

    Trigram

    An N-gram with N=3 that uses the last two tokens to predict the next. It captures more context than bigrams. It can model short phrases and common 3-word patterns. But it still struggles with rare sequences. Smoothing helps reduce zero probabilities.

    Neural language model

    A model that uses neural networks to learn word sequence patterns. It can capture complex relationships and long-range dependencies. Common types include RNNs, LSTMs, and Transformers. They outperform N-grams on many tasks. They require more compute and data.

    Recurrent Neural Network (RNN)

    A neural network designed for sequences that passes a hidden state along each time step. This hidden state stores information from previous tokens. RNNs model order and context over time. However, they can struggle with very long dependencies. LSTMs improve on that.

    +31 more (click terms in content)