🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 14: Data 2 | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch14 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 14: Data 2
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 14: Data 2

Intermediate
Stanford Online
LLMYouTube

Key Summary

  • •The lecture explains why rare words are a core challenge in language modeling. Most corpora follow Zipf’s law, where a few words appear very often and a huge number appear very rarely. Rare words make probability estimates unreliable and inflate vocabulary size, which increases memory and slows training and inference.
  • •One simple fix is word replacement: swap all words that occur below a threshold (like fewer than 5 times) with a special token such as <UNK>. This shrinks the vocabulary and lets the model learn a stable probability for unknown words. Variants include part-of-speech-aware tokens like <UNK-NOUN> and statistical methods to avoid replacing informative words.
  • •Subword tokenization breaks words into smaller, reusable pieces so rare words can be built from common parts. Byte Pair Encoding (BPE) merges the most frequent character pairs, while WordPiece merges pairs that most improve the likelihood of the data. This lets models handle unseen words by composing them from known subwords.
  • •Character-level models predict the next character instead of the next word or subword. This removes the out-of-vocabulary problem because the alphabet is small and fixed, but sequences become much longer and harder to learn, making training more expensive. Hybrids often combine subword tokens with character-level modeling to balance flexibility and efficiency.
  • •The lecture then moves to structured data and knowledge, such as database tables and knowledge graphs. Structured data contains entities and relations in explicit, clean formats, which can help models be more accurate and robust. Using this data can augment training, constrain outputs, or guide learning.
  • •One method is to augment training data with sentences generated from structured sources in a domain (e.g., medicine). This teaches the model domain-specific facts like symptoms of diseases or drug side effects. It boosts performance on domain tasks by filling gaps left by raw web text.

Why This Lecture Matters

Anyone building or fine-tuning language models faces the same core pain points: rare words, factual accuracy, and limited data. Product teams, research scientists, data engineers, and ML practitioners benefit from knowing how to manage these issues. Handling rare and unknown words makes models faster, smaller, and more reliable, which matters in production where latency and cost are real constraints. Subwords and sensible unknown-token strategies reduce failures on new names, typos, and domain terms. Structured knowledge integration improves correctness for question answering, search assistants, and domain tools (healthcare, finance, law), where wrong answers have real consequences. Knowledge graph embeddings add reasoning power and enable discovery and validation. Data augmentation closes gaps when labeled or in-domain data is scarce, improving robustness to paraphrasing, noise, and incomplete inputs—key for chatbots, classification, and translation systems used by diverse users. These techniques also help career growth: they are foundational skills that show up across modern NLP pipelines and interviews. In an industry where factuality, efficiency, and reliability are decisive, mastering rare word handling, structured data use, and smart augmentation equips you to build LMs that perform better in the real world and scale with evolving user demands.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on three pillars of data handling for language modeling: (1) managing rare and unknown words, (2) incorporating structured data and knowledge into models, and (3) using data augmentation to strengthen generalization. It begins by revisiting how text data differs from uniform datasets because of Zipf’s law: a small number of words appear extremely often, while an enormous number occur rarely or even once. This long tail makes it hard to estimate reliable probabilities for many words, inflates vocabulary size, and strains memory and compute. The lecture then offers practical solutions with clear trade-offs: simple word replacement with a special token, subword tokenization (such as Byte Pair Encoding and WordPiece), and character-level models, often combined for balance.

The second part turns to structured data and knowledge. Instead of learning only from raw, messy text, the lecture highlights resources like database tables and knowledge graphs that encode facts and relations directly. It outlines three ways to use these: augment training data with structured facts, constrain outputs using knowledge graphs to prevent invalid answers, and guide learning with curated parallel datasets (e.g., for translation). The lecture also introduces knowledge graph embeddings, with TransE as a core example: entities and relations are mapped to vectors so that relations act like translations in vector space, enabling reasoning, question answering, and even discovery of missing links.

Since real-world knowledge graphs are often noisy or incomplete, the lecture addresses resilience strategies. Robust embedding methods like DistMult and ComplEx can better handle noise and diverse relation types. Knowledge graph completion techniques predict missing facts based on existing patterns, while cleaning methods find and fix likely errors, sometimes with human review.

Finally, the lecture covers data augmentation—creating new training examples from existing ones to improve model robustness, especially when training data is scarce. It details back translation (translate text to a second language and back), synonym replacement (swap words for meaning-preserving alternatives), random insertion (add extra words to simulate noise), and random deletion (remove words to simulate incomplete input). It also points to advanced techniques like contextual augmentation using language models and GAN-based synthetic data generation. Each augmentation method aims to teach models to cope with variations in phrasing, noise, and missing information without drifting away from the original meaning.

This lecture is intended for students with a basic understanding of language modeling concepts—such as tokens, vocabularies, and probabilities—and some familiarity with how models are trained from text. Prior exposure to tokenization (covered previously) helps, as today’s material builds on those foundations. The content serves both those building classic n-gram or neural language models and those interested in modern large language models, as the data issues and solutions apply widely.

After completing this lecture, you will be able to: explain why rare words are challenging and choose among replacement, subword, or character-level strategies; understand how and when to integrate structured data to improve accuracy and reliability; describe and implement knowledge graph embeddings like TransE (and know when to reach for DistMult or ComplEx); and select and apply data augmentation methods that fit your task without breaking semantics. The lecture is structured in three segments—rare/unknown words, structured data and knowledge, and data augmentation—with examples such as “unbelievable” split into subwords, a QA constraint for “What is the capital of France?”, and augmentation edits to “The cat sat on the mat.” It balances conceptual explanations, everyday analogies, and practical pipelines so you can apply these techniques in real projects.

Key Takeaways

  • ✓Start with subword tokenization as your default. Train a BPE or WordPiece model on your corpus with a target vocabulary around 30k–50k. Check how often unknown or long token sequences appear and adjust size accordingly. Subwords give strong coverage with manageable compute.
  • ✓Use <UNK> replacement when you stick with word-level vocabularies or for safety nets. Choose a frequency threshold that meaningfully reduces vocabulary without erasing critical domain terms. Consider POS-aware unknown tokens to preserve grammar hints. Validate the impact on a dev set before full training.
  • ✓Avoid over-replacing informative rare words. Use simple statistics (e.g., TF-IDF or domain term lists) to keep key terms while replacing typos and noise. This maintains domain accuracy while shrinking the vocabulary. Always sample and inspect replacements for sanity.
  • ✓Remember the compute trade-offs of character-level models. Longer sequences mean higher memory and slower training, especially for transformers. If you need character coverage, consider hybrid architectures. Profile training to confirm feasibility on your hardware.
  • ✓Leverage structured data to boost factuality. Generate natural language statements from databases or KGs to augment training. Mix them in gradually to avoid style drift. Track domain task metrics to measure benefit.
  • ✓Constrain outputs for tasks where correctness matters. For QA, build candidate sets from a knowledge graph or verify answers post-generation. Constraints reduce nonsense and increase trust. Provide fallbacks when constraints are empty or uncertain.
  • ✓

Glossary

Zipf's law

A rule that says a few items are very common while most are very rare. In language, a small set of words appear a lot, and a huge number appear only a few times. This creates a long tail of rare words. It makes it hard for models to learn good statistics for many words. It also increases vocabulary size and memory needs.

Rare words

Words that appear very few times in a dataset. The model has little evidence to estimate their probabilities. This can lead to poor predictions when those words appear. They also increase vocabulary size and slow down training. Handling them is key to better models.

Vocabulary

The list of all tokens a model knows how to handle. Bigger vocabularies need more memory and make the output layer larger. Smaller vocabularies can miss important words or details. Finding a good size is important for speed and accuracy. Tokenization choices control this size.

Tokenization

The process of splitting text into pieces called tokens. Tokens can be words, subwords, or characters. Good tokenization helps the model learn patterns. Poor tokenization makes learning harder. Choosing the right level affects performance.

#zipf's law#rare words#unknown token#subword tokenization#byte pair encoding#wordpiece#character-level model#knowledge graph#transE#distmult#complex#knowledge graph embeddings#data augmentation#back translation#synonym replacement#random insertion#random deletion#contextual augmentation#parallel corpus
Version: 1
  • •Another method is to constrain the model’s answers using a knowledge graph. For a question like “What is the capital of France?”, the system restricts outputs to valid capitals, cutting down nonsensical answers. Constraints serve as guardrails to increase reliability.
  • •Structured data can also guide learning through aligned data like a parallel corpus for translation. Paired sentences in different languages teach the model how to map meaning across languages. This improves fluency and accuracy for machine translation tasks.
  • •Knowledge graph embeddings turn nodes (entities) and edges (relations) into vectors that preserve structure. TransE models a relation as a translation: e_head + e_relation ≈ e_tail (e.g., Paris + is_capital_of ≈ France). These embeddings can support better reasoning, question answering, and discovery of new links.
  • •Noisy or incomplete knowledge graphs are common, and the lecture discusses coping strategies. More robust embedding methods like DistMult and ComplEx better handle noise and richer relation patterns. Knowledge graph completion infers missing facts, and cleaning detects and fixes likely errors.
  • •Data augmentation creates new training examples from existing ones to improve generalization, especially with limited data. Back translation translates text to another language and back, producing varied yet faithful sentences. It teaches the model to be robust to paraphrasing and phrasing shifts.
  • •Simple text edits like synonym replacement, random insertion, and random deletion increase resilience to wording and noise. For example, replacing “sat” with “perched,” inserting “fluffy,” or deleting “the” from “The cat sat on the mat” introduces variety. More advanced methods include contextual augmentation with language models and even GANs for synthetic data.
  • •Each technique has trade-offs: <UNK> is simple but loses meaning, subwords balance coverage and efficiency, and char-level handles any string but costs more compute. Structured knowledge boosts correctness but requires maintenance and noise handling. Augmentation improves robustness but must preserve meaning to avoid hurting performance.
  • •The lecture’s practical message is to pick methods based on needs and constraints. Start with subwords for modern LMs and add <UNK> rules for edge cases. Use structured data where accuracy and consistency matter, and apply targeted augmentation when data is scarce.
  • 02Key Concepts

    • 01

      Zipf’s Law in Language Data: Zipf’s law says a few words are used a lot and a massive number are used very rarely. Think of it like city sizes: a few giant cities and many small towns. In text, this means you have many words appearing once or only a few times, making it hard to learn good probabilities for them. With unreliable counts, models misjudge rare word likelihoods and produce worse predictions. It also blows up vocabulary size, making training slower and memory usage higher.

    • 02

      Why Rare Words Hurt Models: Rare words give too little data to estimate accurate probabilities for n-grams or neural embeddings. It’s like trying to judge a movie from one scene—you don’t have enough to be confident. This leads to poor predictions and mistakes in generation or translation when rare words appear. Large vocabularies also increase parameter counts, slow lookups, and complicate softmax layers. Managing rare words is key to improving both accuracy and efficiency.

    • 03

      Word Replacement with <UNK>: Word replacement swaps all low-frequency words (e.g., frequency < 5) for a single token like <UNK>. It’s like putting all unknown spices in a jar labeled “spice” when cooking—you lose the specific flavor, but the recipe still works. Technically, it reduces vocabulary size and lets the model learn a stable probability for unknown words. This improves robustness to unseen words and helps with memory and speed. The cost is loss of meaning for specific rare words.

    • 04

      Part-of-Speech-Aware Unknown Tokens: Instead of one <UNK>, use tags like <UNK-NOUN> or <UNK-VERB> to keep grammatical hints. It’s like knowing the mystery ingredient is a vegetable or a spice, even if you don’t know which one. This preserves some semantic and syntactic information, which can help tasks like translation. Implementation requires a part-of-speech tagger to label rare words before replacement. You retain efficiency while improving grammatical correctness.

    • 05

      Selective Replacement via Statistics: Rather than a fixed frequency cutoff, use tests to avoid replacing informative low-frequency terms. Imagine keeping rare but crucial medical terms while discarding typos. Techniques can compare expected vs observed frequencies to flag which words deserve preservation. This balances vocabulary control with domain accuracy. The result is fewer harmful replacements and better downstream performance.

    • 06

      Subword Tokenization: Subword tokenization breaks words into common pieces, like “un + believe + able.” It’s like building words from Lego bricks, so even rare castles can be made from common blocks. Algorithms such as BPE and WordPiece learn which pieces are most useful. This covers unseen words by composing them from known parts, improving probability estimates. It keeps vocabularies compact while handling novelty well.

    • 07

      Byte Pair Encoding (BPE): BPE merges the most frequent character pairs again and again to create subword units. Picture repeatedly gluing the two puzzle pieces that most often sit together. The process continues until you reach a target vocabulary size, giving a practical balance between granularity and coverage. At inference, words are split into the learned units, enabling representation of rare words. BPE is simple, fast, and widely used.

    • 08

      WordPiece: WordPiece chooses merges that most increase training data likelihood instead of just frequency. It’s like picking puzzle merges that make the overall picture clearest, not just the most common pair. This optimization can yield subword sets that better model language structure. WordPiece often improves handling of morphology and compound words. It remains conceptually similar to BPE but with a different selection rule.

    • 09

      Character-Level Models: Character-level models predict the next character, eliminating out-of-vocabulary issues. It’s like reading letter by letter—slow but never stuck on a new word. Sequences are much longer, making training more expensive and long-range dependencies harder to learn. Despite cost, character models handle typos, creative spellings, and rare names naturally. They are often combined with subword models for balance.

    • 10

      Hybrid Subword + Character Approaches: Hybrids use subwords for efficiency and characters for flexibility. Think of subwords as main roads and characters as alleys that reach every house. One method predicts characters within subwords, capturing fine details without huge vocabularies. This reduces OOV problems while keeping computations manageable. It benefits domains with many new terms, like user-generated text.

    • 11

      Structured Data and Knowledge: Structured data organizes facts clearly in rows, columns, and graph edges. It’s like a well-labeled library catalog instead of a pile of books. Using structured data can improve model accuracy, consistency, and factuality. It helps when raw text is noisy or incomplete. Models can learn or be guided by clean relationships and attributes.

    • 12

      Augmenting Training with Structured Data: We can generate sentences from domain databases (e.g., medical facts) and add them to training. It’s like adding notes from a reliable encyclopedia to your study guide. This fills blind spots left by web text and improves domain performance. The method is especially useful in specialized areas like healthcare or law. It makes models more knowledgeable and precise in those domains.

    • 13

      Constraining Outputs with Knowledge Graphs: Knowledge graphs can act as guardrails to restrict model answers. For “What is the capital of France?”, only known capitals are allowed as candidates. This reduces nonsense outputs and improves trustworthiness. The constraint step checks generated answers against graph facts. It’s a practical way to increase factual accuracy in QA systems.

    • 14

      Guiding Learning with Parallel Corpora: Parallel corpora provide aligned sentence pairs across languages, guiding translation models. It’s like having a bilingual tutor point to the exact match for each sentence. The model learns how ideas map between languages, improving fluency and accuracy. This guidance reduces reliance on noisy, unaligned web text. It’s a cornerstone of machine translation training.

    • 15

      Knowledge Graph Embeddings (KGE): KGE turns entities and relations into vectors that preserve graph structure. Imagine arranging points so that connected items are near in meaningful ways. With vectors for nodes and edges, models can measure plausibility of facts and infer new ones. Embeddings support QA, link prediction, and reasoning. They bridge structured knowledge with neural models.

    • 16

      TransE: TransE models a relation as a vector that translates the head entity to the tail entity (e_head + e_rel ≈ e_tail). It’s like an arrow from Paris through “is_capital_of” that lands near France. This simple geometry enables fast training and intuitive scoring of triples. It works well for one-to-one relations but struggles with more complex patterns. Still, it’s a foundational, widely used approach.

    • 17

      Handling Noisy or Incomplete Knowledge Graphs: Real knowledge graphs have mistakes and gaps. Robust embedding models like DistMult and ComplEx better handle symmetric, asymmetric, and noisy relations. Knowledge graph completion infers likely missing edges based on patterns in embeddings. Cleaning methods detect suspicious triples for human review and correction. These steps make structured knowledge more reliable for downstream use.

    • 18

      Data Augmentation Basics: Data augmentation creates new, slightly different training examples from existing ones. It’s like practicing a song in different keys so you can play it anywhere. The goal is to improve generalization, especially with small datasets. Care is needed to keep the meaning intact, or models may learn the wrong patterns. Done right, it improves robustness to phrasing, noise, and missing words.

    • 19

      Back Translation: Back translation translates text to another language and back to create paraphrases. It’s like telling a story in French and then retelling it in English with slightly different phrasing. The augmented sentences keep meaning but vary surface form, teaching robustness to rewording. This technique is especially effective in translation and summarization tasks. It scales with available translation systems.

    • 20

      Synonym Replacement: Synonym replacement swaps words with similar-meaning alternatives. Replacing “sat” with “perched” keeps the idea but changes the surface. This builds a model’s tolerance for varied wording. Care is needed to choose context-appropriate synonyms to avoid changing meaning. It’s simple, fast, and useful for many text tasks.

    • 21

      Random Insertion and Deletion: Random insertion adds extra words like “fluffy,” and random deletion removes words like “the.” These perturbations teach the model to handle noise and incomplete input. They simulate real-world text issues, such as typos or clipped messages. Overuse can damage grammar and semantics, so moderate application matters. The goal is robustness without confusion.

    • 22

      Advanced Augmentation (Contextual and GANs): Contextual augmentation uses a language model to insert or replace words that fit the sentence. GANs can generate synthetic text similar to the training data. These advanced methods diversify data while trying to keep meaning consistent. They require quality control to avoid drifting from ground truth. When curated, they can meaningfully boost performance.

    03Technical Details

    Overall Architecture/Structure

    1. Managing Rare and Unknown Words
    • Definition and challenge: In natural language data, word frequencies follow Zipf’s law, so many words appear only a few times or once. Classic count-based models and modern neural models both struggle when there is too little data to estimate a word’s distribution. Large vocabularies also balloon parameter counts (e.g., softmax output layers) and memory.
    • High-level solution set: (a) Replace rare words with a special token; (b) use subword tokenization to build rare words from common parts; (c) drop to the character level to eliminate out-of-vocabulary entirely; (d) combine these for a balanced approach.
    • Data flow: raw corpus → tokenization strategy (word/subword/char) → vocabulary construction and frequency stats → rare word handling (replacement or subword decomposition) → training data preparation → model training.
    1. Incorporating Structured Data and Knowledge
    • Definition: Structured data includes database tables and knowledge graphs where entities and relations are explicit. This structure can improve factuality, consistency, and coverage.
    • Integration modes: (a) Augment training with sentences or templates derived from structured sources; (b) Constrain outputs using knowledge graphs as a filter or candidate set; (c) Guide learning using aligned datasets (e.g., parallel corpora for translation).
    • Data flow: structured source → transformation/generation → augmented training set or constraint module or supervised pairing → model training/inference with constraints.
    1. Data Augmentation
    • Definition: Generate new training samples from existing ones while preserving meaning. Objectives include improved generalization, robustness to paraphrase, noise tolerance, and reduced overfitting.
    • Techniques: back translation; synonym replacement; random insertion/deletion; contextual augmentation; GAN-based synthetic data. Data flow: original text → augmentation pipeline(s) → combined dataset → training.

    Code/Implementation Details (Conceptual) A) Word Replacement (<UNK>)

    • Thresholding: Count word frequencies. Choose a threshold T (e.g., 5). For any word with frequency < T, replace with <UNK> in the training corpus.
    • POS-aware replacement: Run a part-of-speech tagger on sentences. For rare nouns, use <UNK-NOUN>; rare verbs → <UNK-VERB>; similarly for adjectives/adverbs if needed. You maintain a small set of unknown tokens and keep some syntactic information.
    • Statistical selection: Instead of a flat threshold, compute a significance score to decide whether a rare token carries disproportionate information (e.g., domain-specific proper nouns). If significant, keep it; otherwise replace. This may use heuristics such as term frequency–inverse document frequency (TF-IDF) or domain lists.
    • Pros/Cons: Pro: simple, memory-efficient, stable probability for unknowns. Con: discards specific semantics; repeated <UNK> tokens reduce expressivity.

    B) Subword Tokenization

    • BPE training procedure:
      1. Initialize the vocabulary with all characters seen in the corpus.
      2. Segment every word into characters (with boundary markers if desired).
      3. Count all adjacent symbol pairs across corpus.
      4. Merge the most frequent pair into a new symbol; update corpus by replacing that pair with its merged symbol.
      5. Repeat steps 3–4 until reaching the target vocabulary size (e.g., 30k tokens).
    • Inference (tokenization): For a new word, greedily apply the longest subwords learned during training to segment it. If needed, back off to smaller pieces down to characters.
    • WordPiece training procedure:
      1. Start with characters as tokens.
      2. Propose merges and compute which merge yields the greatest increase in the likelihood of the data under a simplified language model.
      3. Accept the best merge, update tokens, and iterate until the vocabulary reaches the target size.
    • Differences: BPE optimizes local pair frequency; WordPiece maximizes an objective related to data likelihood, often capturing morphological patterns better.
    • Parameters: target vocab size, minimum pair count, reserved special tokens (e.g., <PAD>, <UNK>, <BOS>, <EOS>), continuation markers (like ## in WordPiece) to indicate subword continuation.
    • Pros/Cons: Pro: handles unseen words, keeps vocabulary manageable, improves generalization. Con: introduces segmentation complexity; very small subword vocabularies increase sequence length; very large vocabularies reduce flexibility.

    C) Character-Level Models

    • Mechanics: Model predicts next character given history. The token set is small (letters, digits, punctuation, whitespace), eliminating OOV issues.
    • Sequence length: If average word length is ~5 characters plus spaces, character sequences are about 5–6× longer than word sequences. For transformers with self-attention cost of O(L^2), this increases compute and memory sharply.
    • Learning long-range dependencies: The model must connect distant characters to form words and phrases, which is harder than word-level context. Regularization, curriculum learning, and longer training may be required.
    • Hybrid approach: Use subwords as the main units; within each subword, a character-level module can refine or generate characters, balancing coverage and speed.

    D) Structured Data Integration

    • Augment training data: Create natural language sentences from a database or knowledge graph. For example, templates like “X is the capital of Y” produce factual sentences. Add these to the training corpus to boost domain knowledge.
    • Constrain outputs: During inference for QA, first use the knowledge graph to propose valid candidates (e.g., all world capitals). Force the model to choose from these or to verify generated answers by checking the graph.
    • Guide learning: For translation, paired sentences form supervised data that maps source to target language. Training aligns model representations across languages using cross-entropy or other sequence-to-sequence losses.

    E) Knowledge Graph Embeddings (KGE)

    • Entities and relations: A knowledge graph has nodes (entities) and edges (relations). Facts are indexed as triples (h, r, t) for head entity h, relation r, and tail entity t (e.g., (Paris, is_capital_of, France)).
    • TransE:
      • Core idea: Learn vectors e_h, e_r, e_t so that e_h + e_r ≈ e_t.
      • Scoring: score(h, r, t) = -||e_h + e_r - e_t|| (e.g., L1 or L2 norm). Higher scores (less distance) mean more plausible triples.
      • Training: Use positive triples from the graph and generate negative samples by corrupting the head or tail (replace Paris with Berlin, etc.). Optimize a margin-based ranking loss or logistic loss to separate true from false triples.
      • Strengths/limits: Simple, fast, good for one-to-one relations; struggles with one-to-many, many-to-one, or many-to-many without tweaks.
    • DistMult:
      • Core idea: score(h, r, t) = e_h^T diag(e_r) e_t, a bilinear form with a diagonal relation matrix. It captures symmetric relations well but not antisymmetric ones.
      • Strength: More robust to certain noise and handles diverse relation patterns better than simple translations.
    • ComplEx:
      • Core idea: Extend DistMult to complex-valued embeddings. score(h, r, t) uses complex inner products that can model asymmetry.
      • Benefit: Better fits a wide range of relation types (symmetric, antisymmetric, inverse) and tolerates noise in real graphs.
    • Usage: After training, use KGE to rank candidate tails given (h, r, ?) or candidate heads given (?, r, t). Also use embeddings as features in downstream models.

    F) Handling Noisy/Incomplete Knowledge Graphs

    • Robust training: Prefer models like DistMult or ComplEx when data is noisy. Add regularization (L2, dropout in embedding layers), early stopping, and careful negative sampling.
    • Knowledge graph completion: Predict missing edges using trained embeddings by ranking candidate triples. Add high-confidence predictions back to the graph, possibly after human verification.
    • Cleaning: Train classifiers or anomaly detectors over triples using graph features (e.g., degree, embedding distance) to flag likely errors. Route flagged triples for human review or automated correction based on rules.

    G) Data Augmentation Pipelines

    • Back translation:
      1. Choose a pivot language (e.g., English → French → English).
      2. Translate each sentence to the pivot language using a machine translation system.
      3. Translate back to the original language.
      4. Keep paraphrases that preserve meaning; optionally filter with semantic similarity or entailment checks.
    • Synonym replacement:
      1. For selected tokens (not stopwords), retrieve synonyms from a thesaurus or a contextual model.
      2. Replace a small percentage (e.g., 10–20%) to avoid semantic drift.
      3. Validate with simple heuristics (e.g., POS match) to keep grammar.
    • Random insertion:
      1. Sample adjectives or adverbs (e.g., “fluffy”) or use a language model to suggest plausible insertions.
      2. Insert at random positions at a low rate (e.g., 1–2 tokens per sentence).
    • Random deletion:
      1. Remove tokens at a small probability (e.g., 10%).
      2. Ensure key content words are not removed too often.
    • Contextual augmentation:
      1. Mask a token and let a language model propose context-fitting replacements.
      2. Accept replacements above a probability threshold.
    • GAN-based augmentation:
      1. Train a generator to produce text and a discriminator to tell real from fake.
      2. Use generated samples that pass quality checks to augment data.

    Tools/Libraries (Typical Choices)

    • Tokenization: SentencePiece (BPE/Unigram), Hugging Face Tokenizers for BPE/WordPiece.
    • POS tagging: spaCy, NLTK.
    • Knowledge graphs: RDFLib, Neo4j, NetworkX for graph utilities.
    • KGE: PyKEEN, AmpliGraph, OpenKE for training TransE/DistMult/ComplEx.
    • MT for back translation: MarianNMT, OpenNMT, Hugging Face Transformers with pre-trained translation models.

    Step-by-Step Implementation Guide

    1. Decide Vocabulary Strategy
    • Start with subword tokenization for most tasks (target vocab size ~30k–50k). Train BPE or WordPiece on your corpus; reserve special tokens.
    • If data is extremely noisy or domain-specific, consider POS-aware <UNK> for rare tokens that remain or as a fallback.
    1. Train Tokenizer
    • Fit BPE/WordPiece on the training corpus (shuffle for coverage). Inspect learned merges for sanity (e.g., common morphemes captured). Save tokenizer artifacts.
    1. Prepare Data
    • Tokenize the corpus into subwords. Replace tokens below threshold with <UNK> only if using word-level vocabularies; for subword workflows, this is usually not needed.
    • Build datasets with input-target pairs (e.g., next-token prediction for LMs).
    1. Optional: Character-Level or Hybrid Modeling
    • If you need maximal flexibility (e.g., heavy misspellings), use char-level modeling or a hybrid architecture. Be ready for longer sequences and higher compute.
    1. Integrate Structured Knowledge
    • Augmentation: Select domain triples and convert to sentences via templates (e.g., “X is the capital of Y”). Mix with raw text, balancing proportions.
    • Constraints: For QA, construct candidate sets from knowledge graphs; restrict decoding to these candidates or post-verify model outputs.
    • Guided learning: For translation, assemble a clean parallel corpus and train a sequence-to-sequence model with teacher forcing.
    1. Train Knowledge Graph Embeddings (Optional)
    • Choose TransE for simplicity or DistMult/ComplEx for noisy graphs. Split triples into train/valid/test. Train with negative sampling; tune margin or regularization. Evaluate with link prediction metrics (MRR, Hits@k).
    1. Data Augmentation
    • Back translation: Run a batch translation pipeline; filter paraphrases with semantic similarity thresholds.
    • Synonym replacement: Replace a controlled fraction of words, ensuring POS match; skip named entities.
    • Random insertion/deletion: Apply low rates to avoid degrading grammar; audit samples to calibrate.
    • Contextual augmentation: Use a masked language model to propose replacements; accept only high-confidence substitutes.
    1. Training and Evaluation
    • Train the language model on augmented and original data. Track validation loss and downstream task metrics. Ablate augmentation components to measure impact. Use held-out test sets for final evaluation.

    Tips and Warnings

    • Threshold tuning: If using <UNK>, pick a threshold that meaningfully reduces vocabulary but does not erase important domain terms. Validate on a dev set.
    • Subword vocab size: Too small → longer sequences; too large → less flexibility and more memory. Start around 30k and adjust based on sequence length distributions and hardware.
    • Char-level costs: Remember self-attention increases quadratically with sequence length; budget GPU memory accordingly and consider gradient checkpointing.
    • Knowledge graph quality: Constraints are only as good as the graph. Keep a cleaning pipeline and track error rates; allow fallbacks when constraints are empty.
    • Augmentation semantics: Over-aggressive edits can change meanings. Use small rates, semantic similarity checks, and human spot-checks.
    • Back translation pitfalls: Poor translation models can distort meaning; choose strong models and filter aggressively.
    • Synonym context: Many words are not interchangeable in context. Use POS and contextual similarity to avoid awkward or incorrect replacements.
    • Balancing augmented data: Do not swamp original data; a 1:1 or 1:2 ratio is a reasonable starting point, then tune based on validation.
    • Evaluation fidelity: Always evaluate on unaugmented, real data. The goal is real-world robustness, not just improved training metrics.
    • Domain drift: If you augment with templates from a structured source, ensure style and distribution do not drift too far from your target data; keep a mix.

    04Examples

    • 💡

      Zipf’s Law Visualization Example: Suppose you list English words by frequency and plot rank vs frequency on log-log axes. You’ll see a straight, downward-sloping line, meaning the 1st ranked word is vastly more frequent than the 10,000th. Input is raw word counts; processing is sorting and plotting; output is a near-linear pattern on a log-log chart. The key point is that most words are rare, creating data sparsity.

    • 💡

      UNK Replacement Example: In a corpus, any word seen fewer than 5 times is replaced with <UNK>. Input is raw sentences; processing counts word frequencies and swaps rare tokens; output sentences contain <UNK> where rare words were. This reduces vocabulary size and lets the model learn a stable probability for unknowns. It trades away the specific meaning of those rare words.

    • 💡

      POS-Aware UNK Example: A sentence contains a rare verb “gallivanted” and a rare noun “bricolage.” The pipeline replaces them with <UNK-VERB> and <UNK-NOUN> after POS tagging. The processed sentence keeps clues about grammar even when exact words are unknown. This helps a translator or generator preserve structure.

    • 💡

      BPE Merge Example: Start with characters a, b, c, d and observe that “ab” appears most often across the corpus. The training loop merges a+b into “ab,” then repeats on the updated text to form frequent subwords. At inference, a rare word like “dabbed” might tokenize as “d + ab + bed.” The example demonstrates how frequent pairs become reusable blocks.

    • 💡

      WordPiece Likelihood Example: Consider candidate merges “un+believe” and “believe+able.” WordPiece evaluates which merge improves overall data likelihood more. It chooses the merge that best fits usage patterns across the corpus. The result often segments “unbelievable” as “un + believable,” capturing morphemes more cleanly.

    • 💡

      Character-Level Sequence Length Example: A 1,000-word paragraph may become around 6,000–10,000 characters. The model predicts character by character, making attention windows and memory needs much larger. Training is slower and capturing distant dependencies is harder. This shows the compute trade-off for guaranteed coverage of any string.

    • 💡

      Hybrid Subword+Char Example: The model tokenizes into subwords like “un,” “believe,” and “able,” but a character decoder refines spelling for rare variants. Input is text; processing is subword encoding with a character-level submodule; output predictions keep efficiency while handling odd spellings. The key point is balancing flexibility and speed.

    • 💡

      QA Constraint Example with Capitals: For the question “What is the capital of France?”, the system builds a candidate set from a knowledge graph of countries and capitals. During decoding, it restricts answers to that set or verifies the output against the graph. Output is “Paris,” a valid capital. The guardrail prevents nonsensical answers like “Triangle.”

    • 💡

      Knowledge Graph Embedding Triple Scoring: Given (Paris, is_capital_of, France), TransE computes the distance ||e_Paris + e_rel - e_France||. Lower distance indicates a more plausible triple. For a corrupted triple like (Berlin, is_capital_of, France), the distance should be higher. This example shows how embeddings support reasoning.

    • 💡

      Knowledge Graph Completion Inference: If the graph has (Paris, is_capital_of, France) and (France, is_in, Europe), the system infers (Paris, is_in, Europe). Input is two known triples; processing uses graph patterns or embedding-based link prediction; output is a new, plausible triple. This fills gaps in knowledge.

    • 💡

      Noisy Graph Cleaning Example: A model flags (Paris, is_capital_of, Germany) as suspicious due to high embedding distance. A human reviewer confirms it’s wrong and removes it. The cleaning step improves graph accuracy for future queries. It illustrates automatic detection plus human validation.

    • 💡

      Back Translation Example: The sentence “The meeting starts at noon” is translated to French (“La réunion commence à midi”) and back to English as “The meeting begins at noon.” The input is one English sentence; the processing is two translations; the output is a paraphrase with the same meaning. It teaches the model to handle paraphrasing.

    • 💡

      Synonym Replacement Example: “The cat sat on the mat” becomes “The cat perched on the mat.” Input is the original sentence; processing selects a replaceable verb and finds a synonym; output is a semantically similar sentence. The lesson is improved robustness to wording changes.

    • 💡

      Random Insertion Example: “The cat sat on the mat” becomes “The cat fluffy sat on the mat.” Input is the sentence; processing randomly chooses an insertion point and adds an adjective; output contains noise. This trains the model to ignore irrelevant words.

    • 💡

      Random Deletion Example: “The cat sat on the mat” becomes “Cat sat on the mat.” Input is the sentence; processing removes a word like “the”; output is a clipped sentence still understandable. The model learns to handle missing function words.

    05Conclusion

    This lecture brings together three vital themes for effective language modeling: handling rare and unknown words, integrating structured knowledge, and augmenting data. Because language follows Zipf’s law, most words are rare, which undermines probability estimates and inflates vocabularies. Practical strategies include replacing rare words with <UNK> (optionally POS-aware), adopting subword tokenization like BPE or WordPiece to build rare words from common parts, and, when needed, using character-level modeling or hybrids to guarantee coverage. These choices directly affect accuracy, memory, and speed.

    Structured data and knowledge—such as database tables and knowledge graphs—offer clean, explicit facts that complement noisy raw text. You can use them to generate extra training sentences, constrain outputs to valid answers, or guide learning with aligned data (e.g., parallel corpora). Knowledge graph embeddings like TransE, and robust variants like DistMult and ComplEx, map entities and relations into vectors that preserve graph structure, enabling link prediction, question answering, and better factual reasoning. Because real graphs can be noisy or incomplete, completion and cleaning steps are crucial for reliability.

    Data augmentation rounds out the toolkit. Back translation produces paraphrases that keep meaning but vary wording, while synonym replacement, random insertion, and random deletion teach resilience to surface changes and noise. More advanced methods like contextual augmentation and GAN-based generation give additional diversity when curated carefully. The unifying idea is to create varied, meaning-preserving examples so models generalize better from limited data.

    To practice, start by training a BPE or WordPiece tokenizer and comparing results with and without <UNK> rules. Build a small knowledge graph or use an open one to constrain a simple QA task and experiment with TransE embeddings. Implement a back-translation pipeline and light synonym/noise augmentations, and measure their effects on validation performance. Next steps include exploring robust KGE models (ComplEx), hybrid subword–character architectures, and automated augmentation filtering using semantic similarity.

    The core message: choose data strategies that match your task’s needs and constraints. Subwords are a strong default; use structured data to boost factuality and guardrails; apply careful augmentation to improve robustness. With these tools, you can make language models more accurate, efficient, and reliable in real-world settings.

    Guide learning with clean supervision when possible. Use parallel corpora to train translation or alignment tasks. Clean, aligned pairs outperform noisy web text. Maintain quality with filters and deduplication.
  • ✓Adopt knowledge graph embeddings for reasoning tasks. Start with TransE for simplicity; move to DistMult or ComplEx if relations are complex or data is noisy. Evaluate with MRR and Hits@k to ensure your embeddings capture structure. Use embeddings to rank answers or predict missing facts.
  • ✓Plan for noisy or incomplete knowledge graphs. Add a completion step to infer likely missing edges and a cleaning step to remove suspicious ones. Combine automated flags with human review for high-stakes domains. Keep logs to track graph quality over time.
  • ✓Implement back translation for paraphrase diversity. Pick strong MT models, translate out and back, then filter by semantic similarity. Limit the number of paraphrases per sentence to avoid data imbalance. Measure gains on validation tasks before scaling up.
  • ✓Use synonym replacement sparingly and context-aware. Match parts of speech and avoid named entities. Keep replacement rates low (e.g., 10–20%) to preserve meaning. Spot-check samples to ensure quality.
  • ✓Apply random insertion and deletion at low rates to teach robustness to noise. Insertion can add harmless adjectives; deletion can remove function words. Avoid heavy edits that break grammar severely. Monitor if these augmentations actually improve validation.
  • ✓Try contextual augmentation for smarter edits. Mask tokens and let a language model propose replacements with high confidence. This preserves coherence better than random changes. Ensure the base LM is good enough to avoid drift.
  • ✓Balance augmented and original data. Start with a 1:1 or 1:2 ratio (augmented:original) and adjust based on validation gains. Too much augmentation can distort distribution and hurt performance. Keep the original data as the backbone.
  • ✓Evaluate on clean, unaugmented test sets. The goal is real-world robustness, not just fitting augmented noise. Track multiple metrics (accuracy, calibration, factual consistency). Use ablations to see which augmentations help.
  • ✓Manage vocabulary size to control compute costs. Larger vocabularies enlarge the softmax and memory use; smaller vocabularies lengthen sequences. Tune vocab size with profiling and watch your training throughput. Pick the sweet spot for your hardware and task.
  • ✓Document your data decisions. Record thresholds, tokenizer settings, augmentation rates, and graph integrity checks. This makes experiments reproducible and debuggable. It also helps teammates understand trade-offs and results.
  • <UNK> (unknown token)

    A special token used to replace rare or unseen words. It stands for “I don’t know this exact word.” It reduces vocabulary size and gives a stable probability to unknowns. But it removes the word’s precise meaning. Variations include POS-aware unknowns.

    Part of Speech (POS)

    A label for a word’s role in a sentence, like noun, verb, or adjective. Knowing POS helps keep grammar and meaning. POS-aware unknown tokens preserve structure even when exact words are unknown. This can improve translation and generation. It’s a simple way to keep useful hints.

    Subword

    A piece of a word learned from data, often a common prefix, root, or suffix. Subwords allow rare or new words to be built from known parts. This handles unknown words without huge vocabularies. It balances flexibility and efficiency. Most modern LMs rely on subwords.

    Byte Pair Encoding (BPE)

    A subword algorithm that merges the most frequent character pairs repeatedly. It builds larger units from common smaller parts. Training stops at a target vocabulary size. It’s simple and fast. It works well across many languages.

    +27 more (click terms in content)