📚 Stanford CS336: Language Modeling from Scratch13 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1

Intermediate

Stanford Online

Key Summary

•This class explains why data is the most important part of building language models. You learn where text data comes from (books, the web, and human feedback) and what each source is good and bad at. The instructor stresses that most of your time in real projects goes into finding, collecting, cleaning, and filtering data, not model code.
•Books are high quality and well written but can be old and narrow in topic. The Books3 dataset is a large example (about 200 GB). Web data is huge, diverse, and fresh, but very messy and noisy. Human feedback data is precise for tasks like instruction following but is expensive to create.
•Common Crawl is a nonprofit project that crawls a large chunk of the web every month since 2011, producing petabytes of data. It’s free and great for general pretraining, but very messy with HTML, scripts, boilerplate, and spam. You must extract text and filter aggressively. It cannot just be downloaded onto a laptop due to its size.
•C4 (Colossal Cleaned Common Crawl) is a cleaned version of Common Crawl created with rules. It keeps sentences that end with punctuation, removes pages with offensive words and JavaScript-heavy content. It is still large (about 300 GB) and cleaner but not perfect. MC4 is a multilingual version with around 100 languages.
•Other web datasets include RealNews (news articles), WebText (used for GPT-2), Pushshift.io (Reddit comments), and CCNet (another cleaned Common Crawl). Social data like Twitter has become harder to obtain because its API now costs money. Each dataset carries different biases and noise. You must choose based on your goal.
•You can get data in three main ways: download ready-made datasets, pull from APIs, or crawl websites yourself. Downloading is easiest if a dataset already exists. APIs give structured access but often require payment now. Crawling is most flexible but takes the most engineering effort.

Why This Lecture Matters

This lecture matters because data is the single biggest driver of language model success in real projects. Engineers, data scientists, and researchers often spend most of their time obtaining, cleaning, and filtering text, not tweaking model code. Knowing where to find data (books, web, human feedback), how to collect it (downloads, APIs, crawling), and how to filter it (quality, safety, relevance) directly translates into better model behavior, fewer surprises, and safer outputs. The knowledge here solves practical problems: getting timely content despite API changes, extracting meaningful text from messy HTML, and removing junk and harmful content before it poisons training. It also helps build specialized models by keeping only domain-relevant text, like medicine or law, so models become experts instead of generalists. Understanding perplexity-based filtering, the limits of heuristics, and the subtleties of length, vocabulary size, and domain match prevents throwing out good data or keeping bad data. In real work, this guidance helps you design robust data pipelines that respect websites (robots.txt, rate limits), anticipate shifting access rules, and track filtering decisions for auditing and improvement. It supports careers by moving you beyond toy datasets to production-scale thinking—an essential skill for ML engineers and researchers. In today’s industry, the teams that master data acquisition and filtering tend to build stronger, safer, and more competitive models, because the model’s diet determines its health. Mastering these skills puts you in the driver’s seat of modern LLM development.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on the first and most important part of building language models: data. The goal is to show what kinds of data exist, where to find them, how to collect them, and how to filter them so that training works well. While transformers, neural networks, and autoregressive generation matter, real-world work in language modeling is mostly about data—finding it, cleaning it, processing it, and deciding what to keep or drop. The instructor emphasizes that about 80% of project time goes into data tasks, and that models are highly sensitive to what they are trained on.

The lecture divides text data into three main sources: books, the web, and human feedback. Books (for example, the Books3 dataset, roughly 200 GB) are well-written and carefully proofread, but often outdated and narrow in topic coverage. The web is vast, timely, and diverse—covering news, blogs, forums, and social media—but very noisy and inconsistent. Human feedback data (like MT Bench or Alpaca Farm) is directly helpful for tasks such as alignment and instruction-following, but it is expensive to create because you must pay people to write or rate examples.

To make web-scale data usable, the lecture spotlights Common Crawl, a nonprofit that has crawled a large portion of the web each month since 2011. Common Crawl provides petabytes of free data, which is excellent for general-purpose pretraining but too messy to use raw. It contains HTML, CSS, JavaScript, PDFs, images, and lots of boilerplate (navigation bars, footers, legal disclaimers), plus spam and machine-translated text. Therefore, it needs heavy text extraction and filtering. A well-known cleaned subset is C4 (Colossal Cleaned Common Crawl), which applies heuristics such as keeping lines that end with punctuation and filtering pages with offensive words or JavaScript, yielding a still-large (≈300 GB) dataset. There is also MC4, a multilingual version spanning roughly 100 languages. Even so, some junk remains, like “Skip to content. Breaking news …”, so additional filtering is often needed.

Beyond Common Crawl and C4, many other datasets are relevant: RealNews (news articles), WebText (used for GPT-2), Pushshift.io (Reddit comments), CCNet (another cleaned Common Crawl), and social media like Twitter (now harder to access due to paid APIs). Choosing among these depends on your task and constraints, including quality, coverage, recency, and availability.

The lecture then explains three ways to acquire data: download pre-packaged datasets, use APIs, or crawl the web yourself. Downloading is simplest when a suitable dataset already exists. APIs provide structured access to sites like Wikipedia or the New York Times, though more APIs now require payment. Crawling is the most general, but requires careful engineering and etiquette. Ethical and polite crawling includes rate limiting requests, identifying the crawler as a bot, and obeying robots.txt files that specify what is allowed.

Once data is collected, you must filter it. The lecture organizes filtering around three criteria: quality, safety, and relevance. Quality filtering removes low-quality or nonsensical text (like machine-translated junk or random strings). Safety filtering removes hateful content or personally identifiable information (PII). Relevance filtering removes content unrelated to your target domain, such as medicine or law. Heuristics (simple rules) can help, like requiring sentence-ending punctuation or maintaining blocklists of offensive words, but they are imperfect. More advanced filtering uses machine learning, including language models.

Language models can filter by scoring text with perplexity, a number that reflects how surprising the text is under a model trained on high-quality data. Lower perplexity means the text fits well; higher means it looks like gibberish. A simple demonstration compares “The quick brown fox jumps over the lazy dog” (low perplexity) with “Asdf jkl qwerty” (high perplexity). Important subtleties include normalizing for length (e.g., per-word), using models with suitable vocabulary sizes, and matching the domain of the scoring model to the content; otherwise, good domain-specific text may be mistakenly rejected. The instructor also warns that using a language model to filter data can introduce bias: if the filter model is trained on biased data, it can unfairly down-rank texts from underrepresented groups.

Finally, the lecture recaps the main message: data selection and filtering heavily shape language model performance. You learned about data sources, acquisition methods, and filtering strategies, and why each step matters. In the next session, the focus will move to preparing data—tokenization, normalization, and vocabulary building—so the cleaned text can be fed into models effectively.

Key Takeaways

✓Start with the right sources for your goal. Combine books for clean structure, web for diversity and recency, and human feedback for instruction-following. Each source adds a different strength to your model. Balance them to match your use case.
✓Prefer existing datasets when they fit. Downloading C4, RealNews, or other curated sets is faster than crawling from scratch. Save crawling for niches that lack good datasets. This saves engineering time and reduces risk.
✓Be a polite, transparent crawler. Always check robots.txt, set a clear User-Agent, and limit your request rate. Politeness prevents blocks and protects sites. It also builds trust with web admins.
✓Clean HTML aggressively before training. Strip scripts, styles, headers, footers, and navigation. Keep only the main article text. Training on layout junk harms model quality.
✓Use simple heuristics as a first filter. Keep lines ending with punctuation and drop pages with offensive words or heavy JavaScript. Heuristics are cheap and fast at scale. They prepare data for smarter filters later.
✓Score text with a language model to catch subtle junk. Perplexity reveals nonsensical or machine-generated text that looks okay to simple rules. Normalize for length to be fair to long documents. Tune thresholds on a small validation set.
✓Match your filter model to your domain. A general model may misjudge specialized language. Use a domain-aware model for perplexity scoring when possible. This keeps valuable domain text from being discarded.

Glossary

Language model

A language model is a program that guesses the next word (token) based on previous words. It learns these guesses by reading lots of text and finding patterns. The more and better text it sees, the smarter its guesses become. It does not understand like a human, but it gets very good at predicting likely text. Its quality depends strongly on the data it trains on.

Token

A token is a small piece of text the model reads, like a word or subword. Models don’t read whole paragraphs at once; they read tokens one after another. Tokenization is how text is split into tokens. Shorter tokens make the model’s job easier in some ways but change vocabulary size. Good token choices make training smoother.

Autoregressive generation

Autoregressive generation means the model writes text one token at a time, using what it already wrote to decide the next token. It’s like typing a story by looking at the last word and picking the next. The process repeats until the output is done. This is how many modern language models generate text.

Transformer

A transformer is a neural network architecture that handles sequences well. It uses attention to focus on important parts of the input when predicting the next token. This allows it to capture long-range patterns in text. Transformers are the standard for many language models today.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Language model

Token

Autoregressive generation

Transformer

02Key Concepts

03Technical Details

04Examples

05Conclusion

Dataset

Common Crawl

C4 (Colossal Cleaned Common Crawl)

MC4