🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1 | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch13 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 13: Data 1

Intermediate
Stanford Online
LLMYouTube

Key Summary

  • •This class explains why data is the most important part of building language models. You learn where text data comes from (books, the web, and human feedback) and what each source is good and bad at. The instructor stresses that most of your time in real projects goes into finding, collecting, cleaning, and filtering data, not model code.
  • •Books are high quality and well written but can be old and narrow in topic. The Books3 dataset is a large example (about 200 GB). Web data is huge, diverse, and fresh, but very messy and noisy. Human feedback data is precise for tasks like instruction following but is expensive to create.
  • •Common Crawl is a nonprofit project that crawls a large chunk of the web every month since 2011, producing petabytes of data. It’s free and great for general pretraining, but very messy with HTML, scripts, boilerplate, and spam. You must extract text and filter aggressively. It cannot just be downloaded onto a laptop due to its size.
  • •C4 (Colossal Cleaned Common Crawl) is a cleaned version of Common Crawl created with rules. It keeps sentences that end with punctuation, removes pages with offensive words and JavaScript-heavy content. It is still large (about 300 GB) and cleaner but not perfect. MC4 is a multilingual version with around 100 languages.
  • •Other web datasets include RealNews (news articles), WebText (used for GPT-2), Pushshift.io (Reddit comments), and CCNet (another cleaned Common Crawl). Social data like Twitter has become harder to obtain because its API now costs money. Each dataset carries different biases and noise. You must choose based on your goal.
  • •You can get data in three main ways: download ready-made datasets, pull from APIs, or crawl websites yourself. Downloading is easiest if a dataset already exists. APIs give structured access but often require payment now. Crawling is most flexible but takes the most engineering effort.

Why This Lecture Matters

This lecture matters because data is the single biggest driver of language model success in real projects. Engineers, data scientists, and researchers often spend most of their time obtaining, cleaning, and filtering text, not tweaking model code. Knowing where to find data (books, web, human feedback), how to collect it (downloads, APIs, crawling), and how to filter it (quality, safety, relevance) directly translates into better model behavior, fewer surprises, and safer outputs. The knowledge here solves practical problems: getting timely content despite API changes, extracting meaningful text from messy HTML, and removing junk and harmful content before it poisons training. It also helps build specialized models by keeping only domain-relevant text, like medicine or law, so models become experts instead of generalists. Understanding perplexity-based filtering, the limits of heuristics, and the subtleties of length, vocabulary size, and domain match prevents throwing out good data or keeping bad data. In real work, this guidance helps you design robust data pipelines that respect websites (robots.txt, rate limits), anticipate shifting access rules, and track filtering decisions for auditing and improvement. It supports careers by moving you beyond toy datasets to production-scale thinking—an essential skill for ML engineers and researchers. In today’s industry, the teams that master data acquisition and filtering tend to build stronger, safer, and more competitive models, because the model’s diet determines its health. Mastering these skills puts you in the driver’s seat of modern LLM development.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on the first and most important part of building language models: data. The goal is to show what kinds of data exist, where to find them, how to collect them, and how to filter them so that training works well. While transformers, neural networks, and autoregressive generation matter, real-world work in language modeling is mostly about data—finding it, cleaning it, processing it, and deciding what to keep or drop. The instructor emphasizes that about 80% of project time goes into data tasks, and that models are highly sensitive to what they are trained on.

The lecture divides text data into three main sources: books, the web, and human feedback. Books (for example, the Books3 dataset, roughly 200 GB) are well-written and carefully proofread, but often outdated and narrow in topic coverage. The web is vast, timely, and diverse—covering news, blogs, forums, and social media—but very noisy and inconsistent. Human feedback data (like MT Bench or Alpaca Farm) is directly helpful for tasks such as alignment and instruction-following, but it is expensive to create because you must pay people to write or rate examples.

To make web-scale data usable, the lecture spotlights Common Crawl, a nonprofit that has crawled a large portion of the web each month since 2011. Common Crawl provides petabytes of free data, which is excellent for general-purpose pretraining but too messy to use raw. It contains HTML, CSS, JavaScript, PDFs, images, and lots of boilerplate (navigation bars, footers, legal disclaimers), plus spam and machine-translated text. Therefore, it needs heavy text extraction and filtering. A well-known cleaned subset is C4 (Colossal Cleaned Common Crawl), which applies heuristics such as keeping lines that end with punctuation and filtering pages with offensive words or JavaScript, yielding a still-large (≈300 GB) dataset. There is also MC4, a multilingual version spanning roughly 100 languages. Even so, some junk remains, like “Skip to content. Breaking news …”, so additional filtering is often needed.

Beyond Common Crawl and C4, many other datasets are relevant: RealNews (news articles), WebText (used for GPT-2), Pushshift.io (Reddit comments), CCNet (another cleaned Common Crawl), and social media like Twitter (now harder to access due to paid APIs). Choosing among these depends on your task and constraints, including quality, coverage, recency, and availability.

The lecture then explains three ways to acquire data: download pre-packaged datasets, use APIs, or crawl the web yourself. Downloading is simplest when a suitable dataset already exists. APIs provide structured access to sites like Wikipedia or the New York Times, though more APIs now require payment. Crawling is the most general, but requires careful engineering and etiquette. Ethical and polite crawling includes rate limiting requests, identifying the crawler as a bot, and obeying robots.txt files that specify what is allowed.

Once data is collected, you must filter it. The lecture organizes filtering around three criteria: quality, safety, and relevance. Quality filtering removes low-quality or nonsensical text (like machine-translated junk or random strings). Safety filtering removes hateful content or personally identifiable information (PII). Relevance filtering removes content unrelated to your target domain, such as medicine or law. Heuristics (simple rules) can help, like requiring sentence-ending punctuation or maintaining blocklists of offensive words, but they are imperfect. More advanced filtering uses machine learning, including language models.

Language models can filter by scoring text with perplexity, a number that reflects how surprising the text is under a model trained on high-quality data. Lower perplexity means the text fits well; higher means it looks like gibberish. A simple demonstration compares “The quick brown fox jumps over the lazy dog” (low perplexity) with “Asdf jkl qwerty” (high perplexity). Important subtleties include normalizing for length (e.g., per-word), using models with suitable vocabulary sizes, and matching the domain of the scoring model to the content; otherwise, good domain-specific text may be mistakenly rejected. The instructor also warns that using a language model to filter data can introduce bias: if the filter model is trained on biased data, it can unfairly down-rank texts from underrepresented groups.

Finally, the lecture recaps the main message: data selection and filtering heavily shape language model performance. You learned about data sources, acquisition methods, and filtering strategies, and why each step matters. In the next session, the focus will move to preparing data—tokenization, normalization, and vocabulary building—so the cleaned text can be fed into models effectively.

Key Takeaways

  • ✓Start with the right sources for your goal. Combine books for clean structure, web for diversity and recency, and human feedback for instruction-following. Each source adds a different strength to your model. Balance them to match your use case.
  • ✓Prefer existing datasets when they fit. Downloading C4, RealNews, or other curated sets is faster than crawling from scratch. Save crawling for niches that lack good datasets. This saves engineering time and reduces risk.
  • ✓Be a polite, transparent crawler. Always check robots.txt, set a clear User-Agent, and limit your request rate. Politeness prevents blocks and protects sites. It also builds trust with web admins.
  • ✓Clean HTML aggressively before training. Strip scripts, styles, headers, footers, and navigation. Keep only the main article text. Training on layout junk harms model quality.
  • ✓Use simple heuristics as a first filter. Keep lines ending with punctuation and drop pages with offensive words or heavy JavaScript. Heuristics are cheap and fast at scale. They prepare data for smarter filters later.
  • ✓Score text with a language model to catch subtle junk. Perplexity reveals nonsensical or machine-generated text that looks okay to simple rules. Normalize for length to be fair to long documents. Tune thresholds on a small validation set.
  • ✓Match your filter model to your domain. A general model may misjudge specialized language. Use a domain-aware model for perplexity scoring when possible. This keeps valuable domain text from being discarded.

Glossary

Language model

A language model is a program that guesses the next word (token) based on previous words. It learns these guesses by reading lots of text and finding patterns. The more and better text it sees, the smarter its guesses become. It does not understand like a human, but it gets very good at predicting likely text. Its quality depends strongly on the data it trains on.

Token

A token is a small piece of text the model reads, like a word or subword. Models don’t read whole paragraphs at once; they read tokens one after another. Tokenization is how text is split into tokens. Shorter tokens make the model’s job easier in some ways but change vocabulary size. Good token choices make training smoother.

Autoregressive generation

Autoregressive generation means the model writes text one token at a time, using what it already wrote to decide the next token. It’s like typing a story by looking at the last word and picking the next. The process repeats until the output is done. This is how many modern language models generate text.

Transformer

A transformer is a neural network architecture that handles sequences well. It uses attention to focus on important parts of the input when predicting the next token. This allows it to capture long-range patterns in text. Transformers are the standard for many language models today.

#common crawl#c4#mc4#web crawling#robots.txt#beautiful soup#scrapy#requests#data filtering#perplexity#safety filtering#pii#topic modeling#keywords filtering#books3#webtext#realnews#pushshift#ccnet#api access
Version: 1
  • •When crawling, be respectful and polite to websites. Limit your request rate so you don’t overload servers, identify yourself as a bot, and obey robots.txt rules. These rules tell you where crawling is allowed and not allowed. Polite crawling keeps the web stable and keeps you from being blocked.
  • •Useful Python tools include urllib and requests for HTTP requests, Beautiful Soup for parsing HTML, and Scrapy for building full crawlers. These tools help fetch pages, extract main text, and follow links. You combine them to move from a seed page to many related pages. Each tool solves a piece of the pipeline.
  • •Filtering is critical because a lot of collected text is low quality, unsafe, or irrelevant. Quality filtering removes gibberish, spam, and machine-translated junk. Safety filtering removes hateful content and sensitive personal information (PII). Relevance filtering keeps only text related to your domain, like medicine or law.
  • •Heuristic filters are simple rule-based checks, like keeping lines that end in punctuation or blocking pages with certain offensive words. They are fast and easy to run at large scale. But people can bypass them with misspellings or new slang, and they might erase useful content by mistake. They are a first pass, not the final answer.
  • •Language models can act as filters by scoring text with perplexity. Perplexity measures how surprising text is to a model trained on good English; low perplexity means likely good text, high means gibberish. You filter out high-perplexity text, but you must adjust for text length, model vocabulary size, and the domain the model was trained on. Wrong choices here can toss out useful domain content.
  • •Relevance can be checked with keywords or a topic model trained on your domain. Keywords are simple but brittle and can miss synonyms. Topic models can capture broader themes but take more setup. Either way, you keep what fits your task and drop the rest.
  • •The instructor used a simple example to show perplexity filtering: “The quick brown fox jumps over the lazy dog” vs “Asdf jkl qwerty.” The first sentence is grammatical and gets low perplexity; the second is nonsense and gets high perplexity. You would keep the first and drop the second. This shows why numeric scoring helps automate quality control.
  • •Common Crawl raw HTML contains tags, CSS, JavaScript, navigation bars, and legal footers. You must strip boilerplate and extract the main content. Even cleaned datasets like C4 still include unwanted lines like “Skip to content” and “Breaking news,” so additional filtering is needed. There is no single magic dataset that is perfect.
  • •Using a language model for filtering can introduce bias. If the filter model was trained mostly on one group’s writing, it might down-rank other groups’ language. This can unfairly remove valuable content. You must pick filter models carefully and be aware of bias risks.
  • •Data choices affect every stage of the project: training, debugging, evaluation, and deployment. Better data leads to better predictions because language models are like giant memory tables of token patterns. The more diverse, clean, and relevant data the model sees, the better it performs. The next class will cover preparing data: tokenization, normalization, and building vocabularies.
  • •APIs and social platforms change access rules, so plan for data availability to shift over time. If an API becomes paid or restricted, you may need to switch to curated datasets or build crawlers. Always check and respect terms of service. Sustainable data pipelines depend on both technical and legal care.
  • 02Key Concepts

    • 01

      Why Data Volume Matters: Definition: Data volume is how much text your model sees during training. Analogy: It’s like knowing a friend named Alex—after 10 years you can predict their next sentence better than after one hour. Technical: Language models memorize statistical patterns of token sequences, so more samples make probability estimates more reliable. Importance: With too little data, estimates are noisy and the model predicts poorly. Example: A model trained on millions of webpages recognizes common phrasing in news better than a model trained on a few books.

    • 02

      Language Models as Giant Lookup Tables: Definition: A language model stores patterns about which token sequences commonly follow others. Analogy: It’s like a giant table of next-word guesses built from past reading. Technical: During training, it learns probabilities of token sequences; at inference, it chooses likely next tokens. Importance: If patterns were not learned from rich data, next-token predictions would be weak. Example: Seeing many email sign-offs helps the model predict “Best regards,” after “Thank you for your time.”

    • 03

      Books as a Data Source: Definition: Books are long-form, carefully edited text. Analogy: They’re like polished essays bound into a library—reliable but sometimes dated. Technical: Datasets like Books3 (~200 GB) contain high-quality sentences and rich vocabulary but limited topical freshness and breadth. Importance: They provide clean grammar and structure to stabilize training. Example: Training with books improves coherent paragraph generation but might lack today’s slang or recent events.

    • 04

      Web as a Data Source: Definition: The web is a massive collection of online pages (news, blogs, forums, social media). Analogy: It’s like a giant city bazaar—everything is there, but it’s loud and messy. Technical: Web data is diverse and timely, but includes HTML, scripts, boilerplate, spam, and low-quality text that must be extracted and filtered. Importance: Models learn current language use, variety of styles, and up-to-date facts. Example: Including web forums helps a model answer casual questions in everyday tone.

    • 05

      Human Feedback Data: Definition: Human feedback data consists of examples written or rated by people for specific tasks. Analogy: It’s like getting a tutor’s custom corrections and practice problems. Technical: Datasets such as MT Bench and Alpaca Farm contain instruction-following or alignment examples tailored to desired behaviors. Importance: It directly teaches models how to follow instructions safely and helpfully. Example: Fine-tuning on human-ranked responses improves response quality for question-answering.

    • 06

      Common Crawl Basics: Definition: Common Crawl is a nonprofit monthly crawl of a large fraction of the web since 2011. Analogy: It’s like a giant public snapshot album of the internet taken every month. Technical: It provides petabytes of raw HTML data, free to use, but far too big and messy to consume directly on a laptop. Importance: It’s a foundational source for general-purpose pretraining. Example: Starting from Common Crawl gives a model broad knowledge of topics from cooking blogs to tech news.

    • 07

      The Messiness of Raw HTML: Definition: Raw HTML is the source code of web pages that mixes content with layout and scripts. Analogy: It’s like a magazine page covered with stickers and ads—you must peel them off to read the article. Technical: Pages include tags, CSS, JavaScript, navigation bars, footers, and legal text; you must extract main text and drop boilerplate. Importance: Without extraction, the model would learn from junk and format noise. Example: Removing “Skip to content” and menu items before training yields cleaner sentences.

    • 08

      C4 (Colossal Cleaned Common Crawl): Definition: C4 is a cleaned dataset built from Common Crawl using rules. Analogy: It’s like washing a huge basket of fruit and tossing out visibly bad pieces. Technical: It keeps lines ending with punctuation, removes pages with offensive words and JavaScript-heavy content, producing ~300 GB of cleaner text. Importance: You get a large, more trainable corpus with less noise. Example: Training on C4 reduces the model’s tendency to reproduce navigation menus.

    • 09

      MC4 (Multilingual C4): Definition: MC4 is a multilingual extension of C4. Analogy: It’s like having the same cleaned fruit basket but for many languages. Technical: It covers roughly 100 languages, enabling multilingual pretraining from similar cleaning rules. Importance: It supports building models that understand and generate many languages. Example: A model trained on MC4 can handle questions in English, Spanish, and Hindi.

    • 10

      Other Web Datasets: Definition: Collections like RealNews, WebText, Pushshift.io (Reddit), and CCNet are specialized web corpora. Analogy: They’re like themed sections of a library—news, forums, and curated shelves. Technical: Each dataset has its own source, cleaning approach, and style; Twitter data is now harder to obtain due to paid APIs. Importance: Picking the right sources shapes your model’s voice and knowledge. Example: Adding Reddit comments improves conversational tone, while RealNews strengthens news summarization.

    • 11

      Three Ways to Get Data: Definition: You can download datasets, use APIs, or crawl websites yourself. Analogy: It’s like buying ready-made food, ordering from a menu, or cooking from scratch. Technical: Downloads are simplest, APIs offer structured access (often paid), and crawling is most flexible but hardest. Importance: Your choice affects engineering time, cost, and coverage. Example: Use Wikipedia’s API for reliable pages, or Scrapy to harvest a niche site unavailable via API.

    • 12

      Polite Web Crawling: Definition: Polite crawling means fetching pages without harming websites. Analogy: It’s like visiting someone’s home and knocking gently, not kicking the door. Technical: Limit request rates, identify as a bot with a user-agent, and obey robots.txt rules about allowed paths. Importance: It prevents server overload and blocking and respects site policies. Example: Checking wikipedia.org/robots.txt before crawling ensures you avoid disallowed endpoints.

    • 13

      Core Libraries for Crawling and Parsing: Definition: Tools like urllib/requests, Beautiful Soup, and Scrapy help fetch and process web pages. Analogy: They’re your backpack tools—map, knife, and rope for a web hike. Technical: requests makes HTTP calls, Beautiful Soup parses HTML to extract text, and Scrapy orchestrates large-scale crawls with queues and pipelines. Importance: Using the right tool makes data collection faster and cleaner. Example: Combine requests + Beautiful Soup to pull article paragraphs from a news site.

    • 14

      Filtering by Quality: Definition: Quality filtering removes gibberish, spam, and machine-translated junk. Analogy: It’s like tossing bruised or rotten fruit from a basket. Technical: Use heuristics (punctuation-ending lines, blocklists) or language-model scoring (perplexity) to decide what to keep. Importance: Better data leads to clearer, more grammatical model outputs. Example: Dropping pages with random keyboard mash text improves training stability.

    • 15

      Heuristics for Filtering: Definition: Heuristics are simple rules for quick filtering. Analogy: They’re like quick visual checks when sorting fruit—no microscope needed. Technical: Examples include requiring sentence-ending punctuation, stripping pages with JavaScript blocks, and excluding offensive word lists. Importance: They scale cheaply to billions of lines. Example: A first-pass filter removes lines without ., ?, or ! to reduce fragments and boilerplate.

    • 16

      Language Models as Filters (Perplexity): Definition: Perplexity measures how surprising text is to a trained model. Analogy: It’s like a familiarity meter—the model smiles at normal sentences and frowns at nonsense. Technical: Train a model on high-quality text and score new text; keep low-perplexity lines, drop high ones. Importance: It catches subtle junk that heuristics miss. Example: “The quick brown fox …” gets low perplexity; “Asdf jkl qwerty” gets high and is removed.

    • 17

      Perplexity Subtleties: Length: Definition: Long texts naturally accumulate more surprise. Analogy: Reading a long story offers more chances for odd parts than a short sentence. Technical: Normalize scores by length (e.g., per-word) so fair comparisons are possible. Importance: Without normalization, longer but good documents may be thrown away. Example: A 1,000-word article shouldn’t be penalized just for being long.

    • 18

      Perplexity Subtleties: Vocabulary Size: Definition: A model’s vocabulary affects how it scores words. Analogy: A small dictionary makes many words look unusual, even if they’re common elsewhere. Technical: Larger vocabularies reduce artificial surprise for legitimate tokens. Importance: Using a tiny vocabulary can inflate perplexity and drop good text. Example: Technical terms look rare to a small-vocab model but normal to a large-vocab one.

    • 19

      Perplexity Subtleties: Domain Match: Definition: A filter model should know the target domain’s language. Analogy: A chef trained on Italian dishes may misjudge great sushi. Technical: A general model may rate medical jargon as odd; a medical-tuned model won’t. Importance: Wrong-domain filters delete useful domain text. Example: A cardiology article might be wrongly filtered by a general news-trained model.

    • 20

      Safety Filtering (Hate and PII): Definition: Safety filtering removes harmful or sensitive content. Analogy: It’s like locking away sharp tools from kids. Technical: Use blocklists and ML classifiers trained to detect hate speech and personally identifiable information (PII). Importance: It protects users and meets legal/ethical requirements. Example: A page exposing phone numbers and addresses is detected and excluded.

    • 21

      Relevance Filtering with Keywords: Definition: Relevance filtering keeps text that matches your topic. Analogy: It’s like keeping only the puzzle pieces that fit your picture. Technical: Keep documents containing domain-specific terms; drop others. Importance: It focuses training on what the model must do well. Example: For a medical model, keep pages mentioning “diagnosis,” “treatment,” or “symptoms.”

    • 22

      Relevance Filtering with Topic Models: Definition: Topic models group documents by themes. Analogy: They’re librarians clustering books into themed shelves. Technical: Train on domain text to learn topic distributions, then keep documents matching your domain topics. Importance: They catch relevance even when keywords differ. Example: An article about “clinical trials” and “adverse events” is kept for a pharma model.

    • 23

      Bias Risks in LM-Based Filtering: Definition: Filter models can carry biases from their training data. Analogy: A crooked ruler makes all measurements wrong. Technical: If a model saw mostly one group’s writing, it may down-rank others’ language styles. Importance: It can unfairly erase valuable perspectives. Example: A male-dominated corpus could lower scores for texts written by women.

    • 24

      Twitter and API Access Changes: Definition: Data access rules can change over time, especially for social media. Analogy: A road you used yesterday might be toll-only today. Technical: Paid APIs limit collection; you may need to switch to curated datasets or crawling. Importance: Plans must adapt to access and budget. Example: If Twitter’s API is too expensive, rely on Reddit or news datasets instead.

    • 25

      Data Shapes the Whole Pipeline: Definition: Data decisions affect training, debugging, evaluation, and deployment. Analogy: Bad ingredients ruin the whole meal. Technical: Clean, safe, relevant data yields better loss curves, fewer bugs, and higher-quality outputs. Importance: It’s the foundation under all modeling work. Example: A carefully filtered corpus reduces hallucinations and toxic outputs.

    03Technical Details

    Overall Architecture/Structure

    1. Source Selection
    • Definition: Source selection means deciding which collections of text you will use. Analogy: It’s like choosing the markets you’ll shop at before cooking a feast. Technical: You weigh books (high-quality but dated), web (diverse but noisy), and human feedback (task-aligned but costly). Importance: Wrong sources lead to poorly aligned models. Example: For a conversational assistant, you might prefer web forums plus instruction data.
    1. Acquisition Methods
    • Definition: Acquisition is how you actually get the text. Analogy: Buying ready-made meals (downloads), ordering from restaurants (APIs), or cooking from scratch (crawling). Technical: Pre-packaged datasets (C4, RealNews) are easiest; APIs (Wikipedia, NYT) provide structured access; crawling builds your own dataset when others don’t fit. Importance: Acquisition choices determine your engineering effort and cost. Example: Use APIs when you need guaranteed structure and metadata.
    1. Raw Data Handling
    • Definition: Raw handling means turning whatever you fetched (HTML, JSON) into plain text. Analogy: Removing wrapping paper and sorting what’s inside. Technical: For the web, strip HTML tags, CSS, and scripts; remove boilerplate like menus and footers. Importance: Models should learn language, not layout. Example: From a blog page, keep the article body but drop the sidebar.
    1. Filtering (Quality, Safety, Relevance)
    • Definition: Filtering removes what you don’t want or need. Analogy: Sifting flour before baking. Technical: Apply heuristics, LM-based scoring (perplexity), safety classifiers, and domain relevance checks (keywords or topic models). Importance: Filtering quality directly impacts model behavior. Example: Excluding hateful or PII content prevents unsafe outputs.
    1. Preparation (Preview)
    • Definition: Preparation converts clean text into model-ready format. Analogy: Cutting ingredients to the right size before cooking. Technical: Tokenization, normalization, and vocabulary building organize text into tokens the model can understand. Importance: Poor preparation increases training errors and confusion. Example: Normalizing Unicode and consistent casing improves tokenizer stability.

    Data Flow

    • Step 1: Choose sources (books, web, human feedback) based on task needs and constraints.
    • Step 2: Acquire data (download datasets, call APIs, or crawl sites) while respecting policies.
    • Step 3: Extract main text from raw formats (HTML to text, JSON fields to strings).
    • Step 4: Filter aggressively for quality, safety, and relevance (rules + models).
    • Step 5: Store the clean text in a structured repository for later preparation and training.

    Code/Implementation Details (Conceptual, Tools Named in Lecture)

    Languages/Frameworks: Python is a common choice.

    • requests (or urllib): Fetch pages via HTTP GET requests.
    • Beautiful Soup: Parse HTML and extract content (e.g., article paragraphs).
    • Scrapy: Build scalable crawlers with spiders, queues, and pipelines.

    What Each Tool Does:

    • requests: Sends HTTP requests, receives responses, handles timeouts and headers.
      • Important parameters: headers (User-Agent string to identify bot), timeout (avoid hanging), backoff/retry logic (politeness and resilience).
    • urllib: Standard library alternative for HTTP access; lower-level than requests.
    • Beautiful Soup: Parses HTML into a tree; lets you select and remove tags; you can find <p> tags, strip scripts and styles.
    • Scrapy: Defines spiders that start from seed URLs, follow links matching rules, and process items through pipelines (for cleaning and storage). Handles concurrency and politeness settings.

    Role of Each Component:

    • Fetcher: Pulls pages/APIs (requests/urllib) while rate-limiting and identifying the bot.
    • Parser: Turns HTML into clean text (Beautiful Soup).
    • Crawler Orchestrator: Manages large-scale crawling across sites (Scrapy), obeys robots.txt.
    • Filter: Applies heuristics and ML scoring to keep good text.
    • Storage: Saves cleaned documents with metadata for later use.

    Important Parameters and Meanings:

    • User-Agent: Identifies your crawler; being explicit increases trust and helps site admins.
    • robots.txt rules: Define allowed/disallowed paths; crawlers must check before fetching.
    • Delay/Rate Limit: Controls the pace of requests (e.g., 1 request/sec) to avoid overload.
    • Timeouts/Retries: Prevents stalls and handles transient errors gracefully.
    • Heuristic Thresholds: Punctuation rules, offensive word lists, and JavaScript detection flags.
    • Perplexity Threshold: A cutoff (e.g., >100) to drop high-surprise text; should be tuned.

    Execution Order and Flow:

    1. Initialize sources and seed URLs.
    2. Load robots.txt and configure allowed paths and delays.
    3. Fetch a page, check HTTP status, and respect backoff upon errors.
    4. Parse HTML, remove scripts/styles/nav, keep main content nodes.
    5. Run heuristics: drop non-sentence lines, filter offensive words.
    6. Score remaining text with a language model for perplexity; drop high-perplexity content.
    7. For domain tasks, run keyword or topic filters; keep relevant documents.
    8. Store cleaned documents with source metadata and timestamps.

    Tools/Libraries Used

    • requests: Why chosen: simplicity and readability; Basic usage: requests.get(url, headers=...).
    • urllib: Why chosen: built-in, no extra install; Basic usage: urllib.request.urlopen(url).
    • Beautiful Soup: Why chosen: easy HTML parsing; Basic usage: BeautifulSoup(html, 'html.parser').find_all('p').
    • Scrapy: Why chosen: scalable crawling framework; Basic usage: define a Spider class with start_urls and parse() to follow links.
    • APIs (Wikipedia, NYT): Why chosen: structured content and metadata; Basic usage: send GET requests with API keys and query parameters (note: many APIs now require payment).

    Step-by-Step Implementation Guide (High-Level)

    Step 1: Decide on Sources

    • Goal: Choose books, web, and/or human feedback depending on your model’s purpose.
    • Action: For a general model, prefer broad web corpora (Common Crawl/C4) plus some books; for instruction-following, add human feedback datasets.

    Step 2: Choose Acquisition Method

    • If a good dataset exists (e.g., C4), download it.
    • If you need specific sites (e.g., Wikipedia), use their API.
    • If your domain is niche and lacks datasets, plan a polite crawler.

    Step 3: Set Up Polite Crawling (if needed)

    • Check site robots.txt (e.g., https://example.com/robots.txt) for allowed paths and crawl delays.
    • Set a descriptive User-Agent to identify your bot.
    • Configure rate limits (e.g., 1 request/sec) and exponential backoff on errors.

    Step 4: Fetch Pages

    • Use requests.get(url, headers={'User-Agent': 'YourBot/1.0'}) with timeouts.
    • Handle 200 OK responses; skip 4xx/5xx errors; respect retry-after headers.

    Step 5: Parse and Extract Text

    • Remove <script>, <style>, nav bars, headers, and footers using Beautiful Soup.
    • Extract main content (e.g., paragraphs); join lines into sentences.

    Step 6: Heuristic Filtering

    • Keep lines ending in ., ?, or !
    • Filter pages containing offensive-word lists.
    • Drop pages with embedded JavaScript code blocks.

    Step 7: LM-Based Filtering (Perplexity)

    • Score each document with a language model trained on high-quality text.
    • Normalize for length (e.g., per word) to compare documents fairly.
    • Set a perplexity threshold; drop documents above threshold; tune as needed.

    Step 8: Safety Filtering

    • Run models or rules to detect hate speech and PII.
    • Remove or redact sensitive content.

    Step 9: Relevance Filtering

    • Use keywords for quick domain checks.
    • Optionally apply a topic model trained on domain text to keep on-topic content.

    Step 10: Store Clean Data

    • Save cleaned text with metadata (URL, timestamp, language) in a datastore.
    • Maintain logs of what was filtered and why, for auditing and tuning.

    Tips and Warnings

    • Respect Robots and Rates: Always obey robots.txt and keep rates low to avoid blocking or harm.
    • Expect Noise: Even “clean” datasets like C4 need extra filtering; boilerplate sneaks through.
    • Tune Thresholds: Perplexity cutoffs are not one-size-fits-all; calibrate on a held-out sample.
    • Domain Matters: Use a domain-appropriate language model for filtering or you’ll drop good content.
    • Vocabulary Size: Filter models with small vocabularies inflate perplexity; prefer larger vocabularies.
    • Bias Awareness: Filter models can encode biases; monitor for systematic removal of certain language styles.
    • API Costs and Limits: Many APIs are now paid; budget for access or switch strategies.
    • Logging and Auditing: Keep records of filter reasons to debug over- or under-filtering.
    • Storage Planning: Large-scale data (Common Crawl) cannot fit on laptops; plan cloud storage and compute.
    • Future Preparation: After filtering, you’ll still need tokenization, normalization, and vocabulary setup to feed the model.

    Perplexity Details (Intuition-Oriented)

    • Definition: Perplexity measures how well a model predicts a text sequence; lower is better.
    • Analogy: A familiarity score—if the text looks like the training data, perplexity is low.
    • Technical: Perplexity relates to inverse probability of the sequence under the model; length-normalized values compare documents fairly. In practice, set a threshold (e.g., >100) to flag likely junk.
    • Importance: It spots subtle nonsense that simple rules miss. Example: “Asdf jkl qwerty” has high perplexity because it doesn’t match learned English patterns.

    Heuristics (From C4) and Why They Help

    • Sentence-Ending Punctuation: Keeps lines likely to be full sentences, improving grammaticality.
    • Offensive Word Filtering: Removes potentially harmful content at scale, though it can miss coded language or create false positives.
    • JavaScript Removal: Avoids code-heavy pages where language isn’t the main content.
    • Caveat: Heuristics are blunt tools; combine them with ML scoring for best results.

    Dataset Landscape Overview

    • Common Crawl: Massive, free, monthly web snapshots; needs heavy cleaning.
    • C4: Cleaned with rules; still large and imperfect; good general baseline.
    • MC4: Multilingual C4; spans ≈100 languages; suitable for multi-language models.
    • RealNews: Focused news articles; good for factual and formal style.
    • WebText: Used for GPT-2; web pages curated for quality.
    • Pushshift.io (Reddit): Conversational style, slang, and informal dialogue.
    • CCNet: Another cleaned Common Crawl pipeline; alternative to C4.
    • Twitter: Harder due to paid API; plan alternatives.

    Human Feedback Datasets

    • MT Bench: Benchmarks and training data shaped for alignment and instruction-following.
    • Alpaca Farm: Instruction data designed to teach models to follow prompts.
    • Use: Add to pretraining or fine-tuning to improve helpfulness and compliance.
    • Cost: Requires paying annotators or leveraging previous curated resources.

    Putting It All Together

    • Start with broad corpora (C4/CCNet) for coverage.
    • Add domain-specific sources (news for current events, forums for conversational tone).
    • Layer in human feedback for instruction-following behavior.
    • Apply multi-stage filtering: heuristics → perplexity → safety → relevance.
    • Audit for bias and adjust filter models and thresholds.
    • Prepare data next (tokenize, normalize, build vocabulary) for training.

    Core Message

    • Data is the bread and butter of language modeling. Models are extremely sensitive to what they see. The right mix of sources, careful acquisition, and rigorous filtering are essential. Without this, even the best model architecture underperforms.

    04Examples

    • 💡

      Books vs. Web Trade-off: Input: Choose between Books3 and Common Crawl for pretraining. Processing: Evaluate quality (books are polished), freshness and diversity (web), and noise levels (web is messy). Output: A combined plan: use books for clean structure and web for breadth and recency. Key Point: Blending sources balances quality with coverage.

    • 💡

      Common Crawl Raw HTML Extraction: Input: A raw Common Crawl HTML page with tags, CSS, and JavaScript. Processing: Use a parser to remove scripts/styles and boilerplate, then extract main text paragraphs. Output: Clean sentences ready for downstream filtering. Key Point: You must strip format noise before training.

    • 💡

      C4 Heuristic Filtering: Input: A batch of web pages. Processing: Keep only lines ending with ., ?, or !; remove pages with offensive words; drop pages heavy in JavaScript. Output: A cleaned subset similar to C4’s approach. Key Point: Simple rules at scale remove lots of junk fast.

    • 💡

      Perplexity Scoring Demo: Input: Two sentences—“The quick brown fox jumps over the lazy dog” and “Asdf jkl qwerty.” Processing: Score with a language model trained on English; normalize for length. Output: Low perplexity for the first, high for the second; keep the first, drop the second. Key Point: Perplexity separates real language from gibberish.

    • 💡

      Safety Filtering for PII: Input: A web page listing someone’s phone number and address. Processing: Run a PII detector or regular expressions to find sensitive data. Output: Remove or redact the page from the dataset. Key Point: Protect users and meet safety standards.

    • 💡

      Domain Relevance with Keywords: Input: Mixed-topic articles for a medical model. Processing: Keep documents with domain keywords like “diagnosis,” “treatment,” “clinical,” and “symptoms.” Output: A focused corpus of medical text. Key Point: Simple keyword checks keep your dataset on-topic.

    • 💡

      Topic Model Relevance Check: Input: A large mixed corpus with varied subjects. Processing: Train a topic model on medical text, then keep documents whose topic distribution matches medical themes. Output: Retained documents that are relevant even without exact keywords. Key Point: Topic models find relevance beyond simple word matches.

    • 💡

      API vs. Crawling Decision: Input: Need Wikipedia articles. Processing: Compare Wikipedia API (structured, reliable) to crawling (more work, less structure). Output: Choose the API for faster, cleaner access. Key Point: APIs reduce effort when available and affordable.

    • 💡

      Polite Crawling Practice: Input: Plan to crawl a news site. Processing: Check robots.txt for allowed paths and crawl delays; set a descriptive User-Agent; throttle requests. Output: A stable crawl that avoids being blocked or harming the site. Key Point: Politeness ensures sustainable data collection.

    • 💡

      Filtering C4 Residual Boilerplate: Input: A C4 page snippet containing “Skip to content. Breaking news …”. Processing: Apply additional rules to remove boilerplate phrases and banners. Output: Cleaner, article-only text. Key Point: Even cleaned datasets need extra passes.

    • 💡

      Choosing Reddit for Conversational Tone: Input: You want a chatty, informal model voice. Processing: Add Pushshift.io (Reddit comments) while filtering for quality and safety. Output: A model that speaks more casually and understands slang. Key Point: Source choice shapes the model’s tone.

    • 💡

      Adapting to API Paywalls: Input: Need social media data but Twitter now charges for API access. Processing: Evaluate budget and alternatives, like Reddit or curated datasets. Output: Switch sources to keep costs reasonable. Key Point: Data strategies must adapt to changing access rules.

    • 💡

      Bias Check in LM Filtering: Input: A filter model trained on imbalanced data. Processing: Observe that texts from certain groups get higher perplexity. Output: Adjust filter model or thresholds to reduce unfair drops. Key Point: LM-based filtering can silently encode bias.

    • 💡

      Combining Heuristics and LM Scoring: Input: A large web crawl with mixed quality. Processing: First, apply punctuation and blocklist rules; then use perplexity to catch subtle junk. Output: Smaller, higher-quality dataset. Key Point: Multi-stage filtering is stronger than any single method.

    05Conclusion

    This session centered on the most decisive ingredient in language modeling: data. You learned where data comes from—books, the web, and human feedback—and the strengths and weaknesses of each. You explored major datasets like Common Crawl (massive and messy), C4/MC4 (rule-cleaned, large, still imperfect), and domain-focused sources like RealNews, WebText, and Reddit comments. You also saw how to get data via downloads, APIs, or your own crawlers, and how to be a polite and responsible crawler by obeying robots.txt, rate limiting, and identifying your bot.

    The second major theme was filtering: quality, safety, and relevance. Heuristics such as punctuation checks and offensive-word filters can quickly remove obvious junk, while language-model scoring using perplexity can catch subtler nonsense. You learned the practical subtleties of perplexity—length normalization, vocabulary size effects, and domain matching—to avoid throwing away useful text. Safety filtering protects against harmful content and personal data leaks, and relevance filtering, using keywords or topic models, keeps your dataset focused on the tasks you care about. A key warning is that language models used for filtering can introduce bias if they were trained on biased data, so you must monitor and adjust.

    Practically, the process looks like this: select sources, acquire data (download/API/crawl), extract text, run multi-stage filters, and store the clean corpus for later preparation. Each choice influences model behavior, tone, and safety. Next, you will move into preparing data—tokenization, normalization, and building vocabularies—so the cleaned corpus can feed into training smoothly. The core message to remember is simple but powerful: the quality and fit of your data largely determine your model’s success. Treat data selection and filtering as first-class engineering tasks, and your models will repay that care with better performance and safer behavior.

    ✓
    Watch out for vocabulary effects. Filter models with tiny vocabularies can inflate perplexity for normal words. Prefer models with larger vocabularies for scoring. This reduces false positives.
  • ✓Protect safety by filtering hate and PII. Use both blocklists and ML classifiers. Redact or remove sensitive content to avoid harm. This step is non-negotiable for deployed systems.
  • ✓Focus your corpus with relevance filters. Start with keywords, then consider topic models for broader theme matching. Keep only what supports your application’s domain. This makes your model an expert, not a generalist.
  • ✓Audit for bias in LM-based filtering. Check if certain groups’ texts are disproportionately dropped. Adjust models, data, or thresholds to reduce unfairness. Transparency in filtering decisions helps trust and quality.
  • ✓Plan for changing data access. APIs can become paid or limited suddenly. Maintain alternative sources or strategies (other datasets, crawling) to stay resilient. Budget time and money for access.
  • ✓Log every filtering decision. Record what was removed and why (heuristic hit, high perplexity, safety flag). These logs help you debug over-filtering or leaks. They also support compliance and reproducibility.
  • ✓Iterate your pipeline. Evaluate early samples, adjust thresholds, and re-run filters. Small tuning steps can massively improve final data quality. Treat data curation as continuous, not one-and-done.
  • ✓Keep storage and compute constraints in mind. Massive datasets like Common Crawl won’t fit on a laptop. Use cloud storage and compute plans that scale. Avoid bottlenecks with batching and streaming.
  • ✓Blend sources to shape tone and capability. Add Reddit for conversational style, news for factual writing, and books for structure. The mix teaches the model how to speak and what to know. Choose intentionally.
  • ✓Prepare next: tokenize and normalize. After filtering, ensure text is standardized and split into tokens. Good prep reduces training issues and improves stability. Don’t skip these final steps before training.
  • Dataset

    A dataset is a collection of data used for training or evaluation. For language models, this usually means lots of text. Different datasets have different styles, topics, and quality levels. Choosing the right dataset is crucial for good results.

    Common Crawl

    Common Crawl is a nonprofit project that crawls a large part of the web every month and releases the data for free. It includes raw HTML for many webpages. It’s huge, covering petabytes, and very diverse. But it’s messy and needs lots of cleaning.

    C4 (Colossal Cleaned Common Crawl)

    C4 is a cleaned version of Common Crawl created by applying simple rules to remove low-quality text. It keeps lines ending with punctuation and filters pages with offensive words and JavaScript. It is still large (around 300 GB) and not perfect. It gives a better starting point than raw web data.

    MC4

    MC4 is a multilingual version of C4. It includes many languages, roughly about 100. It uses similar cleaning ideas but across different languages. It helps build multilingual models.

    +28 more (click terms in content)