📚 Stanford CS336: Language Modeling from Scratch16 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL 1

Intermediate

Stanford Online

Key Summary

•This session introduces alignment for language models and why next‑token prediction alone is not enough. When models only learn to guess the next word, they can hallucinate facts, produce toxic or biased text, and follow tricky prompts the wrong way. Alignment aims to make models helpful, honest, and harmless so they do what people actually want. The lecture lays out a practical recipe to achieve this with RLHF (Reinforcement Learning from Human Feedback).
•RLHF is presented as a three‑step pipeline. First, collect prompts and human‑written (or edited) responses that show the behavior you want. Second, train a reward model that scores which response is better for a given prompt using pairwise preferences. Third, fine‑tune the language model with reinforcement learning so it generates responses the reward model prefers.
•Data collection choices strongly shape outcomes. You decide which prompt types to include (questions, instructions, creative writing), where they come from (humans, models, or both), and how many you need. You also choose how responses are produced (fully human, model‑generated then edited, or mixtures) and how many alternatives per prompt. Diversity and quality control are key so the model learns broad, reliable behavior.
•The reward model maps (prompt, response) → a scalar score that represents “how good” the response is. It is trained with a pairwise ranking loss that prefers the higher‑quality response among two candidates. The common loss is −log(sigmoid(R(P,Y1) − R(P,Y2))) when Y1 is labeled better than Y2. Minimizing this loss teaches the model to rank good responses above weaker ones.
•Any model that can read text and output a single number can be a reward model, but a language model backbone is a common choice. You can reuse the base LM, a smaller LM, or even a different architecture. The key is that the reward model learns human preferences from labeled comparisons. Careful training and validation are needed because reward models can also make mistakes.

Why This Lecture Matters

Alignment with RLHF turns general language skill into trustworthy, goal‑directed behavior. This is crucial for anyone deploying models in real products—engineers, data scientists, product managers, and safety researchers—because customers care about accuracy, usefulness, and safety, not just fluency. Without alignment, models can hallucinate facts, echo toxic content, or be manipulated by tricky prompts, creating legal, ethical, and user‑experience risks. RLHF offers a concrete pipeline to encode human preferences and values into the training process: collect targeted prompts and responses, learn a reward signal from comparisons, and fine‑tune with robust RL (PPO) to optimize for what users prefer. In practice, this improves instruction following, reduces harmful outputs, and increases customer trust. This knowledge directly applies to building assistants, helpdesk bots, coding copilots, educational tutors, and domain‑specific experts (like medical or legal triage tools). Teams can stand up a focused RLHF system in weeks by curating domain prompts, gathering preference labels, and bootstrapping a reward model. It also opens career paths in applied AI safety, preference learning, and RL for language, where demand is growing rapidly as companies operationalize AI. Finally, alignment methods are a key differentiator in the industry: organizations that master data collection, reward modeling, and stable RL can deliver safer, more reliable AI systems that win user trust and meet regulatory expectations.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on alignment for large language models: how to train models not just to predict the next word but to do what people actually want. The central motivation is that pure next‑token prediction on internet text yields fluent models that sometimes make up facts (hallucinations), can echo toxic or biased content, can be misled by adversarial prompts, and are often inconsistent at following instructions. To correct these shortcomings, we want models that are helpful (follow instructions and solve tasks), honest (factual and transparent about uncertainty), and harmless (avoid toxic, biased, and dangerous outputs). The practical method introduced is Reinforcement Learning from Human Feedback (RLHF), a three‑step approach that has become the standard path to alignment in practice.

The three steps are: (1) data collection of prompts and human responses that exhibit the desired behavior; (2) reward modeling, where a model is trained to predict which response is better for a given prompt from pairwise human preferences; and (3) reinforcement learning, where the language model is fine‑tuned to produce responses that the reward model scores highly. The reward model is a simple function that inputs a (prompt, response) pair and outputs a single scalar score that reflects quality. It is trained with a pairwise ranking loss so that better responses receive higher scores than weaker ones. For the reinforcement learning step, Proximal Policy Optimization (PPO) is introduced as a robust policy‑gradient algorithm that updates the model while constraining how much it can change at once, improving stability.

The lecture begins by defining the problem with just next‑token training and articulating alignment goals. It then delves into data collection: choosing prompt types (questions, instructions, creative writing), deciding sources (human‑written prompts, model‑generated prompts, or a mix), and ensuring diversity and representativeness. It also covers response generation trade‑offs: purely human responses, model‑generated then human‑edited responses, and how many candidate responses per prompt to create. A concrete example is given: building a CS Q&A assistant using questions from sources like Stack Overflow or Reddit with human‑written answers to seed preference data.

Next, it explains the reward model in detail. You can use a language model backbone or other architectures to score responses, but the common approach is to fine‑tune an LM to output a scalar score for each (prompt, response). The model is trained with a pairwise logistic loss: for two responses Y1 and Y2 to prompt P where Y1 is judged better, the loss −log(sigmoid(R(P,Y1) − R(P,Y2))) encourages the model to score Y1 higher than Y2. This naturally learns a ranking over responses that reflects human preferences, which becomes the reward signal for RL.

Finally, it introduces reinforcement learning with PPO to fine‑tune the policy (the LM). In this framing, the agent is the LM generating a response token by token, the environment is the prompting task, and the reward comes from the learned reward model applied to the completed response. PPO maintains an old policy (used to collect data) and a new policy (being optimized), using a clipped objective that keeps the probability ratio of new vs. old policy close to 1, multiplied by an advantage estimate. This combination improves responses without taking destabilizing steps, leading to better instruction following and alignment.

The lecture also addresses a key question: since the reward model is a language model, can it hallucinate too? Yes, and that is a concern. Three mitigation strategies are described: train the reward model on factual data to reduce hallucinations, swap to a smaller or different model architecture that may hallucinate less, and use uncertainty estimation to down‑weight uncertain reward predictions. These techniques help prevent the fine‑tuned policy from overfitting to reward model errors.

The session targets learners who already understand basic language modeling and are ready to move into aligning models with human values and tasks. You should be comfortable with concepts like tokens, likelihood, and gradient‑based training. After completing this lecture, you will be able to describe the motivation for alignment, explain the RLHF pipeline end to end, define and train a reward model using pairwise preferences, and understand how PPO fine‑tunes a language model toward human‑preferred behavior. The lecture closes by previewing next time’s focus: limitations of RLHF and alternative alignment methods.

Key Takeaways

✓Start with clear goals: define helpful, honest, and harmless for your domain. Write a simple rubric so labelers agree on what “better” means. Include accuracy, relevance, clarity, safety, and calibration of uncertainty. A shared rubric improves preference data quality and downstream results.
✓Curate diverse prompts that reflect real use. Mix instructions, Q&A, reasoning, safety‑sensitive cases, and creative tasks. Tag prompts by type and difficulty to diagnose performance later. More diversity reduces overfitting to narrow behaviors.
✓Collect multiple responses per prompt to fuel pairwise comparisons. Combine human‑written answers with model‑generated variants at different temperatures. This makes the reward model robust across styles and qualities. More contrast improves learning of preferences.
✓Train a reward model with pairwise ranking instead of absolute scores. It’s easier for annotators and yields more consistent signals. Use the −log(sigmoid(score_diff)) loss and monitor pairwise accuracy on a held‑out set. Keep inputs formatted consistently with clear separators.
✓Mitigate reward model hallucination from the start. Add factual calibration tasks, consider a smaller or different architecture, and estimate uncertainty with ensembles or dropout. Down‑weight uncertain scores during PPO. Revisit and retrain the reward model with hard negative examples periodically.
✓Use SFT to create a strong starting policy. Fine‑tune on the best human responses before RL. SFT stabilizes PPO and reduces the number of RL updates required. This also anchors behavior for KL control.
✓

Glossary

Alignment

Making a language model do what people actually want instead of only predicting the next word. It focuses on being helpful, honest, and harmless. Alignment adds goals and rules on top of raw language skill. It turns vague human wishes into specific training signals. Without it, models can sound smart but act in ways we don’t want.

RLHF (Reinforcement Learning from Human Feedback)

A method to align models using human preferences. People compare responses to the same prompt and pick the better one. A reward model learns from these comparisons to score responses. Then the language model is trained with RL to maximize that score.

Prompt

The input text you give a language model to start a task. It can be a question, an instruction, or a creative request. Good prompts clearly state what is wanted. They set the stage for the model’s response.

Response

The text the language model generates after reading a prompt. Responses can be short answers, explanations, or creative writing. In alignment, we judge responses by quality and safety. Better responses follow instructions and avoid harm.

Reward model

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL 1

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Alignment

RLHF (Reinforcement Learning from Human Feedback)

Prompt

Response

Reward model

02Key Concepts

03Technical Details

05Conclusion

Pairwise ranking loss

Sigmoid

PPO (Proximal Policy Optimization)