Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL 1
IntermediateKey Summary
- â˘This session introduces alignment for language models and why nextâtoken prediction alone is not enough. When models only learn to guess the next word, they can hallucinate facts, produce toxic or biased text, and follow tricky prompts the wrong way. Alignment aims to make models helpful, honest, and harmless so they do what people actually want. The lecture lays out a practical recipe to achieve this with RLHF (Reinforcement Learning from Human Feedback).
- â˘RLHF is presented as a threeâstep pipeline. First, collect prompts and humanâwritten (or edited) responses that show the behavior you want. Second, train a reward model that scores which response is better for a given prompt using pairwise preferences. Third, fineâtune the language model with reinforcement learning so it generates responses the reward model prefers.
- â˘Data collection choices strongly shape outcomes. You decide which prompt types to include (questions, instructions, creative writing), where they come from (humans, models, or both), and how many you need. You also choose how responses are produced (fully human, modelâgenerated then edited, or mixtures) and how many alternatives per prompt. Diversity and quality control are key so the model learns broad, reliable behavior.
- â˘The reward model maps (prompt, response) â a scalar score that represents âhow goodâ the response is. It is trained with a pairwise ranking loss that prefers the higherâquality response among two candidates. The common loss is âlog(sigmoid(R(P,Y1) â R(P,Y2))) when Y1 is labeled better than Y2. Minimizing this loss teaches the model to rank good responses above weaker ones.
- â˘Any model that can read text and output a single number can be a reward model, but a language model backbone is a common choice. You can reuse the base LM, a smaller LM, or even a different architecture. The key is that the reward model learns human preferences from labeled comparisons. Careful training and validation are needed because reward models can also make mistakes.
Why This Lecture Matters
Alignment with RLHF turns general language skill into trustworthy, goalâdirected behavior. This is crucial for anyone deploying models in real productsâengineers, data scientists, product managers, and safety researchersâbecause customers care about accuracy, usefulness, and safety, not just fluency. Without alignment, models can hallucinate facts, echo toxic content, or be manipulated by tricky prompts, creating legal, ethical, and userâexperience risks. RLHF offers a concrete pipeline to encode human preferences and values into the training process: collect targeted prompts and responses, learn a reward signal from comparisons, and fineâtune with robust RL (PPO) to optimize for what users prefer. In practice, this improves instruction following, reduces harmful outputs, and increases customer trust. This knowledge directly applies to building assistants, helpdesk bots, coding copilots, educational tutors, and domainâspecific experts (like medical or legal triage tools). Teams can stand up a focused RLHF system in weeks by curating domain prompts, gathering preference labels, and bootstrapping a reward model. It also opens career paths in applied AI safety, preference learning, and RL for language, where demand is growing rapidly as companies operationalize AI. Finally, alignment methods are a key differentiator in the industry: organizations that master data collection, reward modeling, and stable RL can deliver safer, more reliable AI systems that win user trust and meet regulatory expectations.
Lecture Summary
Tap terms for definitions01Overview
This lecture focuses on alignment for large language models: how to train models not just to predict the next word but to do what people actually want. The central motivation is that pure nextâtoken prediction on internet text yields fluent models that sometimes make up facts (hallucinations), can echo toxic or biased content, can be misled by adversarial prompts, and are often inconsistent at following instructions. To correct these shortcomings, we want models that are helpful (follow instructions and solve tasks), honest (factual and transparent about uncertainty), and harmless (avoid toxic, biased, and dangerous outputs). The practical method introduced is Reinforcement Learning from Human Feedback (RLHF), a threeâstep approach that has become the standard path to alignment in practice.
The three steps are: (1) data collection of prompts and human responses that exhibit the desired behavior; (2) reward modeling, where a model is trained to predict which response is better for a given prompt from pairwise human preferences; and (3) reinforcement learning, where the language model is fineâtuned to produce responses that the reward model scores highly. The reward model is a simple function that inputs a (prompt, response) pair and outputs a single scalar score that reflects quality. It is trained with a pairwise ranking loss so that better responses receive higher scores than weaker ones. For the reinforcement learning step, Proximal Policy Optimization (PPO) is introduced as a robust policyâgradient algorithm that updates the model while constraining how much it can change at once, improving stability.
The lecture begins by defining the problem with just nextâtoken training and articulating alignment goals. It then delves into data collection: choosing prompt types (questions, instructions, creative writing), deciding sources (humanâwritten prompts, modelâgenerated prompts, or a mix), and ensuring diversity and representativeness. It also covers response generation tradeâoffs: purely human responses, modelâgenerated then humanâedited responses, and how many candidate responses per prompt to create. A concrete example is given: building a CS Q&A assistant using questions from sources like Stack Overflow or Reddit with humanâwritten answers to seed preference data.
Next, it explains the reward model in detail. You can use a language model backbone or other architectures to score responses, but the common approach is to fineâtune an LM to output a scalar score for each (prompt, response). The model is trained with a pairwise logistic loss: for two responses Y1 and Y2 to prompt P where Y1 is judged better, the loss âlog(sigmoid(R(P,Y1) â R(P,Y2))) encourages the model to score Y1 higher than Y2. This naturally learns a ranking over responses that reflects human preferences, which becomes the reward signal for RL.
Finally, it introduces reinforcement learning with PPO to fineâtune the policy (the LM). In this framing, the agent is the LM generating a response token by token, the environment is the prompting task, and the reward comes from the learned reward model applied to the completed response. PPO maintains an old policy (used to collect data) and a new policy (being optimized), using a clipped objective that keeps the probability ratio of new vs. old policy close to 1, multiplied by an advantage estimate. This combination improves responses without taking destabilizing steps, leading to better instruction following and alignment.
The lecture also addresses a key question: since the reward model is a language model, can it hallucinate too? Yes, and that is a concern. Three mitigation strategies are described: train the reward model on factual data to reduce hallucinations, swap to a smaller or different model architecture that may hallucinate less, and use uncertainty estimation to downâweight uncertain reward predictions. These techniques help prevent the fineâtuned policy from overfitting to reward model errors.
The session targets learners who already understand basic language modeling and are ready to move into aligning models with human values and tasks. You should be comfortable with concepts like tokens, likelihood, and gradientâbased training. After completing this lecture, you will be able to describe the motivation for alignment, explain the RLHF pipeline end to end, define and train a reward model using pairwise preferences, and understand how PPO fineâtunes a language model toward humanâpreferred behavior. The lecture closes by previewing next timeâs focus: limitations of RLHF and alternative alignment methods.
Key Takeaways
- âStart with clear goals: define helpful, honest, and harmless for your domain. Write a simple rubric so labelers agree on what âbetterâ means. Include accuracy, relevance, clarity, safety, and calibration of uncertainty. A shared rubric improves preference data quality and downstream results.
- âCurate diverse prompts that reflect real use. Mix instructions, Q&A, reasoning, safetyâsensitive cases, and creative tasks. Tag prompts by type and difficulty to diagnose performance later. More diversity reduces overfitting to narrow behaviors.
- âCollect multiple responses per prompt to fuel pairwise comparisons. Combine humanâwritten answers with modelâgenerated variants at different temperatures. This makes the reward model robust across styles and qualities. More contrast improves learning of preferences.
- âTrain a reward model with pairwise ranking instead of absolute scores. Itâs easier for annotators and yields more consistent signals. Use the âlog(sigmoid(score_diff)) loss and monitor pairwise accuracy on a heldâout set. Keep inputs formatted consistently with clear separators.
- âMitigate reward model hallucination from the start. Add factual calibration tasks, consider a smaller or different architecture, and estimate uncertainty with ensembles or dropout. Downâweight uncertain scores during PPO. Revisit and retrain the reward model with hard negative examples periodically.
- âUse SFT to create a strong starting policy. Fineâtune on the best human responses before RL. SFT stabilizes PPO and reduces the number of RL updates required. This also anchors behavior for KL control.
- â
Glossary
Alignment
Making a language model do what people actually want instead of only predicting the next word. It focuses on being helpful, honest, and harmless. Alignment adds goals and rules on top of raw language skill. It turns vague human wishes into specific training signals. Without it, models can sound smart but act in ways we donât want.
RLHF (Reinforcement Learning from Human Feedback)
A method to align models using human preferences. People compare responses to the same prompt and pick the better one. A reward model learns from these comparisons to score responses. Then the language model is trained with RL to maximize that score.
Prompt
The input text you give a language model to start a task. It can be a question, an instruction, or a creative request. Good prompts clearly state what is wanted. They set the stage for the modelâs response.
Response
The text the language model generates after reading a prompt. Responses can be short answers, explanations, or creative writing. In alignment, we judge responses by quality and safety. Better responses follow instructions and avoid harm.
