Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2
IntermediateKey Summary
- •This session continues alignment with reinforcement learning for language models. It recaps reward hacking—when a model chases the reward in the wrong way, like writing very long answers if reward is tied to word count. The RLHF pipeline is reviewed: pre-train a model, gather human preference data, train a reward model, then fine-tune the policy using RL with a safety constraint. The main focus is how to optimize the policy while staying close to the original model using techniques like KL penalties, PPO, and DPO.
- •Policy optimization seeks a policy (the model’s behavior) that maximizes expected reward over prompts and outputs. In language models, the action space is huge (tens of thousands of tokens), rewards are noisy and sparse, and shifting too far from the base model can destroy useful knowledge. A KL divergence penalty is added to discourage the model from moving too far away from the pre-trained policy. This stabilizes updates and reduces reward hacking.
- •Proximal Policy Optimization (PPO) is a popular algorithm for RLHF because it is relatively stable and simple to implement. PPO works by comparing new action probabilities to old ones and clipping the change so it can’t grow too large at once. The advantage function tells whether an action was better than average for a given prompt. Clipping and advantage estimation together keep learning steady and avoid big, risky jumps.
- •The advantage function can be estimated with reward minus a value function, or with Generalized Advantage Estimation (GAE) for smoother estimates. Value functions predict expected reward from a prompt under the current policy, while advantage measures the extra goodness of a chosen output. PPO iterates over: generate data with the policy, estimate advantages, compute the PPO objective, update parameters, and repeat. It is effective but requires careful tuning of clipping range and KL penalty strength.
- •A simple PPO example: for a summarization task, a reward model scores two summaries and humans prefer A over B. If the old policy liked B too much, PPO adjusts the policy to increase A’s probability but only within a clipped range. After several updates, A gets chosen more often. This demonstrates preference alignment via controlled policy updates.
Why This Lecture Matters
Organizations deploying language models must ensure they behave helpfully, safely, and in line with human values. This lecture explains the practical RL tools that make alignment work at scale: PPO with KL constraints keeps updates safe and stable, while DPO dramatically simplifies the pipeline by learning directly from preferences. These approaches help teams avoid reward hacking, where models exploit poorly designed signals, and instead incentivize the outcomes users really want. Practitioners gain a clear understanding of hyperparameters (like PPO clipping and KL strength) that make or break large-scale training and how to evaluate alignment with human-grounded metrics, not just perplexity. For product teams, the knowledge translates into better chatbots, summarizers, and assistants that users trust. Researchers can prototype alternatives like TRPO, actor-critic variants, and DPO to study stability versus efficiency trade-offs. Engineers with limited compute benefit from DPO’s simpler loop and from reward shaping techniques that reduce wasted training. Learning to balance exploration and exploitation during rollouts also improves discovery of better behaviors. Career-wise, these skills are highly sought after as companies race to deploy aligned LLMs in real applications. Understanding PPO, DPO, KL penalties, preference data quality, and meaningful evaluation metrics allows you to build safer, more reliable systems. In an industry where misaligned models can cause real harm or reputational damage, mastery of alignment-focused RL methods is a strategic advantage.
Lecture Summary
Tap terms for definitions01Overview
This lecture focuses on the reinforcement learning (RL) step of aligning large language models (LLMs) to human preferences, building on the standard RLHF (Reinforcement Learning from Human Feedback) pipeline. The instructor begins by revisiting reward hacking—when a model finds clever but undesirable ways to maximize a reward, like outputting very long responses if the reward correlates with length. The RLHF pipeline is reviewed: pre-train a language model, gather human preference data by comparing outputs, train a reward model from these comparisons, and then use RL to fine-tune the policy (the model’s behavior) toward higher rewards while staying close to the original model. The heart of today’s lecture is policy optimization under practical constraints: enormous action spaces, noisy/sparse reward signals, and the need to avoid drifting too far from the base model’s knowledge. A central stabilizing tool is a KL divergence penalty that discourages large deviations from the pre-trained policy, controlled by a hyperparameter beta.
The lecture then explores Proximal Policy Optimization (PPO), the most commonly used policy gradient algorithm for RLHF. PPO’s clipped objective limits how much action probabilities can change in one update. The advantage function—often estimated as reward minus a value prediction—guides which actions to strengthen or weaken. The instructor walks through PPO’s steps for language models: roll out the current policy to collect data, estimate advantages, compute the PPO loss with clipping and optional KL terms, update the model, and repeat. A concrete example uses a text summarization task where humans prefer one summary over another; PPO gradually shifts the policy to favor the preferred option, but only within controlled update sizes.
While PPO is popular, it presents challenges for very large models: high compute and memory demands, sensitivity to hyperparameters such as the clipping range and KL penalty strength, and stability concerns. Alternative RL approaches are mentioned. TRPO (Trust Region Policy Optimization) enforces a trust region by directly constraining KL divergence, with stronger theoretical guarantees but greater implementation complexity due to second-order optimization using the Fisher information matrix. Actor-critic methods learn a separate value function alongside the policy, often improving sample efficiency but sometimes at the cost of stability.
A newer pathway, Direct Preference Optimization (DPO), avoids training a separate reward model entirely. DPO reframes preference learning as a binary classification task over pairs of outputs given a prompt: the model is trained with cross-entropy loss to prefer the human-chosen output. Remarkably, optimizing this loss can be shown to correspond to RL with a particular reward. DPO’s advantages include a simpler pipeline, increased stability (no advantage estimation), and potential sample efficiency. However, its performance depends heavily on high-quality preference data and careful hyperparameter choices. The instructor also addresses how to handle ties in preference data—either by excluding them, down-weighting their loss, or using a margin so only strong preferences drive updates.
The lecture closes with broader RL topics in the context of LLM alignment. Reward shaping is presented as the art of designing reward functions that guide learning, often by layering simple rewards first and gradually adding more nuanced components, sometimes mixing intrinsic (exploration-promoting) and extrinsic (goal-oriented) signals. The exploration–exploitation trade-off is highlighted: models must try new actions to discover better strategies while also exploiting known good actions to earn reward, with methods like epsilon-greedy and upper confidence bounds helping manage the balance. Finally, evaluation metrics must reflect alignment with human values, not just generic language quality. Human evaluation, pairwise comparison accuracy, and task-specific measures are emphasized as better indicators than perplexity alone.
By the end, learners understand the practical algorithms used to align LLMs via RL, especially PPO and DPO, how KL penalties stabilize training, what challenges arise at scale, and how to think about reward shaping, exploration, and evaluation. The audience should come away with a clear blueprint for the RL optimization step in RLHF and a sense of when to prefer PPO versus DPO or consider TRPO/actor-critic variants.
This material suits intermediate practitioners who know basic ML, neural network training, and the idea of language modeling. It is helpful to understand gradients, loss functions, and probability distributions. After this lecture, you will be able to describe and implement the PPO loop for LLMs, explain DPO and its advantages, design KL-constrained objectives, and choose appropriate evaluation metrics for alignment.
Key Takeaways
- ✓Start with a solid baseline and a KL leash: Use a good pre-trained model and apply a KL penalty to keep updates near it. Monitor KL per token and adjust beta (β) to maintain a safe drift. This preserves knowledge and prevents reward hacking from pulling the policy off-course. Stable proximity is a foundation for successful alignment.
- ✓Tune PPO’s clipping epsilon deliberately: Begin with ε around 0.1–0.2 and watch stability and learning speed. If learning stalls, nudge ε up; if updates oscillate or collapse, lower it. Combine with careful learning rate selection and gradient clipping. Small changes in ε can have large effects.
- ✓Estimate advantages cleanly: For single-step tasks, A = R − V can suffice; for multi-step, consider GAE to damp noise. Train a decent value head with a separate loss and possibly a different learning rate. Poor value estimates increase variance and destabilize updates. Validate by tracking value loss and advantage statistics.
- ✓Shape rewards to avoid shortcuts: Normalize or penalize length, reward key point coverage, and include safety checks. Iteratively refine rewards as the model’s behavior evolves. Poor shaping invites reward hacking. Good shaping accelerates learning the right habits.
- ✓Explore during rollouts: Use temperature, top-k/top-p sampling, or an entropy bonus to avoid deterministic, narrow data. Exploration discovers better strategies for PPO to latch onto. Too little exploration leads to local optima; too much hinders convergence. Keep exploration moderate and purposeful.
- ✓Watch compute and batch dynamics: Large LLM PPO is heavy; use mixed precision, gradient accumulation, and modest PPO epochs. Refresh rollouts frequently and avoid overfitting to stale data. Make minibatches large enough to average noisy rewards. Plan training to match your compute budget.
Glossary
Alignment
Making a model behave in ways that match human values and goals. It means the model not only speaks fluently but also does the right thing. Aligned models avoid harmful, unhelpful, or misleading answers. Alignment guides training choices like rewards, constraints, and evaluations.
RLHF (Reinforcement Learning from Human Feedback)
A training method that uses human preferences to guide a model to better behavior. First, people compare pairs of model outputs to say which they like. A reward model is trained from these comparisons, and the policy is then optimized to get high rewards. RLHF helps bridge the gap between raw pre-training and human-friendly behavior.
Reward Hacking
When a model finds loopholes in the reward to score points without doing what we really want. It exploits the reward function rather than learning the intended behavior. This can happen when rewards correlate with easy-to-game signals. Preventing it requires better rewards, constraints, or checks.
Policy
A rule that maps from inputs to a distribution over actions. In language models, it maps a prompt to probabilities over next tokens or full outputs. Training changes the policy to prefer better actions. The goal is a policy that earns high reward and stays aligned.
