🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2 | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch17 / 17
PrevComplete
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2

Intermediate
Stanford Online
RLHFYouTube

Key Summary

  • •This session continues alignment with reinforcement learning for language models. It recaps reward hacking—when a model chases the reward in the wrong way, like writing very long answers if reward is tied to word count. The RLHF pipeline is reviewed: pre-train a model, gather human preference data, train a reward model, then fine-tune the policy using RL with a safety constraint. The main focus is how to optimize the policy while staying close to the original model using techniques like KL penalties, PPO, and DPO.
  • •Policy optimization seeks a policy (the model’s behavior) that maximizes expected reward over prompts and outputs. In language models, the action space is huge (tens of thousands of tokens), rewards are noisy and sparse, and shifting too far from the base model can destroy useful knowledge. A KL divergence penalty is added to discourage the model from moving too far away from the pre-trained policy. This stabilizes updates and reduces reward hacking.
  • •Proximal Policy Optimization (PPO) is a popular algorithm for RLHF because it is relatively stable and simple to implement. PPO works by comparing new action probabilities to old ones and clipping the change so it can’t grow too large at once. The advantage function tells whether an action was better than average for a given prompt. Clipping and advantage estimation together keep learning steady and avoid big, risky jumps.
  • •The advantage function can be estimated with reward minus a value function, or with Generalized Advantage Estimation (GAE) for smoother estimates. Value functions predict expected reward from a prompt under the current policy, while advantage measures the extra goodness of a chosen output. PPO iterates over: generate data with the policy, estimate advantages, compute the PPO objective, update parameters, and repeat. It is effective but requires careful tuning of clipping range and KL penalty strength.
  • •A simple PPO example: for a summarization task, a reward model scores two summaries and humans prefer A over B. If the old policy liked B too much, PPO adjusts the policy to increase A’s probability but only within a clipped range. After several updates, A gets chosen more often. This demonstrates preference alignment via controlled policy updates.

Why This Lecture Matters

Organizations deploying language models must ensure they behave helpfully, safely, and in line with human values. This lecture explains the practical RL tools that make alignment work at scale: PPO with KL constraints keeps updates safe and stable, while DPO dramatically simplifies the pipeline by learning directly from preferences. These approaches help teams avoid reward hacking, where models exploit poorly designed signals, and instead incentivize the outcomes users really want. Practitioners gain a clear understanding of hyperparameters (like PPO clipping and KL strength) that make or break large-scale training and how to evaluate alignment with human-grounded metrics, not just perplexity. For product teams, the knowledge translates into better chatbots, summarizers, and assistants that users trust. Researchers can prototype alternatives like TRPO, actor-critic variants, and DPO to study stability versus efficiency trade-offs. Engineers with limited compute benefit from DPO’s simpler loop and from reward shaping techniques that reduce wasted training. Learning to balance exploration and exploitation during rollouts also improves discovery of better behaviors. Career-wise, these skills are highly sought after as companies race to deploy aligned LLMs in real applications. Understanding PPO, DPO, KL penalties, preference data quality, and meaningful evaluation metrics allows you to build safer, more reliable systems. In an industry where misaligned models can cause real harm or reputational damage, mastery of alignment-focused RL methods is a strategic advantage.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on the reinforcement learning (RL) step of aligning large language models (LLMs) to human preferences, building on the standard RLHF (Reinforcement Learning from Human Feedback) pipeline. The instructor begins by revisiting reward hacking—when a model finds clever but undesirable ways to maximize a reward, like outputting very long responses if the reward correlates with length. The RLHF pipeline is reviewed: pre-train a language model, gather human preference data by comparing outputs, train a reward model from these comparisons, and then use RL to fine-tune the policy (the model’s behavior) toward higher rewards while staying close to the original model. The heart of today’s lecture is policy optimization under practical constraints: enormous action spaces, noisy/sparse reward signals, and the need to avoid drifting too far from the base model’s knowledge. A central stabilizing tool is a KL divergence penalty that discourages large deviations from the pre-trained policy, controlled by a hyperparameter beta.

The lecture then explores Proximal Policy Optimization (PPO), the most commonly used policy gradient algorithm for RLHF. PPO’s clipped objective limits how much action probabilities can change in one update. The advantage function—often estimated as reward minus a value prediction—guides which actions to strengthen or weaken. The instructor walks through PPO’s steps for language models: roll out the current policy to collect data, estimate advantages, compute the PPO loss with clipping and optional KL terms, update the model, and repeat. A concrete example uses a text summarization task where humans prefer one summary over another; PPO gradually shifts the policy to favor the preferred option, but only within controlled update sizes.

While PPO is popular, it presents challenges for very large models: high compute and memory demands, sensitivity to hyperparameters such as the clipping range and KL penalty strength, and stability concerns. Alternative RL approaches are mentioned. TRPO (Trust Region Policy Optimization) enforces a trust region by directly constraining KL divergence, with stronger theoretical guarantees but greater implementation complexity due to second-order optimization using the Fisher information matrix. Actor-critic methods learn a separate value function alongside the policy, often improving sample efficiency but sometimes at the cost of stability.

A newer pathway, Direct Preference Optimization (DPO), avoids training a separate reward model entirely. DPO reframes preference learning as a binary classification task over pairs of outputs given a prompt: the model is trained with cross-entropy loss to prefer the human-chosen output. Remarkably, optimizing this loss can be shown to correspond to RL with a particular reward. DPO’s advantages include a simpler pipeline, increased stability (no advantage estimation), and potential sample efficiency. However, its performance depends heavily on high-quality preference data and careful hyperparameter choices. The instructor also addresses how to handle ties in preference data—either by excluding them, down-weighting their loss, or using a margin so only strong preferences drive updates.

The lecture closes with broader RL topics in the context of LLM alignment. Reward shaping is presented as the art of designing reward functions that guide learning, often by layering simple rewards first and gradually adding more nuanced components, sometimes mixing intrinsic (exploration-promoting) and extrinsic (goal-oriented) signals. The exploration–exploitation trade-off is highlighted: models must try new actions to discover better strategies while also exploiting known good actions to earn reward, with methods like epsilon-greedy and upper confidence bounds helping manage the balance. Finally, evaluation metrics must reflect alignment with human values, not just generic language quality. Human evaluation, pairwise comparison accuracy, and task-specific measures are emphasized as better indicators than perplexity alone.

By the end, learners understand the practical algorithms used to align LLMs via RL, especially PPO and DPO, how KL penalties stabilize training, what challenges arise at scale, and how to think about reward shaping, exploration, and evaluation. The audience should come away with a clear blueprint for the RL optimization step in RLHF and a sense of when to prefer PPO versus DPO or consider TRPO/actor-critic variants.

This material suits intermediate practitioners who know basic ML, neural network training, and the idea of language modeling. It is helpful to understand gradients, loss functions, and probability distributions. After this lecture, you will be able to describe and implement the PPO loop for LLMs, explain DPO and its advantages, design KL-constrained objectives, and choose appropriate evaluation metrics for alignment.

Key Takeaways

  • ✓Start with a solid baseline and a KL leash: Use a good pre-trained model and apply a KL penalty to keep updates near it. Monitor KL per token and adjust beta (β) to maintain a safe drift. This preserves knowledge and prevents reward hacking from pulling the policy off-course. Stable proximity is a foundation for successful alignment.
  • ✓Tune PPO’s clipping epsilon deliberately: Begin with ε around 0.1–0.2 and watch stability and learning speed. If learning stalls, nudge ε up; if updates oscillate or collapse, lower it. Combine with careful learning rate selection and gradient clipping. Small changes in ε can have large effects.
  • ✓Estimate advantages cleanly: For single-step tasks, A = R − V can suffice; for multi-step, consider GAE to damp noise. Train a decent value head with a separate loss and possibly a different learning rate. Poor value estimates increase variance and destabilize updates. Validate by tracking value loss and advantage statistics.
  • ✓Shape rewards to avoid shortcuts: Normalize or penalize length, reward key point coverage, and include safety checks. Iteratively refine rewards as the model’s behavior evolves. Poor shaping invites reward hacking. Good shaping accelerates learning the right habits.
  • ✓Explore during rollouts: Use temperature, top-k/top-p sampling, or an entropy bonus to avoid deterministic, narrow data. Exploration discovers better strategies for PPO to latch onto. Too little exploration leads to local optima; too much hinders convergence. Keep exploration moderate and purposeful.
  • ✓Watch compute and batch dynamics: Large LLM PPO is heavy; use mixed precision, gradient accumulation, and modest PPO epochs. Refresh rollouts frequently and avoid overfitting to stale data. Make minibatches large enough to average noisy rewards. Plan training to match your compute budget.

Glossary

Alignment

Making a model behave in ways that match human values and goals. It means the model not only speaks fluently but also does the right thing. Aligned models avoid harmful, unhelpful, or misleading answers. Alignment guides training choices like rewards, constraints, and evaluations.

RLHF (Reinforcement Learning from Human Feedback)

A training method that uses human preferences to guide a model to better behavior. First, people compare pairs of model outputs to say which they like. A reward model is trained from these comparisons, and the policy is then optimized to get high rewards. RLHF helps bridge the gap between raw pre-training and human-friendly behavior.

Reward Hacking

When a model finds loopholes in the reward to score points without doing what we really want. It exploits the reward function rather than learning the intended behavior. This can happen when rewards correlate with easy-to-game signals. Preventing it requires better rewards, constraints, or checks.

Policy

A rule that maps from inputs to a distribution over actions. In language models, it maps a prompt to probabilities over next tokens or full outputs. Training changes the policy to prefer better actions. The goal is a policy that earns high reward and stays aligned.

#rlhf#ppo#kl divergence#advantage estimation#value function#gae#trpo#actor-critic#dpo#reward shaping#reward hacking#exploration#exploitation#pairwise accuracy#human evaluation#policy ratio#clipping epsilon#beta penalty#fisher information
Version: 1
  • •PPO can be hard at very large scales due to high compute and memory costs. Hyperparameters like the clipping parameter (epsilon) and KL coefficient (beta) greatly affect results. Finding good values is tricky and often task-dependent. Despite this, PPO remains a common default for RLHF in practice.
  • •Other RL algorithms exist. TRPO enforces a strict KL limit between old and new policies, backed by stronger theory but harder to implement due to second-order computations (Fisher information matrix). Actor-critic methods learn both a policy and a value function, often being more sample-efficient but also more unstable. In practice, PPO’s simplicity and robustness make it a frequent choice.
  • •Direct Preference Optimization (DPO) is a newer method that skips training a separate reward model. It treats preference learning as a classification problem: given two outputs, predict which one humans prefer. Optimizing a simple cross-entropy loss on preferred vs dispreferred outputs can be shown to match a specific RL objective. DPO’s pipeline is simpler and can be more stable and sample-efficient.
  • •DPO’s benefits come with trade-offs. High-quality preference data is critical; noisy or biased comparisons lead to poor behavior. DPO still needs hyperparameter tuning and can be sensitive to settings. Handling ties (when humans think two outputs are equally good) can be done by discarding those pairs, down-weighting their loss, or adding a margin threshold.
  • •Reward shaping is about designing a reward function that nudges the model toward the right behavior. You can start simple and add pieces as the model learns, mixing intrinsic rewards (encouraging exploration) and extrinsic rewards (target goals). It often takes experimentation to get right. Good shaping reduces reward hacking.
  • •Balancing exploration and exploitation is essential: exploring tries new ideas; exploiting uses the best known idea. Too much exploitation gets stuck in local optima; too much exploration never settles on good answers. Methods like epsilon-greedy and upper confidence bounds can help balance this. In language models, exploration can mean sampling diverse tokens or temperature adjustments.
  • •Evaluation must measure alignment, not just language modeling quality. Metrics like perplexity are not enough. Use human evaluation, pairwise preference accuracy, and task-specific scores to judge whether behavior matches values and goals. Continuous evaluation safeguards against reward hacking and drift.
  • 02Key Concepts

    • 01

      Reward Hacking: The model exploits the reward function in unintended ways to score high without truly improving behavior. It’s like a student guessing patterns to get test points instead of learning the subject. Technically, when the reward correlates with an easy-to-game signal (e.g., length), the policy learns actions that maximize that signal. Without addressing this, the system becomes misaligned and unhelpful. Example: if reward increases with word count, the model writes overly long answers to win points, not to help.

    • 02

      RLHF Pipeline: A process to align LLMs to human values using human feedback. Think of it as training a helpful assistant by first teaching it language, then telling it which answers people like, and finally reinforcing those behaviors. Concretely, pre-train a model, collect pairwise human preferences, fit a reward model, then fine-tune the policy with RL to maximize expected rewards. This pipeline is needed because raw pre-training doesn’t guarantee helpfulness or safety. Example: train on internet text, gather human rankings of chatbot replies, learn a reward from those rankings, and RL-tune the chatbot to produce preferred replies.

    • 03

      Policy Optimization Objective: The goal is to find the policy that maximizes expected reward over prompts and generated outputs. It’s like choosing a strategy that, on average, makes people happiest with your answers. Formally, maximize E_{x~d, y~π(·|x)}[r(x,y)], where d is the prompt distribution. Without a clear objective, learning drifts and may not improve alignment. Example: for a QA assistant, optimize the policy so answers across many questions earn high reward from the reward model.

    • 04

      Challenges in LLM RL: The action space is massive, rewards are noisy/sparse, and we must not forget what the base model knows. It’s like searching for a good path in a huge city with faint street signs while trying not to leave the familiar neighborhood. Technically, vocabulary sizes of ~50k make exploration difficult; human-trained reward models are imperfect; and unconstrained updates can cause catastrophic drift. If ignored, learning becomes unstable or useless. Example: a model starts producing strange, off-distribution text if pushed too far by a flawed reward.

    • 05

      KL Divergence Constraint: A penalty that discourages the new policy from moving too far from the old one. This is like a rubber band tying the new behavior to the original, preventing wild swings. Mathematically, add −β KL(π(·|x) || π_old(·|x)) to the reward, where β controls penalty strength. It matters because it stabilizes training and protects base knowledge. Example: when optimizing on a small preference dataset, the KL penalty keeps outputs coherent and on-topic.

    • 06

      Proximal Policy Optimization (PPO): A policy gradient method that limits update size using a clipped objective. Think of it like training wheels that let you move forward but prevent sharp turns. The loss uses the ratio r_t = π(y|x)/π_old(y|x) multiplied by advantage A, then clips r_t to [1−ε, 1+ε] to avoid big steps. Without clipping, the policy could change too fast and collapse. Example: if a token’s probability jumps too high, the clip cuts the gain to keep learning steady.

    • 07

      PPO Clipping: The clip keeps the policy change within a small band around the old policy. It’s like telling a chef to adjust a recipe by only 20% at a time. Implemented as min(r_t A, clip(r_t, 1−ε, 1+ε) A), it prevents over-optimistic updates when A is positive and guards against harmful decreases when A is negative. This stabilizes learning across batches. Example: with ε=0.2, increases or decreases in action probability beyond ±20% bring no extra benefit in the objective.

    • 08

      Advantage Function: A measure of how much better an action is than average at a given prompt. It’s like asking, “Was this decision above or below par?” Technically, A(x,y) ≈ R(x,y) − V(x), where V estimates expected return. If we don’t use advantage, all actions look equally good or bad, making learning noisy. Example: for a summary with reward 0.8 and value prediction 0.5, advantage 0.3 suggests boosting that choice.

    • 09

      Value Function: A prediction of expected return from a prompt under the current policy. It’s like a weather forecast for how good outcomes will be if you continue as usual. Often implemented as a learned critic network that minimizes prediction error. It reduces variance in policy gradients and improves stability. Example: if V(x) accurately predicts 0.6, then rewards near 0.6 are unsurprising and produce small advantages.

    • 10

      Generalized Advantage Estimation (GAE): A method for smoother and more stable advantage estimates. It’s like averaging over multiple future steps with a decay to reduce noise. Technically, it mixes multi-step returns with a parameter λ to trade off bias and variance. GAE helps convergence and reduces gradient jitter. Example: with λ set between 0 and 1, you get a balance between short-term and long-term credit assignment.

    • 11

      PPO Algorithm Steps: A repeated loop of collecting data, estimating advantages, computing the objective, and updating the policy. It’s like practicing, evaluating, adjusting, and practicing again. Each loop uses the old policy to gather rollouts and then applies gradient updates toward better actions while clipped. Without this iterative cycle, the model can’t steadily improve. Example: roll out on prompts, score with reward model minus KL penalty, compute A, update via clipped loss, repeat.

    • 12

      Summarization Example for PPO: Two summaries A and B are scored, and humans prefer A. The old policy might still prefer B, but PPO nudges it toward A by increasing A’s probability within a clip range. Across updates, A’s probability grows from 0.6 to 0.7 while B drops from 0.4 to 0.3. This mirrors human preferences with controlled steps. It shows how alignment emerges from guided probability shifts.

    • 13

      Practical Challenges of PPO for LLMs: Training is compute- and memory-heavy, and hyperparameters are sensitive. It’s like steering a large ship where tiny wheel turns matter. Key settings include ε for clipping and β for KL penalties; wrong values can cause collapse or no learning. Monitoring and tuning are critical. Example: overly small ε stalls learning; overly large ε destabilizes it.

    • 14

      TRPO vs PPO: Two ways to limit policy change for safety. TRPO sets a hard KL trust region, solving a constrained optimization using second-order information (Fisher matrix). PPO uses a simpler clipped surrogate objective that indirectly limits KL. TRPO has stronger theory but is harder to implement at scale, while PPO is simpler and often works similarly well in practice. Example: use PPO as a robust default; consider TRPO if you need strict KL control and can afford complexity.

    • 15

      Actor-Critic Methods: Algorithms that learn a policy (actor) and a value function (critic) together. Think of the actor as the doer and the critic as the coach giving scores. This can be more sample-efficient because the critic improves advantage estimates, but it can be more unstable if the critic is wrong. Careful training and regularization are needed. Example: an LLM policy proposes tokens while a learned critic estimates values for advantage computation.

    • 16

      Direct Preference Optimization (DPO): A method that directly trains the model to prefer human-preferred outputs without a separate reward model. Imagine showing two answers and teaching, “Pick this one.” It frames learning as classification with cross-entropy loss on preferred vs dispreferred outputs. This can be equivalent to RL under certain reward formulations. Example: given (prompt, A preferred over B), train the model to increase log-prob of A relative to B.

    • 17

      DPO Formulation and Loss: Given a prompt and pair (y+, y−), optimize a loss that increases the model’s score for y+ over y−. It’s like telling the model, “Rank y+ higher than y− for this question.” The cross-entropy formulation makes training straightforward with standard supervised learning tooling. This bypasses advantage estimation and policy ratios. Example: compute logits for both outputs and minimize the loss that penalizes choosing y−.

    • 18

      DPO Advantages: Simpler pipeline, fewer moving parts, and potential stability and sample-efficiency gains. It reuses common supervised training code paths, making it easy to implement. Without a learned reward model and advantage estimation, error sources can be reduced. This matters when compute is limited or teams want faster iteration. Example: small teams align a model using only preference pairs and standard fine-tuning loops.

    • 19

      DPO Challenges and Ties: Requires high-quality preferences; bias or noise hurts performance. Ties—when humans see outputs as equally good—need care: drop them, down-weight their loss, or require a margin before learning. Hyperparameter sensitivity still exists, so tuning is necessary. Without careful data curation, the model can learn inconsistent preferences. Example: give tied pairs a small weight so the model does not overfit to ambiguous signals.

    • 20

      Reward Shaping: The craft of building reward signals that encourage desired behavior. It’s like giving points for good steps, not just the final goal, to guide learning. Technically, you can combine intrinsic (exploration) and extrinsic (task) rewards and gradually add components. Good shaping reduces reward hacking and speeds learning. Example: in summarization, reward being concise and covering key points, not just matching length.

    • 21

      Exploration vs Exploitation: Balancing trying new actions with using known good actions. It’s like sampling a new restaurant vs returning to a favorite. Techniques include epsilon-greedy (randomly explore sometimes) and upper confidence bounds (favor uncertain actions). Too little exploration gets stuck; too much never converges. Example: in LLM decoding, sampling diversity encourages discovering better phrasings.

    • 22

      Evaluation Metrics for Alignment: We need measures that reflect human values, not just language fluency. Perplexity alone can be misleading. Use human evaluation, pairwise comparison accuracy, and task-specific metrics that directly assess helpfulness, safety, or correctness. This closes the loop to ensure training really improves alignment. Example: humans compare responses before and after RL; higher pairwise win-rate signals better alignment.

    • 23

      KL Penalty Hyperparameter (β): Controls how strongly the policy is pulled back toward the base model. It’s like tightening or loosening the safety leash. A larger β prevents drift but can slow learning; a smaller β allows faster changes but risks instability or reward hacking. Tuning β is critical for stable progress. Example: start with a moderate β and adjust based on observed KL per update.

    • 24

      Noisy and Sparse Rewards: Reward models trained from human data are imperfect and may give weak signals. It’s like hearing feedback through static. This motivates averaging, careful advantage estimation, and KL constraints. Without addressing noise, the policy chases unreliable gradients. Example: some reasonable answers get low reward; stability techniques prevent overreacting to such errors.

    03Technical Details

    Overall Architecture/Structure

    1. Data Sources and Policies
    • Prompt Distribution (d): This is the source of inputs x (e.g., a dataset of user prompts or task instructions). In each iteration, prompts are sampled from d.
    • Old Policy (π_old): The current language model parameters at the start of a training iteration. It generates candidate outputs y given x.
    • New Policy (π): The updated language model parameters after optimizing the objective. During training, the algorithm computes π(y|x)/π_old(y|x) ratios to control update size.
    • Reward Model (optional for PPO): A learned model trained from human preferences that maps (x,y) to a scalar reward r(x,y). For DPO, this component is bypassed.
    • KL Reference Policy: Typically the pre-trained base model (or the policy from an earlier iteration) used to compute a KL penalty to avoid drifting too far.
    1. Objectives
    • RLHF PPO Objective (Conceptual): Maximize expected reward while discouraging divergence from the reference policy. This can be cast as maximizing E_{x~d, y~π}[r(x,y) − β KL(π(·|x)||π_ref(·|x))]. In practice, PPO implements this via a clipped surrogate that uses samples from π_old.
    • Policy Ratio: r_t(x,y) = π(y|x) / π_old(y|x). This ratio measures how much the probability of the chosen output has changed from the old policy to the new policy.
    • Advantage: A(x,y) estimates how much better action y is at x than the expected return under the current policy (e.g., A = R − V).
    • Clipped Surrogate: L_clip = E[min(r_t A, clip(r_t, 1−ε, 1+ε) A)]. Clipping prevents the ratio from moving too far in one update.
    • KL Penalty: Often added explicitly to the total loss or monitored indirectly via ε. An explicit term might be −β KL(π(·|x)||π_ref(·|x)), computed at the token or sequence level.
    1. Data Flow for PPO with LLMs
    • Sample prompts x from d.
    • Generate outputs y ~ π_old(·|x) using decoding (e.g., sampling with temperature, nucleus sampling) to encourage exploration.
    • Compute reward R(x,y) using the reward model (and optionally subtract a per-token KL penalty to the reference policy’s logits to form a shaped reward).
    • Estimate value V(x) using a learned value head on the model or a separate critic network.
    • Compute advantage A(x,y). If using GAE, build advantages by combining temporal-difference residuals with decay λ; for single-step tasks (prompt → response), A often reduces to R − V.
    • Compute policy ratio r_t from log-probabilities under π and π_old. Log-ratio is log π(y|x) − log π_old(y|x), exponentiated to get r_t.
    • Build the PPO loss with clipping, optionally include a value loss (e.g., MSE between R and V) and an entropy bonus to encourage exploration, and include KL penalty if used as a separate term.
    • Take gradient steps on the combined loss to update π and the value function parameters (if applicable). Repeat with fresh rollouts.

    Code/Implementation Details (Conceptual)

    • Language/Framework: Commonly PyTorch or JAX/Flax for LLMs. Even if no code is shown here, the structure maps cleanly to modern deep learning libraries.
    • Components: • Policy Network: The LLM (decoder-only transformer) that outputs token logits; also provides log-probabilities for tokens in y given x. • Value Head (Critic): A small linear layer on top of the transformer’s final hidden states to predict V(x). In sequence tasks, often you take the value at a designated position or average across positions. • Reward Model: A separate LLM fine-tuned on human preferences to produce scalar rewards for (x,y). It typically shares architecture with the policy but is trained differently.
    • Key Functions/Steps: • rollout(policy, prompts): Generate responses y and track log_probs_old for each token. • reward(x,y): Score sequences with the reward model. Optionally, compute a per-token KL to a reference model and subtract β * KL as part of the reward shaping. • estimate_advantage(R,V): For one-step tasks, A = R − V; for multi-step, use GAE with λ. • ppo_objective(logp_new, logp_old, A, ε): Compute r_t = exp(logp_new − logp_old). Return mean of min(r_t * A, clip(r_t, 1−ε, 1+ε) * A). • value_loss(V, R): Mean squared error to train the critic. • update(parameters, gradients): Apply gradient descent/Adam optimizer.
    • Important Parameters: • ε (epsilon): Clipping parameter, e.g., 0.1–0.2 commonly used. Smaller ε = safer but slower learning; larger ε = faster but riskier. • β (beta): KL penalty strength; controls how tightly to stay near the reference policy. • Learning Rate: For both policy and value head; too high causes instability, too low slows training. • Batch Size / Minibatch Size: Affects stability and throughput; larger batches average out reward noise. • Number of PPO Epochs: How many optimization passes over the same batch of rollouts; too many can overfit to stale data.

    Tools/Libraries Used (Typical)

    • Deep Learning: PyTorch or JAX for model definition, automatic differentiation, optimizers.
    • Tokenizers: To convert text to tokens and back; influence action space size.
    • Data Loaders: For sampling prompts from d and preference pairs for reward model or DPO.
    • Logging/Monitoring: Track KL divergence, reward, advantage stats, and evaluation metrics.

    Step-by-Step Implementation Guide (PPO for LLM Alignment)

    • Step 1: Start with a pre-trained LLM (π_ref) and optionally initialize π_old = π_ref.
    • Step 2: Prepare a prompt dataset d and a reward model trained from human preferences. Ensure data is representative of target use cases.
    • Step 3: Roll out with π_old. For each prompt, decode a response y, storing per-token log-probabilities under π_old.
    • Step 4: Compute reward R(x,y). Use the reward model’s scalar output. Optionally apply reward shaping by subtracting β times the KL divergence between π_old and π_ref per token, or track KL for a separate penalty term.
    • Step 5: Estimate the value V(x). If your critic is a head on the policy, run forward passes to get V. Compute advantages A = R − V (or use GAE in multi-step settings).
    • Step 6: Compute the PPO loss. For each token (or sequence), compute logp_new from the current policy π, then the ratio r_t = exp(logp_new − logp_old). Build the clipped loss min(r_t A, clip(r_t, 1−ε, 1+ε) A).
    • Step 7: Add auxiliary losses. Include value loss to train the critic and possibly an entropy bonus to keep exploration. Add an explicit KL penalty term if not done in reward shaping.
    • Step 8: Backpropagate and update model parameters using an optimizer like AdamW. Consider gradient clipping to prevent exploding gradients.
    • Step 9: Iterate. After several minibatch epochs on the collected batch, refresh π_old with the updated policy and collect new rollouts.
    • Step 10: Evaluate frequently. Track human-win rates, pairwise accuracy, task metrics, and KL drift to catch reward hacking or collapse early.

    DPO Implementation Guide

    • Step 1: Start with a pre-trained LLM (reference policy).
    • Step 2: Collect preference pairs: for each prompt x, obtain (y+, y−) where humans prefer y+ over y−.
    • Step 3: Build the DPO loss. Compute model scores (logits or log-probs) for y+ and y− conditioned on x, and apply a cross-entropy style loss that increases log P(y+|x) relative to log P(y−|x). Many implementations include a temperature-like scaling or implicit KL to the reference model.
    • Step 4: Handle ties. Either exclude tied pairs, down-weight them, or require a margin so only confident preferences drive training.
    • Step 5: Optimize with standard supervised fine-tuning routines (mini-batches, AdamW, learning rate schedules). Monitor pairwise accuracy on a held-out preference set.

    Tips and Warnings

    • Reward Hacking Watch-outs: If the model begins producing extremely long or repetitive outputs, the reward may be gamed. Use length-normalized rewards or explicit penalties for verbosity where appropriate.
    • KL Tuning: Start with a moderate β and target a reasonable KL per token (e.g., a small fraction of a nat). If KL spikes, increase β; if KL is near zero and learning stalls, decrease β.
    • Clipping Epsilon: If updates are too timid, raise ε slightly; if instability occurs (e.g., reward crashes or KL blows up), lower ε.
    • Batch and Epochs: Don’t over-optimize the same rollout batch; stale advantages and ratios lead to overfitting. Limit PPO epochs (e.g., 1–8) and refresh data frequently.
    • Critic Quality: A poor value function increases variance; monitor value loss and ensure the critic learns. Consider value clipping or separate learning rates.
    • Exploration: Use temperature or top-p sampling during rollout to maintain diversity. Too deterministic a rollout starves learning of alternatives to compare.
    • Data Quality: For DPO, preference data must be reliable and representative. Remove inconsistent raters, resolve ties carefully, and sample across domains.
    • Compute Budget: Large LLM PPO is expensive. Consider smaller batch sizes, gradient accumulation, mixed precision, and careful checkpointing. DPO can be a cheaper alternative when reward modeling and rollouts are costly.
    • Evaluation Beyond Perplexity: Always include human-in-the-loop checks or high-quality proxy metrics that reflect alignment, not only fluency.

    Comparing PPO, TRPO, Actor-Critic, and DPO

    • PPO: First-order, clipped surrogate, easy to implement, widely used, good trade-off of stability and simplicity. Sensitive to ε and β; still compute-heavy.
    • TRPO: Constrained KL per update with second-order optimization (Fisher information). Strong theory; harder to implement and costly at LLM scale.
    • Actor-Critic: Learns policy and value together; can be more sample-efficient but may be unstable if the critic is poor. Often forms the backbone of PPO variants.
    • DPO: No reward model; reduces RLHF to preference classification. Simpler pipeline and often more stable, but depends critically on clean preference data and careful hyperparameters.

    Evaluation Metrics in Alignment

    • Human Evaluation: Gold standard; measure human preferences between system outputs. Expensive but most trustworthy.
    • Pairwise Comparison Accuracy: Given held-out preference pairs, measure how often the model’s output wins. Directly tied to the alignment goal.
    • Task-specific Metrics: For summarization, factuality/coverage/conciseness; for dialog, helpfulness/safety; for coding, test pass rates. Choose metrics aligned with target use.
    • Drift and Safety Metrics: Monitor KL to the base model, toxicity, hallucinations, and other safety signals.

    Exploration vs Exploitation in Text Generation

    • Exploration: Encourage diverse token choices to discover better responses—temperature > 0, top-k/top-p sampling, or adding entropy bonuses in the objective.
    • Exploitation: Increase probability of proven good responses (per rewards/advantages). Over-exploitation risks local optima: the model repeats mid-quality outputs and misses better ones.
    • Practical Balance: During rollout, avoid greedy decoding; use mild stochasticity. In updates, keep entropy from collapsing too fast.

    Reward Shaping Techniques for LLMs

    • Length Control: Penalize or normalize for length to avoid long-answer reward hacking.
    • Content Coverage: Reward mentioning key points or factual correctness.
    • Safety Filters: Penalize unsafe or disallowed content via classifiers.
    • KL-based Shaping: Subtract β * KL to a reference to bake in stability directly into the reward rather than as a separate term.

    Hyperparameter Sensitivity

    • ε (PPO Clip): Controls how far policy can move per update; typical 0.1–0.2.
    • β (KL): Controls proximity to base model; adjust to maintain a target KL per token.
    • λ (GAE): Balances bias/variance in advantages; typical 0.9–0.97 in multi-step tasks; for single-step tasks, often not used.
    • Learning Rates: Separate LR for actor and critic can stabilize training. Use warmups and decays.

    Handling Noisy Preference Data

    • Aggregation: Use multiple raters, majority vote, or probabilistic modeling of rater reliability.
    • Cleaning: Remove outliers and contradictory pairs.
    • Ties: Exclude, down-weight, or enforce margin thresholds before applying loss (as discussed for DPO).

    Putting It Together

    • Start from a solid pre-trained LLM. Decide between PPO (needs reward model and rollouts) and DPO (uses preference pairs directly). If you have a good reward model and compute, PPO can finely shape behavior with KL constraints. If you prioritize simplicity and cost, DPO often provides strong gains quickly. In all cases, measure with human-centered metrics and watch for reward hacking and drift.

    04Examples

    • 💡

      Reward Hacking by Length: Input: prompts like “Explain photosynthesis.” Process: the system receives reward correlated with response length. Output: the model learns to write very long answers, even if repetitive. Key point: if reward design favors length, the policy can game it; fix by normalizing for length or adding penalties for verbosity.

    • 💡

      PPO Summarization Pair: Input: prompt “Summarize this article,” with two candidate summaries A and B. Process: human preference and reward model assign higher reward to A (0.8) than B (0.2); current policy probabilities are P(A)=0.6, P(B)=0.4. Output: after PPO updates with clipping, probabilities shift toward A, e.g., P(A)=0.7, P(B)=0.3. Key point: PPO increases the chance of preferred outputs while avoiding large jumps.

    • 💡

      KL Penalty in Practice: Input: prompt about medical advice; base model is safe and factual. Process: during RL, a KL penalty to the base model discourages large distributional shifts. Output: responses remain stylistically and semantically similar to the base model while improving alignment. Key point: KL acts like a leash preventing drift and preserving knowledge.

    • 💡

      Advantage Computation: Input: (x,y) with reward R=0.8 and value prediction V=0.5. Process: compute A=R−V=0.3. Output: the policy gradient boosts tokens that produced y in proportion to A. Key point: advantage focuses learning on better-than-expected actions.

    • 💡

      Clipping Effect: Input: ratio r_t=1.5 with positive advantage A>0. Process: clip(r_t,1−ε,1+ε) with ε=0.2 gives 1.2; use min(1.5A,1.2A)=1.2A. Output: the update benefit is capped. Key point: clipping prevents excessively large probability increases from dominating training.

    • 💡

      TRPO vs PPO Update: Input: desire to limit KL to 0.01 per step. Process: TRPO explicitly solves a constrained step with Fisher information; PPO uses a clipped loss that indirectly controls KL. Output: TRPO keeps a hard trust region, PPO is simpler but may slightly violate the target KL. Key point: TRPO is theoretically strict; PPO is practical and widely used.

    • 💡

      Actor-Critic Rollout: Input: prompts for a dialogue assistant. Process: the actor proposes replies while the critic estimates values for each state/prompt; advantages guide updates. Output: more sample-efficient learning compared to pure policy gradients. Key point: a good critic reduces variance and speeds convergence, but can destabilize if inaccurate.

    • 💡

      DPO Pairwise Training: Input: (x, y+, y−) where humans prefer y+. Process: compute logits for both outputs and apply cross-entropy so the model ranks y+ above y−. Output: the model directly learns preferences without a reward model. Key point: simpler pipeline, standard supervised optimization.

    • 💡

      Handling DPO Ties: Input: pairs where raters view y+ and y− as equally good. Process: either exclude these pairs, down-weight their contribution, or require a margin before applying loss. Output: training ignores or softens ambiguous signals. Key point: managing ties avoids pushing the model in random directions.

    • 💡

      Reward Shaping Mix: Input: a summarization task. Process: reward includes coverage of key points, brevity, and factual accuracy; intrinsic bonuses add small exploration incentives. Output: the model learns to be concise, accurate, and complete. Key point: careful shaping reduces reward hacking and encourages balanced behavior.

    • 💡

      Exploration Temperature: Input: decode with temperature T=1.0 vs T=0.2 during rollouts. Process: higher T samples more diverse tokens, revealing alternative good answers. Output: better exploration produces richer training data for PPO. Key point: controlled randomness improves discovery of high-reward behaviors.

    • 💡

      Alignment Evaluation: Input: a held-out set of prompts with human-preferred references. Process: measure pairwise comparison accuracy and collect human ratings. Output: improved win-rate indicates better alignment despite similar perplexity. Key point: alignment needs human-grounded metrics, not just language modeling scores.

    05Conclusion

    This lecture centered on the reinforcement learning stage of aligning language models to human preferences. It began by reminding us that reward hacking—like favoring long answers when reward correlates with length—can derail alignment if rewards are naive. The RLHF pipeline establishes the path: pre-train a model, gather preference comparisons, train a reward model, and optimize the policy with RL while constraining drift via KL penalties. Proximal Policy Optimization (PPO) emerged as the main workhorse: it uses a clipped surrogate objective and advantage estimates to make safe, incremental policy updates. A hands-on example showed how PPO gradually increases the probability of human-preferred summaries without making risky jumps.

    The lecture contrasted PPO with alternatives. TRPO offers a stricter, theory-backed KL constraint using second-order methods but is harder to implement and more costly at LLM scale. Actor-critic designs can boost sample efficiency by learning a value function alongside the policy but demand careful stabilization. Direct Preference Optimization (DPO) provides a streamlined route by discarding the reward model and learning directly from pairwise preferences as a classification task; it often improves stability and efficiency but hinges on clean preference data and thoughtful tie handling.

    Beyond algorithms, the instructor highlighted practical levers: reward shaping to avoid hacks and encourage desired traits, exploration–exploitation balance to discover better behaviors without getting stuck, and evaluation metrics that truly reflect alignment with human values. Hyperparameters like the PPO clip epsilon and KL penalty beta can make or break training and should be tuned with close monitoring of KL drift, rewards, and human-win rates.

    For immediate practice, you can implement a PPO loop for a summarization or dialogue task with a basic reward model and a KL penalty, or try DPO with curated preference pairs to reduce complexity. Evaluate using pairwise accuracy and human reviews, and iterate on reward shaping to reduce shortcuts. As next steps, explore TRPO for stricter trust regions, refine value estimation with GAE, and expand your evaluation suite to cover safety and factuality. The core message is clear: stable, constrained optimization paired with human-centered evaluation is the path to reliably aligned language models.

  • ✓Prefer DPO when simplicity matters: If building a reward model and running PPO is too costly, DPO offers a strong alternative. It turns preference learning into standard supervised training with cross-entropy. Ensure high-quality, representative preference pairs. Handle ties by exclusion, down-weighting, or margin thresholds.
  • ✓Use human-centered evaluation: Track human-win rates, pairwise accuracy, and task-specific quality metrics. Perplexity alone doesn’t reflect alignment. Add safety and factuality checks to prevent regressions. Continuously evaluate during training to catch drift early.
  • ✓Control KL explicitly or implicitly: Either include an explicit KL penalty term or rely on small ε plus monitoring. Target a reasonable KL range per update and adjust β or ε accordingly. Prevent runaway divergence that breaks fluency or safety. Consistent KL management is a key stabilizer.
  • ✓Handle noisy rewards carefully: Reward models can be wrong; average across batches, keep updates small, and validate with humans. Don’t overreact to single outlier rewards. Consider value and advantage smoothing. Robustness beats chasing every spike.
  • ✓Manage hyperparameters as a system: ε, β, learning rates, batch size, and epochs interact. Change one at a time and keep detailed logs. Use validation sets and early stopping to avoid overfitting. Systematic tuning saves time and prevents silent failures.
  • ✓Prefer small, frequent updates: Multiple gentle iterations outcompete occasional large jumps. This is safer for alignment and easier to debug. If you see big KL spikes, reduce step size or increase β. Incrementalism builds reliable improvements.
  • ✓Curate preference data: For DPO and reward modeling, clean, diverse, and consistent data matters most. Remove contradictory labels and handle ties thoughtfully. Bias in preferences will show up in the model. Invest in data quality for long-term gains.
  • ✓Pick algorithms by constraints: Use PPO when you have a good reward model and need fine-grained shaping; consider TRPO if strict KL control is required; adopt actor-critic variants for sample efficiency; use DPO for a simpler, cost-effective pipeline. Match choice to compute, data, and goals. Revisit choices as requirements evolve.
  • ✓Instrument your training: Log KL divergence, reward, entropy, advantage stats, and win-rates. Visualize trends to detect early issues like collapse or drift. Use checkpoints to backtrack from bad updates. Observability is an alignment safety net.
  • Expected Reward

    The average reward the model gets over many prompts and outputs. Optimizing expected reward means doing well overall, not just on one example. It’s central to reinforcement learning objectives. A good policy maximizes expected reward under the task distribution.

    KL Divergence

    A measure of how different one probability distribution is from another. In alignment, it measures how far the new policy moved from the old/base policy. Penalizing KL keeps updates small and safer. A strong penalty can prevent model drift and reward hacking.

    PPO (Proximal Policy Optimization)

    An RL algorithm that limits how much a policy can change in one update. It uses a clipped objective to prevent extreme probability shifts. PPO is popular for LLMs because it’s relatively stable and simple. It often works well with KL penalties and advantage estimation.

    TRPO (Trust Region Policy Optimization)

    An RL method that enforces a strict KL constraint per update using second-order information. It ensures updates stay within a trust region where approximations are valid. TRPO has strong theoretical grounding but is complex and compute-heavy. It’s less common for very large LLMs due to cost.

    +25 more (click terms in content)