🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL 1 | How I Study AI
📚 Stanford CS336: Language Modeling from Scratch16 / 17
PrevNext
Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL 1
Watch on YouTube

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL 1

Intermediate
Stanford Online
RLHFYouTube

Key Summary

  • •This session introduces alignment for language models and why next‑token prediction alone is not enough. When models only learn to guess the next word, they can hallucinate facts, produce toxic or biased text, and follow tricky prompts the wrong way. Alignment aims to make models helpful, honest, and harmless so they do what people actually want. The lecture lays out a practical recipe to achieve this with RLHF (Reinforcement Learning from Human Feedback).
  • •RLHF is presented as a three‑step pipeline. First, collect prompts and human‑written (or edited) responses that show the behavior you want. Second, train a reward model that scores which response is better for a given prompt using pairwise preferences. Third, fine‑tune the language model with reinforcement learning so it generates responses the reward model prefers.
  • •Data collection choices strongly shape outcomes. You decide which prompt types to include (questions, instructions, creative writing), where they come from (humans, models, or both), and how many you need. You also choose how responses are produced (fully human, model‑generated then edited, or mixtures) and how many alternatives per prompt. Diversity and quality control are key so the model learns broad, reliable behavior.
  • •The reward model maps (prompt, response) → a scalar score that represents “how good” the response is. It is trained with a pairwise ranking loss that prefers the higher‑quality response among two candidates. The common loss is −log(sigmoid(R(P,Y1) − R(P,Y2))) when Y1 is labeled better than Y2. Minimizing this loss teaches the model to rank good responses above weaker ones.
  • •Any model that can read text and output a single number can be a reward model, but a language model backbone is a common choice. You can reuse the base LM, a smaller LM, or even a different architecture. The key is that the reward model learns human preferences from labeled comparisons. Careful training and validation are needed because reward models can also make mistakes.

Why This Lecture Matters

Alignment with RLHF turns general language skill into trustworthy, goal‑directed behavior. This is crucial for anyone deploying models in real products—engineers, data scientists, product managers, and safety researchers—because customers care about accuracy, usefulness, and safety, not just fluency. Without alignment, models can hallucinate facts, echo toxic content, or be manipulated by tricky prompts, creating legal, ethical, and user‑experience risks. RLHF offers a concrete pipeline to encode human preferences and values into the training process: collect targeted prompts and responses, learn a reward signal from comparisons, and fine‑tune with robust RL (PPO) to optimize for what users prefer. In practice, this improves instruction following, reduces harmful outputs, and increases customer trust. This knowledge directly applies to building assistants, helpdesk bots, coding copilots, educational tutors, and domain‑specific experts (like medical or legal triage tools). Teams can stand up a focused RLHF system in weeks by curating domain prompts, gathering preference labels, and bootstrapping a reward model. It also opens career paths in applied AI safety, preference learning, and RL for language, where demand is growing rapidly as companies operationalize AI. Finally, alignment methods are a key differentiator in the industry: organizations that master data collection, reward modeling, and stable RL can deliver safer, more reliable AI systems that win user trust and meet regulatory expectations.

Lecture Summary

Tap terms for definitions

01Overview

This lecture focuses on alignment for large language models: how to train models not just to predict the next word but to do what people actually want. The central motivation is that pure next‑token prediction on internet text yields fluent models that sometimes make up facts (hallucinations), can echo toxic or biased content, can be misled by adversarial prompts, and are often inconsistent at following instructions. To correct these shortcomings, we want models that are helpful (follow instructions and solve tasks), honest (factual and transparent about uncertainty), and harmless (avoid toxic, biased, and dangerous outputs). The practical method introduced is Reinforcement Learning from Human Feedback (RLHF), a three‑step approach that has become the standard path to alignment in practice.

The three steps are: (1) data collection of prompts and human responses that exhibit the desired behavior; (2) reward modeling, where a model is trained to predict which response is better for a given prompt from pairwise human preferences; and (3) reinforcement learning, where the language model is fine‑tuned to produce responses that the reward model scores highly. The reward model is a simple function that inputs a (prompt, response) pair and outputs a single scalar score that reflects quality. It is trained with a pairwise ranking loss so that better responses receive higher scores than weaker ones. For the reinforcement learning step, Proximal Policy Optimization (PPO) is introduced as a robust policy‑gradient algorithm that updates the model while constraining how much it can change at once, improving stability.

The lecture begins by defining the problem with just next‑token training and articulating alignment goals. It then delves into data collection: choosing prompt types (questions, instructions, creative writing), deciding sources (human‑written prompts, model‑generated prompts, or a mix), and ensuring diversity and representativeness. It also covers response generation trade‑offs: purely human responses, model‑generated then human‑edited responses, and how many candidate responses per prompt to create. A concrete example is given: building a CS Q&A assistant using questions from sources like Stack Overflow or Reddit with human‑written answers to seed preference data.

Next, it explains the reward model in detail. You can use a language model backbone or other architectures to score responses, but the common approach is to fine‑tune an LM to output a scalar score for each (prompt, response). The model is trained with a pairwise logistic loss: for two responses Y1 and Y2 to prompt P where Y1 is judged better, the loss −log(sigmoid(R(P,Y1) − R(P,Y2))) encourages the model to score Y1 higher than Y2. This naturally learns a ranking over responses that reflects human preferences, which becomes the reward signal for RL.

Finally, it introduces reinforcement learning with PPO to fine‑tune the policy (the LM). In this framing, the agent is the LM generating a response token by token, the environment is the prompting task, and the reward comes from the learned reward model applied to the completed response. PPO maintains an old policy (used to collect data) and a new policy (being optimized), using a clipped objective that keeps the probability ratio of new vs. old policy close to 1, multiplied by an advantage estimate. This combination improves responses without taking destabilizing steps, leading to better instruction following and alignment.

The lecture also addresses a key question: since the reward model is a language model, can it hallucinate too? Yes, and that is a concern. Three mitigation strategies are described: train the reward model on factual data to reduce hallucinations, swap to a smaller or different model architecture that may hallucinate less, and use uncertainty estimation to down‑weight uncertain reward predictions. These techniques help prevent the fine‑tuned policy from overfitting to reward model errors.

The session targets learners who already understand basic language modeling and are ready to move into aligning models with human values and tasks. You should be comfortable with concepts like tokens, likelihood, and gradient‑based training. After completing this lecture, you will be able to describe the motivation for alignment, explain the RLHF pipeline end to end, define and train a reward model using pairwise preferences, and understand how PPO fine‑tunes a language model toward human‑preferred behavior. The lecture closes by previewing next time’s focus: limitations of RLHF and alternative alignment methods.

Key Takeaways

  • ✓Start with clear goals: define helpful, honest, and harmless for your domain. Write a simple rubric so labelers agree on what “better” means. Include accuracy, relevance, clarity, safety, and calibration of uncertainty. A shared rubric improves preference data quality and downstream results.
  • ✓Curate diverse prompts that reflect real use. Mix instructions, Q&A, reasoning, safety‑sensitive cases, and creative tasks. Tag prompts by type and difficulty to diagnose performance later. More diversity reduces overfitting to narrow behaviors.
  • ✓Collect multiple responses per prompt to fuel pairwise comparisons. Combine human‑written answers with model‑generated variants at different temperatures. This makes the reward model robust across styles and qualities. More contrast improves learning of preferences.
  • ✓Train a reward model with pairwise ranking instead of absolute scores. It’s easier for annotators and yields more consistent signals. Use the −log(sigmoid(score_diff)) loss and monitor pairwise accuracy on a held‑out set. Keep inputs formatted consistently with clear separators.
  • ✓Mitigate reward model hallucination from the start. Add factual calibration tasks, consider a smaller or different architecture, and estimate uncertainty with ensembles or dropout. Down‑weight uncertain scores during PPO. Revisit and retrain the reward model with hard negative examples periodically.
  • ✓Use SFT to create a strong starting policy. Fine‑tune on the best human responses before RL. SFT stabilizes PPO and reduces the number of RL updates required. This also anchors behavior for KL control.
  • ✓

Glossary

Alignment

Making a language model do what people actually want instead of only predicting the next word. It focuses on being helpful, honest, and harmless. Alignment adds goals and rules on top of raw language skill. It turns vague human wishes into specific training signals. Without it, models can sound smart but act in ways we don’t want.

RLHF (Reinforcement Learning from Human Feedback)

A method to align models using human preferences. People compare responses to the same prompt and pick the better one. A reward model learns from these comparisons to score responses. Then the language model is trained with RL to maximize that score.

Prompt

The input text you give a language model to start a task. It can be a question, an instruction, or a creative request. Good prompts clearly state what is wanted. They set the stage for the model’s response.

Response

The text the language model generates after reading a prompt. Responses can be short answers, explanations, or creative writing. In alignment, we judge responses by quality and safety. Better responses follow instructions and avoid harm.

Reward model

#alignment#rlhf#reward model#pairwise ranking#ppo#policy gradient#instruction following#hallucination#toxicity#bias mitigation#preference learning#uncertainty estimation#kl penalty#entropy bonus#advantage estimation#supervised fine-tuning#helpful honest harmless#adversarial prompts#data curation#cs q&a
Version: 1
  • •Reinforcement learning turns the language model into an agent that generates responses (actions) in an environment (the prompting task) to maximize reward (the reward model’s score). PPO (Proximal Policy Optimization) is a standard algorithm to do this safely and steadily. PPO uses an “old” policy and a “new” policy and constrains updates so the new policy doesn’t drift too far. This stability prevents performance from collapsing while still improving reward.
  • •PPO’s clipped objective centers on the probability ratio between new and old policy for the sampled actions. It multiplies this ratio by the advantage (how much better an action was than usual) and then clips the ratio within [1−ε, 1+Îľ] to avoid big, risky updates. This encourages helpful changes while discouraging drastic ones. In practice, this reliably improves instruction following.
  • •The lecture addresses the concern that reward models can hallucinate too. Three remedies are highlighted: (1) train the reward model on factual datasets to sharpen accuracy, (2) use a smaller or different architecture that may hallucinate less, and (3) estimate uncertainty and down‑weight uncertain scores. These steps help align the policy without overfitting to reward model errors.
  • •A concrete example illustrates dataset design: to build a CS Q&A assistant, gather questions from sources like Stack Overflow or Reddit. Have humans write high‑quality answers to these prompts. Use those pairs for preference labeling and reward modeling. Finally, run PPO to fine‑tune the LM toward those preferred answers.
  • •Alignment goals are summarized as helpfulness, honesty, and harmlessness. Helpfulness means following instructions and providing useful content. Honesty means avoiding hallucinations and being truthful about uncertainty. Harmlessness means staying away from toxic, biased, or dangerous content.
  • •The session emphasizes that data decisions are strategic. Choosing prompt domains, balancing human vs. model‑generated content, and collecting multiple responses per prompt all influence what the final model learns. More diverse and representative prompts lead to better generalization. Clear quality standards for responses raise the ceiling on alignment.
  • •The overall message: next‑token training builds a strong foundation, but alignment aims the model at human goals. RLHF operationalizes “what we want” through human preferences and optimization. PPO provides a practical, stable way to close the loop from reward to policy. Next steps will cover RLHF’s limitations and alternative approaches.
  • 02Key Concepts

    • 01

      🎯 Alignment: Training LMs to do what people actually want. 🏠 It’s like teaching a smart parrot not only to repeat words but to be a helpful assistant who answers questions safely and honestly. 🔧 Technically, alignment adjusts a pretrained LM so its outputs optimize for human‑defined objectives, not just next‑token probability. 💡 Without alignment, models may hallucinate, be toxic or biased, and follow adversarial prompts. 📝 A practical target is the trio of helpful, honest, and harmless (HHH).

    • 02

      🎯 Shortcomings of next‑token prediction: fluent but unreliable behavior. 🏠 It’s like a student who memorized lots of books but isn’t graded on truthfulness, safety, or instructions. 🔧 Maximizing likelihood over web text learns patterns, not task goals or values; it can amplify dataset biases. 💡 This leads to hallucinations, toxicity, and weak instruction following. 📝 Fixing these requires adding objectives beyond next‑token prediction.

    • 03

      🎯 RLHF: Reinforcement Learning from Human Feedback. 🏠 It’s like a coach who watches two attempts and points to the better one so the player learns what “better” means. 🔧 Pipeline: collect prompts and responses, train a reward model from pairwise preferences, then RL‑fine‑tune the LM to maximize that reward. 💡 RLHF translates fuzzy human preferences into a trainable signal. 📝 It is widely used to make assistant‑style models.

    • 04

      🎯 Data collection for RLHF. 🏠 It’s like building a good test: choose varied questions, write reference answers, and include multiple choices to compare. 🔧 Decide prompt types (Q&A, instructions, creative), sources (human vs. model), volume, diversity, and response quality. 💡 Poor or narrow data narrows what the model learns to do. 📝 For a CS helper, gather Stack Overflow questions and craft high‑quality answers.

    • 05

      🎯 Reward model: scoring response quality. 🏠 It’s a judge that reads the prompt and response and gives a single score. 🔧 Function R(P,Y) → scalar; higher means better according to learned human preferences. 💡 This turns human taste into a numerical reward for RL. 📝 It can be an LM head fine‑tuned for scalar outputs.

    • 06

      🎯 Pairwise ranking loss. 🏠 It’s like saying “response A beats response B” and training the judge to agree. 🔧 With preferred Y1 over Y2 for prompt P, optimize −log(sigmoid(R(P,Y1) − R(P,Y2))). 💡 This directly teaches relative quality, which is easier to label than absolute scores. 📝 Over many pairs, the reward model learns a consistent ranking.

    • 07

      🎯 PPO (Proximal Policy Optimization). 🏠 Think of it as a safety harness that lets you improve steadily without jumping too far at once. 🔧 It uses the ratio of new vs. old policy probabilities, multiplies by advantage, and clips the ratio within [1−ε, 1+ε]. 💡 Clipping prevents destabilizing updates and collapse. 📝 PPO is a common choice for RLHF fine‑tuning.

    • 08

      🎯 Policy, environment, reward mapping for LMs. 🏠 It’s like a game where a player writes a reply, the world gives a score, and the player tries to improve. 🔧 Policy = LM generating tokens; environment = prompting task; reward = reward model’s score of the completed response. 💡 This framing lets us apply RL to text generation. 📝 Each sampled response is a trajectory with a final reward.

    • 09

      🎯 Advantage: how much better an action is than usual. 🏠 It’s a bonus compared to the normal outcome, like getting extra points for an above‑average move. 🔧 Estimated from returns minus a baseline (often a value function) to reduce variance. 💡 Using advantage stabilizes learning. 📝 PPO multiplies the probability ratio by the advantage to guide updates.

    • 10

      🎯 Prompt and response design choices. 🏠 Like writing assignments: the type of question changes what skills get practiced. 🔧 Include instructions, reasoning questions, safety‑sensitive prompts, and creative tasks; collect multiple responses per prompt to support pairwise comparisons. 💡 Variety improves generalization and safety. 📝 Human editing of model drafts can speed data creation while preserving quality.

    • 11

      🎯 Hallucination and toxicity risks. 🏠 Like a well‑spoken friend who sometimes confidently says the wrong thing or repeats rude phrases heard online. 🔧 Pretraining data contains errors and bias; next‑token objective doesn’t penalize falsehoods or harm. 💡 Alignment must explicitly discourage unsafe and untrue outputs. 📝 HHH (helpful, honest, harmless) captures the goals.

    • 12

      🎯 Reward model hallucination concern. 🏠 A judge can be fooled too if their sources are shaky. 🔧 Mitigate by training on factual data, using smaller/different architectures, and estimating uncertainty to down‑weight shaky scores. 💡 This prevents the policy from over‑optimizing flawed signals. 📝 Calibrated reward models yield better aligned policies.

    • 13

      🎯 Old vs. new policy in PPO. 🏠 Like practicing a routine: compare today’s moves to yesterday’s version and don’t change too abruptly. 🔧 Data is collected with the old policy; the new policy is trained using a clipped objective relative to the old. 💡 This on‑policy setup reduces distribution drift. 📝 Periodically update the old policy to the new one and repeat.

    • 14

      🎯 Preference labels over absolute scores. 🏠 Choosing a better essay is easier than giving it a 1–10 score. 🔧 Pairwise labels reduce rater disagreement and calibration issues. 💡 Leads to higher‑quality reward models with less annotation friction. 📝 Use multiple pairs per prompt for robust learning.

    • 15

      🎯 Response count per prompt. 🏠 More choices give a clearer sense of what “better” means. 🔧 Sampling multiple candidate responses supports richer pairwise comparisons. 💡 Improves ranking quality and reward model generalization. 📝 Combine human‑written and model‑generated alternatives.

    • 16

      🎯 Safety via constrained updates. 🏠 Guardrails stop the car from swerving off the road while still moving forward. 🔧 PPO’s clipping and optional KL penalties keep the policy near its prior behavior. 💡 This reduces mode collapse and reward hacking. 📝 Monitor KL and adjust learning schedules accordingly.

    • 17

      🎯 Uncertainty estimation for rewards. 🏠 If the judge is unsure, treat their score with caution. 🔧 Use ensembles, Monte Carlo dropout, or calibrated probabilities to estimate uncertainty and weight rewards. 💡 This reduces overfitting to noisy judgments. 📝 Down‑weight or filter high‑uncertainty training samples.

    • 18

      🎯 Choosing model sizes for reward modeling. 🏠 A smaller critic can be stricter and less imaginative, which may help factual judgment. 🔧 Smaller LMs or non‑LM classifiers can reduce hallucinations in scoring. 💡 Trade‑off: smaller models may miss nuance; larger ones may hallucinate. 📝 Validate with held‑out preference tests.

    • 19

      🎯 Concrete CS Q&A pipeline. 🏠 Like building a helpful forum moderator. 🔧 Gather CS questions (e.g., Stack Overflow), craft high‑quality answers, collect pairwise preferences, train the reward model, and run PPO fine‑tuning. 💡 Results in a model that answers CS questions more reliably. 📝 This example shows end‑to‑end RLHF in a focused domain.

    • 20

      🎯 Big picture: from prediction to preference‑aligned behavior. 🏠 Turning a good mimic into a good teammate. 🔧 RLHF injects human goals as a reward and uses stable RL to optimize for them. 💡 This addresses hallucinations, toxicity, and instruction following. 📝 Later sessions consider RLHF limitations and alternatives.

    03Technical Details

    Overall architecture/structure

    1. Pretraining baseline
    • Start with a pretrained language model (LM) that predicts the next token given context. This model is fluent but not aligned: it can output untrue, unsafe, or unhelpful content because it has not been optimized for human intentions.
    1. Supervised fine‑tuning (SFT) on instructions (implicit in step 1 of RLHF)
    • The lecture’s first RLHF step—collect prompts with human responses—doubles as SFT data. Fine‑tune the base LM to mimic high‑quality human answers to prompts. This creates an instruction‑following policy π_SFT that is usually safer and more compliant than the raw pretrained model. SFT provides a starting point that makes subsequent RL more stable and sample‑efficient.
    1. Reward modeling (preference learning)
    • Build a dataset of tuples (P, Y_a, Y_b, label), where P is a prompt, Y_a and Y_b are two candidate responses to P, and label indicates which is preferred by humans (e.g., Y_a > Y_b). The reward model R(P, Y) outputs a scalar score for any (P, Y). Training minimizes a pairwise logistic loss so that R(P, Y_pref) > R(P, Y_nonpref).
    1. Reinforcement learning optimization (PPO)
    • Treat the LM as a stochastic policy π_θ(Y | P) that generates a response Y token by token. For each sampled response, compute a reward r = R(P, Y). Then update the policy with PPO using an advantage estimate that encourages responses with higher‑than‑expected reward. PPO uses the probability ratio between the new policy and the old policy to constrain updates.

    Data flow

    • Prompts are sampled from a curated pool. The current policy (often π_SFT initially) generates K candidate responses per prompt. Humans (or a trained preference model in a bootstrapping loop) rank pairs of responses. The reward model R is trained on these pairwise rankings. During PPO, the policy generates responses, is scored by R, and parameters are updated using the PPO objective. This loop repeats: periodically refresh rollouts, update R if new preference data arrives, and continue policy optimization.

    Reward model training in detail

    • Input representation: Concatenate prompt and response with clear separators (e.g., “<|prompt|> ... <|response|> ...”). Optionally include system instructions to anchor desired style (e.g., “Be helpful, honest, and harmless.”). Tokenize with the LM’s tokenizer.
    • Architecture: Commonly a pretrained LM with its final hidden state pooled (e.g., via mean or using the last token) and projected to a scalar head. Alternative architectures include sequence classifiers or smaller encoders that process (P, Y) jointly.
    • Loss: For a labeled pair (P, Y1, Y2) where Y1 is preferred, define s1 = R(P, Y1), s2 = R(P, Y2). The loss is L = −log(sigmoid(s1 − s2)). Intuition: if s1 ≫ s2, sigmoid approaches 1 and loss approaches 0; if s1 < s2, sigmoid approaches 0 and loss grows large, pushing the model to increase s1 − s2.
    • Optimization: Use standard optimizers (AdamW), small learning rates, gradient clipping, and early stopping on a validation set of preference pairs. Regularization (weight decay, dropout) and data augmentation (prompt order, formatting variants) improve robustness.
    • Calibration and sanity checks: Evaluate pairwise accuracy (how often R ranks the human‑preferred response higher). Plot score distributions to check for collapse or saturation. Use held‑out prompts with carefully created gold preferences.

    Handling reward model hallucination

    • Accuracy training: Supplement the preference data with factual verification tasks (true/false statements, fact‑checking questions). Train multi‑task or staged: first factual calibration, then preference ranking.
    • Architecture choice: Smaller or more conservative models may hallucinate less. Try distilled LMs or encoders with limited generative capacity.
    • Uncertainty estimation: Use ensembling (multiple reward models, average scores and measure variance), Monte Carlo dropout (multiple stochastic forward passes), or temperature‑scaled logits to estimate confidence. Down‑weight high‑uncertainty scores during PPO, or filter samples exceeding an uncertainty threshold.

    PPO fine‑tuning in detail

    • Policy and old policy: Maintain θ_old (frozen for a short horizon) and θ (trainable). Generate rollouts using π_{θ_old}. Optimize θ on these rollouts for several epochs; then set θ_old ← θ and collect fresh rollouts.
    • Objective: For each tokenized action sequence, PPO defines the likelihood ratio r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t). Compute an advantage estimate A_t for each action. The clipped surrogate objective for policy loss is: L_policy = − E_t [ min( r_t A_t, clip(r_t, 1 − Îľ, 1 + Îľ) A_t ) ]. This penalizes updates that push r_t outside the [1−ε, 1+Îľ] band, limiting drift.
    • Advantage estimation: Compute returns from the final reward (which is sequence‑level). A common approach is to assign the same sequence reward to all tokens, then subtract a learned baseline V(s_t) (value function) and optionally use Generalized Advantage Estimation (GAE) to reduce variance. Even with a terminal reward, GAE helps stabilize training by smoothing credit assignment.
    • Entropy bonus: Add −β H(π_θ(¡ | s_t)) to encourage exploration and avoid early over‑commitment to narrow behaviors.
    • KL control: Optionally add a KL penalty between π_θ and a reference policy (often π_SFT) to keep the policy close to the supervised model. This can be adaptive: adjust the KL coefficient to target a desired KL per update.
    • Value loss: Train a value function V_φ(s) to predict expected returns for baseline subtraction. The total loss typically combines policy loss, value loss (with coefficient c_v), and entropy bonus.
    • Batching and epochs: Split rollouts into minibatches; run several epochs of PPO updates per batch. Monitor ratio statistics, clipped fraction, KL divergence, and reward trends to tune Îľ, learning rate, and batch sizes.

    Credit assignment and sequence‑level reward

    • The reward model usually gives a single scalar for the entire response, but PPO operates per token. A simple approach is to assign that scalar uniformly to each token’s step (with discount Îł=1). To reduce bias, normalize rewards across a batch (z‑scoring) and use a value baseline. Some implementations add small shaping terms (e.g., per‑token style checks) but must avoid reward hacking.

    Sampling strategies for candidates

    • During data collection: sample multiple responses per prompt with temperature or nucleus sampling to create variety for pairwise comparisons. Include human‑written baselines to anchor quality. For difficult prompts, ask experts to produce gold answers.
    • During PPO rollouts: control sampling to balance exploration (higher temperature) and training stability (enough diversity to learn but not too noisy). Trim or reject obviously unsafe outputs using simple filters during training data generation.

    Quality and safety controls

    • Prompt diversity: Ensure representation across domains (reasoning, coding, safety‑sensitive, creative) and difficulty. Track per‑domain reward to detect overfitting.
    • Harmlessness and honesty: Include safety prompts and factual checks in the data; incorporate refusal patterns for unsafe requests; reward appropriately calibrated uncertainty (e.g., “I don’t know” when appropriate).
    • Bias mitigation: Include prompts that stress fairness and neutral tone; penalize biased language in preferences.

    Implementation guide (step by step)

    • Step 1: Define objectives and scope. Specify what “helpful, honest, harmless” means for your domain. Write a lightweight rubric for raters (accuracy, relevance, clarity, safety).
    • Step 2: Build the prompt set. Gather from real sources (e.g., Stack Overflow for CS Q&A) and synthesize extras with the base LM. Deduplicate, filter low‑quality items, and tag prompts by type.
    • Step 3: Produce responses. For each prompt, collect K responses: some human‑written, some model‑generated at varied temperatures, optionally human‑edited model drafts. Keep full provenance metadata (who/what produced each response, sampling settings, time).
    • Step 4: Collect preferences. Present pairs (and occasionally triads) to human labelers with the rubric. Require justifications on disagreements to improve future rater calibration.
    • Step 5: Train the reward model. Format inputs consistently, fine‑tune with the pairwise ranking loss, monitor pairwise accuracy and score distributions, and validate on a held‑out set.
    • Step 6: Prepare the SFT policy. Fine‑tune the base LM on the best human responses so it follows instructions reasonably well before RL.
    • Step 7: PPO setup. Initialize θ from the SFT model. Configure Îľ (e.g., 0.1–0.2), learning rate, batch size, and entropy/kl coefficients. Implement advantage estimation with a value head.
    • Step 8: Rollouts and updates. Sample prompts, generate responses with π_{θ_old}, score with R, compute returns and advantages, and optimize PPO for several epochs. Periodically update θ_old ← θ and repeat.
    • Step 9: Safety checks in loop. Track KL to SFT, reward model uncertainty, and simple safety filters. If KL drifts high, increase KL penalty; if uncertainty spikes, down‑weight those samples.
    • Step 10: Evaluation. Build a benchmark of prompts with human‑preferred references. Measure win‑rate against SFT and base models, refusal quality for unsafe prompts, and fact‑checking accuracy.

    Tips and warnings

    • Preference data quality dominates outcomes: ambiguous or inconsistent labeling confuses the reward model. Invest in rater training and clear rubrics.
    • Avoid reward hacking: models can learn to exploit reward model quirks (e.g., overly long or overly polite boilerplate). Regularly audit top‑scored responses and retrain R with hard negatives.
    • Keep updates gentle: too large Îľ or learning rate can cause collapse or mode narrowing. Monitor clipped fraction; high values suggest the optimizer is often trying to make too‑big updates.
    • Use reference model KL: tethering to the SFT policy preserves helpfulness while improving on specific deficiencies.
    • Sequence length discipline: ensure prompts and responses fit within context limits. Truncation can silently corrupt training signals.
    • Track uncertainty: don’t give full credit to shaky reward model judgments; filter or down‑weight them.
    • Maintain a validation loop with humans: occasionally run side‑by‑side evaluations to ensure progress matches human preferences, not just reward scores.

    Concrete example: CS Q&A assistant

    • Goal: A model that answers computer science questions reliably.
    • Data: Collect thousands of CS questions from Stack Overflow/Reddit (respecting licenses), add expert human answers, and generate additional model answers for variety.
    • Preferences: Show pairs to CS‑savvy labelers; prefer answers that are correct, concise, reference sources when possible, and state uncertainty if needed.
    • Reward model: Fine‑tune an LM head to score (prompt, answer). Validate that it ranks expert answers above weak ones.
    • PPO: Start from SFT. Set Îľ=0.2, add a KL penalty to SFT with a target KL of ~0.1–0.2 nats per token. Train with batch rollouts, monitor factuality and harmful content filters.
    • Outcome: Improved win‑rate vs. SFT on a held‑out CS benchmark, fewer hallucinations, better instruction adherence.

    By following this architecture and training recipe, you translate human preferences into a stable optimization signal that tunes a language model from “good next‑token predictor” into “useful assistant” aligned with helpfulness, honesty, and harmlessness.

    05Conclusion

    This lecture reframes language modeling from pure next‑token prediction to goal‑directed behavior that serves human needs. The core idea is alignment: we want models that are helpful, honest, and harmless, not just fluent. Reinforcement Learning from Human Feedback (RLHF) is the practical mechanism: collect prompts and high‑quality responses, learn a reward model from human preferences using pairwise ranking, and fine‑tune the language model with PPO so it generates outputs the reward model favors. PPO’s clipped objective and the old/new policy setup provide stability, improving responses without sudden destructive shifts.

    A careful data pipeline underpins everything. Prompt and response design determines what the model can learn; multiple candidates per prompt fuel pairwise preferences; and rater rubrics keep labels consistent. The reward model must be trustworthy: train it well, validate rigorously, and mitigate its own hallucinations through factual training, architectural choices, and uncertainty estimation. In the RL loop, use advantage estimation, entropy bonuses, and KL control to balance improvement with safety, and monitor metrics like KL drift, clipped fraction, and reward model uncertainty.

    To practice, try building a focused assistant such as a CS Q&A helper: gather real prompts, craft strong human answers, collect preference labels, train a reward model, and run PPO from an SFT base. Evaluate with human side‑by‑side tests and factuality checks to confirm real gains, not just higher reward scores. As a next step, study RLHF’s limitations and alternative alignment methods, and explore advanced techniques for safety, calibration, and evaluation. The key message to remember: pretraining gives language skill, but alignment shapes purpose. RLHF operationalizes human intent into a learnable signal, steering models toward behavior we prefer and trust.

    Configure PPO conservatively for stability. Use ε around 0.1–0.2, add an entropy bonus, and include a KL penalty to the SFT policy. Monitor the clipped ratio fraction and KL drift each epoch. If training becomes unstable, lower the learning rate or increase KL strength.
  • ✓Handle sequence‑level rewards carefully. Assign the final reward across tokens and use a value baseline with advantage estimation (e.g., GAE). Normalize rewards and advantages per batch. This reduces variance and improves sample efficiency.
  • ✓Prevent reward hacking with audits and countermeasures. Regularly inspect top‑scored outputs for boilerplate or exploits. Add those as negative examples to retrain the reward model. Adjust KL and entropy settings to discourage brittle strategies.
  • ✓Build a reliable evaluation suite. Use side‑by‑side human comparisons, factual QA tests, safety scenarios, and instruction‑following checks. Track win‑rate over SFT and base models. Evaluate per domain to see where to collect more data.
  • ✓Keep a human‑in‑the‑loop feedback cycle. As failure modes appear, collect new preference data to target them. Update the reward model and continue PPO. Iteration is key to sustained improvement.
  • ✓Document provenance and settings for every example. Record prompt IDs, generators, sampler settings, and rater IDs. This allows debugging odd behaviors and retraining with the right subsets. Good data hygiene saves time later.
  • ✓Treat uncertainty as a first‑class signal. Filter or reduce weight for samples where the reward model is unsure. This prevents the policy from chasing noise. It also highlights areas where new human labels would help most.
  • ✓Use safety filters during rollouts. Block obviously harmful outputs before scoring. This keeps bad samples from shaping the policy. It also protects labelers and infrastructure.
  • ✓Adjust exploration during training. Use modest temperature and nucleus sampling to balance diversity with learnability. Too little exploration stalls learning; too much creates noisy gradients. Tune these with validation feedback.
  • A model that reads a prompt and a response and outputs a single score for how good the response is. It learns from human preference comparisons. Higher scores mean better alignment with what people prefer. This score becomes the reward for RL.

    Pairwise ranking loss

    A training loss that teaches the reward model which of two responses is better. It pushes the score of the preferred response above the non‑preferred one. The common form uses −log(sigmoid(score_diff)). This makes ranking more reliable than asking for absolute scores.

    Sigmoid

    A math function that squashes any real number to a value between 0 and 1. It’s often used to turn score differences into probabilities. If the input is large and positive, output is near 1; if negative, near 0. It helps in pairwise ranking loss.

    PPO (Proximal Policy Optimization)

    An RL algorithm that improves policies steadily while preventing big, risky updates. It compares the new policy to the old one and clips changes that are too large. This keeps training stable. It’s widely used for fine‑tuning LMs with RLHF.

    +26 more (click terms in content)