📚 Stanford CS336: Language Modeling from Scratch15 / 17

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 15: Alignment - SFT/RLHF

Intermediate

Stanford Online

Key Summary

•Alignment means teaching a pre-trained language model to act the way people want: safe, helpful, and harmless. A pre-trained model is like a bag of knowledge with no idea how to use it, so it may hallucinate or say unsafe things. Alignment adds an outer layer of behavior so the model answers clearly, avoids harm, and respects user intent.
•Supervised Fine-Tuning (SFT) is the first common step. Humans write example prompts and high-quality answers, and the model learns to imitate those answers. This makes the model follow instructions better, respond in the right format, and be more polite and accurate.
•Collecting SFT data can be done by hiring annotators, reusing public datasets, or mixing both. Clear instructions for annotators, examples of good and bad responses, and multiple reviewers per prompt improve quality. The training objective is standard: maximize the probability of the human demonstration given the prompt.
•Choosing tasks for SFT depends on the final product goals: Q&A, instruction following, dialogue, summarization, or code. Larger datasets and models help, but hyperparameters like learning rate, batch size, and regularization matter too. SFT can be expensive and brittle across domains, yet it is widely used and effective.
•Reinforcement Learning from Human Feedback (RLHF) teaches the model to produce outputs humans prefer. It trains a reward model to score outputs based on human rankings or preferences. The language model is then optimized with reinforcement learning (often PPO) to maximize that score.
•Collect human preferences by showing multiple model outputs for the same prompt and asking humans to rank, compare pairs, or give scores. Ranking is usually most accurate but most costly; pairwise is a good balance; scoring is cheaper but noisier. Good instructions, examples, and multiple annotators per item improve consistency.

Why This Lecture Matters

Alignment transforms a capable but unpredictable language model into a trustworthy assistant suitable for real users. Product teams building chatbots, copilots, and search assistants need models that follow instructions, avoid harm, and handle sensitive topics carefully. SFT and RLHF provide practical, proven recipes to achieve this by combining imitation of great responses with direct optimization for human preferences. This reduces hallucinations, discourages unsafe answers, and tunes tone and formatting to fit brand and legal requirements. For researchers and engineers, mastering alignment opens doors to safety engineering, human-in-the-loop systems, and responsible AI development. In industry, aligned models are now table stakes: regulators, customers, and partners expect systems that respect safety and fairness norms. Learning these methods helps you design robust data pipelines, build reward models that resist gaming, and run stable RL training. These skills map directly to real projects—customer support bots, developer copilots, content moderation helpers—and improve career prospects in AI safety, applied ML, and product-oriented ML engineering.

Lecture Summary

Tap terms for definitions

01Overview

This lecture explains alignment: the process of teaching a pre-trained language model to behave in ways people want—specifically to be safe, helpful, and harmless. A pre-trained model learns patterns from internet-scale text, but that alone does not guarantee it will answer user questions correctly, avoid harmful content, or follow instructions well. The central message is that alignment is not optional if we want language models to work reliably in the real world. The lecture focuses on two widely used alignment paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). It also briefly mentions Constitutional AI as an emerging approach.

The audience is assumed to already understand basic language modeling: transformers, pre-training, and optimization. If you know what a token is, how a transformer predicts the next token, and what loss functions like cross-entropy do, you are in the right place. Familiarity with optimization (SGD/Adam), and the idea of fine-tuning is helpful. Some exposure to reinforcement learning makes RLHF easier to grasp, but the lecture introduces the essentials you need to follow along, like what a reward model is and why PPO is used.

After studying this material, you will be able to: (1) describe why alignment is necessary even for very capable pre-trained models; (2) outline and implement a basic SFT pipeline, including prompt/response data collection, annotator instruction design, and training; (3) design and collect human preference data for RLHF; (4) train a reward model using pairwise or ranking losses; and (5) run a reinforcement learning loop (commonly PPO) that uses the reward model to push the policy toward preferred outputs. Beyond that, you will understand trade-offs—like why ranking is accurate but costly, why pairwise is a sensible middle ground, and why scoring is cheaper but noisier. You’ll also learn practical tips for reducing instability and brittleness.

The structure begins by defining alignment and motivating it with real risks: hallucinations, toxicity, format issues, and unhelpful behavior. It frames an aligned model as one that is safe, helpful, and harmless, echoing a Hippocratic-like promise: first, do no harm. Then it covers SFT in depth: what the data looks like (prompt x, demonstration y), how to collect it (hired annotators, existing datasets, or both), what tasks to include, and how to train (maximize log-likelihood of y given x). The lecture next introduces RLHF: collecting human preferences over model outputs, training a reward model to predict those preferences, and using reinforcement learning—especially PPO—to optimize the model to maximize the reward. Throughout, it emphasizes practical realities: alignment data is expensive, optimization can be unstable, and results can be brittle if narrowly trained. It closes by noting that despite these challenges, SFT and RLHF are used by nearly all major model providers, and hints that Constitutional AI will be covered next time as another promising direction.

Key Takeaways

✓Start with a strong SFT baseline: Collect high-quality prompt–response pairs, write a clear rubric, and include safe refusals. A solid SFT model makes PPO training easier and more stable. It also sets tone and formatting that RLHF can refine instead of reinvent. Small, carefully curated datasets often beat large, messy ones at the beginning.
✓Choose preference labels wisely: Ranking is accurate but expensive; pairwise generally balances cost and quality; scoring is cheapest but noisy. Use pairwise for scale and keep some full rankings for critical prompts. Make instructions and examples crystal clear for raters. Always collect multiple labels per item to measure agreement.
✓Design the reward model carefully: Reuse the base architecture with a simple scalar head and use pairwise logistic loss. Regularize, monitor pairwise accuracy, and validate on a held-out set. Beware of reward hacking—periodically audit top-scoring outputs. Update the preference dataset with tricky counterexamples.
✓Stabilize PPO training: Use conservative learning rates, adequate batch sizes, and clipping. Monitor KL to a reference policy (often SFT) to prevent drift. Consider an entropy bonus to maintain diversity without encouraging babble. Stop and adjust if reward spikes while quality drops—this signals hacking.
✓Encode safety directly in data: Include refusal examples and safe redirections in SFT and preferences. Prefer helpful refusals over unsafe compliance in pairwise labels. Add coverage for sensitive domains (self-harm, hate, violence, privacy). Test with red-team prompts before deployment.
✓Balance verbosity and concision: Without guidance, RLHF can over-reward longer outputs. Include preferences that favor concise, to-the-point answers. Track average length during PPO and set evaluation tasks with strict length limits. Adjust reward model or data if the model rambles.

Glossary

Alignment

Teaching a model to behave in ways people want, like being safe, helpful, and harmless. It adds rules and goals on top of what the model learned from reading lots of text. The focus is on making answers useful, kind, and not dangerous. It turns raw skill into responsible behavior. Without it, the model can be unpredictable.

Supervised Fine-Tuning (SFT)

Training a model to imitate high-quality human-written answers to prompts. The model learns by seeing example questions and ideal responses. It uses standard supervised learning with a loss that rewards matching the human output. This shapes tone, format, and basic safety behavior.

Reinforcement Learning from Human Feedback (RLHF)

A method that uses human preferences to guide a model’s behavior. People compare model outputs to say which they like better. A reward model is trained to predict those choices. The main model is then optimized to get higher rewards.

Reward Model (RM)

A small model that judges how good an output is for a given prompt. It takes the prompt and the answer, and returns a score. This score should match what humans would prefer. It turns human choices into a training signal.

Version: 1

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 15: Alignment - SFT/RLHF

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Alignment

Supervised Fine-Tuning (SFT)

Reinforcement Learning from Human Feedback (RLHF)

Reward Model (RM)

02Key Concepts

03Technical Details

04Examples

05Conclusion

Proximal Policy Optimization (PPO)

Prompt

Demonstration

Preference Data