📚 Stanford CS329H: Machine Learning from Human Preferences6 / 8

Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024 | Ethics

Beginner

Stanford

Key Summary

•The lecture explains regularization, a method to reduce overfitting by adding a penalty to the cost (loss) function that discourages overly complex models. Overfitting is when a model memorizes noise in the training data and fails to generalize. Regularization keeps model parameters (weights) from growing too large, which helps models generalize better to new data.
•Two main types are covered: L2 regularization (Ridge) and L1 regularization (Lasso). L2 adds a penalty equal to the sum of squared weights, which shrinks weights toward zero but rarely makes them exactly zero. L1 adds a penalty equal to the sum of absolute values of weights, which often pushes some weights exactly to zero, effectively selecting features.
•The penalty strength is controlled by a hyperparameter called lambda (λ), which you choose, not the model. A larger lambda means stronger penalty and smaller weights; a smaller lambda means weaker penalty and larger weights. Picking lambda balances bias and variance: too large raises bias (underfitting), too small raises variance (overfitting).
•The lecture uses Mean Squared Error (MSE) as the base cost function: average of squared differences between predictions and true values. Regularization adds a penalty term to this cost, so the optimizer minimizes both data error and weight size. This encourages simpler models that still fit the data reasonably well.
•A geometric view helps build intuition: without regularization, you find the very bottom of the loss landscape (the center of nested ellipses). With L2 (Ridge), you add a circular constraint on weights; you pick the point where the lowest ellipse just touches the circle (tangent). Making the circle smaller (larger lambda) forces smaller weights.
•For L1 (Lasso), the constraint is a diamond (sum of absolute values bounded). The lowest ellipse tends to touch the diamond at its corners, which lie on the axes. Touching a corner means one of the weights equals zero, which explains why L1 creates sparse models with some weights exactly zero.
•L2 is usually preferred for best predictive accuracy because it reduces weight size smoothly and handles many small effects well. L1 is preferred when you want automatic feature selection, because it can zero out unimportant features completely. Elastic Net mixes both L1 and L2 to get a balance of sparsity and stability.
•Regularization changes the optimization goal from ‘fit as well as possible’ to ‘fit well but stay small.’ This reduces sensitivity to noise and prevents the model from chasing outliers. The result is often better performance on unseen data.
•Lambda is a hyperparameter because it is set by you, not learned like the weights. You typically try several lambda values and choose the one that works best using a validation method. If lambda is too high, the model underfits; if too low, it may overfit.
•The lecture’s visuals show contours (ellipses) of equal loss for two weights, W1 and W2. L2 draws a circle constraint, while L1 draws a diamond constraint; the optimum lies where the lowest ellipse meets the constraint boundary. This simple picture explains the shrinkage vs. sparsity behavior.
•Elastic Net adds both penalties with two hyperparameters that control L1 and L2 strengths. It is useful when features are correlated (L2 helps share weight among them) but you still want some features removed (L1 zeros out the weakest). It can outperform pure L1 or L2 in many real cases.
•In practice, you start with a plain model and observe overfitting, then add L2 for stable improvement, or L1 if you need feature selection. You tune lambda to balance fit and simplicity. Regularization is a core tool to control model complexity and improve generalization.

Why This Lecture Matters

Regularization is one of the most important tools in a machine learning practitioner’s kit. Data scientists, ML engineers, analysts, and researchers all face the challenge of overfitting, especially with many features or limited data. By adding a penalty to the loss function, regularization keeps models from becoming too complex and chasing noise, which greatly improves performance on unseen data. It helps solve practical problems like unstable predictions, poor test performance, and confusing models with too many irrelevant features. This knowledge is directly applicable to real work: you can stabilize forecasting models, simplify feature sets for interpretability, and handle correlated inputs gracefully. In domains such as finance, healthcare, and operations, where reliability and trust matter, regularization makes models consistent and safer to deploy. Mastering L1, L2, and Elastic Net lets you tune models to your exact goals—whether that is best predictive accuracy, lean feature sets, or a careful balance of both. Understanding lambda as a hyperparameter and using validation to pick it gives you a repeatable process for building robust models. From a career perspective, being fluent in regularization signals a solid grasp of the fundamentals of generalization. The technique underpins classic linear models, logistic regression, and modern deep learning (via weight decay). In an industry where data is noisy and high-dimensional, and where interpretability can be critical, regularization is not optional; it is essential. Knowing when and how to apply L1, L2, and Elastic Net will make your models more trustworthy, your analysis more transparent, and your deployments more successful.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches a powerful technique called regularization, which helps machine learning models avoid overfitting. Overfitting happens when a model learns not only the true patterns in the training data but also the random noise. As a result, it performs well on the training set but poorly on new, unseen data. Regularization fixes this by gently punishing complex models, nudging them to be simpler and more stable. The core idea is simple: add a penalty to the cost (loss) function that grows when the model’s parameters (weights) are large.

The lecture focuses on two main types of regularization: L2 regularization (also called Ridge) and L1 regularization (also called Lasso). With L2, you add a term equal to the sum of the squares of the weights, scaled by a hyperparameter called lambda (λ). This makes all weights smaller but rarely exactly zero. With L1, you add a term equal to the sum of the absolute values of the weights, again scaled by lambda. This version can push some weights exactly to zero, which means the model completely ignores those features. Because of this, L1 naturally performs feature selection. There’s also a combined approach called Elastic Net, which mixes L1 and L2 and requires tuning two hyperparameters.

The lecture uses Mean Squared Error (MSE) as the base cost function example. MSE measures how far predictions are from true values, by averaging the squared differences. Regularization adds a penalty term to MSE, so the optimizer tries to minimize both the prediction errors and the size of the weights. This trade-off is controlled by lambda: a bigger lambda means more penalty on large weights and a simpler model, while a smaller lambda means less penalty and a more flexible model. Lambda is a hyperparameter, which means you choose it rather than the model learning it during training.

To build strong intuition, the instructor uses a geometric picture with two weights (W1 and W2) so we can draw everything in two dimensions. The loss function looks like a set of nested ellipses centered at the best-fit point. L2 regularization acts like putting a circle around the origin; you must pick the best point inside that circle. The optimal point ends up at the place where a loss ellipse just touches the circle—like a tire kissing a curb. Making the circle smaller (increasing lambda) forces the solution closer to the origin, shrinking the weights. With L1, the constraint is a diamond; the optimal point often lands on a corner of the diamond, which lies on the axes, meaning one weight becomes exactly zero. This explains why L1 encourages sparse solutions.

This lecture is for beginners and intermediates who want a practical and intuitive understanding of why and how regularization helps generalization. You should know what a cost function is, what model parameters (weights) are, and have a basic grasp of MSE and the bias-variance tradeoff. If you remember that overfitting means too much variance and not enough bias, regularization will make sense because it increases bias slightly to reduce variance a lot.

After this lecture, you will be able to explain what regularization is, name and describe L1 and L2 (Lasso and Ridge), and understand why L1 produces feature selection while L2 produces smooth shrinkage. You will be able to describe the role of lambda, why it is a hyperparameter, and what happens when you increase or decrease it. You will also be able to visualize regularization using the circle (L2) and diamond (L1) constraints and the tangency with loss ellipses. While the lecture centers on linear models and MSE, the same ideas extend to many other models and loss functions.

The structure of the lecture begins by recalling overfitting and the bias-variance tradeoff. It then introduces the basic idea of regularization as a penalty added to the loss. Next, it defines L2 and L1 regularization precisely, explains lambda as a hyperparameter, and highlights the practical consequences: prediction quality versus feature selection. Finally, it cements understanding with a clear geometric visualization: ellipses for loss, a circle for L2, a diamond for L1, and the solution at the point of tangency inside the constraint. The closing note reminds you that other tools like cross-validation help pick lambda well, and that regularization is a central technique for building models that generalize.

02Key Concepts

01
What Regularization Is: Definition: Regularization is a method to reduce overfitting by adding a penalty to the model’s loss function. Analogy: It’s like adding a gentle brake to a speeding bike so it doesn’t wobble and crash. How it works: The penalty grows when model weights get large, so the optimizer prefers smaller, simpler models. Why it matters: Without it, models can become overly complex and memorize noise, hurting performance on new data. Example: Adding an extra term to Mean Squared Error that depends on the size of the weights.
02
Overfitting and the Bias-Variance Tradeoff: Definition: Overfitting is when a model learns noise and fails on new data; the bias-variance tradeoff balances simplicity and flexibility. Analogy: A short blanket (high bias) leaves you cold; a super-stretchy blanket (high variance) flaps around and doesn’t stay put. How it works: Regularization adds bias (simplifies the model) to reduce variance (wild swings on new data). Why it matters: Finding the right balance improves generalization. Example: A polynomial curve that wiggles through every training point but predicts future points poorly is reduced to a smoother curve with regularization.
03
Cost Function (Loss) and MSE: Definition: A cost function measures how wrong predictions are; Mean Squared Error averages squared differences between predictions and true values. Analogy: It’s like averaging the squared distances of darts from the bullseye. How it works: You sum (prediction − truth)^2 across data and average; the training algorithm tries to minimize this. Why it matters: Regularization adds to this cost to balance data fit and model simplicity. Example: J = (1/(2N)) Σ (y_hat_i − y_i)^2 plus a penalty term.
04
Hyperparameter Lambda (λ): Definition: Lambda is a knob you set that controls how strong the regularization penalty is. Analogy: It’s like the tightness of a guitar string—tighter makes it harder to wiggle. How it works: The penalty term is multiplied by lambda, so larger lambda shrinks weights more. Why it matters: Too large leads to underfitting; too small allows overfitting. Example: Trying λ = 0.01, 0.1, 1.0 and choosing the one that gives the best validation score.
05
L2 Regularization (Ridge): Definition: L2 adds a penalty equal to the sum of squared weights. Analogy: It’s like attaching each weight to a rubber band that gently pulls it toward zero. How it works: The optimizer minimizes loss + (λ/(2N)) Σ w_j^2; weights shrink smoothly but rarely become exactly zero. Why it matters: It improves prediction stability, handles many small effects, and reduces sensitivity to noise. Example: In linear regression, ridge often outperforms plain least squares on new data.
06
L1 Regularization (Lasso): Definition: L1 adds a penalty equal to the sum of absolute values of weights. Analogy: It’s like having a magnet at zero that can snap some weights exactly to zero. How it works: The optimizer minimizes loss + (λ/N) Σ |w_j|; the shape of the penalty makes exact zeros common. Why it matters: It performs automatic feature selection by turning off unimportant features. Example: A model with 100 features ends up using only 12 because the other weights become zero.
07
Feature Selection via L1: Definition: Feature selection means automatically choosing which input variables matter. Analogy: It’s like choosing a few key spices for a soup instead of dumping the whole spice rack. How it works: L1’s diamond-shaped constraint often lands solutions on axes where some weights are zero. Why it matters: Simpler, more interpretable models are easier to explain and faster to run. Example: Lasso removes noisy or redundant features and keeps the strongest predictors.
08
Prediction Stability via L2: Definition: Prediction stability means small changes in data don’t cause big swings in predictions. Analogy: It’s like adding shock absorbers to a car to smooth out bumps. How it works: L2 shrinks all weights, reducing sensitivity to individual points and noise. Why it matters: Models with lower variance perform better on unseen data. Example: Ridge regression remains steady when a few training points are slightly perturbed.
09
Elastic Net (Mix of L1 and L2): Definition: Elastic Net combines L1 and L2 penalties in one model. Analogy: It’s like wearing both a safety belt (L2) and choosing a lighter backpack (L1). How it works: The cost adds α·L1 + (1−α)·L2 penalties, with two hyperparameters to tune. Why it matters: It keeps sparsity from L1 and stability from L2, especially with correlated features. Example: When several features are similar, Elastic Net shares weights and still removes the weakest.
10
Geometric View—Contours and Constraints: Definition: The loss landscape can be drawn as nested ellipses when there are two weights. Analogy: Think of topographic lines on a map, with the bullseye at the bottom of a valley. How it works: Without regularization, you go to the valley’s bottom; with constraints, you must stay inside a circle (L2) or diamond (L1). Why it matters: The point where the lowest ellipse touches the constraint explains shrinkage or sparsity. Example: The solution is at the tangency point of the ellipse and the constraint boundary.
11
L2 as a Circle Constraint: Definition: L2 regularization constrains the squared sum of weights to be ≤ C, which is a circle in 2D. Analogy: It’s like searching for the best campsite within a circular fence. How it works: You minimize loss subject to w1^2 + w2^2 ≤ C; the best point is where an ellipse is tangent to the circle. Why it matters: Tightening the circle (bigger λ) shrinks all weights toward the origin. Example: Reducing the circle radius pushes the solution closer to (0, 0).
12
L1 as a Diamond Constraint: Definition: L1 regularization constrains the sum of absolute weights to be ≤ C, a diamond in 2D. Analogy: It’s like choosing the best spot inside a kite-shaped boundary. How it works: The lowest ellipse often touches a corner of the diamond, which lies on the axes. Why it matters: Touching a corner means one weight is exactly zero, explaining L1’s sparsity. Example: The optimum lands on the w2 = 0 axis, turning off that feature.
13
Choosing Between L1 and L2: Definition: The choice depends on whether you value feature selection (L1) or smooth shrinkage and predictive accuracy (L2). Analogy: Do you want to carry fewer items (L1) or spread the weight more evenly (L2)? How it works: L1 zeros out some weights; L2 reduces all weights. Why it matters: Matching the penalty to your goal gives better models. Example: Use L1 to find the top predictors; use L2 for robust forecasting.
14
Role of Lambda in Model Behavior: Definition: Lambda controls the strength of the penalty and therefore the model’s complexity. Analogy: It’s like a volume knob for how strongly you discourage large weights. How it works: Increasing lambda tightens the feasible region and shrinks (or zeros) weights more. Why it matters: Tuning lambda balances underfitting and overfitting. Example: A small lambda fits training data closely but may generalize poorly.
15
Hyperparameters vs. Parameters: Definition: Parameters (weights) are learned; hyperparameters (like lambda) are chosen by you. Analogy: You set the oven’s temperature (hyperparameter), but the cake rises on its own (parameters). How it works: Training optimizes weights; you pick hyperparameters by trying values and checking performance. Why it matters: Good hyperparameters are crucial for strong results. Example: Fix several lambda values and evaluate which gives the best validation score.
16
Interpretability and Sparsity: Definition: Interpretability means understanding which features matter; sparsity means many weights are exactly zero. Analogy: A tidy desk with only essential tools is easier to navigate. How it works: L1 drives sparsity, revealing a small, important subset of features. Why it matters: Sparse models are easier to explain and can reduce data collection costs. Example: A medical model using only five key lab values is easier for clinicians to trust.
17
What Happens Without Regularization: Definition: Without any penalty, models can grow overly complex to chase every training point. Analogy: A student who memorizes every practice question but fails new ones. How it works: The optimizer moves weights as needed to reduce training error, even if that means huge, unstable weights. Why it matters: High training accuracy can mask poor real-world performance. Example: A polynomial model that fits noise predicts future values wildly.
18
Effect on Optimization: Definition: Regularization changes the shape of the loss surface that the optimizer explores. Analogy: It smooths the road so the car doesn’t swerve into potholes. How it works: L2 makes the surface more convex and well-conditioned; L1 adds sharp points that encourage zeros. Why it matters: This leads to more stable training and solutions that generalize better. Example: Gradient descent converges faster and more reliably with L2 than without it.
19
When to Use Elastic Net: Definition: Use Elastic Net when features are correlated and you want both stability and sparsity. Analogy: It’s like wearing both boots (grip) and light shoes (speed) depending on the terrain. How it works: The L2 component shares weight among correlated features; the L1 component removes the weakest. Why it matters: It avoids L1’s tendency to pick only one among correlated features while still selecting. Example: In text models with many similar word features, Elastic Net performs reliably.
20
Visual Intuition and Tangency: Definition: Tangency is the point where the lowest loss contour just touches the constraint boundary. Analogy: Like a soap bubble touching a wall at a single point. How it works: This point is the solution to the constrained optimization: best fit that stays within the allowed region. Why it matters: It gives a simple picture of why weights shrink or zero out. Example: With L2, the tangent lies on the circle; with L1, often at a diamond corner.

03Technical Details

Overall Architecture/Structure

Data and Model: You start with input features X and target values y. A model, such as linear regression, makes predictions y_hat = X·w (plus possibly a bias term). The training goal is to choose weights w that make y_hat close to y.
Base Cost Function: The base cost function often used in regression is Mean Squared Error (MSE). In a simple form, J_data(w) = (1/(2N)) Σ_i (y_hat_i − y_i)^2, where N is the number of data points. This measures average squared error between predicted and true values.
Adding Regularization: Regularization adds a penalty term that grows with the size of w. For L2, the penalty is (λ/(2N)) Σ_j w_j^2. For L1, the penalty is (λ/N) Σ_j |w_j|. The full objective becomes J_total(w) = J_data(w) + J_penalty(w).
Optimization Objective: The training algorithm no longer minimizes just data error, but the sum of data error and penalty. This balances fitting the data with keeping weights small (simple model). The balance strength is controlled by the hyperparameter lambda (λ).
Geometric Constraint View: Minimizing J_total(w) is equivalent to minimizing J_data(w) subject to a constraint on w. For L2: Σ_j w_j^2 ≤ C (a circle in 2D, a sphere in higher dimensions). For L1: Σ_j |w_j| ≤ C (a diamond in 2D, a cross-polytope in higher dimensions). The solution is at the tangency of the lowest data-loss contour and the constraint set.

Data Flow

Input features X and targets y feed into the model to compute predictions y_hat. 2) Compute data loss: MSE across all data points. 3) Compute penalty: L2 or L1 based on current weights and chosen lambda. 4) Sum them to get total loss J_total. 5) The optimizer updates weights to reduce J_total. 6) Iterate until convergence or stopping criteria are met.

Code/Implementation Details (Conceptual, works across libraries)

Language/Framework: Any ML framework (scikit-learn, PyTorch, TensorFlow) supports L1/L2 penalties. The math is the same: add a penalty to the loss.
L2 (Ridge) in Practice: In linear regression, ridge regression solves (X^T X + λI) w = X^T y, where I is the identity matrix. This closed-form solution shows how L2 stabilizes inversion by adding λ to the diagonal, improving numerical conditioning. In gradient descent, you update w ← w − η(∇J_data + (λ/N)w), where η is the learning rate.
L1 (Lasso) in Practice: L1 lacks a simple closed-form due to the absolute value, which creates kinks at zero. Optimizers use methods like coordinate descent, proximal gradient, or subgradient techniques. The key behavior is that optimality conditions favor exact zeros for some coordinates.
Elastic Net: Uses a mix of L1 and L2 penalties. Many libraries implement parameters alpha (overall strength) and l1_ratio (how much of alpha goes to L1). For example: total penalty = alpha * (l1_ratio * L1 + (1 − l1_ratio) * L2).
Parameters and Hyperparameters: Weights w are learned by minimizing J_total. Lambda (and Elastic Net’s split) are hyperparameters you choose—often by trying multiple values and keeping the best one according to a validation metric.

Important Parameters and Meanings

Lambda (λ): Strength of the penalty. Larger λ means stronger shrinkage (and, for L1, more zeros). Smaller λ means weaker regularization.
N (number of samples): Standardizing by N keeps penalty comparable as dataset size changes. The formulas in the lecture divide by N or 2N, but the minimizer is unaffected by constant scalings.
D (number of features): The penalty sums over all D weights. High-D problems benefit strongly from regularization to avoid overfitting.

Optimization Flow

Initialize weights w (e.g., zeros or small random values). 2) At each iteration, compute predictions y_hat. 3) Compute data loss and penalty. 4) Compute gradients (or subgradients for L1). 5) Update weights using an optimizer (gradient descent, coordinate descent, etc.). 6) Repeat until loss stabilizes or a stopping criterion is met.

Tools/Libraries Used (Common Options)

Scikit-learn (Python): ridge regression (Ridge), lasso regression (Lasso), and ElasticNet classes for regression. These accept alpha (similar to λ) and, for ElasticNet, l1_ratio. Fit with .fit(X, y) and predict with .predict(X).
PyTorch/TensorFlow: Define the base loss (e.g., MSE) and add weight decay (L2) by including λ Σ w^2 either in the loss or as an optimizer parameter (weight_decay). L1 can be added manually by summing |w| across parameters and adding to loss.
General Tip: Standardize or normalize features before applying regularization so that penalties treat features fairly (this is especially important for L1).

Step-by-Step Implementation Guide

Step 1: Prepare data. Split your dataset into training and validation sets. Standardize features (zero mean, unit variance), especially important for L1 and Elastic Net.
Step 2: Choose a base model. For regression, start with linear regression. For classification, consider logistic regression (which also supports L1/L2 penalties).
Step 3: Pick a type of regularization. If you need feature selection, start with L1 (Lasso). If predictive accuracy and stability are priorities, start with L2 (Ridge). If features are correlated and you still want sparsity, try Elastic Net.
Step 4: Select candidate lambda values. Create a grid of values (e.g., 1e-4, 1e-3, 1e-2, 1e-1, 1, 10). Plan to try each and compare validation performance.
Step 5: Train the model for each lambda. Fit the model using the training data and compute validation error (MSE for regression, accuracy or log loss for classification). Record performance and model characteristics (like number of nonzero weights for L1/Elastic Net).
Step 6: Compare and choose lambda. Look for the value that minimizes validation error and avoids extreme underfitting or overfitting. Prefer simpler models when validation performance is similar (the simpler one is often more robust).
Step 7: Retrain on combined training+validation data (optional). Once the best lambda is selected, retrain using more data for a final model. Then evaluate on a held-out test set to estimate true performance.
Step 8: Interpret and deploy. With L1/Elastic Net, examine which features remain; with L2, note the shrinkage pattern. Ensure the preprocessing pipeline (e.g., standardization) is carried to production.

Tips and Warnings

Start Simple: Begin with L2 for stability. If you also want interpretability and feature selection, try L1 or Elastic Net.
Watch Lambda Size: Too large λ causes underfitting (high bias) with tiny weights; too small λ can overfit (high variance). Plot validation error vs. λ to see the U-shaped curve.
Scaling Matters: Without feature scaling, L1 may prefer features just because of units. Standardization ensures fair penalty across features.
Interpretation: L1 zeros out weights, but remember this depends on data and preprocessing. Do not assume causality from selection; it shows predictive usefulness, not cause-effect.
Numerical Stability: L2 aids stability by making (X^T X + λI) invertible and better-conditioned. This helps when features are nearly collinear.
Optimization Details: L1 introduces non-differentiable points at zero; use algorithms that handle this (coordinate descent, proximal methods). L2 is smooth and friendly for standard gradient methods.
Evaluate on Unseen Data: Always assess regularized models on validation/test sets to ensure real generalization gains. Training loss alone can mislead.
Combine with Other Practices: Use proper train/validation splits and, if needed, cross-validation to pick λ more reliably. Regularization complements, not replaces, good evaluation.

Deepening the Geometric Intuition

Loss Contours: For two weights, the unregularized MSE loss forms ellipses centered at the least-squares solution. Moving along these ellipses increases loss.
Constraint Sets: L2’s circle (or hypersphere) and L1’s diamond (or cross-polytope) define allowed regions for weights. The best constrained solution lies where an ellipse first touches the boundary (tangency).
Why L2 Shrinks Smoothly: The circle has no sharp corners, so the tangency is unlikely to land exactly on an axis. Most weights are nonzero but reduced in magnitude.
Why L1 Zeros Out: The diamond has sharp corners aligned with axes. Ellipses often touch corners, forcing one or more weights to be exactly zero. This is the geometric source of sparsity.

Extending Beyond Linear Regression (Conceptual)

Logistic Regression: Replace MSE with logistic loss; add L1 or L2 penalties the same way. L1 still encourages sparsity; L2 still encourages smooth shrinkage.
Other Models: Neural networks, support vector machines, and many others support L2-like penalties (often called weight decay). The philosophy remains: limit weight size to control complexity.

Putting It All Together

Problem: Overfitting hurts generalization. Solution: Add regularization to loss to penalize large weights.
Choices: L2 (Ridge) for stable shrinkage; L1 (Lasso) for sparsity and feature selection; Elastic Net to blend both. Control with lambda, chosen as a hyperparameter using validation.
Intuition: Circle vs. diamond constraints explain the different behaviors. Tangency shows how the solution changes as λ varies.
Practice: Standardize features, tune λ carefully, and evaluate on unseen data. Prefer simpler models when performance is similar for robustness.

04Examples

💡
MSE With and Without Regularization: Input: A dataset with features X and targets y, using linear regression. Processing: Compute unregularized MSE to fit w, then add an L2 penalty term and refit. Output: The L2 model has slightly higher training error but lower validation error. Key point: Regularization trades a tiny bit of fit for much better generalization.
💡
L2 Shrinkage Effect: Input: A model with several moderately useful features. Processing: Apply L2 (Ridge) with λ = 0.1, then λ = 1.0. Output: As λ grows, all weights become smaller, but most remain nonzero. Key point: L2 reduces sensitivity to noise and produces stable predictions.
💡
L1 Feature Selection: Input: A model with 100 features, many weak or redundant. Processing: Apply L1 (Lasso) with increasing λ values. Output: Weights for unimportant features become exactly zero; only the strongest 10–20 remain. Key point: L1 provides automatic feature selection and interpretability.
💡
Elastic Net on Correlated Features: Input: Two features that are highly correlated. Processing: Train L1-only and Elastic Net models. Output: L1 often picks just one feature; Elastic Net tends to share weight across both while still shrinking others. Key point: Elastic Net handles correlation better while keeping sparsity.
💡
Geometric Picture—L2 Circle: Input: A 2D weight space with W1 and W2. Processing: Draw loss ellipses and a circle constraint w1^2 + w2^2 ≤ C. Output: The optimal point is where the lowest ellipse touches the circle. Key point: Visualizes how L2 shrinks weights toward the origin.
💡
Geometric Picture—L1 Diamond: Input: The same 2D weight space. Processing: Draw loss ellipses and a diamond constraint |w1| + |w2| ≤ C. Output: The optimal point often lies at a diamond corner, making one weight exactly zero. Key point: Explains why L1 creates sparse solutions.
💡
Too Small Lambda (Overfitting): Input: Choose λ = 0 (no regularization). Processing: Fit the model to minimize MSE only. Output: Training error is very low, but validation error is high. Key point: Without regularization, the model can memorize noise and perform poorly on new data.
💡
Too Large Lambda (Underfitting): Input: Choose a very large λ. Processing: Fit the model with strong penalty. Output: Both training and validation errors are high; weights are tiny, and the model is too simple. Key point: Over-regularization removes useful signal and hurts performance.
💡
Choosing Lambda by Validation: Input: Candidate λ values [0.001, 0.01, 0.1, 1, 10]. Processing: Train a model for each λ and compute validation MSE. Output: The best λ minimizes validation MSE while keeping model reasonably simple. Key point: Lambda is a hyperparameter chosen by trying and comparing.
💡
L1 vs. L2 on Noisy Features: Input: A dataset with many weak, noisy features. Processing: Train L1 and L2 models. Output: L1 zeros out many features; L2 keeps small nonzero weights on many features. Key point: L1 is better for selecting a few strong signals; L2 keeps a blend of small effects.
💡
Interpreting L1 Coefficients: Input: A lasso model with nonzero weights on five features. Processing: Examine which features remain and their signs (positive/negative). Output: Only key predictors remain, and their influence direction is clear. Key point: L1 models are easier to explain and justify.
💡
Stability to Data Perturbations: Input: Slightly perturb some training points. Processing: Compare how unregularized and L2-regularized models change. Output: Unregularized weights swing more; L2 weights change little. Key point: L2 improves robustness to small data changes.
💡
Scaling Before L1: Input: Unscaled features with different units. Processing: Train an L1 model before and after standardization. Output: Before scaling, the model favors certain units; after scaling, selection is more fair. Key point: Proper scaling is crucial for L1 and Elastic Net.
💡
Model Complexity Curve: Input: Vary λ over a wide range. Processing: Plot number of nonzero weights (for L1) or average weight size (for L2) against validation error. Output: You see sparsity increase with λ and a U-shaped validation error curve. Key point: Visualizing helps choose a good λ.
💡
Retraining with Best Lambda: Input: Best λ from validation. Processing: Retrain the chosen regularized model on combined training and validation data, then test. Output: Final test performance reflects true generalization. Key point: This workflow yields a reliable model ready for use.

05Conclusion

This lecture presents regularization as a simple and essential tool for fighting overfitting. By adding a penalty to the cost function, regularization discourages large weights and thus overly complex models. Two main forms are emphasized: L2 (Ridge), which smoothly shrinks all weights toward zero without usually making them exactly zero, and L1 (Lasso), which often pushes some weights to be exactly zero, performing feature selection. Elastic Net combines both, balancing sparsity and stability, especially when features are correlated. The geometric view—loss ellipses with a circle (L2) or diamond (L1) constraint—explains why L2 yields shrinkage and L1 yields sparsity: the best point is where the lowest loss contour is tangent to the constraint.

In practice, you will tune the hyperparameter lambda to control the penalty’s strength. A small lambda may overfit, while a large one may underfit. The right lambda provides the best trade-off between fitting the data and keeping the model simple. Using validation methods to select lambda is standard practice. L2 is usually favored for strong predictive performance and stability; L1 is valuable when you want interpretability and feature selection.

To practice, start with a basic linear regression and add L2 regularization; observe how training and validation errors change as you increase lambda. Then repeat with L1 and note which features remain. Try Elastic Net when you have many correlated features, adjusting the mix between L1 and L2. Make sure to standardize features, especially when using L1 or Elastic Net, to ensure fair treatment by the penalty.

Next steps include learning formal validation techniques to pick lambda more reliably and exploring regularization in other models such as logistic regression and neural networks. As you build more complex systems, remember the lecture’s core message: regularization is a principled way to control complexity. It adds a gentle brake to your model so it rides smoothly over new data instead of wobbling after every bump. Keep the circle and diamond picture in mind—it is a powerful mental model that explains the behavior of L2 and L1 in a single glance.

Key Takeaways

✓Use regularization whenever you see signs of overfitting. Start with L2 (Ridge) for a stable, smooth reduction of weights. Expect slightly higher training error but better validation/test performance. This trade-off is usually beneficial in real applications.
✓Choose L1 (Lasso) when you want feature selection and interpretability. L1 can zero out unimportant features, making models simpler to explain. Make sure to standardize features first so the penalty treats them fairly. Inspect which features remain to communicate insights.
✓Try Elastic Net if features are correlated and you still want sparsity. Tune both the overall strength and the L1/L2 mix. Elastic Net often outperforms pure L1 or L2 in correlated settings. It shares weight among similar features while pruning weaker ones.
✓Tune lambda (λ) carefully using a validation set. Too small λ risks overfitting; too large λ causes underfitting. Sweep λ across a logarithmic range and pick the value with the lowest validation error. Prefer simpler models if performance is similar.
✓Always evaluate on unseen data to confirm generalization. Training error alone can mislead, especially without regularization. Keep a clean split between training and validation (and test). Track both errors as you adjust λ to see the bias-variance tradeoff.
✓Standardize or normalize features before using L1 or Elastic Net. Different units can bias which features get zeroed. Scaling ensures the penalty is applied fairly to all features. This leads to more reliable feature selection.
✓Monitor model complexity while tuning. For L1, track the number of nonzero coefficients; for L2, track average coefficient size. Plot validation error versus λ to find a sweet spot. Use these plots to explain decisions to stakeholders.
✓Prefer L2 for stability when many small effects matter. It shrinks all weights smoothly and reduces sensitivity to noise. This typically yields better predictive accuracy than plain models. It’s a solid default when interpretability is secondary.
✓Use L1 to discover key drivers in high-dimensional data. It can reduce hundreds of features to a manageable shortlist. Double-check selected features with domain experts. Remember: selection shows predictive value, not necessarily causation.
✓Check for underfitting if validation error rises as λ increases. If both training and validation errors are high, λ is likely too large. Reduce λ to let the model capture more signal. Keep adjusting until you see an improvement on validation.
✓Validate assumptions with simple visualizations. For linear models, examine residual plots and, in 2D, the contour picture of circle (L2) or diamond (L1) constraints. These visuals help teams grasp shrinkage and sparsity intuitively. Clear visuals speed decision-making.
✓Document the full pipeline for deployment. Record preprocessing steps (like scaling), chosen λ, and why you selected L1, L2, or Elastic Net. Ensure the same steps run in production as in training. Consistency preserves the benefits of regularization.
✓Be careful with data leakage during lambda tuning. Keep validation data separate from training data. If you do multiple rounds of tuning, consider a final test set for an unbiased estimate. This prevents overly optimistic results.
✓Use appropriate optimization methods. Standard gradient descent works smoothly with L2. For L1, prefer solvers like coordinate descent or proximal methods that handle non-differentiable points well. Correct solvers find reliable sparse solutions.
✓Communicate the trade-offs clearly to stakeholders. Emphasize that a small increase in training error can mean a large decrease in test error. Point out how L1 improves interpretability, while L2 improves stability. Align the choice with business goals.

Glossary

Overfitting

When a model memorizes the training data, including noise, and does poorly on new data. It looks great during training but fails to generalize. This often happens when the model is too complex. Regularization helps prevent this. The goal is to balance fit and simplicity.

Bias-Variance Tradeoff

A balance between a model being too simple (high bias) and too flexible (high variance). Adding regularization increases bias a little to reduce variance a lot. Finding the right balance improves new-data performance. This is central to building good models.

Regularization

A technique that adds a penalty to the loss to discourage complex models with large weights. It keeps models simpler and reduces overfitting. Common types are L1 and L2. The penalty strength is set by a hyperparameter called lambda.

Cost Function (Loss)

A number that measures how wrong a model’s predictions are. Training tries to make this number small. Regularization adds an extra part to this number to penalize complexity. Minimizing total loss balances fit and simplicity.

Mean Squared Error (MSE)

The average of squared differences between predictions and true values. Squaring makes big errors count more. It’s smooth and easy for optimization. Often used in regression as the base loss.

Parameters (Weights)

Numbers inside the model that scale inputs to make predictions. Training changes these numbers to fit the data. Large weights can mean the model is too complex. Regularization pushes them to be small.

Hyperparameter

A setting you choose before training that controls how the model learns. It is not learned from the data like weights are. Lambda (λ) for regularization is a hyperparameter. You pick it by trying values and comparing results.

Lambda (λ)

The knob that controls how strong the regularization penalty is. Bigger λ means stronger penalty and smaller weights. Smaller λ means weaker penalty and more flexible models. It balances underfitting and overfitting.

+30 more (click terms in content)

Version: 1