📚 Stanford CME295 Transformers & LLMs5 / 9

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM tuning

Beginner

Stanford

Machine LearningYouTube

Key Summary

•Regularization is a method to prevent overfitting by adding a penalty for model complexity. Overfitting happens when a model memorizes training data, including noise, and performs poorly on new data. By discouraging overly complex patterns, regularization helps the model generalize better.
•Overfitting often shows up as a model with lots of 'wiggles' that match every tiny bump in the training set. While training error goes down, test error eventually goes up as the model becomes too complex. Regularization pushes the model toward smoother, simpler functions.
•L2 regularization (Ridge) adds a penalty equal to the sum of squared weights. This encourages all weights to be small but not exactly zero, creating a smooth model. Think of it like a spring pulling each weight gently toward zero.
•L1 regularization (Lasso) adds a penalty equal to the sum of absolute weight values. This encourages many weights to become exactly zero, creating a sparse model. Sparse models help with feature selection because zero-weight features can be dropped.
•Elastic Net combines L1 and L2 penalties to get the benefits of both. It can select features like L1 while also stabilizing solutions like L2. Two hyperparameters control the blend: one for L1 strength and one for L2 strength.
•The penalty strength is controlled by a hyperparameter called lambda. A larger lambda means stronger punishment of big weights and a simpler model, and a smaller lambda means weaker punishment and a more flexible model. Lambda must be tuned carefully to avoid underfitting or overfitting.
•You choose lambda using cross-validation. Split data into training, validation, and test sets; train on the training set, pick lambda using the validation set, and report final performance on the test set. Try many lambda values and pick the one with the best validation score.
•Polynomial regression is a clear example of how complexity affects error. As polynomial degree increases, training error falls but test error eventually rises due to overfitting. Regularization reduces this effect by limiting the size of the coefficients.
•Regularization is especially important in neural networks, which have many parameters. Techniques include L2/L1 weight penalties, dropout, and batch normalization. These methods reduce overfitting and improve generalization.
•Dropout randomly turns off some neurons during training, which prevents the network from relying too much on any one path. This acts like training many smaller networks and averaging them. It’s a powerful regularizer for deep models.
•Batch normalization stabilizes activations within layers, which can indirectly regularize the model. By normalizing intermediate outputs, it makes training smoother and more robust. This often improves generalization along with speed.
•There is a Bayesian perspective that explains regularization as expressing a prior belief about parameters. L2 corresponds to a Gaussian (normal) prior that prefers small weights. L1 corresponds to a Laplace prior that prefers many exact zeros.
•In practice, use L2 when you want smoothness and don’t need feature selection. Use L1 when you want a sparse model and automatic feature selection. Use Elastic Net when you want a balance and added stability, especially with correlated features.
•The cost function with regularization equals prediction error plus a penalty term. For L2, it’s sum of squared errors plus lambda times sum of squared weights. For L1, it’s sum of squared errors plus lambda times sum of absolute weight values.
•Choosing the train/validation/test split properly is key to honest model evaluation. The validation set helps you tune lambda, while the test set tells you how the final choice generalizes. Never tune on the test set, or you’ll overfit to it.
•Regularization makes models more robust to noise and small data sets. By limiting parameter magnitude or number, it prevents the model from chasing random patterns. This leads to more reliable predictions on new, unseen data.

Why This Lecture Matters

Regularization matters because nearly every real-world machine learning problem faces the risk of overfitting. Data scientists and ML engineers work with noisy, limited, or high-dimensional data where models can easily learn false patterns. Regularization keeps models simple enough to generalize, making predictions more reliable in production settings. This reliability is vital for roles in healthcare (predicting risk without chasing random signals), finance (avoiding models that memorize rare market blips), retail (stable demand forecasting), and beyond. Without regularization, model performance can look great during development but collapse in the field. By controlling parameter size or count, regularization reduces variance and sensitivity to small data changes. Teams can use it to create interpretable models too, especially with L1, which highlights the most important features. Regularization also supports better teamwork and reproducibility: by applying cross-validation and keeping the test set clean, organizations avoid misleading metrics and maintain trust in results. For career development, mastering regularization is foundational. It shows you can build robust models, tune hyperparameters properly, and explain the theory (like the Bayesian view) behind your choices. These skills transfer directly to deep learning, where weight decay, dropout, and batch normalization are everyday tools. In today’s industry, where models must be dependable and auditable, regularization isn’t optional—it’s a core competency that separates quick demos from durable, real-world systems.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches the core idea of regularization in machine learning: how to prevent overfitting by penalizing model complexity so that models generalize well to new data. The central message is simple but powerful: a model that fits every tiny bump in the training data—often shown as a function with many 'wiggles'—may look perfect on past data but fail on future data. Regularization fixes this by adding a penalty into the training objective that discourages overly large or numerous parameters, leading the learning algorithm to prefer simpler, smoother solutions.

The lecture focuses on two widely used forms of regularization: L2 regularization (also called Ridge regression) and L1 regularization (also called Lasso regression). L2 regularization adds a penalty based on the sum of squared weights, which encourages all weights to be small but rarely exactly zero. This acts like a gentle pull toward zero and results in smoother functions. L1 regularization adds a penalty based on the sum of absolute weight values, which encourages many weights to become exactly zero. This sparsity helps with feature selection by turning off unimportant features automatically. The lecture also introduces Elastic Net regularization, which combines L1 and L2 to get the benefits of both.

A practical question addressed here is how to choose the penalty strength, usually called lambda. The answer is to treat lambda as a hyperparameter and pick it using cross-validation. Specifically, you split your data into training, validation, and test sets. You train models with different lambda values on the training set, choose the one that performs best on the validation set, and finally report performance on the test set so you get an honest estimate of generalization. This workflow prevents sneaky overfitting to the test set.

Polynomial regression serves as an intuitive example of how complexity can harm generalization. Increasing the polynomial degree typically drives the training error down by capturing more patterns. However, after a point, the test error starts rising because the model starts fitting noise and making many sharp turns—the 'wiggles.' Regularization reduces the size of the coefficients, which calms down these wiggles and improves test performance.

The lecture emphasizes that regularization is even more important for neural networks because they often have a huge number of parameters. Methods such as L2/L1 penalties, dropout, and batch normalization help deep models generalize better. Dropout randomly turns off some neurons during training, which stops the network from depending too much on any one path and works like averaging many small networks. Batch normalization normalizes intermediate activations and tends to stabilize training and indirectly regularize the network.

Another key perspective presented is the Bayesian interpretation. In Bayesian terms, regularization encodes your prior belief about what parameter values are likely before seeing data. L2 corresponds to a Gaussian (normal) prior that prefers small weights, while L1 corresponds to a Laplace prior that strongly prefers many exact zeros. This view gives a theoretical justification for why these penalties make sense.

By the end, you understand what regularization is, why it’s necessary, and how to apply L1, L2, and Elastic Net. You also know how to choose lambda using cross-validation and how these ideas extend naturally to neural networks. The flow of the lecture moves from the motivation (overfitting and wiggly functions), to concrete tools (L1/L2/Elastic Net), to practical model selection (cross-validation), to broader applications (neural networks), and finally to theory (Bayesian priors).

02Key Concepts

01
Regularization (definition and purpose): Regularization is a method to prevent overfitting by adding a penalty to model complexity. It’s like placing a speed limit on a car to keep it safe and controlled rather than letting it go dangerously fast. Technically, you add a term to the loss function that punishes large or numerous weights. This matters because without it, models can memorize noise and fail on new data. For example, a wiggly curve that touches every training point may look perfect but predicts poorly on future points.
02
Overfitting and generalization: Overfitting happens when a model learns the noise in the training data instead of the true pattern. Imagine drawing a line through every dot on a homework page, including accidental smudges; it fits the page perfectly but won’t match the real trend. Technically, training error decreases while test error eventually increases as model complexity grows. Generalization means performing well on new, unseen data, not just the training set. In practice, we watch test or validation error to detect overfitting and apply regularization to reduce it.
03
Complexity and 'wiggles' in functions: A model with many 'wiggles' can change direction sharply to nail each training point. Picture a roller coaster twisting to touch every flag in a park; it’s entertaining but not a smooth path. In equations, these wiggles come from large, sensitive parameters that react strongly to tiny input changes. Penalizing large weights reduces sudden swings and makes the function smoother. This helps predictions remain stable and reliable.
04
Polynomial regression as an illustration: Polynomial regression fits a curve using powers of x (like x, x^2, x^3, and so on). As you raise the degree, the model can bend more and more, which cuts training error. But after some degree, the test error rises because the curve starts chasing noise and creating sharp turns. Regularization dampens the coefficients and smooths the curve. For example, a 10th-degree polynomial might overfit, and adding L2 can calm it down.
05
L2 regularization (Ridge): L2 penalizes the sum of squared weights, encouraging all weights to be small. It’s like attaching springs from each weight to zero, gently pulling them back. Technically, you add lambda times the sum of squared weights to the loss function. This reduces variance by shrinking parameters without typically making them exactly zero. In practice, Ridge is great when you want smoothness and don’t need automatic feature selection.
06
Effect of lambda in L2: Lambda controls how strong the penalty is. Like a dial on the spring’s tightness, higher lambda pulls weights closer to zero and lowers model flexibility. If lambda is too small, the model may still overfit; if too large, it may underfit and miss patterns. Tuning lambda is essential for balancing bias and variance. For instance, in a linear model with many correlated features, a moderate lambda can stabilize coefficients.
07
L1 regularization (Lasso): L1 penalizes the sum of absolute weight values, pushing many weights to exactly zero. It’s like a tight rope (lasso) that can completely shut down weak features. Mathematically, you add lambda times the sum of absolute weights to the loss function. This creates sparsity and automatic feature selection. In practice, Lasso is useful when you suspect many features are irrelevant.
08
Sparsity and feature selection with L1: Sparsity means lots of weights become zero. Imagine cleaning your backpack by throwing out items you barely use—your bag gets lighter and simpler. L1 does this pruning by favoring solutions where unhelpful features are turned off. This matters because it improves interpretability and reduces overfitting. For example, with 1,000 features, L1 might keep only the 50 most helpful ones.
09
Elastic Net (combining L1 and L2): Elastic Net blends L1 and L2 penalties to get both sparsity and stability. Think of it as using both a rope (L1) and springs (L2) to control the model. Technically, the loss includes lambda1 times absolute weights plus lambda2 times squared weights. This guards against issues like selecting only one of several correlated features, which L1 alone might do. In practice, Elastic Net often performs well on high-dimensional, correlated data.
10
Choosing lambda with cross-validation: Lambda is a hyperparameter you must pick using a validation process. The analogy is test-driving many cars and choosing the one that handles best on a practice track. We train models on a training set with multiple lambda values and evaluate on a validation set, selecting the best one. Finally, we report performance on a test set to estimate real-world accuracy. This separation prevents overfitting the test set.
11
Train/validation/test split workflow: We divide data into three parts to make honest decisions. The training set teaches the model, the validation set helps tune lambda, and the test set checks generalization at the end. Without a validation set, you might tune on the test set and trick yourself into fake success. The clear split keeps results trustworthy. For example, use 60% train, 20% validation, and 20% test as a simple starting point.
12
Regularization in neural networks: Neural networks have many parameters and can easily overfit. Regularization methods like L2/L1 penalties, dropout, and batch normalization are crucial. Dropout randomly turns off some neurons during training to prevent co-dependence. Batch normalization stabilizes activations and often helps generalization. Together, these strategies make deep learning models more robust.
13
Dropout as a regularizer: Dropout randomly sets some neuron outputs to zero during training. It’s like training a team where some players sit out randomly so the team doesn’t rely only on stars. Technically, this forces the network to spread knowledge across many pathways. It reduces overfitting and often boosts test accuracy. During testing, dropout is turned off and outputs are scaled appropriately.
14
Batch normalization’s stabilizing effect: Batch normalization normalizes layer activations to have stable mean and variance. Like keeping a thermostat steady, it prevents wild swings inside the network. While mainly used for faster, more stable training, it also provides a regularizing effect. This improves generalization in many cases. In practice, it’s often combined with dropout and L2.
15
Bayesian view of regularization: In Bayesian terms, regularization encodes prior beliefs about parameters before seeing data. L2 corresponds to a Gaussian prior that prefers small weights centered at zero. L1 corresponds to a Laplace prior that prefers many exact zeros. This gives a theoretical reason for why regularization makes sense. It explains how prior beliefs and observed data blend to shape the final model.
16
When to favor L2 vs L1: Choose L2 when you want smoothness and don’t need to drop features. Choose L1 when you want a sparse model and automatic feature selection. Elastic Net offers a middle path when features are correlated or when you want both stability and sparsity. The right choice depends on your data and goals. Trying all three with cross-validation is a solid strategy.
17
Cost function with regularization: The training objective equals prediction error plus a penalty term. For L2, add lambda times the sum of squared weights; for L1, add lambda times the sum of absolute weights. This changes the solution the optimizer prefers. With a penalty, the model trades a tiny bit more training error for a big gain in test performance. The key is picking lambda so this trade-off is optimal.
18
Effect on polynomial regression coefficients: Regularization shrinks large coefficients that cause sharp bends. Imagine tightening screws so a flexible rod doesn’t bend wildly. Smaller coefficients lead to a smoother curve that ignores tiny fluctuations in data. This improves stability and test accuracy. For example, Ridge often calms high-degree polynomials without zeroing terms, while Lasso may zero out some powers entirely.
19
Hyperparameters and honest evaluation: Lambda is a hyperparameter, not learned directly from training loss. Picking it on the validation set keeps the test set clean for final evaluation. This avoids test set overfitting, which gives an overly optimistic view of performance. Always report the test result only once, after tuning is done. This process builds trust in model claims.
20
Why regularization matters in practice: Real-world data often has noise, small sample sizes, or many features. Without regularization, models can latch onto false patterns and fail in production. Penalties keep parameters under control and predictions steady. This leads to safer, more reliable systems. From healthcare to finance, this reliability is critical.

03Technical Details

Overall Architecture/Structure of Regularized Learning

Goal (What are we solving?): We are solving supervised learning problems (like regression or classification) where we want good predictions on new, unseen data. The core challenge is overfitting—when a model fits training data too closely and performs poorly on test data.
Basic Idea (How do we address it?): We add a penalty term to the loss function that punishes model complexity. Complexity is often measured by the size of the parameters (weights). The modified training objective becomes: Loss = Prediction Error + Regularization Penalty.
Components: (a) Data split: training, validation, test. (b) Model: linear regression, polynomial regression, neural networks, etc. (c) Loss function: measures how well predictions match targets (e.g., sum of squared errors). (d) Regularization: L2 (sum of squared weights) or L1 (sum of absolute weights), or a mix (Elastic Net). (e) Hyperparameter lambda: controls how strong the penalty is. (f) Optimizer: algorithm to find parameters that minimize the total loss.
Data Flow: 1) Choose a model family. 2) Define loss with regularization. 3) Train on the training set to minimize this loss. 4) Evaluate models with different lambda values on the validation set. 5) Pick the best lambda. 6) Retrain on train + validation (optional) with the chosen lambda. 7) Evaluate once on the test set for the final performance.

L2 Regularization (Ridge) in Detail

Intuition: L2 encourages all weights to be small but not exactly zero. This smooths predictions, reduces sensitivity to noise, and prevents extreme parameter values.
Formal Description in Words: We add a penalty equal to lambda multiplied by the sum of squares of all weights. For a linear model y_hat = w^T x, the training objective becomes: sum of squared residuals plus lambda times sum of squared weights. The intercept (bias) term is often excluded from penalization so the baseline prediction remains flexible.
Effects: (a) Shrinkage—coefficients are pulled toward zero. (b) Stability—helps when features are correlated by distributing weight among them. (c) Bias-Variance Trade-Off—adds some bias but reduces variance, usually improving test error.
Practical Behavior: If lambda is tiny, the model acts like an unregularized one and can overfit. If lambda is huge, the model underfits because most weights are forced near zero.
Optimization: The total loss is smooth and convex for linear regression, so it has a unique global minimum. Many solvers (closed-form matrix solution for Ridge; gradient-based methods) can compute it efficiently.

L1 Regularization (Lasso) in Detail

Intuition: L1 encourages many weights to be exactly zero, creating a sparse model. This naturally performs feature selection.
Formal Description in Words: Add lambda times the sum of absolute values of the weights to the prediction error. This creates 'kinks' in the loss surface at zero, which makes some weights settle exactly at zero.
Effects: (a) Sparsity—turns off unhelpful features. (b) Interpretability—fewer nonzero coefficients make the model easier to understand. (c) Guard against overfitting—eliminates noisy signals.
Practical Behavior: L1 can be unstable with highly correlated features because it may pick one and drop others arbitrarily. In such cases, Elastic Net can be better.
Optimization: The objective is convex but not smooth at zero because of absolute values. Specialized solvers (like coordinate descent) handle it well.

Elastic Net Regularization

What It Is: A combination of L1 and L2 that blends sparsity and stability. The loss includes lambda1 times absolute weights plus lambda2 times squared weights.
Why Use It: When you want feature selection but also want to avoid L1’s tendency to pick a single feature among a group of correlated features. L2 helps share weights among related features.
Behavior: By adjusting the two lambdas (or a total lambda with a mixing ratio), you can move between mostly-L1 behavior and mostly-L2 behavior.
Optimization: Still convex for linear models. Coordinate descent works well in practice.

Choosing Lambda via Cross-Validation

Hyperparameter Tuning: Lambda is not learned directly from training loss; it must be selected using a validation procedure. Use a grid (e.g., powers of 10) or a finer search over promising ranges.
Data Splitting: Keep a clean separation: training for fitting parameters; validation for choosing lambda; test for final performance only. Do not use test data to make any choices.
K-Fold Cross-Validation: Instead of a single validation set, you can split training data into K folds. Train on K-1 folds and validate on the remaining fold, rotating so each fold is used once as validation. Average performance across folds to select lambda.
Practical Tips: Start with a wide search (e.g., lambda ∈ {1e-4,...,1e4}). If the best lambda is at a boundary, expand the search range.

Polynomial Regression Example Mechanics

Model: y_hat = w0 + w1x + w2x^2 + ... + wd*x^d. Higher degree d increases flexibility.
Without Regularization: As d grows, training error shrinks drastically, but the curve becomes wiggly and test error rises after a point. Large coefficients for high powers cause sharp bends.
With L2: Penalizes large coefficients evenly, reducing severe bends and improving generalization. Coefficients are shrunk but typically not zeroed.
With L1: Can completely drop some polynomial terms (e.g., set w7 = 0), yielding a simpler polynomial. Good when many powers are unnecessary.

Regularization in Neural Networks

Why Critical: Neural networks have many parameters and can memorize training data easily.
L2/L1 in NNs: Add penalties on all layer weights to discourage overly large values. Many deep learning frameworks call L2 'weight decay' because weights decay toward zero during training.
Dropout: Randomly zeros a fraction (e.g., 0.5) of neuron outputs during training. This forces the network to distribute learning and reduces co-adaptation of neurons.
Batch Normalization: Normalizes activations within a minibatch to stabilize training. Although primarily for optimization stability, it often improves generalization as a side effect.
Implementation Notes: Apply L2/weight decay via optimizer settings; insert dropout layers after activations; include batch normalization layers near the start of each block.

Bayesian Interpretation

Prior Beliefs: Before seeing data, assume parameters are likely small (L2) or many are exactly zero (L1). Data updates these beliefs to produce the final parameters (the posterior).
L2 ↔ Gaussian Prior: Prefers small weights around zero, with probability decreasing smoothly as weights grow. This leads naturally to the squared penalty.
L1 ↔ Laplace Prior: Has a sharp peak at zero and heavier tails, encouraging many exact zeros but allowing some larger weights. This leads naturally to the absolute value penalty.
Why This Matters: Provides a principled, theoretical reason for penalties and helps understand how assumptions translate into model behavior.

Practical Implementation Guide Step 1: Split Data

Divide data into training, validation, and test sets (e.g., 60/20/20). Ensure the test set is only used once at the end.

Step 2: Choose a Model Family

For tabular regression: start with linear or polynomial regression. For high-dimensional or noisy data, consider regularized linear models. For complex patterns, consider neural networks with regularization.

Step 3: Define Loss with Regularization

L2: Loss = squared error + lambda * sum of squared weights. L1: Loss = squared error + lambda * sum of absolute weights. Elastic Net: mix both with their own strengths.

Step 4: Tune Lambda

Prepare a candidate set (e.g., log-spaced values). Train the model for each lambda on the training set and evaluate on the validation set or via K-fold CV.

Step 5: Pick and Retrain

Select the lambda giving the best validation performance. Optionally retrain on combined train+validation data using this lambda.

Step 6: Final Test

Evaluate once on the test set to estimate real-world performance.

Tools/Libraries (Conceptual Usage)

Scikit-learn (Python): Ridge, Lasso, ElasticNet, and CV versions like RidgeCV, LassoCV, ElasticNetCV make tuning easier. They handle scaling and coordinate descent internally.
Deep Learning Frameworks: PyTorch/Keras/TensorFlow support weight decay (L2), dropout layers, and batch normalization layers directly in model definitions and optimizers.

Tips and Warnings

Don’t tune on the test set: This leaks information and inflates results. Always use a validation set or cross-validation for hyperparameters.
Start with L2: If you don’t need feature selection, L2 is a safe default for stability and smoothness. Try L1 or Elastic Net when you suspect many irrelevant features.
Check for underfitting: If validation error is high and training error is also high, lambda may be too large. Reduce lambda to allow more flexibility.
Check for overfitting: If training error is low but validation error is high, lambda may be too small. Increase lambda to penalize complexity.
Neural Networks: Use a combination—weight decay plus dropout (and often batch norm). Tune dropout rate (e.g., 0.1–0.5) and weight decay carefully.

Worked Walkthrough (Conceptual)

Example Goal: Predict house prices from features like size, rooms, and age.
Baseline: Train plain linear regression and observe overfitting if many features are noisy.
Apply Ridge: Add L2 penalty; try lambda in {1e-4, 1e-3, ..., 1e2}. Pick the best by validation error; expect smoother, more stable coefficients.
Apply Lasso: Try L1 for feature selection; many coefficients may become zero, simplifying the model.
Elastic Net: If features are correlated (e.g., size and square footage), Elastic Net can share weights and avoid dropping useful features entirely.

Understanding the Trade-Off

Training vs Generalization: Regularization slightly increases training error but often reduces validation/test error. The penalty trades a bit of fit for more reliability on new data.
Smoother Functions: Especially visible in polynomial regression, where regularization reduces wild oscillations. Smoother functions are less sensitive to tiny data changes.
Feature Pruning (L1): When many features exist, L1 helps automatically remove the unhelpful ones. This often improves interpretability and reduces variance.

Connecting to Theory

Convexity: For linear models with L1/L2 penalties, the optimization problems are convex, ensuring a global optimum. This makes solutions stable and reliable.
Bayesian Grounding: Priors explain why penalties work and how they encode assumptions. This perspective aligns practical tuning with theoretical understanding.

Extending to Classification

Although the lecture focuses on regression language, the same penalties apply to classification models (e.g., logistic regression). You simply use a different prediction error term (e.g., log loss) but keep the same L1/L2 penalties. The logic of lambda tuning and cross-validation remains the same. This generality makes regularization a universal tool.

04Examples

💡
Polynomial degree tuning: Input is 1D x values with noisy y targets. Without regularization, a 10th-degree polynomial perfectly fits training points but shows high test error due to sharp wiggles. Adding L2 with a moderate lambda shrinks high-order coefficients, making the curve smoother and test error lower. The key point is that regularization calms overfitting by controlling coefficient size.
💡
Ridge regression on house prices: Features include square footage, number of rooms, and age; the target is price. Plain linear regression overfits because some features are noisy and correlated. Using Ridge with lambda chosen by cross-validation makes all coefficients smaller and more stable, improving test performance. The instructor emphasized that L2 encourages small weights and smooth predictions.
💡
Lasso for feature selection in text data: Input is a bag-of-words matrix with thousands of word features. Lasso sets many tiny coefficients to zero, keeping only the most informative words. This reduces model size and helps interpret which words matter for prediction. The point is that L1 creates sparsity and aids feature selection.
💡
Elastic Net on correlated features: Two features capture similar information (e.g., size in meters and size in feet). Lasso might keep only one, dropping the other entirely. Elastic Net balances L1 and L2 so both get nonzero weights, sharing credit more fairly. The takeaway is that Elastic Net handles correlated features better than pure L1.
💡
Tuning lambda via validation curve: Try lambda values from very small to very large. Plot validation error versus lambda; you’ll often see a U-shaped curve with a sweet spot. Pick the lambda at the bottom of the U for best generalization. This demonstrates how cross-validation guides hyperparameter choice.
💡
Neural network with dropout: A small MLP predicts a target from several features. Without dropout, training accuracy is high but test accuracy lags, indicating overfitting. Adding dropout of 0.3 reduces co-dependence and improves test accuracy, even if training accuracy drops slightly. This shows dropout’s regularizing effect.
💡
Neural network with L2 weight decay: The same MLP now includes L2 on all weight matrices. We tune weight decay (the L2 strength) by validation and find a value that improves test performance. The weights stay small, and the model becomes less sensitive to noise. The key is that weight decay is L2 regularization for networks.
💡
Batch normalization in a deep model: A deeper network struggles with unstable training. Adding batch normalization layers stabilizes activation distributions and speeds learning. Test accuracy improves, suggesting a regularization benefit alongside training stability. The point is that batch norm often improves generalization indirectly.
💡
Bayesian interpretation of L2: Assume a Gaussian prior that prefers weights near zero before seeing data. After observing training data, the posterior balances this prior with the likelihood from data. The resulting solution mirrors Ridge regression’s effect of shrinking weights. This example connects practice to theory.
💡
Bayesian interpretation of L1: Assume a Laplace prior with a sharp peak at zero, preferring many exact zeros. Observing data updates this belief, leading to many weights being exactly zero. This aligns with Lasso’s sparsity behavior. The example shows why L1 naturally performs feature selection.
💡
Underfitting due to too-large lambda: Choose an extremely large lambda in Ridge. The model’s weights shrink too much, and both training and validation errors become high. Reducing lambda brings back flexibility and lowers error. The lesson is that over-regularization can be as harmful as under-regularization.
💡
Overfitting due to too-small lambda: Use a tiny lambda in Lasso on noisy data. The model barely penalizes complexity and learns noise, causing validation error to rise. Increasing lambda reduces noise chasing and lowers test error. This reinforces that lambda must be tuned, not guessed.
💡
Train/validation/test discipline: Split your dataset 60/20/20. Use the 60% to fit models across lambdas, the 20% validation to pick the best lambda, and the final 20% test to report accuracy once. This prevents accidental test overfitting. The example highlights proper experimental hygiene.
💡
High-dimensional genetics data: Thousands of genetic markers predict a health outcome. Lasso identifies a small subset of markers with nonzero weights, improving interpretability and avoiding overfitting. Elastic Net may further stabilize selection when markers are correlated. The point is that regularization is vital in high-dimensional problems.
💡
Time series with polynomial trend: Fit a polynomial trend to smooth a time series. Without penalties, high-degree terms create unrealistic oscillations between points. Adding L2 keeps the trend smooth and robust to outliers. The key insight is that regularization suppresses spurious bends.

05Conclusion

Regularization is a cornerstone of building models that truly generalize. The main idea is simple: add a penalty for model complexity so the learning algorithm prefers simpler, smoother solutions. L2 regularization (Ridge) shrinks weights toward zero, improving stability without usually zeroing them out. L1 regularization (Lasso) pushes many weights to exactly zero, providing sparsity and automatic feature selection. Elastic Net blends both, offering a practical balance when features are correlated or when you want both sparsity and stability.

The choice of penalty strength, lambda, is critical and must be tuned using cross-validation. By splitting data into training, validation, and test sets—and only using the test set once at the very end—you ensure an honest estimate of how the model will perform in the real world. Polynomial regression offers an intuitive picture of overfitting: as flexibility grows, training error can drop while test error rises. Regularization counters this by shrinking large coefficients and smoothing away the wiggles.

For neural networks, regularization is even more important because of their large number of parameters. L2/L1 weight penalties, dropout, and batch normalization help these models avoid memorizing noise and instead learn patterns that transfer to new data. Finally, the Bayesian perspective connects practice to theory by showing that regularization encodes prior beliefs about parameter sizes and sparsity.

Right after this, you can apply L2, L1, or Elastic Net to your own regression or classification tasks, tune lambda via cross-validation, and check how your test performance improves. If you work with deep learning, add weight decay and dropout, and consider batch normalization to stabilize training. The core message to remember is this: controlling complexity through regularization is essential for building reliable machine learning models that perform well beyond the training data.

Key Takeaways

✓Start with L2 regularization when you want a stable, smooth model. Tune lambda using cross-validation and pick the value that minimizes validation error. Watch for underfitting if lambda is too large and overfitting if it’s too small. Report final performance on the test set only once.
✓Use L1 regularization when you need sparsity and automatic feature selection. Expect many coefficients to become exactly zero, which simplifies the model and improves interpretability. Be cautious with highly correlated features because L1 may keep just one and drop the others. Consider Elastic Net when correlations are strong.
✓Apply Elastic Net to balance sparsity and stability. Adjust the blend of L1 and L2 to handle correlated features and maintain robust performance. Tune both the overall strength and the mixing ratio by cross-validation. This often yields strong results in high-dimensional problems.
✓Always separate training, validation, and test sets. Train on the training set, tune lambda on the validation set, and evaluate once on the test set. This discipline avoids hidden test overfitting and inflated claims. Keep the test set untouched until the very end.
✓Check validation curves to understand the effect of lambda. If the best value is at the edge of your search range, expand the range and try again. A U-shaped validation error curve is common, with a sweet spot in the middle. Choose the lambda at or near the minimum.
✓Diagnose underfitting versus overfitting by comparing training and validation errors. High errors on both sets suggest underfitting and possibly too-strong regularization. Low training but high validation error suggests overfitting and too-weak regularization. Adjust lambda accordingly.
✓For polynomial regression or other flexible models, use regularization to calm sharp wiggles. Shrinking high-order coefficients makes the curve smoother and more reliable. L2 typically shrinks all terms, while L1 can drop some powers entirely. Validate to find the best balance.
✓In neural networks, combine weight decay (L2) with dropout for strong regularization. Tune the dropout rate (e.g., 0.1–0.5) and weight decay through validation. Consider batch normalization to stabilize training and improve generalization. Use a consistent evaluation protocol with a held-out test set.
✓Remember the Bayesian story to explain regularization choices. L2 corresponds to a Gaussian prior that favors small weights; L1 corresponds to a Laplace prior that favors zeros. This perspective helps justify settings to stakeholders. It also clarifies why these penalties produce the behaviors they do.
✓Favor L2 when you do not need feature selection and want stability with correlated features. Use L1 when you believe many features are irrelevant and want a compact model. Elastic Net is a practical default when unsure, especially with correlated inputs. Validate all choices rather than guessing.
✓Scale your hyperparameter search over several orders of magnitude. Log-spaced grids (like 1e-4 to 1e2) quickly find the right region for lambda. If the optimum lies near the boundary, widen the search. This saves time and prevents missing the best setting.
✓Interpret model simplicity as a strength, not a weakness. Regularization may increase training error slightly but will usually lower test error. The goal is not perfect training performance; it’s reliable generalization. Communicate this trade-off clearly to your team.
✓Use sparse solutions from L1 to improve interpretability and deployment speed. Dropping unneeded features simplifies data pipelines and reduces inference cost. This is especially helpful in resource-limited environments. Validate that performance remains strong after pruning.
✓Document your cross-validation procedure and chosen lambda. Record the search space, best values, and scores to ensure reproducibility. This builds trust and makes future experiments faster. It also helps you explain decisions during reviews.
✓For small datasets, regularization is even more important. With limited data, models can easily memorize noise. Stronger regularization often improves stability and protects against fragile predictions. Cross-validate carefully to avoid picking a noisy lambda.
✓When performance drifts in production, consider re-tuning lambda. Data distributions change, and your original regularization strength may no longer be optimal. Periodic cross-validation on fresh data helps maintain generalization. Automate this process if possible.

Glossary

Regularization

A method to prevent overfitting by adding a penalty for model complexity. It keeps a model from becoming too wiggly or sensitive to noise. The penalty pushes the model to use smaller or fewer weights. This usually improves performance on new, unseen data.

Overfitting

When a model learns the noise in the training data instead of the true pattern. Training error is very low, but test error is high. The model fits tiny bumps and outliers that won’t repeat. It usually happens when the model is too complex.

Generalization

How well a model performs on new, unseen data. A good model doesn’t just memorize; it captures true patterns. Regularization improves generalization by limiting complexity. We measure it with validation or test sets.

Training set

The part of data used to fit the model’s parameters. The model learns patterns from this set. If it only focuses here, it may overfit. Regularization helps keep learning balanced.

Validation set

The part of data used to tune hyperparameters like lambda. It acts as a practice field for choosing settings. You do not train on it directly, just evaluate. It guides which model choice is best.

Test set

The final dataset used to report performance after all choices are made. It must not influence model or hyperparameter selection. It tells you how the model may perform in the real world. Using it early risks overfitting to the test.

Hyperparameter

A setting chosen before training that shapes how the model learns. It’s not learned from the training loss directly. Examples include lambda in regularization or dropout rate. You pick it using validation or cross-validation.

Lambda (regularization strength)

A number that controls how strongly the penalty term pushes weights down. A larger lambda means stronger penalty and a simpler model. A smaller lambda means weaker penalty and more flexibility. It must be tuned for best performance.

+28 more (click terms in content)

Version: 1