📚 Stanford CME295 Transformers & LLMs7 / 9

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 7 - Agentic LLMs

Beginner

Stanford

Machine LearningYouTube

Key Summary

•This lecture explains L1 regularization, also called LASSO, as a way to prevent overfitting by adding a penalty to the loss that depends on the absolute values of model weights. Overfitting means a model memorizes the training data but fails on new data. By penalizing large weights, L1 helps the model focus on the strongest, most useful features.
•Regularization adds a term to the loss: objective = loss + λ × penalty, where λ (lambda) controls how strong the penalty is. A bigger λ means the model cares more about keeping weights small, even if training loss rises a bit. This creates a trade-off between fitting the data and keeping the model simple.
•The L1 penalty uses the L1 norm: the sum of absolute values of the weights. In contrast, L2 (Ridge) uses the sum of squared weights. This difference leads to very different behavior when optimizing.
•L1 often sets some weights exactly to zero, which is called sparsity. Zero weights mean those features are effectively removed, so L1 performs automatic feature selection. L2 usually makes weights small but not exactly zero, so it shrinks without selecting.
•The geometric picture explains why L1 gives zeros: in 2D, the L1 constraint region is a diamond, which has sharp corners on the axes. When the loss contours touch the diamond at a corner, one weight becomes exactly zero.
•For L2, the constraint region is a circle, which is smooth and has no corners. When the loss contours touch the circle, the point rarely lies exactly on an axis. That’s why L2 seldom produces exact zeros.
•When you have many features and believe most are irrelevant, L1 is a strong choice. It can select a smaller, more meaningful subset of features automatically. This also makes the model easier to interpret.
•When all features are somewhat relevant or you want more stability, L2 is a good choice. It keeps every feature but reduces their influence by shrinking the weights. This often leads to more stable results when data changes a little.
•Elastic Net is a blend of L1 and L2 penalties. It uses two hyperparameters to mix them and can combine sparsity with stability. It’s helpful when features are correlated and pure L1 is unstable.
•Optimizing with L1 can be computationally more expensive because the absolute value function is not differentiable at zero. This stops vanilla gradient descent from working directly. Methods like subgradient descent and coordinate descent are used instead.
•Subgradient descent uses a generalized notion of the derivative to step downhill even at kinks like zero. Coordinate descent updates one weight at a time while holding others fixed, which works well with L1’s structure. These methods can be slower than standard gradient descent used for L2.
•Choosing λ is crucial and usually done by cross-validation. Small λ risks overfitting, while large λ can underfit by zeroing too many coefficients. The best λ balances accuracy and simplicity.
•L1 can be less stable than L2: small changes in data can flip which features get selected. This is especially true when features are highly correlated. Elastic Net can soften this issue by adding some L2.
•If you already know exactly which features are irrelevant, you can remove them before training. But in practice, you often don’t know, and L1 can discover them automatically. L1 can also keep tiny effects if λ isn’t too strong.
•The core idea: regularization adds a gentle leash on model complexity. L1’s leash tugs some weights all the way to zero, simplifying the model. L2’s leash pulls all weights toward zero without cutting them off.
•In summary, use L1 when you want feature selection and interpretability, L2 when you want stability and smooth shrinkage, and Elastic Net when you want a mix. Always tune λ and evaluate on validation data. The geometry (diamond vs circle) explains the different results you’ll see.

Why This Lecture Matters

L1 regularization is vital for anyone working with high-dimensional data, like data scientists, ML engineers, and analysts in fields such as text mining, bioinformatics, finance, and marketing. It solves the real problem of too many features by automatically selecting a small, useful subset, making models simpler, faster, and easier to interpret. In real projects, you rarely know which features are irrelevant ahead of time; L1 discovers this during training. By controlling overfitting, L1 helps models perform better on new, unseen data—exactly what matters in production. This knowledge maps directly to daily work: you can use L1 to prune bloated feature sets, communicate which inputs matter, and reduce maintenance costs. It also provides a pathway when stakeholders demand transparent models, as sparse coefficients clearly show what drives predictions. Understanding the geometry (diamond vs circle) and the optimization issues (non-differentiability at zero) prepares you to choose the right method and optimizer, and to explain trade-offs to teammates and managers. In today’s industry, where datasets can have thousands or millions of features, L1 is a core tool that turns complexity into clarity—a valuable skill that strengthens your career and the reliability of your models.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches L1 regularization (also called LASSO), a powerful technique to prevent overfitting by adding a penalty to a model’s loss function that depends on the sum of absolute values of the model’s weights. Overfitting happens when a model fits training data extremely well but performs poorly on new, unseen data. The key idea behind regularization is to balance two goals at once: fit the data well and keep the model simple. L1 regularization creates simplicity by pushing many weights to be exactly zero, which automatically removes unhelpful features and makes the model easier to interpret.

The lecture starts by reminding you why we regularize at all: models with too many degrees of freedom can latch onto noise in the dataset. To fight this, we add a penalty to the loss, giving the objective function: objective = loss + λ × penalty. The parameter λ (lambda) controls how strong the penalty is; higher λ means stronger regularization, which usually shrinks or zeroes more coefficients. The lecture then contrasts two main types of penalties: L2 (Ridge), which uses the sum of squared weights, and L1 (LASSO), which uses the sum of absolute values. This difference might look small at first, but it leads to very different outcomes when optimizing.

A central part of the lecture is the geometric intuition. In two dimensions with weights w1 and w2, you can imagine the optimization as moving contour lines (like elevation lines on a map) of the loss function until they just touch a “constraint region.” For L2, the constraint region is a circle (because w1^2 + w2^2 ≤ c). For L1, the constraint region is a diamond (because |w1| + |w2| ≤ c). Circles are smooth, while diamonds have sharp corners that sit right on the axes. When the loss contours press against the diamond, they are more likely to touch at a corner, which makes one weight exactly zero. This geometric picture explains why L1 creates sparse solutions (lots of zeros), while L2 rarely does.

The lecture then discusses when to use each method. If you have many features and suspect many are irrelevant (common in high-dimensional problems), L1 is very useful because it performs feature selection automatically. If you have a moderate number of features and believe most matter at least a little, L2 is often better, as it shrinks all weights toward zero without turning them off completely. Sometimes you might want both effects: some sparsity and some stability. That is where Elastic Net comes in—a combined penalty that blends L1 and L2, controlled by a mixing hyperparameter.

Next, the talk covers practical pros and cons. Advantages of L1 include built-in feature selection, which leads to simpler, more interpretable models with fewer non-zero coefficients. This can also reduce storage and computation for prediction. Disadvantages include computational considerations: the absolute value function in L1 is not differentiable at zero (it has a sharp corner), so basic gradient descent doesn’t apply directly. Instead, you use methods like subgradient descent or coordinate descent, which can be slower than the smooth-gradient methods commonly used for L2. Another drawback is stability: L1 solutions can change a lot if the data changes slightly, especially when features are highly correlated, because different features may take turns being selected.

The instructor also answers common questions. Why is L1 computationally heavier? Because its penalty is non-differentiable at zero, forcing us to use slower methods like subgradient or coordinate descent. Why not just remove irrelevant features before training? You often don’t know which ones are irrelevant, or they may have tiny effects that are still useful; L1 can discover this automatically and keep small but real signals if λ isn’t overly large. Finally, choosing λ is critical and is typically done by cross-validation: you try different λ values, check validation performance, and pick the best balance.

By the end, you should be able to: explain why we regularize, write the L1 objective function, describe how λ controls the trade-off between fit and simplicity, explain geometrically why L1 sets coefficients to zero while L2 does not, decide when to use L1 vs L2 vs Elastic Net, and understand the computational differences and algorithms used for L1. The lecture is designed for beginners to intermediate learners who know basic linear models and loss functions. Prior knowledge helpful here includes linear regression, the idea of fitting a loss function, and basic optimization concepts like gradients. With these ideas, you can apply L1 to build simpler, more interpretable, and better-generalizing models.

02Key Concepts

01
L1 Regularization (LASSO)
- 🎯 One-line definition: A penalty added to the loss that sums absolute values of weights, encouraging many weights to be exactly zero.
- 🏠 Analogy: It’s like packing a backpack for a trip and throwing out items you don’t truly need to keep your load light.
- 🔧 Technical explanation: The objective is loss + λ × Σ|wi|; the absolute value creates sharp corners in the feasible region, leading to sparse solutions.
- 💡 Why it matters: It prevents overfitting and performs automatic feature selection, improving generalization and interpretability.
- 📝 Example: In a dataset with 10,000 features where only 50 matter, L1 will push most coefficients to zero, keeping only the useful ones.
02
Regularization in General
- 🎯 One-line definition: Techniques that limit model complexity to reduce overfitting by adding a penalty to the loss.
- 🏠 Analogy: It’s like setting a speed limit so a car (the model) doesn’t drive recklessly fast and crash on new roads.
- 🔧 Technical explanation: We minimize objective = loss + λ × penalty; higher λ prioritizes simplicity over perfect training fit.
- 💡 Why it matters: Without it, models can memorize noise and fail on new data.
- 📝 Example: A polynomial regression with high degree wildly fits training data but fails on test data; adding regularization smooths the curve.
03
Overfitting
- 🎯 One-line definition: When a model fits training data very well but performs poorly on unseen data.
- 🏠 Analogy: Studying only past test answers and not the concepts, then failing a new test with different questions.
- 🔧 Technical explanation: High variance models latch onto noise; penalties reduce variance by shrinking or zeroing weights.
- 💡 Why it matters: Overfit models give unreliable predictions in real-world settings.
- 📝 Example: A spam filter that memorizes specific emails but misses new spam styles performs poorly without regularization.
04
The Role of λ (Lambda)
- 🎯 One-line definition: A hyperparameter controlling how strongly we penalize large coefficients.
- 🏠 Analogy: Think of λ as a dial on a radio—turn it up for more quiet (simpler model), turn it down for more sound (closer fit).
- 🔧 Technical explanation: As λ increases, more coefficients shrink toward zero; at very large λ, most may be zero.
- 💡 Why it matters: Choosing λ balances bias (too large) and variance (too small).
- 📝 Example: Cross-validation can select λ that yields the best validation error.
05
L1 vs L2 Penalties
- 🎯 One-line definition: L1 sums absolute values; L2 sums squared values of weights.
- 🏠 Analogy: L1 is like cutting off items entirely from your packing list; L2 is like making every item smaller but keeping them all.
- 🔧 Technical explanation: L1 leads to sparse solutions due to the diamond-shaped constraint; L2 leads to smooth shrinkage due to the circular constraint.
- 💡 Why it matters: It determines whether your model selects features (L1) or keeps them with reduced influence (L2).
- 📝 Example: A model with many weak features benefits from L1; one where all features matter benefits from L2.
06
Geometric Intuition (Diamond vs Circle)
- 🎯 One-line definition: L1 constraint regions are diamonds; L2 constraint regions are circles.
- 🏠 Analogy: A map where your allowed area is a diamond with sharp corners versus a smooth circle.
- 🔧 Technical explanation: Loss contours tend to touch the L1 diamond at corners (axes), making some weights zero; L2’s smooth circle rarely hits exactly on axes.
- 💡 Why it matters: This picture explains why L1 performs feature selection and L2 does not.
- 📝 Example: In 2D, the L1 optimum often lies on w1 = 0 or w2 = 0; L2’s optimum typically has both non-zero.
07
Feature Selection via Sparsity
- 🎯 One-line definition: Automatically setting many coefficients to zero, keeping only the most useful features.
- 🏠 Analogy: Cleaning a messy room by tossing out unneeded items, not just hiding them.
- 🔧 Technical explanation: The L1 penalty’s kink at zero makes zero an attractive solution for many weights during optimization.
- 💡 Why it matters: Reduces complexity, improves interpretability, and can speed up inference.
- 📝 Example: In genetics, L1 can pick a handful of genes that best predict a trait from thousands measured.
08
When to Prefer L1
- 🎯 One-line definition: Use L1 when you have many features and expect many are irrelevant.
- 🏠 Analogy: When backpack space is tiny, you only carry essentials.
- 🔧 Technical explanation: L1 drives many coefficients exactly to zero, simplifying the model.
- 💡 Why it matters: It removes noise features and highlights the truly predictive ones.
- 📝 Example: Bag-of-words text models with huge vocabularies benefit from L1 to select key terms.
09
When to Prefer L2
- 🎯 One-line definition: Use L2 when most features are at least somewhat relevant or you want stability.
- 🏠 Analogy: You keep all tools in a smaller size instead of throwing any away.
- 🔧 Technical explanation: L2 reduces coefficient magnitudes smoothly, rarely to exact zero.
- 💡 Why it matters: More stable solutions when data changes slightly or features are correlated.
- 📝 Example: Sensor data where each signal contributes a bit is well-handled by L2.
10
Elastic Net
- 🎯 One-line definition: A combination of L1 and L2 penalties controlled by a mixing parameter.
- 🏠 Analogy: Using both trimming and resizing to lighten your backpack.
- 🔧 Technical explanation: Objective = loss + λ(α Σ|wi| + (1−α) Σwi^2); α tunes between L1 and L2 behavior.
- 💡 Why it matters: Provides sparsity with improved stability, helpful for correlated features.
- 📝 Example: In finance with many correlated indicators, Elastic Net balances selection and robustness.
11
Computational Challenge of L1
- 🎯 One-line definition: L1 is harder to optimize because |x| is not differentiable at x=0.
- 🏠 Analogy: Rolling a ball down a hill that has sharp edges—it’s harder to decide which way to roll at the corner.
- 🔧 Technical explanation: Gradient descent needs derivatives; L1’s kink at zero demands subgradients or coordinate-wise updates.
- 💡 Why it matters: Training can be slower than with the smooth L2 penalty.
- 📝 Example: You might switch to coordinate descent, updating one weight at a time efficiently.
12
Subgradient Descent
- 🎯 One-line definition: An optimization method that uses a generalized slope when the true derivative doesn’t exist.
- 🏠 Analogy: If the road has a sudden corner, you pick any safe direction that still heads downhill.
- 🔧 Technical explanation: At zero, the subgradient of |w| is any value in [−1, 1]; the algorithm uses these to step toward lower objective.
- 💡 Why it matters: Enables training with non-differentiable penalties like L1.
- 📝 Example: For a coefficient hovering near zero, subgradient picks a direction that encourages sparsity.
13
Coordinate Descent
- 🎯 One-line definition: An algorithm that optimizes one coefficient at a time while holding others fixed.
- 🏠 Analogy: Cleaning your house room by room instead of all at once.
- 🔧 Technical explanation: Each step solves a one-dimensional problem, naturally handling the L1 penalty and often yielding exact zeros.
- 💡 Why it matters: Efficient and popular for L1-regularized models.
- 📝 Example: Iteratively updating w1, then w2, …, until convergence gives sparse solutions quickly.
14
Stability Considerations
- 🎯 One-line definition: L1 solutions can change a lot if the data changes slightly.
- 🏠 Analogy: A seesaw that tips from one side to the other with a tiny nudge.
- 🔧 Technical explanation: When features are correlated, L1 may pick one and drop others; small data shifts can flip the choice.
- 💡 Why it matters: Model selections may vary run to run; interpret with care.
- 📝 Example: Two similar words in text may trade off being selected depending on tiny sample differences.
15
Constrained vs Penalized Views
- 🎯 One-line definition: Regularization can be seen as minimizing loss with a penalty or as minimizing loss subject to a norm constraint.
- 🏠 Analogy: Pay a fine for packing too much or set a strict weight limit on your bag.
- 🔧 Technical explanation: Penalized form: loss + λ × penalty; constrained form: loss minimized with Σ|wi| ≤ c for L1, Σwi^2 ≤ c for L2.
- 💡 Why it matters: The geometric intuition comes from the constrained picture and explains sparsity.
- 📝 Example: The L1 constraint forms a diamond region; the optimum often lands on its axes.
16
Interpretability Benefits
- 🎯 One-line definition: Sparse models are easier to understand because only a few features are active.
- 🏠 Analogy: A recipe with only five ingredients is simpler to follow than one with fifty.
- 🔧 Technical explanation: Zero coefficients remove features from the prediction rule, clarifying which inputs matter most.
- 💡 Why it matters: Helps stakeholders trust and act on model insights.
- 📝 Example: In healthcare, seeing that three specific lab tests drive risk scores can guide clinical focus.
17
Why Not Remove Features First?
- 🎯 One-line definition: You often don’t know which features are irrelevant ahead of time.
- 🏠 Analogy: You only realize which tools you don’t need once you start fixing things.
- 🔧 Technical explanation: L1 automatically tests usefulness during training, keeping small but real effects if λ is tuned well.
- 💡 Why it matters: Saves manual effort and finds subtle signals.
- 📝 Example: In marketing data, L1 may keep a low-impact but consistent feature that a human might prematurely drop.
18
Bias-Variance Trade-off with L1
- 🎯 One-line definition: Increasing λ increases bias but reduces variance; L1 controls this by removing features.
- 🏠 Analogy: Speaking in simpler sentences (higher bias) reduces the chance of being misunderstood (lower variance).
- 🔧 Technical explanation: Stronger L1 raises training error but improves test error by avoiding noise-fitting.
- 💡 Why it matters: The right balance improves generalization.
- 📝 Example: A small λ overfits; a large λ underfits by zeroing too many features; cross-validation finds the sweet spot.
19
Practical Choice: Try Both L1 and L2
- 🎯 One-line definition: In practice, you should try L1, L2, and Elastic Net, then choose by validation results.
- 🏠 Analogy: Try on different shoe types to see which fits best for the journey.
- 🔧 Technical explanation: Train multiple models with tuned λ values; compare metrics on held-out data.
- 💡 Why it matters: Real datasets vary; no single method wins everywhere.
- 📝 Example: A dataset with few but strong features favors L1; another with many mild features favors L2.

03Technical Details

Overall Architecture/Structure of Regularized Learning

At a high level, supervised learning with regularization has these components:

Data: Inputs X (features) and outputs y (targets).
Model: A function f(x; w) with parameters w (e.g., linear regression f(x) = w·x).
Loss: A measure of prediction error (e.g., mean squared error for regression).
Regularizer: A penalty that discourages complex models by shrinking weights.
Objective: The sum of the loss and the penalty, weighted by λ (lambda), the regularization strength.
Optimizer: An algorithm to minimize the objective (e.g., subgradient or coordinate descent for L1).
Validation: A method to select λ and compare models (e.g., cross-validation).

Data flow: You start with data (X, y). You choose a model and define the loss. You add a regularization term. You minimize the objective to find weights w*. During optimization, the regularizer influences w* by shrinking or zeroing coefficients. Finally, you evaluate on held-out data to ensure generalization.

L1 Objective Function and Constrained Form

For a linear regression model, the L1-regularized objective is: Objective(w) = Loss(y, Xw) + λ × Σ_i |w_i|.

Loss(y, Xw) is often the mean squared error: (1/2n) Σ (y_j − x_j·w)^2, where j indexes data points.
The L1 penalty is the L1 norm: ||w||_1 = Σ_i |w_i|.
λ ≥ 0 controls the strength of the penalty.

Constrained view: Minimize Loss(y, Xw) subject to ||w||_1 ≤ c for some constant c. There is a one-to-one mapping between λ in the penalized problem and c in the constrained problem (for convex problems). The constrained view gives geometric insight:

L1 constraint: |w1| + |w2| ≤ c forms a diamond in 2D.
L2 constraint: w1^2 + w2^2 ≤ c forms a circle in 2D.

Geometric Intuition: Why L1 Creates Sparsity

Imagine the loss function’s contour lines (like height lines on a hill) around its unconstrained minimum. As you tighten the constraint (smaller c) or increase λ, the feasible region shrinks toward the origin. For L1, the diamond has corners on the axes. When the loss contours are pushed outward until they just touch the diamond, they often touch at a corner—exactly where one coordinate is zero. This makes the solution lie on an axis (some w_i = 0). In higher dimensions, the L1 feasible set is a polytope with many corners. Touch points at these corners naturally create many zeros, leading to sparsity.

In contrast, L2’s circle is smooth and symmetrical with no corners. When contours touch it, the solution tends to have all components non-zero, though they are small. This smoothness also makes L2 easy to optimize with standard gradient methods.

Optimization: Why L1 Needs Special Methods

The absolute value function |w| is not differentiable at w = 0; its left derivative is −1 and right derivative is +1. Gradient descent relies on the derivative to tell you which way to move. Because of the kink at zero, you cannot compute a unique gradient there. Two common methods handle this:

Subgradient Descent: The subgradient generalizes the derivative for convex but non-smooth functions. For |w|, the subgradient at w ≠ 0 is sign(w), and at w = 0 it is any value in [−1, 1]. Subgradient descent takes small steps in a subgradient direction to reduce the objective. It works but can be slower to converge because directions are less precise than true gradients.
Coordinate Descent: Instead of updating all coefficients at once, coordinate descent updates one coefficient at a time, holding the others fixed. For objectives with separable penalties like L1, each 1D subproblem has a simple structure. The update often applies a thresholding rule that either shrinks the coefficient or sets it exactly to zero, naturally producing sparsity. This method is computationally efficient for high-dimensional problems and is widely used in practice for L1-regularized linear models.

Note: Smooth L2 penalties allow fast methods (like standard gradient descent with line search or second-order methods) because the gradient is well-behaved everywhere. This is why L2 training is typically simpler and faster.

Practical Behavior: L1 vs L2 vs Elastic Net

L1 (LASSO): Encourages sparsity by setting many weights to exactly zero. Great for feature selection, interpretability, and when only a few features truly matter. Can be unstable when features are highly correlated; slight data changes can swap which feature is chosen. Training requires non-smooth optimization methods.
L2 (Ridge): Shrinks all weights smoothly toward zero but rarely makes them exactly zero. Good when all features have some predictive power, and when you prefer stability. Easy and fast to optimize with smooth methods.
Elastic Net: Combines L1 and L2 penalties. This can select groups of correlated features rather than arbitrarily picking just one, improving stability. The mixing parameter α (between 0 and 1) controls how much L1 vs L2 you use, and λ controls overall strength.

Choosing λ (Regularization Strength)

λ controls the balance between fitting the training data and keeping the model simple. If λ is too small, the penalty is weak: the model may overfit. If λ is too large, too many weights go to zero (for L1) or become too small (for L2), causing underfitting. The standard approach is to pick λ by cross-validation:

Split data into training and validation folds.
Train models over a grid of λ values.
Choose the λ that gives the best validation performance.

In practice, you may use a logarithmic grid for λ (e.g., 10^-4, 10^-3, …, 10^2) to explore a wide range. Some workflows also consider the “one-standard-error rule” to pick a simpler model with performance statistically close to the best.

Interpreting Results and Stability Considerations

L1’s sparsity makes interpretation straightforward: non-zero coefficients highlight important features. However, when features are correlated or redundant, L1 may pick one and drop the rest arbitrarily. This can make interpretations shift when the data changes slightly. To mitigate this:

Consider Elastic Net to encourage group selection.
Check stability across resamples or folds to see if the same features keep getting selected.
Report sets of candidate features rather than a single definitive list when correlations are high.

Advantages and Disadvantages Summarized

Advantages of L1 (LASSO):

Automatic feature selection (sparsity).
Clearer interpretability (few non-zero coefficients).
Potentially faster prediction at inference time (fewer active features).

Disadvantages of L1:

Non-differentiable penalty requires specialized, sometimes slower optimization.
Less stable under data perturbations, especially with correlated features.
May discard weak but jointly useful features if λ is too large.

Advantages of L2 (Ridge):

Smooth optimization with standard gradient methods.
Stable coefficients that change gradually with data.
Keeps all features, which can help when each carries small signal.

Disadvantages of L2:

No exact zeros, so no automatic feature selection.
Models can be harder to interpret with many small but non-zero coefficients.

Implementation Guide (Framework-Agnostic)

Step 1: Prepare data.

Gather features X and target y. Handle missing values and ensure features are numeric.
Optionally standardize features (mean 0, variance 1) so the penalty treats them comparably.

Step 2: Choose the model and loss.

For regression, choose linear regression with mean squared error (MSE).
For classification (logistic regression), the loss is logistic (cross-entropy), and L1 can also be applied similarly.

Step 3: Define the L1 objective.

Objective(w) = Loss(y, Xw) + λ × Σ|w_i|.

Step 4: Select optimization method.

Coordinate descent: Efficient for L1; iterate over features, update each weight, repeat until convergence.
Subgradient descent: Use appropriate step sizes; can be slower.

Step 5: Tune λ via cross-validation.

Create a grid of λ values. Train models for each λ and evaluate on validation folds.
Pick the λ with best validation score (or the simplest within one standard error of the best).

Step 6: Fit final model.

Retrain on the full training set using the chosen λ.
Record which features are non-zero; these are your selected features.

Step 7: Evaluate on test data.

Measure performance (e.g., RMSE for regression, accuracy/AUC for classification).
Check calibration and residuals for regression to ensure no obvious patterns remain.

Step 8: Communicate results.

Report selected features and their coefficients.
Explain trade-offs, especially if sparsity caused some features to be dropped.

Tips and Warnings

Standardize features: L1 penalizes all coefficients equally; without standardization, features on larger scales may be unfairly penalized less or more.
Beware correlated features: L1 may pick one and drop others; consider Elastic Net when groups of features represent similar signals.
Don’t over-penalize: Very large λ can produce an empty or nearly empty model; always validate.
Check stability: Use bootstrap or repeated cross-validation to see if the same features keep appearing.
Start simple: Try pure L1 and pure L2 first to understand behaviors, then try Elastic Net for a balance.
Monitor convergence: For coordinate descent, ensure you iterate until changes in coefficients are tiny; for subgradient, use appropriate step-size schedules.
Interpret carefully: Zero coefficients mean exclusion; small non-zero values under L2 may still be meaningful signals.

Real-World Use Cases

High-dimensional text (bag-of-words or n-grams): L1 selects a compact vocabulary of informative terms.
Genomics and proteomics: L1 narrows thousands of biological markers to a handful that predict outcomes.
Click-through rate prediction: L1 picks the most predictive user or context features among many.
Sensor fusion: When most sensors are noisy, L1 can isolate the few that carry strong signal.
Risk modeling: Sparse models aid compliance and interpretability in regulated industries.

Summary of Core Mathematics

Objective: L(w) + λ||w||_1, with L convex (e.g., MSE or logistic loss) and ||w||_1 = Σ|w_i|.
Subgradient for |w_i|: sign(w_i) if w_i ≠ 0; any value in [−1, 1] if w_i = 0.
Optimality intuition: If the gradient of the loss with respect to w_i is small in magnitude, L1 can set w_i to zero; if large, it can overcome the penalty and keep w_i non-zero.
Geometry: L1 constraint is a cross-polytope (diamond in 2D); loss contours touch at corners → sparse solutions.
Trade-off: Larger λ → higher bias, lower variance; smaller λ → lower bias, higher variance.

These technical details equip you to understand why L1 behaves differently from L2, how to optimize with L1’s non-smooth penalty, and how to apply, tune, and interpret models that use L1 in practice.

04Examples

💡
High-Dimensional Text Features: Input is a bag-of-words representation with 50,000 word features predicting sentiment. Processing with L1 pushes most word coefficients to zero, keeping only the few terms that truly indicate positive or negative sentiment. Output is a sparse model with maybe a few hundred active words. Key point: L1 performs automatic feature selection in very wide (many-feature) spaces.
💡
Polynomial Regression with Many Terms: Input uses polynomial expansions up to degree 5 on 100 base features, creating thousands of terms. L1 regularization drives many high-degree terms to zero while keeping a small subset that improves fit. Output is a simple polynomial model focusing on the most impactful interactions. Key point: L1 tames feature explosion by pruning.
💡
Sensor Selection: Input is data from 200 sensors predicting equipment failure. L1 retains only the signals from a handful of sensors that consistently change before failures. Output is a sparse coefficient vector highlighting those sensors. Key point: Interpretability improves because you see which sensors matter.
💡
Medical Risk Prediction: Input includes lab tests, age, and vitals for thousands of patients. L1 drops many lab test features that don’t add predictive value while keeping a short list of strong predictors. Output is a model a clinician can understand and trust. Key point: Sparse models aid decision-making in high-stakes settings.
💡
Correlated Features and Stability: Input has two highly correlated features (e.g., word synonyms). L1 may select one and zero out the other, but which one is picked can change with slight data shifts. Output shows either feature A or feature B selected across different folds. Key point: L1 can be unstable under correlation; consider Elastic Net.
💡
Small vs Large λ: Input is a dataset with some noise. With very small λ, the model overfits and uses many features; with very large λ, most weights go to zero and accuracy drops. Output shows a U-shaped validation error curve over λ. Key point: Cross-validation helps find the λ that balances fit and simplicity.
💡
Diamond vs Circle Geometry: Consider two coefficients w1 and w2. The L1 constraint |w1| + |w2| ≤ c forms a diamond, and the optimal point often lies on an axis, producing a zero coefficient. The L2 constraint w1^2 + w2^2 ≤ c forms a circle, and the optimal point rarely lies exactly on an axis. Key point: Geometry explains sparsity vs shrinkage.
💡
Manual Feature Removal vs L1: Input includes 1,000 engineered features, and you suspect some are useless. Manually removing features risks dropping helpful ones or keeping harmful ones. L1 automatically searches and zeros out many that don’t help, while keeping small but useful effects if λ is tuned. Key point: Automated selection beats guesswork.
💡
Model Interpretability: Input is a marketing dataset with demographics, behavior metrics, and campaign info. L1 yields a compact model with 10-20 non-zero features, making it clear which factors drive response. Output supports business storytelling and action. Key point: Sparsity improves communication and trust.
💡
Training Algorithms: Input is the same dataset trained with L1 using subgradient descent versus coordinate descent. Subgradient descent converges but slower due to non-smoothness; coordinate descent converges faster by solving 1D problems per feature. Output shows similar sparse solutions with different training times. Key point: Choose the right optimizer for L1.
💡
Feature Scaling Impact: Input has features on very different scales (e.g., dollars vs counts). Without scaling, L1 penalizes coefficients unevenly, distorting selection. After standardization, L1 fairly compares features and selects appropriately. Key point: Scale features to get meaningful L1 solutions.
💡
Comparing L1 and L2 Outcomes: Input is a dataset where all features carry some small signal. L1 zeros many and keeps a few, possibly losing some small signals; L2 keeps all but shrinks them. Output shows L2 with better stability and slightly better test error in this scenario. Key point: Prefer L2 when all features matter a bit.
💡
Elastic Net Rescue: Input has many correlated feature groups (e.g., trigrams in text). Pure L1 picks one from each group inconsistently; Elastic Net keeps small groups together. Output is a more stable model with slightly denser coefficients but better generalization. Key point: Elastic Net balances sparsity and robustness.
💡
Cross-Validation for λ: Input is split into k folds. For each λ in a grid, models are trained on k−1 folds and validated on the remaining fold, repeating across folds. Output is average validation error per λ; pick the best (or simplest within one standard error). Key point: Tuning λ is essential for performance.
💡
Edge Case—Too Strong L1: Input is a modest dataset with moderate noise. With very high λ, the model collapses to predicting the mean, as nearly all coefficients go to zero. Output is poor test performance due to underfitting. Key point: Avoid over-regularizing; always check validation metrics.

05Conclusion

This lecture focused on L1 regularization (LASSO) as a tool to prevent overfitting by adding a penalty equal to the sum of absolute values of model weights. We contrasted L1 with L2 (Ridge), showing how L1 tends to set many coefficients exactly to zero, achieving feature selection, while L2 shrinks weights smoothly without making them zero. The geometric intuition clarified this: the L1 constraint is a diamond with sharp corners that often force solutions onto axes, whereas the L2 constraint is a smooth circle that rarely produces exact zeros. We covered the role of the hyperparameter λ, which balances fitting the data against keeping the model simple; selecting λ via cross-validation is essential. We also addressed the computational challenge: L1’s non-differentiability at zero requires methods like subgradient descent or coordinate descent, which can be slower than methods for L2.

Practically, use L1 when you expect many irrelevant features and want interpretability through sparsity. Use L2 when most features are somewhat useful and you want stable, smooth shrinkage. When features are correlated or you want both sparsity and stability, Elastic Net blends the two approaches. Keep in mind the potential instability of L1 when data changes slightly and consider remedies like Elastic Net and stability checks.

To solidify learning, try fitting simple linear models with and without L1 on a dataset with many features, and compare which features become zero as you vary λ. Plot validation error across a λ grid to observe the U-shaped curve and pick a good λ. Experiment with coordinate descent versus subgradient descent to see differences in convergence speed. For next steps, explore Elastic Net in deeper detail, learn more about optimization methods for non-smooth problems, and practice with real high-dimensional datasets such as text or genomics. The core message to remember: regularization is about wise restraint. L1’s special power is to trim models down to their most important pieces, delivering better generalization and clearer stories about which features matter.

Key Takeaways

✓Use L1 when you need feature selection: It sets many coefficients to zero, trimming your model to essentials. Start with a wide feature set, standardize features, and tune λ by cross-validation. Watch the validation curve to avoid over- or under-regularizing. Report the final set of non-zero features for interpretability.
✓Prefer L2 when all features matter somewhat: It keeps all coefficients but shrinks them. This is more stable when data changes slightly or features are correlated. Tune λ to balance bias and variance. Expect smoother training with standard gradient methods.
✓Try Elastic Net when features are correlated: Pure L1 may pick one feature and drop its twins, causing instability. Elastic Net blends L1 and L2 to keep small groups and improve robustness. Tune both α and λ to get a good mix of sparsity and stability. Validate thoroughly to confirm gains.
✓Choose λ with cross-validation: Use a logarithmic grid over several orders of magnitude. Plot validation error versus λ to find the minimum or the simplest model within one standard error of the best. Refit on the full training set with the chosen λ. Finally, test on a held-out set.
✓Standardize features before L1: Different scales can distort which coefficients are penalized more. Scaling ensures fair comparison across features. This leads to more reliable selection. Always scale train and apply the same transform to validation/test.
✓Understand the geometric intuition: L1’s diamond-shaped constraint favors solutions on axes, causing zeros; L2’s circle does not. This picture explains observed behavior in practice. Use it to justify modeling choices to stakeholders. It also helps debug unexpected sparsity or lack of it.
✓Expect computational differences: L1 is non-differentiable at zero, so avoid plain gradient descent. Use coordinate descent or subgradient methods suited to L1. Monitor convergence and consider warm starts across λ values. Anticipate longer training than L2 for large problems.
✓Check model stability: For L1, repeat training with different data splits or bootstraps. See if the same features remain non-zero. If not, consider Elastic Net or grouping correlated features. Communicate uncertainty in selected features.
✓Beware of over-regularization: Very large λ can wipe out useful signals and cause underfitting. Always inspect training and validation error trends. Aim for the simplest model that still performs well on validation. Don’t equate more sparsity with better accuracy.
✓Interpret sparse models carefully: Non-zero coefficients indicate features that matter under the current data and λ. Small changes in data can swap which correlated features are chosen. Consider domain knowledge when presenting selected features. Validate that selections make practical sense.
✓Compare L1 and L2 on your data: No single method wins everywhere. Train both (and Elastic Net) and compare on held-out data. Choose based on metrics and practical needs like interpretability. Document your choice and the trade-offs.
✓Use L1 beyond linear regression: L1 can regularize logistic regression and other convex models. The same intuition—sparsity and feature selection—applies. Optimization still uses subgradient or coordinate-like methods. The payoff remains better generalization and simpler models.
✓Start with a broad feature set, then prune with L1: Don’t prematurely discard features; let L1 test them. This can uncover small but consistent signals. After selection, you can retrain a simpler model on the chosen features. This can improve speed and maintainability.
✓Communicate trade-offs clearly: Explain λ as the dial between fit and simplicity. Describe L1 as turning some features off and L2 as turning all features down. Use the diamond vs circle picture to make it visual. This builds trust with non-technical stakeholders.

Glossary

L1 regularization

A method that adds the sum of absolute values of model weights to the loss to discourage complex models. It tends to push many weights to exactly zero. This creates simple, easy-to-understand models. It helps stop the model from memorizing noise.

LASSO

Another name for L1 regularization, short for Least Absolute Shrinkage and Selection Operator. It both shrinks and selects by setting some weights to zero. This gives sparse models. It’s useful when many features are not helpful.

L2 regularization

A method that adds the sum of squared weights to the loss. It shrinks all weights toward zero but usually doesn’t make them exactly zero. It’s smoother and easier to optimize. It helps stabilize models.

Ridge regression

The common name for L2 regularization applied to linear regression. It adds a squared-weight penalty to the mean squared error loss. This prevents large coefficients. It’s helpful when all features matter a little.

Regularization

Any technique that limits model complexity to reduce overfitting. It adds a penalty to the loss to discourage overly flexible models. This improves performance on new data. It balances fit and simplicity.

Overfitting

When a model fits training data very well but fails on new data. It often happens when the model is too complex. The model learns noise instead of true patterns. Regularization helps reduce this.

Feature selection

Choosing a subset of input features that truly help the prediction. It can be done manually or automatically. L1 does this automatically by setting some weights to zero. This leads to simpler models.

Sparsity

When most entries in a vector are zero. In models, it means most weights are zero. Sparse models are simpler and easier to interpret. L1 drives sparsity naturally.

+26 more (click terms in content)

Version: 1