Stanford CS230 | Autumn 2025 | Lecture 10: What’s Going On Inside My Model?
BeginnerKey Summary
- •This lecture teaches linear regression as a simple but powerful way to predict a number (output) from input features. The model assumes a straight-line relationship between inputs and output, written as y = W^T x. Each weight in W shows how much a feature pushes the prediction up or down.
- •Supervised learning needs three things: a model family, an objective function, and an optimization method. Here, the model is linear, the objective is to make prediction errors small, and the optimization uses calculus to find the best weights. These three choices define how we learn from data.
- •The objective function used is the Residual Sum of Squares (RSS). RSS adds up the squared differences between each prediction and the true answer. Minimizing RSS finds the line (or hyperplane) that sits closest to all points in the vertical direction.
- •We represent the dataset with a matrix X (n rows for data points, d columns for features) and a vector Y (n outputs). The parameter vector W has d numbers, one per feature. In compact form, the RSS is ||Y − XW||^2, where ||·|| is the L2 (Euclidean) norm.
- •To minimize RSS, we take the derivative with respect to W and set it to zero. This gives the normal equation: W = (X^T X)^{-1} X^T Y. It directly computes the best weights in one step, when the inverse exists.
- •The solution requires X^T X to be invertible (full rank). If the columns of X are linearly dependent, X^T X cannot be inverted, so there isn’t a unique solution. A common case is when there are more features (d) than data points (n).
- •Overfitting happens when the model fits noise instead of the true pattern, especially with many features. It leads to very small training error but poor performance on new data. Linear regression can overfit when X^T X is close to singular or when d is large compared to n.
- •Mean Squared Error (MSE) is used to evaluate how well the model predicts. MSE is just RSS divided by the number of data points. Lower MSE means better average prediction accuracy.
- •Regularization helps prevent overfitting by discouraging large weights. We add a penalty to the objective so the algorithm prefers smaller W values. This keeps the model simpler and more stable.
- •L2 regularization (ridge regression) adds λ||W||^2 to the RSS. The hyperparameter λ controls how strongly we punish large weights. Bigger λ shrinks weights more, trading a bit of bias for lower variance.
- •L1 regularization (lasso regression) adds λ||W||_1, the sum of absolute values of weights. L1 often makes some weights exactly zero, creating a sparse model. This doubles as feature selection because unused features get zero weight.
- •Geometrically, regularization changes the shape of constraints around the minimum. L2 looks like circles (or spheres), nudging weights toward the origin smoothly. L1 looks like diamonds with sharp corners, which are likely to land exactly on axes, producing zeros.
- •When d > n, the columns of X are dependent, so there are many W that produce the same predictions. Regularization makes learning well-posed by preferring the smallest or sparsest weights among many solutions. This yields a unique, stable solution.
- •Weights are easy to interpret: positive weights increase predictions when the feature grows; negative weights decrease them. Large magnitude means the feature has a big influence. This interpretability is a key advantage of linear regression.
- •Optimization here uses direct calculus and linear algebra, so we get a closed-form answer. In contrast to iterative methods, this solution is fast for small to medium d. But when d is huge or X^T X is ill-conditioned, regularization is safer.
- •Choosing λ is crucial: too small offers little protection; too large oversmooths and underfits. In practice, we pick λ by checking validation performance. The right λ balances fit and simplicity.
- •RSS vs MSE: RSS is the total squared error, MSE is the average squared error. They give the same minimizer W because dividing by n doesn’t change where the minimum is. MSE is preferred for comparing models across datasets.
- •L1’s sparsity comes from its “kink” at zero and diamond-shaped level sets. The best point under the diamond constraint often sits at a corner where some weights are exactly zero. This is why lasso selects features naturally.
Why This Lecture Matters
Linear regression is one of the most important foundations in machine learning and data science. For analysts, engineers, and researchers, it provides a fast, interpretable way to understand how inputs relate to a numeric outcome. The model’s weights tell clear stories—what pushes predictions up or down and by how much—so business stakeholders and scientists can trust and act on results. Because the objective (RSS) and solution (normal equation) are explicit, you can diagnose problems like non-invertibility and fix them with regularization. In real projects, you often face tabular data with many features and limited examples. Linear regression, paired with MSE for evaluation, gives a strong baseline that is quick to build and sets a performance reference for more complex models. Regularization makes it practical in high-dimensional or noisy settings, stabilizing estimates and improving generalization. L1 regularization’s sparsity doubles as automatic feature selection, which reduces complexity and focuses attention on the most informative inputs. Mastering this lecture’s content helps your career by strengthening your mathematical intuition and practical skills. You’ll know when a closed-form solution applies, how to interpret coefficients, and how to prevent overfitting by tuning λ. These skills transfer directly to many advanced topics—logistic regression, generalized linear models, and even modern deep learning concepts like weight decay (an L2-like penalty). In today’s industry, where explainability, speed, and reliability matter, linear regression remains a go-to tool that solves real problems effectively.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches the foundations of linear regression, a core model in supervised learning that predicts a numeric output from input features by assuming a straight-line (linear) relationship. You will learn the full pipeline: how to define the model, how to choose an objective function that measures error, and how to solve for the best weights mathematically. The lecture uses clean matrix notation to handle many data points and features at once, and it shows how calculus and linear algebra combine to produce a closed-form solution known as the normal equation. On top of this, you will see how to evaluate model quality using mean squared error and how to control overfitting with regularization (both L2/ridge and L1/lasso), including why L1 tends to create sparse, feature-selecting solutions.
The lecture starts by setting up the notation: X is the data matrix with n rows (one per data point) and d columns (one per feature), Y is the vector of outputs, and W is the vector of model parameters (one weight per feature). The linear model predicts each output as a weighted sum of features, written compactly as y = W^T x for a single example, or Y ≈ XW across the entire dataset. The objective is to find W that makes predictions close to true outputs. This closeness is measured by the Residual Sum of Squares (RSS), which adds up squared differences between predictions and actual values. Minimizing RSS gives the best-fitting line or hyperplane in the least-squares sense.
To minimize RSS, the lecture uses calculus: take the derivative of RSS with respect to W, set it to zero, and solve. Expanding the squared norm ||Y − XW||^2 and applying matrix calculus rules leads to the normal equation: W = (X^T X)^{-1} X^T Y. This formula gives the unique minimizer when X^T X is invertible (full rank). The lecture explains why invertibility can fail when there are more features than data points (d > n): then the columns of X are linearly dependent, so X^T X is not full rank and cannot be inverted, leading to no unique solution without additional constraints.
The lecture then turns to how we judge model performance using Mean Squared Error (MSE), which is simply RSS divided by n, the number of data points. MSE reports average squared error and is widely used because it is easy to understand and compare. Although multiplying by 1/n doesn’t change which W minimizes the objective, it does make the error scale interpretable and consistent across datasets of different sizes.
Next comes a central practical concern: overfitting. Especially with many features, linear regression can fit not only the true pattern but also random noise in the training data. The lecture presents regularization as the main defense: we add a penalty to discourage large weights, which smooths the solution and reduces variance. Two common choices are L2 (ridge), which penalizes the sum of squared weights, and L1 (lasso), which penalizes the sum of absolute values. A key insight is that L1 often yields sparse solutions by setting some weights exactly to zero, effectively selecting features. The geometric view helps: L2’s penalty contour is round and smooth, shrinking weights evenly, while L1’s diamond shape has sharp corners that tend to catch the solution at an axis, forcing some weights to be zero.
By the end of the lecture, you will understand: how to pose linear regression as minimizing RSS, how to derive and interpret the normal equation, why invertibility and rank matter, when overfitting appears, how regularization counters it, and how to evaluate with MSE. You will also be able to interpret the meaning of weights (positive increases prediction, negative decreases) and appreciate the trade-off controlled by the regularization strength λ. The material is foundational and prepares you for the next steps like logistic regression and more advanced models.
This lecture is aimed at beginners to early intermediates in machine learning who have basic familiarity with linear algebra (vectors, matrices, transpose, inverse) and calculus (derivative concepts). You should be comfortable with ideas like features and labels, and with reading equations involving matrices. After studying this material, you will be able to build and reason about linear regression models end-to-end, diagnose when the normal equation applies, decide when to add regularization, and use MSE to evaluate your model’s performance on data. The structure of the lecture flows from problem setup and notation, to model and objective, to optimization and solution, to performance evaluation, and finally to extensions via regularization, closing with intuitive geometric reasons for lasso’s sparsity.
02Key Concepts
- 01
Supervised Learning’s Three Ingredients: 🎯 A recipe to learn from labeled data by defining a model, an objective, and an optimization method. 🏠 It’s like choosing a kind of house (model), how you judge if it feels right (objective), and how you pick and buy it (optimization). 🔧 We choose a function family (here, linear), measure fit with an error function (RSS/MSE), and find the best weights by calculus. 💡 Without all three, learning is vague—you wouldn’t know what to search, how to score it, or how to actually find it. 📝 Example: Predict house price from size and age; pick linear regression, minimize squared errors, solve for the best weights using the normal equation.
- 02
Data Matrix, Outputs, and Weights (X, Y, W): 🎯 Notation that organizes inputs, outputs, and parameters. 🏠 Think of X as a spreadsheet with rows as people and columns as traits; Y is the column of answers; W is the dial setting for each trait. 🔧 X is n×d (n data points, d features), Y is n×1, and W is d×1; predictions are XW. 💡 Clear shapes prevent math mistakes and guide which operations are valid. 📝 Example: For 100 houses (n=100) with 3 features (d=3), X is 100×3, Y is 100×1, W is 3×1, and the predicted prices are the 100 entries of XW.
- 03
Linear Regression Model (y = W^T x): 🎯 A model that predicts outputs as a weighted sum of inputs. 🏠 It’s like mixing colors with sliders; each slider (feature) adds or subtracts to make the final shade (prediction). 🔧 For each example x, the prediction is ŷ = W^T x; across all data, Ŷ = XW. 💡 Without a clear functional form, you can’t compute or optimize predictions. 📝 Example: If W=[2, −1] and x=[3, 4], then ŷ = 2·3 + (−1)·4 = 2.
- 04
Interpreting Weights: 🎯 Each weight shows how a feature pushes the prediction. 🏠 Like a thermostat: turning it up or down changes room temperature; positive weights warm, negative cool. 🔧 If wj > 0, increasing feature j raises ŷ; if wj < 0, it lowers ŷ; |wj| measures strength. 💡 Interpretation helps explain the model and detect odd or useless features. 📝 Example: In a salary model, a positive weight on years of experience increases predicted salary; a negative weight on long commute time decreases it.
- 05
Residual Sum of Squares (RSS): 🎯 The total squared error between predictions and truths. 🏠 It’s like measuring how far a set of darts missed the bullseye, squaring misses so big mistakes count a lot. 🔧 RSS = ∑(ŷi − yi)^2 = ||Y − XW||^2, using the L2 norm (Euclidean length). 💡 Without an objective like RSS, there’s no target to minimize. 📝 Example: For errors [1, −2, 3], RSS = 1^2 + (−2)^2 + 3^2 = 14.
- 06
L2 Norm: 🎯 A way to measure the length of a vector. 🏠 Like using a ruler to measure the straight-line distance from the origin. 🔧 For v, ||v||2 = sqrt(∑ vj^2); squared L2 norm is ∑ vj^2. 💡 It lets us write RSS compactly and reason with geometry. 📝 Example: If v=[3, 4], ||v||2 = 5 and ||v||2^2 = 25.
- 07
Geometric Picture of Least Squares: 🎯 Fit the line that keeps vertical residuals as small as possible. 🏠 Imagine a tight clothesline that passes through a cloud of hanging shirts, minimizing droops. 🔧 We minimize squared vertical distances from points to the line/hyperplane defined by W. 💡 This view builds intuition about what the model is trying to do. 📝 Example: In 2D, the best-fit line balances points above and below so that squared vertical gaps are minimal.
- 08
Optimization via Calculus: 🎯 Use derivatives to find the minimum of RSS. 🏠 Like rolling a ball downhill until it settles in the lowest spot. 🔧 Compute ∂/∂W ||Y − XW||^2, set it to zero, and solve for W. 💡 Without taking derivatives, you can’t find the exact minimizer efficiently. 📝 Example: The derivative gives −2X^T Y + 2X^T X W; setting it to zero yields the normal equation.
- 09
Normal Equation: 🎯 A closed-form formula for the best weights. 🏠 Like a direct recipe: mix exact amounts to bake the perfect cake once, no trial and error. 🔧 W = (X^T X)^{-1} X^T Y, assuming X^T X is invertible. 💡 It avoids iterative tuning and gives the exact least-squares solution. 📝 Example: With small d, compute X^T X, invert it, multiply by X^T Y to get W in one shot.
- 10
Invertibility and Full Rank: 🎯 A condition ensuring the solution is unique. 🏠 Think of independent tools in a toolbox; if some are copies, you don’t have everything you need. 🔧 X^T X is invertible only if X’s columns are linearly independent (full column rank). 💡 Without invertibility, there are many W that fit equally well, so there’s no single answer. 📝 Example: If one feature equals the sum of two others, columns are dependent, and X^T X can’t be inverted.
- 11
When d > n (More Features Than Data Points): 🎯 A common case where uniqueness fails. 🏠 Too many knobs for too few examples means different knob settings can yield the same outcome. 🔧 With d > n, columns of X are dependent, making X^T X rank-deficient and non-invertible. 💡 You cannot get a unique W without extra rules like regularization. 📝 Example: With 5 features but only 3 data points, multiple W can produce identical predictions.
- 12
Mean Squared Error (MSE): 🎯 The average squared difference between predictions and truths. 🏠 It’s like averaging how far your daily step goal was missed each day, squaring to stress big misses. 🔧 MSE = (1/n) ∑(ŷi − yi)^2 = RSS/n and doesn’t change the minimizer compared to RSS. 💡 It standardizes error, letting you compare across datasets of different sizes. 📝 Example: RSS 200 over 100 samples gives MSE 2.0.
- 13
Overfitting: 🎯 Fitting noise instead of the true signal. 🏠 Like memorizing answers to last year’s test and failing on new questions. 🔧 Happens when the model is too flexible or when features are many or redundant; training error is low but test error is high. 💡 Without guarding against it, your model won’t generalize to new data. 📝 Example: A model with 100 features and 50 samples might perfectly fit training data but predict poorly on new cases.
- 14
Regularization (General Idea): 🎯 Add a penalty to keep weights small and prevent overfitting. 🏠 Like adding a leash to keep a dog from running too far. 🔧 Modify the objective to RSS + λ·Penalty(W), where λ controls how tight the leash is. 💡 It stabilizes learning, picks simpler models, and solves non-uniqueness. 📝 Example: With noisy data, regularization reduces wild swings in weights and improves test MSE.
- 15
L2 Regularization (Ridge): 🎯 Penalize the sum of squared weights. 🏠 Like a soft bungee cord pulling all weights gently toward zero. 🔧 Objective: RSS + λ||W||2^2, shrinking all weights smoothly; larger λ means stronger shrinkage. 💡 It reduces variance and handles near-dependencies in features. 📝 Example: Doubling λ typically shrinks all weights but rarely to exactly zero.
- 16
L1 Regularization (Lasso): 🎯 Penalize the sum of absolute weight values. 🏠 Like a diamond-shaped fence with sharp corners that catch and hold some weights at zero. 🔧 Objective: RSS + λ||W||1; the kink at zero encourages exact zeros in W (sparsity). 💡 It performs feature selection by turning off unhelpful features. 📝 Example: With many weak features, lasso may keep only a handful with nonzero weights.
- 17
Geometric View of Regularization: 🎯 Understand how penalties shape the solution. 🏠 L2 is a circle/sphere; L1 is a diamond with corners. 🔧 Minimization under these shapes often hits smooth points for L2 and corners (zeros) for L1. 💡 Geometry explains why ridge shrinks broadly and lasso sparsifies. 📝 Example: In 2D weight space, the lasso solution frequently lands exactly on the x- or y-axis.
- 18
Lambda (Regularization Strength): 🎯 A knob that balances fit and simplicity. 🏠 Like adjusting noise-canceling: too low doesn’t help; too high blocks useful sound. 🔧 Larger λ increases penalty weight, shrinking (L2) or zeroing (L1) coefficients more. 💡 Picking λ well controls overfitting and underfitting. 📝 Example: Try several λ values and choose the one with lowest validation MSE.
- 19
Contours and Constraints Intuition: 🎯 Visualize optimization as finding where error contours meet penalty boundaries. 🏠 Like sliding a bowl (error surface) against a fence (penalty shape) until they touch. 🔧 Without regularization, minimum is wherever RSS is lowest; with it, the solution is the lowest error point within small-weight regions. 💡 This picture clarifies why small weights are favored. 📝 Example: L1’s fence corners promote intersections at axes (zeros).
- 20
Closed-Form vs. Algorithm Choice: 🎯 Here we solve directly instead of iterating. 🏠 It’s like using a formula to jump to the answer rather than step-by-step guessing. 🔧 The normal equation gives W exactly when X^T X is invertible; otherwise, add regularization. 💡 Direct solutions are exact and fast for small d but can be unstable without regularization. 📝 Example: For small tabular datasets, compute (X^T X)^{-1} X^T Y; for many features, add λ.
03Technical Details
- Overall Architecture/Structure
-
Data and Notation: We organize the dataset into a matrix X ∈ R^{n×d}, where n is the number of data points and d is the number of features. Each row X_i: is a 1×d feature vector for the i-th data point, and each column corresponds to one feature across all points. Outputs are collected in a vector Y ∈ R^{n×1}, and parameters (weights) are in W ∈ R^{d×1}. The model predicts Ŷ = XW.
-
Model Family: Linear regression assumes the output is a linear function of inputs. For a single data point x ∈ R^d, the prediction is ŷ = W^T x. This is a weighted sum: each entry w_j tells how strongly feature j influences the prediction; sign indicates direction, magnitude indicates strength.
-
Objective Function: We want W that makes predictions close to actual outputs. We measure closeness with Residual Sum of Squares (RSS): RSS(W) = ||Y − XW||2^2 = (Y − XW)^T (Y − XW) = ∑{i=1}^n (y_i − x_i^T W)^2. Squaring emphasizes larger errors and yields a smooth, convex objective that is easy to optimize.
-
Optimization Strategy: Use calculus to minimize the differentiable objective. Take the derivative of RSS with respect to W, set it to zero, and solve the resulting linear system. This produces a closed-form answer (the normal equation) under an invertibility condition.
-
Solution: The gradient of RSS(W) is ∇_W RSS = −2 X^T Y + 2 X^T X W. Setting ∇_W RSS = 0 gives X^T X W = X^T Y. If X^T X is invertible (full rank d), we solve W = (X^T X)^{-1} X^T Y. This W uniquely minimizes RSS because RSS is convex and quadratic in W.
-
Invertibility and Rank: X^T X is invertible if and only if the columns of X are linearly independent. Linear independence means no column can be formed as a combination of the others. If d > n, there cannot be d independent columns in n-dimensional space, so dependence is guaranteed. In that case, multiple weight vectors yield the same predictions and RSS, so there’s no unique minimizer without further constraints.
-
Evaluation Metric: Mean Squared Error (MSE) = (1/n) RSS = (1/n) ∑(y_i − x_i^T W)^2. Though dividing by n doesn’t change the optimizer, it makes the error interpretable as an average per example, useful for comparing across datasets or different n.
-
Overfitting Risk: With many features or with near-linear dependencies, the model can fit noise in training data, making training error low but test error high. Overfitting is especially likely when d is large relative to n or when some features are redundant. Coefficients can become very large in magnitude to chase tiny variations, amplifying noise.
-
Regularization: To curb overfitting and fix non-uniqueness, we adjust the objective to penalize large weights. Two common forms: L2 (ridge): minimize RSS + λ||W||_2^2; L1 (lasso): minimize RSS + λ||W||_1. The hyperparameter λ ≥ 0 controls penalty strength. Larger λ prefers smaller W, trading some bias for lower variance (more stable predictions).
-
Geometry of Penalties: L2 penalty forms circular/spherical level sets in weight space, gently shrinking all coefficients toward zero. L1 penalty forms diamond-shaped level sets with sharp corners at axes; minimization often meets the objective at these corners, setting some coefficients exactly to zero. This geometric difference explains ridge’s smooth shrinkage versus lasso’s sparsity.
- Code/Implementation Details (Conceptual, No Specific Language Required)
-
Matrix Computations: The core operations are X^T X (d×d), X^T Y (d×1), and solving a d×d linear system. Numerically, computing (X^T X)^{-1} explicitly is less stable than solving the system X^T X W = X^T Y via a linear solver. However, conceptually, the closed-form inverse clarifies the solution.
-
Derivative Steps in Detail: Let f(W) = ||Y − XW||_2^2 = (Y − XW)^T (Y − XW). Expand: f(W) = Y^T Y − 2 Y^T X W + W^T X^T X W. Take derivative w.r.t. W: ∇ f(W) = −2 X^T Y + 2 X^T X W. Set to zero: X^T X W = X^T Y ⇒ W = (X^T X)^{-1} X^T Y (if invertible).
-
Conditions and Edge Cases: • If X^T X is not invertible (rank-deficient), there are infinitely many minimizers of RSS. Regularization with L2 makes the matrix X^T X + λI invertible for any λ > 0, yielding a unique solution. • With L1, the solution is not given by a simple matrix inverse; specialized optimization (e.g., coordinate descent) is typically used to find the sparse minimizer.
-
Regularized Objective Derivatives (Intuition): • L2 (ridge): J(W) = ||Y − XW||^2 + λ||W||^2. The derivative adds 2λW, giving (X^T X + λI) W = X^T Y, so W = (X^T X + λI)^{-1} X^T Y. • L1 (lasso): J(W) = ||Y − XW||^2 + λ∑|w_j|. The absolute value is not differentiable at zero, creating the “kink” that induces sparsity. Solutions often lie at exact zeros for some coefficients, which is why lasso selects features.
- Tools/Libraries Used
- The lecture stays at the mathematical level and does not depend on specific libraries. In practice, any scientific computing environment (such as NumPy, MATLAB, or R) can perform the needed matrix operations: transposes, multiplications, and solving linear systems. No special machine learning framework is required for basic linear regression with the normal equation.
- Step-by-Step Implementation Guide (Conceptual)
-
Step 1: Prepare Data • Collect features into X (n×d) and outputs into Y (n×1). Ensure each row of X matches the corresponding entry of Y. Consider adding a column of ones to X if you include an intercept (bias term); this was not explicitly discussed here, but it’s a common practice so the model can fit a nonzero baseline.
-
Step 2: Define Objective • Use RSS(W) = ||Y − XW||^2 if you seek a pure least-squares fit. For evaluation or comparison, compute MSE = RSS/n.
-
Step 3: Solve for W (Unregularized) • Compute X^T X and X^T Y. If X^T X is invertible, compute W = (X^T X)^{-1} X^T Y (or solve the linear system directly using a solver). This yields the unique least-squares minimizer.
-
Step 4: Check Invertibility / Overfitting Risk • If d > n or features are collinear, X^T X may not be invertible or may be poorly conditioned. Expect instability or multiple solutions; move to regularization.
-
Step 5: Add Regularization (If Needed) • Ridge (L2): Solve (X^T X + λI) W = X^T Y for a chosen λ ≥ 0. Larger λ shrinks coefficients more and improves stability when features are correlated or d > n. • Lasso (L1): Solve minimize ||Y − XW||^2 + λ||W||_1 using an appropriate algorithm (e.g., coordinate descent). Expect some weights to become exactly zero, simplifying the model.
-
Step 6: Evaluate with MSE • Compute predictions Ŷ = XW on validation or test data and calculate MSE. Compare MSE across different settings (e.g., various λ) to choose the best trade-off.
-
Step 7: Interpret Weights • Inspect signs and magnitudes: positive means increasing the feature increases prediction; negative means it decreases. Very large magnitudes may indicate overfitting or feature scaling issues.
- Tips and Warnings
-
When d > n or columns are nearly dependent, regularization is not optional—it’s essential for stability and uniqueness. Even mild ridge regularization can dramatically reduce variance and improve generalization.
-
L1 vs L2 Choice: If you desire interpretability and feature selection, lasso is attractive due to sparsity. If you want to reduce variance smoothly without forcing zeros, ridge is a safer, more stable shrinker, especially when many features have small, real effects.
-
Lambda Sensitivity: Start with a range of λ values from very small to moderately large and pick the one with lowest validation MSE. Too-small λ provides little protection; too-large λ oversmooths, increasing bias and underfitting.
-
Numerical Stability: Although the formula uses a matrix inverse, in computation prefer solving linear systems or using decompositions (like QR) instead of explicitly inverting X^T X. This reduces numerical errors.
-
Interpreting MSE: Remember MSE averages squared errors; large outliers can dominate. If you see very high MSE driven by a few points, investigate data quality or consider robust alternatives (outside this lecture’s scope).
-
Feature Understanding: If a weight is zero under lasso, that feature is effectively unused—this can guide feature selection. If all weights shrink too close to zero with ridge or lasso, you may have set λ too high or features may be weak predictors.
04Examples
- 💡
Simple 1D Fit: Suppose we have x = [1, 2, 3] and y = [2, 3, 5]. We model ŷ = w·x and minimize RSS = ∑(w·xi − yi)^2. Solving gives a slope that best fits points in the vertical least-squares sense. The key point is that squared vertical gaps determine the best straight line.
- 💡
Matrix Shapes Check: With 4 data points and 2 features, X is 4×2, Y is 4×1, and W is 2×1. Predictions are Ŷ = XW, a 4×1 vector. RSS = ||Y − XW||^2 is valid because Y and XW have the same shape. The instructor emphasized that knowing shapes prevents algebra mistakes.
- 💡
Deriving the Gradient: Start from f(W) = ||Y − XW||^2 = (Y − XW)^T (Y − XW). Expand to get Y^T Y − 2 Y^T X W + W^T X^T X W. Taking ∇_W yields −2 X^T Y + 2 X^T X W. Setting to zero and solving leads to the normal equation.
- 💡
Normal Equation Solution: For small d, compute X^T X, invert it, and multiply by X^T Y to get W. For example, if X^T X = [[10, 2],[2, 5]] and X^T Y = [8, 3]^T, then W = (X^T X)^{-1} X^T Y. This produces the RSS-minimizing coefficients. It directly demonstrates the closed-form nature of the solution.
- 💡
Non-Invertible Case (d > n): Let n=2 and d=3 with X = [[1, 0, 1],[2, 0, 2]]. The third column equals the first column, so columns are dependent. X^T X is rank-deficient, so (X^T X)^{-1} does not exist. This shows why no unique W exists without regularization.
- 💡
MSE Calculation: If RSS is 50 over n = 10 data points, MSE = 50/10 = 5. If another model’s RSS is 45 over n = 5 points, its MSE is 9. Even though RSS is lower (45 < 50), the average error is higher. This illustrates why MSE is better for comparing models across dataset sizes.
- 💡
Overfitting Illustration: With n=5 and d=5, a model can fit training data nearly perfectly by twisting weights to chase noise. Training RSS can be near zero, but test MSE will be high. This mismatch reveals overfitting. The lesson: low training error alone is not proof of a good model.
- 💡
L2 Regularization Effect: Consider ridge objective J(W) = ||Y − XW||^2 + λ||W||^2. With λ=0.1 versus λ=10, the larger λ shrinks all weights more. Predictions become smoother and less sensitive to small data fluctuations. The trade-off is a slight increase in training error for better generalization.
- 💡
L1 Regularization Sparsity: For many weakly related features, lasso will often set most weights to zero. Suppose 20 features have little effect and 2 matter; with a suitable λ, lasso keeps only the 2 important ones. The result is a sparse, simpler model. This demonstrates feature selection via L1.
- 💡
Geometric Intuition (Contours): Visualize error contours as nested ovals and regularization constraints as shapes around the origin. With L2 (circles), the touch point is usually at a smooth spot, shrinking all weights but keeping them nonzero. With L1 (diamonds), the touch point often lands on a corner (an axis), forcing a zero weight. This explains ridge shrinking vs lasso zeroing.
- 💡
Vertical Distances in 2D: Plot points on an x–y plane and fit a line ŷ = a x + b. The residuals are the vertical gaps from points to the line. Minimizing the sum of their squares finds the best-fitting line. This picture closely matches the instructor’s verbal explanation.
- 💡
Interpreting a Negative Weight: In a model predicting energy usage, a negative weight on outside temperature means warmer days reduce heating needs. Each degree increase lowers the predicted energy. The magnitude tells how strong the effect is. This showcases weight sign and size meaning.
- 💡
Choosing λ by Validation: Try λ in {0, 0.01, 0.1, 1, 10} and compute validation MSE. Pick the λ with the smallest validation MSE. This balances fit quality and simplicity. It highlights λ as a crucial knob for generalization.
- 💡
Handling Redundant Features: If feature 3 = feature 1 + feature 2, columns are dependent. Unregularized normal equation fails to produce a unique W. Ridge fixes this by adding λI, making the system solvable and stable. This example underlines the role of regularization in resolving non-uniqueness.
- 💡
RSS vs MSE Minimizer: Minimizing RSS or MSE yields the same W because MSE is just RSS divided by n. Division by a positive constant does not change the minimizer. However, MSE makes comparing across datasets meaningful. This clarifies why both metrics appear and how they relate.
05Conclusion
This lecture built linear regression from the ground up, from notation to solution to safeguards against overfitting. We began by organizing data in matrix form (X for features, Y for outputs, W for weights) and defining the linear model ŷ = W^T x. We chose the Residual Sum of Squares (RSS) as the objective to measure total squared prediction error and used calculus to derive the normal equation, W = (X^T X)^{-1} X^T Y, the closed-form minimizer when X^T X is invertible. The lecture clarified why invertibility requires full column rank and why this fails when there are more features than data points (d > n), resulting in no unique solution without added constraints.
We then learned to evaluate model performance using Mean Squared Error (MSE), an average form of RSS that enables fair comparisons across datasets. Recognizing the risk of overfitting—especially with many or redundant features—we introduced regularization. L2 (ridge) adds a squared-weight penalty that smoothly shrinks coefficients, improves stability, and often resolves near-dependencies. L1 (lasso) adds an absolute-value penalty that frequently sets some weights exactly to zero, providing natural feature selection. A geometric view explained the different effects: L2’s smooth circular contours encourage small but nonzero weights, while L1’s diamond-shaped contours with sharp corners make zero weights likely.
To practice, start by fitting an unregularized linear model on a small dataset and verify the normal equation solution. Next, explore d > n or near-collinearity to witness non-uniqueness or instability, then add ridge regularization and observe how λ stabilizes the solution and affects MSE. Finally, try lasso on a dataset with many weak features and confirm that some coefficients become exactly zero—an exercise in feature selection.
For next steps, extend this understanding to logistic regression for classification problems and to generalized linear models. Study strategies for choosing λ effectively, such as cross-validation, and learn more about numerical stability and matrix decompositions for practical computing. The core message to remember is the three-part structure of supervised learning—model, objective, optimization—and how linear regression makes each piece explicit and solvable. With RSS as the objective and the normal equation as the optimizer’s result, you gain a precise, interpretable, and often surprisingly strong baseline model, made robust by regularization when needed.
Key Takeaways
- ✓Always define the three pillars: model, objective, optimization. Pick a model family (here linear), a clear error measure (RSS/MSE), and a method to find the best parameters (normal equation). Without this structure, training is guesswork. Keep these pillars in mind for every supervised task you tackle.
- ✓Use matrix shapes to avoid mistakes. Confirm X is n×d, Y is n×1, and W is d×1 before computing Ŷ = XW. If shapes don’t line up, stop and fix your data assembly. This simple check prevents many silent bugs.
- ✓Interpret weights to understand your model. Positive weights mean increasing a feature increases the prediction; negative weights do the opposite. Large magnitudes indicate strong effects. This interpretability is a major reason to start with linear regression.
- ✓Minimize RSS to fit the model; report MSE to compare results. RSS measures total squared error, while MSE averages it. They share the same minimizer but MSE is easier to compare across datasets. Always state which you’re using.
- ✓Derive the normal equation to get W in one step. Compute X^T X and X^T Y, and solve X^T X W = X^T Y when invertible. This is efficient and exact for small to medium feature counts. Prefer linear solvers to explicit matrix inversion for numerical stability.
- ✓Check invertibility and rank before trusting the solution. If d > n or columns are collinear, X^T X is not invertible and solutions aren’t unique. Don’t force an inverse—switch to regularization. This turns an ill-posed problem into a stable one.
- ✓Expect overfitting with too many or redundant features. Very low training error can be misleading if test MSE is high. Use validation or a held-out test set to assess generalization. If overfitting appears, add regularization.
- ✓Use ridge (L2) when you need smooth shrinkage and stability. It reduces variance and handles collinearity by adding λI to X^T X. Start with a small λ and increase until validation MSE improves. Avoid extreme λ that over-shrinks all weights.
- ✓Use lasso (L1) when you want sparsity and feature selection. L1 pushes some weights to exactly zero thanks to its kink at zero. This simplifies the model and highlights important features. Tune λ carefully to avoid dropping truly useful predictors.
- ✓Pick λ by validation, not by guesswork. Try a range of λ values and choose the one with the lowest validation MSE. If training and validation errors diverge widely, adjust λ upward. If both are high, reduce λ or revisit features.
- ✓Visualize the geometric intuition when deciding between L1 and L2. L2’s circular contours imply gentle, uniform shrinkage; L1’s diamond contours imply axis-hitting sparsity. This mental picture predicts how your solution will behave. Use it to set expectations for interpretability and stability.
- ✓Keep an eye on coefficient magnitude. Extremely large weights often signal overfitting or poorly scaled features. Regularization can tame this, but also consider cleaning or rescaling inputs. Stable, moderate coefficients usually generalize better.
- ✓Remember that MSE is sensitive to outliers. A few large errors can dominate the average. Inspect residuals to spot anomalies. Consider robust strategies if extreme outliers are common (beyond this lecture’s scope).
- ✓Ensure data rows and labels align perfectly. Any mismatch corrupts RSS and the learned W. Double-check indexing and joins when building X and Y. Small alignment errors can destroy model performance.
- ✓Document model choices and results. Record whether you used unregularized, ridge, or lasso; the λ value; and the achieved training/validation MSE. This makes your work reproducible and comparable. It also helps you defend decisions to stakeholders.
- ✓Start simple, then add complexity. Linear regression is fast and interpretable, making it an ideal baseline. If it performs well, you may not need a more complex model. If not, its results guide what to try next.
- ✓Use regularization by default when d is large or features are correlated. Even a small λ can stabilize learning dramatically. L2 is a safe first choice; add L1 if you want sparsity. Validate the effect on MSE before finalizing.
Glossary
Supervised learning
A way for computers to learn from examples where each input comes with the correct answer. The goal is to find a rule that maps inputs to outputs so new inputs get good predictions. You must pick a model type, a way to measure mistakes, and a method to find the best model. This setup helps the computer improve by reducing measured errors. It’s like a student learning with an answer key.
Feature
A measurable property or input used for prediction. Each feature is a column in the data matrix and describes one aspect of the example. Features can be numbers like height or age. The choice and quality of features greatly affect model performance. Bad or redundant features can cause trouble.
Label (output)
The correct answer the model tries to predict. In regression, the label is a number like a price or temperature. We compare the model’s prediction to this label to measure error. Labels are stored in a vector Y, one entry per data point. Good labels are required for supervised learning.
Data matrix (X)
A big table (matrix) where each row is a data point and each column is a feature. We write it as X with shape n by d. Using a matrix lets us process all points at once with linear algebra. It makes formulas clean and fast. It’s the main input to the model.
Parameter vector (W)
A list of weights, one per feature, that the model learns. Each weight says how much its feature pushes the prediction up or down. Positive means increase; negative means decrease. The size of the weight shows strength. W has length d, matching the number of features.
Linear regression
A model that predicts a number by taking a weighted sum of features. It assumes a straight-line relationship between inputs and output. It’s simple, fast, and often works well. The prediction is ŷ = W^T x for each input x. It’s a common baseline in machine learning.
Prediction (y-hat)
The model’s guess for the label given an input. In linear regression, ŷ is the dot product of W and x. We compare ŷ to the true y to compute error. Improving the model means making ŷ closer to y. Predictions drive decisions in the real world.
Residual
The difference between the true label and the prediction: residual = y − ŷ. It shows how much the model missed for that example. Squaring residuals gives positive numbers and punishes big mistakes more. Residuals guide model improvement. Patterns in residuals can reveal problems.
+29 more (click terms in content)
