Stanford CS230 | Autumn 2025 | Lecture 3: Full Cycle of a DL project

Beginner

Stanford

Machine LearningYouTube

Key Summary

•This lecture introduces supervised learning for regression, where the goal is to predict a real number (like house price) from input features (like square footage, bedrooms, and location). You represent each example as a d-dimensional vector x with a target y. Linear regression models this relationship with a straight-line formula: f(x) = w^T x + b. The focus is on learning weights w and bias b that best map inputs to outputs.
•The model parameters are learned by minimizing a loss function, most commonly Mean Squared Error (MSE). MSE averages the squared differences between true values and predictions across all training examples. Squaring makes bigger mistakes count more than small ones. Lower MSE means the model’s predictions are closer to the true values.
•Gradient descent is used to find w and b that minimize MSE. It starts from an initial guess, computes the gradient (the direction of steepest increase), and moves in the opposite direction to reduce the loss. The size of each move is controlled by a learning rate. This process repeats until the loss stops improving significantly.
•Choosing the learning rate is crucial: too large and the loss can bounce around or even blow up; too small and training crawls slowly. A good approach is to start small, increase until the loss stops decreasing reliably, then back off. Learning rate schedules gradually reduce the rate over time to stabilize training. Adaptive optimizers like Adam can adjust effective learning rates per parameter automatically.
•Linear regression has clear advantages: it’s simple, fast, and easy to interpret. It often serves as a strong baseline before trying fancier models. On large datasets, it can be trained efficiently and deployed quickly. If it performs well enough, you might not need anything more complex.
•However, it assumes a linear relationship between inputs and outputs, which is not always true. It’s sensitive to outliers because MSE squares errors, making extreme points dominate the fit. It can also overfit, learning noise in the training data instead of the true signal. Regularization helps by gently restricting the model’s complexity.
•Regularization adds a penalty to the loss to discourage overly complex models. L1 regularization (lasso) penalizes the absolute value of weights and pushes many to exactly zero, effectively selecting features. L2 regularization (ridge) penalizes the square of weights, shrinking them but rarely to zero. A hyperparameter lambda controls how strong this penalty is.
•Polynomial regression extends linear regression to model non-linear relationships by adding powers of features (like x^2) as new inputs. The model is still linear in parameters but now can fit curves. This is done by expanding the feature set, then applying ordinary linear regression on the expanded features. It’s a flexible way to capture bends without switching to a completely different algorithm.
•Ridge regression is linear regression with L2 regularization. It reduces overfitting by keeping weights small and tends to improve stability, especially when features are correlated. Lasso regression is linear regression with L1 regularization. It performs feature selection by zeroing out less important weights.
•The lecture uses relatable examples of non-linear patterns: electricity demand versus temperature (high at both hot and cold extremes) and tree height versus age (fast growth that slows over time). These show when a straight line won’t fit well. Polynomial features can help capture such U-shaped or saturating patterns. Regularization prevents these richer models from overfitting.
•Practically, learning involves iterating: compute predictions, compute MSE, compute gradients with respect to w and b, and update parameters. This loop repeats until changes are small. Monitoring the loss helps you detect convergence and diagnose learning rate issues. Outlier checks and data cleaning can significantly improve stability.
•The bias term b lets the model fit data that doesn’t pass through the origin. Without b, the line must cross (0, 0) in feature space, which is often unrealistic. Together, w and b define a plane (or hyperplane) that best approximates the data. This simple geometry underlies most of the lecture’s methods.
•Lambda, the regularization strength, is a hyperparameter you set before training. Larger lambda means stronger pressure to simplify the model; smaller lambda allows more flexibility. Setting it too high can underfit; too low can overfit. The right balance produces better generalization to new data.
•Even with the same data, L1 and L2 regularization behave differently. L1 tends to produce sparse solutions with many zeros, which is handy for picking a subset of important features. L2 spreads the shrinkage across all weights, improving numerical stability and robustness. Knowing which to use depends on whether you value feature selection or smooth shrinkage.
•After mastering linear regression and its extensions, you’re ready to study logistic regression for classification. The optimization ideas (loss, gradients, learning rate) carry over. The main change is the loss function and link function suited for probabilities. This continuity helps you build from regression to classification smoothly.

Why This Lecture Matters

This lecture’s content is essential for anyone building predictive systems where the output is a number: data analysts estimating prices, forecasters predicting demand, and engineers modeling performance metrics. Linear regression forms a baseline that is fast to train, easy to interpret, and often surprisingly strong, so it saves time by telling you early whether you need more complex methods. Understanding MSE and gradient descent gives you a reusable playbook for optimizing models: define a clear loss, compute gradients, pick a learning rate, and iterate. Learning about learning rate schedules and adaptive methods like Adam prepares you for practical stability and speed when training on real datasets. Regularization (L1 and L2) directly addresses overfitting, one of the most costly failures in real projects, where a model appears great in development but fails in production. L1 teaches you how to perform feature selection automatically and build sparse, interpretable models; L2 teaches you how to stabilize fits, especially when features correlate, improving robustness. Polynomial regression shows how to capture curves while keeping the simplicity of linear-in-parameter models, making it a pragmatic bridge to non-linear modeling without jumping to complex architectures. By mastering these tools, you gain the ability to quickly prototype, evaluate, and deploy regression models that generalize. In a career context, these skills are foundational: they appear in interviews, code challenges, and day-to-day modeling tasks. In the broader industry, even advanced systems often start with or benchmark against linear models; being fluent here helps you judge when to move to more sophisticated approaches and how to regularize them properly. Ultimately, this knowledge equips you to make data-driven decisions with confidence, control model complexity, and deliver reliable predictions that matter in business and science.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches the core ideas behind supervised learning for regression, focusing on the classic and foundational method: linear regression. In supervised learning, you are given input–output pairs: each input x is a vector of features (for example, house characteristics like square footage, number of bedrooms, and location), and each output y is the real number you want to predict (such as the house price). The goal is to learn a function f that maps any new input x to a good prediction of y. Linear regression assumes this mapping is a straight line (technically, a hyperplane) in feature space: f(x) = w^T x + b, where w is a vector of weights and b is a bias (intercept).

To measure how well the model predicts, the lecture uses Mean Squared Error (MSE), which averages the squares of the differences between actual values and predicted values. Squaring emphasizes larger mistakes more than smaller ones, which helps the optimization focus on big errors. The parameters w and b are found by minimizing MSE. The tool for this is gradient descent, an iterative algorithm: start with an initial guess for w and b, compute the gradient (the direction that increases the loss fastest), and step in the opposite direction to reduce the loss. The size of each step is determined by a learning rate hyperparameter.

The lecture explains that choosing a good learning rate is essential. If it is too large, the updates can overshoot the minimum, causing the loss to bounce or even explode. If it is too small, progress is painfully slow and training takes a long time. Helpful strategies include starting small and increasing until improvement slows, then backing off, using learning rate schedules that gradually reduce the rate during training, or using adaptive methods like Adam that adjust the effective learning rate for each parameter automatically.

Linear regression is praised for being simple, interpretable, and computationally efficient. It is often recommended as a baseline: try it first, and if it performs sufficiently well, you may not need a more complex model. However, it has limitations. The most important one is that it assumes a linear relationship between features and the target, which is not always realistic. The lecture gives examples of non-linear relationships: electricity demand versus temperature tends to be high at both hot and cold extremes, which looks like a U-shaped curve rather than a straight line; tree height versus age shows rapid growth early and slower growth later, another clearly non-linear pattern. Linear regression is also sensitive to outliers due to the squared-error loss, and it can overfit when it memorizes training noise rather than learning general patterns.

To address overfitting, regularization adds a penalty to the loss that discourages overly complex models. The lecture presents two main types: L1 regularization (lasso) adds the sum of absolute values of the weights to the loss, promoting sparsity (many weights become exactly zero) and thus performing feature selection. L2 regularization (ridge) adds the sum of squared weights, shrinking weights toward zero but rarely turning them off completely, which stabilizes the fit and reduces variance. A hyperparameter lambda controls the strength of this penalty: higher lambda means a stronger push toward simpler models, while lower lambda allows more flexibility.

The lecture also introduces polynomial regression as an extension that can capture non-linear relationships without abandoning linear regression’s simplicity. The trick is to expand the input features by adding polynomial terms (e.g., x^2, x^3), and then apply standard linear regression on the expanded feature set. This keeps the model linear in the parameters but allows curved fits. Ridge regression (linear regression with L2 regularization) and lasso regression (linear regression with L1 regularization) are highlighted as practical tools to control complexity and improve generalization.

By the end, you understand the full loop of building a basic regression model: define the model form (linear), choose a loss (MSE), minimize it with gradient descent (tuning the learning rate), be aware of outliers and non-linearity, and use regularization to prevent overfitting. With these ideas in place, you are set up to move on to logistic regression for classification problems, where many of the same optimization and hyperparameter principles apply. The lecture’s structure flows from the problem setup (supervised regression), to the model and loss, to optimization via gradient descent, to practical concerns (learning rate, pros/cons, outliers), and finally to extensions (regularization and polynomial features), giving a complete and clear foundation for predictive modeling with linear methods.

02Key Concepts

01
🎯 Supervised Learning (Regression)
- 🏠 It’s like a teacher giving you questions (inputs) and the correct answers (outputs) so you can learn to answer new questions.
- 🔧 Technically, you have feature vectors x and target numbers y, and you learn a function f mapping x to y.
- 💡 This matters because many problems (like house pricing or temperature forecasting) require predicting real numbers.
- 📝 Example: Predict housing price from square footage, bedrooms, and location.
02
🎯 Linear Regression Model
- 🏠 It’s like fitting a straight ruler to a cloud of points to best pass through them.
- 🔧 The model is f(x) = w^T x + b, with weights w and bias b to be learned from data.
- 💡 Without a clear model, we cannot systematically make predictions for new inputs.
- 📝 Example: A positive weight on square footage means bigger houses predict higher prices.
03
🎯 Features and Targets
- 🏠 Features are ingredients; the target is the final dish you want to make.
- 🔧 Each input x is a d-dimensional vector of features; each target y is a real number.
- 💡 Organizing data into features and targets lets learning algorithms see patterns.
- 📝 Example: Features = [1,500 sq ft, 3 beds, good school rating]; Target = $640,000.
04
🎯 Mean Squared Error (MSE)
- 🏠 It’s like averaging how far off you are, but big mistakes are punished much more.
- 🔧 MSE = (1/n) Σ (y_i − f(x_i))^2; squaring emphasizes large errors.
- 💡 Without a loss to measure mistakes, you cannot judge or improve a model.
- 📝 Example: If predictions are off by 10, 2, and 0, MSE highlights the 10-error most.
05
🎯 Gradient Descent
- 🏠 It’s like walking downhill in fog: feel the slope and step in the steepest-down direction.
- 🔧 Compute gradients of the loss with respect to w and b, then update: new = old − learning_rate × gradient.
- 💡 Without iterative updates, we would not know how to change parameters to reduce error.
- 📝 Example: After each pass, if MSE drops, you’re moving in the right direction.
06
🎯 Learning Rate
- 🏠 It’s like the size of your footsteps going downhill: too big and you stumble; too small and you crawl.
- 🔧 A hyperparameter that scales the gradient step; chosen by trial, schedules, or adaptive methods like Adam.
- 💡 A bad learning rate can stop training from converging or make it painfully slow.
- 📝 Example: Start at 0.01, increase until loss stops decreasing smoothly, then reduce slightly.
07
🎯 Bias (Intercept)
- 🏠 It’s like adjusting where your ruler starts so it fits the overall height of the data.
- 🔧 b allows the fitted line or plane to shift up or down; without b, the fit must pass through the origin.
- 💡 Without b, many real datasets would be poorly fit, increasing the error.
- 📝 Example: Even a small house has a baseline cost; b captures that base price.
08
🎯 Advantages of Linear Regression
- 🏠 It’s like a simple, reliable tool you can use quickly and understand easily.
- 🔧 It’s interpretable, fast, scalable, and a strong baseline.
- 💡 Starting simple saves time and often achieves good-enough performance.
- 📝 Example: On many tabular datasets, linear models perform competitively with minimal tuning.
09
🎯 Disadvantages of Linear Regression
- 🏠 It’s like trying to draw a straight line through a curved road—sometimes the shape just doesn’t fit.
- 🔧 Assumes linearity, is sensitive to outliers, and can overfit.
- 💡 Recognizing limits avoids drawing wrong conclusions from a misfit model.
- 📝 Example: A few extreme-priced houses can pull the line and distort most predictions.
10
🎯 Outliers
- 🏠 It’s like one very tall person in a class photo making everyone else look short by comparison.
- 🔧 Because of squared errors, extreme points can dominate the loss and skew the fit.
- 💡 If not handled, outliers can degrade accuracy for the majority of cases.
- 📝 Example: A mansion priced at 10× others can tilt the line upward for all predictions.
11
🎯 Overfitting
- 🏠 It’s like memorizing answers instead of learning the material; you do great on old questions but fail new ones.
- 🔧 The model fits noise in training data and performs poorly on unseen data.
- 💡 Preventing overfitting leads to better generalization.
- 📝 Example: A model that nails training prices but misses on new houses has overfit.
12
🎯 Regularization (General)
- 🏠 It’s like gently tying a leash to keep the model from running wild.
- 🔧 Adds a penalty term to the loss to discourage large or complex weights.
- 💡 This reduces overfitting and improves stability on new data.
- 📝 Example: With a moderate penalty, the model favors simpler, more robust trends.
13
🎯 L1 Regularization (Lasso)
- 🏠 It’s like packing light for a trip: you end up leaving many items (features) at home.
- 🔧 Adds λ Σ |w_j| to the loss; pushes many weights to exactly zero.
- 💡 Helps with feature selection by turning off unhelpful inputs.
- 📝 Example: Among dozens of house features, only a handful get nonzero weights.
14
🎯 L2 Regularization (Ridge)
- 🏠 It’s like tightening all screws evenly so nothing sticks out too far.
- 🔧 Adds λ Σ w_j^2 to the loss; shrinks weights smoothly without making many exactly zero.
- 💡 Reduces variance and helps when features are correlated.
- 📝 Example: All house features get smaller, more stable weights that generalize better.
15
🎯 Lambda (Regularization Strength)
- 🏠 It’s like a knob that controls how tight the leash is.
- 🔧 Larger λ means stronger penalties and simpler models; smaller λ allows more complex fits.
- 💡 Wrong λ can underfit (too large) or overfit (too small).
- 📝 Example: Try λ values on a scale (like 0.001, 0.01, 0.1, 1) to find balance.
16
🎯 Polynomial Regression
- 🏠 It’s like bending your ruler into a curve to follow a winding path.
- 🔧 Add polynomial terms (x, x^2, x^3, …) as new features, then do linear regression on the expanded set.
- 💡 Captures non-linear patterns while keeping the model linear in parameters.
- 📝 Example: A quadratic term models U-shaped demand versus temperature.
17
🎯 Non-linear Relationships
- 🏠 It’s like noticing that some things grow fast then slow down, or go up at both extremes.
- 🔧 Real-world patterns (e.g., electricity demand vs. temperature; tree height vs. age) are often curved.
- 💡 Recognizing non-linearity tells you when to add polynomial features or choose other models.
- 📝 Example: Demand spikes on very hot and very cold days—best fit by a curve, not a line.
18
🎯 Learning Rate Schedules and Adam
- 🏠 It’s like starting with big steps and then taking smaller, careful steps as you approach your goal.
- 🔧 Schedules reduce the learning rate over time; Adam adapts step sizes per parameter automatically.
- 💡 These methods can stabilize training and speed convergence.
- 📝 Example: Begin at 0.01 and halve it every few epochs as loss plateaus.
19
🎯 Training Loop and Convergence
- 🏠 It’s like practicing a skill: try, check, adjust, and repeat until improvements are tiny.
- 🔧 Repeatedly compute predictions, loss, gradients, and parameter updates until changes are small.
- 💡 Clear stopping criteria prevent wasted time and over-updating.
- 📝 Example: Stop when MSE improvement falls below a threshold for several rounds.
20
🎯 Ridge and Lasso as Extensions
- 🏠 They’re like two ways to keep your model neat: one trims steadily (ridge), one prunes aggressively (lasso).
- 🔧 Ridge uses L2; lasso uses L1; both reduce overfitting but behave differently with features.
- 💡 Choosing the right one depends on whether you want feature selection or smooth shrinkage.
- 📝 Example: Use lasso to drop unhelpful features; use ridge to stabilize correlated features.

03Technical Details

Overall Architecture/Structure

Problem Setup

You have a dataset of n training examples. Each example has a feature vector x_i ∈ R^d (d features) and a real-valued target y_i ∈ R.
The goal is to learn a function f(x) that predicts y for new, unseen inputs x.
In linear regression, f(x) = w^T x + b, where w ∈ R^d (weights) and b ∈ R (bias) are the model parameters you must learn from data.

Loss Function: Mean Squared Error (MSE)

Define predictions: for each i, ŷ_i = w^T x_i + b.
The MSE loss is L(w, b) = (1/n) Σ_{i=1..n} (y_i − ŷ_i)^2.
Squaring the errors penalizes large deviations more heavily than small ones, making the optimization focus on outliers’ impact (a strength and a weakness).

Optimization: Gradient Descent

We minimize L(w, b) iteratively by following the negative gradient of the loss.
Gradients for MSE (derivations): • ∂L/∂w = −(2/n) Σ x_i (y_i − ŷ_i) = (2/n) X^T (Xw + b1 − y), where X is the n×d matrix of features, y is the n-vector of targets, and 1 is the all-ones vector. • ∂L/∂b = −(2/n) Σ (y_i − ŷ_i) = (2/n) Σ (ŷ_i − y_i) = (2/n) 1^T (Xw + b1 − y).
Update rules with learning rate α: • w ← w − α ∂L/∂w • b ← b − α ∂L/∂b
Repeat until convergence (e.g., until loss reduction is below a small threshold for several iterations, or a maximum iteration count is reached).

Regularization (to prevent overfitting)

L1 regularization (lasso): L(w, b) = (1/n) Σ (y_i − ŷ_i)^2 + λ Σ |w_j|.
L2 regularization (ridge): L(w, b) = (1/n) Σ (y_i − ŷ_i)^2 + λ Σ w_j^2.
Effects on gradients: • L2 adds 2λw to ∂L/∂w (for the common convention without 1/2 factors). The bias b is often not regularized. • L1 adds λ sign(w_j) to each component’s gradient. At w_j = 0, the subgradient is any value in [−λ, λ]; optimization uses subgradient or specialized solvers.
λ (lambda) is a hyperparameter controlling penalty strength: higher values shrink more strongly.

Extensions: Polynomial Regression

Expand x with polynomial features: for a scalar x, use [x, x^2, x^3, …]; for vector x, include squares and cross-terms if desired (though cross-terms were not emphasized in the lecture).
Train the same linear regression on the expanded feature vector. The model remains linear in parameters but now can fit curves.

Data Flow

Input: Matrix X (n×d) and vector y (n×1).
Forward pass: Compute predictions ŷ = Xw + b·1.
Loss computation: MSE between ŷ and y; optionally add regularization.
Backward pass: Compute gradients w.r.t. w and b (plus regularization terms if used).
Parameter update: Apply gradient descent steps; repeat for multiple iterations.

Code/Implementation Details (Conceptual Pseudocode) Language: Python with NumPy (conceptually)

Ordinary Linear Regression (MSE + Gradient Descent)

Initialize w = zeros(d), b = 0 (or small random values).
For t in 1..T: • y_hat = X @ w + b • error = y_hat − y • grad_w = (2/n) * X.T @ error • grad_b = (2/n) * np.sum(error) • w = w − α * grad_w • b = b − α * grad_b • Optionally record loss: (1/n) * np.sum(error**2)
Stop early if loss stops improving materially.

With L2 Regularization (Ridge)

Same as above, but add regularization to gradient: • grad_w = (2/n) * X.T @ error + 2λ * w • grad_b unchanged (commonly not regularized)

With L1 Regularization (Lasso)

Replace grad_w by subgradient: • grad_w = (2/n) * X.T @ error + λ * sign(w) (componentwise) • Handle w_j = 0 carefully (subgradient ∈ [−λ, λ]); in practice, use proximal gradient (soft-thresholding) or coordinate descent for stability.

Polynomial Regression

Create expanded feature matrix Φ: • For degree p: Φ = [x, x^2, …, x^p] (if x is 1-D), or augment columns accordingly if x is multi-dimensional. • Then run the same regression loop on Φ and y.

Important Parameters and Their Meanings

α (learning rate): Step size for gradient descent. Too large causes divergence; too small leads to slow convergence.
λ (regularization strength): Controls penalty weight. Higher values reduce variance but can increase bias.
T (max iterations or epochs): Upper limit to guard against endless loops.
Initialization: Starting values for w and b; zeros or small random values are common.

Code Execution Order and Flow

Prepare data: X, y; optionally standardize features if scales differ a lot (helps stability, though the lecture did not require it).
Initialize parameters w, b.
Loop: forward pass → loss → backward pass (gradients) → parameter update.
Check stopping conditions; possibly adjust α (learning rate schedule) if loss plateaus.
Output the learned w, b and evaluate predictions on new data.

Tools/Libraries Used (Common Choices)

NumPy: For matrix operations (X @ w), sums, and vectorized computations.
Optional: scikit-learn provides ready-made LinearRegression, Ridge, and Lasso implementations; PolynomialFeatures helps create polynomial inputs. While not part of the lecture’s code, these tools mirror the discussed techniques and are standard in practice.

Step-by-Step Implementation Guide Step 1: Define the task

Choose a regression problem (e.g., predict house price). Identify features (inputs) and the target (output).

Step 2: Collect and prepare data

Build your feature matrix X (n×d) and target vector y (n×1). Ensure no missing values in the simplest setup; remove or impute if present.
Optionally remove obvious outliers or winsorize extreme values (because MSE is sensitive to outliers).

Step 3: Initialize parameters and hyperparameters

Pick α (e.g., 0.01) and set λ if using regularization (start with small like 0.01). Initialize w (zeros) and b (0).

Step 4: Train with gradient descent

Repeat: ŷ = Xw + b; compute loss; compute gradients; update w, b.
Monitor loss to ensure it decreases steadily. If it stalls or oscillates, adjust α.

Step 5: Tune learning rate

Try small α first; if loss decreases smoothly, you can consider slightly increasing. If it spikes or oscillates, reduce α immediately.
Consider a schedule (e.g., reduce α by half after certain iterations without improvement) or use adaptive methods like Adam.

Step 6: Add regularization if overfitting

For ridge: add 2λw to the gradient. For lasso: use subgradient or a proximal step to encourage sparsity.
Adjust λ: increase if overfitting, decrease if underfitting.

Step 7: Extend with polynomial features if non-linearity is evident

Add x^2 (and possibly x^3) terms to features. Retrain and re-tune λ (polynomial features can increase overfitting risk).

Step 8: Evaluate on new inputs

Use f(x) = w^T x + b for predictions. Compare to known values when available to estimate generalization quality.

Tips and Warnings

Learning rate pitfalls: Too large causes divergence (loss shoots up or becomes NaN); too small wastes time. Adjust promptly based on loss behavior.
Outlier sensitivity: Because errors are squared, a few extreme points can dominate the fit. Consider outlier detection or robust preprocessing.
Regularization choice: L1 is good for feature selection and sparse models; L2 stabilizes and spreads shrinkage. λ must be tuned for balance.
Feature scaling: While not explicitly required in the lecture, scaling features can help gradient descent converge more smoothly, especially when feature magnitudes differ widely.
Bias handling: Typically, do not regularize b; regularization focuses on limiting feature weights.
Stopping criteria: Use thresholds on loss decrease or gradient norm to avoid unnecessary extra iterations.
Non-linearity cues: U-shapes, saturations, or bends in scatter plots suggest adding polynomial features or considering non-linear models.
Initialization: Zeros are fine for linear regression; complex initializations are unnecessary here.
Monitoring: Plot training loss over iterations to visually confirm steady descent; sudden spikes indicate α is too high.

Why These Pieces Fit Together

The linear model defines how predictions are made from parameters. The MSE loss defines what it means to be “good.” Gradients tell us how to change parameters to improve. The learning rate determines how fast (and safely) we change. Regularization ensures the model doesn’t just memorize training quirks. Polynomial expansion gives flexibility to capture curves while staying within the linear-in-parameters framework. Combined, they form a practical, end-to-end recipe for building strong baseline regressors.

04Examples

💡
House Price Prediction Setup: Input features include square footage, number of bedrooms, and location; the target is the sale price. The model computes ŷ = w^T x + b. Training minimizes MSE by adjusting w and b via gradient descent. Key point: A positive weight on square footage raises predicted prices, showing interpretability.
💡
Gradient Descent Step-by-Step: Start with w = 0, b = 0. Compute predictions ŷ, find errors ŷ − y, compute gradients, and update parameters with learning rate α. Repeat until the loss barely changes. Key point: This loop is the practical engine that learns from data.
💡
Learning Rate Too Large: Suppose α = 1.0 and the initial loss spikes higher each iteration. The updates overshoot the minimum, making the process unstable. Reducing α to 0.01 stabilizes and steadily decreases the loss. Key point: Proper α selection is essential for convergence.
💡
Learning Rate Too Small: With α = 1e−6, the loss decreases, but only microscopically per iteration. Training takes far too long to reach a useful solution. Increasing α to 0.01 speeds progress dramatically. Key point: Don’t crawl downhill; choose a rate that makes steady progress.
💡
Electricity Demand vs. Temperature: Demand is high on very hot and very cold days and lower on mild days, making a U-shaped curve. A plain linear model fits poorly here. Adding a quadratic term (temperature^2) enables the model to capture the U-shape. Key point: Polynomial features let linear regression fit curved relationships.
💡
Tree Age vs. Height: Young trees grow quickly, then growth slows, forming a curve that flattens over time. A linear fit misses this bend. Adding polynomial terms captures the early rapid rise and later slowdown. Key point: Non-linear growth patterns are common and need curved models.
💡
Effect of Outliers on MSE: One mansion priced far above others heavily influences the MSE and skews the fitted line upward. Most normal homes then get over-predicted. Removing or capping outliers brings the line back to a fairer fit. Key point: Squared errors make outlier handling important.
💡
L2 Regularization in Practice (Ridge): Train with λ = 0.1; weights shrink and predictions become less wiggly. The model generalizes better to new houses by avoiding over-reliance on any single feature. Increasing λ to 10 may underfit by over-shrinking. Key point: Tuning λ balances bias and variance.
💡
L1 Regularization in Practice (Lasso): With λ = 0.05, several feature weights go exactly to zero. The model becomes simpler and highlights the most informative inputs. If λ is too high, too many weights drop, and the model underfits. Key point: L1 doubles as feature selection.
💡
Bias Term’s Role: Without b, the line must pass through the origin, which rarely makes sense in pricing. Adding b allows a baseline house price independent of features. This improves fit and reduces systematic error. Key point: The intercept captures the base level of the target.
💡
Monitoring Convergence: Track MSE over iterations: it should decrease smoothly. If it oscillates or increases, the learning rate is likely too high. If it flattens early, consider lowering α gradually (schedule) or adding more iterations. Key point: Visual monitoring prevents wasted training time.
💡
Polynomial Degree Choice: Degree 2 (quadratic) often captures simple curves like U-shapes. Degree 3 (cubic) can model more complex bends but risks overfitting. Regularization becomes more important as degree rises. Key point: Start low and increase degree only as needed.
💡
Ridge vs. Lasso on Correlated Features: With highly correlated house features (e.g., lot size and total square footage), ridge spreads weight across them. Lasso may pick one and zero out the other. Both reduce overfitting but yield different interpretations. Key point: Choose based on whether you prefer sharing or selecting.
💡
Simple Training Loop Pseudocode: y_hat = Xw + b; error = y_hat − y; grad_w = (2/n) X^T error (+ reg); grad_b = (2/n) sum(error); update w, b by subtracting learning rate times gradients. Repeat until change in loss is tiny. Key point: This minimal loop captures the whole learning process.

05Conclusion

This lecture laid out a complete, practical path to building a regression model that predicts real numbers from input features. You started with supervised learning, where labeled pairs (x, y) train a function to generalize to new inputs. Linear regression provided the foundational model f(x) = w^T x + b, and Mean Squared Error gave a clear objective to minimize. Gradient descent supplied the iterative method for tuning weights and bias, with the learning rate as a crucial hyperparameter controlling the speed and stability of learning.

You learned that linear regression is simple, fast, and interpretable—often the best baseline to try first. But it assumes linearity, is sensitive to outliers, and can overfit, especially when data is noisy or complex. Regularization addresses overfitting by penalizing large weights: L2 (ridge) shrinks them smoothly to stabilize the model, while L1 (lasso) drives many to zero, performing feature selection. The hyperparameter lambda governs how aggressively the model is simplified. To account for curved relationships, polynomial regression expands the feature set with powers like x^2, allowing the method to capture U-shapes and slowdowns without abandoning linear-in-parameter simplicity.

Practically, the learning loop is straightforward: predict, measure error with MSE, compute gradients, and update parameters until improvements taper off. Monitoring the loss guides learning rate adjustments and signals convergence. Outlier handling and thoughtful choices about regularization and polynomial degree meaningfully improve results. These ideas set the stage for logistic regression next, where the same optimization mindset applies to classification problems with different loss functions.

To solidify your understanding, implement a small project: predict housing prices using linear regression with and without regularization, try adding a quadratic feature, and experiment with learning rate schedules. Pay attention to how the loss behaves and how λ tunes complexity. The core message to remember: start simple, measure with a clear loss, optimize carefully, control complexity with regularization, and extend the model only when the data’s shape calls for it. Mastering this workflow equips you for a wide range of predictive tasks and prepares you to step confidently into more advanced models.

Key Takeaways

✓Start with a clear mapping from inputs to outputs: define features x and target y precisely. Clean up obvious outliers since MSE heavily penalizes large errors. Organize your data into a matrix X (rows are examples, columns are features). This structure makes training and debugging smoother.
✓Use the simplest effective model first: linear regression with bias. Its formula f(x) = w^T x + b is interpretable and fast to train. If it performs well, you save time by avoiding unnecessary complexity. If not, you’ll have a clear baseline to beat.
✓Minimize Mean Squared Error to train the model. MSE averages squared residuals, accentuating large mistakes so the model pays attention to them. Monitor MSE each iteration to confirm learning progresses. If MSE increases or oscillates, adjust the learning rate.
✓Implement gradient descent in a tight, efficient loop. Compute predictions, errors, gradients, and then update parameters. Keep track of loss to gauge progress and detect issues quickly. Stop when improvements are consistently tiny.
✓Tune the learning rate carefully; it is the most sensitive hyperparameter in basic training. Too large causes divergence; too small wastes time. Start modestly, then adjust based on the shape of the loss curve. Consider a simple schedule to reduce the rate as training settles.
✓Include a bias term to avoid forcing the fit through the origin. The intercept captures the baseline level of your target. Omitting it can add systematic error across all predictions. Always check that b is learned and makes sense.
✓Regularize when you see signs of overfitting or instability. L2 (ridge) shrinks weights smoothly and helps with correlated features. L1 (lasso) zeros many weights, acting as feature selection. Choose λ by trying a range and watching generalization behavior.
✓Use polynomial features when patterns are clearly curved. A quadratic term captures U-shapes like electricity demand vs. temperature. Higher degrees can fit complex curves but raise overfitting risk. Pair higher degrees with stronger regularization.
✓Handle outliers before or during modeling to avoid skewed fits. Consider capping extremes or analyzing leverage points. Since MSE magnifies their influence, a few odd data points can ruin the line for the majority. Robust preprocessing pays off.
✓Track convergence with clear criteria to avoid over-training. Stop when loss decreases fall below a threshold for several steps. This prevents wasted computation and guards against numerical drift. It also signals when hyperparameters might need adjustment.
✓Prefer L1 when you want interpretability and automatic feature selection. Sparsity simplifies communication with stakeholders and can reduce maintenance. Be mindful that too aggressive λ may drop useful features. Validate that predictive performance remains strong.
✓Prefer L2 when you want stability and smooth shrinkage across features. It’s especially helpful with multicollinearity where features overlap in information. Expect fewer exact zeros but more balanced weights. This often yields better numerical behavior.
✓Experiment with learning rate schedules or Adam for smoother training. Start with a fixed rate, then lower it as progress slows. Adaptive methods can remove some trial-and-error. Always verify they produce stable, decreasing loss.
✓Incrementally extend your model only when needed. Add polynomial terms based on visible non-linear patterns. Each extension should be justified by data behavior and validated by performance. Keep complexity in check with regularization.
✓Document your choices: α, λ, polynomial degree, and preprocessing steps. This makes results reproducible and debuggable. Clear records help you and teammates understand what worked and why. It accelerates future iterations.
✓Interpret learned weights to gain domain insights. Positive or negative signs and magnitudes reveal how features influence the target. Watch for suspiciously large weights that may indicate scaling issues or overfitting. Use insights to refine features and models.

Glossary

Supervised learning

A way to teach a model using examples that include both inputs and the correct outputs. The model learns patterns that connect inputs to outputs. After training, it can predict outputs for new inputs it has never seen. It’s like a teacher giving practice problems with answer keys.

Regression

A type of supervised learning where the output is a real number. It focuses on predicting continuous values instead of categories. Examples include prices, temperatures, or speeds. It answers questions like 'How much?' or 'How many?'

Linear regression

A simple model that predicts a number by drawing a straight line (or flat plane) through data points. It assumes the relationship between inputs and output is linear. The formula is f(x) = w^T x + b. It is easy to train and understand.

Feature

A measurable piece of information used as input to a model. Features are the ingredients that help the model make predictions. They are usually numbers in a vector. Good features carry useful signals about the target.

Target (label)

The output value the model is trying to predict. In regression, it’s a real number. During training, targets are known and guide learning. The model aims to match targets closely.

Weight (w)

A parameter that tells how strongly a feature affects the prediction. Larger absolute weights mean the feature has more influence. Positive weights push predictions up; negative weights push them down. Weights are learned from data.

Bias (b, intercept)

A constant term added to the prediction that shifts the line up or down. It lets the model fit data that doesn’t pass through the origin. Without it, many datasets would fit poorly. The bias is learned like the weights.

Prediction (y-hat)

The model’s guess for the target given an input. It is computed using the current weights and bias. In linear regression, ŷ = w^T x + b. Predictions are compared to true targets to measure error.

+28 more (click terms in content)

Version: 1