L2 Regularization (Ridge/Weight Decay)
Key Points
- •L2 regularization (also called ridge or weight decay) adds a penalty proportional to the sum of squared weights to discourage large parameters.
- •It acts like a soft pull (a spring) that shrinks coefficients toward zero without setting many exactly to zero, improving generalization.
- •In linear regression, ridge has a closed-form solution using normal equations with a term, which also stabilizes ill-conditioned X.
- •The gradient of the L2 penalty is simple: so SGD updates become weight decay steps: w ← (1 − −
- •Bias terms (intercepts) are usually not regularized; standardizing features before using L2 is important to balance the penalty.
- •L2 has a Bayesian view: it is equivalent to a Gaussian prior on weights and yields a MAP estimate.
- •Use L2 when you want smooth shrinking, to handle multicollinearity, reduce variance, and improve numerical stability.
- •Choose λ via cross-validation; too large over-smooths (underfits), too small barely helps (overfits).
Prerequisites
- →Linear algebra (vectors, matrices, matrix multiplication) — Ridge uses X^T X, linear systems, and norms, which require matrix operations.
- →Differential calculus and gradients — L2 changes gradients by adding λw; understanding updates requires derivatives.
- →Least squares and ordinary linear regression — Ridge is a regularized variant of least squares with a modified normal equation.
- →Optimization algorithms (SGD, gradient descent) — In practice, L2 is implemented as weight decay in first-order methods.
- →Feature scaling/standardization — L2 penalizes coefficients, so scaling ensures fair and stable shrinkage across features.
- →Model selection and cross-validation — Choosing λ requires validation to balance bias and variance.
- →Probability and MAP estimation — L2 corresponds to a Gaussian prior on parameters; MAP connects regularization and Bayesian views.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine training a model that fits your training data perfectly but fails miserably on new data—the classic overfitting problem. What if you could add a gentle brake that discourages wild parameter values without forcing them to be exactly zero? Concept: L2 regularization adds a penalty to the objective that grows with the square of each weight, nudging the model toward smaller, more conservative parameters. This reduces variance, improves generalization, and makes solutions more stable when features are correlated. In linear regression, this is called ridge regression; in gradient-based training (including deep learning), the equivalent update form is weight decay. Example: Suppose two features are nearly duplicates. Ordinary least squares can swing coefficients to large, opposite values to fit noise. Ridge balances them by shrinking both moderately, producing a more reliable predictor with smaller coefficients and better test performance.
02Intuition & Analogies
Hook: Think of your model’s weights like knobs on a stereo system. If you let every knob crank up freely, you might amplify both music and static. Adding L2 regularization is like attaching each knob to a soft spring pulling it back toward zero: you still get volume where needed, but the system resists extremes. Concept: Squaring the weights makes the penalty grow faster for large values, so very large weights are strongly discouraged, while small weights are only lightly touched. Unlike L1 (which is like a budget that can force exact zeros), L2 acts smoothly and continuously, spreading the shrinkage across all weights. Example: Picture fitting a line to noisy data. Without regularization, the slope might tilt sharply to chase noise points. With L2, the slope and intercept are tugged back a bit (usually leaving the intercept unpenalized), giving a line that may fit slightly worse on training data but will be more robust on new data. Another analogy: friction in a physical system—L2 adds a uniform drag that prevents parameters from accelerating to large magnitudes; in optimization steps, this appears as weight decay, gradually fading weights unless there is consistent gradient signal to sustain them.
03Formal Definition
04When to Use
Hook: If your model performs great on training data but degrades on validation, ask whether you need a gentle constraint. Concept: Use L2 regularization when you want to reduce variance without inducing sparsity, especially with correlated features or high-dimensional inputs. It is also effective when X^T X is ill-conditioned: adding λI improves conditioning and numerical stability. Use cases: (1) Ridge regression for tabular data with multicollinearity, (2) Text or polynomial features where many small coefficients are preferable to a few large ones, (3) Deep learning as weight decay to curb exploding weights and improve generalization, (4) Online learning with SGD where L2 provides continuous shrinkage during updates. Example: In predicting house prices with many overlapping features (square footage, total rooms, bedrooms), ridge keeps coefficients modest and reduces sensitivity to noise; in a neural network, adding weight decay to linear layers prevents weight magnitudes from drifting upward over long training.
⚠️Common Mistakes
Hook: L2 is simple to add but surprisingly easy to misuse. Concept: Common pitfalls include (1) not standardizing features—L2 penalizes coefficients, so unscaled features skew shrinkage; (2) regularizing the bias—penalizing the intercept often hurts performance; (3) picking λ by guesswork—use cross-validation; (4) mixing conventions—forgetting whether λ couples with a 1/n factor in your loss, leading to mismatched strengths; (5) confusing λw vs. 2λw gradients—using λ∥w∥^2/2 gives gradient λw; (6) assuming L2 gives sparsity—it rarely sets weights exactly to zero; (7) with adaptive optimizers, equating L2 with weight decay—AdamW uses decoupled weight decay, which behaves better than naive L2. Example: A practitioner feeds raw dollar amounts and counts into a ridge model without scaling; the large-scale feature gets over-penalized into near-zero influence while small-scale features dominate. Another example: enabling regularization on the bias collapses predictions toward zero mean, degrading fit.
Key Formulas
Regularized Objective
Explanation: Total objective equals the data loss plus half lambda times the squared L2 norm of weights. The factor makes the gradient of the penalty simply
L2 Norm
Explanation: The L2 norm measures the Euclidean length of the vector. The squared L2 norm is used in L2 regularization to penalize large coefficients.
Gradient with L2
Explanation: The gradient of the regularized objective is the sum of the data-loss gradient and λ times the weight vector. This is what drives weight decay in first-order methods.
Weight Decay Update
Explanation: In SGD with step size L2 adds a multiplicative shrinkage (1− to weights each step, plus the usual data-gradient step. This is the computational view of L2 in iterative training.
Ridge Objective (Scaled)
Explanation: Ridge regression minimizes mean squared error with an L2 penalty. Different libraries absorb factors of n into be consistent when comparing values.
Ridge Normal Equations
Explanation: Adding to X yields a well-conditioned linear system. Solving it gives the ridge coefficients in closed form (up to scaling conventions).
Hat Matrix and Degrees of Freedom
Explanation: The hat matrix maps the targets to fitted values. Its trace defines the effective degrees of freedom, which decrease as λ increases.
Logistic Regression with L2
Explanation: For binary labels ∈ {−1, +1}, the logistic loss plus an L2 penalty yields better generalization and smoother decision boundaries.
Gaussian Prior Equivalence
Explanation: Placing a zero-mean isotropic Gaussian prior on weights leads to an L2 penalty in the MAP objective, with λ = 1/ (up to scaling of the data term).
Tikhonov Regularization
Explanation: Generalizes ridge by penalizing \_2^2, allowing you to shape which directions in parameter space are shrunk more.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute X^T X (d x d) and X^T y (d) 5 static void computeXtX_Xty(const vector<vector<double>>& X, const vector<double>& y, 6 vector<vector<double>>& XtX, vector<double>& Xty) { 7 size_t n = X.size(); 8 size_t d = X[0].size(); 9 XtX.assign(d, vector<double>(d, 0.0)); 10 Xty.assign(d, 0.0); 11 for (size_t i = 0; i < n; ++i) { 12 const vector<double>& xi = X[i]; 13 double yi = y[i]; 14 for (size_t a = 0; a < d; ++a) { 15 Xty[a] += xi[a] * yi; 16 double xia = xi[a]; 17 for (size_t b = 0; b < d; ++b) { 18 XtX[a][b] += xia * xi[b]; 19 } 20 } 21 } 22 } 23 24 // Cholesky decomposition for SPD matrix A: A = L L^T (lower triangular L) 25 static bool choleskyDecompose(const vector<vector<double>>& A, vector<vector<double>>& L) { 26 size_t n = A.size(); 27 L.assign(n, vector<double>(n, 0.0)); 28 for (size_t i = 0; i < n; ++i) { 29 for (size_t j = 0; j <= i; ++j) { 30 double sum = A[i][j]; 31 for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k]; 32 if (i == j) { 33 if (sum <= 0.0) return false; // not SPD 34 L[i][i] = sqrt(max(sum, 0.0)); 35 } else { 36 L[i][j] = sum / L[j][j]; 37 } 38 } 39 } 40 return true; 41 } 42 43 // Solve A x = b given Cholesky L such that A = L L^T 44 static vector<double> choleskySolve(const vector<vector<double>>& L, const vector<double>& b) { 45 size_t n = L.size(); 46 vector<double> y(n, 0.0), x(n, 0.0); 47 // Forward solve: L y = b 48 for (size_t i = 0; i < n; ++i) { 49 double sum = b[i]; 50 for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k]; 51 y[i] = sum / L[i][i]; 52 } 53 // Backward solve: L^T x = y 54 for (int i = (int)n - 1; i >= 0; --i) { 55 double sum = y[i]; 56 for (size_t k = i + 1; k < n; ++k) sum -= L[k][i] * x[k]; 57 x[i] = sum / L[i][i]; 58 } 59 return x; 60 } 61 62 // Fit ridge regression: minimize 1/2 ||X w - y||^2 + (lambda/2) ||w||^2 63 // Optionally exclude an index (bias_index) from regularization by not adding lambda to its diagonal. 64 static vector<double> ridge_fit(const vector<vector<double>>& X, const vector<double>& y, 65 double lambda, int bias_index = -1) { 66 vector<vector<double>> XtX; vector<double> Xty; 67 computeXtX_Xty(X, y, XtX, Xty); 68 69 size_t d = XtX.size(); 70 for (size_t i = 0; i < d; ++i) { 71 if ((int)i == bias_index) continue; // do not penalize bias 72 XtX[i][i] += lambda; 73 } 74 75 vector<vector<double>> L; 76 if (!choleskyDecompose(XtX, L)) { 77 throw runtime_error("Matrix not SPD; increase lambda or check data."); 78 } 79 return choleskySolve(L, Xty); 80 } 81 82 // Predict y = X w 83 static vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) { 84 vector<double> yhat(X.size(), 0.0); 85 for (size_t i = 0; i < X.size(); ++i) { 86 double s = 0.0; 87 for (size_t j = 0; j < w.size(); ++j) s += X[i][j] * w[j]; 88 yhat[i] = s; 89 } 90 return yhat; 91 } 92 93 int main() { 94 ios::sync_with_stdio(false); 95 cin.tie(nullptr); 96 97 // Create a toy dataset with a bias column (first column of ones) 98 // True model: y = 3 + 2*x1 - x2 + noise 99 int n = 200, d = 3; // [1, x1, x2] 100 vector<vector<double>> X(n, vector<double>(d, 1.0)); 101 vector<double> y(n); 102 mt19937 rng(42); 103 normal_distribution<double> noise(0.0, 1.0); 104 uniform_real_distribution<double> unif(-3.0, 3.0); 105 for (int i = 0; i < n; ++i) { 106 double x1 = unif(rng); 107 double x2 = unif(rng) + 0.5 * x1; // introduce correlation 108 X[i][1] = x1; 109 X[i][2] = x2; 110 y[i] = 3.0 + 2.0 * x1 - 1.0 * x2 + noise(rng); 111 } 112 113 double lambda = 10.0; // regularization strength 114 // bias is column 0; do not penalize it 115 vector<double> w = ridge_fit(X, y, lambda, /*bias_index=*/0); 116 117 // Report weights 118 cout << fixed << setprecision(4); 119 cout << "Ridge weights (bias, w1, w2): "; 120 for (double wi : w) cout << wi << ' '; 121 cout << "\n"; 122 123 // Evaluate training RMSE 124 vector<double> yhat = predict(X, w); 125 double se = 0.0; 126 for (int i = 0; i < n; ++i) se += (yhat[i] - y[i]) * (yhat[i] - y[i]); 127 cout << "Train RMSE: " << sqrt(se / n) << "\n"; 128 } 129
This program fits ridge regression using the closed-form normal equations with a Cholesky factorization. We build X^T X and X^T y, add λ to the diagonal (skipping the bias index to avoid penalizing the intercept), and solve (X^T X + λI)w = X^T y. The toy data introduces correlated features to highlight ridge’s stabilizing effect.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); } 5 6 // Train logistic regression with L2 using SGD/minibatch weight decay 7 struct LogisticL2 { 8 vector<double> w; // weights for features (excluding bias) 9 double b = 0.0; // bias (unpenalized) 10 11 // Initialize weights to zeros 12 explicit LogisticL2(size_t d) : w(d, 0.0), b(0.0) {} 13 14 // One training epoch over data (minibatch SGD) 15 void train_epoch(const vector<vector<double>>& X, const vector<int>& y, 16 double lr, double lambda, size_t batch_size = 32, mt19937* rng = nullptr) { 17 size_t n = X.size(); 18 vector<size_t> idx(n); 19 iota(idx.begin(), idx.end(), 0); 20 if (rng) shuffle(idx.begin(), idx.end(), *rng); 21 22 for (size_t start = 0; start < n; start += batch_size) { 23 size_t end = min(n, start + batch_size); 24 // Accumulate gradients on minibatch 25 vector<double> gw(w.size(), 0.0); 26 double gb = 0.0; 27 for (size_t ii = start; ii < end; ++ii) { 28 size_t i = idx[ii]; 29 // Model prediction: p = sigma(w^T x + b) 30 double z = b; 31 for (size_t j = 0; j < w.size(); ++j) z += w[j] * X[i][j]; 32 double p = sigmoid(z); 33 // Labels y are in {0,1}; gradient of average cross-entropy: (p - y) 34 double diff = p - static_cast<double>(y[i]); 35 for (size_t j = 0; j < w.size(); ++j) gw[j] += diff * X[i][j]; 36 gb += diff; 37 } 38 double m = static_cast<double>(end - start); 39 for (size_t j = 0; j < w.size(); ++j) gw[j] /= m; 40 gb /= m; 41 42 // Decoupled weight decay (equivalent to L2 for SGD): shrink weights, not bias 43 for (size_t j = 0; j < w.size(); ++j) w[j] *= (1.0 - lr * lambda); 44 45 // Gradient step 46 for (size_t j = 0; j < w.size(); ++j) w[j] -= lr * gw[j]; 47 b -= lr * gb; // do not decay bias 48 } 49 } 50 51 // Predict probability p(y=1|x) 52 double predict_proba(const vector<double>& x) const { 53 double z = b; 54 for (size_t j = 0; j < w.size(); ++j) z += w[j] * x[j]; 55 return sigmoid(z); 56 } 57 58 int predict_label(const vector<double>& x, double thresh = 0.5) const { 59 return predict_proba(x) >= thresh ? 1 : 0; 60 } 61 }; 62 63 int main() { 64 ios::sync_with_stdio(false); 65 cin.tie(nullptr); 66 67 // Generate a toy binary classification dataset in 2D 68 int n = 1000; int d = 2; 69 vector<vector<double>> X(n, vector<double>(d)); 70 vector<int> y(n); 71 72 mt19937 rng(123); 73 normal_distribution<double> ga(0.0, 1.0), gb(0.0, 1.0), noise(0.0, 0.5); 74 // Two clusters separated roughly by a line 75 for (int i = 0; i < n; ++i) { 76 if (i < n/2) { 77 X[i][0] = ga(rng) - 2.0; X[i][1] = ga(rng); 78 y[i] = 0; 79 } else { 80 X[i][0] = gb(rng) + 2.0; X[i][1] = gb(rng); 81 y[i] = 1; 82 } 83 // Add some noise to make it non-trivial 84 X[i][0] += noise(rng); X[i][1] += noise(rng); 85 } 86 87 // Standardize features for fair L2 penalization 88 for (int j = 0; j < d; ++j) { 89 double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n; 90 double var = 0.0; for (int i = 0; i < n; ++i) { double t = X[i][j]-mean; var += t*t; } 91 double stdv = sqrt(var / n + 1e-12); 92 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv; 93 } 94 95 LogisticL2 clf(d); 96 double lr = 0.1, lambda = 0.01; 97 for (int epoch = 0; epoch < 30; ++epoch) { 98 clf.train_epoch(X, y, lr, lambda, 64, &rng); 99 } 100 101 // Evaluate accuracy 102 int correct = 0; 103 for (int i = 0; i < n; ++i) correct += (clf.predict_label(X[i]) == y[i]); 104 cout << fixed << setprecision(4); 105 cout << "Train accuracy: " << (100.0 * correct / n) << "%\n"; 106 cout << "Weights: "; for (double wi : clf.w) cout << wi << ' '; cout << "| bias: " << clf.b << "\n"; 107 } 108
This example trains a logistic regression classifier using minibatch SGD with L2 regularization implemented as weight decay. We standardize features, shrink weights each step by (1−ηλ), then apply the gradient of the cross-entropy loss. The bias term is not penalized. The code demonstrates how L2 appears as a simple multiplicative decay in iterative optimization.