🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

L1 Regularization (Lasso)

Key Points

  • •
    L1 regularization (Lasso) adds a penalty \(λ ∑i=1p​ ∣wi​∣\) to the loss, which pushes many coefficients exactly to zero and performs feature selection.
  • •
    The key computational tool behind Lasso is the soft-thresholding (shrinkage) operator, which zeroes small coefficients and reduces large ones by a constant amount.
  • •
    For squared-error regression, efficient solvers include coordinate descent and proximal gradient (ISTA), both of which have per-iteration cost roughly O(n p) for dense data.
  • •
    Standardizing features and not penalizing the intercept are crucial practical steps to make Lasso behave well and to choose \(λ\) meaningfully.
  • •
    L1 is convex but not differentiable at zero; we use subgradients or proximal operators to optimize it reliably.
  • •
    Lasso works best when you expect only a small subset of features to be relevant, especially in high-dimensional settings.
  • •
    Choosing \(λ\) is typically done via cross-validation or information criteria; larger \(λ\) yields sparser models but more bias.
  • •
    If features are highly correlated, Lasso may arbitrarily pick one; elastic net (L1+L2) is more stable in such cases.

Prerequisites

  • →Linear algebra (vectors, matrices, norms) — Understanding matrix–vector products, norms, and eigenvalues is essential for expressing the Lasso objective and gradients.
  • →Convex optimization basics — Lasso is a convex but non-smooth problem; subgradients, proximal operators, and KKT conditions underpin the algorithms.
  • →Calculus and gradients — Computing and interpreting gradients of the squared loss is necessary for ISTA and analysis.
  • →Probability and statistics — Noise models, mean squared error, and cross-validation are important for modeling and hyperparameter selection.
  • →C++ programming (vectors, loops, numerics) — Implementing iterative solvers efficiently and correctly requires familiarity with arrays, loops, and numeric stability.
  • →Data preprocessing (centering, scaling) — Feature standardization strongly affects L1 behavior and the practical choice of \(\lambda\).

Detailed Explanation

Tap terms for definitions

01Overview

L1 regularization, widely known as Lasso (Least Absolute Shrinkage and Selection Operator), augments a predictive model’s loss function with the sum of absolute values of the weights. For linear regression with parameters w and intercept b, Lasso solves a convex optimization problem that trades off goodness of fit with model simplicity. The penalty term (\lambda \lVert w \rVert_1 = \lambda \sum |w_i|) encourages sparsity: many coefficients become exactly zero. This property makes Lasso a powerful, built-in feature selection method, particularly useful when the number of features p is large compared to the number of samples n, or when interpretability matters.

Unlike L2 (ridge) regularization, which spreads shrinkage smoothly across all coefficients, L1 has a kink at zero, creating non-differentiability and enabling coefficients to be set exactly to zero. Optimization leverages subgradient methods or proximal operators, most notably the soft-thresholding function. Practical algorithms include coordinate descent, which updates one coefficient at a time via closed-form soft-thresholding, and proximal gradient (ISTA), which takes gradient steps on the smooth part followed by soft-thresholding. Careful preprocessing—centering/standardizing features and leaving the intercept unpenalized—makes solutions numerically stable and makes (\lambda) meaningfully comparable across problems.

02Intuition & Analogies

Imagine you’re packing a suitcase with a strict baggage fee that charges a fixed amount for every distinct item, regardless of its weight. If two small items add little value but each incurs a fee, you’ll drop many of them and keep only the most useful ones. L1 regularization acts similarly on model coefficients: it adds a fixed “tax” proportional to the absolute size of each coefficient. If a coefficient’s benefit to prediction is small, paying its tax is not worth it, and the optimizer sets it to zero—leaving only a handful of valuable features in the model.

Another analogy is cleaning background noise from an audio track using a gate. The gate mutes any signal below a threshold, but lets louder signals through (with a slight volume reduction). Soft-thresholding, the proximal operator of L1, does exactly this to coefficients: numbers with magnitude below a threshold become zero; larger ones are reduced by a constant amount. This is different from L2 (ridge), which is more like turning down the overall volume smoothly—everything gets a bit quieter but nothing is fully muted.

From a geometric point of view, the L1 penalty forms a diamond-shaped constraint region (an (\ell_1) ball). When this region touches the elliptical contours of the loss, the corners of the diamond often line up with axes, making it likely that some coordinates are exactly zero. That geometric “corner effect” explains why L1 favors sparse solutions while remaining convex and computationally tractable.

03Formal Definition

Given data matrix \(X ∈ Rn×p\) (rows are samples), response vector \(y ∈ Rn\), parameters \(w ∈ Rp\), and intercept \(b ∈ R\), the Lasso for least squares solves \[ minw,b​ \; 2n1​ ∥ y - (Xw + b1) ∥22​ \; + \; λ ∥ w ∥1​, \] where \(∥ w ∥1​ = ∑i=1p​ ∣wi​∣\), \(λ ≥ 0\) is a regularization parameter, and \(1\) is an all-ones vector. The objective is convex but non-differentiable at coordinates where \(w_i=0\). The subgradient optimality condition is \[ 0 ∈ n1​ X⊤(Xw + b1 - y) + λ \, ∂ ∥ w ∥1​, 0 = n1​1⊤(Xw + b1 - y). \] Here, \(∂ ∥ w ∥1​\) is the Cartesian product of subgradients for each \(∣wi​∣\). The proximal operator for \(λ ∥ w ∥1​\) is the soft-thresholding map \(Sλ​(z)_i = sign(zi​) max(∣zi​∣-λ, 0)\). Standard solvers include coordinate descent with updates \[ wj​ ← ∥Xj​∥22​1​ Sλ​\!\left( ∑i=1n​ x_{ij}\Big(yi​ - b - ∑k=j​ xik​ w_k\Big) \right), \] and proximal gradient (ISTA): \(wk+1 = Sλt​(wk - t ∇ f(wk))\) with step size \(t ≤ 1/L\), where \(f(w)=2n1​∥ y-Xw∥22​\) and \(L\) is the Lipschitz constant of \(∇ f\).

04When to Use

Use Lasso when you expect sparsity—only a small subset of features truly matters—or when interpretability and automatic feature selection are important. It is effective in high-dimensional settings (p comparable to or greater than n), such as genomics, text with bag-of-words features, or sensor selection. Lasso can reduce variance and improve generalization when unregularized models overfit.

Prefer Lasso over ridge when your goal is to zero out irrelevant features; prefer ridge when all features contribute a bit and multicollinearity is high. If features are strongly correlated and you want grouped behavior, consider elastic net (a mixture of L1 and L2) to avoid Lasso’s tendency to select a single representative feature from a correlated cluster. Lasso also integrates well with pipelines that require model compression or fast inference, since many weights become zero and can be pruned.

Choose (\lambda) by cross-validation or criteria like AIC/BIC; larger (\lambda) increases sparsity but also bias. Always standardize/center features and avoid penalizing the intercept. For non-squared losses (e.g., logistic), L1 still works via proximal gradient; coordinate descent remains effective when individual coordinate updates have closed forms or cheap 1D solves.

⚠️Common Mistakes

  • Not standardizing features: Without centering and scaling, features with large units dominate the penalty, leading to arbitrary selections. Always center columns and scale to unit variance (or equal (\ell_2) norm).
  • Penalizing the intercept: The bias term should typically be unpenalized; penalizing it can shift predictions and degrade fit.
  • Mis-tuning (\lambda): Too small gives overfitting with little sparsity; too large collapses the model to zeros. Use cross-validation and, if necessary, warm starts along a (\lambda)-path.
  • Using plain gradient descent: L1 is non-differentiable; use subgradient, coordinate descent, or proximal gradient (ISTA/FISTA). Forgetting the proximal step results in incorrect updates.
  • Ignoring correlated features: Lasso may pick one feature from a correlated group unpredictably. If stability is desired, use elastic net or preprocessing (e.g., PCA or grouping).
  • Early stopping or loose tolerances: Insufficient convergence checks yield biased estimates. Track objective decrease or parameter change norms.
  • Numerical drift in coordinate descent: If you don’t maintain and update the residual efficiently, recomputing from scratch each step is slow and can accumulate floating-point error. Keep a residual vector and update it incrementally.
  • Comparing coefficients across unscaled problems: The magnitude of (\lambda) is not directly comparable across datasets unless features are standardized and the loss is normalized by n.

Key Formulas

Lasso objective (squared loss)

w,bmin​2n1​∥y−(Xw+b1)∥22​+λ∥w∥1​

Explanation: We minimize prediction error plus an L1 penalty on coefficients. The parameter \(λ\) controls the trade-off between fit and sparsity.

L1 norm

∥w∥1​=i=1∑p​∣wi​∣

Explanation: The L1 norm sums the absolute values of all coefficients. It encourages many coefficients to be exactly zero when used as a penalty.

Soft-thresholding (prox of L1)

Sτ​(z)i​=sign(zi​)max(∣zi​∣−τ,0)

Explanation: This operator shrinks each component toward zero by \(τ\), setting it to zero if its magnitude is below \(τ\). It is the key building block of L1 optimization.

Subgradient of absolute value

∂∣wi​∣={{sign(wi​)},[−1,1],​wi​=0wi​=0​

Explanation: Because |w| is not differentiable at 0, we use its subgradient. Any value between -1 and 1 is valid at zero; elsewhere, the subgradient is the sign.

Stationarity (KKT) for Lasso

0∈n1​X⊤(Xw+b1−y)+λ∂∥w∥1​

Explanation: At the optimum, the negative gradient of the loss must be balanced by a subgradient of the L1 norm scaled by \(λ\). This characterizes which coordinates are zero or nonzero.

Coordinate descent update

ρj​=i=1∑n​xij​(yi​−b−k=j∑​xik​wk​),wj​←∥Xj​∥22​1​Sλ​(ρj​)

Explanation: Holding other coordinates fixed, the one-dimensional subproblem has a closed-form solution via soft-thresholding. The denominator rescales by the column’s squared norm.

ISTA update (no intercept shown)

wk+1=Sλt​(wk−tn1​X⊤(Xwk−y))

Explanation: Each iteration takes a gradient step on the smooth squared loss and then applies soft-thresholding. The step size t must be at most the inverse Lipschitz constant of the gradient.

Lipschitz constant for squared loss

L=λmax​(n1​X⊤X)

Explanation: The gradient of the squared loss has Lipschitz constant equal to the largest eigenvalue of n1​ XT X. Using t≤1/L guarantees convergence of ISTA.

Complexity Analysis

For dense data with n samples and p features, both coordinate descent and proximal gradient (ISTA) have per-iteration cost dominated by matrix–vector operations. Coordinate descent: Maintaining the residual r=y − (Xw + b), each update of a single coordinate j requires computing ρ_j=XjT​ r + |∣Xj​∣|_2^2 wj​ (O(n)) and then adjusting the residual by r←r − Xj​ (w_jnew − wj​) (another O(n)). A full sweep over all p coordinates therefore costs O(n p). Updating the intercept is O(n). With Tc​d sweeps, the total time is O(Tc​d n p). Memory is O(n p) for X plus O(n + p) for r and w. If X is sparse with nnz nonzeros, the per-sweep cost becomes O(nnz). ISTA: Each iteration computes the residual e=Xw+b − y (O(n p)), the gradient g = n1​ XT e (O(n p)), and applies soft-thresholding (O(p)), for a per-iteration cost of O(n p). If we estimate the Lipschitz constant L via power iteration, each iteration of power iteration also costs O(n p); running Ip​i iterations adds O(Ip​i n p) once at the start. With Ti​sta iterations, total time is O((Ip​i + Ti​sta) n p). Memory is O(n p) for X and O(n + p) for working vectors. Convergence rates: ISTA has objective error Ok1​, while accelerated variants (FISTA) achieve O(1/k2). Coordinate descent often converges very quickly in practice due to efficient one-dimensional updates and can exploit feature standardization and warm starts. Stopping criteria commonly use relative parameter change, objective decrease, or duality gap. Overall runtime depends strongly on the sparsity of X and the solution; as more coefficients are driven to zero, caches and sparse updates can further reduce costs.

Code Examples

Lasso via Coordinate Descent with Soft-Thresholding (squared loss)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Soft-thresholding operator S_tau(z) = sign(z) * max(|z|-tau, 0)
5double soft_threshold(double z, double tau) {
6 if (z > tau) return z - tau;
7 if (z < -tau) return z + tau;
8 return 0.0;
9}
10
11int main() {
12 ios::sync_with_stdio(false);
13 cin.tie(nullptr);
14
15 // Synthetic dense dataset: n samples, p features
16 int n = 200, p = 50, k_true = 5; // k_true nonzero weights in ground truth
17 unsigned seed = 42;
18 mt19937 rng(seed);
19 normal_distribution<double> nd(0.0, 1.0);
20
21 vector<vector<double>> X(n, vector<double>(p));
22 vector<double> w_true(p, 0.0), y(n, 0.0);
23
24 // Create sparse true weights
25 for (int i = 0; i < k_true; ++i) w_true[i] = (i % 2 ? -1.5 : 2.0);
26
27 // Generate features ~ N(0,1)
28 for (int i = 0; i < n; ++i) {
29 for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
30 }
31
32 // Standardize each column: zero mean, unit variance
33 vector<double> meanX(p, 0.0), stdX(p, 0.0);
34 for (int j = 0; j < p; ++j) {
35 for (int i = 0; i < n; ++i) meanX[j] += X[i][j];
36 meanX[j] /= n;
37 for (int i = 0; i < n; ++i) stdX[j] += (X[i][j] - meanX[j]) * (X[i][j] - meanX[j]);
38 stdX[j] = sqrt(stdX[j] / max(1, n - 1));
39 if (stdX[j] == 0) stdX[j] = 1; // guard against zero variance
40 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - meanX[j]) / stdX[j];
41 }
42
43 // Build noisy response: y = X w_true + b_true + noise
44 double b_true = 0.5;
45 normal_distribution<double> noise(0.0, 0.5);
46 for (int i = 0; i < n; ++i) {
47 double yi = b_true;
48 for (int j = 0; j < p; ++j) yi += X[i][j] * w_true[j];
49 y[i] = yi + noise(rng);
50 }
51
52 // Lasso hyperparameters
53 double lambda = 0.2; // regularization strength
54 int max_sweeps = 200; // max coordinate descent sweeps
55 double tol = 1e-6; // stopping tolerance on max |delta w|
56
57 // Initialize weights and intercept (unpenalized)
58 vector<double> w(p, 0.0);
59 double b = 0.0;
60
61 // Residual r = y - (X w + b)
62 vector<double> r = y; // since w=0, b=0 initially
63
64 // Precompute column squared norms ||X_j||^2
65 vector<double> colNorm2(p, 0.0);
66 for (int j = 0; j < p; ++j) {
67 for (int i = 0; i < n; ++i) colNorm2[j] += X[i][j] * X[i][j];
68 if (colNorm2[j] == 0) colNorm2[j] = 1; // guard
69 }
70
71 for (int sweep = 0; sweep < max_sweeps; ++sweep) {
72 double max_change = 0.0;
73
74 // Update intercept (unpenalized): b <- b + mean(r), r <- r - mean(r)
75 double mean_r = accumulate(r.begin(), r.end(), 0.0) / n;
76 b += mean_r;
77 for (int i = 0; i < n; ++i) r[i] -= mean_r;
78
79 // Update each coordinate with soft-thresholding
80 for (int j = 0; j < p; ++j) {
81 // rho_j = X_j^T r + ||X_j||^2 * w_j (partial residual trick)
82 double rho = 0.0;
83 for (int i = 0; i < n; ++i) rho += X[i][j] * r[i];
84 rho += colNorm2[j] * w[j];
85
86 double w_new = soft_threshold(rho, lambda) / colNorm2[j];
87 double delta = w_new - w[j];
88 if (delta != 0.0) {
89 // Update residual: r <- r - X_j * delta
90 for (int i = 0; i < n; ++i) r[i] -= X[i][j] * delta;
91 w[j] = w_new;
92 max_change = max(max_change, fabs(delta));
93 }
94 }
95
96 if (max_change < tol) {
97 // Converged
98 break;
99 }
100 }
101
102 // Evaluate training MSE and sparsity
103 double mse = 0.0; int nnz = 0;
104 for (int i = 0; i < n; ++i) {
105 double pred = b;
106 for (int j = 0; j < p; ++j) pred += X[i][j] * w[j];
107 double err = pred - y[i];
108 mse += err * err;
109 }
110 mse /= n;
111 for (int j = 0; j < p; ++j) if (fabs(w[j]) > 1e-8) ++nnz;
112
113 cout << fixed << setprecision(6);
114 cout << "Coordinate Descent Lasso results\n";
115 cout << "Intercept b: " << b << "\n";
116 cout << "Nonzeros in w: " << nnz << " / " << p << "\n";
117 cout << "Training MSE: " << mse << "\n";
118 double l1 = 0.0; for (double wi : w) l1 += fabs(wi);
119 cout << "L1 norm of w: " << l1 << "\n";
120
121 return 0;
122}
123

This program generates a synthetic regression dataset, standardizes features, and solves the Lasso using coordinate descent. It maintains the residual vector to make each coordinate update O(n), uses the closed-form soft-thresholding update for each weight, and separately updates the intercept without penalty. The loop stops when parameter changes are sufficiently small. Finally, it reports sparsity, intercept, mean squared error, and the L1 norm of the solution.

Time: O(T · n · p) where T is the number of coordinate sweeps until convergence.Space: O(n · p) to store X plus O(n + p) for residuals and weights.
Lasso via Proximal Gradient (ISTA) with Power-Iteration Step Size
1#include <bits/stdc++.h>
2using namespace std;
3
4// Soft-thresholding applied elementwise to a vector
5void soft_threshold_vec(vector<double>& v, double tau) {
6 for (double &z : v) {
7 if (z > tau) z -= tau;
8 else if (z < -tau) z += tau;
9 else z = 0.0;
10 }
11}
12
13// Multiply y = X * w (X: n x p, w: p)
14void matvec_Xw(const vector<vector<double>>& X, const vector<double>& w, vector<double>& y) {
15 int n = (int)X.size();
16 int p = (int)X[0].size();
17 y.assign(n, 0.0);
18 for (int i = 0; i < n; ++i) {
19 double s = 0.0;
20 for (int j = 0; j < p; ++j) s += X[i][j] * w[j];
21 y[i] = s;
22 }
23}
24
25// Multiply g = (1/n) X^T * e
26void grad_XT_e_over_n(const vector<vector<double>>& X, const vector<double>& e, vector<double>& g) {
27 int n = (int)X.size();
28 int p = (int)X[0].size();
29 g.assign(p, 0.0);
30 for (int j = 0; j < p; ++j) {
31 double s = 0.0;
32 for (int i = 0; i < n; ++i) s += X[i][j] * e[i];
33 g[j] = s / n;
34 }
35}
36
37// Power iteration to estimate largest eigenvalue of (1/n) X^T X
38double estimate_L(const vector<vector<double>>& X, int iters = 50) {
39 int n = (int)X.size();
40 int p = (int)X[0].size();
41 vector<double> v(p, 0.0), Xv(n, 0.0), w(p, 0.0);
42 // random init
43 mt19937 rng(123);
44 normal_distribution<double> nd(0.0, 1.0);
45 for (int j = 0; j < p; ++j) v[j] = nd(rng);
46 // Normalize v
47 auto norm = [&](const vector<double>& a){ double s=0; for(double x:a) s+=x*x; return sqrt(s); };
48 double nv = norm(v); if (nv==0) nv=1; for (double &x: v) x/=nv;
49
50 double lambda = 0.0;
51 for (int t = 0; t < iters; ++t) {
52 // w = (1/n) X^T (X v)
53 matvec_Xw(X, v, Xv);
54 vector<double> e = Xv; // reuse as temp
55 grad_XT_e_over_n(X, e, w);
56 // Rayleigh quotient approximation
57 lambda = 0.0; for (int j = 0; j < p; ++j) lambda += v[j] * w[j];
58 // v <- w / ||w||
59 nv = norm(w); if (nv == 0) break;
60 for (int j = 0; j < p; ++j) v[j] = w[j] / nv;
61 }
62 // lambda approximates largest eigenvalue of (1/n) X^T X
63 return max(lambda, 1e-12); // guard against zero
64}
65
66int main(){
67 ios::sync_with_stdio(false);
68 cin.tie(nullptr);
69
70 // Synthetic data
71 int n = 300, p = 80, k_true = 7;
72 mt19937 rng(7);
73 normal_distribution<double> nd(0.0, 1.0);
74 vector<vector<double>> X(n, vector<double>(p));
75 vector<double> y(n), w_true(p, 0.0);
76
77 for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
78 // Standardize columns
79 for (int j = 0; j < p; ++j) {
80 double mu=0, ss=0; for (int i = 0; i < n; ++i) mu += X[i][j]; mu/=n;
81 for (int i = 0; i < n; ++i) { X[i][j] -= mu; ss += X[i][j]*X[i][j]; }
82 double st = sqrt(ss / max(1, n-1)); if (st==0) st=1;
83 for (int i = 0; i < n; ++i) X[i][j] /= st;
84 }
85 // True sparse weights and noisy target
86 for (int j = 0; j < k_true; ++j) w_true[j] = (j%2? -2.0 : 2.5);
87 double b_true = -0.3;
88 normal_distribution<double> noise(0.0, 0.6);
89 for (int i = 0; i < n; ++i) {
90 double s = b_true;
91 for (int j = 0; j < p; ++j) s += X[i][j] * w_true[j];
92 y[i] = s + noise(rng);
93 }
94
95 // ISTA hyperparameters
96 double lambda = 0.25;
97 int max_iters = 300;
98 double tol = 1e-6;
99
100 // Estimate Lipschitz constant L for f(w) = (1/2n)||Xw - y||^2
101 double L = estimate_L(X, 40);
102 double t = 1.0 / L; // safe fixed step size
103
104 vector<double> w(p, 0.0);
105 double b = 0.0;
106
107 vector<double> Xw(n, 0.0), e(n, 0.0), g(p, 0.0), w_new(p, 0.0);
108
109 auto l1_norm = [&](const vector<double>& a){ double s=0; for(double z:a) s += fabs(z); return s; };
110
111 double prev_obj = numeric_limits<double>::infinity();
112 for (int it = 0; it < max_iters; ++it) {
113 // Residual e = Xw + b - y
114 matvec_Xw(X, w, Xw);
115 for (int i = 0; i < n; ++i) e[i] = Xw[i] + b - y[i];
116 // Gradient g = (1/n) X^T e
117 grad_XT_e_over_n(X, e, g);
118 // Gradient step then prox (soft-thresholding)
119 w_new = w;
120 for (int j = 0; j < p; ++j) w_new[j] = w[j] - t * g[j];
121 soft_threshold_vec(w_new, t * lambda);
122 // Intercept update (unpenalized): b <- b - t * mean(e)
123 double mean_e = accumulate(e.begin(), e.end(), 0.0) / n;
124 double b_new = b - t * mean_e;
125
126 // Check convergence via parameter change
127 double diff = 0.0, normw = 0.0;
128 for (int j = 0; j < p; ++j) { double d = w_new[j] - w[j]; diff += d*d; normw += w[j]*w[j]; }
129 diff = sqrt(diff);
130 double denom = max(1e-12, sqrt(normw));
131 if (diff/denom < tol && fabs(b_new - b) < 1e-8) { w = w_new; b = b_new; break; }
132
133 // Optional: objective monitoring
134 matvec_Xw(X, w_new, Xw);
135 double mse = 0.0; for (int i = 0; i < n; ++i) { double r = Xw[i] + b_new - y[i]; mse += r*r; }
136 double obj = 0.5 * mse / n + lambda * l1_norm(w_new);
137 if (prev_obj - obj < 1e-10) {
138 // tiny improvement; could break or continue a bit more
139 }
140 prev_obj = obj;
141
142 // Commit updates
143 w.swap(w_new);
144 b = b_new;
145 }
146
147 // Report results
148 int nnz = 0; for (double z : w) if (fabs(z) > 1e-8) ++nnz;
149 double mse = 0.0; matvec_Xw(X, w, Xw);
150 for (int i = 0; i < n; ++i) { double r = Xw[i] + b - y[i]; mse += r*r; }
151 mse /= n;
152
153 cout << fixed << setprecision(6);
154 cout << "ISTA Lasso results\n";
155 cout << "Step size t: " << t << ", estimated L: " << L << "\n";
156 cout << "Intercept b: " << b << "\n";
157 cout << "Nonzeros in w: " << nnz << " / " << p << "\n";
158 cout << "Training MSE: " << mse << "\n";
159 cout << "L1 norm of w: " << l1_norm(w) << "\n";
160
161 return 0;
162}
163

This implementation solves the Lasso with proximal gradient descent (ISTA). It estimates a safe step size using power iteration to approximate the Lipschitz constant L of the squared-loss gradient, applies a gradient step, and then performs soft-thresholding. The intercept is updated via a standard gradient step and is not penalized. The algorithm stops when the parameter change is small or the objective stops improving.

Time: O((I_pi + T) · n · p), where I_pi is the number of power-iteration steps and T is ISTA iterations.Space: O(n · p) for X plus O(n + p) for working vectors.
#lasso#l1 regularization#soft-thresholding#proximal gradient#ista#coordinate descent#sparse regression#feature selection#convex optimization#subgradient#elastic net#lipschitz constant#kkt conditions#standardization#cross-validation