∑MathIntermediate

Elastic Net Regularization

Key Points

•
Elastic Net regularization combines L1 (Lasso) and L2 (Ridge) penalties to produce models that are both sparse and stable.
•
It is especially effective when features are correlated, encouraging grouped selection rather than picking just one feature from a correlated set.
•
The optimization remains convex, so global minima exist and efficient algorithms like coordinate descent and proximal gradient work well.
•
Standardizing features (zero mean, unit variance) is crucial so the penalty treats all coefficients fairly.
•
The mixing of penalties can be written with two parameters ( $λ 1,$ $λ 2)$ or one λ with a mixing factor $α,$ where $λ 1$ = $λ α$ and $λ 2$ = $λ (1 - α) .$
•
For linear regression, coordinate descent has simple closed-form updates using soft-thresholding.
•
For logistic regression and other smooth losses, proximal gradient descent applies a gradient step (including L2) followed by soft-thresholding (for L1).
•
Hyperparameters are typically tuned via cross-validation to balance bias, variance, and sparsity.

Prerequisites

→Linear and logistic regression — Elastic Net is applied to these models by adding penalties to their loss functions.
→Vector norms (L1 and L2) — Understanding how $\lVert w \rVert_1$ induces sparsity and $\lVert w \rVert_2^2$ provides shrinkage explains Elastic Net’s behavior.
→Convex optimization basics — Elastic Net uses convex objectives; knowing gradients, subgradients, and optimality conditions helps understand solvers.
→Gradient descent and proximal methods — Implementation for non-smooth penalties (L1) relies on proximal steps or coordinate descent.
→Feature standardization — Scaling ensures the penalty treats all coefficients comparably and improves numerical stability.
→Cross-validation — Choosing λ and α requires validation to balance bias, variance, and sparsity.

Detailed Explanation

Tap terms for definitions

01Overview

Elastic Net regularization is a technique for preventing overfitting by adding a penalty to the model’s loss function that blends the strengths of L1 (Lasso) and L2 (Ridge) penalties. In predictive modeling, especially with many features (possibly more than observations), models can fit noise rather than signal. L1 regularization encourages sparsity by driving many coefficients to exactly zero, effectively selecting features. L2 regularization spreads shrinkage across coefficients, stabilizing estimates and handling multicollinearity. Elastic Net combines these: it promotes sparsity while also grouping correlated features, often leading to better generalization than using L1 or L2 alone. The method is widely used in linear and logistic regression, and it extends to generalized linear models and beyond. Because the objective remains convex, a variety of efficient solvers—coordinate descent for squared loss and proximal gradient methods for smooth losses—can find globally optimal solutions. Hyperparameters controlling the balance and strength of the penalties are typically selected using k-fold cross-validation. In practice, Elastic Net has become a robust default when you have many, potentially correlated predictors and you need a model that is interpretable (sparse), stable, and accurate.

02Intuition & Analogies

Imagine packing a backpack for a hike. You want to carry only the items that matter (sparsity) but also distribute weight evenly so the bag is comfortable (stability). L1 regularization is like a strict baggage fee that forces you to drop many items entirely; it drives some weights to zero, leaving a light, simple pack. L2 regularization is like using elastic straps around the backpack: it doesn’t remove items, but it presses everything closer to the center, reducing the influence of any single item. Now think about carrying two highly similar items—say, two near-identical water bottles. L1 alone often makes you keep one and drop the other, which can be unstable if they’re nearly indistinguishable. L2 alone keeps both but shrinks them, which may keep redundancy. Elastic Net combines the baggage fee and the elastic straps. It removes truly unnecessary items but, when two items are redundant, it tends to keep them together at reduced weight (the “grouping effect”). This makes the final pack both lean and balanced. In data terms, when features are correlated, Elastic Net tends to keep correlated features together with similar coefficients, rather than arbitrarily selecting one. This leads to more stable models across different samples. The penalty strength (λ) controls how aggressively you pack light, and the mixing factor (α) controls how much you care about dropping items entirely (L1) versus compressing them (L2).

03Formal Definition

For a parameter vector w

\in

R^{p}

and intercept b, Elastic Net minimizes a penalized loss. For linear regression with squared loss, the objective is: \[

min_{w, b}

\frac{1}{n}

\sum_{i = 1}^{n}

(

y_{i}

- b - x_

i^{⊤}

w)^2 \, + \,

λ_{1}

\sum_{j = 1}^{p}

∣ w_{j} ∣

\, + \,

\frac{λ _{2}}{2}

∥

∥_{2}^{2}

. \] Here, \(

λ_{1}

\geq

0\) controls sparsity via the L1 norm (sum of absolute values), and \(

λ_{2}

\geq

0\) controls shrinkage via the squared L2 norm. A common reparameterization uses a single strength \(

λ

\) and a mixing factor \(

α

\in

[0,1]\): \[

λ_{1}

λ

α

λ_{2}

λ

(1-

α

). \] The loss remains convex because both the squared loss and L2 term are smooth convex, and the L1 norm is convex (though non-smooth). For generalized linear models, replace the squared loss with a convex differentiable loss \(

ℓ

(w,b)\) (e.g., logistic loss), yielding: \[

min_{w, b}

\frac{1}{n}

\sum_{i = 1}^{n}

ℓ

(

y_{i}

, b + x_

i^{⊤}

w) +

λ_{1}

∥

∥_{1}

\frac{λ _{2}}{2}

∥

∥_{2}^{2}

. \] Optimization typically uses coordinate descent (for squared loss) or proximal gradient methods (for smooth losses), leveraging the separability of the L1 penalty via soft-thresholding.

04When to Use

Use Elastic Net when you have many features, especially with substantial correlation among them, and you want both interpretability and robustness. Typical scenarios include genomics (many correlated gene expression measurements), text features (high-dimensional sparse vectors with co-occurring terms), and finance (highly correlated indicators). If pure Lasso is unstable because it arbitrarily selects one among correlated predictors, Elastic Net’s L2 component helps keep groups together, improving stability and predictive performance. When you need exact zeros for feature selection, keep a nonzero L1 component (α > 0). When overfitting is severe but you still value keeping related predictors, choose a moderate α (e.g., 0.2–0.8) and tune λ by cross-validation. If p > n (more features than samples), Elastic Net often outperforms Ridge and can outperform Lasso due to its grouping effect. It is also useful when you plan to compute regularization paths: sweeping λ while monitoring validation performance offers a clear bias–variance–sparsity trade-off. Remember to standardize features so that the same penalty scale applies across coefficients; otherwise, features with larger scales are penalized disproportionately.

⚠️Common Mistakes

Skipping feature standardization: Without zero-mean and unit-variance scaling, coefficients tied to large-scale features are over-penalized, biasing selection. Always standardize X (but do not penalize the intercept).\n- Penalizing the intercept: The intercept should be free; include it in the model but exclude it from L1/L2 penalties.\n- Using plain gradient descent for L1: The L1 term is non-smooth; use coordinate descent or proximal gradient with soft-thresholding. A vanilla gradient step on |w| is undefined at zero.\n- Ignoring correlation structure: Pure Lasso may drop useful correlated predictors; if you see instability across folds, increase the L2 share (lower α) to encourage grouping.\n- Mis-scaling λ with sample size: When comparing across datasets with different n, remember the loss is usually averaged by n; keep λ on a comparable scale or re-tune via cross-validation.\n- Early stopping without convergence checks: For coordinate descent, monitor max coefficient change; for proximal methods, track objective decrease or gradient mapping norm.\n- Data leakage during scaling or CV: Fit scalers on training folds only, then transform validation folds; re-fit after model selection on all training data.

Key Formulas

Elastic Net Objective (Linear)

w, b min \frac{1}{n} i = 1 \sum n (y_{i} - b - x_{i}^{⊤} w)^{2} + λ_{1} ∥ w ∥_{1} + \frac{λ _{2}}{2} ∥ w ∥_{2}^{2}

Explanation: Penalizes the mean squared error with a combination of L1 and L2 terms. The L1 induces sparsity while the L2 stabilizes and shrinks coefficients.

Elastic Net Objective (General)

w, b min \frac{1}{n} i = 1 \sum n ℓ (y_{i}, b + x_{i}^{⊤} w) + λ_{1} ∥ w ∥_{1} + \frac{λ _{2}}{2} ∥ w ∥_{2}^{2}

Explanation: Generalizes to any convex differentiable loss such as logistic loss. Optimization uses proximal methods due to the non-smooth L1 term.

Mixing Parameterization

λ_{1} = λ α, λ_{2} = λ (1 - α), α \in [0, 1]

Explanation: Uses a single overall strength λ and a mixing factor α to balance L1 vs L2. α = 1 gives Lasso, α = 0 gives Ridge, and in-between gives Elastic Net.

Soft-thresholding Operator

S (z, τ) = sgn (z) max (∣ z ∣ - τ, 0)

Explanation: This is the proximal operator of the L1 norm. It zeroes small values and shrinks large ones by τ, enabling sparse solutions.

Coordinate Descent Update (Squared Loss)

w_{j} \leftarrow \frac{S ( \frac{1}{n} \sum _{i = 1}^{n} x _{ij} ( y _{i} - b - \sum _{k \neq = j} x _{ik} w _{k} ) , λ _{1} )}{( \frac{1}{n} \sum _{i = 1}^{n} x _{ij}^{2} ) + λ _{2}}

Explanation: Closed-form update for a single coefficient given others fixed. The numerator is a soft-thresholded partial residual correlation; the denominator includes data curvature plus the L2 penalty.

Logistic Loss Gradient

\nabla_{w} [\frac{1}{n} i = 1 \sum n lo g (1 + e^{- y_{i} (b + x_{i}^{⊤} w)})] = - \frac{1}{n} i = 1 \sum n y_{i} x_{i} σ (- y_{i} (b + x_{i}^{⊤} w))

Explanation: Gives the gradient of the smooth logistic data term. For Elastic Net, add $ $λ_{2}$ w$ and handle $ $λ_{1}$ $ via the proximal soft-thresholding step.

Proximal Operator for L1

prox_{η λ_{1} ∥ \cdot ∥_{1}} (v) = ar g w min \frac{1}{2 η} ∥ w - v ∥_{2}^{2} + λ_{1} ∥ w ∥_{1} = S (v, η λ_{1})

Explanation: In proximal gradient, after a gradient step to v, apply soft-thresholding with threshold $η λ 1$ to obtain sparsity.

Optimality (KKT) Condition

0 \in \nabla f (w) + λ_{2} w + λ_{1} \partial ∥ w ∥_{1}

Explanation: At the solution, the gradient of the smooth part plus ridge term is balanced by a subgradient of the L1 norm. This characterizes optimality in convex Elastic Net problems.

Complexity Analysis

Let n be the number of samples, p the number of features, and T the number of outer iterations (epochs). For linear regression with squared loss, cyclic coordinate descent updates each coefficient with O(n) work if you maintain the residual vector and column-wise statistics. A full epoch costs O(n p). Precomputing per-column squared norms costs O(n p) once. Memory is O(n + p) for dense data (storing residuals and coefficients), plus O(n p) to store X unless streamed or sparse. For sparse feature matrices with nnz nonzeros, the per-epoch cost becomes O(nnz). Convergence is typically linear for strongly convex objectives (

λ 2

> 0), and the number of epochs T needed for a given tolerance depends on conditioning and feature correlations.

\n \nFor

logistic regression or other smooth losses, proximal gradient descent performs one gradient computation per iteration. Computing the gradient requires matrix–vector products: O(n p) for dense X (or O(nnz) for sparse). Each proximal step (soft-thresholding) is O(p). Thus each iteration is O(n p + p), dominated by O(n p). If you use a conservative fixed step size based on a Lipschitz constant (e.g., L ≈

\frac{1}{4 n}

x_{j}

∣ x_{j} ∥

^2 +

λ 2),

the method converges at rate O

\frac{1}{k}

in objective gap for basic proximal gradient, or O(1/

k^{2}

) with acceleration (FISTA). Memory remains O(n p) for X plus O(p) for parameters and O(n) for intermediate vectors.

\n \nIn

practice, coordinate descent often outperforms proximal methods on squared loss due to cheap closed-form updates and better cache locality. For generalized losses, proximal methods are preferred unless specialized coordinate updates are available. Cross-validation multiplies runtime roughly by the number of folds times the number of (

λ,

α)

settings explored; warm starts along a regularization path can substantially reduce total cost.

Code Examples

Elastic Net Linear Regression via Coordinate Descent (Squared Loss)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Soft-thresholding operator S(z, tau) = sign(z) * max(|z| - tau, 0)
5 static inline double soft_threshold(double z, double tau) {
6     if (z > tau) return z - tau;
7     if (z < -tau) return z + tau;
8     return 0.0;
9 }
10 
11 struct StandardScaler {
12     vector<double> mean, stdv;
13     // Fit on X (n x p) and transform in-place (exclude intercept)
14     void fit_transform(vector<vector<double>>& X) {
15         int n = (int)X.size();
16         int p = (int)X[0].size();
17         mean.assign(p, 0.0);
18         stdv.assign(p, 0.0);
19         for (int j = 0; j < p; ++j) {
20             for (int i = 0; i < n; ++i) mean[j] += X[i][j];
21             mean[j] /= n;
22             for (int i = 0; i < n; ++i) stdv[j] += (X[i][j] - mean[j]) * (X[i][j] - mean[j]);
23             stdv[j] = sqrt(stdv[j] / max(1, n - 1));
24             if (stdv[j] == 0.0) stdv[j] = 1.0; // avoid divide-by-zero
25             for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean[j]) / stdv[j];
26         }
27     }
28 };
29 
30 struct ElasticNetCD {
31     int max_epochs = 2000;
32     double tol = 1e-6;
33     double lambda1; // L1 strength
34     double lambda2; // L2 strength
35 
36     ElasticNetCD(double l1, double l2) : lambda1(l1), lambda2(l2) {}
37 
38     // Fits: minimize (1/n) * ||y - b - X w||^2 + lambda1 * ||w||_1 + (lambda2/2) * ||w||_2^2
39     // Returns pair {intercept b, weights w}
40     pair<double, vector<double>> fit(const vector<vector<double>>& X, const vector<double>& y) {
41         int n = (int)X.size();
42         int p = (int)X[0].size();
43         vector<double> w(p, 0.0);
44         vector<double> r = y; // residual r = y - b - Xw ; start with b=mean(y), but we'll incorporate that via centering below
45         double b = 0.0;
46 
47         // Initialize intercept as mean(y)
48         double y_mean = accumulate(y.begin(), y.end(), 0.0) / n;
49         b = y_mean;
50         for (int i = 0; i < n; ++i) r[i] = y[i] - b; // since w=0 initially
51 
52         // Precompute per-column squared norms: (1/n) * sum x_{ij}^2
53         vector<double> col_norm_sq(p, 0.0);
54         for (int j = 0; j < p; ++j) {
55             double s2 = 0.0;
56             for (int i = 0; i < n; ++i) s2 += X[i][j] * X[i][j];
57             col_norm_sq[j] = s2 / n;
58         }
59 
60         for (int epoch = 0; epoch < max_epochs; ++epoch) {
61             double max_change = 0.0;
62 
63             // Coordinate-wise updates
64             for (int j = 0; j < p; ++j) {
65                 if (col_norm_sq[j] == 0.0) { w[j] = 0.0; continue; }
66                 // Add back old contribution of feature j to residual
67                 for (int i = 0; i < n; ++i) r[i] += X[i][j] * w[j];
68 
69                 // Partial residual correlation: (1/n) * sum x_{ij} * r_i
70                 double zj = 0.0;
71                 for (int i = 0; i < n; ++i) zj += X[i][j] * r[i];
72                 zj /= n;
73 
74                 // Closed-form update with soft-thresholding
75                 double new_wj = soft_threshold(zj, lambda1) / (col_norm_sq[j] + lambda2);
76                 double delta = new_wj - w[j];
77                 w[j] = new_wj;
78 
79                 // Subtract new contribution to residual
80                 for (int i = 0; i < n; ++i) r[i] -= X[i][j] * w[j];
81 
82                 max_change = max(max_change, fabs(delta));
83             }
84 
85             // Optimal intercept for squared loss: b = mean(y - Xw)
86             // Given r = y - b - Xw, we have mean(r) = mean(y - b - Xw)
87             double mean_r = accumulate(r.begin(), r.end(), 0.0) / n;
88             b += mean_r;
89             for (int i = 0; i < n; ++i) r[i] -= mean_r; // keep r consistent
90 
91             if (max_change < tol) break;
92         }
93 
94         return {b, w};
95     }
96 };
97 
98 int main() {
99     ios::sync_with_stdio(false);
100     cin.tie(nullptr);
101 
102     // Synthetic dense dataset: n samples, p features
103     int n = 200, p = 12;
104     std::mt19937 rng(42);
105     std::normal_distribution<double> nd(0.0, 1.0);
106 
107     vector<vector<double>> X(n, vector<double>(p));
108     for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
109 
110     // True sparse weights with correlated features
111     vector<double> w_true(p, 0.0);
112     w_true[0] = 3.0; w_true[1] = -2.0; w_true[5] = 1.5; // sparse support
113     double b_true = 1.0;
114 
115     // Create correlation between feature 0 and 1 (add small fraction of 0 into 1)
116     for (int i = 0; i < n; ++i) X[i][1] = 0.8 * X[i][0] + 0.2 * X[i][1];
117 
118     // Generate y with noise
119     vector<double> y(n, b_true);
120     for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) y[i] += X[i][j] * w_true[j];
121     for (int i = 0; i < n; ++i) y[i] += 0.5 * nd(rng); // noise
122 
123     // Standardize features (important!)
124     StandardScaler scaler;
125     scaler.fit_transform(X);
126 
127     // Fit Elastic Net with moderate mixing
128     double lambda = 0.2; // overall strength (tune via CV in practice)
129     double alpha = 0.6;  // mixing: 0 -> ridge, 1 -> lasso
130     double lambda1 = lambda * alpha;
131     double lambda2 = lambda * (1.0 - alpha);
132 
133     ElasticNetCD model(lambda1, lambda2);
134     auto [b_hat, w_hat] = model.fit(X, y);
135 
136     cout << fixed << setprecision(4);
137     cout << "Intercept: " << b_hat << "\nWeights:\n";
138     for (int j = 0; j < p; ++j) cout << "w[" << j << "] = " << w_hat[j] << "\n";
139 
140     return 0;
141 }
142

This program fits an Elastic Net linear regression model using cyclic coordinate descent. Features are standardized so the L1/L2 penalties treat coefficients comparably. The algorithm maintains the residual vector to update each coefficient in O(n) time via a soft-thresholded closed-form. The intercept is updated to the mean of residual-adjusted targets and is not penalized.

Time: O(n p T) for T epochs; O(nnz T) if X is sparse.Space: O(n + p) additional memory beyond the input matrix; O(n p) to store X if dense.

Elastic Net Logistic Regression via Proximal Gradient (Full-Batch)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
5 
6 static inline double soft_threshold(double z, double tau) {
7     if (z > tau) return z - tau;
8     if (z < -tau) return z + tau;
9     return 0.0;
10 }
11 
12 struct StandardScaler {
13     vector<double> mean, stdv;
14     void fit_transform(vector<vector<double>>& X) {
15         int n = (int)X.size(), p = (int)X[0].size();
16         mean.assign(p, 0.0); stdv.assign(p, 0.0);
17         for (int j = 0; j < p; ++j) {
18             for (int i = 0; i < n; ++i) mean[j] += X[i][j];
19             mean[j] /= n;
20             for (int i = 0; i < n; ++i) stdv[j] += (X[i][j] - mean[j]) * (X[i][j] - mean[j]);
21             stdv[j] = sqrt(stdv[j] / max(1, n - 1));
22             if (stdv[j] == 0.0) stdv[j] = 1.0;
23             for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean[j]) / stdv[j];
24         }
25     }
26 };
27 
28 struct ElasticNetLogisticPGD {
29     int max_iters = 2000;
30     double tol = 1e-6;
31     double lambda1; // L1
32     double lambda2; // L2
33 
34     ElasticNetLogisticPGD(double l1, double l2) : lambda1(l1), lambda2(l2) {}
35 
36     pair<double, vector<double>> fit(const vector<vector<double>>& X, const vector<int>& y01) {
37         int n = (int)X.size();
38         int p = (int)X[0].size();
39         vector<double> w(p, 0.0);
40         double b = 0.0; // intercept (unpenalized)
41 
42         // Compute step size via Lipschitz bound: L <= (1/4n) * max_j ||x_j||^2 + lambda2
43         vector<double> col_norm_sq(p, 0.0);
44         for (int j = 0; j < p; ++j) {
45             double s2 = 0.0; for (int i = 0; i < n; ++i) s2 += X[i][j]*X[i][j];
46             col_norm_sq[j] = s2 / n;
47         }
48         double max_col = 0.0; for (double v : col_norm_sq) max_col = max(max_col, v);
49         double L = 0.25 * max_col + lambda2 + 1e-12; // avoid divide-by-zero
50         double eta = 1.0 / L;
51 
52         vector<double> grad_w(p), p_hat(n);
53         for (int it = 0; it < max_iters; ++it) {
54             // Predictions and residuals
55             for (int i = 0; i < n; ++i) {
56                 double z = b;
57                 for (int j = 0; j < p; ++j) z += X[i][j] * w[j];
58                 p_hat[i] = sigmoid(z);
59             }
60 
61             // Gradient of data term: (1/n) X^T (p - y)
62             fill(grad_w.begin(), grad_w.end(), 0.0);
63             double grad_b = 0.0;
64             for (int i = 0; i < n; ++i) {
65                 double diff = p_hat[i] - (double)y01[i];
66                 grad_b += diff;
67                 for (int j = 0; j < p; ++j) grad_w[j] += X[i][j] * diff;
68             }
69             grad_b /= n;
70             for (int j = 0; j < p; ++j) grad_w[j] = grad_w[j] / n + lambda2 * w[j];
71 
72             // Gradient step on smooth part (data + L2)
73             vector<double> w_temp(p);
74             for (int j = 0; j < p; ++j) w_temp[j] = w[j] - eta * grad_w[j];
75             double b_new = b - eta * grad_b; // intercept not penalized
76 
77             // Proximal step for L1
78             double thresh = eta * lambda1;
79             double max_change = 0.0;
80             for (int j = 0; j < p; ++j) {
81                 double new_wj = soft_threshold(w_temp[j], thresh);
82                 max_change = max(max_change, fabs(new_wj - w[j]));
83                 w[j] = new_wj;
84             }
85             max_change = max(max_change, fabs(b_new - b));
86             b = b_new;
87 
88             if (max_change < tol) break;
89         }
90         return {b, w};
91     }
92 };
93 
94 int main() {
95     ios::sync_with_stdio(false);
96     cin.tie(nullptr);
97 
98     int n = 400, p = 10;
99     std::mt19937 rng(123);
100     std::normal_distribution<double> nd(0.0, 1.0);
101 
102     vector<vector<double>> X(n, vector<double>(p));
103     for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
104 
105     // True weights and intercept
106     vector<double> w_true(p, 0.0); w_true[0] = 2.0; w_true[1] = -1.5; w_true[3] = 1.0; w_true[7] = 0.5;
107     double b_true = -0.3;
108 
109     // Correlate features 0 and 1
110     for (int i = 0; i < n; ++i) X[i][1] = 0.7 * X[i][0] + 0.3 * X[i][1];
111 
112     // Generate probabilities and labels
113     vector<int> y(n);
114     for (int i = 0; i < n; ++i) {
115         double z = b_true;
116         for (int j = 0; j < p; ++j) z += X[i][j] * w_true[j];
117         double pr = sigmoid(z);
118         std::bernoulli_distribution bd(pr);
119         y[i] = bd(rng);
120     }
121 
122     // Standardize features
123     StandardScaler scaler; scaler.fit_transform(X);
124 
125     double lambda = 0.15, alpha = 0.5;
126     double lambda1 = lambda * alpha;
127     double lambda2 = lambda * (1.0 - alpha);
128 
129     ElasticNetLogisticPGD clf(lambda1, lambda2);
130     auto [b_hat, w_hat] = clf.fit(X, y);
131 
132     cout << fixed << setprecision(4);
133     cout << "Intercept: " << b_hat << "\nWeights:\n";
134     for (int j = 0; j < p; ++j) cout << "w[" << j << "] = " << w_hat[j] << "\n";
135 
136     // Quick accuracy check
137     int correct = 0;
138     for (int i = 0; i < n; ++i) {
139         double z = b_hat; for (int j = 0; j < p; ++j) z += X[i][j] * w_hat[j];
140         int pred = sigmoid(z) >= 0.5 ? 1 : 0;
141         if (pred == y[i]) ++correct;
142     }
143     cout << "Training accuracy: " << (double)correct / n << "\n";
144     return 0;
145 }
146

This program trains an Elastic Net logistic regression classifier using proximal gradient descent. The data term gradient (plus λ2w) is followed by the L1 proximal step (soft-thresholding), which produces sparsity. The step size is set from a Lipschitz bound using the maximum column norm to ensure convergence. The intercept is not penalized.

Time: O(n p T) for T proximal iterations; O(nnz T) for sparse data.Space: O(n p) to store X if dense, plus O(p) for parameters and O(n) for working vectors.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Soft-thresholding operator S(z, tau) = sign(z) * max(\|z\| - tau, 0)
5	static inline double soft_threshold(double z, double tau) {
6	if (z > tau) return z - tau;
7	if (z < -tau) return z + tau;
8	return 0.0;
9	}
10
11	struct StandardScaler {
12	vector<double> mean, stdv;
13	// Fit on X (n x p) and transform in-place (exclude intercept)
14	void fit_transform(vector<vector<double>>& X) {
15	int n = (int)X.size();
16	int p = (int)X[0].size();
17	mean.assign(p, 0.0);
18	stdv.assign(p, 0.0);
19	for (int j = 0; j < p; ++j) {
20	for (int i = 0; i < n; ++i) mean[j] += X[i][j];
21	mean[j] /= n;
22	for (int i = 0; i < n; ++i) stdv[j] += (X[i][j] - mean[j]) * (X[i][j] - mean[j]);
23	stdv[j] = sqrt(stdv[j] / max(1, n - 1));
24	if (stdv[j] == 0.0) stdv[j] = 1.0; // avoid divide-by-zero
25	for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean[j]) / stdv[j];
26	}
27	}
28	};
29
30	struct ElasticNetCD {
31	int max_epochs = 2000;
32	double tol = 1e-6;
33	double lambda1; // L1 strength
34	double lambda2; // L2 strength
35
36	ElasticNetCD(double l1, double l2) : lambda1(l1), lambda2(l2) {}
37
38	// Fits: minimize (1/n) * \|\|y - b - X w\|\|^2 + lambda1 * \|\|w\|\|_1 + (lambda2/2) * \|\|w\|\|_2^2
39	// Returns pair {intercept b, weights w}
40	pair<double, vector<double>> fit(const vector<vector<double>>& X, const vector<double>& y) {
41	int n = (int)X.size();
42	int p = (int)X[0].size();
43	vector<double> w(p, 0.0);
44	vector<double> r = y; // residual r = y - b - Xw ; start with b=mean(y), but we'll incorporate that via centering below
45	double b = 0.0;
46
47	// Initialize intercept as mean(y)
48	double y_mean = accumulate(y.begin(), y.end(), 0.0) / n;
49	b = y_mean;
50	for (int i = 0; i < n; ++i) r[i] = y[i] - b; // since w=0 initially
51
52	// Precompute per-column squared norms: (1/n) * sum x_{ij}^2
53	vector<double> col_norm_sq(p, 0.0);
54	for (int j = 0; j < p; ++j) {
55	double s2 = 0.0;
56	for (int i = 0; i < n; ++i) s2 += X[i][j] * X[i][j];
57	col_norm_sq[j] = s2 / n;
58	}
59
60	for (int epoch = 0; epoch < max_epochs; ++epoch) {
61	double max_change = 0.0;
62
63	// Coordinate-wise updates
64	for (int j = 0; j < p; ++j) {
65	if (col_norm_sq[j] == 0.0) { w[j] = 0.0; continue; }
66	// Add back old contribution of feature j to residual
67	for (int i = 0; i < n; ++i) r[i] += X[i][j] * w[j];
68
69	// Partial residual correlation: (1/n) * sum x_{ij} * r_i
70	double zj = 0.0;
71	for (int i = 0; i < n; ++i) zj += X[i][j] * r[i];
72	zj /= n;
73
74	// Closed-form update with soft-thresholding
75	double new_wj = soft_threshold(zj, lambda1) / (col_norm_sq[j] + lambda2);
76	double delta = new_wj - w[j];
77	w[j] = new_wj;
78
79	// Subtract new contribution to residual
80	for (int i = 0; i < n; ++i) r[i] -= X[i][j] * w[j];
81
82	max_change = max(max_change, fabs(delta));
83	}
84
85	// Optimal intercept for squared loss: b = mean(y - Xw)
86	// Given r = y - b - Xw, we have mean(r) = mean(y - b - Xw)
87	double mean_r = accumulate(r.begin(), r.end(), 0.0) / n;
88	b += mean_r;
89	for (int i = 0; i < n; ++i) r[i] -= mean_r; // keep r consistent
90
91	if (max_change < tol) break;
92	}
93
94	return {b, w};
95	}
96	};
97
98	int main() {
99	ios::sync_with_stdio(false);
100	cin.tie(nullptr);
101
102	// Synthetic dense dataset: n samples, p features
103	int n = 200, p = 12;
104	std::mt19937 rng(42);
105	std::normal_distribution<double> nd(0.0, 1.0);
106
107	vector<vector<double>> X(n, vector<double>(p));
108	for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
109
110	// True sparse weights with correlated features
111	vector<double> w_true(p, 0.0);
112	w_true[0] = 3.0; w_true[1] = -2.0; w_true[5] = 1.5; // sparse support
113	double b_true = 1.0;
114
115	// Create correlation between feature 0 and 1 (add small fraction of 0 into 1)
116	for (int i = 0; i < n; ++i) X[i][1] = 0.8 * X[i][0] + 0.2 * X[i][1];
117
118	// Generate y with noise
119	vector<double> y(n, b_true);
120	for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) y[i] += X[i][j] * w_true[j];
121	for (int i = 0; i < n; ++i) y[i] += 0.5 * nd(rng); // noise
122
123	// Standardize features (important!)
124	StandardScaler scaler;
125	scaler.fit_transform(X);
126
127	// Fit Elastic Net with moderate mixing
128	double lambda = 0.2; // overall strength (tune via CV in practice)
129	double alpha = 0.6; // mixing: 0 -> ridge, 1 -> lasso
130	double lambda1 = lambda * alpha;
131	double lambda2 = lambda * (1.0 - alpha);
132
133	ElasticNetCD model(lambda1, lambda2);
134	auto [b_hat, w_hat] = model.fit(X, y);
135
136	cout << fixed << setprecision(4);
137	cout << "Intercept: " << b_hat << "\nWeights:\n";
138	for (int j = 0; j < p; ++j) cout << "w[" << j << "] = " << w_hat[j] << "\n";
139
140	return 0;
141	}
142

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
5
6	static inline double soft_threshold(double z, double tau) {
7	if (z > tau) return z - tau;
8	if (z < -tau) return z + tau;
9	return 0.0;
10	}
11
12	struct StandardScaler {
13	vector<double> mean, stdv;
14	void fit_transform(vector<vector<double>>& X) {
15	int n = (int)X.size(), p = (int)X[0].size();
16	mean.assign(p, 0.0); stdv.assign(p, 0.0);
17	for (int j = 0; j < p; ++j) {
18	for (int i = 0; i < n; ++i) mean[j] += X[i][j];
19	mean[j] /= n;
20	for (int i = 0; i < n; ++i) stdv[j] += (X[i][j] - mean[j]) * (X[i][j] - mean[j]);
21	stdv[j] = sqrt(stdv[j] / max(1, n - 1));
22	if (stdv[j] == 0.0) stdv[j] = 1.0;
23	for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean[j]) / stdv[j];
24	}
25	}
26	};
27
28	struct ElasticNetLogisticPGD {
29	int max_iters = 2000;
30	double tol = 1e-6;
31	double lambda1; // L1
32	double lambda2; // L2
33
34	ElasticNetLogisticPGD(double l1, double l2) : lambda1(l1), lambda2(l2) {}
35
36	pair<double, vector<double>> fit(const vector<vector<double>>& X, const vector<int>& y01) {
37	int n = (int)X.size();
38	int p = (int)X[0].size();
39	vector<double> w(p, 0.0);
40	double b = 0.0; // intercept (unpenalized)
41
42	// Compute step size via Lipschitz bound: L <= (1/4n) * max_j \|\|x_j\|\|^2 + lambda2
43	vector<double> col_norm_sq(p, 0.0);
44	for (int j = 0; j < p; ++j) {
45	double s2 = 0.0; for (int i = 0; i < n; ++i) s2 += X[i][j]*X[i][j];
46	col_norm_sq[j] = s2 / n;
47	}
48	double max_col = 0.0; for (double v : col_norm_sq) max_col = max(max_col, v);
49	double L = 0.25 * max_col + lambda2 + 1e-12; // avoid divide-by-zero
50	double eta = 1.0 / L;
51
52	vector<double> grad_w(p), p_hat(n);
53	for (int it = 0; it < max_iters; ++it) {
54	// Predictions and residuals
55	for (int i = 0; i < n; ++i) {
56	double z = b;
57	for (int j = 0; j < p; ++j) z += X[i][j] * w[j];
58	p_hat[i] = sigmoid(z);
59	}
60
61	// Gradient of data term: (1/n) X^T (p - y)
62	fill(grad_w.begin(), grad_w.end(), 0.0);
63	double grad_b = 0.0;
64	for (int i = 0; i < n; ++i) {
65	double diff = p_hat[i] - (double)y01[i];
66	grad_b += diff;
67	for (int j = 0; j < p; ++j) grad_w[j] += X[i][j] * diff;
68	}
69	grad_b /= n;
70	for (int j = 0; j < p; ++j) grad_w[j] = grad_w[j] / n + lambda2 * w[j];
71
72	// Gradient step on smooth part (data + L2)
73	vector<double> w_temp(p);
74	for (int j = 0; j < p; ++j) w_temp[j] = w[j] - eta * grad_w[j];
75	double b_new = b - eta * grad_b; // intercept not penalized
76
77	// Proximal step for L1
78	double thresh = eta * lambda1;
79	double max_change = 0.0;
80	for (int j = 0; j < p; ++j) {
81	double new_wj = soft_threshold(w_temp[j], thresh);
82	max_change = max(max_change, fabs(new_wj - w[j]));
83	w[j] = new_wj;
84	}
85	max_change = max(max_change, fabs(b_new - b));
86	b = b_new;
87
88	if (max_change < tol) break;
89	}
90	return {b, w};
91	}
92	};
93
94	int main() {
95	ios::sync_with_stdio(false);
96	cin.tie(nullptr);
97
98	int n = 400, p = 10;
99	std::mt19937 rng(123);
100	std::normal_distribution<double> nd(0.0, 1.0);
101
102	vector<vector<double>> X(n, vector<double>(p));
103	for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
104
105	// True weights and intercept
106	vector<double> w_true(p, 0.0); w_true[0] = 2.0; w_true[1] = -1.5; w_true[3] = 1.0; w_true[7] = 0.5;
107	double b_true = -0.3;
108
109	// Correlate features 0 and 1
110	for (int i = 0; i < n; ++i) X[i][1] = 0.7 * X[i][0] + 0.3 * X[i][1];
111
112	// Generate probabilities and labels
113	vector<int> y(n);
114	for (int i = 0; i < n; ++i) {
115	double z = b_true;
116	for (int j = 0; j < p; ++j) z += X[i][j] * w_true[j];
117	double pr = sigmoid(z);
118	std::bernoulli_distribution bd(pr);
119	y[i] = bd(rng);
120	}
121
122	// Standardize features
123	StandardScaler scaler; scaler.fit_transform(X);
124
125	double lambda = 0.15, alpha = 0.5;
126	double lambda1 = lambda * alpha;
127	double lambda2 = lambda * (1.0 - alpha);
128
129	ElasticNetLogisticPGD clf(lambda1, lambda2);
130	auto [b_hat, w_hat] = clf.fit(X, y);
131
132	cout << fixed << setprecision(4);
133	cout << "Intercept: " << b_hat << "\nWeights:\n";
134	for (int j = 0; j < p; ++j) cout << "w[" << j << "] = " << w_hat[j] << "\n";
135
136	// Quick accuracy check
137	int correct = 0;
138	for (int i = 0; i < n; ++i) {
139	double z = b_hat; for (int j = 0; j < p; ++j) z += X[i][j] * w_hat[j];
140	int pred = sigmoid(z) >= 0.5 ? 1 : 0;
141	if (pred == y[i]) ++correct;
142	}
143	cout << "Training accuracy: " << (double)correct / n << "\n";
144	return 0;
145	}
146