⚙️AlgorithmIntermediate

Proximal Operators & Methods

Key Points

•
A proximal operator pulls a point x toward minimizing a function f while penalizing how far it moves, acting like a denoiser or projector depending on f.
•
It enables fast algorithms for nonsmooth convex problems by splitting a smooth part (handled by gradients) and a nonsmooth part (handled by a prox).
•
Common proximals include soft-thresholding for L1 norms and projections for constraints; many have simple closed forms.
•
Proximal Gradient (ISTA) converges at rate O $\frac{1}{k}$ and its accelerated variant FISTA achieves O(1/ $k^{2}$ ) for convex objectives.
•
The step size must be chosen using the Lipschitz constant of the smooth gradient, or via backtracking line search.
•
Moreau’s identity and envelope connect proximals to smooth approximations and duality, giving useful theoretical and computational tools.
•
Proximal methods scale well to large sparse problems because each iteration is mostly a matrix-vector multiply and a simple vector prox.
•
In C++, you can implement proximals as small vector-wise functions and combine them with gradient steps for efficient solvers.

Prerequisites

→Convex functions and sets — Proximal operators rely on convexity for uniqueness, stability, and convergence guarantees.
→Norms and basic projections — Understanding L1/L2 norms and Euclidean projections helps derive and implement common proximals.
→Subgradients and subdifferentials — The optimality condition for proximals uses subgradients; many nonsmooth terms require this concept.
→Lipschitz continuity of gradients — Step-size selection in proximal gradient depends on the Lipschitz constant.
→Gradient descent and Nesterov acceleration — Proximal gradient extends gradient descent; FISTA adds Nesterov momentum.
→Fenchel conjugates and Moreau envelope (optional) — Useful for deriving new proximals and understanding smooth approximations.
→Linear algebra and spectral norm — Computing gradients like A^T(Ax−b) and estimating ||A||_2^2 require linear algebra basics.
→C++ vectors and loops — Efficiently implementing vector-wise proximals and matrix-vector multiplies requires comfort with C++ containers.

Detailed Explanation

Tap terms for definitions

01Overview

Proximal operators are building blocks for solving optimization problems that include nonsmooth terms (like absolute values or constraints). Given a convex function f, its proximal operator maps a point x to the point z that best balances two goals: (1) making f(z) small, and (2) staying close to x. Formally, it solves a small quadratic-regularized subproblem around x. This makes proximals behave like denoisers or projectors: for example, the proximal of the L1 norm is soft-thresholding (it shrinks small components to zero), and the proximal of an indicator function of a set is the projection onto that set.

These operators let us extend gradient-based optimization to nonsmooth settings in a principled way. In composite problems of the form minimize h(x) + g(x), where h is smooth with a Lipschitz gradient and g is convex (possibly nonsmooth), we make a gradient step for h and then apply the proximal operator of g. This is called the proximal gradient method (ISTA) and converges under mild conditions. With Nesterov-style momentum, the accelerated version FISTA achieves faster theoretical rates. Proximals also underlie a wide family of splitting methods (e.g., ADMM, Douglas–Rachford) and connect deeply to convex analysis via Moreau envelopes and Fenchel duality.

02Intuition & Analogies

Imagine you are standing at point x on a landscape defined by a cost f. You want to move to a better point, but a rubber band tied to your waist penalizes moving too far from x. The proximal operator of f with parameter λ asks: where should you stand so that the pull of the landscape (wanting low f) and the pull of the rubber band (wanting proximity to x) are in balance? If f is steep near some components, you will accept more stretch of the band in those directions; if f is flat or has kinks (like absolute value), you may land exactly at a kink (producing sparsity).

For L1 regularization, that balance yields soft-thresholding: each component is pulled toward zero by a constant amount λ, stopping at exactly zero if it would cross. This is like a friction that wipes out small noise (“denoising”), yet keeps large signals. For constraints, take f as an indicator (0 if z is in the set, ∞ otherwise). The rubber band then just projects you back into the allowed region: the proximal becomes the nearest feasible point. When we combine a smooth data-fit (say least squares) with a nonsmooth regularizer or constraint, a practical iteration is: step downhill on the smooth part (pure gradient move), then snap back with the appropriate proximal (denoise or project). Acceleration (FISTA) adds a slingshot-like momentum that extrapolates the iterate, improving convergence speed but still applying the same simple proximal after each gradient step.

03Formal Definition

For a proper, closed, convex function f:

R^{n}

→ R ∪ {+∞} and a parameter λ > 0, the proximal operator is defined by

p ro x_{λ f}

(x) =

a r g mi n_{z \in R^{n}}

f(z) + (1/(2

λ)) ∣∣ z

− x||_2^2. The optimality (first-order) condition for

z = prox_{λ

f}(x) is 0 ∈ ∂f(z) + (1/

λ) (z

− x), where ∂f denotes the subdifferential. Equivalently,

p ro x_{λ f}

is the resolvent of the monotone operator ∂f:

p ro x_{λ f}

= (I + λ ∂f)^{-1}. Proximal mappings are single-valued and firmly nonexpansive for convex f, which implies stability and contraction-like properties crucial for convergence proofs. In composite optimization with F(x) = h(x) + g(x), where h is differentiable with L-Lipschitz gradient and g is convex (possibly nonsmooth), the proximal gradient step with step size λ ∈ (0, 1/L] is

x^{k+1} = prox_{λ

g}(

x^{k}

− λ ∇h(

x^{k}

)). Accelerated variants (FISTA) use an extrapolated point

y^{k}

to evaluate the gradient and then apply the same prox. Moreau’s decomposition connects a function and its Fenchel conjugate via proximals, and the Moreau envelope

M_{λ f}

(x) provides a smooth approximation of f with gradient (1/

λ) (x

−

p ro x_{λ f}

(x)).

04When to Use

Use proximal operators whenever your objective has a nonsmooth convex part that still admits an easy proximal mapping. Typical cases include: (1) sparsity via L1 regularization (LASSO), where the prox is soft-thresholding; (2) constraints represented as indicator functions of simple sets (e.g., box constraints, nonnegativity, Euclidean balls), where the prox is just a projection; (3) group sparsity (L2,1 norms) and nuclear norm penalties, both with known closed-form proximals; (4) robust losses such as Huber, with simple componentwise proximals.

They are especially effective in large-scale machine learning and signal processing because each iteration reduces to a gradient on the smooth data-fit plus a cheap, often separable, vector-wise operation. If the smooth part’s gradient is L-Lipschitz and you can compute its gradient quickly (e.g., sparse matrix-vector multiplies), proximal gradient (ISTA/FISTA) is a strong default. Choose ISTA for simplicity and stable behavior; pick FISTA when you want faster convergence on well-behaved convex problems. If the prox is not easy but splitting across variables is, methods like ADMM may be preferable. When the nonsmooth term is nonconvex, some proximal ideas still apply, but guarantees weaken—use with caution.

⚠️Common Mistakes

Using the wrong step size: If λ > 1/L where L is the Lipschitz constant of ∇h, proximal gradient can diverge or oscillate. Either estimate L (e.g., by power iteration on A^T A for least squares) or use backtracking.
Forgetting parameter scaling in the prox: prox_{λ||·||_1}(x) uses threshold λ, but if your composite step is x − τ∇h(x), the correct threshold is τ·λ. Mixing these leads to either too much or too little shrinkage.
Confusing L2 and squared L2: The prox of ||x||_2 is a vector shrinkage toward zero with a different formula than the prox of (1/2)||x||_2^2, which is a simple rescaling. Always double-check which norm or squared norm your model uses.
Ignoring separability: Many proximals act componentwise (L1, box constraints). Implement them as vectorized element-wise operations to get O(n) time; don’t solve a full optimization subproblem unnecessarily.
Poor stopping criteria: Stopping only on a fixed iteration count may waste time or stop too early. Monitor relative iterate change, objective decrease, or a dual/gradient mapping norm.
Numerical issues: Very small thresholds can suffer from catastrophic cancellation. Use max(|x|-τ, 0) rather than branching on signs, and clamp projections carefully to avoid NaNs or infinities.
Misapplied acceleration: FISTA can overshoot on ill-conditioned problems or with backtracking missteps. If you see oscillations, restart momentum (set it to zero) or revert to ISTA temporarily.

Key Formulas

Proximal Operator

prox_{λ f} (x) = ar g z \in R^{n} min f (z) + \frac{1}{2 λ} ∥ z - x ∥_{2}^{2}

Explanation: Defines the point z that balances minimizing f and staying close to x. The parameter λ controls how strongly we penalize moving away from x.

Optimality Condition

0 \in \partial f (z) + \frac{1}{λ} (z - x) ⟺ z = prox_{λ f} (x)

Explanation: At the proximal point z, the subgradient of f balances the quadratic pull back to x. This condition is used to derive closed-form proximals.

Proximal Gradient (ISTA) Step

x^{k + 1} = prox_{λ g} (x^{k} - λ \nabla h (x^{k}))

Explanation: Update by a gradient step on the smooth part h followed by a prox on the nonsmooth part g. Converges for λ $\in$ (0, 1/L] when ∇h is L-Lipschitz.

FISTA Updates

t_{k + 1} y^{k} x^{k + 1} = \frac{1 + 1 + 4 t _{k}^{2}}{2}, = x^{k} + \frac{t _{k - 1} - 1}{t _{k}} (x^{k} - x^{k - 1}), = prox_{λ g} (y^{k} - λ \nabla h (y^{k}))

Explanation: Adds Nesterov momentum through sequence $t_{k}$ and extrapolated point $y^{k}$ . Achieves an O(1/ $k^{2}$ ) rate for convex objectives.

Soft-Thresholding

(prox_{λ ∥ \cdot ∥_{1}} (x))_{i} = sign (x_{i}) max {∣ x_{i} ∣ - λ, 0}

Explanation: Closed-form proximal for the L1 norm applied componentwise. It shrinks small values to zero, promoting sparsity.

Projection onto a Box

(prox_{I_{[l, u]^{n}}} (x))_{i} = min {max {x_{i}, l}, u}

Explanation: The indicator’s proximal equals Euclidean projection onto the box [l,u]^n. It simply clamps each coordinate.

Moreau Envelope and Gradient

M_{λ f} (x) = z min f (z) + \frac{1}{2 λ} ∥ z - x ∥_{2}^{2}, \nabla M_{λ f} (x) = \frac{1}{λ} (x - prox_{λ f} (x))

Explanation: Smooths f while preserving minimizers approximately. The gradient is cheap to compute once the prox is available.

Moreau Decomposition

x = prox_{λ f} (x) + λ prox_{f^{*} / λ} (x / λ)

Explanation: Relates the proximal of a function to the proximal of its Fenchel conjugate. Useful for deriving new proximals from known ones.

Firm Nonexpansiveness

∥ prox_{λ f} (x) - prox_{λ f} (y) ∥_{2}^{2} \leq ⟨ prox_{λ f} (x) - prox_{λ f} (y), x - y ⟩

Explanation: Proximals are stable mappings that do not expand distances. This property is central to convergence analysis of splitting methods.

ISTA Convergence Rate

F (x^{k}) - F (x^{*}) \leq \frac{L ∥ x ^{0} - x ^{*} ∥ _{2}^{2}}{2 k}

Explanation: For convex composite $F = h + g$ with L-Lipschitz ∇h, ISTA attains O $\frac{1}{k}$ suboptimality. More iterations reduce the optimality gap inversely with k.

FISTA Convergence Rate

F (x^{k}) - F (x^{*}) \leq \frac{2 L ∥ x ^{0} - x ^{*} ∥ _{2}^{2}}{( k + 1 ) ^{2}}

Explanation: Acceleration improves the rate to O(1/ $k^{2}$ ) for convex problems. It significantly reduces iterations for a given accuracy.

Lipschitz Gradient

∥\nabla h (u) - \nabla h (v) ∥_{2} \leq L ∥ u - v ∥_{2}

Explanation: Bounds how fast the gradient can change. The step size in proximal gradient must not exceed 1/L for convergence without line search.

Resolvent Relation

prox_{λ f} = (I + λ \partial f)^{- 1}

Explanation: Connects the proximal operator to monotone operator theory. Many splitting algorithms can be seen as fixed-point iterations of resolvents.

Complexity Analysis

The runtime of proximal methods decomposes cleanly into two parts per iteration: a gradient evaluation of the smooth term and an application of a proximal operator. For a least-squares data-fit h(x) =

\frac{1}{2}

∣ A x - b ∣ ∣_{2}^{2} w i t ha d e n se m \times nma t r i x, t h e g r a d i e n t \nabla h (x) = A^{T} (A x - b) cos t s O (mn) t im e an d O (m + n) e x t r am e m ory b eyo n d s t or in g A . I f A i ss p a rse w i t hnn z (A) n o n zeros, t h ecos t re d u ces t o O (nn z (A)) . M an y u se f u lp ro x ima l s a rese p a r ab l e an d r u nin l in e a r t im e O (n) : L 1 so f t - t h res h o l d in g i sco m p o n e n tw i se; b o x an d n o nn e g a t i v i t yco n s t r ain t s a res im pl ec l am p s; g ro u p n or m s a c tp er g ro u p . T h u s, t h e p er - i t er a t i o n cos t i s t y p i c a ll y d o mina t e d b y a s ma ll n u mb ero f ma t r i x - v ec t or m u lt i pl i es . C h oos in g a s t e p s i ze λ \leq 1/ L re q u i reses t ima t in g t h e L i p sc hi t zco n s t an t L o f \nabla h . F or l e a s t s q u a res, L = ∣∣ A ∣

|_2^2 equals the largest eigenvalue of

A^{T}

A; a few steps of power iteration estimate this in O(nnz(A)·T) time for T iterations. Alternatively, backtracking line search adapts λ on the fly at the expense of extra gradient/forward evaluations per iteration. Convergence rates for convex problems are sublinear: ISTA achieves O

\frac{1}{k}

and FISTA O(1/

k^{2}

) in objective suboptimality after k iterations. The total runtime to reach

ε - a cc u r a cy

scales like O((cost per iteration)·k(

ε)),

where k(

ε)

≈ O(L/

ε)

for ISTA and O(√(L/

ε))

for FISTA (hiding constants). Memory use is modest: O(n) for vectors plus storage for A (O(mn) dense or O(nnz(A)) sparse). Implementation details such as caching Ax and reusing residuals can cut constant factors significantly.

Code Examples

Core proximal operators: soft-thresholding (L1) and projection onto a box

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <algorithm>
5 using namespace std;
6 
7 // Soft-thresholding proximal for L1 norm: prox_{tau * ||.||_1}(x)
8 vector<double> soft_threshold(const vector<double>& x, double tau) {
9     vector<double> y(x.size());
10     for (size_t i = 0; i < x.size(); ++i) {
11         double v = x[i];
12         double m = fabs(v) - tau;           // shrink magnitude by tau
13         if (m > 0.0) y[i] = (v >= 0.0 ? 1.0 : -1.0) * m; // keep sign
14         else y[i] = 0.0;                    // wipe out small values
15     }
16     return y;
17 }
18 
19 // Projection onto a box [lo, hi]^n: prox_{I_{[lo,hi]^n}}(x)
20 vector<double> project_box(const vector<double>& x, double lo, double hi) {
21     vector<double> y(x.size());
22     for (size_t i = 0; i < x.size(); ++i) {
23         y[i] = min(max(x[i], lo), hi); // clamp each coordinate
24     }
25     return y;
26 }
27 
28 void print_vec(const vector<double>& v) {
29     cout << "[";
30     for (size_t i = 0; i < v.size(); ++i) {
31         cout << v[i];
32         if (i + 1 < v.size()) cout << ", ";
33     }
34     cout << "]\n";
35 }
36 
37 int main() {
38     vector<double> x = {-0.2, 0.5, 3.0, -4.2, 0.05};
39 
40     // Example 1: L1 proximal (soft-thresholding)
41     double lambda = 0.8; // threshold
42     vector<double> y = soft_threshold(x, lambda);
43     cout << "Soft-thresholding with lambda=" << lambda << "\n";
44     cout << "x = "; print_vec(x);
45     cout << "prox = "; print_vec(y);
46 
47     // Example 2: Projection onto a box [0, 2]
48     double lo = 0.0, hi = 2.0;
49     vector<double> z = project_box(x, lo, hi);
50     cout << "\nProjection onto box [" << lo << ", " << hi << "]\n";
51     cout << "x = "; print_vec(x);
52     cout << "proj = "; print_vec(z);
53 
54     return 0;
55 }
56

This program implements two of the most common proximal operators. The soft-thresholding operator is the proximal for the L1 norm and promotes sparsity by shrinking each component toward zero by a fixed amount lambda, truncating to zero if it crosses. The projection operator is the proximal of an indicator function of a box set and simply clamps each coordinate between the lower and upper bounds. Both operations act componentwise and run in linear time in the vector length.

Time: O(n) per call, where n is the vector dimension.Space: O(n) additional space to store the output vector.

FISTA (accelerated proximal gradient) for LASSO: min 1/2||Ax-b||^2 + lambda||x||_1

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <cmath>
5 #include <algorithm>
6 #include <numeric>
7 using namespace std;
8 
9 // Basic linear algebra helpers for dense matrices stored as vector<vector<double>>
10 using Matrix = vector<vector<double>>;
11 
12 double dot(const vector<double>& a, const vector<double>& b) {
13     double s = 0.0; size_t n = a.size();
14     for (size_t i = 0; i < n; ++i) s += a[i] * b[i];
15     return s;
16 }
17 
18 double norm2(const vector<double>& a) { return sqrt(max(0.0, dot(a, a))); }
19 
20 double l1norm(const vector<double>& a) {
21     double s = 0.0; for (double v : a) s += fabs(v); return s;
22 }
23 
24 vector<double> soft_threshold(const vector<double>& x, double tau) {
25     vector<double> y(x.size());
26     for (size_t i = 0; i < x.size(); ++i) {
27         double v = x[i]; double m = fabs(v) - tau;
28         y[i] = (m > 0.0) ? ((v >= 0.0 ? 1.0 : -1.0) * m) : 0.0;
29     }
30     return y;
31 }
32 
33 vector<double> matvec(const Matrix& A, const vector<double>& x) {
34     size_t m = A.size(), n = x.size();
35     vector<double> y(m, 0.0);
36     for (size_t i = 0; i < m; ++i) {
37         double s = 0.0; const auto& Ai = A[i];
38         for (size_t j = 0; j < n; ++j) s += Ai[j] * x[j];
39         y[i] = s;
40     }
41     return y;
42 }
43 
44 vector<double> matTvec(const Matrix& A, const vector<double>& y) {
45     size_t m = A.size(), n = A[0].size();
46     vector<double> x(n, 0.0);
47     for (size_t i = 0; i < m; ++i) {
48         const auto& Ai = A[i]; double yi = y[i];
49         for (size_t j = 0; j < n; ++j) x[j] += Ai[j] * yi;
50     }
51     return x;
52 }
53 
54 // Power iteration to estimate L = ||A||_2^2 = largest eigenvalue of A^T A
55 double estimate_L(const Matrix& A, int iters = 100) {
56     size_t n = A[0].size();
57     vector<double> u(n, 0.0);
58     // random init
59     std::mt19937 rng(42);
60     std::normal_distribution<double> nd(0.0, 1.0);
61     for (size_t j = 0; j < n; ++j) u[j] = nd(rng);
62     double nu = norm2(u);
63     if (nu == 0.0) u[0] = 1.0; else for (size_t j = 0; j < n; ++j) u[j] /= nu;
64 
65     double rayleigh = 0.0;
66     for (int k = 0; k < iters; ++k) {
67         vector<double> v = matvec(A, u);      // v = A u
68         vector<double> w = matTvec(A, v);     // w = A^T v = A^T A u
69         double nw = norm2(w);
70         if (nw == 0.0) break;
71         for (size_t j = 0; j < n; ++j) u[j] = w[j] / nw; // normalize
72         rayleigh = dot(u, w);                 // since ||u||=1, u^T (A^T A) u
73     }
74     return max(rayleigh, 1e-12); // avoid zero
75 }
76 
77 // Objective: 0.5*||Ax-b||^2 + lambda*||x||_1
78 struct Lasso {
79     const Matrix& A; const vector<double>& b; double lambda;
80     vector<double> Ax(const vector<double>& x) const { return matvec(A, x); }
81     double value(const vector<double>& x) const {
82         vector<double> r = Ax(x);
83         for (size_t i = 0; i < r.size(); ++i) r[i] -= b[i];
84         return 0.5 * dot(r, r) + lambda * l1norm(x);
85     }
86     vector<double> grad_h(const vector<double>& x) const {
87         vector<double> r = Ax(x);
88         for (size_t i = 0; i < r.size(); ++i) r[i] -= b[i]; // r = Ax - b
89         return matTvec(A, r);                              // A^T (Ax - b)
90     }
91 };
92 
93 vector<double> fista_lasso(const Matrix& A, const vector<double>& b, double lambda,
94                            int max_iters, double tol, bool verbose = true) {
95     size_t n = A[0].size();
96     Lasso prob{A, b, lambda};
97 
98     // Step size from Lipschitz estimate L of grad_h
99     double L = estimate_L(A, 100);
100     double step = 1.0 / L;
101 
102     vector<double> x(n, 0.0), x_prev(n, 0.0), y(n, 0.0);
103     double t = 1.0;
104 
105     double f0 = prob.value(x);
106     if (verbose) {
107         cout << "Initial objective: " << f0 << "  (L ~ " << L << ")\n";
108     }
109 
110     for (int k = 0; k < max_iters; ++k) {
111         // Gradient at extrapolated point y
112         vector<double> g = prob.grad_h(y);
113         // Prox step (soft-threshold) with threshold step*lambda
114         vector<double> x_new(n);
115         for (size_t i = 0; i < n; ++i) x_new[i] = y[i] - step * g[i];
116         x_new = soft_threshold(x_new, step * lambda);
117 
118         double t_new = 0.5 * (1.0 + sqrt(1.0 + 4.0 * t * t));
119         // Extrapolation (Nesterov momentum)
120         vector<double> y_new(n);
121         double beta = (t - 1.0) / t_new;
122         for (size_t i = 0; i < n; ++i) y_new[i] = x_new[i] + beta * (x_new[i] - x[i]);
123 
124         // Check convergence by relative change
125         double dx = 0.0, xnorm = 0.0;
126         for (size_t i = 0; i < n; ++i) { double d = x_new[i] - x[i]; dx += d * d; xnorm += x[i] * x[i]; }
127         dx = sqrt(dx); xnorm = sqrt(xnorm);
128         if (verbose && (k % 50 == 0 || k == max_iters - 1)) {
129             cout << "Iter " << k << ": f = " << prob.value(x_new) << ", ||dx||/max(1,||x||) = "
130                  << dx / max(1.0, xnorm) << "\n";
131         }
132         if (dx / max(1.0, xnorm) < tol) {
133             if (verbose) cout << "Converged at iter " << k << "\n";
134             x = x_new; break;
135         }
136 
137         x_prev = x;
138         x = x_new; y = y_new; t = t_new;
139     }
140     return x;
141 }
142 
143 // Utility to count nonzeros by threshold
144 size_t count_nnz(const vector<double>& x, double thr = 1e-6) {
145     size_t c = 0; for (double v : x) if (fabs(v) > thr) ++c; return c;
146 }
147 
148 int main() {
149     // Synthetic LASSO instance: b = A x_true + noise, with sparse x_true
150     size_t m = 80, n = 200; int k_sparse = 10; double lambda = 0.1;
151     std::mt19937 rng(123);
152     std::normal_distribution<double> nd(0.0, 1.0);
153 
154     Matrix A(m, vector<double>(n));
155     for (size_t i = 0; i < m; ++i) {
156         for (size_t j = 0; j < n; ++j) A[i][j] = nd(rng) / sqrt((double)m); // scale to keep L moderate
157     }
158 
159     vector<double> x_true(n, 0.0);
160     // Random support of size k_sparse
161     vector<size_t> idx(n); iota(idx.begin(), idx.end(), 0);
162     shuffle(idx.begin(), idx.end(), rng);
163     for (int t = 0; t < k_sparse; ++t) x_true[idx[t]] = 3.0 * nd(rng); // moderately large coeffs
164 
165     vector<double> b(m, 0.0);
166     // b = A x_true + small noise
167     vector<double> Ax = matvec(A, x_true);
168     for (size_t i = 0; i < m; ++i) b[i] = Ax[i] + 0.01 * nd(rng);
169 
170     // Solve with FISTA
171     vector<double> x_hat = fista_lasso(A, b, lambda, /*max_iters=*/1000, /*tol=*/1e-6, /*verbose=*/true);
172 
173     cout << "\nRecovered nonzeros (>1e-3): " << count_nnz(x_hat, 1e-3) << " of true " << k_sparse << "\n";
174     // Report objective value
175     Lasso prob{A, b, lambda};
176     cout << "Final objective: " << prob.value(x_hat) << "\n";
177 
178     return 0;
179 }
180

This program solves a LASSO problem using FISTA: it alternates a gradient step on the least-squares term and a soft-thresholding prox on the L1 term, with Nesterov momentum to accelerate convergence. The Lipschitz constant of the gradient is estimated by power iteration on A^T A, and the step size is set to 1/L. The synthetic test constructs a sparse ground-truth vector, forms measurements with a random dense matrix, adds small noise, and then recovers a sparse estimate. The code prints progress every 50 iterations and reports the final objective and sparsity level.

Time: O(mn) per iteration for dense A (two matrix-vector multiplies), or O(nnz(A)) if A is sparse.Space: O(mn) to store A (dense), plus O(m+n) working memory for vectors.

1	#include <iostream>
2	#include <vector>
3	#include <cmath>
4	#include <algorithm>
5	using namespace std;
6
7	// Soft-thresholding proximal for L1 norm: prox_{tau * \|\|.\|\|_1}(x)
8	vector<double> soft_threshold(const vector<double>& x, double tau) {
9	vector<double> y(x.size());
10	for (size_t i = 0; i < x.size(); ++i) {
11	double v = x[i];
12	double m = fabs(v) - tau; // shrink magnitude by tau
13	if (m > 0.0) y[i] = (v >= 0.0 ? 1.0 : -1.0) * m; // keep sign
14	else y[i] = 0.0; // wipe out small values
15	}
16	return y;
17	}
18
19	// Projection onto a box [lo, hi]^n: prox_{I_{[lo,hi]^n}}(x)
20	vector<double> project_box(const vector<double>& x, double lo, double hi) {
21	vector<double> y(x.size());
22	for (size_t i = 0; i < x.size(); ++i) {
23	y[i] = min(max(x[i], lo), hi); // clamp each coordinate
24	}
25	return y;
26	}
27
28	void print_vec(const vector<double>& v) {
29	cout << "[";
30	for (size_t i = 0; i < v.size(); ++i) {
31	cout << v[i];
32	if (i + 1 < v.size()) cout << ", ";
33	}
34	cout << "]\n";
35	}
36
37	int main() {
38	vector<double> x = {-0.2, 0.5, 3.0, -4.2, 0.05};
39
40	// Example 1: L1 proximal (soft-thresholding)
41	double lambda = 0.8; // threshold
42	vector<double> y = soft_threshold(x, lambda);
43	cout << "Soft-thresholding with lambda=" << lambda << "\n";
44	cout << "x = "; print_vec(x);
45	cout << "prox = "; print_vec(y);
46
47	// Example 2: Projection onto a box [0, 2]
48	double lo = 0.0, hi = 2.0;
49	vector<double> z = project_box(x, lo, hi);
50	cout << "\nProjection onto box [" << lo << ", " << hi << "]\n";
51	cout << "x = "; print_vec(x);
52	cout << "proj = "; print_vec(z);
53
54	return 0;
55	}
56

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <cmath>
5	#include <algorithm>
6	#include <numeric>
7	using namespace std;
8
9	// Basic linear algebra helpers for dense matrices stored as vector<vector<double>>
10	using Matrix = vector<vector<double>>;
11
12	double dot(const vector<double>& a, const vector<double>& b) {
13	double s = 0.0; size_t n = a.size();
14	for (size_t i = 0; i < n; ++i) s += a[i] * b[i];
15	return s;
16	}
17
18	double norm2(const vector<double>& a) { return sqrt(max(0.0, dot(a, a))); }
19
20	double l1norm(const vector<double>& a) {
21	double s = 0.0; for (double v : a) s += fabs(v); return s;
22	}
23
24	vector<double> soft_threshold(const vector<double>& x, double tau) {
25	vector<double> y(x.size());
26	for (size_t i = 0; i < x.size(); ++i) {
27	double v = x[i]; double m = fabs(v) - tau;
28	y[i] = (m > 0.0) ? ((v >= 0.0 ? 1.0 : -1.0) * m) : 0.0;
29	}
30	return y;
31	}
32
33	vector<double> matvec(const Matrix& A, const vector<double>& x) {
34	size_t m = A.size(), n = x.size();
35	vector<double> y(m, 0.0);
36	for (size_t i = 0; i < m; ++i) {
37	double s = 0.0; const auto& Ai = A[i];
38	for (size_t j = 0; j < n; ++j) s += Ai[j] * x[j];
39	y[i] = s;
40	}
41	return y;
42	}
43
44	vector<double> matTvec(const Matrix& A, const vector<double>& y) {
45	size_t m = A.size(), n = A[0].size();
46	vector<double> x(n, 0.0);
47	for (size_t i = 0; i < m; ++i) {
48	const auto& Ai = A[i]; double yi = y[i];
49	for (size_t j = 0; j < n; ++j) x[j] += Ai[j] * yi;
50	}
51	return x;
52	}
53
54	// Power iteration to estimate L = \|\|A\|\|_2^2 = largest eigenvalue of A^T A
55	double estimate_L(const Matrix& A, int iters = 100) {
56	size_t n = A[0].size();
57	vector<double> u(n, 0.0);
58	// random init
59	std::mt19937 rng(42);
60	std::normal_distribution<double> nd(0.0, 1.0);
61	for (size_t j = 0; j < n; ++j) u[j] = nd(rng);
62	double nu = norm2(u);
63	if (nu == 0.0) u[0] = 1.0; else for (size_t j = 0; j < n; ++j) u[j] /= nu;
64
65	double rayleigh = 0.0;
66	for (int k = 0; k < iters; ++k) {
67	vector<double> v = matvec(A, u); // v = A u
68	vector<double> w = matTvec(A, v); // w = A^T v = A^T A u
69	double nw = norm2(w);
70	if (nw == 0.0) break;
71	for (size_t j = 0; j < n; ++j) u[j] = w[j] / nw; // normalize
72	rayleigh = dot(u, w); // since \|\|u\|\|=1, u^T (A^T A) u
73	}
74	return max(rayleigh, 1e-12); // avoid zero
75	}
76
77	// Objective: 0.5\|\|Ax-b\|\|^2 + lambda\|\|x\|\|_1
78	struct Lasso {
79	const Matrix& A; const vector<double>& b; double lambda;
80	vector<double> Ax(const vector<double>& x) const { return matvec(A, x); }
81	double value(const vector<double>& x) const {
82	vector<double> r = Ax(x);
83	for (size_t i = 0; i < r.size(); ++i) r[i] -= b[i];
84	return 0.5 * dot(r, r) + lambda * l1norm(x);
85	}
86	vector<double> grad_h(const vector<double>& x) const {
87	vector<double> r = Ax(x);
88	for (size_t i = 0; i < r.size(); ++i) r[i] -= b[i]; // r = Ax - b
89	return matTvec(A, r); // A^T (Ax - b)
90	}
91	};
92
93	vector<double> fista_lasso(const Matrix& A, const vector<double>& b, double lambda,
94	int max_iters, double tol, bool verbose = true) {
95	size_t n = A[0].size();
96	Lasso prob{A, b, lambda};
97
98	// Step size from Lipschitz estimate L of grad_h
99	double L = estimate_L(A, 100);
100	double step = 1.0 / L;
101
102	vector<double> x(n, 0.0), x_prev(n, 0.0), y(n, 0.0);
103	double t = 1.0;
104
105	double f0 = prob.value(x);
106	if (verbose) {
107	cout << "Initial objective: " << f0 << " (L ~ " << L << ")\n";
108	}
109
110	for (int k = 0; k < max_iters; ++k) {
111	// Gradient at extrapolated point y
112	vector<double> g = prob.grad_h(y);
113	// Prox step (soft-threshold) with threshold step*lambda
114	vector<double> x_new(n);
115	for (size_t i = 0; i < n; ++i) x_new[i] = y[i] - step * g[i];
116	x_new = soft_threshold(x_new, step * lambda);
117
118	double t_new = 0.5 * (1.0 + sqrt(1.0 + 4.0 * t * t));
119	// Extrapolation (Nesterov momentum)
120	vector<double> y_new(n);
121	double beta = (t - 1.0) / t_new;
122	for (size_t i = 0; i < n; ++i) y_new[i] = x_new[i] + beta * (x_new[i] - x[i]);
123
124	// Check convergence by relative change
125	double dx = 0.0, xnorm = 0.0;
126	for (size_t i = 0; i < n; ++i) { double d = x_new[i] - x[i]; dx += d * d; xnorm += x[i] * x[i]; }
127	dx = sqrt(dx); xnorm = sqrt(xnorm);
128	if (verbose && (k % 50 == 0 \|\| k == max_iters - 1)) {
129	cout << "Iter " << k << ": f = " << prob.value(x_new) << ", \|\|dx\|\|/max(1,\|\|x\|\|) = "
130	<< dx / max(1.0, xnorm) << "\n";
131	}
132	if (dx / max(1.0, xnorm) < tol) {
133	if (verbose) cout << "Converged at iter " << k << "\n";
134	x = x_new; break;
135	}
136
137	x_prev = x;
138	x = x_new; y = y_new; t = t_new;
139	}
140	return x;
141	}
142
143	// Utility to count nonzeros by threshold
144	size_t count_nnz(const vector<double>& x, double thr = 1e-6) {
145	size_t c = 0; for (double v : x) if (fabs(v) > thr) ++c; return c;
146	}
147
148	int main() {
149	// Synthetic LASSO instance: b = A x_true + noise, with sparse x_true
150	size_t m = 80, n = 200; int k_sparse = 10; double lambda = 0.1;
151	std::mt19937 rng(123);
152	std::normal_distribution<double> nd(0.0, 1.0);
153
154	Matrix A(m, vector<double>(n));
155	for (size_t i = 0; i < m; ++i) {
156	for (size_t j = 0; j < n; ++j) A[i][j] = nd(rng) / sqrt((double)m); // scale to keep L moderate
157	}
158
159	vector<double> x_true(n, 0.0);
160	// Random support of size k_sparse
161	vector<size_t> idx(n); iota(idx.begin(), idx.end(), 0);
162	shuffle(idx.begin(), idx.end(), rng);
163	for (int t = 0; t < k_sparse; ++t) x_true[idx[t]] = 3.0 * nd(rng); // moderately large coeffs
164
165	vector<double> b(m, 0.0);
166	// b = A x_true + small noise
167	vector<double> Ax = matvec(A, x_true);
168	for (size_t i = 0; i < m; ++i) b[i] = Ax[i] + 0.01 * nd(rng);
169
170	// Solve with FISTA
171	vector<double> x_hat = fista_lasso(A, b, lambda, /max_iters=/1000, /tol=/1e-6, /verbose=/true);
172
173	cout << "\nRecovered nonzeros (>1e-3): " << count_nnz(x_hat, 1e-3) << " of true " << k_sparse << "\n";
174	// Report objective value
175	Lasso prob{A, b, lambda};
176	cout << "Final objective: " << prob.value(x_hat) << "\n";
177
178	return 0;
179	}
180