📚TheoryIntermediate

Convex Optimization

Key Points

•
Convex optimization studies minimizing convex functions over convex sets, where every local minimum is guaranteed to be a global minimum.
•
Convexity can be checked by geometry (epigraph is convex) or calculus (Hessian is positive semidefinite).
•
Common convex models in ML include LASSO, logistic regression, support vector machines, and many quadratic or linear programs.
•
Convexity is preserved under positive weighted sums, composition with affine maps, and pointwise maxima.
•
First-order methods (gradient, subgradient, proximal/ISTA) scale well to large problems by using only gradients and simple projections.
•
Second-order methods (Newton, quasi-Newton) converge faster but require solving linear systems per iteration.
•
Interior-point and barrier methods solve general cones/LP/QP to high accuracy in a small number of iterations.
•
C++ implementations require careful linear algebra (vectors, matrix–vector multiplies, line search) and attention to numerical stability.

Prerequisites

→Linear algebra (vectors, matrices, norms, eigenvalues) — Convexity tests, gradients/Hessians, and algorithms rely on matrix calculus and spectral properties.
→Calculus and multivariable optimization — Gradients, Hessians, Taylor expansions, and optimality conditions are central to algorithm design.
→Probability and statistics — ML losses (logistic, least squares) and regularization have probabilistic interpretations used in modeling.
→Numerical methods and stability — Line search, conditioning, and linear system solvers affect convergence and robustness of implementations.
→Data structures and algorithmic complexity — Choosing algorithms (first- vs second-order) depends on time/space trade-offs and sparsity patterns.

Detailed Explanation

Tap terms for definitions

01Overview

Convex optimization is the study of minimizing a convex objective function over a convex feasible set. Intuitively, a set is convex if it has no “dents”: for any two points inside, the straight line segment between them remains entirely inside. A function is convex if it curves upward everywhere—averaging inputs never yields a value larger than the corresponding average of function values. These two properties form a powerful pair: when both the objective and the feasible region are convex, every local minimum is automatically a global minimum, removing the need to worry about bad local traps. This field provides both a clean theory and practical algorithms. Theoretically, convexity allows tight guarantees on optimality (via duality and KKT conditions) and convergence rates of algorithms. Practically, many real-world problems—like resource allocation, portfolio optimization, regularized regression, and margin maximization—fit naturally as convex programs. Algorithms range from scalable first-order methods (gradient, subgradient, proximal) to fast-converging second-order methods (Newton, interior-point). Modern modeling frameworks can express a wide variety of problems and call efficient solvers, but it is also educational (and often necessary in systems programming) to implement core methods directly in C++ for speed and control. With careful numerical linear algebra, line search, and stopping criteria, robust convex optimization routines can be built for machine learning and signal processing tasks.

02Intuition & Analogies

Imagine hiking in a bowl-shaped valley. No matter where you start, if you walk downhill, you eventually reach the bottom—and crucially, there is only one bottom. That is convex optimization: the valley is the convex function, and the bottom is the global minimum. In contrast, a landscape with multiple pits would be nonconvex; you could get stuck in a small puddle that is not the lowest point overall. A convex set is like a block of gelatin: poke two toothpicks anywhere inside, and the straight stick connecting them lies entirely within the gelatin. There are no caves or overhangs. When you minimize a convex function over such a shape, you are guaranteed that moving a little downhill cannot lead you into a deceptive side pocket. Key building rules preserve this bowl-like shape: adding bowls makes a bigger bowl (positive weighted sum), tilting or shifting a bowl (affine transformation) keeps it bowl-shaped, and taking the upper envelope of bowls (pointwise max) still yields a bowl from above. This explains why so many machine learning losses plus regularizers are convex: squared error or logistic loss (smooth bowl) plus L1/L2 penalty (cone or ridge-shaped bowl) remains convex. From an algorithmic perspective, first-order methods use the slope (gradient) as a compass pointing downhill. Proximal operators are like a “magnet” that pulls you back to structures you want (sparsity via soft-thresholding). Newton’s method examines local curvature to take smarter, longer strides down the valley, while interior-point methods slide along the interior, repelled from constraints by a virtual force field (barrier), heading to the optimum efficiently.

03Formal Definition

A set C ⊆ ℝ^n is convex if for all x, y ∈ C and all t ∈ [0, 1], the point tx + (1 − t)y ∈ C. A function f: ℝ^n → ℝ is convex if for all x, y and t ∈ [0, 1], f(tx + (1 − t)y) ≤ t f(x) + (1 − t) f(y). Equivalently, the epigraph epi(f) = {(x, t) |

t \geq f (x

)} is a convex set. If f is twice differentiable, a sufficient and necessary condition for convexity is that its Hessian is positive semidefinite everywhere: ∇²f(x) ⪰ 0. A convex optimization problem has the form minimize f(x) subject to x ∈ C, where f is convex and C is a convex set (often described by convex inequalities

g_{i}

(x) ≤ 0 and affine equalities

A x = b

). Optimality is characterized by first-order conditions and duality: the Karush–Kuhn–Tucker (KKT) conditions are necessary and sufficient under Slater’s condition. For smooth unconstrained problems, x⋆ is optimal iff ∇f(x⋆) = 0. For nonsmooth convex objectives, 0 ∈ ∂f(x⋆), where ∂f denotes the subdifferential. Strong convexity—there exists

m > 0

with ∇²f(x) ⪰ m I—implies uniqueness of the minimizer and linear convergence of gradient descent with suitable step size. Algorithmic frameworks include first-order methods (gradient, subgradient, proximal), second-order methods (Newton, quasi-Newton), and interior-point methods for general cones (LP/QP/SOCP/SDP).

04When to Use

Use convex optimization whenever your model can be expressed with a convex objective and convex constraints. Typical ML cases include: LASSO (sparse linear regression) using an L1 penalty; ridge regression and elastic net; logistic regression for classification; support vector machines (hinge loss + L2); and many matrix factorization surrogates that are convex in one block at a time. In operations research, linear and quadratic programs model resource allocation, scheduling, and portfolio optimization. In signal processing, convex denoising and compressed sensing rely on convex regularizers. Choose first-order methods when the dataset is large (n, d in the millions) and you can tolerate moderate accuracy; their iterations are cheap and memory-light. Proximal methods are ideal when your regularizer promotes structure (sparsity, constraints) via simple proximal maps (e.g., soft-thresholding, projection onto a box or simplex). Use Newton or quasi-Newton when feature dimension d is moderate (so that forming/solving linear systems is feasible) and you want fast, high-accuracy convergence. Employ interior-point methods if you need a general-purpose, high-accuracy solver for LP/QP or small-to-medium conic problems. When integrating into C++, consider numerical stability, cache-friendly dense/sparse multiplies, and stopping rules based on gradient norm, duality gap, or relative objective decrease.

⚠️Common Mistakes

• Assuming convexity without verification. Always check that each component (loss, regularizer, constraints) is convex and that operations used preserve convexity (e.g., composition rules). A single nonconvex part breaks the guarantee. • Using an aggressive fixed step size in gradient methods. Without line search or knowledge of the Lipschitz constant, you can oscillate or diverge. Employ backtracking with Armijo condition or estimate the Lipschitz constant. • Ignoring scaling and conditioning. Poorly scaled features produce ill-conditioned Hessians, slowing first-order methods and destabilizing Newton steps. Standardize features and consider preconditioning. • Stopping too early or too late. A tiny absolute change in objective may still be far from optimal on large-scale problems; prefer relative criteria and gradient (or subgradient) norms. Conversely, over-optimizing beyond noise level wastes time. • Mishandling nonsmooth terms. For L1 norms or hinge losses, vanilla gradient descent is invalid; use subgradients or proximal operators (ISTA/FISTA). Incorrectly implementing soft-thresholding or projections leads to wrong solutions. • Forming dense Hessians unnecessarily. For high-dimensional sparse data, use Hessian–vector products or quasi-Newton (L-BFGS), and avoid O(d^2) storage. • Violating domain constraints. Barrier or projection methods require feasible initialization or careful line search to remain in domain (e.g., log barriers need strictly interior points).

Key Formulas

Convexity (Jensen-type inequality)

f (t x + (1 - t) y) \leq t f (x) + (1 - t) f (y), \forall x, y, t \in [0, 1]

Explanation: This is the defining inequality of convex functions. It states that the function value at any convex combination of two points is no greater than the corresponding convex combination of the values.

Epigraph

e p i (f) = {(x, t) ∣ t \geq f (x)}

Explanation: The epigraph of f is convex if and only if f is convex. This geometric viewpoint is often easier to reason about in proofs.

Hessian PSD test

\nabla^{2} f (x) ⪰ 0

Explanation: For twice-differentiable f, convexity is equivalent to the Hessian being positive semidefinite everywhere. This connects geometry with calculus.

Gradient descent update

x_{k + 1} = x_{k} - t_{k} \nabla f (x_{k})

Explanation: Each iteration moves opposite the gradient by a step size $t_{k}$ . For convex f with Lipschitz gradients, this converges to a global minimizer.

ISTA (soft-thresholding)

x_{k + 1} = prox_{t λ ∥ \cdot ∥_{1}} (x_{k} - t \nabla g (x_{k})) = sign (u) ⊙ max {∣ u ∣ - t λ, 0}, u = x_{k} - t \nabla g (x_{k})

Explanation: For g smooth and $λ$ | $∣ x ∣$ |_1, the proximal step becomes soft-thresholding. This efficiently handles sparsity-inducing L1 penalties.

Armijo condition

f (x - t \nabla f (x)) \leq f (x) - c t ∥\nabla f (x) ∥_{2}^{2}

Explanation: A sufficient decrease rule used in backtracking line search. It prevents overly large steps that increase the objective.

LASSO objective

x min \frac{1}{2} ∥ A x - b ∥_{2}^{2} + λ ∥ x ∥_{1}

Explanation: Balances data fit (least squares) against sparsity (L1 norm). This is convex but nonsmooth and is well-suited to proximal methods.

Lagrangian

L (x, λ, ν) = f (x) + i = 1 \sum m λ_{i} g_{i} (x) + ν^{T} (A x - b)

Explanation: Combines the objective with constraints using multipliers. Leads to dual problems and KKT conditions for optimality.

First-order optimality

\nabla f (x^{*}) = 0 (unconstrained), 0 \in \partial f (x^{*}) (nonsmooth)

Explanation: For convex problems, a zero gradient (or zero in the subdifferential) characterizes global optimality, not just local.

Convergence rates

T (n) = O (\frac{L}{ϵ}), T (n) = O (\frac{L}{m} lo g \frac{1}{ϵ})

Explanation: Gradient descent on L-smooth convex functions reaches $ϵ$ -accuracy in O(L/ $ε)$ iterations; if also m-strongly convex, convergence is linear, yielding the second rate.

Logistic regression Newton system

H = X^{T} W X + λ I, g = X^{T} (p - y) + λ w

Explanation: The Newton direction solves H s = -g. Here W has diagonal $p_{i}$ (1- $p_{i}$ ) and p = $σ$ (Xw). This yields fast convergence for moderate d.

Complexity Analysis

For dense problems with variable dimension d and sample size n, complexity is governed by how many passes over data and how expensive each pass is. Gradient descent for a smooth convex function with Lipschitz gradient costs O(Cgrad) per iteration, where Cgrad is the cost to evaluate the gradient. For least squares or logistic regression with dense data X ∈

R^{n \times d}

C g r a d = O (n d

). Convergence to

ε - s u b o pt ima l i t y

requires O(L/

ε)

iterations for general convex L-smooth functions, or O(√

\frac{L}{m}

log(1/

ε))

iterations for m-strongly convex problems, so total time is iterations × per-iteration cost. Proximal gradient (ISTA) adds a cheap proximal step. For LASSO, soft-thresholding is O(d), so the dominant cost remains matrix–vector multiplies O(nd). Backtracking line search may require a few extra function evaluations per iteration, each also O(nd), but usually a small constant factor. Memory is O(nd) to store data (or less for streaming) and O(d) for iterates. Newton’s method uses second-order curvature, achieving quadratic convergence near the solution. Each iteration forms the Hessian H (O(n

d^{2}

)) and solves Hs = −g (O(

d^{3}

) with dense Cholesky). Thus it is attractive when d is modest (hundreds to a few thousands) and high accuracy is needed. For large d, one uses Hessian–vector products with conjugate gradients, reducing memory to O(d) and per-iteration cost to a few gradient-like passes. Interior-point methods for LP/QP typically need tens of iterations, each solving a KKT system whose cost depends on sparsity and factorization; for dense QP with d variables, per-iteration cost is roughly O(

d^{3}

). Overall, the choice of algorithm should match problem size, sparsity, conditioning, required accuracy, and available memory/cache characteristics.

Code Examples

Gradient Descent with Backtracking for a Strongly Convex Quadratic

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Utility linear algebra for small dense problems
5 using Vec = vector<double>;
6 using Mat = vector<vector<double>>;
7 
8 static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
9 static double norm2(const Vec &a){ return sqrt(dot(a,a)); }
10 static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
11 static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
12 static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
13 
14 // Quadratic objective: f(x) = 0.5 x^T Q x - b^T x, with SPD Q
15 struct Quadratic {
16     Mat Q; Vec b;
17     double f(const Vec &x) const {
18         Vec Qx = matvec(Q,x);
19         return 0.5*dot(x,Qx) - dot(b,x);
20     }
21     Vec grad(const Vec &x) const {
22         Vec Qx = matvec(Q,x);
23         Vec g = Qx; // g = Qx - b
24         for(size_t i=0;i<g.size();++i) g[i]-=b[i];
25         return g;
26     }
27 };
28 
29 // Backtracking line search with Armijo condition
30 static double backtracking(const Quadratic &obj, const Vec &x, const Vec &g, double init_t=1.0, double c=1e-4, double beta=0.5){
31     double t = init_t;
32     double fx = obj.f(x);
33     double gnorm2 = dot(g,g);
34     while(true){
35         Vec x_new = add(x, scal(g, -t));
36         double f_new = obj.f(x_new);
37         if(f_new <= fx - c*t*gnorm2) break; // sufficient decrease
38         t *= beta;
39         if(t < 1e-16) break; // safeguard
40     }
41     return t;
42 }
43 
44 int main(){
45     // Example: minimize 0.5 x^T Q x - b^T x, Q SPD
46     int d = 3;
47     Mat Q = {{4,1,0},{1,3,0},{0,0,2}}; // SPD
48     Vec b = {1,2,3};
49     Quadratic obj{Q,b};
50 
51     Vec x(d, 0.0); // start at zero
52     double tol = 1e-8;
53     int maxit = 10000;
54 
55     for(int k=0;k<maxit;++k){
56         Vec g = obj.grad(x);
57         double gnorm = norm2(g);
58         if(gnorm < tol) { cout << "Converged in " << k << " iterations\n"; break; }
59         double t = backtracking(obj, x, g);
60         x = add(x, scal(g, -t));
61         if(k%1000==0){ cout << "Iter " << k << ", f(x)=" << obj.f(x) << ", ||g||=" << gnorm << "\n"; }
62     }
63 
64     cout << fixed << setprecision(6);
65     cout << "Solution x*: "; for(double xi: x) cout << xi << ' '; cout << "\n";
66     cout << "f(x*) = " << obj.f(x) << "\n";
67     return 0;
68 }
69

This program minimizes a strongly convex quadratic using gradient descent with backtracking Armijo line search. The gradient is g = Qx − b. Backtracking adaptively chooses a safe step size that ensures sufficient decrease. Because the objective is convex with SPD Q, any stationary point is the global minimizer, and gradient descent converges.

Time: O(d^2) per iteration (dense Q) for the matvec; total O(k d^2) for k iterations.Space: O(d^2) to store Q, plus O(d) for vectors.

Proximal Gradient (ISTA) for LASSO: min 0.5||Ax - b||^2 + λ||x||1

1 #include <bits/stdc++.h>
2 using namespace std;
3 using Vec = vector<double>; using Mat = vector<vector<double>>;
4 
5 static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
6 static double norm2(const Vec &a){ return sqrt(dot(a,a)); }
7 static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
8 static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
9 static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
10 static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; }
11 
12 // Soft-thresholding operator S_tau(u) = sign(u) * max(|u| - tau, 0)
13 static Vec soft_threshold(const Vec &u, double tau){ Vec x=u; for(double &v: x){ double s = (v>0) - (v<0); double a = fabs(v) - tau; v = (a>0 ? s*a : 0.0); } return x; }
14 
15 struct Lasso {
16     Mat A; Vec b; double lambda;
17     // Smooth part g(x) = 0.5||Ax - b||^2
18     double g(const Vec &x) const {
19         Vec r = matvec(A,x); // r = Ax - b
20         for(size_t i=0;i<r.size();++i) r[i]-=b[i];
21         return 0.5*dot(r,r);
22     }
23     Vec grad_g(const Vec &x) const {
24         Vec r = matvec(A,x);
25         for(size_t i=0;i<r.size();++i) r[i]-=b[i];
26         return matTvec(A, r); // A^T (Ax - b)
27     }
28     double f(const Vec &x) const { // full objective
29         double val = g(x);
30         double l1=0; for(double xi: x) l1 += fabs(xi);
31         return val + lambda*l1;
32     }
33 };
34 
35 int main(){
36     // Small synthetic example: A in R^{mxd}, b in R^m
37     int m=50, d=20; double lambda=0.1;
38     std::mt19937 rng(42); std::normal_distribution<double> N(0.0,1.0);
39     Mat A(m, Vec(d)); Vec x_true(d,0.0);
40     // sparse ground truth with 3 nonzeros
41     x_true[2]=2.0; x_true[7]=-1.5; x_true[15]=1.0;
42     for(int i=0;i<m;++i) for(int j=0;j<d;++j) A[i][j] = N(rng)/sqrt(d);
43     Vec b(m,0.0); for(int i=0;i<m;++i){ for(int j=0;j<d;++j) b[i]+=A[i][j]*x_true[j]; b[i]+=0.05*N(rng); }
44 
45     Lasso prob{A,b,lambda};
46 
47     Vec x(d,0.0); // start at zero
48     double t = 1.0; // initial step; will use backtracking to ensure majorization
49     double beta = 0.5; // shrink factor
50     double tol = 1e-6; int maxit = 1000;
51 
52     for(int k=0;k<maxit;++k){
53         Vec g = prob.grad_g(x);
54         // Backtracking to satisfy: g(y) <= g(x) + grad^T(y-x) + (1/(2t))||y-x||^2
55         double t_cur = t;
56         Vec y;
57         while(true){
58             Vec u = add(x, scal(g, -t_cur));
59             y = soft_threshold(u, prob.lambda * t_cur);
60             // Check majorization inequality
61             double gx = prob.g(x);
62             double gy = prob.g(y);
63             Vec s = add(y, x, -1.0); // s = y - x
64             double rhs = gx + dot(g, s) + (0.5/t_cur)*dot(s,s);
65             if(gy <= rhs || t_cur < 1e-16) break;
66             t_cur *= beta;
67         }
68         Vec s = add(y, x, -1.0);
69         x = y;
70         if(norm2(s) / max(1.0, norm2(x)) < tol){ cout << "Converged in " << k+1 << " iterations\n"; break; }
71         if((k+1)%50==0){ cout << "Iter " << k+1 << ", f(x)=" << prob.f(x) << "\n"; }
72         t = t_cur * 1.1; // slight increase heuristic
73     }
74 
75     cout << fixed << setprecision(4);
76     cout << "Recovered x: "; for(double xi: x) cout << xi << ' '; cout << "\n";
77     cout << "Objective f(x): " << prob.f(x) << "\n";
78     return 0;
79 }
80

Implements ISTA (proximal gradient) for the LASSO. Each iteration takes a gradient step on the smooth part g(x)=0.5||Ax−b||^2 and then applies the soft-thresholding proximal map for the L1 penalty. A backtracking test enforces a valid step size without computing the Lipschitz constant. This converges to the global minimizer because the objective is convex.

Time: O(nd) per iteration for dense A, with a small constant-factor overhead for backtracking.Space: O(nd) to store A plus O(d) for vectors.

Newton's Method with Backtracking for L2-regularized Logistic Regression

1 #include <bits/stdc++.h>
2 using namespace std; using Vec=vector<double>; using Mat=vector<vector<double>>;
3 
4 static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
5 static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
6 static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
7 static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
8 static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; }
9 static double sigmoid(double z){ if(z>=0){ double e=exp(-z); return 1.0/(1.0+e);} else { double e=exp(z); return e/(1.0+e);} }
10 
11 // Cholesky solve for SPD matrix H s = b
12 static bool cholesky_solve(const Mat &H, const Vec &b, Vec &x){
13     int n=H.size(); vector<vector<double>> L(n, vector<double>(n,0.0));
14     for(int i=0;i<n;++i){
15         for(int j=0;j<=i;++j){
16             double sum=H[i][j];
17             for(int k=0;k<j;++k) sum -= L[i][k]*L[j][k];
18             if(i==j){ if(sum<=1e-14) return false; L[i][i]=sqrt(sum); }
19             else{ L[i][j]=sum/L[j][j]; }
20         }
21     }
22     // Forward solve Ly = b
23     Vec y(n,0.0); for(int i=0;i<n;++i){ double s=b[i]; for(int k=0;k<i;++k) s-=L[i][k]*y[k]; y[i]=s/L[i][i]; }
24     // Backward solve L^T x = y
25     x.assign(n,0.0); for(int i=n-1;i>=0;--i){ double s=y[i]; for(int k=i+1;k<n;++k) s-=L[k][i]*x[k]; x[i]=s/L[i][i]; }
26     return true;
27 }
28 
29 struct LogisticL2 {
30     Mat X; Vec y; double lambda; // y in {0,1}
31     double f(const Vec &w) const {
32         Vec z = matvec(X, w);
33         double loss=0; for(size_t i=0;i<z.size();++i){ double p=sigmoid(z[i]); double yi=y[i]; loss += -(yi*log(max(p,1e-15)) + (1-yi)*log(max(1-p,1e-15))); }
34         double reg = 0.5*lambda*dot(w,w);
35         return loss + reg;
36     }
37     Vec grad(const Vec &w) const {
38         Vec z = matvec(X, w); Vec p(z.size()); for(size_t i=0;i<z.size();++i) p[i]=sigmoid(z[i]);
39         Vec r = p; for(size_t i=0;i<r.size();++i) r[i]-=y[i];
40         Vec g = matTvec(X, r); // X^T (p - y)
41         // + lambda w
42         for(size_t j=0;j<w.size();++j) g[j] += lambda*w[j];
43         return g;
44     }
45     Mat hess(const Vec &w) const {
46         // H = X^T W X + lambda I, W=diag(p*(1-p))
47         Vec z = matvec(X,w); size_t n=X.size(), d=X[0].size();
48         Vec wdiag(n); for(size_t i=0;i<n;++i){ double p=sigmoid(z[i]); wdiag[i]=p*(1-p); }
49         Mat H(d, vector<double>(d,0.0));
50         for(size_t i=0;i<n;++i){ double wi=wdiag[i]; if(wi==0) continue; // add wi * x_i x_i^T
51             for(size_t a=0;a<d;++a){ double xa=X[i][a]*wi; for(size_t b=0;b<=a;++b){ H[a][b]+= xa*X[i][b]; } }
52         }
53         // symmetrize and add lambda I
54         for(size_t a=0;a<d;++a){ for(size_t b=0;b<a;++b){ H[b][a]=H[a][b]; } H[a][a]+=lambda; }
55         return H;
56     }
57 };
58 
59 int main(){
60     // Synthetic binary classification
61     int n=200, d=5; double lambda=1e-1; std::mt19937 rng(7); std::normal_distribution<double> N(0,1);
62     Mat X(n, Vec(d)); Vec w_true(d); for(int j=0;j<d;++j) w_true[j]=N(rng);
63     for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j]=N(rng);
64     Vec y(n); for(int i=0;i<n;++i){ double z=0; for(int j=0;j<d;++j) z+=X[i][j]*w_true[j]; double p=1.0/(1.0+exp(-z)); y[i]=( (double)std::generate_canonical<double,10>(rng) < p ? 1.0:0.0 ); }
65 
66     LogisticL2 prob{X,y,lambda};
67     Vec w(d,0.0);
68     double tol=1e-6; int maxit=50; double c=1e-4; double beta=0.5;
69 
70     for(int k=0;k<maxit;++k){
71         Vec g = prob.grad(w);
72         double gnorm = sqrt(dot(g,g));
73         if(gnorm < tol){ cout << "Converged (grad norm) in " << k << " iterations\n"; break; }
74         Mat H = prob.hess(w);
75         Vec s; s.resize(d);
76         // Solve H s = -g via Cholesky
77         Vec rhs = g; for(double &v: rhs) v = -v;
78         if(!cholesky_solve(H, rhs, s)){ cerr << "Hessian not SPD; aborting\n"; break; }
79         // Backtracking Armijo
80         double t=1.0; double f0=prob.f(w); double decr=0; for(size_t i=0;i<w.size();++i) decr += g[i]*s[i];
81         while(true){
82             Vec w_new = add(w, scal(s, t));
83             double f_new = prob.f(w_new);
84             if(f_new <= f0 + c*t*decr || t<1e-16) { w=w_new; break; }
85             t *= beta;
86         }
87         cout << "Iter "<<k<<": f(w)="<<prob.f(w)<<", ||g||="<<gnorm<<"\n";
88     }
89 
90     cout << fixed << setprecision(5);
91     cout << "Estimated w: "; for(double wi: w) cout << wi << ' '; cout << "\n";
92     cout << "Final objective: " << prob.f(w) << "\n";
93     return 0;
94 }
95

Trains L2-regularized logistic regression with Newton’s method. Each iteration builds the Hessian H = X^T W X + λI and gradient g = X^T(p − y) + λw, solves Hs = −g via Cholesky, and uses backtracking to ensure sufficient decrease. For moderate feature dimension, Newton steps achieve rapid convergence to the global optimum.

Time: O(nd^2) to form H and O(d^3) to factor per iteration (dense).Space: O(nd) to store X and O(d^2) for H.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Utility linear algebra for small dense problems
5	using Vec = vector<double>;
6	using Mat = vector<vector<double>>;
7
8	static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
9	static double norm2(const Vec &a){ return sqrt(dot(a,a)); }
10	static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
11	static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
12	static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
13
14	// Quadratic objective: f(x) = 0.5 x^T Q x - b^T x, with SPD Q
15	struct Quadratic {
16	Mat Q; Vec b;
17	double f(const Vec &x) const {
18	Vec Qx = matvec(Q,x);
19	return 0.5*dot(x,Qx) - dot(b,x);
20	}
21	Vec grad(const Vec &x) const {
22	Vec Qx = matvec(Q,x);
23	Vec g = Qx; // g = Qx - b
24	for(size_t i=0;i<g.size();++i) g[i]-=b[i];
25	return g;
26	}
27	};
28
29	// Backtracking line search with Armijo condition
30	static double backtracking(const Quadratic &obj, const Vec &x, const Vec &g, double init_t=1.0, double c=1e-4, double beta=0.5){
31	double t = init_t;
32	double fx = obj.f(x);
33	double gnorm2 = dot(g,g);
34	while(true){
35	Vec x_new = add(x, scal(g, -t));
36	double f_new = obj.f(x_new);
37	if(f_new <= fx - ctgnorm2) break; // sufficient decrease
38	t *= beta;
39	if(t < 1e-16) break; // safeguard
40	}
41	return t;
42	}
43
44	int main(){
45	// Example: minimize 0.5 x^T Q x - b^T x, Q SPD
46	int d = 3;
47	Mat Q = {{4,1,0},{1,3,0},{0,0,2}}; // SPD
48	Vec b = {1,2,3};
49	Quadratic obj{Q,b};
50
51	Vec x(d, 0.0); // start at zero
52	double tol = 1e-8;
53	int maxit = 10000;
54
55	for(int k=0;k<maxit;++k){
56	Vec g = obj.grad(x);
57	double gnorm = norm2(g);
58	if(gnorm < tol) { cout << "Converged in " << k << " iterations\n"; break; }
59	double t = backtracking(obj, x, g);
60	x = add(x, scal(g, -t));
61	if(k%1000==0){ cout << "Iter " << k << ", f(x)=" << obj.f(x) << ", \|\|g\|\|=" << gnorm << "\n"; }
62	}
63
64	cout << fixed << setprecision(6);
65	cout << "Solution x*: "; for(double xi: x) cout << xi << ' '; cout << "\n";
66	cout << "f(x*) = " << obj.f(x) << "\n";
67	return 0;
68	}
69

1	#include <bits/stdc++.h>
2	using namespace std;
3	using Vec = vector<double>; using Mat = vector<vector<double>>;
4
5	static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
6	static double norm2(const Vec &a){ return sqrt(dot(a,a)); }
7	static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
8	static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
9	static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
10	static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; }
11
12	// Soft-thresholding operator S_tau(u) = sign(u) * max(\|u\| - tau, 0)
13	static Vec soft_threshold(const Vec &u, double tau){ Vec x=u; for(double &v: x){ double s = (v>0) - (v<0); double a = fabs(v) - tau; v = (a>0 ? s*a : 0.0); } return x; }
14
15	struct Lasso {
16	Mat A; Vec b; double lambda;
17	// Smooth part g(x) = 0.5\|\|Ax - b\|\|^2
18	double g(const Vec &x) const {
19	Vec r = matvec(A,x); // r = Ax - b
20	for(size_t i=0;i<r.size();++i) r[i]-=b[i];
21	return 0.5*dot(r,r);
22	}
23	Vec grad_g(const Vec &x) const {
24	Vec r = matvec(A,x);
25	for(size_t i=0;i<r.size();++i) r[i]-=b[i];
26	return matTvec(A, r); // A^T (Ax - b)
27	}
28	double f(const Vec &x) const { // full objective
29	double val = g(x);
30	double l1=0; for(double xi: x) l1 += fabs(xi);
31	return val + lambda*l1;
32	}
33	};
34
35	int main(){
36	// Small synthetic example: A in R^{mxd}, b in R^m
37	int m=50, d=20; double lambda=0.1;
38	std::mt19937 rng(42); std::normal_distribution<double> N(0.0,1.0);
39	Mat A(m, Vec(d)); Vec x_true(d,0.0);
40	// sparse ground truth with 3 nonzeros
41	x_true[2]=2.0; x_true[7]=-1.5; x_true[15]=1.0;
42	for(int i=0;i<m;++i) for(int j=0;j<d;++j) A[i][j] = N(rng)/sqrt(d);
43	Vec b(m,0.0); for(int i=0;i<m;++i){ for(int j=0;j<d;++j) b[i]+=A[i][j]x_true[j]; b[i]+=0.05N(rng); }
44
45	Lasso prob{A,b,lambda};
46
47	Vec x(d,0.0); // start at zero
48	double t = 1.0; // initial step; will use backtracking to ensure majorization
49	double beta = 0.5; // shrink factor
50	double tol = 1e-6; int maxit = 1000;
51
52	for(int k=0;k<maxit;++k){
53	Vec g = prob.grad_g(x);
54	// Backtracking to satisfy: g(y) <= g(x) + grad^T(y-x) + (1/(2t))\|\|y-x\|\|^2
55	double t_cur = t;
56	Vec y;
57	while(true){
58	Vec u = add(x, scal(g, -t_cur));
59	y = soft_threshold(u, prob.lambda * t_cur);
60	// Check majorization inequality
61	double gx = prob.g(x);
62	double gy = prob.g(y);
63	Vec s = add(y, x, -1.0); // s = y - x
64	double rhs = gx + dot(g, s) + (0.5/t_cur)*dot(s,s);
65	if(gy <= rhs \|\| t_cur < 1e-16) break;
66	t_cur *= beta;
67	}
68	Vec s = add(y, x, -1.0);
69	x = y;
70	if(norm2(s) / max(1.0, norm2(x)) < tol){ cout << "Converged in " << k+1 << " iterations\n"; break; }
71	if((k+1)%50==0){ cout << "Iter " << k+1 << ", f(x)=" << prob.f(x) << "\n"; }
72	t = t_cur * 1.1; // slight increase heuristic
73	}
74
75	cout << fixed << setprecision(4);
76	cout << "Recovered x: "; for(double xi: x) cout << xi << ' '; cout << "\n";
77	cout << "Objective f(x): " << prob.f(x) << "\n";
78	return 0;
79	}
80

1	#include <bits/stdc++.h>
2	using namespace std; using Vec=vector<double>; using Mat=vector<vector<double>>;
3
4	static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
5	static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
6	static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
7	static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
8	static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; }
9	static double sigmoid(double z){ if(z>=0){ double e=exp(-z); return 1.0/(1.0+e);} else { double e=exp(z); return e/(1.0+e);} }
10
11	// Cholesky solve for SPD matrix H s = b
12	static bool cholesky_solve(const Mat &H, const Vec &b, Vec &x){
13	int n=H.size(); vector<vector<double>> L(n, vector<double>(n,0.0));
14	for(int i=0;i<n;++i){
15	for(int j=0;j<=i;++j){
16	double sum=H[i][j];
17	for(int k=0;k<j;++k) sum -= L[i][k]*L[j][k];
18	if(i==j){ if(sum<=1e-14) return false; L[i][i]=sqrt(sum); }
19	else{ L[i][j]=sum/L[j][j]; }
20	}
21	}
22	// Forward solve Ly = b
23	Vec y(n,0.0); for(int i=0;i<n;++i){ double s=b[i]; for(int k=0;k<i;++k) s-=L[i][k]*y[k]; y[i]=s/L[i][i]; }
24	// Backward solve L^T x = y
25	x.assign(n,0.0); for(int i=n-1;i>=0;--i){ double s=y[i]; for(int k=i+1;k<n;++k) s-=L[k][i]*x[k]; x[i]=s/L[i][i]; }
26	return true;
27	}
28
29	struct LogisticL2 {
30	Mat X; Vec y; double lambda; // y in {0,1}
31	double f(const Vec &w) const {
32	Vec z = matvec(X, w);
33	double loss=0; for(size_t i=0;i<z.size();++i){ double p=sigmoid(z[i]); double yi=y[i]; loss += -(yilog(max(p,1e-15)) + (1-yi)log(max(1-p,1e-15))); }
34	double reg = 0.5lambdadot(w,w);
35	return loss + reg;
36	}
37	Vec grad(const Vec &w) const {
38	Vec z = matvec(X, w); Vec p(z.size()); for(size_t i=0;i<z.size();++i) p[i]=sigmoid(z[i]);
39	Vec r = p; for(size_t i=0;i<r.size();++i) r[i]-=y[i];
40	Vec g = matTvec(X, r); // X^T (p - y)
41	// + lambda w
42	for(size_t j=0;j<w.size();++j) g[j] += lambda*w[j];
43	return g;
44	}
45	Mat hess(const Vec &w) const {
46	// H = X^T W X + lambda I, W=diag(p*(1-p))
47	Vec z = matvec(X,w); size_t n=X.size(), d=X[0].size();
48	Vec wdiag(n); for(size_t i=0;i<n;++i){ double p=sigmoid(z[i]); wdiag[i]=p*(1-p); }
49	Mat H(d, vector<double>(d,0.0));
50	for(size_t i=0;i<n;++i){ double wi=wdiag[i]; if(wi==0) continue; // add wi * x_i x_i^T
51	for(size_t a=0;a<d;++a){ double xa=X[i][a]wi; for(size_t b=0;b<=a;++b){ H[a][b]+= xaX[i][b]; } }
52	}
53	// symmetrize and add lambda I
54	for(size_t a=0;a<d;++a){ for(size_t b=0;b<a;++b){ H[b][a]=H[a][b]; } H[a][a]+=lambda; }
55	return H;
56	}
57	};
58
59	int main(){
60	// Synthetic binary classification
61	int n=200, d=5; double lambda=1e-1; std::mt19937 rng(7); std::normal_distribution<double> N(0,1);
62	Mat X(n, Vec(d)); Vec w_true(d); for(int j=0;j<d;++j) w_true[j]=N(rng);
63	for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j]=N(rng);
64	Vec y(n); for(int i=0;i<n;++i){ double z=0; for(int j=0;j<d;++j) z+=X[i][j]*w_true[j]; double p=1.0/(1.0+exp(-z)); y[i]=( (double)std::generate_canonical<double,10>(rng) < p ? 1.0:0.0 ); }
65
66	LogisticL2 prob{X,y,lambda};
67	Vec w(d,0.0);
68	double tol=1e-6; int maxit=50; double c=1e-4; double beta=0.5;
69
70	for(int k=0;k<maxit;++k){
71	Vec g = prob.grad(w);
72	double gnorm = sqrt(dot(g,g));
73	if(gnorm < tol){ cout << "Converged (grad norm) in " << k << " iterations\n"; break; }
74	Mat H = prob.hess(w);
75	Vec s; s.resize(d);
76	// Solve H s = -g via Cholesky
77	Vec rhs = g; for(double &v: rhs) v = -v;
78	if(!cholesky_solve(H, rhs, s)){ cerr << "Hessian not SPD; aborting\n"; break; }
79	// Backtracking Armijo
80	double t=1.0; double f0=prob.f(w); double decr=0; for(size_t i=0;i<w.size();++i) decr += g[i]*s[i];
81	while(true){
82	Vec w_new = add(w, scal(s, t));
83	double f_new = prob.f(w_new);
84	if(f_new <= f0 + ctdecr \|\| t<1e-16) { w=w_new; break; }
85	t *= beta;
86	}
87	cout << "Iter "<<k<<": f(w)="<<prob.f(w)<<", \|\|g\|\|="<<gnorm<<"\n";
88	}
89
90	cout << fixed << setprecision(5);
91	cout << "Estimated w: "; for(double wi: w) cout << wi << ' '; cout << "\n";
92	cout << "Final objective: " << prob.f(w) << "\n";
93	return 0;
94	}
95