📚TheoryIntermediate

Convex Optimization

Key Points

  • Convex optimization studies minimizing convex functions over convex sets, where every local minimum is guaranteed to be a global minimum.
  • Convexity can be checked by geometry (epigraph is convex) or calculus (Hessian is positive semidefinite).
  • Common convex models in ML include LASSO, logistic regression, support vector machines, and many quadratic or linear programs.
  • Convexity is preserved under positive weighted sums, composition with affine maps, and pointwise maxima.
  • First-order methods (gradient, subgradient, proximal/ISTA) scale well to large problems by using only gradients and simple projections.
  • Second-order methods (Newton, quasi-Newton) converge faster but require solving linear systems per iteration.
  • Interior-point and barrier methods solve general cones/LP/QP to high accuracy in a small number of iterations.
  • C++ implementations require careful linear algebra (vectors, matrix–vector multiplies, line search) and attention to numerical stability.

Prerequisites

  • Linear algebra (vectors, matrices, norms, eigenvalues)Convexity tests, gradients/Hessians, and algorithms rely on matrix calculus and spectral properties.
  • Calculus and multivariable optimizationGradients, Hessians, Taylor expansions, and optimality conditions are central to algorithm design.
  • Probability and statisticsML losses (logistic, least squares) and regularization have probabilistic interpretations used in modeling.
  • Numerical methods and stabilityLine search, conditioning, and linear system solvers affect convergence and robustness of implementations.
  • Data structures and algorithmic complexityChoosing algorithms (first- vs second-order) depends on time/space trade-offs and sparsity patterns.

Detailed Explanation

Tap terms for definitions

01Overview

Convex optimization is the study of minimizing a convex objective function over a convex feasible set. Intuitively, a set is convex if it has no “dents”: for any two points inside, the straight line segment between them remains entirely inside. A function is convex if it curves upward everywhere—averaging inputs never yields a value larger than the corresponding average of function values. These two properties form a powerful pair: when both the objective and the feasible region are convex, every local minimum is automatically a global minimum, removing the need to worry about bad local traps. This field provides both a clean theory and practical algorithms. Theoretically, convexity allows tight guarantees on optimality (via duality and KKT conditions) and convergence rates of algorithms. Practically, many real-world problems—like resource allocation, portfolio optimization, regularized regression, and margin maximization—fit naturally as convex programs. Algorithms range from scalable first-order methods (gradient, subgradient, proximal) to fast-converging second-order methods (Newton, interior-point). Modern modeling frameworks can express a wide variety of problems and call efficient solvers, but it is also educational (and often necessary in systems programming) to implement core methods directly in C++ for speed and control. With careful numerical linear algebra, line search, and stopping criteria, robust convex optimization routines can be built for machine learning and signal processing tasks.

02Intuition & Analogies

Imagine hiking in a bowl-shaped valley. No matter where you start, if you walk downhill, you eventually reach the bottom—and crucially, there is only one bottom. That is convex optimization: the valley is the convex function, and the bottom is the global minimum. In contrast, a landscape with multiple pits would be nonconvex; you could get stuck in a small puddle that is not the lowest point overall. A convex set is like a block of gelatin: poke two toothpicks anywhere inside, and the straight stick connecting them lies entirely within the gelatin. There are no caves or overhangs. When you minimize a convex function over such a shape, you are guaranteed that moving a little downhill cannot lead you into a deceptive side pocket. Key building rules preserve this bowl-like shape: adding bowls makes a bigger bowl (positive weighted sum), tilting or shifting a bowl (affine transformation) keeps it bowl-shaped, and taking the upper envelope of bowls (pointwise max) still yields a bowl from above. This explains why so many machine learning losses plus regularizers are convex: squared error or logistic loss (smooth bowl) plus L1/L2 penalty (cone or ridge-shaped bowl) remains convex. From an algorithmic perspective, first-order methods use the slope (gradient) as a compass pointing downhill. Proximal operators are like a “magnet” that pulls you back to structures you want (sparsity via soft-thresholding). Newton’s method examines local curvature to take smarter, longer strides down the valley, while interior-point methods slide along the interior, repelled from constraints by a virtual force field (barrier), heading to the optimum efficiently.

03Formal Definition

A set C ⊆ ℝ^n is convex if for all x, y ∈ C and all t ∈ [0, 1], the point tx + (1 − t)y ∈ C. A function f: ℝ^n → ℝ is convex if for all x, y and t ∈ [0, 1], f(tx + (1 − t)y) ≤ t f(x) + (1 − t) f(y). Equivalently, the epigraph epi(f) = {(x, t) | )} is a convex set. If f is twice differentiable, a sufficient and necessary condition for convexity is that its Hessian is positive semidefinite everywhere: ∇²f(x) ⪰ 0. A convex optimization problem has the form minimize f(x) subject to x ∈ C, where f is convex and C is a convex set (often described by convex inequalities (x) ≤ 0 and affine equalities ). Optimality is characterized by first-order conditions and duality: the Karush–Kuhn–Tucker (KKT) conditions are necessary and sufficient under Slater’s condition. For smooth unconstrained problems, x⋆ is optimal iff ∇f(x⋆) = 0. For nonsmooth convex objectives, 0 ∈ ∂f(x⋆), where ∂f denotes the subdifferential. Strong convexity—there exists with ∇²f(x) ⪰ m I—implies uniqueness of the minimizer and linear convergence of gradient descent with suitable step size. Algorithmic frameworks include first-order methods (gradient, subgradient, proximal), second-order methods (Newton, quasi-Newton), and interior-point methods for general cones (LP/QP/SOCP/SDP).

04When to Use

Use convex optimization whenever your model can be expressed with a convex objective and convex constraints. Typical ML cases include: LASSO (sparse linear regression) using an L1 penalty; ridge regression and elastic net; logistic regression for classification; support vector machines (hinge loss + L2); and many matrix factorization surrogates that are convex in one block at a time. In operations research, linear and quadratic programs model resource allocation, scheduling, and portfolio optimization. In signal processing, convex denoising and compressed sensing rely on convex regularizers. Choose first-order methods when the dataset is large (n, d in the millions) and you can tolerate moderate accuracy; their iterations are cheap and memory-light. Proximal methods are ideal when your regularizer promotes structure (sparsity, constraints) via simple proximal maps (e.g., soft-thresholding, projection onto a box or simplex). Use Newton or quasi-Newton when feature dimension d is moderate (so that forming/solving linear systems is feasible) and you want fast, high-accuracy convergence. Employ interior-point methods if you need a general-purpose, high-accuracy solver for LP/QP or small-to-medium conic problems. When integrating into C++, consider numerical stability, cache-friendly dense/sparse multiplies, and stopping rules based on gradient norm, duality gap, or relative objective decrease.

⚠️Common Mistakes

• Assuming convexity without verification. Always check that each component (loss, regularizer, constraints) is convex and that operations used preserve convexity (e.g., composition rules). A single nonconvex part breaks the guarantee. • Using an aggressive fixed step size in gradient methods. Without line search or knowledge of the Lipschitz constant, you can oscillate or diverge. Employ backtracking with Armijo condition or estimate the Lipschitz constant. • Ignoring scaling and conditioning. Poorly scaled features produce ill-conditioned Hessians, slowing first-order methods and destabilizing Newton steps. Standardize features and consider preconditioning. • Stopping too early or too late. A tiny absolute change in objective may still be far from optimal on large-scale problems; prefer relative criteria and gradient (or subgradient) norms. Conversely, over-optimizing beyond noise level wastes time. • Mishandling nonsmooth terms. For L1 norms or hinge losses, vanilla gradient descent is invalid; use subgradients or proximal operators (ISTA/FISTA). Incorrectly implementing soft-thresholding or projections leads to wrong solutions. • Forming dense Hessians unnecessarily. For high-dimensional sparse data, use Hessian–vector products or quasi-Newton (L-BFGS), and avoid O(d^2) storage. • Violating domain constraints. Barrier or projection methods require feasible initialization or careful line search to remain in domain (e.g., log barriers need strictly interior points).

Key Formulas

Convexity (Jensen-type inequality)

Explanation: This is the defining inequality of convex functions. It states that the function value at any convex combination of two points is no greater than the corresponding convex combination of the values.

Epigraph

Explanation: The epigraph of f is convex if and only if f is convex. This geometric viewpoint is often easier to reason about in proofs.

Hessian PSD test

Explanation: For twice-differentiable f, convexity is equivalent to the Hessian being positive semidefinite everywhere. This connects geometry with calculus.

Gradient descent update

Explanation: Each iteration moves opposite the gradient by a step size . For convex f with Lipschitz gradients, this converges to a global minimizer.

ISTA (soft-thresholding)

Explanation: For g smooth and ||_1, the proximal step becomes soft-thresholding. This efficiently handles sparsity-inducing L1 penalties.

Armijo condition

Explanation: A sufficient decrease rule used in backtracking line search. It prevents overly large steps that increase the objective.

LASSO objective

Explanation: Balances data fit (least squares) against sparsity (L1 norm). This is convex but nonsmooth and is well-suited to proximal methods.

Lagrangian

Explanation: Combines the objective with constraints using multipliers. Leads to dual problems and KKT conditions for optimality.

First-order optimality

Explanation: For convex problems, a zero gradient (or zero in the subdifferential) characterizes global optimality, not just local.

Convergence rates

Explanation: Gradient descent on L-smooth convex functions reaches -accuracy in O(L/ iterations; if also m-strongly convex, convergence is linear, yielding the second rate.

Logistic regression Newton system

Explanation: The Newton direction solves H s = -g. Here W has diagonal (1-) and p = (Xw). This yields fast convergence for moderate d.

Complexity Analysis

For dense problems with variable dimension d and sample size n, complexity is governed by how many passes over data and how expensive each pass is. Gradient descent for a smooth convex function with Lipschitz gradient costs O(Cgrad) per iteration, where Cgrad is the cost to evaluate the gradient. For least squares or logistic regression with dense data X ∈ , ). Convergence to requires O(L/ iterations for general convex L-smooth functions, or O(√ log(1/ iterations for m-strongly convex problems, so total time is iterations × per-iteration cost. Proximal gradient (ISTA) adds a cheap proximal step. For LASSO, soft-thresholding is O(d), so the dominant cost remains matrix–vector multiplies O(nd). Backtracking line search may require a few extra function evaluations per iteration, each also O(nd), but usually a small constant factor. Memory is O(nd) to store data (or less for streaming) and O(d) for iterates. Newton’s method uses second-order curvature, achieving quadratic convergence near the solution. Each iteration forms the Hessian H (O(n)) and solves Hs = −g (O() with dense Cholesky). Thus it is attractive when d is modest (hundreds to a few thousands) and high accuracy is needed. For large d, one uses Hessian–vector products with conjugate gradients, reducing memory to O(d) and per-iteration cost to a few gradient-like passes. Interior-point methods for LP/QP typically need tens of iterations, each solving a KKT system whose cost depends on sparsity and factorization; for dense QP with d variables, per-iteration cost is roughly O(). Overall, the choice of algorithm should match problem size, sparsity, conditioning, required accuracy, and available memory/cache characteristics.

Code Examples

Gradient Descent with Backtracking for a Strongly Convex Quadratic
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utility linear algebra for small dense problems
5using Vec = vector<double>;
6using Mat = vector<vector<double>>;
7
8static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
9static double norm2(const Vec &a){ return sqrt(dot(a,a)); }
10static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
11static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
12static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
13
14// Quadratic objective: f(x) = 0.5 x^T Q x - b^T x, with SPD Q
15struct Quadratic {
16 Mat Q; Vec b;
17 double f(const Vec &x) const {
18 Vec Qx = matvec(Q,x);
19 return 0.5*dot(x,Qx) - dot(b,x);
20 }
21 Vec grad(const Vec &x) const {
22 Vec Qx = matvec(Q,x);
23 Vec g = Qx; // g = Qx - b
24 for(size_t i=0;i<g.size();++i) g[i]-=b[i];
25 return g;
26 }
27};
28
29// Backtracking line search with Armijo condition
30static double backtracking(const Quadratic &obj, const Vec &x, const Vec &g, double init_t=1.0, double c=1e-4, double beta=0.5){
31 double t = init_t;
32 double fx = obj.f(x);
33 double gnorm2 = dot(g,g);
34 while(true){
35 Vec x_new = add(x, scal(g, -t));
36 double f_new = obj.f(x_new);
37 if(f_new <= fx - c*t*gnorm2) break; // sufficient decrease
38 t *= beta;
39 if(t < 1e-16) break; // safeguard
40 }
41 return t;
42}
43
44int main(){
45 // Example: minimize 0.5 x^T Q x - b^T x, Q SPD
46 int d = 3;
47 Mat Q = {{4,1,0},{1,3,0},{0,0,2}}; // SPD
48 Vec b = {1,2,3};
49 Quadratic obj{Q,b};
50
51 Vec x(d, 0.0); // start at zero
52 double tol = 1e-8;
53 int maxit = 10000;
54
55 for(int k=0;k<maxit;++k){
56 Vec g = obj.grad(x);
57 double gnorm = norm2(g);
58 if(gnorm < tol) { cout << "Converged in " << k << " iterations\n"; break; }
59 double t = backtracking(obj, x, g);
60 x = add(x, scal(g, -t));
61 if(k%1000==0){ cout << "Iter " << k << ", f(x)=" << obj.f(x) << ", ||g||=" << gnorm << "\n"; }
62 }
63
64 cout << fixed << setprecision(6);
65 cout << "Solution x*: "; for(double xi: x) cout << xi << ' '; cout << "\n";
66 cout << "f(x*) = " << obj.f(x) << "\n";
67 return 0;
68}
69

This program minimizes a strongly convex quadratic using gradient descent with backtracking Armijo line search. The gradient is g = Qx − b. Backtracking adaptively chooses a safe step size that ensures sufficient decrease. Because the objective is convex with SPD Q, any stationary point is the global minimizer, and gradient descent converges.

Time: O(d^2) per iteration (dense Q) for the matvec; total O(k d^2) for k iterations.Space: O(d^2) to store Q, plus O(d) for vectors.
Proximal Gradient (ISTA) for LASSO: min 0.5||Ax - b||^2 + λ||x||1
1#include <bits/stdc++.h>
2using namespace std;
3using Vec = vector<double>; using Mat = vector<vector<double>>;
4
5static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
6static double norm2(const Vec &a){ return sqrt(dot(a,a)); }
7static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
8static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
9static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
10static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; }
11
12// Soft-thresholding operator S_tau(u) = sign(u) * max(|u| - tau, 0)
13static Vec soft_threshold(const Vec &u, double tau){ Vec x=u; for(double &v: x){ double s = (v>0) - (v<0); double a = fabs(v) - tau; v = (a>0 ? s*a : 0.0); } return x; }
14
15struct Lasso {
16 Mat A; Vec b; double lambda;
17 // Smooth part g(x) = 0.5||Ax - b||^2
18 double g(const Vec &x) const {
19 Vec r = matvec(A,x); // r = Ax - b
20 for(size_t i=0;i<r.size();++i) r[i]-=b[i];
21 return 0.5*dot(r,r);
22 }
23 Vec grad_g(const Vec &x) const {
24 Vec r = matvec(A,x);
25 for(size_t i=0;i<r.size();++i) r[i]-=b[i];
26 return matTvec(A, r); // A^T (Ax - b)
27 }
28 double f(const Vec &x) const { // full objective
29 double val = g(x);
30 double l1=0; for(double xi: x) l1 += fabs(xi);
31 return val + lambda*l1;
32 }
33};
34
35int main(){
36 // Small synthetic example: A in R^{mxd}, b in R^m
37 int m=50, d=20; double lambda=0.1;
38 std::mt19937 rng(42); std::normal_distribution<double> N(0.0,1.0);
39 Mat A(m, Vec(d)); Vec x_true(d,0.0);
40 // sparse ground truth with 3 nonzeros
41 x_true[2]=2.0; x_true[7]=-1.5; x_true[15]=1.0;
42 for(int i=0;i<m;++i) for(int j=0;j<d;++j) A[i][j] = N(rng)/sqrt(d);
43 Vec b(m,0.0); for(int i=0;i<m;++i){ for(int j=0;j<d;++j) b[i]+=A[i][j]*x_true[j]; b[i]+=0.05*N(rng); }
44
45 Lasso prob{A,b,lambda};
46
47 Vec x(d,0.0); // start at zero
48 double t = 1.0; // initial step; will use backtracking to ensure majorization
49 double beta = 0.5; // shrink factor
50 double tol = 1e-6; int maxit = 1000;
51
52 for(int k=0;k<maxit;++k){
53 Vec g = prob.grad_g(x);
54 // Backtracking to satisfy: g(y) <= g(x) + grad^T(y-x) + (1/(2t))||y-x||^2
55 double t_cur = t;
56 Vec y;
57 while(true){
58 Vec u = add(x, scal(g, -t_cur));
59 y = soft_threshold(u, prob.lambda * t_cur);
60 // Check majorization inequality
61 double gx = prob.g(x);
62 double gy = prob.g(y);
63 Vec s = add(y, x, -1.0); // s = y - x
64 double rhs = gx + dot(g, s) + (0.5/t_cur)*dot(s,s);
65 if(gy <= rhs || t_cur < 1e-16) break;
66 t_cur *= beta;
67 }
68 Vec s = add(y, x, -1.0);
69 x = y;
70 if(norm2(s) / max(1.0, norm2(x)) < tol){ cout << "Converged in " << k+1 << " iterations\n"; break; }
71 if((k+1)%50==0){ cout << "Iter " << k+1 << ", f(x)=" << prob.f(x) << "\n"; }
72 t = t_cur * 1.1; // slight increase heuristic
73 }
74
75 cout << fixed << setprecision(4);
76 cout << "Recovered x: "; for(double xi: x) cout << xi << ' '; cout << "\n";
77 cout << "Objective f(x): " << prob.f(x) << "\n";
78 return 0;
79}
80

Implements ISTA (proximal gradient) for the LASSO. Each iteration takes a gradient step on the smooth part g(x)=0.5||Ax−b||^2 and then applies the soft-thresholding proximal map for the L1 penalty. A backtracking test enforces a valid step size without computing the Lipschitz constant. This converges to the global minimizer because the objective is convex.

Time: O(nd) per iteration for dense A, with a small constant-factor overhead for backtracking.Space: O(nd) to store A plus O(d) for vectors.
Newton's Method with Backtracking for L2-regularized Logistic Regression
1#include <bits/stdc++.h>
2using namespace std; using Vec=vector<double>; using Mat=vector<vector<double>>;
3
4static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
5static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
6static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; }
7static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
8static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; }
9static double sigmoid(double z){ if(z>=0){ double e=exp(-z); return 1.0/(1.0+e);} else { double e=exp(z); return e/(1.0+e);} }
10
11// Cholesky solve for SPD matrix H s = b
12static bool cholesky_solve(const Mat &H, const Vec &b, Vec &x){
13 int n=H.size(); vector<vector<double>> L(n, vector<double>(n,0.0));
14 for(int i=0;i<n;++i){
15 for(int j=0;j<=i;++j){
16 double sum=H[i][j];
17 for(int k=0;k<j;++k) sum -= L[i][k]*L[j][k];
18 if(i==j){ if(sum<=1e-14) return false; L[i][i]=sqrt(sum); }
19 else{ L[i][j]=sum/L[j][j]; }
20 }
21 }
22 // Forward solve Ly = b
23 Vec y(n,0.0); for(int i=0;i<n;++i){ double s=b[i]; for(int k=0;k<i;++k) s-=L[i][k]*y[k]; y[i]=s/L[i][i]; }
24 // Backward solve L^T x = y
25 x.assign(n,0.0); for(int i=n-1;i>=0;--i){ double s=y[i]; for(int k=i+1;k<n;++k) s-=L[k][i]*x[k]; x[i]=s/L[i][i]; }
26 return true;
27}
28
29struct LogisticL2 {
30 Mat X; Vec y; double lambda; // y in {0,1}
31 double f(const Vec &w) const {
32 Vec z = matvec(X, w);
33 double loss=0; for(size_t i=0;i<z.size();++i){ double p=sigmoid(z[i]); double yi=y[i]; loss += -(yi*log(max(p,1e-15)) + (1-yi)*log(max(1-p,1e-15))); }
34 double reg = 0.5*lambda*dot(w,w);
35 return loss + reg;
36 }
37 Vec grad(const Vec &w) const {
38 Vec z = matvec(X, w); Vec p(z.size()); for(size_t i=0;i<z.size();++i) p[i]=sigmoid(z[i]);
39 Vec r = p; for(size_t i=0;i<r.size();++i) r[i]-=y[i];
40 Vec g = matTvec(X, r); // X^T (p - y)
41 // + lambda w
42 for(size_t j=0;j<w.size();++j) g[j] += lambda*w[j];
43 return g;
44 }
45 Mat hess(const Vec &w) const {
46 // H = X^T W X + lambda I, W=diag(p*(1-p))
47 Vec z = matvec(X,w); size_t n=X.size(), d=X[0].size();
48 Vec wdiag(n); for(size_t i=0;i<n;++i){ double p=sigmoid(z[i]); wdiag[i]=p*(1-p); }
49 Mat H(d, vector<double>(d,0.0));
50 for(size_t i=0;i<n;++i){ double wi=wdiag[i]; if(wi==0) continue; // add wi * x_i x_i^T
51 for(size_t a=0;a<d;++a){ double xa=X[i][a]*wi; for(size_t b=0;b<=a;++b){ H[a][b]+= xa*X[i][b]; } }
52 }
53 // symmetrize and add lambda I
54 for(size_t a=0;a<d;++a){ for(size_t b=0;b<a;++b){ H[b][a]=H[a][b]; } H[a][a]+=lambda; }
55 return H;
56 }
57};
58
59int main(){
60 // Synthetic binary classification
61 int n=200, d=5; double lambda=1e-1; std::mt19937 rng(7); std::normal_distribution<double> N(0,1);
62 Mat X(n, Vec(d)); Vec w_true(d); for(int j=0;j<d;++j) w_true[j]=N(rng);
63 for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j]=N(rng);
64 Vec y(n); for(int i=0;i<n;++i){ double z=0; for(int j=0;j<d;++j) z+=X[i][j]*w_true[j]; double p=1.0/(1.0+exp(-z)); y[i]=( (double)std::generate_canonical<double,10>(rng) < p ? 1.0:0.0 ); }
65
66 LogisticL2 prob{X,y,lambda};
67 Vec w(d,0.0);
68 double tol=1e-6; int maxit=50; double c=1e-4; double beta=0.5;
69
70 for(int k=0;k<maxit;++k){
71 Vec g = prob.grad(w);
72 double gnorm = sqrt(dot(g,g));
73 if(gnorm < tol){ cout << "Converged (grad norm) in " << k << " iterations\n"; break; }
74 Mat H = prob.hess(w);
75 Vec s; s.resize(d);
76 // Solve H s = -g via Cholesky
77 Vec rhs = g; for(double &v: rhs) v = -v;
78 if(!cholesky_solve(H, rhs, s)){ cerr << "Hessian not SPD; aborting\n"; break; }
79 // Backtracking Armijo
80 double t=1.0; double f0=prob.f(w); double decr=0; for(size_t i=0;i<w.size();++i) decr += g[i]*s[i];
81 while(true){
82 Vec w_new = add(w, scal(s, t));
83 double f_new = prob.f(w_new);
84 if(f_new <= f0 + c*t*decr || t<1e-16) { w=w_new; break; }
85 t *= beta;
86 }
87 cout << "Iter "<<k<<": f(w)="<<prob.f(w)<<", ||g||="<<gnorm<<"\n";
88 }
89
90 cout << fixed << setprecision(5);
91 cout << "Estimated w: "; for(double wi: w) cout << wi << ' '; cout << "\n";
92 cout << "Final objective: " << prob.f(w) << "\n";
93 return 0;
94}
95

Trains L2-regularized logistic regression with Newton’s method. Each iteration builds the Hessian H = X^T W X + λI and gradient g = X^T(p − y) + λw, solves Hs = −g via Cholesky, and uses backtracking to ensure sufficient decrease. For moderate feature dimension, Newton steps achieve rapid convergence to the global optimum.

Time: O(nd^2) to form H and O(d^3) to factor per iteration (dense).Space: O(nd) to store X and O(d^2) for H.