Convex Optimization
Key Points
- •Convex optimization studies minimizing convex functions over convex sets, where every local minimum is guaranteed to be a global minimum.
- •Convexity can be checked by geometry (epigraph is convex) or calculus (Hessian is positive semidefinite).
- •Common convex models in ML include LASSO, logistic regression, support vector machines, and many quadratic or linear programs.
- •Convexity is preserved under positive weighted sums, composition with affine maps, and pointwise maxima.
- •First-order methods (gradient, subgradient, proximal/ISTA) scale well to large problems by using only gradients and simple projections.
- •Second-order methods (Newton, quasi-Newton) converge faster but require solving linear systems per iteration.
- •Interior-point and barrier methods solve general cones/LP/QP to high accuracy in a small number of iterations.
- •C++ implementations require careful linear algebra (vectors, matrix–vector multiplies, line search) and attention to numerical stability.
Prerequisites
- →Linear algebra (vectors, matrices, norms, eigenvalues) — Convexity tests, gradients/Hessians, and algorithms rely on matrix calculus and spectral properties.
- →Calculus and multivariable optimization — Gradients, Hessians, Taylor expansions, and optimality conditions are central to algorithm design.
- →Probability and statistics — ML losses (logistic, least squares) and regularization have probabilistic interpretations used in modeling.
- →Numerical methods and stability — Line search, conditioning, and linear system solvers affect convergence and robustness of implementations.
- →Data structures and algorithmic complexity — Choosing algorithms (first- vs second-order) depends on time/space trade-offs and sparsity patterns.
Detailed Explanation
Tap terms for definitions01Overview
Convex optimization is the study of minimizing a convex objective function over a convex feasible set. Intuitively, a set is convex if it has no “dents”: for any two points inside, the straight line segment between them remains entirely inside. A function is convex if it curves upward everywhere—averaging inputs never yields a value larger than the corresponding average of function values. These two properties form a powerful pair: when both the objective and the feasible region are convex, every local minimum is automatically a global minimum, removing the need to worry about bad local traps. This field provides both a clean theory and practical algorithms. Theoretically, convexity allows tight guarantees on optimality (via duality and KKT conditions) and convergence rates of algorithms. Practically, many real-world problems—like resource allocation, portfolio optimization, regularized regression, and margin maximization—fit naturally as convex programs. Algorithms range from scalable first-order methods (gradient, subgradient, proximal) to fast-converging second-order methods (Newton, interior-point). Modern modeling frameworks can express a wide variety of problems and call efficient solvers, but it is also educational (and often necessary in systems programming) to implement core methods directly in C++ for speed and control. With careful numerical linear algebra, line search, and stopping criteria, robust convex optimization routines can be built for machine learning and signal processing tasks.
02Intuition & Analogies
Imagine hiking in a bowl-shaped valley. No matter where you start, if you walk downhill, you eventually reach the bottom—and crucially, there is only one bottom. That is convex optimization: the valley is the convex function, and the bottom is the global minimum. In contrast, a landscape with multiple pits would be nonconvex; you could get stuck in a small puddle that is not the lowest point overall. A convex set is like a block of gelatin: poke two toothpicks anywhere inside, and the straight stick connecting them lies entirely within the gelatin. There are no caves or overhangs. When you minimize a convex function over such a shape, you are guaranteed that moving a little downhill cannot lead you into a deceptive side pocket. Key building rules preserve this bowl-like shape: adding bowls makes a bigger bowl (positive weighted sum), tilting or shifting a bowl (affine transformation) keeps it bowl-shaped, and taking the upper envelope of bowls (pointwise max) still yields a bowl from above. This explains why so many machine learning losses plus regularizers are convex: squared error or logistic loss (smooth bowl) plus L1/L2 penalty (cone or ridge-shaped bowl) remains convex. From an algorithmic perspective, first-order methods use the slope (gradient) as a compass pointing downhill. Proximal operators are like a “magnet” that pulls you back to structures you want (sparsity via soft-thresholding). Newton’s method examines local curvature to take smarter, longer strides down the valley, while interior-point methods slide along the interior, repelled from constraints by a virtual force field (barrier), heading to the optimum efficiently.
03Formal Definition
04When to Use
Use convex optimization whenever your model can be expressed with a convex objective and convex constraints. Typical ML cases include: LASSO (sparse linear regression) using an L1 penalty; ridge regression and elastic net; logistic regression for classification; support vector machines (hinge loss + L2); and many matrix factorization surrogates that are convex in one block at a time. In operations research, linear and quadratic programs model resource allocation, scheduling, and portfolio optimization. In signal processing, convex denoising and compressed sensing rely on convex regularizers. Choose first-order methods when the dataset is large (n, d in the millions) and you can tolerate moderate accuracy; their iterations are cheap and memory-light. Proximal methods are ideal when your regularizer promotes structure (sparsity, constraints) via simple proximal maps (e.g., soft-thresholding, projection onto a box or simplex). Use Newton or quasi-Newton when feature dimension d is moderate (so that forming/solving linear systems is feasible) and you want fast, high-accuracy convergence. Employ interior-point methods if you need a general-purpose, high-accuracy solver for LP/QP or small-to-medium conic problems. When integrating into C++, consider numerical stability, cache-friendly dense/sparse multiplies, and stopping rules based on gradient norm, duality gap, or relative objective decrease.
⚠️Common Mistakes
• Assuming convexity without verification. Always check that each component (loss, regularizer, constraints) is convex and that operations used preserve convexity (e.g., composition rules). A single nonconvex part breaks the guarantee. • Using an aggressive fixed step size in gradient methods. Without line search or knowledge of the Lipschitz constant, you can oscillate or diverge. Employ backtracking with Armijo condition or estimate the Lipschitz constant. • Ignoring scaling and conditioning. Poorly scaled features produce ill-conditioned Hessians, slowing first-order methods and destabilizing Newton steps. Standardize features and consider preconditioning. • Stopping too early or too late. A tiny absolute change in objective may still be far from optimal on large-scale problems; prefer relative criteria and gradient (or subgradient) norms. Conversely, over-optimizing beyond noise level wastes time. • Mishandling nonsmooth terms. For L1 norms or hinge losses, vanilla gradient descent is invalid; use subgradients or proximal operators (ISTA/FISTA). Incorrectly implementing soft-thresholding or projections leads to wrong solutions. • Forming dense Hessians unnecessarily. For high-dimensional sparse data, use Hessian–vector products or quasi-Newton (L-BFGS), and avoid O(d^2) storage. • Violating domain constraints. Barrier or projection methods require feasible initialization or careful line search to remain in domain (e.g., log barriers need strictly interior points).
Key Formulas
Convexity (Jensen-type inequality)
Explanation: This is the defining inequality of convex functions. It states that the function value at any convex combination of two points is no greater than the corresponding convex combination of the values.
Epigraph
Explanation: The epigraph of f is convex if and only if f is convex. This geometric viewpoint is often easier to reason about in proofs.
Hessian PSD test
Explanation: For twice-differentiable f, convexity is equivalent to the Hessian being positive semidefinite everywhere. This connects geometry with calculus.
Gradient descent update
Explanation: Each iteration moves opposite the gradient by a step size . For convex f with Lipschitz gradients, this converges to a global minimizer.
ISTA (soft-thresholding)
Explanation: For g smooth and ||_1, the proximal step becomes soft-thresholding. This efficiently handles sparsity-inducing L1 penalties.
Armijo condition
Explanation: A sufficient decrease rule used in backtracking line search. It prevents overly large steps that increase the objective.
LASSO objective
Explanation: Balances data fit (least squares) against sparsity (L1 norm). This is convex but nonsmooth and is well-suited to proximal methods.
Lagrangian
Explanation: Combines the objective with constraints using multipliers. Leads to dual problems and KKT conditions for optimality.
First-order optimality
Explanation: For convex problems, a zero gradient (or zero in the subdifferential) characterizes global optimality, not just local.
Convergence rates
Explanation: Gradient descent on L-smooth convex functions reaches -accuracy in O(L/ iterations; if also m-strongly convex, convergence is linear, yielding the second rate.
Logistic regression Newton system
Explanation: The Newton direction solves H s = -g. Here W has diagonal (1-) and p = (Xw). This yields fast convergence for moderate d.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utility linear algebra for small dense problems 5 using Vec = vector<double>; 6 using Mat = vector<vector<double>>; 7 8 static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 9 static double norm2(const Vec &a){ return sqrt(dot(a,a)); } 10 static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; } 11 static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; } 12 static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; } 13 14 // Quadratic objective: f(x) = 0.5 x^T Q x - b^T x, with SPD Q 15 struct Quadratic { 16 Mat Q; Vec b; 17 double f(const Vec &x) const { 18 Vec Qx = matvec(Q,x); 19 return 0.5*dot(x,Qx) - dot(b,x); 20 } 21 Vec grad(const Vec &x) const { 22 Vec Qx = matvec(Q,x); 23 Vec g = Qx; // g = Qx - b 24 for(size_t i=0;i<g.size();++i) g[i]-=b[i]; 25 return g; 26 } 27 }; 28 29 // Backtracking line search with Armijo condition 30 static double backtracking(const Quadratic &obj, const Vec &x, const Vec &g, double init_t=1.0, double c=1e-4, double beta=0.5){ 31 double t = init_t; 32 double fx = obj.f(x); 33 double gnorm2 = dot(g,g); 34 while(true){ 35 Vec x_new = add(x, scal(g, -t)); 36 double f_new = obj.f(x_new); 37 if(f_new <= fx - c*t*gnorm2) break; // sufficient decrease 38 t *= beta; 39 if(t < 1e-16) break; // safeguard 40 } 41 return t; 42 } 43 44 int main(){ 45 // Example: minimize 0.5 x^T Q x - b^T x, Q SPD 46 int d = 3; 47 Mat Q = {{4,1,0},{1,3,0},{0,0,2}}; // SPD 48 Vec b = {1,2,3}; 49 Quadratic obj{Q,b}; 50 51 Vec x(d, 0.0); // start at zero 52 double tol = 1e-8; 53 int maxit = 10000; 54 55 for(int k=0;k<maxit;++k){ 56 Vec g = obj.grad(x); 57 double gnorm = norm2(g); 58 if(gnorm < tol) { cout << "Converged in " << k << " iterations\n"; break; } 59 double t = backtracking(obj, x, g); 60 x = add(x, scal(g, -t)); 61 if(k%1000==0){ cout << "Iter " << k << ", f(x)=" << obj.f(x) << ", ||g||=" << gnorm << "\n"; } 62 } 63 64 cout << fixed << setprecision(6); 65 cout << "Solution x*: "; for(double xi: x) cout << xi << ' '; cout << "\n"; 66 cout << "f(x*) = " << obj.f(x) << "\n"; 67 return 0; 68 } 69
This program minimizes a strongly convex quadratic using gradient descent with backtracking Armijo line search. The gradient is g = Qx − b. Backtracking adaptively chooses a safe step size that ensures sufficient decrease. Because the objective is convex with SPD Q, any stationary point is the global minimizer, and gradient descent converges.
1 #include <bits/stdc++.h> 2 using namespace std; 3 using Vec = vector<double>; using Mat = vector<vector<double>>; 4 5 static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 6 static double norm2(const Vec &a){ return sqrt(dot(a,a)); } 7 static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; } 8 static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; } 9 static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; } 10 static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; } 11 12 // Soft-thresholding operator S_tau(u) = sign(u) * max(|u| - tau, 0) 13 static Vec soft_threshold(const Vec &u, double tau){ Vec x=u; for(double &v: x){ double s = (v>0) - (v<0); double a = fabs(v) - tau; v = (a>0 ? s*a : 0.0); } return x; } 14 15 struct Lasso { 16 Mat A; Vec b; double lambda; 17 // Smooth part g(x) = 0.5||Ax - b||^2 18 double g(const Vec &x) const { 19 Vec r = matvec(A,x); // r = Ax - b 20 for(size_t i=0;i<r.size();++i) r[i]-=b[i]; 21 return 0.5*dot(r,r); 22 } 23 Vec grad_g(const Vec &x) const { 24 Vec r = matvec(A,x); 25 for(size_t i=0;i<r.size();++i) r[i]-=b[i]; 26 return matTvec(A, r); // A^T (Ax - b) 27 } 28 double f(const Vec &x) const { // full objective 29 double val = g(x); 30 double l1=0; for(double xi: x) l1 += fabs(xi); 31 return val + lambda*l1; 32 } 33 }; 34 35 int main(){ 36 // Small synthetic example: A in R^{mxd}, b in R^m 37 int m=50, d=20; double lambda=0.1; 38 std::mt19937 rng(42); std::normal_distribution<double> N(0.0,1.0); 39 Mat A(m, Vec(d)); Vec x_true(d,0.0); 40 // sparse ground truth with 3 nonzeros 41 x_true[2]=2.0; x_true[7]=-1.5; x_true[15]=1.0; 42 for(int i=0;i<m;++i) for(int j=0;j<d;++j) A[i][j] = N(rng)/sqrt(d); 43 Vec b(m,0.0); for(int i=0;i<m;++i){ for(int j=0;j<d;++j) b[i]+=A[i][j]*x_true[j]; b[i]+=0.05*N(rng); } 44 45 Lasso prob{A,b,lambda}; 46 47 Vec x(d,0.0); // start at zero 48 double t = 1.0; // initial step; will use backtracking to ensure majorization 49 double beta = 0.5; // shrink factor 50 double tol = 1e-6; int maxit = 1000; 51 52 for(int k=0;k<maxit;++k){ 53 Vec g = prob.grad_g(x); 54 // Backtracking to satisfy: g(y) <= g(x) + grad^T(y-x) + (1/(2t))||y-x||^2 55 double t_cur = t; 56 Vec y; 57 while(true){ 58 Vec u = add(x, scal(g, -t_cur)); 59 y = soft_threshold(u, prob.lambda * t_cur); 60 // Check majorization inequality 61 double gx = prob.g(x); 62 double gy = prob.g(y); 63 Vec s = add(y, x, -1.0); // s = y - x 64 double rhs = gx + dot(g, s) + (0.5/t_cur)*dot(s,s); 65 if(gy <= rhs || t_cur < 1e-16) break; 66 t_cur *= beta; 67 } 68 Vec s = add(y, x, -1.0); 69 x = y; 70 if(norm2(s) / max(1.0, norm2(x)) < tol){ cout << "Converged in " << k+1 << " iterations\n"; break; } 71 if((k+1)%50==0){ cout << "Iter " << k+1 << ", f(x)=" << prob.f(x) << "\n"; } 72 t = t_cur * 1.1; // slight increase heuristic 73 } 74 75 cout << fixed << setprecision(4); 76 cout << "Recovered x: "; for(double xi: x) cout << xi << ' '; cout << "\n"; 77 cout << "Objective f(x): " << prob.f(x) << "\n"; 78 return 0; 79 } 80
Implements ISTA (proximal gradient) for the LASSO. Each iteration takes a gradient step on the smooth part g(x)=0.5||Ax−b||^2 and then applies the soft-thresholding proximal map for the L1 penalty. A backtracking test enforces a valid step size without computing the Lipschitz constant. This converges to the global minimizer because the objective is convex.
1 #include <bits/stdc++.h> 2 using namespace std; using Vec=vector<double>; using Mat=vector<vector<double>>; 3 4 static double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 5 static Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; } 6 static Vec scal(const Vec &a, double s){ Vec c=a; for(double &v: c) v*=s; return c; } 7 static Vec matvec(const Mat &A, const Vec &x){ size_t n=A.size(), m=x.size(); Vec y(n,0.0); for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; } 8 static Vec matTvec(const Mat &A, const Vec &y){ size_t n=A.size(), m=A[0].size(); Vec x(m,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x[j]+=A[i][j]*y[i]; } return x; } 9 static double sigmoid(double z){ if(z>=0){ double e=exp(-z); return 1.0/(1.0+e);} else { double e=exp(z); return e/(1.0+e);} } 10 11 // Cholesky solve for SPD matrix H s = b 12 static bool cholesky_solve(const Mat &H, const Vec &b, Vec &x){ 13 int n=H.size(); vector<vector<double>> L(n, vector<double>(n,0.0)); 14 for(int i=0;i<n;++i){ 15 for(int j=0;j<=i;++j){ 16 double sum=H[i][j]; 17 for(int k=0;k<j;++k) sum -= L[i][k]*L[j][k]; 18 if(i==j){ if(sum<=1e-14) return false; L[i][i]=sqrt(sum); } 19 else{ L[i][j]=sum/L[j][j]; } 20 } 21 } 22 // Forward solve Ly = b 23 Vec y(n,0.0); for(int i=0;i<n;++i){ double s=b[i]; for(int k=0;k<i;++k) s-=L[i][k]*y[k]; y[i]=s/L[i][i]; } 24 // Backward solve L^T x = y 25 x.assign(n,0.0); for(int i=n-1;i>=0;--i){ double s=y[i]; for(int k=i+1;k<n;++k) s-=L[k][i]*x[k]; x[i]=s/L[i][i]; } 26 return true; 27 } 28 29 struct LogisticL2 { 30 Mat X; Vec y; double lambda; // y in {0,1} 31 double f(const Vec &w) const { 32 Vec z = matvec(X, w); 33 double loss=0; for(size_t i=0;i<z.size();++i){ double p=sigmoid(z[i]); double yi=y[i]; loss += -(yi*log(max(p,1e-15)) + (1-yi)*log(max(1-p,1e-15))); } 34 double reg = 0.5*lambda*dot(w,w); 35 return loss + reg; 36 } 37 Vec grad(const Vec &w) const { 38 Vec z = matvec(X, w); Vec p(z.size()); for(size_t i=0;i<z.size();++i) p[i]=sigmoid(z[i]); 39 Vec r = p; for(size_t i=0;i<r.size();++i) r[i]-=y[i]; 40 Vec g = matTvec(X, r); // X^T (p - y) 41 // + lambda w 42 for(size_t j=0;j<w.size();++j) g[j] += lambda*w[j]; 43 return g; 44 } 45 Mat hess(const Vec &w) const { 46 // H = X^T W X + lambda I, W=diag(p*(1-p)) 47 Vec z = matvec(X,w); size_t n=X.size(), d=X[0].size(); 48 Vec wdiag(n); for(size_t i=0;i<n;++i){ double p=sigmoid(z[i]); wdiag[i]=p*(1-p); } 49 Mat H(d, vector<double>(d,0.0)); 50 for(size_t i=0;i<n;++i){ double wi=wdiag[i]; if(wi==0) continue; // add wi * x_i x_i^T 51 for(size_t a=0;a<d;++a){ double xa=X[i][a]*wi; for(size_t b=0;b<=a;++b){ H[a][b]+= xa*X[i][b]; } } 52 } 53 // symmetrize and add lambda I 54 for(size_t a=0;a<d;++a){ for(size_t b=0;b<a;++b){ H[b][a]=H[a][b]; } H[a][a]+=lambda; } 55 return H; 56 } 57 }; 58 59 int main(){ 60 // Synthetic binary classification 61 int n=200, d=5; double lambda=1e-1; std::mt19937 rng(7); std::normal_distribution<double> N(0,1); 62 Mat X(n, Vec(d)); Vec w_true(d); for(int j=0;j<d;++j) w_true[j]=N(rng); 63 for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j]=N(rng); 64 Vec y(n); for(int i=0;i<n;++i){ double z=0; for(int j=0;j<d;++j) z+=X[i][j]*w_true[j]; double p=1.0/(1.0+exp(-z)); y[i]=( (double)std::generate_canonical<double,10>(rng) < p ? 1.0:0.0 ); } 65 66 LogisticL2 prob{X,y,lambda}; 67 Vec w(d,0.0); 68 double tol=1e-6; int maxit=50; double c=1e-4; double beta=0.5; 69 70 for(int k=0;k<maxit;++k){ 71 Vec g = prob.grad(w); 72 double gnorm = sqrt(dot(g,g)); 73 if(gnorm < tol){ cout << "Converged (grad norm) in " << k << " iterations\n"; break; } 74 Mat H = prob.hess(w); 75 Vec s; s.resize(d); 76 // Solve H s = -g via Cholesky 77 Vec rhs = g; for(double &v: rhs) v = -v; 78 if(!cholesky_solve(H, rhs, s)){ cerr << "Hessian not SPD; aborting\n"; break; } 79 // Backtracking Armijo 80 double t=1.0; double f0=prob.f(w); double decr=0; for(size_t i=0;i<w.size();++i) decr += g[i]*s[i]; 81 while(true){ 82 Vec w_new = add(w, scal(s, t)); 83 double f_new = prob.f(w_new); 84 if(f_new <= f0 + c*t*decr || t<1e-16) { w=w_new; break; } 85 t *= beta; 86 } 87 cout << "Iter "<<k<<": f(w)="<<prob.f(w)<<", ||g||="<<gnorm<<"\n"; 88 } 89 90 cout << fixed << setprecision(5); 91 cout << "Estimated w: "; for(double wi: w) cout << wi << ' '; cout << "\n"; 92 cout << "Final objective: " << prob.f(w) << "\n"; 93 return 0; 94 } 95
Trains L2-regularized logistic regression with Newton’s method. Each iteration builds the Hessian H = X^T W X + λI and gradient g = X^T(p − y) + λw, solves Hs = −g via Cholesky, and uses backtracking to ensure sufficient decrease. For moderate feature dimension, Newton steps achieve rapid convergence to the global optimum.