Convex Optimization Problems
Key Points
- •A convex optimization problem minimizes a convex function over a convex set, guaranteeing that every local minimum is a global minimum.
- •Convex problems are shaped like bowls; moving downhill with gradient-based methods reliably reaches the bottom.
- •Standard form uses convex inequality constraints and affine equality constraints, enabling powerful duality theory and KKT optimality conditions.
- •First-order (gradient or subgradient) methods scale to large problems and have provable convergence rates under mild smoothness assumptions.
- •Projected and proximal gradients handle constraints and nonsmooth terms like L1 norms via simple geometric/proximal operations.
- •Newton and quasi-Newton methods converge faster when second-order curvature information is affordable and the problem is well-conditioned.
- •Strong convexity and Lipschitz-smoothness constants determine step sizes and convergence speed in practice.
- •C++ implementations rely on efficient vector/matrix operations and careful line search or step-size selection for stability.
Prerequisites
- →Multivariable calculus (gradients and Hessians) — Understanding derivatives in multiple dimensions is essential for defining convexity, gradients, and Newton steps.
- →Linear algebra (vectors, matrices, norms) — Optimization over R^n relies on matrix-vector operations, eigenvalues, and norms used in complexity and convergence analysis.
- →Basic real analysis/inequalities — Properties like Jensen's inequality and continuity support convexity proofs and bounds.
- →Numerical methods and conditioning — Practical solvers require stable linear algebra and awareness of conditioning for convergence speed.
- →Probability and statistics (optional for ML) — Interpreting loss functions and regularization in machine learning contexts benefits from probabilistic modeling.
- →Algorithm design and analysis — Selecting algorithms and analyzing time/space complexity is key for scalable implementations.
Detailed Explanation
Tap terms for definitions01Overview
Convex optimization is the study and practice of minimizing a convex objective function over a convex set of feasible points. A set is convex if for any two points in the set, the straight line segment connecting them remains entirely within the set. A function is convex if its graph always lies below the straight line between any two points on the graph. This structure leads to a remarkable property: every local minimum is a global minimum. That means algorithms that take small, local improving steps—like gradient descent—are much more reliable than in non-convex settings. The general form includes convex inequality constraints and affine equality constraints, unifying a wide range of practical problems in machine learning (e.g., LASSO, logistic regression), engineering design, finance (portfolio optimization), and operations research (resource allocation). The theory provides tools like duality and Karush–Kuhn–Tucker (KKT) conditions to certify optimality. Practically, we choose methods based on problem size and smoothness: first-order methods for massive data, proximal methods for nonsmooth regularization, and Newton-type methods for smaller problems needing high accuracy. Good implementations hinge on numerically stable linear algebra, step-size strategies, and exploiting problem structure (sparsity, separability).
02Intuition & Analogies
Picture a smooth, wide bowl on a table. Drop a marble anywhere in the bowl: if you let it roll downhill, it will eventually settle at the very bottom. That bowl shape is the essence of convexity. There aren’t hidden pockets or multiple valleys to trap your marble; no matter where you begin, the lowest point is the same. Now add fences around the bowl to restrict where the marble is allowed to roll—those fences are constraints. If the fences form a convex region (like a circular ring or a rectangle), the marble can still slide along the surface but will stop either at the bottom or somewhere on the fence where it can’t go any lower while staying feasible. The slope at each point (the gradient) tells you the steepest downhill direction; following that slope reduces height, just like descending in a landscape. If the bowl is steeper in some directions than others (ill-conditioned), your steps may zig-zag; preconditioning or second-order curvature (Newton’s method) helps you stride directly toward the bottom. When the surface has sharp edges (like adding an L1 penalty that creates a ridged floor), you can’t use plain gradients everywhere; instead, you use subgradients or a proximal step—think of it as a smart move that both goes downhill and snaps you to a simpler shape (like encouraging exact zeros). Projected gradient methods are like taking a downhill step and then nudging the marble back inside the fences by the shortest possible move. This geometric picture explains why convex optimization is both powerful and dependable: the landscape’s shape makes finding the bottom tractable.
03Formal Definition
04When to Use
Use convex optimization when your objective is convex and constraints define a convex feasible region, or when you can reformulate the problem into a convex one. Common scenarios include: training linear models (ridge regression, LASSO, logistic regression), designing portfolios that balance risk and return (quadratic programs), fitting signals with sparsity or smoothness priors (L1/L2 regularization), resource allocation and scheduling with linear or convex costs (linear and second-order cone programs), and control or robotics problems with convex safety margins. If the dimension is large and data are dense, prefer first-order methods (gradient, accelerated gradient, ADMM, proximal gradient) due to low per-iteration cost and easy parallelization. If variables are few to moderate and high accuracy is needed, Newton or quasi-Newton methods converge faster. If constraints are simple sets (boxes, balls, simplices), projected gradient is efficient because projections are cheap. If your objective includes nonsmooth terms like |x|_{1} or indicator functions, use proximal methods that admit closed-form proximal operators (soft-thresholding, clipping). When integer or combinatorial variables are essential, the problem is not convex; consider relaxations to get bounds or approximate solutions, or switch to mixed-integer methods.
⚠️Common Mistakes
• Assuming convexity without verification: Many losses are convex only after specific transformations (e.g., squared loss is convex; some robust losses are not). Always check the Hessian is positive semidefinite or apply definition tests. • Ignoring conditioning: Poorly scaled data lead to slow zig-zagging in gradient descent. Center and scale features, or use preconditioning/second-order methods. • Bad step-size selection: Fixed steps too large can diverge; too small stagnate. Use backtracking line search or steps tied to Lipschitz constants. • Mishandling constraints: Forgetting to project can yield infeasible iterates; an incorrect projection can break convergence. For complicated sets, use proximal or splitting methods (ADMM) instead of ad hoc clipping. • Misusing KKT: KKT conditions are sufficient for global optimality in convex problems, but not in non-convex ones; don’t over-interpret multipliers outside convex settings. • Overlooking nonsmoothness: Applying vanilla gradient descent to L1-regularized problems can fail; use subgradients or proximal operators like soft-thresholding. • Numerical instability: Forming normal equations (A^{\top} A) can square the condition number; prefer matrix-vector products, iterative solvers, or QR/Cholesky when exact solves are required. • Expecting convex methods to solve integer problems: Convex solvers provide relaxations or heuristics; exact integrality requires mixed-integer techniques.
Key Formulas
Convexity of a function
Explanation: This inequality states that the function lies below the straight line between any two points. It is the defining property of convex functions.
Lipschitz gradient
Explanation: The gradient does not change faster than linearly with distance. L bounds curvature and guides step-size choices like 1/L.
First-order condition
Explanation: For differentiable convex f, the tangent plane is a global underestimator. This underpins descent methods and optimality checks.
Subgradient inequality
Explanation: Extends the first-order condition to nonsmooth convex functions using subgradients. It is key for L1 penalties and hinge losses.
Strong convexity
Explanation: Adds a quadratic term that lower-bounds curvature. Strong convexity yields faster (linear) convergence for gradient methods.
Standard convex program
Explanation: A canonical form with convex inequalities and affine equalities. Many problems can be reformulated into this structure.
Lagrangian
Explanation: Combines objective and constraints with multipliers to form the dual problem. Dual variables have economic interpretations.
KKT conditions
Explanation: Necessary and sufficient for optimality in convex problems under mild conditions. They provide certificates of optimal solutions.
Gradient descent update
Explanation: Moves opposite the gradient by a step size. For convex smooth functions, proper step sizes guarantee monotone decrease.
Proximal gradient (ISTA)
Explanation: Splits objective into smooth part h and nonsmooth part g, taking a gradient step on h and a proximal step on g. Ideal for L1 regularization.
Projected gradient
Explanation: Takes a gradient step then projects onto feasible set C. Works well when projection is simple (boxes, balls, simplices).
Newton step
Explanation: Uses second-order curvature to adjust the step. Quadratically convergent near the optimum when Hessian is positive definite.
Sublinear rate (GD)
Explanation: For L-smooth convex f, gradient descent with constant step achieves O decrease in suboptimality. This sets expectations for large-scale runs.
Linear rate (strongly convex)
Explanation: With strong convexity ( and smoothness (L), gradient descent converges linearly. The rate depends on the condition number L/
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute y = Qx for symmetric positive definite Q 5 vector<double> matvec(const vector<vector<double>>& Q, const vector<double>& x) { 6 int n = (int)Q.size(); 7 vector<double> y(n, 0.0); 8 for (int i = 0; i < n; ++i) { 9 double sum = 0.0; 10 for (int j = 0; j < n; ++j) sum += Q[i][j] * x[j]; 11 y[i] = sum; 12 } 13 return y; 14 } 15 16 double dot(const vector<double>& a, const vector<double>& b) { 17 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s; 18 } 19 20 double norm2(const vector<double>& x) { 21 return sqrt(dot(x, x)); 22 } 23 24 // f(x) = 0.5 x^T Q x + b^T x 25 struct Quadratic { 26 vector<vector<double>> Q; // SPD matrix 27 vector<double> b; 28 29 double value(const vector<double>& x) const { 30 vector<double> Qx = matvec(Q, x); 31 return 0.5 * dot(x, Qx) + dot(b, x); 32 } 33 34 vector<double> grad(const vector<double>& x) const { 35 // ∇f(x) = Qx + b 36 vector<double> Qx = matvec(Q, x); 37 vector<double> g = Qx; 38 for (size_t i = 0; i < g.size(); ++i) g[i] += b[i]; 39 return g; 40 } 41 }; 42 43 vector<double> gradient_descent_backtracking(const Quadratic& prob, vector<double> x0, 44 double tol=1e-8, int max_iters=1000, 45 double alpha0=1.0, double c=1e-4, double rho=0.5) { 46 vector<double> x = x0; 47 for (int k = 0; k < max_iters; ++k) { 48 vector<double> g = prob.grad(x); 49 double gnorm = norm2(g); 50 if (gnorm < tol) { 51 cout << "Converged in " << k << " iterations.\n"; 52 break; 53 } 54 // Descent direction is -g 55 vector<double> d = g; 56 for (double &v : d) v = -v; 57 58 // Backtracking line search (Armijo) 59 double t = alpha0; 60 double fx = prob.value(x); 61 double dg = dot(g, d); // should be negative 62 while (true) { 63 vector<double> x_new = x; 64 for (size_t i = 0; i < x.size(); ++i) x_new[i] += t * d[i]; 65 double fx_new = prob.value(x_new); 66 if (fx_new <= fx + c * t * dg) { 67 x = std::move(x_new); 68 break; 69 } 70 t *= rho; 71 if (t < 1e-20) { // safeguard 72 // Take a very small step to ensure progress 73 for (size_t i = 0; i < x.size(); ++i) x[i] += 1e-20 * d[i]; 74 break; 75 } 76 } 77 if (k % 20 == 0) { 78 cout << "Iter " << k << ": f = " << prob.value(x) << ", ||g|| = " << gnorm << "\n"; 79 } 80 } 81 return x; 82 } 83 84 int main() { 85 // Example SPD Q and vector b for n=3 86 vector<vector<double>> Q = { 87 {4.0, 1.0, 0.0}, 88 {1.0, 3.0, 0.5}, 89 {0.0, 0.5, 2.0} 90 }; 91 vector<double> b = {-1.0, 2.0, -3.0}; 92 93 Quadratic prob{Q, b}; 94 vector<double> x0 = {0.0, 0.0, 0.0}; 95 vector<double> x_star = gradient_descent_backtracking(prob, x0); 96 97 cout << fixed << setprecision(6); 98 cout << "Solution: "; 99 for (double xi : x_star) cout << xi << ' '; 100 cout << "\nFinal f(x): " << prob.value(x_star) << "\n"; 101 return 0; 102 } 103
This program minimizes a strictly convex quadratic using gradient descent with Armijo backtracking line search. The gradient is Qx + b and the line search ensures sufficient decrease each step. Because the objective is convex with Lipschitz gradient, the method converges to the unique global minimizer.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct DenseMatrix { 5 int m, n; // m rows, n cols 6 vector<vector<double>> A; // A[m][n] 7 8 DenseMatrix(int m_, int n_): m(m_), n(n_), A(m_, vector<double>(n_, 0.0)) {} 9 10 vector<double> matvec(const vector<double>& x) const { 11 vector<double> y(m, 0.0); 12 for (int i = 0; i < m; ++i) { 13 double s = 0.0; for (int j = 0; j < n; ++j) s += A[i][j] * x[j]; 14 y[i] = s; 15 } 16 return y; 17 } 18 vector<double> matTvec(const vector<double>& y) const { 19 vector<double> x(n, 0.0); 20 for (int j = 0; j < n; ++j) { 21 double s = 0.0; for (int i = 0; i < m; ++i) s += A[i][j] * y[i]; 22 x[j] = s; 23 } 24 return x; 25 } 26 }; 27 28 vector<double> soft_threshold(const vector<double>& v, double tau) { 29 vector<double> x(v.size()); 30 for (size_t i = 0; i < v.size(); ++i) { 31 double z = v[i]; 32 if (z > tau) x[i] = z - tau; 33 else if (z < -tau) x[i] = z + tau; 34 else x[i] = 0.0; 35 } 36 return x; 37 } 38 39 double objective(const DenseMatrix& A, const vector<double>& x, const vector<double>& y, double lambda) { 40 vector<double> r = A.matvec(x); 41 for (size_t i = 0; i < r.size(); ++i) r[i] -= y[i]; 42 double data = 0.0; for (double ri : r) data += 0.5 * ri * ri; 43 double l1 = 0.0; for (double xi : x) l1 += fabs(xi); 44 return data + lambda * l1; 45 } 46 47 // Power iteration to estimate L = largest eigenvalue of A^T A (i.e., ||A||_2^2) 48 double estimate_L(const DenseMatrix& A, int iters = 50) { 49 int n = A.n; 50 vector<double> x(n, 0.0); 51 std::mt19937 rng(42); 52 std::uniform_real_distribution<double> dist(-1.0, 1.0); 53 for (int i = 0; i < n; ++i) x[i] = dist(rng); 54 auto normalize = [](vector<double>& v){ double s=0; for(double vi:v)s+=vi*vi; s=sqrt(s); if(s>0)for(double&vi:v)vi/=s; }; 55 normalize(x); 56 double lam = 0.0; 57 for (int k = 0; k < iters; ++k) { 58 vector<double> y = A.matvec(x); 59 vector<double> z = A.matTvec(y); // z = A^T A x 60 lam = 0.0; for (size_t i = 0; i < z.size(); ++i) lam += x[i]*z[i]; 61 x = z; normalize(x); 62 } 63 return max(lam, 1e-12); // avoid zero 64 } 65 66 vector<double> ista_lasso(const DenseMatrix& A, const vector<double>& y, double lambda, 67 int max_iters=1000, double tol=1e-6) { 68 int n = A.n; 69 vector<double> x(n, 0.0), x_old(n, 0.0); 70 // Lipschitz constant for gradient of 0.5||Ax - y||^2 is L = ||A||_2^2 71 double L = estimate_L(A, 80); 72 double t = 1.0 / L; 73 for (int k = 0; k < max_iters; ++k) { 74 x_old = x; 75 // Gradient: A^T (A x - y) 76 vector<double> Ax = A.matvec(x); 77 for (size_t i = 0; i < Ax.size(); ++i) Ax[i] -= y[i]; 78 vector<double> g = A.matTvec(Ax); 79 // Gradient step then proximal step (soft-thresholding) 80 vector<double> v(n); 81 for (int i = 0; i < n; ++i) v[i] = x[i] - t * g[i]; 82 x = soft_threshold(v, lambda * t); 83 84 // Check convergence by relative change 85 double num = 0.0, den = 0.0; 86 for (int i = 0; i < n; ++i) { double d = x[i]-x_old[i]; num += d*d; den += x_old[i]*x_old[i]; } 87 double rel = sqrt(num) / (sqrt(den) + 1e-12); 88 if (k % 50 == 0) { 89 cout << "Iter " << k << ": obj = " << objective(A, x, y, lambda) << ", rel_change = " << rel << "\n"; 90 } 91 if (rel < tol) { 92 cout << "Converged in " << k << " iterations.\n"; 93 break; 94 } 95 } 96 return x; 97 } 98 99 int main(){ 100 // Create a small synthetic LASSO instance (m = 50 samples, n = 20 features) 101 int m = 50, n = 20; 102 DenseMatrix A(m, n); 103 std::mt19937 rng(123); 104 std::normal_distribution<double> nd(0.0, 1.0); 105 for (int i = 0; i < m; ++i) 106 for (int j = 0; j < n; ++j) 107 A.A[i][j] = nd(rng) / sqrt((double)m); // scale for conditioning 108 109 vector<double> x_true(n, 0.0); 110 // Make a sparse ground truth with 5 nonzeros 111 for (int j = 0; j < 5; ++j) x_true[j] = (j % 2 ? -1.5 : 2.0); 112 113 vector<double> y = A.matvec(x_true); 114 // Add small noise 115 for (double &yi : y) yi += 0.05 * nd(rng); 116 117 double lambda = 0.1; 118 vector<double> x_est = ista_lasso(A, y, lambda, 2000, 1e-6); 119 120 cout << fixed << setprecision(6); 121 cout << "Recovered x (first 10 entries): "; 122 for (int i = 0; i < min(10, (int)x_est.size()); ++i) cout << x_est[i] << ' '; 123 cout << "\nObjective: " << objective(A, x_est, y, lambda) << "\n"; 124 return 0; 125 } 126
This implementation solves the LASSO problem with ISTA: a gradient step on the quadratic loss followed by a soft-thresholding proximal step for the L1 penalty. The step size is chosen as 1/L where L approximates the largest eigenvalue of A^T A via power iteration. The method converges because the quadratic loss is smooth and convex, and the L1 norm is convex with a simple proximal operator.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct DenseMat { 5 int m, n; vector<vector<double>> A; 6 DenseMat(int m_, int n_): m(m_), n(n_), A(m_, vector<double>(n_, 0.0)) {} 7 vector<double> matvec(const vector<double>& x) const { 8 vector<double> y(m, 0.0); 9 for (int i = 0; i < m; ++i) { 10 double s = 0.0; for (int j = 0; j < n; ++j) s += A[i][j] * x[j]; 11 y[i] = s; 12 } 13 return y; 14 } 15 vector<double> matTvec(const vector<double>& y) const { 16 vector<double> x(n, 0.0); 17 for (int j = 0; j < n; ++j) { 18 double s = 0.0; for (int i = 0; i < m; ++i) s += A[i][j] * y[i]; 19 x[j] = s; 20 } 21 return x; 22 } 23 }; 24 25 void project_box(vector<double>& x, double l=0.0, double u=1.0) { 26 for (double &xi : x) { 27 if (xi < l) xi = l; else if (xi > u) xi = u; 28 } 29 } 30 31 double obj_ls(const DenseMat& A, const vector<double>& x, const vector<double>& y) { 32 vector<double> r = A.matvec(x); 33 for (size_t i = 0; i < r.size(); ++i) r[i] -= y[i]; 34 double s = 0.0; for (double ri : r) s += 0.5 * ri * ri; return s; 35 } 36 37 vector<double> projected_gradient_ls_box(const DenseMat& A, const vector<double>& y, int max_iters=2000, 38 double tol=1e-6, double alpha0=1.0, double c=1e-4, double rho=0.5) { 39 int n = A.n; 40 vector<double> x(n, 0.5); // start at center of box 41 for (int k = 0; k < max_iters; ++k) { 42 // Gradient: g = A^T (A x - y) 43 vector<double> Ax = A.matvec(x); 44 for (size_t i = 0; i < Ax.size(); ++i) Ax[i] -= y[i]; 45 vector<double> g = A.matTvec(Ax); 46 47 double gnorm = 0.0; for (double gi : g) gnorm += gi*gi; gnorm = sqrt(gnorm); 48 if (gnorm < tol) { cout << "Converged in " << k << " iterations.\n"; break; } 49 50 // Backtracking on projected step 51 double t = alpha0; 52 double fx = obj_ls(A, x, y); 53 while (true) { 54 vector<double> x_trial(n); 55 for (int i = 0; i < n; ++i) x_trial[i] = x[i] - t * g[i]; 56 project_box(x_trial, 0.0, 1.0); 57 double fx_new = obj_ls(A, x_trial, y); 58 59 // Use a generalized Armijo condition with projected point 60 // Compute directional derivative approximation: d = x_trial - x 61 vector<double> d(n); 62 for (int i = 0; i < n; ++i) d[i] = x_trial[i] - x[i]; 63 double dg = 0.0; for (int i = 0; i < n; ++i) dg += g[i] * d[i]; 64 65 if (fx_new <= fx + c * dg) { x = std::move(x_trial); break; } 66 t *= rho; 67 if (t < 1e-20) { x = std::move(x_trial); break; } 68 } 69 if (k % 50 == 0) { 70 cout << "Iter " << k << ": f = " << obj_ls(A, x, y) << ", ||g|| = " << gnorm << "\n"; 71 } 72 } 73 return x; 74 } 75 76 int main(){ 77 int m = 80, n = 30; 78 DenseMat A(m, n); 79 std::mt19937 rng(7); 80 std::normal_distribution<double> nd(0.0, 1.0); 81 for (int i = 0; i < m; ++i) 82 for (int j = 0; j < n; ++j) 83 A.A[i][j] = nd(rng) / sqrt((double)m); 84 85 vector<double> x_true(n, 0.0); 86 for (int j = 0; j < n; ++j) x_true[j] = min(1.0, max(0.0, 0.6 + 0.2 * nd(rng))); 87 vector<double> y = A.matvec(x_true); 88 for (double &yi : y) yi += 0.02 * nd(rng); 89 90 vector<double> x = projected_gradient_ls_box(A, y, 1500, 1e-6); 91 cout << fixed << setprecision(6); 92 cout << "First 10 variables (should be in [0,1]): "; 93 for (int i = 0; i < min(10, (int)x.size()); ++i) cout << x[i] << ' '; 94 cout << "\nObjective: " << obj_ls(A, x, y) << "\n"; 95 return 0; 96 } 97
This example minimizes a convex least-squares objective subject to simple box constraints using projected gradient descent. Each step takes a gradient move and then projects back onto [0,1]^n by clipping, followed by a backtracking rule to guarantee sufficient decrease. The projection is cheap and preserves feasibility at every iteration.