Optimization Theory
Key Points
- •Optimization theory studies how to choose variables to minimize or maximize an objective while respecting constraints.
- •For convex problems, any local minimum is also a global minimum, enabling reliable algorithms and guarantees.
- •First- and second-order conditions ( f(x^*) = 0 and f(x^*) 0) characterize optimal points for smooth unconstrained problems.
- •Gradient descent updates x by stepping opposite the gradient; with smooth convexity it converges at O and linearly for strongly convex functions.
- •Momentum and adaptive methods (AdaGrad/RMSprop/Adam) often speed up convergence in practice.
- •Constrained problems use the Lagrangian and KKT conditions, and convex problems often enjoy strong duality.
- •Projection and proximal operators let us handle constraints by “pulling” iterates back into the feasible set.
- •Optimization is central to AI/ML because training models is solving large, often nonconvex, optimization problems.
Prerequisites
- →Multivariable calculus — Gradients, Hessians, Taylor expansions, and line integrals are the mathematical backbone of optimization.
- →Linear algebra — Convex quadratics, eigenvalues, norms, and condition numbers determine curvature and convergence.
- →Convex analysis — Understanding convex sets, convexity, and Jensen’s inequality is essential for guarantees and KKT conditions.
- →Numerical computing basics — Floating-point precision, stopping criteria, and algorithmic stability affect practical implementations.
- →Data structures and algorithms — Efficient projections (sorting), matrix–vector products, and sparse representations determine runtime and memory.
Detailed Explanation
Tap terms for definitions01Overview
Optimization theory is about making the best possible choice given a quantitative measure of quality (the objective) and a set of rules (constraints). In its simplest form, we minimize a function f(x) over variables x. When there are no constraints, this is unconstrained optimization; otherwise, we have constrained optimization with equalities and/or inequalities. A powerful and practical special case is convex optimization, where the objective and feasible set are convex; then any local minimum is guaranteed to be global, giving strong theoretical and algorithmic guarantees. Optimization algorithms range from first-order methods (using gradients), to second-order methods (using Hessians), to specialized techniques for constraints (projected methods, Lagrange multipliers, interior-point methods). In machine learning, almost all training problems—from linear regression to deep neural networks—are cast as minimization of a loss function, possibly with regularization and constraints. Understanding convergence rates, learning rates (step sizes), momentum, and duality helps you pick algorithms that are both fast and reliable. This overview builds intuition, states precise definitions, and gives concrete C++ implementations for core methods like gradient descent, Nesterov acceleration, and projected gradient.
02Intuition & Analogies
Imagine hiking to the lowest point in a landscape. The height at any location is the objective value f(x), and the map coordinates are the variables x. If you stand somewhere and look at the steepest downward slope, that direction is the negative gradient. Taking small steps downhill repeatedly is gradient descent. If the terrain is a nice bowl (convex), then no matter where you start, going downhill eventually gets you to the bottom, and any dip you find is the unique global minimum. If the bowl is very elongated (ill-conditioned), you zig-zag: progress is slow in the narrow direction and fast in the wide one. Momentum is like adding a bit of inertia to your movement: once you’re rolling downhill, you keep some speed, which helps traverse flat plateaus and reduce zig-zagging. Adaptive methods behave like smart shoes that adjust their grip per direction: if a direction consistently has big slopes, they take smaller steps there, stabilizing the ride. Constraints are like fences and trails: you can only move within allowed regions. Projected methods let you step downhill and then snap back to the nearest allowed point, like taking a step off-trail but immediately returning to the trail’s edge. The Lagrangian introduces tolls (multipliers) for violating fences; at optimality, the balance between wanting to go downhill and paying tolls is perfect (KKT conditions). Duality is like viewing the same hike from the ranger’s perspective: instead of picking a path, you pick tolls that make any illegal path unattractive, revealing the same optimal bottom from another angle.
03Formal Definition
04When to Use
- Use vanilla gradient descent when you have a differentiable objective, moderate dimension, and can compute exact gradients efficiently. It is robust and simple for convex and mildly nonconvex problems.
- Use accelerated gradient (Nesterov) for smooth convex problems when you want faster theoretical rates without computing Hessians; it often helps on ill-conditioned but convex landscapes.
- Use momentum (Polyak) in deep learning to damp oscillations and traverse ravines; it typically improves wall-clock time even without convexity.
- Use adaptive methods (AdaGrad/RMSprop/Adam) when gradients vary in scale across coordinates (sparse features, NLP) or you want less manual learning rate tuning.
- Use projected gradient when variables must satisfy simple convex constraints (box bounds, simplex, norms); per-iteration it’s as cheap as unconstrained methods plus a projection.
- Use Lagrangian/KKT analysis to reason about constrained optima, derive dual problems (for bounds or algorithms), and detect when constraints are active at the solution.
- Use second-order methods (Newton, quasi-Newton like BFGS/L-BFGS) when you can afford curvature information; they converge in fewer iterations on well-behaved problems but have higher per-iteration cost.
- In ML training, start with Adam or momentum SGD; switch to SGD with decayed learning rate for best generalization, and consider line-search GD for convex tasks like logistic regression.
⚠️Common Mistakes
- Using a learning rate that is too large, causing divergence or oscillation; mitigate with backtracking line search or by respecting \eta < 2/L when f is L-smooth.
- Assuming any stationary point is a minimum; in nonconvex problems, stationary points can be saddles or maxima. Check second-order information or use perturbations to escape saddles.
- Ignoring scaling/conditioning; ill-conditioned problems make GD very slow. Precondition (feature scaling, whitening) or use momentum/accelerated methods.
- Violating constraints during updates; always project (projected gradient) or incorporate constraints via penalties or Lagrange multipliers.
- Misapplying KKT conditions without verifying constraint qualifications (e.g., Slater’s condition) leading to incorrect conclusions about optimality.
- Over-trusting adaptive optimizers; they may converge fast in training loss but generalize worse. Consider learning rate schedules and switching optimizers.
- Forgetting to stop; use gradient norm, duality gap, or relative decrease criteria rather than a fixed iteration count only.
- Implementing vector math naively (copy-heavy) in code; prefer in-place updates and avoid excessive allocations for performance.
Key Formulas
Convexity inequality
Explanation: This inequality defines convex functions: the function lies below the line segment between any two points. It ensures local minima are global.
First-order optimality
Explanation: At an unconstrained smooth local minimum, the gradient must be zero. It is necessary but not sufficient for nonconvex problems.
Second-order condition
Explanation: A positive semidefinite Hessian at a stationary point indicates local convexity. Together with zero gradient, it implies a local minimum.
Strong convexity
Explanation: This inequality says the function curves upward at least quadratically. It leads to linear convergence of gradient methods.
L-smoothness (Lipschitz gradient)
Explanation: This upper bound controls how much f can increase. It justifies choosing learning rates up to 2/L for gradient descent stability.
Gradient descent update
Explanation: Each iteration moves opposite the gradient scaled by step size. It is the workhorse of first-order optimization.
Sublinear rate for convex GD
Explanation: For L-smooth convex f and fixed =1/L, the error decays like 1/t. More iterations give diminishing improvements.
Linear rate for strongly convex GD
Explanation: When f is -strongly convex and L-smooth, GD converges geometrically. The factor 1-/L shows how conditioning affects speed.
Polyak momentum
Explanation: Adds an exponentially decaying average of past gradients to the update. This reduces oscillations and can accelerate convergence.
Nesterov acceleration
Explanation: Computes the gradient at a lookahead point, improving theoretical rates for smooth convex problems. Widely used in practice.
AdaGrad
Explanation: Accumulates squared gradients per coordinate to adapt steps. Helpful when features are sparse or scales differ across dimensions.
Adam
Explanation: Combines momentum and adaptive scaling with bias correction. A default choice in deep learning due to robustness to tuning.
Lagrangian
Explanation: Augments the objective with penalties for violating constraints, weighted by multipliers. Central to duality and KKT conditions.
KKT conditions
Explanation: These conditions characterize optimality for constrained problems under mild assumptions. For convex problems, they are necessary and sufficient.
Dual problem
Explanation: The dual maximizes the best lower bound on the primal objective. Strong duality means the bound equals the optimum.
Projection operator
Explanation: Projection maps any point to the closest feasible point in a convex set. It enables simple handling of constraints via projected gradient.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Minimize f(x) = 0.5 x^T A x + b^T x with A symmetric positive definite (SPD) 5 // Gradient: g(x) = A x + b 6 7 struct Quadratic { 8 vector<vector<double>> A; // dense SPD matrix 9 vector<double> b; 10 11 double f(const vector<double>& x) const { 12 int n = (int)x.size(); 13 // compute 0.5 x^T A x + b^T x 14 double quad = 0.0; 15 for (int i = 0; i < n; ++i) { 16 double s = 0.0; 17 for (int j = 0; j < n; ++j) s += A[i][j] * x[j]; 18 quad += x[i] * s; 19 } 20 double lin = 0.0; 21 for (int i = 0; i < n; ++i) lin += b[i] * x[i]; 22 return 0.5 * quad + lin; 23 } 24 25 void grad(const vector<double>& x, vector<double>& g) const { 26 int n = (int)x.size(); 27 g.assign(n, 0.0); 28 for (int i = 0; i < n; ++i) { 29 double s = 0.0; 30 for (int j = 0; j < n; ++j) s += A[i][j] * x[j]; 31 g[i] = s + b[i]; 32 } 33 } 34 }; 35 36 // Armijo backtracking line search: find eta so that f(x - eta*g) <= f(x) - alpha*eta*||g||^2 37 static double backtracking(const Quadratic& Q, const vector<double>& x, const vector<double>& g, double alpha=1e-4, double beta=0.5, double eta0=1.0) { 38 double fx = Q.f(x); 39 double gg = 0.0; for (double v : g) gg += v*v; 40 double eta = eta0; 41 int n = (int)x.size(); 42 vector<double> xnew(n); 43 while (true) { 44 for (int i = 0; i < n; ++i) xnew[i] = x[i] - eta * g[i]; 45 double fxnew = Q.f(xnew); 46 if (fxnew <= fx - alpha * eta * gg) break; 47 eta *= beta; // shrink 48 if (eta < 1e-20) break; // safeguard 49 } 50 return eta; 51 } 52 53 int main() { 54 ios::sync_with_stdio(false); 55 cin.tie(nullptr); 56 57 int n = 5; 58 // Build a simple SPD matrix A = D + 0.1*ones, with increasing diagonal 59 vector<vector<double>> A(n, vector<double>(n, 0.1)); 60 for (int i = 0; i < n; ++i) A[i][i] = 1.0 + i; // diagonally dominant => SPD 61 vector<double> b(n); 62 for (int i = 0; i < n; ++i) b[i] = (i % 2 == 0 ? 1.0 : -2.0); 63 64 Quadratic Q{A, b}; 65 66 vector<double> x(n, 3.0); // initial guess 67 vector<double> g; 68 69 double tol = 1e-8; 70 int maxit = 10000; 71 72 double f0 = Q.f(x); 73 cerr << fixed << setprecision(6); 74 cout << "Initial f = " << f0 << "\n"; 75 76 for (int it = 1; it <= maxit; ++it) { 77 Q.grad(x, g); 78 double gn = 0.0; for (double v : g) gn += v*v; gn = sqrt(gn); 79 if (gn < tol) { 80 cout << "Converged at iter " << it << ", ||grad|| = " << gn << "\n"; 81 break; 82 } 83 double eta = backtracking(Q, x, g); 84 for (int i = 0; i < n; ++i) x[i] -= eta * g[i]; 85 if (it % 1000 == 0) cout << "iter " << it << ": f = " << Q.f(x) << ", ||g|| = " << gn << ", eta = " << eta << "\n"; 86 if (it == maxit) cout << "Reached max iterations.\n"; 87 } 88 89 cout << "Final f = " << Q.f(x) << "\n"; 90 cout << "Solution x = "; 91 for (double xi : x) cout << xi << ' '; 92 cout << "\n"; 93 return 0; 94 } 95
Implements gradient descent with an Armijo backtracking line search on a dense symmetric positive definite quadratic. The line search ensures sufficient decrease and prevents overly large steps. The gradient norm is used for termination. Because the function is strongly convex, the method converges linearly with an appropriate step size; backtracking adapts the step automatically.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Same quadratic model: f(x) = 0.5 x^T A x + b^T x 5 struct Quadratic { 6 vector<vector<double>> A; vector<double> b; 7 double f(const vector<double>& x) const { 8 int n=(int)x.size(); double quad=0, lin=0; 9 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; quad+=x[i]*s; } 10 for(int i=0;i<n;++i) lin+=b[i]*x[i]; 11 return 0.5*quad+lin; 12 } 13 void grad(const vector<double>& x, vector<double>& g) const { 14 int n=(int)x.size(); g.assign(n,0.0); 15 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; g[i]=s+b[i]; } 16 } 17 }; 18 19 // Estimate L (largest eigenvalue of A) via power iteration 20 static double estimate_L(const vector<vector<double>>& A, int iters=200) { 21 int n=(int)A.size(); vector<double> v(n,1.0/n), w(n); 22 auto mv=[&](const vector<double>& x, vector<double>& y){ y.assign(n,0.0); for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; y[i]=s; } }; 23 for(int t=0;t<iters;++t){ mv(v,w); double nrm=0; for(double z:w) nrm+=z*z; nrm=sqrt(nrm); for(int i=0;i<n;++i) v[i]=w[i]/(nrm+1e-30); } 24 mv(v,w); double num=0, den=0; for(int i=0;i<n;++i){ num+=v[i]*w[i]; den+=v[i]*v[i]; } 25 return num/(den+1e-30); 26 } 27 28 int main(){ 29 ios::sync_with_stdio(false); cin.tie(nullptr); 30 int n=6; vector<vector<double>> A(n, vector<double>(n, 0.05)); 31 for(int i=0;i<n;++i) A[i][i]=1.0 + 5.0*i; // ill-conditioned SPD 32 vector<double> b(n); for(int i=0;i<n;++i) b[i]=(i%2? 3.0 : -1.0); 33 Quadratic Q{A,b}; 34 35 // Nesterov accelerated gradient with convex schedule beta_t=(t-1)/(t+2) 36 double L = estimate_L(A); // step size 1/L ensures stability for L-smooth f 37 double eta = 1.0 / max(L, 1e-12); 38 39 vector<double> x(n, 5.0), x_prev = x, y=x, g; 40 41 int T=2000; double best=Q.f(x); 42 cout<<fixed<<setprecision(6); 43 cout<<"Estimated L = "<<L<<", step = "<<eta<<"\n"; 44 for(int t=1;t<=T;++t){ 45 double beta = (t-1.0)/(t+2.0); // classic Nesterov convex schedule 46 for(int i=0;i<n;++i) y[i] = x[i] + beta * (x[i] - x_prev[i]); 47 Q.grad(y, g); 48 x_prev.swap(x); 49 for(int i=0;i<n;++i) x[i] = y[i] - eta * g[i]; 50 if (t%200==0) cout<<"iter "<<t<<": f = "<<Q.f(x)<<"\n"; 51 best=min(best,Q.f(x)); 52 } 53 cout<<"Final f = "<<Q.f(x)<<" (best seen "<<best<<")\n"; 54 cout<<"x = "; for(double xi:x) cout<<xi<<' '; cout<<"\n"; 55 return 0; 56 } 57
Implements Nesterov’s accelerated gradient for a smooth convex quadratic. The step size is set to 1/L using a power iteration estimate of the Lipschitz constant, and the standard convex schedule β_t=(t−1)/(t+2) provides acceleration. On ill-conditioned bowls, Nesterov typically reduces zig-zagging and accelerates convergence compared to plain gradient descent.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Objective: minimize f(x) = 0.5 x^T A x + b^T x 5 // Subject to x >= 0 and sum_i x_i = 1 (simplex constraint) 6 // Algorithm: Projected Gradient Descent (PGD): x^{k+1} = Pi_S( x^k - eta * grad f(x^k) ) 7 8 struct Quadratic { 9 vector<vector<double>> A; vector<double> b; 10 double f(const vector<double>& x) const { 11 int n=(int)x.size(); double quad=0, lin=0; 12 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; quad+=x[i]*s; } 13 for(int i=0;i<n;++i) lin+=b[i]*x[i]; 14 return 0.5*quad+lin; 15 } 16 void grad(const vector<double>& x, vector<double>& g) const { 17 int n=(int)x.size(); g.assign(n,0.0); 18 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; g[i]=s+b[i]; } 19 } 20 }; 21 22 // Projection onto simplex {x: x_i >= 0, sum x_i = 1} 23 // Algorithm (sorting-based): see "Efficient Projections onto the L1-Ball for Learning in High Dimensions" (Duchi et al.) 24 static void project_simplex(vector<double>& v) { 25 int n=(int)v.size(); 26 vector<double> u=v; sort(u.begin(), u.end(), greater<double>()); 27 double css=0.0; int rho=-1; double theta=0.0; 28 for(int i=0;i<n;++i){ css += u[i]; double t = (css - 1.0) / (i+1); 29 if (u[i] - t > 0) { rho = i; theta = t; } 30 } 31 for(int i=0;i<n;++i){ v[i] = max(0.0, v[i] - theta); } 32 } 33 34 int main(){ 35 ios::sync_with_stdio(false); cin.tie(nullptr); 36 37 int n=8; vector<vector<double>> A(n, vector<double>(n, 0.02)); 38 for(int i=0;i<n;++i) A[i][i] = 0.5 + 0.3*i; // SPD and relatively well-conditioned 39 vector<double> b(n); for(int i=0;i<n;++i) b[i] = (i%3==0? -1.0 : 0.2); 40 41 Quadratic Q{A,b}; 42 43 vector<double> x(n, 1.0/n); // feasible start on simplex 44 vector<double> g(n); 45 46 // Choose a conservative step size; for SPD quadratic, L is max eigenvalue ~ max diagonal + row sum 47 double eta = 0.5; // small step for safety; could estimate L and set 1/L 48 49 int T=3000; double tol=1e-8; 50 cout<<fixed<<setprecision(6); 51 cout<<"Initial f = "<<Q.f(x)<<"\n"; 52 for(int t=1;t<=T;++t){ 53 Q.grad(x,g); 54 // Stop if projected gradient is small: measure grad norm in the tangent space approximately 55 double gn=0; for(double z:g) gn+=z*z; gn=sqrt(gn); 56 for(int i=0;i<n;++i) x[i] -= eta * g[i]; 57 project_simplex(x); 58 if (t%300==0) cout<<"iter "<<t<<": f = "<<Q.f(x)<<", ||g|| = "<<gn<<"\n"; 59 if (gn < tol) break; 60 } 61 62 cout<<"Final f = "<<Q.f(x)<<"\n"; 63 cout<<"x (on simplex) = "; for(double xi:x) cout<<xi<<' '; cout<<"\n"; 64 65 // KKT insight: active set where x_i==0 has nonnegative multipliers; sum constraint has multiplier nu 66 // (Not computed explicitly here, but PGD enforces feasibility each step.) 67 return 0; 68 } 69
Shows how to handle constraints via projected gradient descent on the probability simplex. After a gradient step, we project onto the simplex using a sorting-based O(n log n) algorithm, ensuring x ≥ 0 and sum x_i = 1. This mirrors KKT optimality: active components at zero correspond to nonnegative multipliers, and the equality constraint enforces a single dual variable (not explicitly computed).