📚TheoryIntermediate

Optimization Theory

Key Points

  • Optimization theory studies how to choose variables to minimize or maximize an objective while respecting constraints.
  • For convex problems, any local minimum is also a global minimum, enabling reliable algorithms and guarantees.
  • First- and second-order conditions ( f(x^*) = 0 and f(x^*) 0) characterize optimal points for smooth unconstrained problems.
  • Gradient descent updates x by stepping opposite the gradient; with smooth convexity it converges at O and linearly for strongly convex functions.
  • Momentum and adaptive methods (AdaGrad/RMSprop/Adam) often speed up convergence in practice.
  • Constrained problems use the Lagrangian and KKT conditions, and convex problems often enjoy strong duality.
  • Projection and proximal operators let us handle constraints by “pulling” iterates back into the feasible set.
  • Optimization is central to AI/ML because training models is solving large, often nonconvex, optimization problems.

Prerequisites

  • Multivariable calculusGradients, Hessians, Taylor expansions, and line integrals are the mathematical backbone of optimization.
  • Linear algebraConvex quadratics, eigenvalues, norms, and condition numbers determine curvature and convergence.
  • Convex analysisUnderstanding convex sets, convexity, and Jensen’s inequality is essential for guarantees and KKT conditions.
  • Numerical computing basicsFloating-point precision, stopping criteria, and algorithmic stability affect practical implementations.
  • Data structures and algorithmsEfficient projections (sorting), matrix–vector products, and sparse representations determine runtime and memory.

Detailed Explanation

Tap terms for definitions

01Overview

Optimization theory is about making the best possible choice given a quantitative measure of quality (the objective) and a set of rules (constraints). In its simplest form, we minimize a function f(x) over variables x. When there are no constraints, this is unconstrained optimization; otherwise, we have constrained optimization with equalities and/or inequalities. A powerful and practical special case is convex optimization, where the objective and feasible set are convex; then any local minimum is guaranteed to be global, giving strong theoretical and algorithmic guarantees. Optimization algorithms range from first-order methods (using gradients), to second-order methods (using Hessians), to specialized techniques for constraints (projected methods, Lagrange multipliers, interior-point methods). In machine learning, almost all training problems—from linear regression to deep neural networks—are cast as minimization of a loss function, possibly with regularization and constraints. Understanding convergence rates, learning rates (step sizes), momentum, and duality helps you pick algorithms that are both fast and reliable. This overview builds intuition, states precise definitions, and gives concrete C++ implementations for core methods like gradient descent, Nesterov acceleration, and projected gradient.

02Intuition & Analogies

Imagine hiking to the lowest point in a landscape. The height at any location is the objective value f(x), and the map coordinates are the variables x. If you stand somewhere and look at the steepest downward slope, that direction is the negative gradient. Taking small steps downhill repeatedly is gradient descent. If the terrain is a nice bowl (convex), then no matter where you start, going downhill eventually gets you to the bottom, and any dip you find is the unique global minimum. If the bowl is very elongated (ill-conditioned), you zig-zag: progress is slow in the narrow direction and fast in the wide one. Momentum is like adding a bit of inertia to your movement: once you’re rolling downhill, you keep some speed, which helps traverse flat plateaus and reduce zig-zagging. Adaptive methods behave like smart shoes that adjust their grip per direction: if a direction consistently has big slopes, they take smaller steps there, stabilizing the ride. Constraints are like fences and trails: you can only move within allowed regions. Projected methods let you step downhill and then snap back to the nearest allowed point, like taking a step off-trail but immediately returning to the trail’s edge. The Lagrangian introduces tolls (multipliers) for violating fences; at optimality, the balance between wanting to go downhill and paying tolls is perfect (KKT conditions). Duality is like viewing the same hike from the ranger’s perspective: instead of picking a path, you pick tolls that make any illegal path unattractive, revealing the same optimal bottom from another angle.

03Formal Definition

We study problems of the form minimize f(x) subject to (x) 0 for ..m and (x) = 0 for ..p. The Lagrangian is L(x,,) = f(x) + (x) + (x) with multipliers 0. The dual function is d(,) = L(x,,), and the dual problem is maximize d(,) subject to 0. For convex problems with Slater’s condition, strong duality holds: the optimal primal and dual values are equal, and KKT conditions (stationarity, primal feasibility, dual feasibility, complementary slackness) are necessary and sufficient for optimality. For unconstrained smooth problems, first-order stationarity f(x^*) = 0 is necessary; second-order conditions refine this: if f(x^*) 0, x^* is a local minimizer; if f(x^*) 0, it is a strict local minimizer. Convergence of first-order methods depends on properties such as L-smoothness ( f Lipschitz) and -strong convexity. Gradient descent with step size (0,2/L) achieves sublinear rate O for convex L-smooth f, and linear rate O((1-/L)^t) for -strongly convex L-smooth f. Momentum (Polyak, Nesterov) and adaptive algorithms (AdaGrad, RMSprop, Adam) modify updates to exploit curvature or gradient history for faster practical convergence.

04When to Use

  • Use vanilla gradient descent when you have a differentiable objective, moderate dimension, and can compute exact gradients efficiently. It is robust and simple for convex and mildly nonconvex problems.
  • Use accelerated gradient (Nesterov) for smooth convex problems when you want faster theoretical rates without computing Hessians; it often helps on ill-conditioned but convex landscapes.
  • Use momentum (Polyak) in deep learning to damp oscillations and traverse ravines; it typically improves wall-clock time even without convexity.
  • Use adaptive methods (AdaGrad/RMSprop/Adam) when gradients vary in scale across coordinates (sparse features, NLP) or you want less manual learning rate tuning.
  • Use projected gradient when variables must satisfy simple convex constraints (box bounds, simplex, norms); per-iteration it’s as cheap as unconstrained methods plus a projection.
  • Use Lagrangian/KKT analysis to reason about constrained optima, derive dual problems (for bounds or algorithms), and detect when constraints are active at the solution.
  • Use second-order methods (Newton, quasi-Newton like BFGS/L-BFGS) when you can afford curvature information; they converge in fewer iterations on well-behaved problems but have higher per-iteration cost.
  • In ML training, start with Adam or momentum SGD; switch to SGD with decayed learning rate for best generalization, and consider line-search GD for convex tasks like logistic regression.

⚠️Common Mistakes

  • Using a learning rate that is too large, causing divergence or oscillation; mitigate with backtracking line search or by respecting \eta < 2/L when f is L-smooth.
  • Assuming any stationary point is a minimum; in nonconvex problems, stationary points can be saddles or maxima. Check second-order information or use perturbations to escape saddles.
  • Ignoring scaling/conditioning; ill-conditioned problems make GD very slow. Precondition (feature scaling, whitening) or use momentum/accelerated methods.
  • Violating constraints during updates; always project (projected gradient) or incorporate constraints via penalties or Lagrange multipliers.
  • Misapplying KKT conditions without verifying constraint qualifications (e.g., Slater’s condition) leading to incorrect conclusions about optimality.
  • Over-trusting adaptive optimizers; they may converge fast in training loss but generalize worse. Consider learning rate schedules and switching optimizers.
  • Forgetting to stop; use gradient norm, duality gap, or relative decrease criteria rather than a fixed iteration count only.
  • Implementing vector math naively (copy-heavy) in code; prefer in-place updates and avoid excessive allocations for performance.

Key Formulas

Convexity inequality

Explanation: This inequality defines convex functions: the function lies below the line segment between any two points. It ensures local minima are global.

First-order optimality

Explanation: At an unconstrained smooth local minimum, the gradient must be zero. It is necessary but not sufficient for nonconvex problems.

Second-order condition

Explanation: A positive semidefinite Hessian at a stationary point indicates local convexity. Together with zero gradient, it implies a local minimum.

Strong convexity

Explanation: This inequality says the function curves upward at least quadratically. It leads to linear convergence of gradient methods.

L-smoothness (Lipschitz gradient)

Explanation: This upper bound controls how much f can increase. It justifies choosing learning rates up to 2/L for gradient descent stability.

Gradient descent update

Explanation: Each iteration moves opposite the gradient scaled by step size. It is the workhorse of first-order optimization.

Sublinear rate for convex GD

Explanation: For L-smooth convex f and fixed =1/L, the error decays like 1/t. More iterations give diminishing improvements.

Linear rate for strongly convex GD

Explanation: When f is -strongly convex and L-smooth, GD converges geometrically. The factor 1-/L shows how conditioning affects speed.

Polyak momentum

Explanation: Adds an exponentially decaying average of past gradients to the update. This reduces oscillations and can accelerate convergence.

Nesterov acceleration

Explanation: Computes the gradient at a lookahead point, improving theoretical rates for smooth convex problems. Widely used in practice.

AdaGrad

Explanation: Accumulates squared gradients per coordinate to adapt steps. Helpful when features are sparse or scales differ across dimensions.

Adam

Explanation: Combines momentum and adaptive scaling with bias correction. A default choice in deep learning due to robustness to tuning.

Lagrangian

Explanation: Augments the objective with penalties for violating constraints, weighted by multipliers. Central to duality and KKT conditions.

KKT conditions

Explanation: These conditions characterize optimality for constrained problems under mild assumptions. For convex problems, they are necessary and sufficient.

Dual problem

Explanation: The dual maximizes the best lower bound on the primal objective. Strong duality means the bound equals the optimum.

Projection operator

Explanation: Projection maps any point to the closest feasible point in a convex set. It enables simple handling of constraints via projected gradient.

Complexity Analysis

Per-iteration complexity depends on computing gradients and handling constraints. For dense problems with n variables, evaluating a quadratic gradient costs O() (matrix–vector multiply), while evaluating f and g for general models may cost O(n) to O() depending on structure. Vanilla gradient descent (GD) performs one gradient evaluation and a few vector operations per iteration, so its per-iteration time is dominated by gradient cost and memory bandwidth; space is O(n) for vectors (plus O() if storing dense A). Backtracking line search may add a small constant factor by evaluating f several times per iteration until the Armijo condition holds. For smooth convex functions, GD with step size 1/L reaches accuracy f()−f(x*) ≤ ε in O((L/ iterations (sublinear O decay). If f is convex and L-smooth, GD achieves linear convergence: O(κ log(1/ iterations, where κ=L/μ is the condition number. Thus, ill-conditioning (large κ) slows convergence. Momentum and Nesterov acceleration improve constants and, for convex smooth problems, reduce iteration complexity to O(√(L/ for function error, although practical gains depend on tuning and noise. Projected gradient descent adds a projection step. For simple sets like boxes, projection is O(n); for the probability simplex, it is O(n log n) due to sorting; space remains O(n). Second-order methods (Newton) require solving a linear system each iteration; direct dense methods take O() time and O() space but can converge in few iterations (quadratically near the optimum). In large-scale ML, first-order methods dominate because they scale linearly in n (and in data size) and can exploit sparsity to reduce both time and memory footprints.

Code Examples

Gradient Descent with Backtracking Line Search on a Convex Quadratic
1#include <bits/stdc++.h>
2using namespace std;
3
4// Minimize f(x) = 0.5 x^T A x + b^T x with A symmetric positive definite (SPD)
5// Gradient: g(x) = A x + b
6
7struct Quadratic {
8 vector<vector<double>> A; // dense SPD matrix
9 vector<double> b;
10
11 double f(const vector<double>& x) const {
12 int n = (int)x.size();
13 // compute 0.5 x^T A x + b^T x
14 double quad = 0.0;
15 for (int i = 0; i < n; ++i) {
16 double s = 0.0;
17 for (int j = 0; j < n; ++j) s += A[i][j] * x[j];
18 quad += x[i] * s;
19 }
20 double lin = 0.0;
21 for (int i = 0; i < n; ++i) lin += b[i] * x[i];
22 return 0.5 * quad + lin;
23 }
24
25 void grad(const vector<double>& x, vector<double>& g) const {
26 int n = (int)x.size();
27 g.assign(n, 0.0);
28 for (int i = 0; i < n; ++i) {
29 double s = 0.0;
30 for (int j = 0; j < n; ++j) s += A[i][j] * x[j];
31 g[i] = s + b[i];
32 }
33 }
34};
35
36// Armijo backtracking line search: find eta so that f(x - eta*g) <= f(x) - alpha*eta*||g||^2
37static double backtracking(const Quadratic& Q, const vector<double>& x, const vector<double>& g, double alpha=1e-4, double beta=0.5, double eta0=1.0) {
38 double fx = Q.f(x);
39 double gg = 0.0; for (double v : g) gg += v*v;
40 double eta = eta0;
41 int n = (int)x.size();
42 vector<double> xnew(n);
43 while (true) {
44 for (int i = 0; i < n; ++i) xnew[i] = x[i] - eta * g[i];
45 double fxnew = Q.f(xnew);
46 if (fxnew <= fx - alpha * eta * gg) break;
47 eta *= beta; // shrink
48 if (eta < 1e-20) break; // safeguard
49 }
50 return eta;
51}
52
53int main() {
54 ios::sync_with_stdio(false);
55 cin.tie(nullptr);
56
57 int n = 5;
58 // Build a simple SPD matrix A = D + 0.1*ones, with increasing diagonal
59 vector<vector<double>> A(n, vector<double>(n, 0.1));
60 for (int i = 0; i < n; ++i) A[i][i] = 1.0 + i; // diagonally dominant => SPD
61 vector<double> b(n);
62 for (int i = 0; i < n; ++i) b[i] = (i % 2 == 0 ? 1.0 : -2.0);
63
64 Quadratic Q{A, b};
65
66 vector<double> x(n, 3.0); // initial guess
67 vector<double> g;
68
69 double tol = 1e-8;
70 int maxit = 10000;
71
72 double f0 = Q.f(x);
73 cerr << fixed << setprecision(6);
74 cout << "Initial f = " << f0 << "\n";
75
76 for (int it = 1; it <= maxit; ++it) {
77 Q.grad(x, g);
78 double gn = 0.0; for (double v : g) gn += v*v; gn = sqrt(gn);
79 if (gn < tol) {
80 cout << "Converged at iter " << it << ", ||grad|| = " << gn << "\n";
81 break;
82 }
83 double eta = backtracking(Q, x, g);
84 for (int i = 0; i < n; ++i) x[i] -= eta * g[i];
85 if (it % 1000 == 0) cout << "iter " << it << ": f = " << Q.f(x) << ", ||g|| = " << gn << ", eta = " << eta << "\n";
86 if (it == maxit) cout << "Reached max iterations.\n";
87 }
88
89 cout << "Final f = " << Q.f(x) << "\n";
90 cout << "Solution x = ";
91 for (double xi : x) cout << xi << ' ';
92 cout << "\n";
93 return 0;
94}
95

Implements gradient descent with an Armijo backtracking line search on a dense symmetric positive definite quadratic. The line search ensures sufficient decrease and prevents overly large steps. The gradient norm is used for termination. Because the function is strongly convex, the method converges linearly with an appropriate step size; backtracking adapts the step automatically.

Time: Per iteration O(n^2) for dense matrix–vector in gradient, plus a few O(n^2) function evaluations during backtracking.Space: O(n^2) to store A and O(n) for vectors.
Nesterov Accelerated Gradient (NAG) for Smooth Convex Minimization
1#include <bits/stdc++.h>
2using namespace std;
3
4// Same quadratic model: f(x) = 0.5 x^T A x + b^T x
5struct Quadratic {
6 vector<vector<double>> A; vector<double> b;
7 double f(const vector<double>& x) const {
8 int n=(int)x.size(); double quad=0, lin=0;
9 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; quad+=x[i]*s; }
10 for(int i=0;i<n;++i) lin+=b[i]*x[i];
11 return 0.5*quad+lin;
12 }
13 void grad(const vector<double>& x, vector<double>& g) const {
14 int n=(int)x.size(); g.assign(n,0.0);
15 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; g[i]=s+b[i]; }
16 }
17};
18
19// Estimate L (largest eigenvalue of A) via power iteration
20static double estimate_L(const vector<vector<double>>& A, int iters=200) {
21 int n=(int)A.size(); vector<double> v(n,1.0/n), w(n);
22 auto mv=[&](const vector<double>& x, vector<double>& y){ y.assign(n,0.0); for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; y[i]=s; } };
23 for(int t=0;t<iters;++t){ mv(v,w); double nrm=0; for(double z:w) nrm+=z*z; nrm=sqrt(nrm); for(int i=0;i<n;++i) v[i]=w[i]/(nrm+1e-30); }
24 mv(v,w); double num=0, den=0; for(int i=0;i<n;++i){ num+=v[i]*w[i]; den+=v[i]*v[i]; }
25 return num/(den+1e-30);
26}
27
28int main(){
29 ios::sync_with_stdio(false); cin.tie(nullptr);
30 int n=6; vector<vector<double>> A(n, vector<double>(n, 0.05));
31 for(int i=0;i<n;++i) A[i][i]=1.0 + 5.0*i; // ill-conditioned SPD
32 vector<double> b(n); for(int i=0;i<n;++i) b[i]=(i%2? 3.0 : -1.0);
33 Quadratic Q{A,b};
34
35 // Nesterov accelerated gradient with convex schedule beta_t=(t-1)/(t+2)
36 double L = estimate_L(A); // step size 1/L ensures stability for L-smooth f
37 double eta = 1.0 / max(L, 1e-12);
38
39 vector<double> x(n, 5.0), x_prev = x, y=x, g;
40
41 int T=2000; double best=Q.f(x);
42 cout<<fixed<<setprecision(6);
43 cout<<"Estimated L = "<<L<<", step = "<<eta<<"\n";
44 for(int t=1;t<=T;++t){
45 double beta = (t-1.0)/(t+2.0); // classic Nesterov convex schedule
46 for(int i=0;i<n;++i) y[i] = x[i] + beta * (x[i] - x_prev[i]);
47 Q.grad(y, g);
48 x_prev.swap(x);
49 for(int i=0;i<n;++i) x[i] = y[i] - eta * g[i];
50 if (t%200==0) cout<<"iter "<<t<<": f = "<<Q.f(x)<<"\n";
51 best=min(best,Q.f(x));
52 }
53 cout<<"Final f = "<<Q.f(x)<<" (best seen "<<best<<")\n";
54 cout<<"x = "; for(double xi:x) cout<<xi<<' '; cout<<"\n";
55 return 0;
56}
57

Implements Nesterov’s accelerated gradient for a smooth convex quadratic. The step size is set to 1/L using a power iteration estimate of the Lipschitz constant, and the standard convex schedule β_t=(t−1)/(t+2) provides acceleration. On ill-conditioned bowls, Nesterov typically reduces zig-zagging and accelerates convergence compared to plain gradient descent.

Time: O(n^2) per iteration for dense gradients, plus an upfront O(n^2 * iters) cost to estimate L via power iteration.Space: O(n^2) for A and O(n) for vectors.
Projected Gradient Descent onto the Probability Simplex (Constraint Handling + KKT Insight)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Objective: minimize f(x) = 0.5 x^T A x + b^T x
5// Subject to x >= 0 and sum_i x_i = 1 (simplex constraint)
6// Algorithm: Projected Gradient Descent (PGD): x^{k+1} = Pi_S( x^k - eta * grad f(x^k) )
7
8struct Quadratic {
9 vector<vector<double>> A; vector<double> b;
10 double f(const vector<double>& x) const {
11 int n=(int)x.size(); double quad=0, lin=0;
12 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; quad+=x[i]*s; }
13 for(int i=0;i<n;++i) lin+=b[i]*x[i];
14 return 0.5*quad+lin;
15 }
16 void grad(const vector<double>& x, vector<double>& g) const {
17 int n=(int)x.size(); g.assign(n,0.0);
18 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s+=A[i][j]*x[j]; g[i]=s+b[i]; }
19 }
20};
21
22// Projection onto simplex {x: x_i >= 0, sum x_i = 1}
23// Algorithm (sorting-based): see "Efficient Projections onto the L1-Ball for Learning in High Dimensions" (Duchi et al.)
24static void project_simplex(vector<double>& v) {
25 int n=(int)v.size();
26 vector<double> u=v; sort(u.begin(), u.end(), greater<double>());
27 double css=0.0; int rho=-1; double theta=0.0;
28 for(int i=0;i<n;++i){ css += u[i]; double t = (css - 1.0) / (i+1);
29 if (u[i] - t > 0) { rho = i; theta = t; }
30 }
31 for(int i=0;i<n;++i){ v[i] = max(0.0, v[i] - theta); }
32}
33
34int main(){
35 ios::sync_with_stdio(false); cin.tie(nullptr);
36
37 int n=8; vector<vector<double>> A(n, vector<double>(n, 0.02));
38 for(int i=0;i<n;++i) A[i][i] = 0.5 + 0.3*i; // SPD and relatively well-conditioned
39 vector<double> b(n); for(int i=0;i<n;++i) b[i] = (i%3==0? -1.0 : 0.2);
40
41 Quadratic Q{A,b};
42
43 vector<double> x(n, 1.0/n); // feasible start on simplex
44 vector<double> g(n);
45
46 // Choose a conservative step size; for SPD quadratic, L is max eigenvalue ~ max diagonal + row sum
47 double eta = 0.5; // small step for safety; could estimate L and set 1/L
48
49 int T=3000; double tol=1e-8;
50 cout<<fixed<<setprecision(6);
51 cout<<"Initial f = "<<Q.f(x)<<"\n";
52 for(int t=1;t<=T;++t){
53 Q.grad(x,g);
54 // Stop if projected gradient is small: measure grad norm in the tangent space approximately
55 double gn=0; for(double z:g) gn+=z*z; gn=sqrt(gn);
56 for(int i=0;i<n;++i) x[i] -= eta * g[i];
57 project_simplex(x);
58 if (t%300==0) cout<<"iter "<<t<<": f = "<<Q.f(x)<<", ||g|| = "<<gn<<"\n";
59 if (gn < tol) break;
60 }
61
62 cout<<"Final f = "<<Q.f(x)<<"\n";
63 cout<<"x (on simplex) = "; for(double xi:x) cout<<xi<<' '; cout<<"\n";
64
65 // KKT insight: active set where x_i==0 has nonnegative multipliers; sum constraint has multiplier nu
66 // (Not computed explicitly here, but PGD enforces feasibility each step.)
67 return 0;
68}
69

Shows how to handle constraints via projected gradient descent on the probability simplex. After a gradient step, we project onto the simplex using a sorting-based O(n log n) algorithm, ensuring x ≥ 0 and sum x_i = 1. This mirrors KKT optimality: active components at zero correspond to nonnegative multipliers, and the equality constraint enforces a single dual variable (not explicitly computed).

Time: O(n^2) per iteration for dense gradient plus O(n log n) for simplex projection (dominant when n is large and A is sparse).Space: O(n^2) for A and O(n) for vectors; projection uses O(n) extra memory.