Gradient Descent Convergence Theory
Key Points
- •Gradient descent updates parameters by stepping opposite the gradient: - f().
- •If f is L-smooth, a safe fixed step size is = 1/L, which guarantees descent for convex functions.
- •For -strongly convex and L-smooth functions, gradient descent converges linearly: \ (1 - /L)^t \.
- •For merely convex and L-smooth functions, the objective gap shrinks sublinearly: f() - f(x^*) O.
- •The condition number = L/ determines how many iterations are needed; larger slows convergence.
- •In non-convex settings, gradient descent converges to stationary points with average gradient norm squared decaying like O.
- •Stochastic gradient descent (SGD) with decreasing step sizes achieves E[f()] - f(x^*) O(1/) for convex objectives.
- •Choosing learning rates well is crucial in ML: too large causes divergence, too small makes training sluggish.
Prerequisites
- →Differential Calculus — Understanding gradients, directional derivatives, and Taylor expansions is essential for GD updates and smoothness.
- →Linear Algebra — Norms, inner products, eigenvalues, and positive definiteness underlie L, μ, and condition numbers.
- →Convexity and Strong Convexity — Convergence rates depend on convexity properties; knowing these definitions is critical.
- →Probability and Expectation — SGD analysis uses unbiasedness and variance of stochastic gradients and expectations over randomness.
- →Introduction to Optimization — Line search, stopping criteria, and basic iterative methods provide context for GD behavior.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine rolling a ball down a hilly landscape to reach the lowest valley. If the hills are smooth and bowl-shaped, the ball quickly settles at the bottom. If the surface is rough or flat in places, progress can be slow or the ball can get stuck at flat spots. Concept: Gradient descent (GD) is the mathematical version of this process. At each step, you look at the slope (the gradient) and take a step downhill proportional to it, x_{t+1} = x_t - \eta \nabla f(x_t). The theory of convergence explains when and how fast this process approaches the minimum. Key properties of the function—smoothness and convexity—govern the guarantees. L-smoothness controls how quickly the slope can change, and \mu-strong convexity ensures the function is bowl-shaped with a unique minimum. Example: For a quadratic function f(x) = \tfrac{1}{2} x^\top A x - b^\top x with A positive definite, GD with a properly chosen step size will shrink the distance to the solution by a constant factor each step, leading to fast, predictable convergence. In contrast, for a non-convex function like x^4 - x^2, GD still finds points where the slope is nearly zero, but we cannot guarantee it reaches the global minimum.
02Intuition & Analogies
Hook: Suppose you're hiking with foggy glasses. You can't see the whole mountain, but you can feel the slope beneath your feet. You take a step downhill—the steeper the slope, the longer the step—but not too long to avoid tripping. Concept: The gradient is your sense of steepness and direction; the learning rate (step size) is how long your stride is. If the terrain is consistently bowl-shaped (strongly convex), every step gets you closer to the unique bottom, and with a good stride length you progress at a steady, geometric pace. If the terrain is just generally sloping down (convex but not strongly), you still move toward lower altitudes, but the landscape might flatten, making your progress taper like 1/t. If the terrain is bumpy (non-convex), you can still move to flatter regions where the slope is small, even if you can’t be sure it’s the absolute lowest valley. Example: Think of a perfectly round bowl (strongly convex): roll a marble and watch it spiral quickly to the bottom—each lap is shorter than the last by a constant factor. Now think of a wide, shallow plate (convex but not strongly): the marble moves inward but slows significantly near the center. On a wavy plate (non-convex), the marble may settle in a local dip; however, if you keep nudging it carefully (small steps), you can at least get to a place where it barely moves—the slope there is essentially zero.
03Formal Definition
04When to Use
Use gradient descent when: (1) You can compute exact or approximate gradients cheaply relative to function evaluation, such as in linear models, logistic regression, and neural networks. (2) The function is L-smooth (most differentiable ML losses are) and possibly convex, where theory guarantees progress and rates. (3) You have large-scale data where second-order methods are too costly; GD/SGD scale linearly with data and parameters. (4) You’re training neural networks; while non-convex, GD variants still reliably reach stationary points and good solutions in practice. Prefer fixed step sizes (\eta = 1/L or tuned constants) when L (or a bound) is known and the surface is well-behaved; prefer backtracking or line search when L is unknown or varies. In ill-conditioned problems (large \kappa), consider preconditioning or momentum/accelerated methods. Use SGD or mini-batch GD when full gradients are too expensive; apply diminishing or adaptive learning rates to manage noise. For strongly convex problems and high accuracy needs, GD with \eta = 1/L (or 2/(L+\mu) when known) quickly achieves small errors; for merely convex objectives, expect slower 1/t decay and consider acceleration.
⚠️Common Mistakes
- Learning rate too large: If \eta > 2/L in quadratic-like regions, GD can diverge or oscillate. Fix by estimating L (e.g., via power iteration on Hessian for quadratics), using \eta \le 1/L, or adopting backtracking line search. 2) Confusing convergence in function value with distance: Sublinear rates like O(1/t) on f(x_t) - f(x^) do not imply the same rate on |x_t - x^| unless strong convexity holds. 3) Ignoring conditioning: With \kappa = L/\mu \gg 1, progress is painfully slow; normalize features, add regularization, or precondition to reduce \kappa. 4) Mixing deterministic and stochastic guarantees: SGD with constant \eta does not converge to x^* but to a noise ball; use diminishing \eta_t or iterate averaging to get O(1/\sqrt{t}) rates in convex problems. 5) Misapplying non-convex results: Guarantees are about stationarity (small |\nabla f|), not global optimality; a small gradient does not prove you found the best minimum. 6) Poor stopping criteria: Stopping on tiny step size or tiny |f(x_{t+1}) - f(x_t)| can be misleading; monitor gradient norm and, when possible, optimality gaps. 7) Not scaling data/parameters: Unscaled features enlarge L and \kappa, degrading step size choices and speed; standardize inputs to improve convergence. 8) Using fixed \eta in stochastic settings without tuning: Either decay \eta_t (e.g., \eta_0/\sqrt{t}) or use adaptive methods; otherwise progress can stall.
Key Formulas
Gradient Descent Update
Explanation: Each iteration moves opposite the gradient scaled by the learning rate. This decreases the function value when the step is small enough.
L-smoothness (Lipschitz Gradient)
Explanation: The gradient does not change too abruptly. This regularity enables using fixed step sizes like 1/L with convergence guarantees.
Descent Lemma
Explanation: An upper bound on f(y) in terms of a quadratic model around x. It ensures GD decreases f if 1/L.
Strong Convexity Inequality
Explanation: A lower bound that enforces bowl-shaped curvature. It implies uniqueness of the minimizer and linear convergence of GD.
Linear Convergence (Strongly Convex)
Explanation: When > 0 and 1/L, the distance to the minimizer shrinks by a constant factor each step. With =1/L, the factor is 1 - /L.
Sublinear Rate (Convex, L-smooth)
Explanation: For convex but not strongly convex functions, GD with =1/L reduces the objective gap like 1/t. Progress slows near the optimum.
Condition Number
Explanation: The ratio of smoothness to strong convexity controls the speed. Iteration complexity scales like O( (1/)).
Non-convex Stationarity Rate
Explanation: In smooth non-convex problems with 1/L, the smallest gradient norm squared among the first t iterates decreases like 1/t.
SGD Convex Rate
Explanation: With unbiased gradients of bounded variance and diminishing steps (e.g., 1/), averaged SGD attains O(1/) expected suboptimality.
Optimal Fixed Step for Quadratics
Explanation: For strongly convex quadratics, the best fixed step is 2/(L+), yielding contraction factor (-1)/(+1), faster than 1 - /L.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Strongly convex quadratic: f(x) = 0.5 * x^T A x - b^T x, with diagonal SPD A 5 // For diagonal A, L = max diag(A), mu = min diag(A). x* = A^{-1} b. 6 7 struct Quad { 8 vector<double> diagA; // diagonal entries of A (positive) 9 vector<double> b; // vector b 10 11 // Evaluate f(x) 12 double f(const vector<double>& x) const { 13 double v = 0.0; 14 for (size_t i = 0; i < x.size(); ++i) { 15 v += 0.5 * diagA[i] * x[i] * x[i] - b[i] * x[i]; 16 } 17 return v; 18 } 19 20 // Gradient: A x - b (since A is diagonal) 21 vector<double> grad(const vector<double>& x) const { 22 vector<double> g(x.size()); 23 for (size_t i = 0; i < x.size(); ++i) g[i] = diagA[i] * x[i] - b[i]; 24 return g; 25 } 26 27 // Closed-form minimizer x* = A^{-1} b (element-wise since A is diagonal) 28 vector<double> x_star() const { 29 vector<double> xs(diagA.size()); 30 for (size_t i = 0; i < diagA.size(); ++i) xs[i] = b[i] / diagA[i]; 31 return xs; 32 } 33 }; 34 35 // Euclidean norm 36 double norm2(const vector<double>& x) { 37 double s = 0.0; 38 for (double v : x) s += v * v; 39 return sqrt(s); 40 } 41 42 vector<double> sub(const vector<double>& a, const vector<double>& b) { 43 vector<double> r(a.size()); 44 for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] - b[i]; 45 return r; 46 } 47 48 int main() { 49 ios::sync_with_stdio(false); 50 cin.tie(nullptr); 51 52 // Define a 5D problem with condition number kappa = L/mu = 10 53 vector<double> diagA = {1.0, 2.0, 3.0, 5.0, 10.0}; // mu=1, L=10 54 vector<double> b = {1.0, -2.0, 0.5, 3.0, -1.0}; 55 Quad prob{diagA, b}; 56 57 const double L = *max_element(diagA.begin(), diagA.end()); 58 const double mu = *min_element(diagA.begin(), diagA.end()); 59 const double eta = 1.0 / L; // theoretical safe step size 60 61 vector<double> x(5, 0.0); // start at zero 62 vector<double> xstar = prob.x_star(); 63 double dist0 = norm2(sub(x, xstar)); 64 65 cout << fixed << setprecision(6); 66 cout << "# t\t||x_t - x*||\tBound (1 - mu/L)^t * ||x0 - x*||\t f(x_t) - f(x*)\n"; 67 68 // Precompute f(x*) 69 double fxstar = prob.f(xstar); 70 71 int T = 40; 72 for (int t = 0; t <= T; ++t) { 73 double bound = pow(1.0 - mu / L, t) * dist0; // linear convergence bound for eta=1/L 74 double dist = norm2(sub(x, xstar)); 75 double gap = prob.f(x) - fxstar; 76 cout << t << "\t" << dist << "\t" << bound << "\t" << gap << "\n"; 77 78 // Next iterate: x_{t+1} = x_t - eta * grad 79 auto g = prob.grad(x); 80 for (size_t i = 0; i < x.size(); ++i) x[i] -= eta * g[i]; 81 } 82 83 return 0; 84 } 85
This program minimizes a strongly convex diagonal quadratic. It uses η = 1/L and reports the distance to the optimum and the theoretical linear bound (1 − μ/L)^t ||x0 − x*||. Because the problem is diagonal, L and μ are simply the largest and smallest diagonal entries. You can observe near-perfect agreement with the bound, illustrating linear convergence.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Non-convex 1D function: f(x) = 0.25 * x^4 - 0.5 * x^2 5 // Gradient: f'(x) = x^3 - x; L-smooth on bounded sets; global non-convex. 6 7 int main() { 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 auto f = [](double x) { return 0.25 * x * x * x * x - 0.5 * x * x; }; 12 auto g = [](double x) { return x * x * x - x; }; 13 14 double x = 0.9; // start near a non-convex region 15 double eta = 0.2; // modest step size to ensure descent 16 17 cout << fixed << setprecision(6); 18 cout << "# t\tx_t\tf(x_t)\t|grad|\tmin_{s<=t}|grad|^2 * t\n"; 19 20 double minGrad2 = numeric_limits<double>::infinity(); 21 int T = 200; 22 for (int t = 1; t <= T; ++t) { 23 double grad = g(x); 24 minGrad2 = min(minGrad2, grad * grad); 25 // Print a diagnostic capturing the O(1/t) stationarity trend 26 cout << t << "\t" << x << "\t" << f(x) << "\t" << fabs(grad) << "\t" << (minGrad2 * t) << "\n"; 27 // Update 28 x = x - eta * grad; 29 } 30 31 return 0; 32 } 33
We optimize a simple non-convex function with two minima at x = ±1/√2 and a saddle at x = 0. Although we cannot guarantee reaching the global minimum, the minimum gradient norm squared among the first t iterates typically decreases proportionally to 1/t. The printed quantity min_{s≤t} ||∇f(x_s)||^2 · t should stabilize, illustrating the non-convex O(1/t) stationarity rate.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // 1D stochastic least squares: f(w) = (1/(2N)) sum_i (a_i*w - b_i)^2 5 // True minimizer: w* = (sum a_i b_i) / (sum a_i^2). We simulate SGD with eta_t = eta0 / sqrt(t). 6 7 int main() { 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 std::mt19937 rng(42); 12 std::normal_distribution<double> noise(0.0, 0.5); 13 std::uniform_real_distribution<double> a_dist(0.5, 2.0); 14 15 const int N = 2000; 16 const double w_true = 1.7; 17 18 vector<double> a(N), b(N); 19 for (int i = 0; i < N; ++i) { 20 a[i] = a_dist(rng); 21 b[i] = a[i] * w_true + noise(rng); // y = a w_true + noise 22 } 23 24 // Compute closed-form w* and f(w*) to quantify suboptimality 25 double Sa2 = 0.0, Sab = 0.0; 26 for (int i = 0; i < N; ++i) { Sa2 += a[i]*a[i]; Sab += a[i]*b[i]; } 27 double w_star = Sab / Sa2; 28 auto f = [&](double w){ 29 double s = 0.0; 30 for (int i = 0; i < N; ++i) { 31 double r = a[i]*w - b[i]; 32 s += 0.5 * r * r; 33 } 34 return s / N; 35 }; 36 double f_star = f(w_star); 37 38 // SGD with diminishing steps; sample one index per iteration 39 std::uniform_int_distribution<int> uni(0, N-1); 40 41 double w = 0.0; // initialize 42 double eta0 = 0.5; // base step size (tuned) 43 44 cout << fixed << setprecision(6); 45 cout << "# t\tw_t\tf(w_t)-f*\t sqrt(t)*(f(w_t)-f*)\n"; 46 47 int T = 20000; 48 for (int t = 1; t <= T; ++t) { 49 int i = uni(rng); 50 double grad = (a[i]*w - b[i]) * a[i]; // stochastic gradient from one sample 51 double eta = eta0 / sqrt((double)t); 52 w -= eta * grad; 53 54 if (t % 1000 == 0) { 55 double gap = max(0.0, f(w) - f_star); 56 cout << t << "\t" << w << "\t" << gap << "\t" << (sqrt((double)t) * gap) << "\n"; 57 } 58 } 59 60 return 0; 61 } 62
We run SGD on a 1D least-squares problem where w* and f(w*) are known in closed form. With η_t = η0/√t, convex SGD theory predicts an O(1/√t) expected gap. The diagnostic √t·(f(w_t)−f*) should trend toward a constant (up to noise), illustrating the sublinear rate.