📚TheoryIntermediate

Gradient Descent Convergence Theory

Key Points

  • Gradient descent updates parameters by stepping opposite the gradient: - f().
  • If f is L-smooth, a safe fixed step size is = 1/L, which guarantees descent for convex functions.
  • For -strongly convex and L-smooth functions, gradient descent converges linearly: \ (1 - /L)^t \.
  • For merely convex and L-smooth functions, the objective gap shrinks sublinearly: f() - f(x^*) O.
  • The condition number = L/ determines how many iterations are needed; larger slows convergence.
  • In non-convex settings, gradient descent converges to stationary points with average gradient norm squared decaying like O.
  • Stochastic gradient descent (SGD) with decreasing step sizes achieves E[f()] - f(x^*) O(1/) for convex objectives.
  • Choosing learning rates well is crucial in ML: too large causes divergence, too small makes training sluggish.

Prerequisites

  • Differential CalculusUnderstanding gradients, directional derivatives, and Taylor expansions is essential for GD updates and smoothness.
  • Linear AlgebraNorms, inner products, eigenvalues, and positive definiteness underlie L, μ, and condition numbers.
  • Convexity and Strong ConvexityConvergence rates depend on convexity properties; knowing these definitions is critical.
  • Probability and ExpectationSGD analysis uses unbiasedness and variance of stochastic gradients and expectations over randomness.
  • Introduction to OptimizationLine search, stopping criteria, and basic iterative methods provide context for GD behavior.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine rolling a ball down a hilly landscape to reach the lowest valley. If the hills are smooth and bowl-shaped, the ball quickly settles at the bottom. If the surface is rough or flat in places, progress can be slow or the ball can get stuck at flat spots. Concept: Gradient descent (GD) is the mathematical version of this process. At each step, you look at the slope (the gradient) and take a step downhill proportional to it, x_{t+1} = x_t - \eta \nabla f(x_t). The theory of convergence explains when and how fast this process approaches the minimum. Key properties of the function—smoothness and convexity—govern the guarantees. L-smoothness controls how quickly the slope can change, and \mu-strong convexity ensures the function is bowl-shaped with a unique minimum. Example: For a quadratic function f(x) = \tfrac{1}{2} x^\top A x - b^\top x with A positive definite, GD with a properly chosen step size will shrink the distance to the solution by a constant factor each step, leading to fast, predictable convergence. In contrast, for a non-convex function like x^4 - x^2, GD still finds points where the slope is nearly zero, but we cannot guarantee it reaches the global minimum.

02Intuition & Analogies

Hook: Suppose you're hiking with foggy glasses. You can't see the whole mountain, but you can feel the slope beneath your feet. You take a step downhill—the steeper the slope, the longer the step—but not too long to avoid tripping. Concept: The gradient is your sense of steepness and direction; the learning rate (step size) is how long your stride is. If the terrain is consistently bowl-shaped (strongly convex), every step gets you closer to the unique bottom, and with a good stride length you progress at a steady, geometric pace. If the terrain is just generally sloping down (convex but not strongly), you still move toward lower altitudes, but the landscape might flatten, making your progress taper like 1/t. If the terrain is bumpy (non-convex), you can still move to flatter regions where the slope is small, even if you can’t be sure it’s the absolute lowest valley. Example: Think of a perfectly round bowl (strongly convex): roll a marble and watch it spiral quickly to the bottom—each lap is shorter than the last by a constant factor. Now think of a wide, shallow plate (convex but not strongly): the marble moves inward but slows significantly near the center. On a wavy plate (non-convex), the marble may settle in a local dip; however, if you keep nudging it carefully (small steps), you can at least get to a place where it barely moves—the slope there is essentially zero.

03Formal Definition

Gradient descent iterates - f() with step size > 0 to minimize a differentiable function f: . A function is L-smooth if its gradient is Lipschitz continuous: \ L\,\ for all x, y, equivalently f(y) f(x) + f(x)^ (y - x) + \^2. It is -strongly convex if f(y) f(x) + f(x)^ (y - x) + \^2 for all x, y, which implies a unique minimizer x^*. Under L-smoothness and convexity, GD with = 1/L satisfies f() - f(x^*) . Under L-smoothness and -strong convexity, GD with 0 < 1/L yields linear convergence in distance: \ (1 - )^t \; with = 1/L this becomes (1 - /L)^t. The condition number = L/ controls the rate: larger slows convergence. For non-convex L-smooth objectives, GD with 1/L yields a guarantee on stationarity: \^2 . In stochastic settings with unbiased gradient estimates of bounded variance, SGD with diminishing step sizes (e.g., 1/) achieves E[f(\bar )] - f(x^*) = O(1/) in the convex case.

04When to Use

Use gradient descent when: (1) You can compute exact or approximate gradients cheaply relative to function evaluation, such as in linear models, logistic regression, and neural networks. (2) The function is L-smooth (most differentiable ML losses are) and possibly convex, where theory guarantees progress and rates. (3) You have large-scale data where second-order methods are too costly; GD/SGD scale linearly with data and parameters. (4) You’re training neural networks; while non-convex, GD variants still reliably reach stationary points and good solutions in practice. Prefer fixed step sizes (\eta = 1/L or tuned constants) when L (or a bound) is known and the surface is well-behaved; prefer backtracking or line search when L is unknown or varies. In ill-conditioned problems (large \kappa), consider preconditioning or momentum/accelerated methods. Use SGD or mini-batch GD when full gradients are too expensive; apply diminishing or adaptive learning rates to manage noise. For strongly convex problems and high accuracy needs, GD with \eta = 1/L (or 2/(L+\mu) when known) quickly achieves small errors; for merely convex objectives, expect slower 1/t decay and consider acceleration.

⚠️Common Mistakes

  1. Learning rate too large: If \eta > 2/L in quadratic-like regions, GD can diverge or oscillate. Fix by estimating L (e.g., via power iteration on Hessian for quadratics), using \eta \le 1/L, or adopting backtracking line search. 2) Confusing convergence in function value with distance: Sublinear rates like O(1/t) on f(x_t) - f(x^) do not imply the same rate on |x_t - x^| unless strong convexity holds. 3) Ignoring conditioning: With \kappa = L/\mu \gg 1, progress is painfully slow; normalize features, add regularization, or precondition to reduce \kappa. 4) Mixing deterministic and stochastic guarantees: SGD with constant \eta does not converge to x^* but to a noise ball; use diminishing \eta_t or iterate averaging to get O(1/\sqrt{t}) rates in convex problems. 5) Misapplying non-convex results: Guarantees are about stationarity (small |\nabla f|), not global optimality; a small gradient does not prove you found the best minimum. 6) Poor stopping criteria: Stopping on tiny step size or tiny |f(x_{t+1}) - f(x_t)| can be misleading; monitor gradient norm and, when possible, optimality gaps. 7) Not scaling data/parameters: Unscaled features enlarge L and \kappa, degrading step size choices and speed; standardize inputs to improve convergence. 8) Using fixed \eta in stochastic settings without tuning: Either decay \eta_t (e.g., \eta_0/\sqrt{t}) or use adaptive methods; otherwise progress can stall.

Key Formulas

Gradient Descent Update

Explanation: Each iteration moves opposite the gradient scaled by the learning rate. This decreases the function value when the step is small enough.

L-smoothness (Lipschitz Gradient)

Explanation: The gradient does not change too abruptly. This regularity enables using fixed step sizes like 1/L with convergence guarantees.

Descent Lemma

Explanation: An upper bound on f(y) in terms of a quadratic model around x. It ensures GD decreases f if 1/L.

Strong Convexity Inequality

Explanation: A lower bound that enforces bowl-shaped curvature. It implies uniqueness of the minimizer and linear convergence of GD.

Linear Convergence (Strongly Convex)

Explanation: When > 0 and 1/L, the distance to the minimizer shrinks by a constant factor each step. With =1/L, the factor is 1 - /L.

Sublinear Rate (Convex, L-smooth)

Explanation: For convex but not strongly convex functions, GD with =1/L reduces the objective gap like 1/t. Progress slows near the optimum.

Condition Number

Explanation: The ratio of smoothness to strong convexity controls the speed. Iteration complexity scales like O( (1/)).

Non-convex Stationarity Rate

Explanation: In smooth non-convex problems with 1/L, the smallest gradient norm squared among the first t iterates decreases like 1/t.

SGD Convex Rate

Explanation: With unbiased gradients of bounded variance and diminishing steps (e.g., 1/), averaged SGD attains O(1/) expected suboptimality.

Optimal Fixed Step for Quadratics

Explanation: For strongly convex quadratics, the best fixed step is 2/(L+), yielding contraction factor (-1)/(+1), faster than 1 - /L.

Complexity Analysis

In gradient descent, each iteration computes the gradient and performs an O(d) vector update, where d is the parameter dimension. For generic objectives where evaluating the gradient costs G(d) operations, the per-iteration complexity is O(G(d)). In many ML settings with dense data of size n-by-d, a full gradient over n samples costs O(nd); thus batch GD requires O(nd) per iteration and O(nd · T) overall for T iterations. Memory usage is typically O(d), storing the current iterate, gradient, and a few scratch vectors. In strongly convex, L-smooth problems, the number of iterations to reach accuracy in distance is T = O( (1/)), where = L/. Accordingly, the total arithmetic work becomes O(G(d) · (1/)). When n is large, computing full gradients is expensive; stochastic gradient descent reduces per-iteration cost to O(bd) for mini-batch size b (often b ≪ n). However, more iterations are needed: for convex problems with diminishing step sizes, expected suboptimality decays as O(1/), so achieving accuracy in function value requires T = O(1/) stochastic steps. The total work is then O(bd/). In strongly convex settings, with carefully tuned diminishing steps (e.g., = c/((t+))), SGD can achieve O rates in expectation, improving to O((bd)/). Space complexity remains O(d) for both GD and SGD, plus O(bd) transiently to load a mini-batch. Ill-conditioning increases , slowing deterministic GD; preconditioning (feature scaling) effectively reduces and thus the iteration count without changing per-iteration cost.

Code Examples

Linear convergence on a strongly convex quadratic with η = 1/L
1#include <bits/stdc++.h>
2using namespace std;
3
4// Strongly convex quadratic: f(x) = 0.5 * x^T A x - b^T x, with diagonal SPD A
5// For diagonal A, L = max diag(A), mu = min diag(A). x* = A^{-1} b.
6
7struct Quad {
8 vector<double> diagA; // diagonal entries of A (positive)
9 vector<double> b; // vector b
10
11 // Evaluate f(x)
12 double f(const vector<double>& x) const {
13 double v = 0.0;
14 for (size_t i = 0; i < x.size(); ++i) {
15 v += 0.5 * diagA[i] * x[i] * x[i] - b[i] * x[i];
16 }
17 return v;
18 }
19
20 // Gradient: A x - b (since A is diagonal)
21 vector<double> grad(const vector<double>& x) const {
22 vector<double> g(x.size());
23 for (size_t i = 0; i < x.size(); ++i) g[i] = diagA[i] * x[i] - b[i];
24 return g;
25 }
26
27 // Closed-form minimizer x* = A^{-1} b (element-wise since A is diagonal)
28 vector<double> x_star() const {
29 vector<double> xs(diagA.size());
30 for (size_t i = 0; i < diagA.size(); ++i) xs[i] = b[i] / diagA[i];
31 return xs;
32 }
33};
34
35// Euclidean norm
36double norm2(const vector<double>& x) {
37 double s = 0.0;
38 for (double v : x) s += v * v;
39 return sqrt(s);
40}
41
42vector<double> sub(const vector<double>& a, const vector<double>& b) {
43 vector<double> r(a.size());
44 for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] - b[i];
45 return r;
46}
47
48int main() {
49 ios::sync_with_stdio(false);
50 cin.tie(nullptr);
51
52 // Define a 5D problem with condition number kappa = L/mu = 10
53 vector<double> diagA = {1.0, 2.0, 3.0, 5.0, 10.0}; // mu=1, L=10
54 vector<double> b = {1.0, -2.0, 0.5, 3.0, -1.0};
55 Quad prob{diagA, b};
56
57 const double L = *max_element(diagA.begin(), diagA.end());
58 const double mu = *min_element(diagA.begin(), diagA.end());
59 const double eta = 1.0 / L; // theoretical safe step size
60
61 vector<double> x(5, 0.0); // start at zero
62 vector<double> xstar = prob.x_star();
63 double dist0 = norm2(sub(x, xstar));
64
65 cout << fixed << setprecision(6);
66 cout << "# t\t||x_t - x*||\tBound (1 - mu/L)^t * ||x0 - x*||\t f(x_t) - f(x*)\n";
67
68 // Precompute f(x*)
69 double fxstar = prob.f(xstar);
70
71 int T = 40;
72 for (int t = 0; t <= T; ++t) {
73 double bound = pow(1.0 - mu / L, t) * dist0; // linear convergence bound for eta=1/L
74 double dist = norm2(sub(x, xstar));
75 double gap = prob.f(x) - fxstar;
76 cout << t << "\t" << dist << "\t" << bound << "\t" << gap << "\n";
77
78 // Next iterate: x_{t+1} = x_t - eta * grad
79 auto g = prob.grad(x);
80 for (size_t i = 0; i < x.size(); ++i) x[i] -= eta * g[i];
81 }
82
83 return 0;
84}
85

This program minimizes a strongly convex diagonal quadratic. It uses η = 1/L and reports the distance to the optimum and the theoretical linear bound (1 − μ/L)^t ||x0 − x*||. Because the problem is diagonal, L and μ are simply the largest and smallest diagonal entries. You can observe near-perfect agreement with the bound, illustrating linear convergence.

Time: O(d · T), since each iteration computes a gradient and updates in O(d).Space: O(d), storing x, gradient, and constants.
Non-convex example: stationarity improves at O(1/t)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Non-convex 1D function: f(x) = 0.25 * x^4 - 0.5 * x^2
5// Gradient: f'(x) = x^3 - x; L-smooth on bounded sets; global non-convex.
6
7int main() {
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 auto f = [](double x) { return 0.25 * x * x * x * x - 0.5 * x * x; };
12 auto g = [](double x) { return x * x * x - x; };
13
14 double x = 0.9; // start near a non-convex region
15 double eta = 0.2; // modest step size to ensure descent
16
17 cout << fixed << setprecision(6);
18 cout << "# t\tx_t\tf(x_t)\t|grad|\tmin_{s<=t}|grad|^2 * t\n";
19
20 double minGrad2 = numeric_limits<double>::infinity();
21 int T = 200;
22 for (int t = 1; t <= T; ++t) {
23 double grad = g(x);
24 minGrad2 = min(minGrad2, grad * grad);
25 // Print a diagnostic capturing the O(1/t) stationarity trend
26 cout << t << "\t" << x << "\t" << f(x) << "\t" << fabs(grad) << "\t" << (minGrad2 * t) << "\n";
27 // Update
28 x = x - eta * grad;
29 }
30
31 return 0;
32}
33

We optimize a simple non-convex function with two minima at x = ±1/√2 and a saddle at x = 0. Although we cannot guarantee reaching the global minimum, the minimum gradient norm squared among the first t iterates typically decreases proportionally to 1/t. The printed quantity min_{s≤t} ||∇f(x_s)||^2 · t should stabilize, illustrating the non-convex O(1/t) stationarity rate.

Time: O(T), constant-time gradient and update per step.Space: O(1), only scalars are stored.
SGD in 1D least squares with diminishing steps (O(1/√t) expected rate)
1#include <bits/stdc++.h>
2using namespace std;
3
4// 1D stochastic least squares: f(w) = (1/(2N)) sum_i (a_i*w - b_i)^2
5// True minimizer: w* = (sum a_i b_i) / (sum a_i^2). We simulate SGD with eta_t = eta0 / sqrt(t).
6
7int main() {
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 std::mt19937 rng(42);
12 std::normal_distribution<double> noise(0.0, 0.5);
13 std::uniform_real_distribution<double> a_dist(0.5, 2.0);
14
15 const int N = 2000;
16 const double w_true = 1.7;
17
18 vector<double> a(N), b(N);
19 for (int i = 0; i < N; ++i) {
20 a[i] = a_dist(rng);
21 b[i] = a[i] * w_true + noise(rng); // y = a w_true + noise
22 }
23
24 // Compute closed-form w* and f(w*) to quantify suboptimality
25 double Sa2 = 0.0, Sab = 0.0;
26 for (int i = 0; i < N; ++i) { Sa2 += a[i]*a[i]; Sab += a[i]*b[i]; }
27 double w_star = Sab / Sa2;
28 auto f = [&](double w){
29 double s = 0.0;
30 for (int i = 0; i < N; ++i) {
31 double r = a[i]*w - b[i];
32 s += 0.5 * r * r;
33 }
34 return s / N;
35 };
36 double f_star = f(w_star);
37
38 // SGD with diminishing steps; sample one index per iteration
39 std::uniform_int_distribution<int> uni(0, N-1);
40
41 double w = 0.0; // initialize
42 double eta0 = 0.5; // base step size (tuned)
43
44 cout << fixed << setprecision(6);
45 cout << "# t\tw_t\tf(w_t)-f*\t sqrt(t)*(f(w_t)-f*)\n";
46
47 int T = 20000;
48 for (int t = 1; t <= T; ++t) {
49 int i = uni(rng);
50 double grad = (a[i]*w - b[i]) * a[i]; // stochastic gradient from one sample
51 double eta = eta0 / sqrt((double)t);
52 w -= eta * grad;
53
54 if (t % 1000 == 0) {
55 double gap = max(0.0, f(w) - f_star);
56 cout << t << "\t" << w << "\t" << gap << "\t" << (sqrt((double)t) * gap) << "\n";
57 }
58 }
59
60 return 0;
61}
62

We run SGD on a 1D least-squares problem where w* and f(w*) are known in closed form. With η_t = η0/√t, convex SGD theory predicts an O(1/√t) expected gap. The diagnostic √t·(f(w_t)−f*) should trend toward a constant (up to noise), illustrating the sublinear rate.

Time: O(T), one data point processed per iteration; occasional O(N) passes only for evaluation.Space: O(N) to store the dataset and O(1) for parameters; if evaluations are skipped or streamed, it can be O(1).