πŸŽ“How I Study AIHISA
πŸ“–Read
πŸ“„PapersπŸ“°Blogs🎬Courses
πŸ’‘Learn
πŸ›€οΈPathsπŸ“šTopicsπŸ’‘Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
βš™οΈAlgorithmIntermediate

Momentum Methods

Key Points

  • β€’
    Momentum methods add an exponentially weighted memory of past gradients to make descent steps smoother and faster, especially in ravines and ill-conditioned problems.
  • β€’
    Classical (Polyak/Heavy-ball) momentum keeps a velocity vt​ = Ξ² vtβˆ’1​ + βˆ‡ L(ΞΈt​) and updates parameters by ΞΈt+1​ = ΞΈt​ - Ξ· vt​.
  • β€’
    Nesterov Accelerated Gradient (NAG) computes the gradient at a look-ahead point, often giving better theoretical and practical convergence on smooth convex problems.
  • β€’
    The momentum coefficient Ξ² ∈ [0,1) controls how much past gradient information persists; typical values are 0.8–0.99.
  • β€’
    Per-iteration cost stays O(d) where d is the number of parameters, with only an extra velocity vector of size d in memory.
  • β€’
    Momentum reduces oscillations across steep directions and increases effective step length along shallow directions.
  • β€’
    Choosing learning rate Ξ· too large with high Ξ² can cause divergence; stable choices follow known bounds for smooth strongly convex losses.
  • β€’
    Momentum is widely used in deep learning and classic optimization to speed up SGD and mitigate noisy gradients.

Prerequisites

  • β†’Gradient Descent β€” Momentum builds directly on gradient descent updates by modifying how gradients accumulate.
  • β†’Vector and Matrix Calculus β€” Computing gradients and understanding updates in high dimensions require basic calculus on vectors and matrices.
  • β†’Convexity and Strong Convexity β€” Convergence guarantees and parameter choices for momentum depend on these properties.
  • β†’Lipschitz Smoothness β€” Smoothness constants bound stable learning rates and appear in optimal momentum settings.
  • β†’Linear Algebra (Eigenvalues, Condition Number) β€” Ill-conditioning explains why momentum helps and provides intuition for acceleration.
  • β†’Stochastic Optimization Basics β€” Momentum is commonly used with mini-batch SGD, which introduces gradient noise.
  • β†’Numerical Stability and Scaling β€” Feature scaling and step-size selection impact stability of momentum methods.
  • β†’Regularization (Weight Decay) β€” To implement training properly, it is important to combine momentum with regularization correctly.

Detailed Explanation

Tap terms for definitions

01Overview

Momentum methods are optimization techniques that augment gradient descent with a memory of past updates. Instead of moving only in the direction of the current gradient, they maintain a running "velocity" vector that blends recent gradients using exponential smoothing. This idea, introduced in the heavy-ball method by Polyak, helps reduce zig-zagging in narrow valleys of the loss landscape and accelerates motion along gentle slopes. In practice, momentum improves both deterministic gradient descent and stochastic gradient descent (SGD), which samples gradients on mini-batches of data. Two popular variants are classical (Polyak) momentum and Nesterov Accelerated Gradient (NAG). Classical momentum forms a velocity as a moving average of gradients and then steps. NAG takes a look-ahead step using the previous velocity before computing the gradient, often providing tighter theoretical guarantees for smooth convex losses. These methods incur minimal overhead: they add one extra vector the size of the parameters and a couple of vector operations per iteration, so they are easy to implement and scale. Momentum is now a standard component of modern deep learning optimizers and a useful tool for speeding up convergence in a wide range of optimization problems.

02Intuition & Analogies

Imagine pushing a heavy shopping cart down a bumpy aisle. If you push a little every second, the cart does not instantly stop when you stop pushing; it keeps rolling due to momentum. Similarly, in optimization, each gradient "push" does not immediately reset direction; previous pushes accumulate, giving inertia to motion. This inertia helps on terrains with ridges: if the aisle slopes slightly forward (a shallow direction) but has rumble strips across it (steep directions), naive steps will waste effort bouncing side-to-side. Momentum damps these sideways oscillations while letting the cart build speed forward, reaching the destination faster. A second analogy is smoothing noisy signals. If you measure the wind direction every second, readings fluctuate. Taking an exponentially weighted moving average dampens noise but still follows the trend. Momentum does the same to gradients: it smooths noisy mini-batch directions so updates are more stable and decisive. Nesterov momentum adds a predictive twist: before measuring the next gradient (checking the wind), you move a little in the direction you expect to go (a look-ahead with your current speed), then adjust based on what you find there. This predictive correction often avoids over-shooting and gives a more accurate course. In both analogies, a friction parameter (the momentum coefficient \beta) controls how much past motion matters: high \beta means strong memory (more inertia), low \beta means updates respond quickly to new information but may wobble more.

03Formal Definition

Let L(ΞΈ) be a differentiable loss with parameters ΞΈ ∈ Rd. At iteration t, define the gradient gt​ = βˆ‡ L(ΞΈt​) (or a stochastic estimate). Classical (Polyak) momentum maintains a velocity vt​ via vt​ = Ξ² vtβˆ’1​ + gt​, ΞΈt+1​ = ΞΈt​ - Ξ· vt​, where 0 ≀ Ξ² < 1 is the momentum coefficient and Ξ· > 0 is the learning rate. Unrolling the recursion shows vt​ is an exponentially weighted moving average of past gradients: vt​ = βˆ‘k=0t​ Ξ²k gtβˆ’k​. Nesterov Accelerated Gradient (NAG) evaluates the gradient at a look-ahead point: ΞΈ~t​ = ΞΈt​ - Ξ· Ξ² vtβˆ’1​. Then vt​ = Ξ² vtβˆ’1​ + βˆ‡ L(ΞΈ~t​), ΞΈt+1​ = ΞΈt​ - Ξ· vt​. For L that is ΞΌ-strongly convex and L-smooth (gradient Lipschitz with constant L), classical choices Ξ· = (L​+μ​)24​ and Ξ² = \left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2} yield accelerated linear convergence. In the scalar quadratic case L(ΞΈ)=2a​(ΞΈ-ΞΈ^*)^2, the error et​ = ΞΈt​-ΞΈ^* obeys et+1​ = (1-Ξ· a+Ξ²)et​ - Ξ² etβˆ’1​, whose characteristic polynomial determines stability constraints on (Ξ·,Ξ²).

04When to Use

Use momentum when gradients oscillate or when curvature is highly anisotropic (some directions steep, others flat). Typical scenarios include training deep neural networks where narrow ravines arise from layer-wise scaling, and convex optimization with ill-conditioned Hessians. In stochastic settings (mini-batch SGD), momentum reduces variance from noisy gradients, helping both speed and stability. Choose classical momentum for simplicity and robustness; prefer Nesterov momentum if you want better performance on smooth convex problems or empirically faster progress in some deep-learning tasks. Momentum is helpful when you need inexpensive acceleration: it adds only O(d) memory and arithmetic, unlike second-order methods that require Hessians or large matrices. It is also useful when you cannot or do not want to tune learning rates frequently because momentum can "rescue" slightly aggressive step sizes by damping oscillations. However, if gradients are already well-behaved (well-conditioned problems or with strong adaptive methods like Adam), the incremental benefit may be smaller. Use momentum together with learning-rate schedules (e.g., decay or warmup) for best results.

⚠️Common Mistakes

  • Setting \beta too high with a large \eta. High inertia plus big steps often overshoots and diverges; start with \beta in [0.8, 0.95] and increase cautiously.
  • Forgetting to scale updates consistently. The learning rate \eta multiplies the velocity in the parameter update, not the gradient term inside v_t for classical momentum; mixing these conventions changes effective step sizes.
  • Mis-implementing Nesterov. The gradient must be evaluated at the look-ahead point \theta_t - \eta \beta v_{t-1}; computing it at \theta_t defeats the purpose.
  • Not initializing v_0 to zero (or appropriately). Starting with stale or mismatched velocity can cause a big first step.
  • Confusing weight decay (L2 regularization) with momentum. Weight decay should be applied to parameters (or gradients) consistently; momentum applies to the running average of gradients.
  • Ignoring data scaling. Poorly scaled features cause ill-conditioning that momentum can mitigate but not fully fix; standardize inputs for better stability.
  • Using the same hyperparameters across very different batch sizes. Larger batches reduce gradient noise and may allow slightly higher \eta or lower \beta; small batches may benefit from higher \beta.
  • Expecting momentum to fix non-differentiable objectives without smoothing. Momentum assumes gradients (or subgradients) are available and informative.

Key Formulas

Classical Momentum (Heavy-ball)

vt​=Ξ²vtβˆ’1​+βˆ‡L(ΞΈt​),ΞΈt+1​=ΞΈtβ€‹βˆ’Ξ·vt​

Explanation: Velocity vt​ is the exponential moving average of past gradients and parameters move opposite to velocity scaled by learning rate. This reduces zig-zags and accelerates along consistent directions.

Nesterov Accelerated Gradient (NAG)

ΞΈ~t​=ΞΈtβ€‹βˆ’Ξ·Ξ²vtβˆ’1​,vt​=Ξ²vtβˆ’1​+βˆ‡L(ΞΈ~t​),ΞΈt+1​=ΞΈtβ€‹βˆ’Ξ·vt​

Explanation: Compute a look-ahead point using previous velocity, then evaluate the gradient there for a corrective update. This often yields better theoretical rates on smooth convex functions.

EMA Expansion

vt​=k=0βˆ‘t​βkβˆ‡L(ΞΈtβˆ’k​)

Explanation: Unrolling the recursion shows momentum is an exponentially weighted moving average of past gradients. Recent gradients get higher weight, older ones decay by powers of Ξ².

Scalar Quadratic Error Recurrence

et+1​=(1βˆ’Ξ·a+Ξ²)etβ€‹βˆ’Ξ²etβˆ’1​

Explanation: For L(ΞΈ)=2a​(ΞΈ-ΞΈ^*)^2, the error follows a second-order linear recurrence. Stability requires the roots of r2 - (1-Ξ· a + Ξ²) r + Ξ² = 0 to lie inside the unit circle.

Heavy-ball Optimal Parameters (Strongly Convex)

η⋆=(L​+μ​)24​,β⋆=(L​+μ​Lβ€‹βˆ’ΞΌβ€‹β€‹)2

Explanation: For ΞΌ-strongly convex and L-smooth functions, these parameters minimize the spectral radius of the iteration in the worst case. They provide accelerated linear convergence.

Condition Number

ΞΊ=ΞΌL​

Explanation: The ratio of largest to smallest curvature controls convergence rates. Larger ΞΊ means ill-conditioning and slower progress for plain gradient descent; momentum mitigates this effect.

Lipschitz Gradient

βˆ₯βˆ‡L(x)βˆ’βˆ‡L(y)βˆ₯≀Lβˆ₯xβˆ’yβˆ₯

Explanation: This smoothness condition bounds how quickly the gradient can change, which is key to deriving stable step sizes and convergence guarantees.

Arithmetic Series (reference)

i=1βˆ‘n​i=2n(n+1)​

Explanation: Useful when summing iteration-dependent terms in analyses. The closed form simplifies complexity and convergence proofs involving cumulative steps.

Per-epoch Complexity

T=O(nd)Β perΒ epoch,M=O(d)

Explanation: Optimizing models with d parameters over n examples costs O(nd) to compute one pass of gradients. Momentum adds only one extra d-dimensional vector in memory.

Complexity Analysis

Let d be the number of parameters and n the number of data points. A single momentum step computes a gradient gt​ and performs a few vector operations: scaling and adding the velocity and updating parameters. These arithmetic costs are O(d). The gradient computation usually dominates cost. For full-batch settings, one iteration costs O(nd); for mini-batch size b, it is O(bd). Thus, one epoch (processing all data once) costs O(nd), regardless of momentum. Momentum adds an auxiliary vector v of length d, increasing memory by O(d) beyond storing parameters and any gradients. Stability affects effective complexity: with well-chosen (Ξ·, Ξ²), momentum reduces the number of iterations to reach a target accuracy compared to plain gradient descent, especially on ill-conditioned problems. For ΞΌ-strongly convex and L-smooth objectives, heavy-ball and Nesterov achieve accelerated linear rates with a contraction factor depending on κ​. In practice, fewer passes are needed to achieve similar loss reduction, which lowers wall-clock time for a target accuracy even though per-iteration cost is nearly unchanged. However, aggressive hyperparameters can cause divergence, wasting compute. Proper tuning (often Ξ² ∈ [0.8, 0.99] with a modest Ξ·) yields robust improvements. Space complexity remains O(d): parameters, velocity, and temporary gradient buffers. On modern hardware, memory bandwidth for reading/writing these vectors can impact throughput; using contiguous arrays and avoiding unnecessary allocations helps maintain O(d) per-step efficiency.

Code Examples

Classical Momentum on a 2D Ill-Conditioned Quadratic
1#include <bits/stdc++.h>
2using namespace std;
3
4// Minimize L(theta) = 0.5 * theta^T H theta, with H = diag([1000, 1])
5// Gradient: g(theta) = H * theta
6
7struct Momentum {
8 double eta; // learning rate
9 double beta; // momentum coefficient
10 vector<double> v; // velocity
11 Momentum(size_t d, double eta_, double beta_) : eta(eta_), beta(beta_), v(d, 0.0) {}
12 void step(vector<double>& theta, const vector<double>& grad) {
13 // v_t = beta * v_{t-1} + g_t
14 for (size_t i = 0; i < theta.size(); ++i) {
15 v[i] = beta * v[i] + grad[i];
16 }
17 // theta_{t+1} = theta_t - eta * v_t
18 for (size_t i = 0; i < theta.size(); ++i) {
19 theta[i] -= eta * v[i];
20 }
21 }
22};
23
24int main() {
25 // Problem setup
26 vector<double> theta = {5.0, 5.0}; // initial point
27 vector<double> diagH = {1000.0, 1.0}; // Hessian diagonal
28
29 // Hyperparameters: modest lr with momentum to reduce zig-zag
30 double eta = 0.01; // learning rate
31 double beta = 0.9; // momentum
32 Momentum opt(theta.size(), eta, beta);
33
34 auto grad = [&](const vector<double>& x){
35 vector<double> g(2);
36 g[0] = diagH[0] * x[0];
37 g[1] = diagH[1] * x[1];
38 return g;
39 };
40
41 auto loss = [&](const vector<double>& x){
42 return 0.5 * (diagH[0]*x[0]*x[0] + diagH[1]*x[1]*x[1]);
43 };
44
45 cout.setf(ios::fixed); cout<<setprecision(6);
46 for (int t = 0; t < 200; ++t) {
47 vector<double> g = grad(theta);
48 opt.step(theta, g);
49 if (t % 20 == 0) {
50 cout << "iter " << t << ": theta=[" << theta[0] << ", " << theta[1]
51 << "] L=" << loss(theta) << "\n";
52 }
53 }
54
55 cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n";
56 return 0;
57}
58

This program minimizes an ill-conditioned quadratic bowl where one direction is 1000 times steeper than the other. Plain gradient descent would zig-zag heavily; classical momentum uses a velocity v_t = \beta v_{t-1} + g_t and takes updates \theta_{t+1} = \theta_t - \eta v_t to damp oscillations and accelerate along the shallow direction. You can experiment by setting beta=0 to see slower convergence and more zig-zag.

Time: O(d) per iteration (here d=2); full run O(T d) for T iterationsSpace: O(d) extra for the velocity vector
Nesterov Accelerated Gradient (NAG) on the Same Quadratic
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Nesterov {
5 double eta; // learning rate
6 double beta; // momentum coefficient
7 vector<double> v; // velocity
8 Nesterov(size_t d, double eta_, double beta_) : eta(eta_), beta(beta_), v(d, 0.0) {}
9 // Step requires gradient at look-ahead point: theta_look = theta - eta * beta * v
10 template <class GradFunc>
11 void step(vector<double>& theta, GradFunc grad_func) {
12 vector<double> theta_look(theta.size());
13 for (size_t i = 0; i < theta.size(); ++i) theta_look[i] = theta[i] - eta * beta * v[i];
14 vector<double> g = grad_func(theta_look); // g_t = grad at look-ahead
15 for (size_t i = 0; i < theta.size(); ++i) {
16 v[i] = beta * v[i] + g[i]; // update velocity
17 theta[i] -= eta * v[i]; // update parameters
18 }
19 }
20};
21
22int main() {
23 // H = diag([1000, 1])
24 vector<double> diagH = {1000.0, 1.0};
25 auto grad = [&](const vector<double>& x){
26 vector<double> g(2);
27 g[0] = diagH[0] * x[0];
28 g[1] = diagH[1] * x[1];
29 return g;
30 };
31 auto loss = [&](const vector<double>& x){
32 return 0.5 * (diagH[0]*x[0]*x[0] + diagH[1]*x[1]*x[1]);
33 };
34
35 vector<double> theta = {5.0, 5.0};
36 Nesterov opt(theta.size(), 0.01, 0.9);
37
38 cout.setf(ios::fixed); cout<<setprecision(6);
39 for (int t = 0; t < 200; ++t) {
40 opt.step(theta, grad);
41 if (t % 20 == 0) {
42 cout << "iter " << t << ": theta=[" << theta[0] << ", " << theta[1]
43 << "] L=" << loss(theta) << "\n";
44 }
45 }
46
47 cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n";
48 return 0;
49}
50

This implementation computes the gradient at a look-ahead point \theta - \eta\beta v, then updates velocity and parameters. Compared to classical momentum on the same problem, Nesterov often reduces overshoot in the shallow direction while keeping oscillations small in the steep direction.

Time: O(d) per iteration; O(T d) total for T stepsSpace: O(d) extra for the velocity and temporary look-ahead vector
Mini-batch Logistic Regression with Classical Momentum (SGD)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Binary logistic regression with L2 regularization trained by SGD + momentum.
5// Model: p(y=1|x) = sigmoid(w^T x + b)
6
7struct Momentum {
8 double eta, beta; vector<double> v; double vb; // velocity for weights and bias
9 Momentum(size_t d, double eta_, double beta_) : eta(eta_), beta(beta_), v(d, 0.0), vb(0.0) {}
10 void step(vector<double>& w, double& b, const vector<double>& gw, double gb) {
11 for (size_t i = 0; i < w.size(); ++i) v[i] = beta * v[i] + gw[i];
12 vb = beta * vb + gb;
13 for (size_t i = 0; i < w.size(); ++i) w[i] -= eta * v[i];
14 b -= eta * vb;
15 }
16};
17
18static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
19
20int main() {
21 // Generate a simple 2D synthetic dataset
22 std::mt19937 rng(42);
23 normal_distribution<double> n01(0.0, 1.0);
24 int n = 2000; int d = 2;
25 vector<vector<double>> X(n, vector<double>(d));
26 vector<int> y(n);
27 // True weights
28 vector<double> w_true = {2.0, -3.0}; double b_true = 0.5;
29 for (int i = 0; i < n; ++i) {
30 double x0 = n01(rng), x1 = n01(rng);
31 X[i][0] = x0; X[i][1] = x1;
32 double logit = w_true[0]*x0 + w_true[1]*x1 + b_true + 0.2*n01(rng);
33 double p = sigmoid(logit);
34 y[i] = (uniform_real_distribution<double>(0.0,1.0)(rng) < p) ? 1 : 0;
35 }
36
37 // Initialize parameters
38 vector<double> w(d, 0.0); double b = 0.0;
39 double lambda = 1e-3; // L2 regularization
40 double eta = 0.1; double beta = 0.9; int batch = 64; int epochs = 10;
41 Momentum opt(d, eta, beta);
42
43 vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
44
45 auto loss_and_grad = [&](const vector<int>& batch_idx){
46 vector<double> gw(d, 0.0); double gb = 0.0; double loss = 0.0;
47 for (int id : batch_idx) {
48 const auto& xi = X[id]; int yi = y[id];
49 double z = inner_product(xi.begin(), xi.end(), w.begin(), 0.0) + b;
50 double p = sigmoid(z);
51 // Cross-entropy loss contribution
52 loss += -(yi ? log(max(p,1e-12)) : log(max(1-p,1e-12)));
53 // Gradient wrt w and b
54 double err = (p - yi);
55 for (int j = 0; j < d; ++j) gw[j] += err * xi[j];
56 gb += err;
57 }
58 // Average over batch
59 double m = (double)batch_idx.size();
60 for (int j = 0; j < d; ++j) gw[j] /= m; gb /= m; loss /= m;
61 // Add L2 regularization: 0.5*lambda*||w||^2
62 for (int j = 0; j < d; ++j) { loss += 0.5 * lambda * w[j]*w[j]; gw[j] += lambda * w[j]; }
63 return tuple<double, vector<double>, double>(loss, gw, gb);
64 };
65
66 cout.setf(ios::fixed); cout<<setprecision(6);
67 for (int e = 0; e < epochs; ++e) {
68 shuffle(idx.begin(), idx.end(), rng);
69 double epoch_loss = 0.0; int steps = 0;
70 for (int i = 0; i < n; i += batch) {
71 vector<int> bidx; bidx.reserve(batch);
72 for (int k = i; k < min(i+batch, n); ++k) bidx.push_back(idx[k]);
73 auto [L, gw, gb] = loss_and_grad(bidx);
74 epoch_loss += L; ++steps;
75 opt.step(w, b, gw, gb);
76 }
77 epoch_loss /= steps;
78 // Compute accuracy on the whole dataset (for demonstration)
79 int correct = 0;
80 for (int i = 0; i < n; ++i) {
81 double p = sigmoid(inner_product(X[i].begin(), X[i].end(), w.begin(), 0.0) + b);
82 int pred = (p >= 0.5) ? 1 : 0; correct += (pred == y[i]);
83 }
84 cout << "epoch " << e << ": loss=" << epoch_loss << ", acc=" << (double)correct/n << "\n";
85 }
86
87 cout << "Final w: [" << w[0] << ", " << w[1] << "] b=" << b << "\n";
88 return 0;
89}
90

This program trains a logistic regression classifier with mini-batch SGD and classical momentum. The velocity smooths noisy gradient estimates, typically improving stability and convergence speed compared to plain SGD. It also demonstrates how to integrate L2 regularization and compute basic metrics.

Time: O(nd) per epoch for dataset size n and parameter dimension d; each step is O(bd) for batch size bSpace: O(d) extra for the velocity plus O(bd) temporary for batch data access
#momentum#heavy-ball#polyak momentum#nesterov#nag#stochastic gradient descent#optimization#ill-conditioned#learning rate#exponential moving average#logistic regression#convex optimization#deep learning#accelerated methods#convergence