🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Gradient Descent

Key Points

  • •
    Gradient descent is a simple, repeatable way to move downhill on a loss surface by stepping in the opposite direction of the gradient.
  • •
    Batch gradient descent uses the full dataset to compute the gradient each step, making it stable but potentially slower per iteration.
  • •
    Choosing a good learning rate (step size) is crucial; too large diverges, too small converges very slowly.
  • •
    For convex, smooth losses like linear regression, batch gradient descent reliably converges to the global minimum.
  • •
    Per iteration cost is O(n d), where n is the number of examples and d is the number of features.
  • •
    Feature scaling (standardization) often dramatically improves convergence speed and stability.
  • •
    Stopping criteria typically monitor gradient norm, loss improvement, or a maximum number of iterations.
  • •
    C++ implementations should pay attention to numerical stability (overflow in exp, underflow in probabilities) and careful vectorized loops.

Prerequisites

  • →Multivariable calculus (gradients and partial derivatives) — Gradient descent requires computing and understanding ∇L(θ).
  • →Linear algebra (vectors, matrices, matrix-vector products) — Batch gradients for linear/logistic regression use X^T(Xθ−y) and X^T(p−y).
  • →Convexity basics — Convergence guarantees and step-size selection rely on convexity and smoothness assumptions.
  • →Floating-point arithmetic and numerical stability — C++ implementations must avoid overflow/underflow, especially for exp/log in logistic regression.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine hiking down a hill in the fog with only a local slope meter—you can’t see the valley, but you can feel which way is downhill. Gradient descent is that hiker. It repeatedly measures the slope (the gradient) at the current position and takes a step downhill to reduce the loss (error). Concept: In machine learning and optimization, we often want to find parameters θ that minimize a loss function L(θ). The gradient ∇L(θ) tells us the direction of steepest ascent; stepping in the opposite direction decreases the loss. Batch gradient descent computes this gradient using the entire dataset each iteration, making each step exact (for that dataset) but computationally heavier than stochastic variants. Example: In linear regression, L(θ) is the mean squared error between predictions and targets. By computing the gradient over all samples, we update θ so that predictions get closer to targets overall, iteration by iteration. Over time—and with a suitable learning rate—θ converges toward parameters that minimize error. The algorithm is simple, widely applicable to differentiable losses, and forms the backbone of many learning systems.

02Intuition & Analogies

Hook: Think of trying to flatten a wrinkled bedsheet. You press where the bump is highest and move your hands outward, repeatedly smoothing until it’s flat. Gradient descent “presses” where the error is greatest and moves parameters in the direction that flattens the error surface. Concept: The gradient is like a compass pointing toward the steepest increase of error. If we walk the other way by some small step, we reduce the error. The learning rate is how big that step is—like how far you move your hands each time you press the sheet. Too big and you overshoot and create new wrinkles; too small and it takes forever. Batch gradient descent uses all the evidence (every data point) to choose the movement direction, like polling everyone in a crowd before deciding. Example: For a simple parabola L(θ) = (θ − 3)^2, the slope at θ tells you whether 3 is to the left or right and how far. If θ = 0, the slope is −6, so stepping right (opposite of negative slope) moves you toward θ = 3. For linear regression with many features, each feature’s contribution to the error becomes a component in the gradient vector. Summing over the whole dataset (batch) ensures our step points toward reducing the average error, not just the error of one sample.

03Formal Definition

Hook: The essence of batch gradient descent is one line. Concept: Given a differentiable loss L: Rd → R, batch gradient descent updates parameters by θt+1​ = θt​ - η \, ∇ L(θt​), where η > 0 is the learning rate. The gradient ∇ L(θt​) is computed exactly (for the dataset) by aggregating contributions from all samples. For empirical risk minimization with dataset D = \{(xi​, yi​)\}_{i=1}^{n}, L(θ) = n1​ ∑i=1n​ ℓ(θ; xi​, yi​), and thus ∇ L(θ) = n1​ ∑i=1n​ ∇_θ ℓ(θ; xi​, yi​). Example: Linear regression with mean squared error ℓi​ = 21​(xi​^⊤ θ - yi​)^2 yields ∇ L(θ) = n1​ X^⊤ (Xθ - y), where X ∈ Rn×d, y ∈ Rn. Under standard assumptions (L-smoothness and possibly μ-strong convexity), appropriate η ensures convergence: sublinear Ot1​ for convex smooth, and linear rate (geometric) for strongly convex. Stopping criteria often check ∥ ∇ L(θt​) ∥2​ ≤ ε or relative loss decrease.

04When to Use

Hook: Use the right hammer for the right nail. Batch gradient descent is the hammer for smooth, differentiable losses when you can afford full passes over the data. Concept: Apply batch gradient descent when the dataset fits in memory or streaming a full pass is acceptable, and when you want stable, low-variance gradient estimates per step. It shines in convex problems like linear regression and logistic regression, where global minima exist and each step steadily progresses. It also serves as a strong baseline and sanity check for more complex optimizers. Example use cases: (1) Fitting linear or logistic regression on medium-sized datasets (n·d manageable). (2) Training small neural networks for didactic purposes or when determinism and stability matter. (3) Computing precise gradients for ill-conditioned problems before switching to quasi-Newton methods. Consider alternatives when: data is huge (prefer mini-batch/SGD), the loss is highly non-convex and noisy (adaptive optimizers may help), or when second-order curvature information (Newton/LM/L-BFGS) yields faster convergence per iteration despite higher cost.

⚠️Common Mistakes

Hook: Most failures come from two knobs: step size and scaling. Concept: Common pitfalls include (1) Learning rate too large, causing divergence or oscillations; (2) Learning rate too small, leading to painfully slow progress; (3) Unscaled features, which distort the loss surface and make a single learning rate ineffective; (4) Forgetting the bias term or mishandling it during scaling; (5) Using integer arithmetic or low precision causing rounding errors; (6) Incorrect gradient signs or averaging factors (missing 1/n); (7) Poor stopping rules—either stopping too early due to noisy loss or too late due to tiny improvement; (8) Ignoring numerical stability in functions like exp for logistic regression. Example fixes: Standardize features (zero mean, unit variance) and keep bias unscaled or handled separately. Start with a conservative \eta (e.g., 1/L estimate or small like 1e-2) and consider decay schedules. Verify gradients with finite-difference checks on small problems. Monitor both loss and gradient norm. Use double precision in C++ and clamp logits for sigmoid to avoid overflow. Log progress and experiment with different \eta to find a stable, fast regime.

Key Formulas

Gradient Descent Update

θt+1​=θt​−η∇L(θt​)

Explanation: Update parameters by stepping opposite the gradient scaled by learning rate η. This is the core rule for batch gradient descent.

Empirical Risk

L(θ)=n1​i=1∑n​ℓ(θ;xi​,yi​)

Explanation: The training loss is the average of per-sample losses. Batch gradient descent differentiates this sum to form the update.

MSE Loss (Linear Regression)

L(θ)=2n1​i=1∑n​(xi⊤​θ−yi​)2

Explanation: This measures average squared error between predictions and targets. The 21​ factor simplifies the gradient expression.

Linear Regression Gradient

∇L(θ)=n1​X⊤(Xθ−y)

Explanation: For mean squared error, the gradient equals the data matrix transpose times residuals, averaged over n samples.

Sigmoid

σ(z)=1+e−z1​

Explanation: Maps any real input to a probability in (0,1). Used to convert linear scores to probabilities in logistic regression.

Logistic Regression Loss

L(θ)=n1​i=1∑n​[−yi​logpi​−(1−yi​)log(1−pi​)],pi​=σ(xi⊤​θ)

Explanation: Cross-entropy penalizes wrong confident predictions heavily. It is convex in θ for binary logistic regression.

Logistic Regression Gradient

∇L(θ)=n1​X⊤(σ(Xθ)−y)

Explanation: The gradient is the data matrix transpose times the probability error vector. It averages errors over all samples.

Sublinear Convergence (Convex, L-smooth)

f(xt​)−f(x∗)≤2tL∥x0​−x∗∥2​

Explanation: With step size 1/L on an L-smooth convex function, gradient descent’s suboptimality shrinks as Ot1​. This gives a rate guarantee without strong convexity.

Linear Convergence (Strongly Convex)

f(xt​)−f(x∗)≤(1−Lμ​)t(f(x0​)−f(x∗))

Explanation: If f is μ−strongly convex and L-smooth, gradient descent with a suitable step size converges geometrically. The ratio μ/L controls the speed.

Learning Rate Decay

ηt​=1+ktη0​​

Explanation: A decaying step size can stabilize and improve convergence when curvature varies. k controls how quickly the learning rate shrinks.

Gradient-Norm Stopping

∥∇L(θt​)∥2​≤ε

Explanation: Stop when the gradient is small, indicating a (near) stationary point. This is robust for smooth problems.

Relative Loss Decrease Stopping

max(1,L(θt−1​))∣L(θt​)−L(θt−1​)∣​≤δ

Explanation: Stop when successive loss improvements become negligible. The max avoids division by very small numbers.

Complexity Analysis

Let n be the number of samples and d the number of features (including bias). In batch gradient descent, each iteration computes predictions for all n samples, accumulates residuals, and forms the gradient by multiplying XT with the residual vector. A straightforward implementation does this in O(n d) time per iteration: O(n d) for computing ŷ = Xθ, O(n) for residuals, and O(n d) for XT r, which simplifies to O(n d). Memory usage for parameters is O(d); if the dataset must be stored in memory, it requires O(n d). Additional workspace for predictions and gradients is O(n + d), typically dominated by the dataset. Over E iterations (or epochs, since batch uses the full dataset each time), the total time is O(E n d). Convergence speed in iterations depends on problem conditioning. For L-smooth convex functions with a fixed step size η ≤ 1/L, the suboptimality decreases as Ot1​; for μ−strongly convex functions, it decreases geometrically at rate (1 − μ/L)t. Poor conditioning (large κ = L/μ) slows convergence, often mitigated by feature scaling or preconditioning. Numerically, using double precision reduces rounding error, and careful implementations of nonlinearities (e.g., sigmoid with clamping) avoid overflow. If line search is used, each iteration’s time increases due to multiple function/gradient evaluations but may reduce the number of iterations. In practice, batch GD trades higher per-iteration cost for low variance and stable progress compared to SGD or mini-batch variants.

Code Examples

Batch Gradient Descent on a 1D Quadratic: Mechanics and Convergence
1#include <bits/stdc++.h>
2using namespace std;
3
4// Minimize L(theta) = (theta - 3)^2 using batch gradient descent
5// Analytical gradient: dL/dtheta = 2*(theta - 3)
6
7int main() {
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 auto L = [](double th) { return (th - 3.0) * (th - 3.0); };
12 auto grad = [](double th) { return 2.0 * (th - 3.0); };
13
14 double theta = 0.0; // initial guess
15 double eta = 0.2; // learning rate
16 int max_iters = 50;
17 double tol = 1e-9; // gradient-norm stopping
18
19 cout << fixed << setprecision(6);
20 for (int t = 0; t < max_iters; ++t) {
21 double g = grad(theta);
22 cout << "iter=" << t
23 << " theta=" << theta
24 << " L=" << L(theta)
25 << " |grad|=" << fabs(g) << "\n";
26 if (fabs(g) <= tol) break;
27 theta -= eta * g; // theta_{t+1} = theta_t - eta * grad
28 }
29
30 cout << "Final theta ~ " << theta << " (true minimizer is 3.0)\n";
31 return 0;
32}
33

This program demonstrates the basic gradient descent update on a simple convex function. The gradient is computed exactly, and the parameter is updated by stepping opposite the gradient. You can experiment with the learning rate: too large can cause divergence or oscillation; moderate values converge quickly to 3.0.

Time: O(T) per run, where T is the number of iterationsSpace: O(1)
Batch Gradient Descent for Multivariate Linear Regression (MSE)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Linear regression with batch gradient descent
5// Minimize L(theta) = (1/(2n)) * sum_i (x_i^T theta - y_i)^2
6// Gradient: (1/n) * X^T (X theta - y)
7
8struct Dataset {
9 vector<vector<double>> X; // n x d (will include bias as last column)
10 vector<double> y; // n
11};
12
13// Generate synthetic linear data with bias and Gaussian noise
14Dataset make_data(int n, int d_no_bias, unsigned seed=42) {
15 mt19937 rng(seed);
16 normal_distribution<double> noise(0.0, 0.3);
17 uniform_real_distribution<double> wdist(-2.0, 2.0);
18 vector<double> true_w(d_no_bias + 1); // include bias at end
19 for (double &w : true_w) w = wdist(rng);
20
21 Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0.0);
22
23 uniform_real_distribution<double> xdist(-2.0, 2.0);
24 for (int i = 0; i < n; ++i) {
25 double yclean = 0.0;
26 for (int j = 0; j < d_no_bias; ++j) {
27 double xj = xdist(rng);
28 ds.X[i][j] = xj;
29 yclean += true_w[j] * xj;
30 }
31 ds.X[i][d_no_bias] = 1.0; // bias feature
32 yclean += true_w[d_no_bias];
33 ds.y[i] = yclean + noise(rng);
34 }
35 return ds;
36}
37
38// Standardize each non-bias feature: zero mean, unit variance
39void standardize_features(vector<vector<double>> &X) {
40 int n = (int)X.size();
41 int d = (int)X[0].size();
42 // Exclude last column (bias) from scaling
43 for (int j = 0; j < d - 1; ++j) {
44 double mean = 0.0;
45 for (int i = 0; i < n; ++i) mean += X[i][j];
46 mean /= n;
47 double var = 0.0;
48 for (int i = 0; i < n; ++i) var += (X[i][j] - mean) * (X[i][j] - mean);
49 var /= max(1, n - 1);
50 double stdv = sqrt(var) + 1e-12; // avoid divide-by-zero
51 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv;
52 }
53}
54
55int main() {
56 ios::sync_with_stdio(false);
57 cin.tie(nullptr);
58
59 int n = 500; // samples
60 int d_no_bias = 5; // real features (we will append bias)
61 Dataset ds = make_data(n, d_no_bias, 123);
62
63 // Scale features (not bias)
64 standardize_features(ds.X);
65
66 int d = d_no_bias + 1; // include bias feature
67 vector<double> theta(d, 0.0); // initialize to zeros
68
69 double eta = 0.1; // learning rate
70 int max_iters = 1000;
71 double tol = 1e-8; // gradient-norm stopping
72
73 vector<double> yhat(n), residual(n), grad(d);
74
75 auto loss = [&](const vector<double>& th){
76 double sse = 0.0;
77 for (int i = 0; i < n; ++i) {
78 double pred = 0.0;
79 for (int j = 0; j < d; ++j) pred += ds.X[i][j] * th[j];
80 double e = pred - ds.y[i];
81 sse += e * e;
82 }
83 return 0.5 * sse / n; // (1/2n) * ||X theta - y||^2
84 };
85
86 cout << fixed << setprecision(6);
87 for (int t = 0; t < max_iters; ++t) {
88 // yhat = X theta
89 for (int i = 0; i < n; ++i) {
90 double s = 0.0;
91 for (int j = 0; j < d; ++j) s += ds.X[i][j] * theta[j];
92 yhat[i] = s;
93 }
94 // residual = yhat - y
95 for (int i = 0; i < n; ++i) residual[i] = yhat[i] - ds.y[i];
96 // grad = (1/n) * X^T residual
97 fill(grad.begin(), grad.end(), 0.0);
98 for (int j = 0; j < d; ++j) {
99 double gj = 0.0;
100 for (int i = 0; i < n; ++i) gj += ds.X[i][j] * residual[i];
101 grad[j] = gj / n;
102 }
103 // Progress report
104 if (t % 100 == 0) {
105 double L = loss(theta);
106 double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm);
107 cout << "iter=" << t << " loss=" << L << " |grad|=" << gnorm << "\n";
108 if (gnorm <= tol) break;
109 }
110 // Update
111 for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j];
112 }
113
114 cout << "Final loss=" << loss(theta) << "\nParameters (including bias last):\n";
115 for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n';
116 return 0;
117}
118

This program fits a linear regression model with batch gradient descent. It standardizes features for better conditioning, appends a bias feature, and iteratively updates parameters using the full-dataset gradient. The loss decreases steadily when the learning rate is reasonable, illustrating stable batch updates.

Time: O(T n d) total; O(n d) per iterationSpace: O(n d) to store data + O(d) parameters + O(n + d) workspace
Batch Gradient Descent for Binary Logistic Regression (Cross-Entropy)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Logistic regression with batch gradient descent
5// Loss: L(theta) = (1/n) * sum_i [ -y_i log p_i - (1-y_i) log(1-p_i) ], p_i = sigmoid(x_i^T theta)
6// Gradient: (1/n) * X^T (p - y)
7
8struct Dataset {
9 vector<vector<double>> X; // n x d (includes bias column)
10 vector<int> y; // n labels in {0,1}
11};
12
13Dataset make_data(int n, int d_no_bias, unsigned seed=7) {
14 mt19937 rng(seed);
15 normal_distribution<double> noise(0.0, 0.5);
16 uniform_real_distribution<double> wdist(-2.0, 2.0);
17 vector<double> true_w(d_no_bias + 1);
18 for (double &w : true_w) w = wdist(rng);
19
20 Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0);
21 normal_distribution<double> xdist(0.0, 1.0);
22 for (int i = 0; i < n; ++i) {
23 double z = 0.0;
24 for (int j = 0; j < d_no_bias; ++j) {
25 double xj = xdist(rng);
26 ds.X[i][j] = xj;
27 z += true_w[j] * xj;
28 }
29 ds.X[i][d_no_bias] = 1.0; // bias
30 z += true_w[d_no_bias];
31 z += noise(rng);
32 // Probability via sigmoid, then sample label
33 double p = 1.0 / (1.0 + exp(-z));
34 bernoulli_distribution bd(p);
35 ds.y[i] = (int)bd(rng);
36 }
37 return ds;
38}
39
40void standardize_features(vector<vector<double>> &X) {
41 int n = (int)X.size();
42 int d = (int)X[0].size();
43 for (int j = 0; j < d - 1; ++j) {
44 double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n;
45 double var = 0.0; for (int i = 0; i < n; ++i) var += (X[i][j]-mean)*(X[i][j]-mean); var /= max(1, n-1);
46 double stdv = sqrt(var) + 1e-12;
47 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j]-mean)/stdv;
48 }
49}
50
51// Numerically stable sigmoid
52inline double sigmoid(double z) {
53 if (z >= 0) {
54 double ez = exp(-z);
55 return 1.0 / (1.0 + ez);
56 } else {
57 double ez = exp(z);
58 return ez / (1.0 + ez);
59 }
60}
61
62int main() {
63 ios::sync_with_stdio(false);
64 cin.tie(nullptr);
65
66 int n = 1000;
67 int d_no_bias = 4;
68 Dataset ds = make_data(n, d_no_bias, 2024);
69 standardize_features(ds.X);
70
71 int d = d_no_bias + 1; // include bias
72 vector<double> theta(d, 0.0);
73 vector<double> p(n), grad(d);
74
75 auto loss = [&](const vector<double>& th){
76 double L = 0.0;
77 for (int i = 0; i < n; ++i) {
78 double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * th[j];
79 double pi = sigmoid(z);
80 // Clamp to avoid log(0)
81 double eps = 1e-12;
82 pi = min(max(pi, eps), 1.0 - eps);
83 L += - ds.y[i] * log(pi) - (1 - ds.y[i]) * log(1 - pi);
84 }
85 return L / n;
86 };
87
88 double eta = 0.2;
89 int max_iters = 1000;
90 double tol = 1e-6;
91
92 cout << fixed << setprecision(6);
93 for (int t = 0; t < max_iters; ++t) {
94 // Compute probabilities p = sigmoid(X theta)
95 for (int i = 0; i < n; ++i) {
96 double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j];
97 p[i] = sigmoid(z);
98 }
99 // Gradient: (1/n) * X^T (p - y)
100 fill(grad.begin(), grad.end(), 0.0);
101 for (int j = 0; j < d; ++j) {
102 double gj = 0.0;
103 for (int i = 0; i < n; ++i) gj += ds.X[i][j] * (p[i] - ds.y[i]);
104 grad[j] = gj / n;
105 }
106 if (t % 100 == 0) {
107 double L = loss(theta);
108 double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm);
109 cout << "iter=" << t << " loss=" << L << " |grad|=" << gnorm << "\n";
110 if (gnorm <= tol) break;
111 }
112 for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j];
113 }
114
115 cout << "Final loss=" << loss(theta) << "\nParameters (bias last):\n";
116 for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n';
117
118 // Quick training accuracy check
119 int correct = 0;
120 for (int i = 0; i < n; ++i) {
121 double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j];
122 int pred = sigmoid(z) >= 0.5 ? 1 : 0;
123 correct += (pred == ds.y[i]);
124 }
125 cout << "Training accuracy ~ " << (100.0 * correct / n) << "%\n";
126 return 0;
127}
128

This example trains a binary logistic regression model using batch gradient descent. It computes cross-entropy loss, uses a numerically stable sigmoid, standardizes features, and updates parameters using the full gradient each iteration. The model’s loss decreases and accuracy improves as training proceeds.

Time: O(T n d) total; O(n d) per iterationSpace: O(n d) for data + O(d) for parameters + O(n + d) workspace
#gradient descent#batch gradient descent#learning rate#linear regression#logistic regression#mean squared error#cross-entropy#convex optimization#feature scaling#lipschitz smoothness#strong convexity#condition number#stopping criterion#line search#c++ machine learning