📚TheoryIntermediate

Gradient Descent Convergence Theory

Key Points

•
Gradient descent updates parameters by stepping opposite the gradient: $x_{t + 1} = x_{t}$ - $η$ $\nabla$ f( $x_{t}$ ).
•
If f is L-smooth, a safe fixed step size is $η$ = 1/L, which guarantees descent for convex functions.
•
For $μ$ -strongly convex and L-smooth functions, gradient descent converges linearly: \ $∣ x_{t} - x^{*} ∥$ $\leq$ (1 - $μ$ /L)^t \ $∣ x_{0} - x^{*} ∥$ .
•
For merely convex and L-smooth functions, the objective gap shrinks sublinearly: f( $x_{t}$ ) - f(x^*) $\leq$ O $\frac{1}{t}$ .
•
The condition number $κ$ = L/ $μ$ determines how many iterations are needed; larger $κ$ slows convergence.
•
In non-convex settings, gradient descent converges to stationary points with average gradient norm squared decaying like O $\frac{1}{t}$ .
•
Stochastic gradient descent (SGD) with decreasing step sizes achieves E[f( $x_{t}$ )] - f(x^*) $\leq$ O(1/ $t$ ) for convex objectives.
•
Choosing learning rates well is crucial in ML: too large causes divergence, too small makes training sluggish.

Prerequisites

→Differential Calculus — Understanding gradients, directional derivatives, and Taylor expansions is essential for GD updates and smoothness.
→Linear Algebra — Norms, inner products, eigenvalues, and positive definiteness underlie L, μ, and condition numbers.
→Convexity and Strong Convexity — Convergence rates depend on convexity properties; knowing these definitions is critical.
→Probability and Expectation — SGD analysis uses unbiasedness and variance of stochastic gradients and expectations over randomness.
→Introduction to Optimization — Line search, stopping criteria, and basic iterative methods provide context for GD behavior.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine rolling a ball down a hilly landscape to reach the lowest valley. If the hills are smooth and bowl-shaped, the ball quickly settles at the bottom. If the surface is rough or flat in places, progress can be slow or the ball can get stuck at flat spots. Concept: Gradient descent (GD) is the mathematical version of this process. At each step, you look at the slope (the gradient) and take a step downhill proportional to it, x_{t+1} = x_t - \eta \nabla f(x_t). The theory of convergence explains when and how fast this process approaches the minimum. Key properties of the function—smoothness and convexity—govern the guarantees. L-smoothness controls how quickly the slope can change, and \mu-strong convexity ensures the function is bowl-shaped with a unique minimum. Example: For a quadratic function f(x) = \tfrac{1}{2} x^\top A x - b^\top x with A positive definite, GD with a properly chosen step size will shrink the distance to the solution by a constant factor each step, leading to fast, predictable convergence. In contrast, for a non-convex function like x^4 - x^2, GD still finds points where the slope is nearly zero, but we cannot guarantee it reaches the global minimum.

02Intuition & Analogies

Hook: Suppose you're hiking with foggy glasses. You can't see the whole mountain, but you can feel the slope beneath your feet. You take a step downhill—the steeper the slope, the longer the step—but not too long to avoid tripping. Concept: The gradient is your sense of steepness and direction; the learning rate (step size) is how long your stride is. If the terrain is consistently bowl-shaped (strongly convex), every step gets you closer to the unique bottom, and with a good stride length you progress at a steady, geometric pace. If the terrain is just generally sloping down (convex but not strongly), you still move toward lower altitudes, but the landscape might flatten, making your progress taper like 1/t. If the terrain is bumpy (non-convex), you can still move to flatter regions where the slope is small, even if you can’t be sure it’s the absolute lowest valley. Example: Think of a perfectly round bowl (strongly convex): roll a marble and watch it spiral quickly to the bottom—each lap is shorter than the last by a constant factor. Now think of a wide, shallow plate (convex but not strongly): the marble moves inward but slows significantly near the center. On a wavy plate (non-convex), the marble may settle in a local dip; however, if you keep nudging it carefully (small steps), you can at least get to a place where it barely moves—the slope there is essentially zero.

03Formal Definition

Gradient descent iterates

x_{t + 1} = x_{t}

η

\nabla

x_{t}

) with step size

η

> 0 to minimize a differentiable function f:

R^{d}

\to

R

. A function is L-smooth if its gradient is Lipschitz continuous: \

∣\nabla f (x) - \nabla f (y) ∥

\leq

L\,\

∣ x - y ∥

for all x, y, equivalently f(y)

\leq

f(x) +

\nabla

f(x)^

⊤

(y - x) +

\frac{L}{2}

∣ y - x ∥

^2. It is

μ

-strongly convex if f(y)

\geq

f(x) +

\nabla

f(x)^

⊤

(y - x) +

\frac{μ}{2}

∣ y - x ∥

^2 for all x, y, which implies a unique minimizer x^*. Under L-smoothness and convexity, GD with

η

= 1/L satisfies f(

x_{t}

) - f(x^*)

\leq

\frac{L ∥ x _{0} - x ^{*} ∥ ^{2}}{2 t}

. Under L-smoothness and

μ

-strong convexity, GD with 0 <

η

\leq

1/L yields linear convergence in distance: \

∣ x_{t} - x^{*} ∥

\leq

(1 -

η

μ

)^t \

∣ x_{0} - x^{*} ∥

; with

η

= 1/L this becomes (1 -

μ

/L)^t. The condition number

κ

= L/

μ

controls the rate: larger

κ

slows convergence. For non-convex L-smooth objectives, GD with

η

\leq

1/L yields a guarantee on stationarity:

min_{0 \leq s < t}

∣\nabla f (x_{s}) ∥

\leq

\frac{2 L ( f ( x _{0} ) - f _{i n f} )}{t}

. In stochastic settings with unbiased gradient estimates of bounded variance, SGD with diminishing step sizes (e.g.,

η_{t}

\propto

t

) achieves E[f(

\bar

x_{t}

)] - f(x^*) = O(1/

t

) in the convex case.

04When to Use

Use gradient descent when: (1) You can compute exact or approximate gradients cheaply relative to function evaluation, such as in linear models, logistic regression, and neural networks. (2) The function is L-smooth (most differentiable ML losses are) and possibly convex, where theory guarantees progress and rates. (3) You have large-scale data where second-order methods are too costly; GD/SGD scale linearly with data and parameters. (4) You’re training neural networks; while non-convex, GD variants still reliably reach stationary points and good solutions in practice. Prefer fixed step sizes (\eta = 1/L or tuned constants) when L (or a bound) is known and the surface is well-behaved; prefer backtracking or line search when L is unknown or varies. In ill-conditioned problems (large \kappa), consider preconditioning or momentum/accelerated methods. Use SGD or mini-batch GD when full gradients are too expensive; apply diminishing or adaptive learning rates to manage noise. For strongly convex problems and high accuracy needs, GD with \eta = 1/L (or 2/(L+\mu) when known) quickly achieves small errors; for merely convex objectives, expect slower 1/t decay and consider acceleration.

⚠️Common Mistakes

Learning rate too large: If \eta > 2/L in quadratic-like regions, GD can diverge or oscillate. Fix by estimating L (e.g., via power iteration on Hessian for quadratics), using \eta \le 1/L, or adopting backtracking line search. 2) Confusing convergence in function value with distance: Sublinear rates like O(1/t) on f(x_t) - f(x^) do not imply the same rate on |x_t - x^| unless strong convexity holds. 3) Ignoring conditioning: With \kappa = L/\mu \gg 1, progress is painfully slow; normalize features, add regularization, or precondition to reduce \kappa. 4) Mixing deterministic and stochastic guarantees: SGD with constant \eta does not converge to x^* but to a noise ball; use diminishing \eta_t or iterate averaging to get O(1/\sqrt{t}) rates in convex problems. 5) Misapplying non-convex results: Guarantees are about stationarity (small |\nabla f|), not global optimality; a small gradient does not prove you found the best minimum. 6) Poor stopping criteria: Stopping on tiny step size or tiny |f(x_{t+1}) - f(x_t)| can be misleading; monitor gradient norm and, when possible, optimality gaps. 7) Not scaling data/parameters: Unscaled features enlarge L and \kappa, degrading step size choices and speed; standardize inputs to improve convergence. 8) Using fixed \eta in stochastic settings without tuning: Either decay \eta_t (e.g., \eta_0/\sqrt{t}) or use adaptive methods; otherwise progress can stall.

Key Formulas

Gradient Descent Update

x_{t + 1} = x_{t} - η \nabla f (x_{t})

Explanation: Each iteration moves opposite the gradient scaled by the learning rate. This decreases the function value when the step is small enough.

L-smoothness (Lipschitz Gradient)

∥\nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥

Explanation: The gradient does not change too abruptly. This regularity enables using fixed step sizes like 1/L with convergence guarantees.

Descent Lemma

f (y) \leq f (x) + \nabla f (x)^{⊤} (y - x) + \frac{L}{2} ∥ y - x ∥^{2}

Explanation: An upper bound on f(y) in terms of a quadratic model around x. It ensures GD decreases f if $η$ $\leq$ 1/L.

Strong Convexity Inequality

f (y) \geq f (x) + \nabla f (x)^{⊤} (y - x) + \frac{μ}{2} ∥ y - x ∥^{2}

Explanation: A lower bound that enforces bowl-shaped curvature. It implies uniqueness of the minimizer and linear convergence of GD.

Linear Convergence (Strongly Convex)

∥ x_{t} - x^{*} ∥ \leq (1 - η μ)^{t} ∥ x_{0} - x^{*} ∥

Explanation: When $μ$ > 0 and $η$ $\leq$ 1/L, the distance to the minimizer shrinks by a constant factor each step. With $η$ =1/L, the factor is 1 - $μ$ /L.

Sublinear Rate (Convex, L-smooth)

f (x_{t}) - f (x^{*}) \leq \frac{L ∥ x _{0} - x ^{*} ∥ ^{2}}{2 t}

Explanation: For convex but not strongly convex functions, GD with $η$ =1/L reduces the objective gap like 1/t. Progress slows near the optimum.

Condition Number

κ = \frac{L}{μ}

Explanation: The ratio of smoothness to strong convexity controls the speed. Iteration complexity scales like O( $κ$ $lo g$ (1/ $ε$ )).

Non-convex Stationarity Rate

0 \leq s < t min ∥\nabla f (x_{s}) ∥^{2} \leq \frac{2 L ( f ( x _{0} ) - f _{i n f} )}{t}

Explanation: In smooth non-convex problems with $η$ $\leq$ 1/L, the smallest gradient norm squared among the first t iterates decreases like 1/t.

SGD Convex Rate

E [f (\overset{x}{ˉ}_{t})] - f (x^{*}) \leq \frac{C}{t}

Explanation: With unbiased gradients of bounded variance and diminishing steps (e.g., $η_{t}$ $\propto$ 1/ $t$ ), averaged SGD attains O(1/ $t$ ) expected suboptimality.

Optimal Fixed Step for Quadratics

η^{*} = \frac{2}{L + μ}, ρ^{*} = \frac{κ - 1}{κ + 1}

Explanation: For strongly convex quadratics, the best fixed step is 2/(L+ $μ$ ), yielding contraction factor ( $κ$ -1)/( $κ$ +1), faster than 1 - $μ$ /L.

Complexity Analysis

In gradient descent, each iteration computes the gradient and performs an O(d) vector update, where d is the parameter dimension. For generic objectives where evaluating the gradient costs G(d) operations, the per-iteration complexity is O(G(d)). In many ML settings with dense data of size n-by-d, a full gradient over n samples costs O(nd); thus batch GD requires O(nd) per iteration and O(nd · T) overall for T iterations. Memory usage is typically O(d), storing the current iterate, gradient, and a few scratch vectors. In strongly convex, L-smooth problems, the number of iterations to reach accuracy

ε

in distance is T = O(

κ

lo g

(1/

ε

)), where

κ

= L/

μ

. Accordingly, the total arithmetic work becomes O(G(d) ·

κ

lo g

(1/

ε

)). When n is large, computing full gradients is expensive; stochastic gradient descent reduces per-iteration cost to O(bd) for mini-batch size b (often b ≪ n). However, more iterations are needed: for convex problems with diminishing step sizes, expected suboptimality decays as O(1/

T

), so achieving

ε

accuracy in function value requires T = O(1/

ε^{2}

) stochastic steps. The total work is then O(bd/

ε^{2}

). In strongly convex settings, with carefully tuned diminishing steps (e.g.,

η_{t}

= c/(

μ

(t+

t_{0}

))), SGD can achieve O

\frac{1}{T}

rates in expectation, improving to O((bd)/

ε

). Space complexity remains O(d) for both GD and SGD, plus O(bd) transiently to load a mini-batch. Ill-conditioning increases

κ

, slowing deterministic GD; preconditioning (feature scaling) effectively reduces

κ

and thus the iteration count without changing per-iteration cost.

Code Examples

Linear convergence on a strongly convex quadratic with η = 1/L

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Strongly convex quadratic: f(x) = 0.5 * x^T A x - b^T x, with diagonal SPD A
5 // For diagonal A, L = max diag(A), mu = min diag(A). x* = A^{-1} b.
6 
7 struct Quad {
8     vector<double> diagA; // diagonal entries of A (positive)
9     vector<double> b;     // vector b
10 
11     // Evaluate f(x)
12     double f(const vector<double>& x) const {
13         double v = 0.0;
14         for (size_t i = 0; i < x.size(); ++i) {
15             v += 0.5 * diagA[i] * x[i] * x[i] - b[i] * x[i];
16         }
17         return v;
18     }
19 
20     // Gradient: A x - b (since A is diagonal)
21     vector<double> grad(const vector<double>& x) const {
22         vector<double> g(x.size());
23         for (size_t i = 0; i < x.size(); ++i) g[i] = diagA[i] * x[i] - b[i];
24         return g;
25     }
26 
27     // Closed-form minimizer x* = A^{-1} b (element-wise since A is diagonal)
28     vector<double> x_star() const {
29         vector<double> xs(diagA.size());
30         for (size_t i = 0; i < diagA.size(); ++i) xs[i] = b[i] / diagA[i];
31         return xs;
32     }
33 };
34 
35 // Euclidean norm
36 double norm2(const vector<double>& x) {
37     double s = 0.0;
38     for (double v : x) s += v * v;
39     return sqrt(s);
40 }
41 
42 vector<double> sub(const vector<double>& a, const vector<double>& b) {
43     vector<double> r(a.size());
44     for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] - b[i];
45     return r;
46 }
47 
48 int main() {
49     ios::sync_with_stdio(false);
50     cin.tie(nullptr);
51 
52     // Define a 5D problem with condition number kappa = L/mu = 10
53     vector<double> diagA = {1.0, 2.0, 3.0, 5.0, 10.0}; // mu=1, L=10
54     vector<double> b     = {1.0, -2.0, 0.5, 3.0, -1.0};
55     Quad prob{diagA, b};
56 
57     const double L = *max_element(diagA.begin(), diagA.end());
58     const double mu = *min_element(diagA.begin(), diagA.end());
59     const double eta = 1.0 / L; // theoretical safe step size
60 
61     vector<double> x(5, 0.0);         // start at zero
62     vector<double> xstar = prob.x_star();
63     double dist0 = norm2(sub(x, xstar));
64 
65     cout << fixed << setprecision(6);
66     cout << "# t\t||x_t - x*||\tBound (1 - mu/L)^t * ||x0 - x*||\t f(x_t) - f(x*)\n";
67 
68     // Precompute f(x*)
69     double fxstar = prob.f(xstar);
70 
71     int T = 40;
72     for (int t = 0; t <= T; ++t) {
73         double bound = pow(1.0 - mu / L, t) * dist0; // linear convergence bound for eta=1/L
74         double dist = norm2(sub(x, xstar));
75         double gap = prob.f(x) - fxstar;
76         cout << t << "\t" << dist << "\t" << bound << "\t" << gap << "\n";
77 
78         // Next iterate: x_{t+1} = x_t - eta * grad
79         auto g = prob.grad(x);
80         for (size_t i = 0; i < x.size(); ++i) x[i] -= eta * g[i];
81     }
82 
83     return 0;
84 }
85

This program minimizes a strongly convex diagonal quadratic. It uses η = 1/L and reports the distance to the optimum and the theoretical linear bound (1 − μ/L)^t ||x0 − x*||. Because the problem is diagonal, L and μ are simply the largest and smallest diagonal entries. You can observe near-perfect agreement with the bound, illustrating linear convergence.

Time: O(d · T), since each iteration computes a gradient and updates in O(d).Space: O(d), storing x, gradient, and constants.

Non-convex example: stationarity improves at O(1/t)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Non-convex 1D function: f(x) = 0.25 * x^4 - 0.5 * x^2
5 // Gradient: f'(x) = x^3 - x; L-smooth on bounded sets; global non-convex.
6 
7 int main() {
8     ios::sync_with_stdio(false);
9     cin.tie(nullptr);
10 
11     auto f = [](double x) { return 0.25 * x * x * x * x - 0.5 * x * x; };
12     auto g = [](double x) { return x * x * x - x; };
13 
14     double x = 0.9;           // start near a non-convex region
15     double eta = 0.2;         // modest step size to ensure descent
16 
17     cout << fixed << setprecision(6);
18     cout << "# t\tx_t\tf(x_t)\t|grad|\tmin_{s<=t}|grad|^2 * t\n";
19 
20     double minGrad2 = numeric_limits<double>::infinity();
21     int T = 200;
22     for (int t = 1; t <= T; ++t) {
23         double grad = g(x);
24         minGrad2 = min(minGrad2, grad * grad);
25         // Print a diagnostic capturing the O(1/t) stationarity trend
26         cout << t << "\t" << x << "\t" << f(x) << "\t" << fabs(grad) << "\t" << (minGrad2 * t) << "\n";
27         // Update
28         x = x - eta * grad;
29     }
30 
31     return 0;
32 }
33

We optimize a simple non-convex function with two minima at x = ±1/√2 and a saddle at x = 0. Although we cannot guarantee reaching the global minimum, the minimum gradient norm squared among the first t iterates typically decreases proportionally to 1/t. The printed quantity min_{s≤t} ||∇f(x_s)||^2 · t should stabilize, illustrating the non-convex O(1/t) stationarity rate.

Time: O(T), constant-time gradient and update per step.Space: O(1), only scalars are stored.

SGD in 1D least squares with diminishing steps (O(1/√t) expected rate)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // 1D stochastic least squares: f(w) = (1/(2N)) sum_i (a_i*w - b_i)^2
5 // True minimizer: w* = (sum a_i b_i) / (sum a_i^2). We simulate SGD with eta_t = eta0 / sqrt(t).
6 
7 int main() {
8     ios::sync_with_stdio(false);
9     cin.tie(nullptr);
10 
11     std::mt19937 rng(42);
12     std::normal_distribution<double> noise(0.0, 0.5);
13     std::uniform_real_distribution<double> a_dist(0.5, 2.0);
14 
15     const int N = 2000;
16     const double w_true = 1.7;
17 
18     vector<double> a(N), b(N);
19     for (int i = 0; i < N; ++i) {
20         a[i] = a_dist(rng);
21         b[i] = a[i] * w_true + noise(rng); // y = a w_true + noise
22     }
23 
24     // Compute closed-form w* and f(w*) to quantify suboptimality
25     double Sa2 = 0.0, Sab = 0.0;
26     for (int i = 0; i < N; ++i) { Sa2 += a[i]*a[i]; Sab += a[i]*b[i]; }
27     double w_star = Sab / Sa2;
28     auto f = [&](double w){
29         double s = 0.0;
30         for (int i = 0; i < N; ++i) {
31             double r = a[i]*w - b[i];
32             s += 0.5 * r * r;
33         }
34         return s / N;
35     };
36     double f_star = f(w_star);
37 
38     // SGD with diminishing steps; sample one index per iteration
39     std::uniform_int_distribution<int> uni(0, N-1);
40 
41     double w = 0.0;           // initialize
42     double eta0 = 0.5;        // base step size (tuned)
43 
44     cout << fixed << setprecision(6);
45     cout << "# t\tw_t\tf(w_t)-f*\t sqrt(t)*(f(w_t)-f*)\n";
46 
47     int T = 20000;
48     for (int t = 1; t <= T; ++t) {
49         int i = uni(rng);
50         double grad = (a[i]*w - b[i]) * a[i]; // stochastic gradient from one sample
51         double eta = eta0 / sqrt((double)t);
52         w -= eta * grad;
53 
54         if (t % 1000 == 0) {
55             double gap = max(0.0, f(w) - f_star);
56             cout << t << "\t" << w << "\t" << gap << "\t" << (sqrt((double)t) * gap) << "\n";
57         }
58     }
59 
60     return 0;
61 }
62

We run SGD on a 1D least-squares problem where w* and f(w*) are known in closed form. With η_t = η0/√t, convex SGD theory predicts an O(1/√t) expected gap. The diagnostic √t·(f(w_t)−f*) should trend toward a constant (up to noise), illustrating the sublinear rate.

Time: O(T), one data point processed per iteration; occasional O(N) passes only for evaluation.Space: O(N) to store the dataset and O(1) for parameters; if evaluations are skipped or streamed, it can be O(1).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Strongly convex quadratic: f(x) = 0.5 * x^T A x - b^T x, with diagonal SPD A
5	// For diagonal A, L = max diag(A), mu = min diag(A). x* = A^{-1} b.
6
7	struct Quad {
8	vector<double> diagA; // diagonal entries of A (positive)
9	vector<double> b; // vector b
10
11	// Evaluate f(x)
12	double f(const vector<double>& x) const {
13	double v = 0.0;
14	for (size_t i = 0; i < x.size(); ++i) {
15	v += 0.5 * diagA[i] * x[i] * x[i] - b[i] * x[i];
16	}
17	return v;
18	}
19
20	// Gradient: A x - b (since A is diagonal)
21	vector<double> grad(const vector<double>& x) const {
22	vector<double> g(x.size());
23	for (size_t i = 0; i < x.size(); ++i) g[i] = diagA[i] * x[i] - b[i];
24	return g;
25	}
26
27	// Closed-form minimizer x* = A^{-1} b (element-wise since A is diagonal)
28	vector<double> x_star() const {
29	vector<double> xs(diagA.size());
30	for (size_t i = 0; i < diagA.size(); ++i) xs[i] = b[i] / diagA[i];
31	return xs;
32	}
33	};
34
35	// Euclidean norm
36	double norm2(const vector<double>& x) {
37	double s = 0.0;
38	for (double v : x) s += v * v;
39	return sqrt(s);
40	}
41
42	vector<double> sub(const vector<double>& a, const vector<double>& b) {
43	vector<double> r(a.size());
44	for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] - b[i];
45	return r;
46	}
47
48	int main() {
49	ios::sync_with_stdio(false);
50	cin.tie(nullptr);
51
52	// Define a 5D problem with condition number kappa = L/mu = 10
53	vector<double> diagA = {1.0, 2.0, 3.0, 5.0, 10.0}; // mu=1, L=10
54	vector<double> b = {1.0, -2.0, 0.5, 3.0, -1.0};
55	Quad prob{diagA, b};
56
57	const double L = *max_element(diagA.begin(), diagA.end());
58	const double mu = *min_element(diagA.begin(), diagA.end());
59	const double eta = 1.0 / L; // theoretical safe step size
60
61	vector<double> x(5, 0.0); // start at zero
62	vector<double> xstar = prob.x_star();
63	double dist0 = norm2(sub(x, xstar));
64
65	cout << fixed << setprecision(6);
66	cout << "# t\t\|\|x_t - x\|\|\tBound (1 - mu/L)^t \|\|x0 - x\|\|\t f(x_t) - f(x)\n";
67
68	// Precompute f(x*)
69	double fxstar = prob.f(xstar);
70
71	int T = 40;
72	for (int t = 0; t <= T; ++t) {
73	double bound = pow(1.0 - mu / L, t) * dist0; // linear convergence bound for eta=1/L
74	double dist = norm2(sub(x, xstar));
75	double gap = prob.f(x) - fxstar;
76	cout << t << "\t" << dist << "\t" << bound << "\t" << gap << "\n";
77
78	// Next iterate: x_{t+1} = x_t - eta * grad
79	auto g = prob.grad(x);
80	for (size_t i = 0; i < x.size(); ++i) x[i] -= eta * g[i];
81	}
82
83	return 0;
84	}
85

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Non-convex 1D function: f(x) = 0.25 * x^4 - 0.5 * x^2
5	// Gradient: f'(x) = x^3 - x; L-smooth on bounded sets; global non-convex.
6
7	int main() {
8	ios::sync_with_stdio(false);
9	cin.tie(nullptr);
10
11	auto f = [](double x) { return 0.25 * x * x * x * x - 0.5 * x * x; };
12	auto g = [](double x) { return x * x * x - x; };
13
14	double x = 0.9; // start near a non-convex region
15	double eta = 0.2; // modest step size to ensure descent
16
17	cout << fixed << setprecision(6);
18	cout << "# t\tx_t\tf(x_t)\t\|grad\|\tmin_{s<=t}\|grad\|^2 * t\n";
19
20	double minGrad2 = numeric_limits<double>::infinity();
21	int T = 200;
22	for (int t = 1; t <= T; ++t) {
23	double grad = g(x);
24	minGrad2 = min(minGrad2, grad * grad);
25	// Print a diagnostic capturing the O(1/t) stationarity trend
26	cout << t << "\t" << x << "\t" << f(x) << "\t" << fabs(grad) << "\t" << (minGrad2 * t) << "\n";
27	// Update
28	x = x - eta * grad;
29	}
30
31	return 0;
32	}
33

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// 1D stochastic least squares: f(w) = (1/(2N)) sum_i (a_i*w - b_i)^2
5	// True minimizer: w* = (sum a_i b_i) / (sum a_i^2). We simulate SGD with eta_t = eta0 / sqrt(t).
6
7	int main() {
8	ios::sync_with_stdio(false);
9	cin.tie(nullptr);
10
11	std::mt19937 rng(42);
12	std::normal_distribution<double> noise(0.0, 0.5);
13	std::uniform_real_distribution<double> a_dist(0.5, 2.0);
14
15	const int N = 2000;
16	const double w_true = 1.7;
17
18	vector<double> a(N), b(N);
19	for (int i = 0; i < N; ++i) {
20	a[i] = a_dist(rng);
21	b[i] = a[i] * w_true + noise(rng); // y = a w_true + noise
22	}
23
24	// Compute closed-form w* and f(w*) to quantify suboptimality
25	double Sa2 = 0.0, Sab = 0.0;
26	for (int i = 0; i < N; ++i) { Sa2 += a[i]a[i]; Sab += a[i]b[i]; }
27	double w_star = Sab / Sa2;
28	auto f = [&](double w){
29	double s = 0.0;
30	for (int i = 0; i < N; ++i) {
31	double r = a[i]*w - b[i];
32	s += 0.5 * r * r;
33	}
34	return s / N;
35	};
36	double f_star = f(w_star);
37
38	// SGD with diminishing steps; sample one index per iteration
39	std::uniform_int_distribution<int> uni(0, N-1);
40
41	double w = 0.0; // initialize
42	double eta0 = 0.5; // base step size (tuned)
43
44	cout << fixed << setprecision(6);
45	cout << "# t\tw_t\tf(w_t)-f\t sqrt(t)(f(w_t)-f*)\n";
46
47	int T = 20000;
48	for (int t = 1; t <= T; ++t) {
49	int i = uni(rng);
50	double grad = (a[i]w - b[i]) a[i]; // stochastic gradient from one sample
51	double eta = eta0 / sqrt((double)t);
52	w -= eta * grad;
53
54	if (t % 1000 == 0) {
55	double gap = max(0.0, f(w) - f_star);
56	cout << t << "\t" << w << "\t" << gap << "\t" << (sqrt((double)t) * gap) << "\n";
57	}
58	}
59
60	return 0;
61	}
62