⚙️AlgorithmIntermediate

Adam & Adaptive Methods

Key Points

•
Adam is an optimization algorithm that combines momentum (first moment) with RMSProp-style adaptive learning rates (second moment).
•
It keeps exponentially decaying averages of gradients and squared gradients, then corrects their bias in early steps.
•
The update step rescales each parameter by an estimate of its recent gradient magnitude, stabilizing training across coordinates.
•
Typical hyperparameters ( $α$ , $β_{1}$ , $β_{2}$ , $ϵ$ ) often work out-of-the-box, but tuning still matters for best results.
•
Adam uses more memory than plain SGD because it stores two extra vectors (m and v) per parameter.
•
On ill-conditioned problems, Adam converges faster than vanilla SGD, but may generalize worse unless regularized (e.g., AdamW).
•
Bias correction is essential in early iterations; forgetting it can make steps too small and slow learning.
•
Adaptive methods like RMSProp, Momentum, and Adam are related; Adam effectively blends Momentum + RMSProp with bias correction.

Prerequisites

→Gradient and differentiation — Adam uses gradients of the loss with respect to parameters for updates.
→Stochastic gradient descent (SGD) — Adam builds on SGD and improves its convergence via moments and adaptivity.
→Exponential moving averages — Understanding EMA is necessary to interpret Adam’s first and second moment updates.
→Vectorized operations and norms — Parameter updates are element-wise and rely on vector math efficiency.
→Mean squared error and basic loss functions — Examples compute gradients of losses; familiarity helps follow derivations.
→Numerical stability — The epsilon term and scaling guard against division by zero and overflow/underflow.
→Regularization (L2, weight decay) — Knowing the difference between L2 and decoupled weight decay clarifies Adam vs. AdamW.

Detailed Explanation

Tap terms for definitions

01Overview

Adam (Adaptive Moment Estimation) is a popular first-order optimization algorithm used to train machine learning models, especially deep neural networks. It extends stochastic gradient descent (SGD) by keeping track of two exponentially weighted moving averages: the first moment (mean) of gradients and the second raw moment (uncentered variance) of gradients. By combining momentum (which smooths directions) with per-parameter adaptive step sizes (which stabilize updates), Adam often achieves faster convergence with minimal hyperparameter tuning. In practice, Adam adjusts each parameter’s learning rate based on how large and how stable its recent gradients have been, enabling robust training on noisy, sparse, or ill-conditioned problems. The algorithm also applies bias correction to compensate for the initial zero-initialized moving averages, preventing early steps from being underestimated. Adam’s defaults (e.g., \beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8}) commonly work well, which contributes to its wide adoption. However, Adam requires additional memory for its moment estimates and can sometimes yield worse generalization than SGD without proper regularization (leading to variants like AdamW).

02Intuition & Analogies

Imagine hiking down a mountain toward a valley in foggy weather. Vanilla SGD is like taking steps directly downhill based on the local slope: quick but jittery, and it can bounce side-to-side in narrow valleys. Momentum adds a sense of inertia—if you’ve been moving east for a while, you keep some eastward push—smoothing noise and accelerating through gentle slopes. RMSProp adds a smart pair of boots that adjust stride length per direction: if recent steps in a direction were steep (large gradients), you shorten your stride there to avoid overshooting; if they were gentle (small gradients), you lengthen the stride. Adam puts both ideas together: it remembers your preferred direction (momentum) and adapts stride length for each direction (RMSProp-like scaling). Early in the hike, because your memory of past motion is still forming, Adam corrects for this by slightly amplifying those early estimates (bias correction) so you don’t crawl unnecessarily. In high-dimensional terrains like neural networks, different parameters can experience very different slopes and noise levels at the same time. Adam’s per-parameter stride adjustment means each coordinate gets an appropriate step size automatically. This is especially helpful when some features are rare or gradients are sparse—Adam can take larger steps where gradients are consistently small and smaller steps where gradients are volatile, often leading to faster, steadier progress toward good solutions.

03Formal Definition

Given parameters

θ_{t}

and stochastic gradients

g_{t}

\nabla

θ

f_{t}

(

θ_{t - 1}

), Adam maintains exponential moving averages of first and second moments:

m_{t}

β_{1}

m_{t - 1}

+ (1-

β_{1}

)

g_{t}

and

v_{t}

β_{2}

v_{t - 1}

+ (1-

β_{2}

)

g_{t}

^2, where operations are element-wise. Because

m_{0}

and

v_{0}

start at zero, Adam applies bias correction:

\overset{m}{^}_{t}

m_{t}

/ (1-

β_{1}^{t}

) and

\overset{v}{^}_{t}

v_{t}

/ (1-

β_{2}^{t}

). The parameter update is

θ_{t}

θ_{t - 1}

α

\cdot

\overset{m}{^}_{t}

/ (

\overset{v}{^}_{t}

ϵ

), where

α

is the base learning rate and

ϵ

is a small constant to prevent division by zero. Typical defaults are

β_{1}

= 0.9,

β_{2}

= 0.999, and

ϵ

= 10^{-8}. Adam can be seen as SGD with momentum in a diagonal preconditioned space, where the diagonal preconditioner is derived from recent gradient magnitudes. Variants like AMSGrad enforce a non-increasing second-moment accumulator, and AdamW decouples L2 regularization from the adaptive update by applying weight decay directly to parameters.

04When to Use

Use Adam when training deep neural networks with noisy, sparse, or highly non-stationary gradients—such as NLP models with rare tokens, recommender systems, or vision models in early training. It’s especially effective when your features vary in scale or when you lack time to fine-tune per-layer learning rates. Adam often shines for quick prototyping because the default hyperparameters perform reasonably well across many tasks. If your objective is ill-conditioned (e.g., skinny valleys) or you observe oscillations with SGD, Adam tends to stabilize and speed up training. Prefer RMSProp when you want simplicity with adaptive scaling but without momentum’s full effect, or Momentum SGD when you care about final generalization on vision tasks and can afford to tune learning-rate schedules carefully. Choose AdamW over classic Adam when you need weight decay regularization that better matches SGD’s effect. For convex, well-conditioned problems (e.g., simple linear regression with normalized features), plain SGD with momentum or even vanilla SGD may be sufficient and more memory-efficient.

⚠️Common Mistakes

Forgetting bias correction: Omitting \hat{m}_t and \hat{v}_t makes early steps too small, slowing convergence and misleading you about optimal learning rates. Fix: always include bias correction factors 1-\beta_1^t and 1-\beta_2^t.\n- Using too large \epsilon: While \epsilon prevents division by zero, setting it too high dampens adaptivity. Typical values are 10^{-8} to 10^{-7}.\n- Confusing weight decay with L2 regularization: In Adam, adding L2 to the loss is not equivalent to decoupled weight decay (AdamW). Fix: if you want SGD-like regularization, use AdamW’s decoupled decay.\n- Not tuning learning rate: Defaults often work, but \alpha still matters. If loss plateaus or diverges, reduce \alpha or use warmup/schedules.\n- Forgetting to reset optimizer state when reusing models: Adam’s m and v must match the current training run; stale states can harm convergence.\n- Over-trusting adaptive optimizers for generalization: Adam may converge faster but overfit or generalize worse. Consider switching to SGD later or using proper regularization.\n- Mixing batch normalization and aggressive \beta_2: Very high \beta_2 (close to 1) can make v_t adapt too slowly. Try slightly smaller \beta_2 or increase batch size.

Key Formulas

First Moment Update

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

Explanation: This keeps an exponential moving average of recent gradients. It smooths noise and preserves direction, similar to momentum.

Second Moment Update

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}

Explanation: This tracks the recent magnitude of gradients (their uncentered variance). Larger $v_{t}$ implies smaller effective step sizes in those coordinates.

Bias Correction

\overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}, \overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}}

Explanation: Because m and v start at zero, their early values are biased low. Dividing by (1 - bet $a^{t}$ ) corrects this and yields unbiased estimates.

Adam Update

θ_{t} = θ_{t - 1} - α \frac{m ^ _{t}}{v ^ _{t} + ϵ}

Explanation: Parameters move opposite the gradient mean, scaled by the RMS of recent gradients. Epsilon prevents division by zero and stabilizes updates.

Momentum SGD

θ_{t} = θ_{t - 1} - α m_{t}, m_{t} = β m_{t - 1} + (1 - β) g_{t}

Explanation: Momentum SGD averages gradients over time, accelerating along consistent directions and damping oscillations without adaptive per-coordinate scaling.

RMSProp

θ_{t} = θ_{t - 1} - α \frac{g _{t}}{v _{t} + ϵ}, v_{t} = β v_{t - 1} + (1 - β) g_{t}^{2}

Explanation: RMSProp divides the gradient by a running RMS of gradient magnitudes, yielding per-parameter adaptive learning rates.

AMSGrad

\overset{v}{^}_{t}^{m a x} = max (\overset{v}{^}_{t - 1}^{m a x}, \overset{v}{^}_{t}), θ_{t} = θ_{t - 1} - α \frac{m ^ _{t}}{v ^ _{t}^{m a x} + ϵ}

Explanation: AMSGrad ensures the denominator does not decrease over time, which helps theoretical convergence by preventing overly large steps.

AdamW (Decoupled Weight Decay)

θ_{t} = (1 - α λ) θ_{t - 1} - α \frac{m ^ _{t}}{v ^ _{t} + ϵ}

Explanation: AdamW applies weight decay separately from the adaptive gradient step, matching the regularization effect typical of SGD with weight decay.

Closed-form EMA

EMA_{β} (x)_{t} = (1 - β) i = 1 \sum t β^{t - i} x_{i}

Explanation: The exponential moving average at time t is a weighted sum of all past values with exponentially decaying weights. It clarifies the memory depth of EMAs.

Adam Complexity

O (n d) per epoch, O (d) memory for m, O (d) memory for v

Explanation: For n data points and d parameters, each pass costs linear time in both n and d, and memory is linear in d due to the two extra moment vectors.

Complexity Analysis

Let d be the number of parameters and B the mini-batch size. Each Adam update requires computing a gradient

g_{t}

in O(B·d) time (assuming dense parameters) and then updating the first and second moments

m_{t}

and

v_{t}

element-wise, which is O(d). The bias corrections (raising bet

a_{1}

and bet

a_{2}

to the t-th power and performing two divisions) are O(1) scalar operations plus O(d) for applying the corrected moments. Therefore, the per-step time complexity is O(B·d + d) = O(B·d); the dominant term is gradient computation. Over an epoch with N training examples, the total cost is O(

\frac{N}{B}

·B·d) = O(N·d). Space-wise, Adam stores three d-dimensional vectors beyond the parameters themselves: m (first moment), v (second moment), and commonly a copy of parameters in frameworks (not strictly necessary). The essential additional memory is O(d) for m and O(d) for v, totaling O(d) extra memory beyond parameters. In contrast, vanilla SGD uses O(1) extra memory per parameter block (effectively O(1) total if you stream gradients), and Momentum SGD uses O(d) for one moment. RMSProp also uses O(d) for its second-moment accumulator. Thus, among first-order methods, Adam’s memory footprint is moderate but higher than SGD. In sparse-gradient settings (e.g., embeddings), implementations can optionally maintain moments only for touched parameters per step, reducing effective memory bandwidth and compute. The choice of beta parameters affects numerical stability and adaptation speed: bet

a_{1}

close to 1 increases temporal smoothing (slower to react), while bet

a_{2}

close to 1 makes variance estimates smoother but may slow adaptation in rapidly changing regimes.

Code Examples

Reusable Adam optimizer for dense parameter vectors (with bias correction)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Adam {
5     // Hyperparameters
6     double alpha;   // learning rate
7     double beta1;   // decay for first moment
8     double beta2;   // decay for second moment
9     double eps;     // numerical stability
10 
11     // State
12     vector<double> m; // first moment
13     vector<double> v; // second moment
14     long long t;      // time step
15 
16     Adam(size_t dim, double alpha_=1e-3, double beta1_=0.9, double beta2_=0.999, double eps_=1e-8)
17         : alpha(alpha_), beta1(beta1_), beta2(beta2_), eps(eps_), m(dim, 0.0), v(dim, 0.0), t(0) {}
18 
19     // Update parameters theta in-place given gradient g
20     void step(vector<double>& theta, const vector<double>& g) {
21         assert(theta.size() == g.size());
22         if (m.size() != theta.size()) { m.assign(theta.size(), 0.0); v.assign(theta.size(), 0.0); }
23         t++;
24         double b1t = pow(beta1, (double)t);
25         double b2t = pow(beta2, (double)t);
26         double corr1 = 1.0 - b1t; // denominator for bias correction of m
27         double corr2 = 1.0 - b2t; // denominator for bias correction of v
28         for (size_t i = 0; i < theta.size(); ++i) {
29             // Update biased first and second raw moments
30             m[i] = beta1 * m[i] + (1.0 - beta1) * g[i];
31             v[i] = beta2 * v[i] + (1.0 - beta2) * (g[i] * g[i]);
32             // Bias-corrected moments
33             double mhat = m[i] / corr1;
34             double vhat = v[i] / corr2;
35             // Parameter update
36             theta[i] -= alpha * mhat / (sqrt(vhat) + eps);
37         }
38     }
39 };
40 
41 int main() {
42     // Example: minimize f(theta) = 0.5 * sum_i theta_i^2 (quadratic bowl)
43     // True minimum at theta = 0
44     const size_t d = 3;
45     vector<double> theta = {5.0, -3.0, 2.0};
46 
47     Adam opt(d, 0.1); // larger LR works fine for this toy
48 
49     auto grad = [&](const vector<double>& th){
50         vector<double> g(d);
51         for (size_t i = 0; i < d; ++i) g[i] = th[i]; // gradient of 0.5*||theta||^2 is theta
52         return g;
53     };
54 
55     for (int it = 1; it <= 200; ++it) {
56         vector<double> g = grad(theta);
57         opt.step(theta, g);
58         if (it % 50 == 0) {
59             double f = 0.0; for (double v: theta) f += 0.5 * v * v;
60             cout << "iter " << it << ": f= " << f << " |theta|= ";
61             for (double v: theta) cout << v << ' '; cout << '\n';
62         }
63     }
64     return 0;
65 }
66

This program defines a minimal Adam optimizer for dense parameter vectors, including bias correction. It then minimizes a simple quadratic function whose gradient is theta itself. The optimizer updates first and second moment estimates and applies the Adam step to move parameters toward zero.

Time: O(d) per step beyond gradient computationSpace: O(d) additional memory for m and O(d) for v (total O(d))

Linear regression trained with SGD vs. Adam (side-by-side)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct SGD {
5     double lr; explicit SGD(double lr_=1e-2): lr(lr_) {}
6     void step(vector<double>& theta, const vector<double>& g){
7         for(size_t i=0;i<theta.size();++i) theta[i] -= lr * g[i];
8     }
9 };
10 
11 struct Adam {
12     double alpha, beta1, beta2, eps; vector<double> m, v; long long t;
13     Adam(size_t d,double a=1e-2,double b1=0.9,double b2=0.999,double e=1e-8)
14         : alpha(a),beta1(b1),beta2(b2),eps(e),m(d,0.0),v(d,0.0),t(0){}
15     void step(vector<double>& theta, const vector<double>& g){
16         t++; double b1t=pow(beta1,(double)t), b2t=pow(beta2,(double)t);
17         double corr1=1-b1t, corr2=1-b2t;
18         for(size_t i=0;i<theta.size();++i){
19             m[i]=beta1*m[i]+(1-beta1)*g[i];
20             v[i]=beta2*v[i]+(1-beta2)*g[i]*g[i];
21             double mhat=m[i]/corr1; double vhat=v[i]/corr2;
22             theta[i]-=alpha*mhat/(sqrt(vhat)+eps);
23         }
24     }
25 };
26 
27 struct Dataset { vector<double> x, y; };
28 
29 Dataset make_data(int n, unsigned seed=42){
30     mt19937 rng(seed); normal_distribution<double> noise(0.0, 0.5);
31     Dataset D; D.x.resize(n); D.y.resize(n);
32     // True model: y = 3x + 2 + noise
33     uniform_real_distribution<double> ux(-2.0, 2.0);
34     for(int i=0;i<n;++i){ double xi=ux(rng); D.x[i]=xi; D.y[i]=3.0*xi+2.0+noise(rng);} return D;
35 }
36 
37 // Compute gradients for linear model y_hat = w*x + b under MSE
38 vector<double> grad_batch(const Dataset& D, const vector<int>& idx, const vector<double>& theta){
39     double w=theta[0], b=theta[1];
40     double gw=0.0, gb=0.0; int B=idx.size();
41     for(int i: idx){ double xi=D.x[i]; double yi=D.y[i]; double err=(w*xi + b - yi); gw += 2.0*err*xi; gb += 2.0*err; }
42     gw/=B; gb/=B; return {gw, gb};
43 }
44 
45 int main(){
46     Dataset D = make_data(400);
47     int n = (int)D.x.size();
48 
49     // Initialize parameters (w, b)
50     vector<double> theta_sgd = {0.0, 0.0};
51     vector<double> theta_adam = {0.0, 0.0};
52 
53     SGD sgd(1e-2);
54     Adam adam(2, 5e-2); // Adam can often use a larger LR on this toy
55 
56     mt19937 rng(123);
57     int epochs = 20; int B = 32;
58 
59     auto shuffle_idx = [&](vector<int>& idx){ shuffle(idx.begin(), idx.end(), rng); };
60 
61     vector<int> order(n); iota(order.begin(), order.end(), 0);
62 
63     // Train with mini-batches
64     for(int ep=1; ep<=epochs; ++ep){
65         shuffle_idx(order);
66         for(int s=0; s<n; s+=B){
67             vector<int> idx; idx.reserve(B);
68             for(int i=s;i<min(n, s+B);++i) idx.push_back(order[i]);
69             auto g1 = grad_batch(D, idx, theta_sgd);
70             auto g2 = grad_batch(D, idx, theta_adam);
71             sgd.step(theta_sgd, g1);
72             adam.step(theta_adam, g2);
73         }
74         // Evaluate MSE at epoch end
75         auto mse = [&](const vector<double>& th){
76             double w=th[0], b=th[1]; double loss=0.0;
77             for(int i=0;i<n;++i){ double e=w*D.x[i]+b-D.y[i]; loss += e*e; }
78             return loss/n;
79         };
80         cout << "epoch "<<ep<<" | MSE SGD= "<<mse(theta_sgd)
81              <<" | MSE Adam= "<<mse(theta_adam)
82              <<" | (w,b) Adam= ("<<theta_adam[0]<<","<<theta_adam[1]<<")\n";
83     }
84 
85     return 0;
86 }
87

This program fits a simple linear regression model y = w x + b using mini-batch training. It trains two models in parallel: one with vanilla SGD and one with Adam, showing typical faster and more stable convergence for Adam on this toy problem. Bias correction and per-parameter scaling are included in Adam.

Time: O(epochs × n) for training, with per-step cost O(B) for this 2-parameter model (general O(B·d))Space: O(d) for parameters; Adam uses an extra 2d for m and v (total O(d))

Comparing SGD, Momentum, RMSProp, and Adam on an ill-conditioned quadratic

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Optim {
5     virtual void step(vector<double>& theta, const vector<double>& g) = 0;
6     virtual ~Optim() = default;
7 };
8 
9 struct SGD : Optim {
10     double lr; explicit SGD(double lr_): lr(lr_) {}
11     void step(vector<double>& th, const vector<double>& g) override{
12         for(size_t i=0;i<th.size();++i) th[i]-=lr*g[i];
13     }
14 };
15 
16 struct Momentum : Optim {
17     double lr, beta; vector<double> m; Momentum(size_t d,double lr_,double beta_=0.9):lr(lr_),beta(beta_),m(d,0.0){}
18     void step(vector<double>& th, const vector<double>& g) override{
19         for(size_t i=0;i<th.size();++i){ m[i]=beta*m[i]+(1-beta)*g[i]; th[i]-=lr*m[i]; }
20     }
21 };
22 
23 struct RMSProp : Optim {
24     double lr, beta, eps; vector<double> v; RMSProp(size_t d,double lr_,double beta_=0.99,double eps_=1e-8):lr(lr_),beta(beta_),eps(eps_),v(d,0.0){}
25     void step(vector<double>& th, const vector<double>& g) override{
26         for(size_t i=0;i<th.size();++i){ v[i]=beta*v[i]+(1-beta)*g[i]*g[i]; th[i]-=lr*g[i]/(sqrt(v[i])+eps); }
27     }
28 };
29 
30 struct Adam : Optim {
31     double a,b1,b2,eps; vector<double> m,v; long long t; Adam(size_t d,double a_=1e-2,double b1_=0.9,double b2_=0.999,double e_=1e-8):a(a_),b1(b1_),b2(b2_),eps(e_),m(d,0.0),v(d,0.0),t(0){}
32     void step(vector<double>& th, const vector<double>& g) override{
33         t++; double b1t=pow(b1,(double)t), b2t=pow(b2,(double)t); double c1=1-b1t, c2=1-b2t;
34         for(size_t i=0;i<th.size();++i){ m[i]=b1*m[i]+(1-b1)*g[i]; v[i]=b2*v[i]+(1-b2)*g[i]*g[i]; double mh=m[i]/c1, vh=v[i]/c2; th[i]-=a*mh/(sqrt(vh)+eps);} }
35 };
36 
37 // f(x,y) = 0.5*(100*x^2 + y^2)
38 // grad = (100*x, y)
39 int main(){
40     auto grad = [](const vector<double>& th){ return vector<double>{100.0*th[0], th[1]}; };
41     auto loss = [](const vector<double>& th){ return 0.5*(100.0*th[0]*th[0] + th[1]*th[1]); };
42 
43     vector<pair<string, unique_ptr<Optim>>> opts;
44     vector<double> init = {1.0, 1.0};
45     opts.push_back({"SGD", make_unique<SGD>(0.01)});
46     opts.push_back({"Momentum", make_unique<Momentum>(2, 0.05, 0.9)});
47     opts.push_back({"RMSProp", make_unique<RMSProp>(2, 0.1, 0.99, 1e-8)});
48     opts.push_back({"Adam", make_unique<Adam>(2, 0.05)});
49 
50     for(auto& kv : opts){
51         string name = kv.first; auto& opt = kv.second;
52         vector<double> th = init;
53         for(int it=1; it<=2000; ++it){
54             auto g = grad(th);
55             opt->step(th, g);
56             if(it==200 || it==1000 || it==2000){
57                 cout << name << " it="<<it<<" loss="<<loss(th)
58                      <<" theta=("<<th[0]<<","<<th[1]<<")\n";
59             }
60         }
61     }
62     return 0;
63 }
64

This experiment minimizes an ill-conditioned quadratic where curvature differs by 100× across axes. SGD zig-zags along the steep direction, Momentum damps oscillations, RMSProp adapts step sizes per-coordinate, and Adam combines momentum with adaptivity. You should observe Adam and RMSProp stabilize the x-direction quickly, with Momentum also helping relative to plain SGD.

Time: O(d) per step beyond gradient computation; here d=2 so operations are constant-time per stepSpace: SGD O(1); Momentum O(d); RMSProp O(d); Adam O(d) for m and O(d) for v

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Adam {
5	// Hyperparameters
6	double alpha; // learning rate
7	double beta1; // decay for first moment
8	double beta2; // decay for second moment
9	double eps; // numerical stability
10
11	// State
12	vector<double> m; // first moment
13	vector<double> v; // second moment
14	long long t; // time step
15
16	Adam(size_t dim, double alpha_=1e-3, double beta1_=0.9, double beta2_=0.999, double eps_=1e-8)
17	: alpha(alpha_), beta1(beta1_), beta2(beta2_), eps(eps_), m(dim, 0.0), v(dim, 0.0), t(0) {}
18
19	// Update parameters theta in-place given gradient g
20	void step(vector<double>& theta, const vector<double>& g) {
21	assert(theta.size() == g.size());
22	if (m.size() != theta.size()) { m.assign(theta.size(), 0.0); v.assign(theta.size(), 0.0); }
23	t++;
24	double b1t = pow(beta1, (double)t);
25	double b2t = pow(beta2, (double)t);
26	double corr1 = 1.0 - b1t; // denominator for bias correction of m
27	double corr2 = 1.0 - b2t; // denominator for bias correction of v
28	for (size_t i = 0; i < theta.size(); ++i) {
29	// Update biased first and second raw moments
30	m[i] = beta1 * m[i] + (1.0 - beta1) * g[i];
31	v[i] = beta2 * v[i] + (1.0 - beta2) * (g[i] * g[i]);
32	// Bias-corrected moments
33	double mhat = m[i] / corr1;
34	double vhat = v[i] / corr2;
35	// Parameter update
36	theta[i] -= alpha * mhat / (sqrt(vhat) + eps);
37	}
38	}
39	};
40
41	int main() {
42	// Example: minimize f(theta) = 0.5 * sum_i theta_i^2 (quadratic bowl)
43	// True minimum at theta = 0
44	const size_t d = 3;
45	vector<double> theta = {5.0, -3.0, 2.0};
46
47	Adam opt(d, 0.1); // larger LR works fine for this toy
48
49	auto grad = [&](const vector<double>& th){
50	vector<double> g(d);
51	for (size_t i = 0; i < d; ++i) g[i] = th[i]; // gradient of 0.5*\|\|theta\|\|^2 is theta
52	return g;
53	};
54
55	for (int it = 1; it <= 200; ++it) {
56	vector<double> g = grad(theta);
57	opt.step(theta, g);
58	if (it % 50 == 0) {
59	double f = 0.0; for (double v: theta) f += 0.5 * v * v;
60	cout << "iter " << it << ": f= " << f << " \|theta\|= ";
61	for (double v: theta) cout << v << ' '; cout << '\n';
62	}
63	}
64	return 0;
65	}
66

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct SGD {
5	double lr; explicit SGD(double lr_=1e-2): lr(lr_) {}
6	void step(vector<double>& theta, const vector<double>& g){
7	for(size_t i=0;i<theta.size();++i) theta[i] -= lr * g[i];
8	}
9	};
10
11	struct Adam {
12	double alpha, beta1, beta2, eps; vector<double> m, v; long long t;
13	Adam(size_t d,double a=1e-2,double b1=0.9,double b2=0.999,double e=1e-8)
14	: alpha(a),beta1(b1),beta2(b2),eps(e),m(d,0.0),v(d,0.0),t(0){}
15	void step(vector<double>& theta, const vector<double>& g){
16	t++; double b1t=pow(beta1,(double)t), b2t=pow(beta2,(double)t);
17	double corr1=1-b1t, corr2=1-b2t;
18	for(size_t i=0;i<theta.size();++i){
19	m[i]=beta1m[i]+(1-beta1)g[i];
20	v[i]=beta2v[i]+(1-beta2)g[i]*g[i];
21	double mhat=m[i]/corr1; double vhat=v[i]/corr2;
22	theta[i]-=alpha*mhat/(sqrt(vhat)+eps);
23	}
24	}
25	};
26
27	struct Dataset { vector<double> x, y; };
28
29	Dataset make_data(int n, unsigned seed=42){
30	mt19937 rng(seed); normal_distribution<double> noise(0.0, 0.5);
31	Dataset D; D.x.resize(n); D.y.resize(n);
32	// True model: y = 3x + 2 + noise
33	uniform_real_distribution<double> ux(-2.0, 2.0);
34	for(int i=0;i<n;++i){ double xi=ux(rng); D.x[i]=xi; D.y[i]=3.0*xi+2.0+noise(rng);} return D;
35	}
36
37	// Compute gradients for linear model y_hat = w*x + b under MSE
38	vector<double> grad_batch(const Dataset& D, const vector<int>& idx, const vector<double>& theta){
39	double w=theta[0], b=theta[1];
40	double gw=0.0, gb=0.0; int B=idx.size();
41	for(int i: idx){ double xi=D.x[i]; double yi=D.y[i]; double err=(wxi + b - yi); gw += 2.0errxi; gb += 2.0err; }
42	gw/=B; gb/=B; return {gw, gb};
43	}
44
45	int main(){
46	Dataset D = make_data(400);
47	int n = (int)D.x.size();
48
49	// Initialize parameters (w, b)
50	vector<double> theta_sgd = {0.0, 0.0};
51	vector<double> theta_adam = {0.0, 0.0};
52
53	SGD sgd(1e-2);
54	Adam adam(2, 5e-2); // Adam can often use a larger LR on this toy
55
56	mt19937 rng(123);
57	int epochs = 20; int B = 32;
58
59	auto shuffle_idx = [&](vector<int>& idx){ shuffle(idx.begin(), idx.end(), rng); };
60
61	vector<int> order(n); iota(order.begin(), order.end(), 0);
62
63	// Train with mini-batches
64	for(int ep=1; ep<=epochs; ++ep){
65	shuffle_idx(order);
66	for(int s=0; s<n; s+=B){
67	vector<int> idx; idx.reserve(B);
68	for(int i=s;i<min(n, s+B);++i) idx.push_back(order[i]);
69	auto g1 = grad_batch(D, idx, theta_sgd);
70	auto g2 = grad_batch(D, idx, theta_adam);
71	sgd.step(theta_sgd, g1);
72	adam.step(theta_adam, g2);
73	}
74	// Evaluate MSE at epoch end
75	auto mse = [&](const vector<double>& th){
76	double w=th[0], b=th[1]; double loss=0.0;
77	for(int i=0;i<n;++i){ double e=wD.x[i]+b-D.y[i]; loss += ee; }
78	return loss/n;
79	};
80	cout << "epoch "<<ep<<" \| MSE SGD= "<<mse(theta_sgd)
81	<<" \| MSE Adam= "<<mse(theta_adam)
82	<<" \| (w,b) Adam= ("<<theta_adam[0]<<","<<theta_adam[1]<<")\n";
83	}
84
85	return 0;
86	}
87

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Optim {
5	virtual void step(vector<double>& theta, const vector<double>& g) = 0;
6	virtual ~Optim() = default;
7	};
8
9	struct SGD : Optim {
10	double lr; explicit SGD(double lr_): lr(lr_) {}
11	void step(vector<double>& th, const vector<double>& g) override{
12	for(size_t i=0;i<th.size();++i) th[i]-=lr*g[i];
13	}
14	};
15
16	struct Momentum : Optim {
17	double lr, beta; vector<double> m; Momentum(size_t d,double lr_,double beta_=0.9):lr(lr_),beta(beta_),m(d,0.0){}
18	void step(vector<double>& th, const vector<double>& g) override{
19	for(size_t i=0;i<th.size();++i){ m[i]=betam[i]+(1-beta)g[i]; th[i]-=lr*m[i]; }
20	}
21	};
22
23	struct RMSProp : Optim {
24	double lr, beta, eps; vector<double> v; RMSProp(size_t d,double lr_,double beta_=0.99,double eps_=1e-8):lr(lr_),beta(beta_),eps(eps_),v(d,0.0){}
25	void step(vector<double>& th, const vector<double>& g) override{
26	for(size_t i=0;i<th.size();++i){ v[i]=betav[i]+(1-beta)g[i]g[i]; th[i]-=lrg[i]/(sqrt(v[i])+eps); }
27	}
28	};
29
30	struct Adam : Optim {
31	double a,b1,b2,eps; vector<double> m,v; long long t; Adam(size_t d,double a_=1e-2,double b1_=0.9,double b2_=0.999,double e_=1e-8):a(a_),b1(b1_),b2(b2_),eps(e_),m(d,0.0),v(d,0.0),t(0){}
32	void step(vector<double>& th, const vector<double>& g) override{
33	t++; double b1t=pow(b1,(double)t), b2t=pow(b2,(double)t); double c1=1-b1t, c2=1-b2t;
34	for(size_t i=0;i<th.size();++i){ m[i]=b1m[i]+(1-b1)g[i]; v[i]=b2v[i]+(1-b2)g[i]g[i]; double mh=m[i]/c1, vh=v[i]/c2; th[i]-=amh/(sqrt(vh)+eps);} }
35	};
36
37	// f(x,y) = 0.5(100x^2 + y^2)
38	// grad = (100*x, y)
39	int main(){
40	auto grad = [](const vector<double>& th){ return vector<double>{100.0*th[0], th[1]}; };
41	auto loss = [](const vector<double>& th){ return 0.5(100.0th[0]th[0] + th[1]th[1]); };
42
43	vector<pair<string, unique_ptr<Optim>>> opts;
44	vector<double> init = {1.0, 1.0};
45	opts.push_back({"SGD", make_unique<SGD>(0.01)});
46	opts.push_back({"Momentum", make_unique<Momentum>(2, 0.05, 0.9)});
47	opts.push_back({"RMSProp", make_unique<RMSProp>(2, 0.1, 0.99, 1e-8)});
48	opts.push_back({"Adam", make_unique<Adam>(2, 0.05)});
49
50	for(auto& kv : opts){
51	string name = kv.first; auto& opt = kv.second;
52	vector<double> th = init;
53	for(int it=1; it<=2000; ++it){
54	auto g = grad(th);
55	opt->step(th, g);
56	if(it==200 \|\| it==1000 \|\| it==2000){
57	cout << name << " it="<<it<<" loss="<<loss(th)
58	<<" theta=("<<th[0]<<","<<th[1]<<")\n";
59	}
60	}
61	}
62	return 0;
63	}
64