Adam & Adaptive Methods
Key Points
- •Adam is an optimization algorithm that combines momentum (first moment) with RMSProp-style adaptive learning rates (second moment).
- •It keeps exponentially decaying averages of gradients and squared gradients, then corrects their bias in early steps.
- •The update step rescales each parameter by an estimate of its recent gradient magnitude, stabilizing training across coordinates.
- •Typical hyperparameters (, , , ) often work out-of-the-box, but tuning still matters for best results.
- •Adam uses more memory than plain SGD because it stores two extra vectors (m and v) per parameter.
- •On ill-conditioned problems, Adam converges faster than vanilla SGD, but may generalize worse unless regularized (e.g., AdamW).
- •Bias correction is essential in early iterations; forgetting it can make steps too small and slow learning.
- •Adaptive methods like RMSProp, Momentum, and Adam are related; Adam effectively blends Momentum + RMSProp with bias correction.
Prerequisites
- →Gradient and differentiation — Adam uses gradients of the loss with respect to parameters for updates.
- →Stochastic gradient descent (SGD) — Adam builds on SGD and improves its convergence via moments and adaptivity.
- →Exponential moving averages — Understanding EMA is necessary to interpret Adam’s first and second moment updates.
- →Vectorized operations and norms — Parameter updates are element-wise and rely on vector math efficiency.
- →Mean squared error and basic loss functions — Examples compute gradients of losses; familiarity helps follow derivations.
- →Numerical stability — The epsilon term and scaling guard against division by zero and overflow/underflow.
- →Regularization (L2, weight decay) — Knowing the difference between L2 and decoupled weight decay clarifies Adam vs. AdamW.
Detailed Explanation
Tap terms for definitions01Overview
Adam (Adaptive Moment Estimation) is a popular first-order optimization algorithm used to train machine learning models, especially deep neural networks. It extends stochastic gradient descent (SGD) by keeping track of two exponentially weighted moving averages: the first moment (mean) of gradients and the second raw moment (uncentered variance) of gradients. By combining momentum (which smooths directions) with per-parameter adaptive step sizes (which stabilize updates), Adam often achieves faster convergence with minimal hyperparameter tuning. In practice, Adam adjusts each parameter’s learning rate based on how large and how stable its recent gradients have been, enabling robust training on noisy, sparse, or ill-conditioned problems. The algorithm also applies bias correction to compensate for the initial zero-initialized moving averages, preventing early steps from being underestimated. Adam’s defaults (e.g., \beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8}) commonly work well, which contributes to its wide adoption. However, Adam requires additional memory for its moment estimates and can sometimes yield worse generalization than SGD without proper regularization (leading to variants like AdamW).
02Intuition & Analogies
Imagine hiking down a mountain toward a valley in foggy weather. Vanilla SGD is like taking steps directly downhill based on the local slope: quick but jittery, and it can bounce side-to-side in narrow valleys. Momentum adds a sense of inertia—if you’ve been moving east for a while, you keep some eastward push—smoothing noise and accelerating through gentle slopes. RMSProp adds a smart pair of boots that adjust stride length per direction: if recent steps in a direction were steep (large gradients), you shorten your stride there to avoid overshooting; if they were gentle (small gradients), you lengthen the stride. Adam puts both ideas together: it remembers your preferred direction (momentum) and adapts stride length for each direction (RMSProp-like scaling). Early in the hike, because your memory of past motion is still forming, Adam corrects for this by slightly amplifying those early estimates (bias correction) so you don’t crawl unnecessarily. In high-dimensional terrains like neural networks, different parameters can experience very different slopes and noise levels at the same time. Adam’s per-parameter stride adjustment means each coordinate gets an appropriate step size automatically. This is especially helpful when some features are rare or gradients are sparse—Adam can take larger steps where gradients are consistently small and smaller steps where gradients are volatile, often leading to faster, steadier progress toward good solutions.
03Formal Definition
04When to Use
Use Adam when training deep neural networks with noisy, sparse, or highly non-stationary gradients—such as NLP models with rare tokens, recommender systems, or vision models in early training. It’s especially effective when your features vary in scale or when you lack time to fine-tune per-layer learning rates. Adam often shines for quick prototyping because the default hyperparameters perform reasonably well across many tasks. If your objective is ill-conditioned (e.g., skinny valleys) or you observe oscillations with SGD, Adam tends to stabilize and speed up training. Prefer RMSProp when you want simplicity with adaptive scaling but without momentum’s full effect, or Momentum SGD when you care about final generalization on vision tasks and can afford to tune learning-rate schedules carefully. Choose AdamW over classic Adam when you need weight decay regularization that better matches SGD’s effect. For convex, well-conditioned problems (e.g., simple linear regression with normalized features), plain SGD with momentum or even vanilla SGD may be sufficient and more memory-efficient.
⚠️Common Mistakes
- Forgetting bias correction: Omitting \hat{m}_t and \hat{v}_t makes early steps too small, slowing convergence and misleading you about optimal learning rates. Fix: always include bias correction factors 1-\beta_1^t and 1-\beta_2^t.\n- Using too large \epsilon: While \epsilon prevents division by zero, setting it too high dampens adaptivity. Typical values are 10^{-8} to 10^{-7}.\n- Confusing weight decay with L2 regularization: In Adam, adding L2 to the loss is not equivalent to decoupled weight decay (AdamW). Fix: if you want SGD-like regularization, use AdamW’s decoupled decay.\n- Not tuning learning rate: Defaults often work, but \alpha still matters. If loss plateaus or diverges, reduce \alpha or use warmup/schedules.\n- Forgetting to reset optimizer state when reusing models: Adam’s m and v must match the current training run; stale states can harm convergence.\n- Over-trusting adaptive optimizers for generalization: Adam may converge faster but overfit or generalize worse. Consider switching to SGD later or using proper regularization.\n- Mixing batch normalization and aggressive \beta_2: Very high \beta_2 (close to 1) can make v_t adapt too slowly. Try slightly smaller \beta_2 or increase batch size.
Key Formulas
First Moment Update
Explanation: This keeps an exponential moving average of recent gradients. It smooths noise and preserves direction, similar to momentum.
Second Moment Update
Explanation: This tracks the recent magnitude of gradients (their uncentered variance). Larger implies smaller effective step sizes in those coordinates.
Bias Correction
Explanation: Because m and v start at zero, their early values are biased low. Dividing by (1 - bet) corrects this and yields unbiased estimates.
Adam Update
Explanation: Parameters move opposite the gradient mean, scaled by the RMS of recent gradients. Epsilon prevents division by zero and stabilizes updates.
Momentum SGD
Explanation: Momentum SGD averages gradients over time, accelerating along consistent directions and damping oscillations without adaptive per-coordinate scaling.
RMSProp
Explanation: RMSProp divides the gradient by a running RMS of gradient magnitudes, yielding per-parameter adaptive learning rates.
AMSGrad
Explanation: AMSGrad ensures the denominator does not decrease over time, which helps theoretical convergence by preventing overly large steps.
AdamW (Decoupled Weight Decay)
Explanation: AdamW applies weight decay separately from the adaptive gradient step, matching the regularization effect typical of SGD with weight decay.
Closed-form EMA
Explanation: The exponential moving average at time t is a weighted sum of all past values with exponentially decaying weights. It clarifies the memory depth of EMAs.
Adam Complexity
Explanation: For n data points and d parameters, each pass costs linear time in both n and d, and memory is linear in d due to the two extra moment vectors.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Adam { 5 // Hyperparameters 6 double alpha; // learning rate 7 double beta1; // decay for first moment 8 double beta2; // decay for second moment 9 double eps; // numerical stability 10 11 // State 12 vector<double> m; // first moment 13 vector<double> v; // second moment 14 long long t; // time step 15 16 Adam(size_t dim, double alpha_=1e-3, double beta1_=0.9, double beta2_=0.999, double eps_=1e-8) 17 : alpha(alpha_), beta1(beta1_), beta2(beta2_), eps(eps_), m(dim, 0.0), v(dim, 0.0), t(0) {} 18 19 // Update parameters theta in-place given gradient g 20 void step(vector<double>& theta, const vector<double>& g) { 21 assert(theta.size() == g.size()); 22 if (m.size() != theta.size()) { m.assign(theta.size(), 0.0); v.assign(theta.size(), 0.0); } 23 t++; 24 double b1t = pow(beta1, (double)t); 25 double b2t = pow(beta2, (double)t); 26 double corr1 = 1.0 - b1t; // denominator for bias correction of m 27 double corr2 = 1.0 - b2t; // denominator for bias correction of v 28 for (size_t i = 0; i < theta.size(); ++i) { 29 // Update biased first and second raw moments 30 m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]; 31 v[i] = beta2 * v[i] + (1.0 - beta2) * (g[i] * g[i]); 32 // Bias-corrected moments 33 double mhat = m[i] / corr1; 34 double vhat = v[i] / corr2; 35 // Parameter update 36 theta[i] -= alpha * mhat / (sqrt(vhat) + eps); 37 } 38 } 39 }; 40 41 int main() { 42 // Example: minimize f(theta) = 0.5 * sum_i theta_i^2 (quadratic bowl) 43 // True minimum at theta = 0 44 const size_t d = 3; 45 vector<double> theta = {5.0, -3.0, 2.0}; 46 47 Adam opt(d, 0.1); // larger LR works fine for this toy 48 49 auto grad = [&](const vector<double>& th){ 50 vector<double> g(d); 51 for (size_t i = 0; i < d; ++i) g[i] = th[i]; // gradient of 0.5*||theta||^2 is theta 52 return g; 53 }; 54 55 for (int it = 1; it <= 200; ++it) { 56 vector<double> g = grad(theta); 57 opt.step(theta, g); 58 if (it % 50 == 0) { 59 double f = 0.0; for (double v: theta) f += 0.5 * v * v; 60 cout << "iter " << it << ": f= " << f << " |theta|= "; 61 for (double v: theta) cout << v << ' '; cout << '\n'; 62 } 63 } 64 return 0; 65 } 66
This program defines a minimal Adam optimizer for dense parameter vectors, including bias correction. It then minimizes a simple quadratic function whose gradient is theta itself. The optimizer updates first and second moment estimates and applies the Adam step to move parameters toward zero.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct SGD { 5 double lr; explicit SGD(double lr_=1e-2): lr(lr_) {} 6 void step(vector<double>& theta, const vector<double>& g){ 7 for(size_t i=0;i<theta.size();++i) theta[i] -= lr * g[i]; 8 } 9 }; 10 11 struct Adam { 12 double alpha, beta1, beta2, eps; vector<double> m, v; long long t; 13 Adam(size_t d,double a=1e-2,double b1=0.9,double b2=0.999,double e=1e-8) 14 : alpha(a),beta1(b1),beta2(b2),eps(e),m(d,0.0),v(d,0.0),t(0){} 15 void step(vector<double>& theta, const vector<double>& g){ 16 t++; double b1t=pow(beta1,(double)t), b2t=pow(beta2,(double)t); 17 double corr1=1-b1t, corr2=1-b2t; 18 for(size_t i=0;i<theta.size();++i){ 19 m[i]=beta1*m[i]+(1-beta1)*g[i]; 20 v[i]=beta2*v[i]+(1-beta2)*g[i]*g[i]; 21 double mhat=m[i]/corr1; double vhat=v[i]/corr2; 22 theta[i]-=alpha*mhat/(sqrt(vhat)+eps); 23 } 24 } 25 }; 26 27 struct Dataset { vector<double> x, y; }; 28 29 Dataset make_data(int n, unsigned seed=42){ 30 mt19937 rng(seed); normal_distribution<double> noise(0.0, 0.5); 31 Dataset D; D.x.resize(n); D.y.resize(n); 32 // True model: y = 3x + 2 + noise 33 uniform_real_distribution<double> ux(-2.0, 2.0); 34 for(int i=0;i<n;++i){ double xi=ux(rng); D.x[i]=xi; D.y[i]=3.0*xi+2.0+noise(rng);} return D; 35 } 36 37 // Compute gradients for linear model y_hat = w*x + b under MSE 38 vector<double> grad_batch(const Dataset& D, const vector<int>& idx, const vector<double>& theta){ 39 double w=theta[0], b=theta[1]; 40 double gw=0.0, gb=0.0; int B=idx.size(); 41 for(int i: idx){ double xi=D.x[i]; double yi=D.y[i]; double err=(w*xi + b - yi); gw += 2.0*err*xi; gb += 2.0*err; } 42 gw/=B; gb/=B; return {gw, gb}; 43 } 44 45 int main(){ 46 Dataset D = make_data(400); 47 int n = (int)D.x.size(); 48 49 // Initialize parameters (w, b) 50 vector<double> theta_sgd = {0.0, 0.0}; 51 vector<double> theta_adam = {0.0, 0.0}; 52 53 SGD sgd(1e-2); 54 Adam adam(2, 5e-2); // Adam can often use a larger LR on this toy 55 56 mt19937 rng(123); 57 int epochs = 20; int B = 32; 58 59 auto shuffle_idx = [&](vector<int>& idx){ shuffle(idx.begin(), idx.end(), rng); }; 60 61 vector<int> order(n); iota(order.begin(), order.end(), 0); 62 63 // Train with mini-batches 64 for(int ep=1; ep<=epochs; ++ep){ 65 shuffle_idx(order); 66 for(int s=0; s<n; s+=B){ 67 vector<int> idx; idx.reserve(B); 68 for(int i=s;i<min(n, s+B);++i) idx.push_back(order[i]); 69 auto g1 = grad_batch(D, idx, theta_sgd); 70 auto g2 = grad_batch(D, idx, theta_adam); 71 sgd.step(theta_sgd, g1); 72 adam.step(theta_adam, g2); 73 } 74 // Evaluate MSE at epoch end 75 auto mse = [&](const vector<double>& th){ 76 double w=th[0], b=th[1]; double loss=0.0; 77 for(int i=0;i<n;++i){ double e=w*D.x[i]+b-D.y[i]; loss += e*e; } 78 return loss/n; 79 }; 80 cout << "epoch "<<ep<<" | MSE SGD= "<<mse(theta_sgd) 81 <<" | MSE Adam= "<<mse(theta_adam) 82 <<" | (w,b) Adam= ("<<theta_adam[0]<<","<<theta_adam[1]<<")\n"; 83 } 84 85 return 0; 86 } 87
This program fits a simple linear regression model y = w x + b using mini-batch training. It trains two models in parallel: one with vanilla SGD and one with Adam, showing typical faster and more stable convergence for Adam on this toy problem. Bias correction and per-parameter scaling are included in Adam.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Optim { 5 virtual void step(vector<double>& theta, const vector<double>& g) = 0; 6 virtual ~Optim() = default; 7 }; 8 9 struct SGD : Optim { 10 double lr; explicit SGD(double lr_): lr(lr_) {} 11 void step(vector<double>& th, const vector<double>& g) override{ 12 for(size_t i=0;i<th.size();++i) th[i]-=lr*g[i]; 13 } 14 }; 15 16 struct Momentum : Optim { 17 double lr, beta; vector<double> m; Momentum(size_t d,double lr_,double beta_=0.9):lr(lr_),beta(beta_),m(d,0.0){} 18 void step(vector<double>& th, const vector<double>& g) override{ 19 for(size_t i=0;i<th.size();++i){ m[i]=beta*m[i]+(1-beta)*g[i]; th[i]-=lr*m[i]; } 20 } 21 }; 22 23 struct RMSProp : Optim { 24 double lr, beta, eps; vector<double> v; RMSProp(size_t d,double lr_,double beta_=0.99,double eps_=1e-8):lr(lr_),beta(beta_),eps(eps_),v(d,0.0){} 25 void step(vector<double>& th, const vector<double>& g) override{ 26 for(size_t i=0;i<th.size();++i){ v[i]=beta*v[i]+(1-beta)*g[i]*g[i]; th[i]-=lr*g[i]/(sqrt(v[i])+eps); } 27 } 28 }; 29 30 struct Adam : Optim { 31 double a,b1,b2,eps; vector<double> m,v; long long t; Adam(size_t d,double a_=1e-2,double b1_=0.9,double b2_=0.999,double e_=1e-8):a(a_),b1(b1_),b2(b2_),eps(e_),m(d,0.0),v(d,0.0),t(0){} 32 void step(vector<double>& th, const vector<double>& g) override{ 33 t++; double b1t=pow(b1,(double)t), b2t=pow(b2,(double)t); double c1=1-b1t, c2=1-b2t; 34 for(size_t i=0;i<th.size();++i){ m[i]=b1*m[i]+(1-b1)*g[i]; v[i]=b2*v[i]+(1-b2)*g[i]*g[i]; double mh=m[i]/c1, vh=v[i]/c2; th[i]-=a*mh/(sqrt(vh)+eps);} } 35 }; 36 37 // f(x,y) = 0.5*(100*x^2 + y^2) 38 // grad = (100*x, y) 39 int main(){ 40 auto grad = [](const vector<double>& th){ return vector<double>{100.0*th[0], th[1]}; }; 41 auto loss = [](const vector<double>& th){ return 0.5*(100.0*th[0]*th[0] + th[1]*th[1]); }; 42 43 vector<pair<string, unique_ptr<Optim>>> opts; 44 vector<double> init = {1.0, 1.0}; 45 opts.push_back({"SGD", make_unique<SGD>(0.01)}); 46 opts.push_back({"Momentum", make_unique<Momentum>(2, 0.05, 0.9)}); 47 opts.push_back({"RMSProp", make_unique<RMSProp>(2, 0.1, 0.99, 1e-8)}); 48 opts.push_back({"Adam", make_unique<Adam>(2, 0.05)}); 49 50 for(auto& kv : opts){ 51 string name = kv.first; auto& opt = kv.second; 52 vector<double> th = init; 53 for(int it=1; it<=2000; ++it){ 54 auto g = grad(th); 55 opt->step(th, g); 56 if(it==200 || it==1000 || it==2000){ 57 cout << name << " it="<<it<<" loss="<<loss(th) 58 <<" theta=("<<th[0]<<","<<th[1]<<")\n"; 59 } 60 } 61 } 62 return 0; 63 } 64
This experiment minimizes an ill-conditioned quadratic where curvature differs by 100× across axes. SGD zig-zags along the steep direction, Momentum damps oscillations, RMSProp adapts step sizes per-coordinate, and Adam combines momentum with adaptivity. You should observe Adam and RMSProp stabilize the x-direction quickly, with Momentum also helping relative to plain SGD.