Gradient descent variants, adaptive methods, and learning rate schedules that drive neural network training.
14 concepts
Gradient descent is a simple, repeatable way to move downhill on a loss surface by stepping in the opposite direction of the gradient.
Stochastic Gradient Descent (SGD) updates model parameters using small random subsets (mini-batches) of data, making learning faster and more memory-efficient.
Momentum methods add an exponentially weighted memory of past gradients to make descent steps smoother and faster, especially in ravines and ill-conditioned problems.
Adam is an optimization algorithm that combines momentum (first moment) with RMSProp-style adaptive learning rates (second moment).
Learning rate schedules control how fast a model learns over time by changing the learning rate across iterations or epochs.
Lagrange multipliers let you optimize a function while strictly satisfying equality constraints by introducing auxiliary variables (the multipliers).
Gradient clipping limits how large gradient values or their overall magnitude can become during optimization to prevent exploding updates.
Weight initialization sets the starting values of neural network parameters so signals and gradients neither explode nor vanish as they pass through layers.
A loss landscape is the βterrainβ of a modelβs loss as you move through parameter space; valleys are good solutions and peaks are bad ones.
Sharpness-Aware Minimization (SAM) trains models to perform well even when their weights are slightly perturbed, seeking flatter minima that generalize better.
Lion (Evolved Sign Momentum) is a first-order, sign-based optimizer discovered through automated program search.
Data parallelism splits the training data across workers that compute gradients in parallel on a shared model.
Mixed precision training stores and computes tensors in low precision (FP16/BF16) for speed and memory savings while keeping a master copy of weights in FP32 for accurate updates.