🎯

Optimization

Gradient descent variants, adaptive methods, and learning rate schedules that drive neural network training.

14 concepts

Intermediate13

Gradient Descent

Gradient descent is a simple, repeatable way to move downhill on a loss surface by stepping in the opposite direction of the gradient.

#gradient descent#batch gradient descent#learning rate+12

⚙️AlgorithmIntermediate

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) updates model parameters using small random subsets (mini-batches) of data, making learning faster and more memory-efficient.

#stochastic gradient descent#mini-batch#random shuffling+12

⚙️AlgorithmIntermediate

Momentum Methods

Momentum methods add an exponentially weighted memory of past gradients to make descent steps smoother and faster, especially in ravines and ill-conditioned problems.

#momentum#heavy-ball#polyak momentum+12

⚙️AlgorithmIntermediate

Adam & Adaptive Methods

Adam is an optimization algorithm that combines momentum (first moment) with RMSProp-style adaptive learning rates (second moment).

#adam#adaptive methods#rmsprop+12

⚙️AlgorithmIntermediate

Learning Rate Schedules

Learning rate schedules control how fast a model learns over time by changing the learning rate across iterations or epochs.

#learning rate schedules#step decay#cosine annealing+12

∑MathIntermediate

Lagrange Multipliers & Constrained Optimization

Lagrange multipliers let you optimize a function while strictly satisfying equality constraints by introducing auxiliary variables (the multipliers).

#lagrange multipliers#constrained optimization#kkt conditions+11

⚙️AlgorithmIntermediate

Gradient Clipping & Normalization

Gradient clipping limits how large gradient values or their overall magnitude can become during optimization to prevent exploding updates.

#gradient clipping#clipping by norm#clipping by value+12

📚TheoryIntermediate

Weight Initialization Strategies

Weight initialization sets the starting values of neural network parameters so signals and gradients neither explode nor vanish as they pass through layers.

#xavier#glorot#he+12

📚TheoryIntermediate

Loss Landscape Analysis

A loss landscape is the “terrain” of a model’s loss as you move through parameter space; valleys are good solutions and peaks are bad ones.

#loss landscape#sharpness#hessian eigenvalues+12

⚙️AlgorithmIntermediate

Sharpness-Aware Minimization (SAM)

Sharpness-Aware Minimization (SAM) trains models to perform well even when their weights are slightly perturbed, seeking flatter minima that generalize better.

#sharpness-aware minimization#sam optimizer#robust optimization+11

⚙️AlgorithmIntermediate

Lion Optimizer

Lion (Evolved Sign Momentum) is a first-order, sign-based optimizer discovered through automated program search.

#lion optimizer#sign-based optimization#momentum+12

⚙️AlgorithmIntermediate

Distributed & Parallel Optimization

Data parallelism splits the training data across workers that compute gradients in parallel on a shared model.

#data parallelism#synchronous sgd#asynchronous sgd+12

⚙️AlgorithmIntermediate

Mixed Precision Training

Mixed precision training stores and computes tensors in low precision (FP16/BF16) for speed and memory savings while keeping a master copy of weights in FP32 for accurate updates.

#mixed precision#fp16#bf16+10

Advanced1

⚙️AlgorithmAdvanced

Newton's Method & Second-Order Optimization

Newton's method uses both the gradient and the Hessian to take steps that aim directly at the local optimum by fitting a quadratic model of the loss around the current point.

#newton's method#second-order optimization#hessian+12