πŸŽ“How I Study AIHISA
πŸ“–Read
πŸ“„PapersπŸ“°Blogs🎬Courses
πŸ’‘Learn
πŸ›€οΈPathsπŸ“šTopicsπŸ’‘Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
Back to Concepts
🎯

Optimization

Gradient descent variants, adaptive methods, and learning rate schedules that drive neural network training.

14 concepts

Intermediate13

βš™οΈAlgorithmIntermediate

Gradient Descent

Gradient descent is a simple, repeatable way to move downhill on a loss surface by stepping in the opposite direction of the gradient.

#gradient descent#batch gradient descent#learning rate+12
βš™οΈAlgorithmIntermediate

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) updates model parameters using small random subsets (mini-batches) of data, making learning faster and more memory-efficient.

#stochastic gradient descent#mini-batch#random shuffling+12
βš™οΈAlgorithmIntermediate

Momentum Methods

Momentum methods add an exponentially weighted memory of past gradients to make descent steps smoother and faster, especially in ravines and ill-conditioned problems.

#momentum#heavy-ball#polyak momentum+12
βš™οΈAlgorithmIntermediate

Adam & Adaptive Methods

Adam is an optimization algorithm that combines momentum (first moment) with RMSProp-style adaptive learning rates (second moment).

#adam#adaptive methods#rmsprop+12
βš™οΈAlgorithmIntermediate

Learning Rate Schedules

Learning rate schedules control how fast a model learns over time by changing the learning rate across iterations or epochs.

#learning rate schedules#step decay#cosine annealing+12
βˆ‘MathIntermediate

Lagrange Multipliers & Constrained Optimization

Lagrange multipliers let you optimize a function while strictly satisfying equality constraints by introducing auxiliary variables (the multipliers).

#lagrange multipliers#constrained optimization#kkt conditions+11
βš™οΈAlgorithmIntermediate

Gradient Clipping & Normalization

Gradient clipping limits how large gradient values or their overall magnitude can become during optimization to prevent exploding updates.

#gradient clipping#clipping by norm#clipping by value+12
πŸ“šTheoryIntermediate

Weight Initialization Strategies

Weight initialization sets the starting values of neural network parameters so signals and gradients neither explode nor vanish as they pass through layers.

#xavier#glorot#he+12
πŸ“šTheoryIntermediate

Loss Landscape Analysis

A loss landscape is the β€œterrain” of a model’s loss as you move through parameter space; valleys are good solutions and peaks are bad ones.

#loss landscape#sharpness#hessian eigenvalues+12
βš™οΈAlgorithmIntermediate

Sharpness-Aware Minimization (SAM)

Sharpness-Aware Minimization (SAM) trains models to perform well even when their weights are slightly perturbed, seeking flatter minima that generalize better.

#sharpness-aware minimization#sam optimizer#robust optimization+11
βš™οΈAlgorithmIntermediate

Lion Optimizer

Lion (Evolved Sign Momentum) is a first-order, sign-based optimizer discovered through automated program search.

#lion optimizer#sign-based optimization#momentum+12
βš™οΈAlgorithmIntermediate

Distributed & Parallel Optimization

Data parallelism splits the training data across workers that compute gradients in parallel on a shared model.

#data parallelism#synchronous sgd#asynchronous sgd+12
βš™οΈAlgorithmIntermediate

Mixed Precision Training

Mixed precision training stores and computes tensors in low precision (FP16/BF16) for speed and memory savings while keeping a master copy of weights in FP32 for accurate updates.

#mixed precision#fp16#bf16+10

Advanced1

βš™οΈAlgorithmAdvanced

Newton's Method & Second-Order Optimization

Newton's method uses both the gradient and the Hessian to take steps that aim directly at the local optimum by fitting a quadratic model of the loss around the current point.

#newton's method#second-order optimization#hessian+12