Theoretical foundations explaining why and how deep networks learn: generalization, expressiveness, optimization landscape, and implicit bias.
12 concepts
The Universal Approximation Theorem (UAT) says a feedforward neural network with one hidden layer and a non-polynomial activation (like sigmoid or ReLU) can approximate any continuous function on a compact set as closely as we want.
Depth adds compositional power: stacking layers lets neural networks represent functions with many repeated patterns using far fewer neurons than a single wide layer.
Double descent describes how test error first follows the classic U-shape with increasing model complexity, spikes near the interpolation threshold, and then drops again in the highly overparameterized regime.
The Lottery Ticket Hypothesis (LTH) says that inside a large dense neural network there exist small sparse subnetworks that, when trained in isolation from their original initialization, can reach comparable accuracy to the full model.
In underdetermined linear systems (more variables than equations), gradient descent started at zero converges to the minimum Euclidean norm solution without any explicit regularizer.
Scaling laws say that model loss typically follows a power law that improves predictably as you increase parameters, data, or compute.
Grokking is when a model suddenly starts to generalize well long after it has already memorized the training set.
Neural Tangent Kernel (NTK) describes how wide neural networks train like kernel machines, turning gradient descent into kernel regression in the infinite-width limit.
Generalization bounds explain why deep neural networks can perform well on unseen data despite having many parameters.