🧠

Deep Learning Theory

Theoretical foundations explaining why and how deep networks learn: generalization, expressiveness, optimization landscape, and implicit bias.

12 concepts

Intermediate7

📚TheoryIntermediate

Universal Approximation Theorem

The Universal Approximation Theorem (UAT) says a feedforward neural network with one hidden layer and a non-polynomial activation (like sigmoid or ReLU) can approximate any continuous function on a compact set as closely as we want.

#universal approximation theorem#cybenko#hornik+12

📚TheoryIntermediate

Depth vs Width Tradeoffs

Depth adds compositional power: stacking layers lets neural networks represent functions with many repeated patterns using far fewer neurons than a single wide layer.

#depth vs width#relu#piecewise linear+12

📚TheoryIntermediate

Double Descent Phenomenon

Double descent describes how test error first follows the classic U-shape with increasing model complexity, spikes near the interpolation threshold, and then drops again in the highly overparameterized regime.

#double descent#interpolation threshold#overparameterization+12

📚TheoryIntermediate

Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis (LTH) says that inside a large dense neural network there exist small sparse subnetworks that, when trained in isolation from their original initialization, can reach comparable accuracy to the full model.

#lottery ticket hypothesis#magnitude pruning#sparsity+12

📚TheoryIntermediate

Implicit Bias of Gradient Descent

In underdetermined linear systems (more variables than equations), gradient descent started at zero converges to the minimum Euclidean norm solution without any explicit regularizer.

#implicit bias#gradient descent#minimum norm+12

📚TheoryIntermediate

Scaling Laws

Scaling laws say that model loss typically follows a power law that improves predictably as you increase parameters, data, or compute.

#scaling laws#power law#chinchilla scaling+12

📚TheoryIntermediate

Grokking & Delayed Generalization

Grokking is when a model suddenly starts to generalize well long after it has already memorized the training set.

#grokking#delayed generalization#weight decay+12

Advanced5

📚TheoryAdvanced

Neural Tangent Kernel (NTK)

Neural Tangent Kernel (NTK) describes how wide neural networks train like kernel machines, turning gradient descent into kernel regression in the infinite-width limit.

#neural tangent kernel#ntk#nngp+12

📚TheoryAdvanced

Generalization Bounds for Deep Learning

Generalization bounds explain why deep neural networks can perform well on unseen data despite having many parameters.

Deep Learning Theory

Intermediate7

Universal Approximation Theorem

Depth vs Width Tradeoffs

Double Descent Phenomenon

Lottery Ticket Hypothesis

Implicit Bias of Gradient Descent

Scaling Laws

Grokking & Delayed Generalization

Advanced5

Neural Tangent Kernel (NTK)

Generalization Bounds for Deep Learning

Information Bottleneck in Deep Learning

Mean Field Theory of Neural Networks

Feature Learning vs Kernel Regime