Mathematical foundations of regularization methods that prevent overfitting: from classical penalties to modern implicit techniques.
11 concepts
L1 regularization (Lasso) adds a penalty \(\lambda \sum_{i=1}^{p} |w_i|\) to the loss, which pushes many coefficients exactly to zero and performs feature selection.
L2 regularization (also called ridge or weight decay) adds a penalty proportional to the sum of squared weights to discourage large parameters.
Elastic Net regularization combines L1 (Lasso) and L2 (Ridge) penalties to produce models that are both sparse and stable.
Dropout randomly turns off (zeros) some neurons during training to prevent the network from memorizing the training data.
Batch Normalization rescales and recenters activations using mini-batch statistics to stabilize and speed up neural network training.
Layer Normalization rescales and recenters each sample across its feature dimensions, making it independent of batch size.
Data augmentation expands the training distribution by applying label-preserving transformations to inputs, which lowers overfitting and improves generalization.
Label smoothing replaces a hard one-hot target with a slightly softened distribution to reduce model overconfidence.
Early stopping halts training when the validation loss stops improving, preventing overfitting and saving compute.
Spectral regularization controls how much a weight matrix can stretch inputs by constraining its largest singular value (spectral norm).
Stochastic Depth randomly drops whole residual layers during training while keeping the full network at inference time.