Groups
Category
Level
Minimum Description Length (MDL) picks the model that compresses the data best by minimizing L(M) + L(D|M).
Rรฉnyi entropy generalizes Shannon entropy by measuring uncertainty with a tunable emphasis on common versus rare outcomes.
The Weak Law of Large Numbers (WLLN) says that the sample average of independent, identically distributed (i.i.d.) random variables with finite mean gets close to the true mean with high probability as the sample size grows.
Mixed precision training stores and computes tensors in low precision (FP16/BF16) for speed and memory savings while keeping a master copy of weights in FP32 for accurate updates.
Data parallelism splits the training data across workers that compute gradients in parallel on a shared model.
Lion (Evolved Sign Momentum) is a first-order, sign-based optimizer discovered through automated program search.
Sharpness-Aware Minimization (SAM) trains models to perform well even when their weights are slightly perturbed, seeking flatter minima that generalize better.
The MooreโPenrose pseudoinverse generalizes matrix inversion to rectangular or singular matrices and is denoted Aโบ.
A sparse matrix stores only its nonzero entries, saving huge amounts of memory when most entries are zero.
The Kronecker product A โ B expands a small matrix into a larger block matrix by multiplying every entry of A with the whole matrix B.
Orthogonal (real) and unitary (complex) matrices are length- and angle-preserving transformations, like perfect rotations and reflections.
Message passing treats meshes and point clouds as graphs where nodes exchange information with neighbors to learn useful features.