๐ŸŽ“How I Study AIHISA
๐Ÿ“–Read
๐Ÿ“„Papers๐Ÿ“ฐBlogs๐ŸŽฌCourses
๐Ÿ’กLearn
๐Ÿ›ค๏ธPaths๐Ÿ“šTopics๐Ÿ’กConcepts๐ŸŽดShorts
๐ŸŽฏPractice
๐Ÿ“Daily Log๐ŸŽฏPrompts๐Ÿง Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Concepts532

Groups

๐Ÿ“Linear Algebra15๐Ÿ“ˆCalculus & Differentiation10๐ŸŽฏOptimization14๐ŸŽฒProbability Theory12๐Ÿ“ŠStatistics for ML9๐Ÿ“กInformation Theory10๐Ÿ”บConvex Optimization7๐Ÿ”ขNumerical Methods6๐Ÿ•ธGraph Theory for Deep Learning6๐Ÿ”ตTopology for ML5๐ŸŒDifferential Geometry6โˆžMeasure Theory & Functional Analysis6๐ŸŽฐRandom Matrix Theory5๐ŸŒŠFourier Analysis & Signal Processing9๐ŸŽฐSampling & Monte Carlo Methods10๐Ÿง Deep Learning Theory12๐Ÿ›ก๏ธRegularization Theory11๐Ÿ‘๏ธAttention & Transformer Theory10๐ŸŽจGenerative Model Theory11๐Ÿ”ฎRepresentation Learning10๐ŸŽฎReinforcement Learning Mathematics9๐Ÿ”„Variational Methods8๐Ÿ“‰Loss Functions & Objectives10โฑ๏ธSequence & Temporal Models8๐Ÿ’ŽGeometric Deep Learning8

Category

๐Ÿ”ทAllโˆ‘Mathโš™๏ธAlgo๐Ÿ—‚๏ธDS๐Ÿ“šTheory

Level

AllBeginnerIntermediateAdvanced
๐Ÿ“šTheoryAdvanced

Feature Learning vs Kernel Regime

The kernel (lazy) regime keeps neural network parameters close to their initialization, making training equivalent to kernel regression with a fixed kernel such as the Neural Tangent Kernel (NTK).

#neural tangent kernel#kernel ridge regression#lazy training+12
๐Ÿ“šTheoryIntermediate

Grokking & Delayed Generalization

Grokking is when a model suddenly starts to generalize well long after it has already memorized the training set.

#grokking
7891011
#delayed generalization
#weight decay
+12
๐Ÿ“šTheoryAdvanced

Mean Field Theory of Neural Networks

Mean field theory treats very wide randomly initialized neural networks as averaging machines where each neuron behaves like a sample from a common distribution.

#mean field theory#neural tangent kernel#neural network gaussian process+12
๐Ÿ“šTheoryAdvanced

Information Bottleneck in Deep Learning

The Information Bottleneck (IB) principle formalizes learning compact representations T that keep only the information about X that is useful for predicting Y.

#information bottleneck#variational information bottleneck#mutual information+11
๐Ÿ“šTheoryAdvanced

Generalization Bounds for Deep Learning

Generalization bounds explain why deep neural networks can perform well on unseen data despite having many parameters.

#generalization bounds#pac-bayes#compression bounds+12
๐Ÿ“šTheoryIntermediate

Implicit Bias of Gradient Descent

In underdetermined linear systems (more variables than equations), gradient descent started at zero converges to the minimum Euclidean norm solution without any explicit regularizer.

#implicit bias#gradient descent#minimum norm+12
๐Ÿ“šTheoryIntermediate

Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis (LTH) says that inside a large dense neural network there exist small sparse subnetworks that, when trained in isolation from their original initialization, can reach comparable accuracy to the full model.

#lottery ticket hypothesis#magnitude pruning#sparsity+12
๐Ÿ“šTheoryIntermediate

Double Descent Phenomenon

Double descent describes how test error first follows the classic U-shape with increasing model complexity, spikes near the interpolation threshold, and then drops again in the highly overparameterized regime.

#double descent#interpolation threshold#overparameterization+12
๐Ÿ“šTheoryAdvanced

Neural Tangent Kernel (NTK)

Neural Tangent Kernel (NTK) describes how wide neural networks train like kernel machines, turning gradient descent into kernel regression in the infinite-width limit.

#neural tangent kernel#ntk#nngp+12
๐Ÿ“šTheoryIntermediate

Depth vs Width Tradeoffs

Depth adds compositional power: stacking layers lets neural networks represent functions with many repeated patterns using far fewer neurons than a single wide layer.

#depth vs width#relu#piecewise linear+12
โš™๏ธAlgorithmIntermediate

Stratified & Latin Hypercube Sampling

Stratified sampling reduces Monte Carlo variance by dividing the domain into non-overlapping regions (strata) and sampling within each region.

#stratified sampling#latin hypercube sampling#variance reduction+11
๐Ÿ“šTheoryIntermediate

Reparameterization Trick

The reparameterization trick rewrites a random variable as a deterministic function of noise that does not depend on the parameters, such as z = ฮผ + ฯƒ ยท ฮต with ฮต ~ N(0, 1).

#reparameterization trick#pathwise derivative#variational autoencoder+11