๐ŸŽ“How I Study AIHISA
๐Ÿ“–Read
๐Ÿ“„Papers๐Ÿ“ฐBlogs๐ŸŽฌCourses
๐Ÿ’กLearn
๐Ÿ›ค๏ธPaths๐Ÿ“šTopics๐Ÿ’กConcepts๐ŸŽดShorts
๐ŸŽฏPractice
๐Ÿ“Daily Log๐ŸŽฏPrompts๐Ÿง Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Concepts356

Groups

๐Ÿ“Linear Algebra15๐Ÿ“ˆCalculus & Differentiation10๐ŸŽฏOptimization14๐ŸŽฒProbability Theory12๐Ÿ“ŠStatistics for ML9๐Ÿ“กInformation Theory10๐Ÿ”บConvex Optimization7๐Ÿ”ขNumerical Methods6๐Ÿ•ธGraph Theory for Deep Learning6๐Ÿ”ตTopology for ML5๐ŸŒDifferential Geometry6โˆžMeasure Theory & Functional Analysis6๐ŸŽฐRandom Matrix Theory5๐ŸŒŠFourier Analysis & Signal Processing9๐ŸŽฐSampling & Monte Carlo Methods10๐Ÿง Deep Learning Theory12๐Ÿ›ก๏ธRegularization Theory11๐Ÿ‘๏ธAttention & Transformer Theory10๐ŸŽจGenerative Model Theory11๐Ÿ”ฎRepresentation Learning10๐ŸŽฎReinforcement Learning Mathematics9๐Ÿ”„Variational Methods8๐Ÿ“‰Loss Functions & Objectives10โฑ๏ธSequence & Temporal Models8๐Ÿ’ŽGeometric Deep Learning8

Category

๐Ÿ”ทAllโˆ‘Mathโš™๏ธAlgo๐Ÿ—‚๏ธDS๐Ÿ“šTheory

Level

AllBeginnerIntermediate
๐Ÿ“šTheoryIntermediate

Autoregressive Models

Autoregressive (AR) models represent a joint distribution by multiplying conditional probabilities in a fixed order, using the chain rule of probability.

#autoregressive#ar model#n-gram+11
โˆ‘MathIntermediate

Wasserstein Distance & Optimal Transport

Wasserstein distance (Earth Moverโ€™s Distance) measures how much โ€œworkโ€ is needed to transform one probability distribution into another by moving mass with minimal total cost.

#wasserstein distance
34567
Advanced
#earth mover's distance
#optimal transport
+12
๐Ÿ“šTheoryIntermediate

Maximum Likelihood & Generative Models

Maximum Likelihood Estimation (MLE) picks parameters that make the observed data most probable under a chosen probabilistic model.

#maximum likelihood#generative models#naive bayes+12
๐Ÿ“šTheoryIntermediate

Mixture of Experts (MoE)

A Mixture of Experts (MoE) routes each input to a small subset of specialized models called experts, enabling conditional computation.

#mixture of experts#moe#gating network+12
๐Ÿ“šTheoryIntermediate

Key-Value Memory Systems

Key-Value memory systems store information as pairs where keys are used to look up values by similarity rather than exact match.

#key-value memory#attention#scaled dot-product+12
โˆ‘MathIntermediate

Softmax & Temperature Scaling

Softmax turns arbitrary real-valued scores (logits) into probabilities that sum to one.

#softmax#temperature scaling#logits+12
โˆ‘MathIntermediate

Positional Encoding Mathematics

Sinusoidal positional encoding represents each tokenโ€™s position using pairs of sine and cosine waves at exponentially spaced frequencies.

#positional encoding#sinusoidal#transformer+11
โš™๏ธAlgorithmIntermediate

Efficient Attention Mechanisms

Standard softmax attention costs O(nยฒ) in sequence length because every token compares with every other token.

#linear attention#efficient attention#kernel trick+12
๐Ÿ“šTheoryIntermediate

Self-Attention as Graph Neural Network

Self-attention can be viewed as message passing on a fully connected graph where each token (node) sends a weighted message to every other token.

#self-attention#graph neural network#message passing+11
๐Ÿ“šTheoryIntermediate

Multi-Head Attention

Multi-Head Attention runs several attention mechanisms in parallel so each head can focus on different relationships in the data.

#multi-head attention#scaled dot-product attention#transformer+12
๐Ÿ“šTheoryIntermediate

Scaled Dot-Product Attention

Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(\sqrt{d_k}\), applying softmax, and forming a weighted sum.

#scaled dot-product attention#softmax#transformer+10
๐Ÿ“šTheoryIntermediate

Stochastic Depth

Stochastic Depth randomly drops whole residual layers during training while keeping the full network at inference time.

#stochastic depth#resnet#residual block+12