Mathematical foundations of attention mechanisms, Transformer architectures, and their theoretical properties.
10 concepts
Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(\sqrt{d_k}\), applying softmax, and forming a weighted sum.
Multi-Head Attention runs several attention mechanisms in parallel so each head can focus on different relationships in the data.
Self-attention can be viewed as message passing on a fully connected graph where each token (node) sends a weighted message to every other token.
Standard softmax attention costs O(nΒ²) in sequence length because every token compares with every other token.
Sinusoidal positional encoding represents each tokenβs position using pairs of sine and cosine waves at exponentially spaced frequencies.
Softmax turns arbitrary real-valued scores (logits) into probabilities that sum to one.
Key-Value memory systems store information as pairs where keys are used to look up values by similarity rather than exact match.
A Mixture of Experts (MoE) routes each input to a small subset of specialized models called experts, enabling conditional computation.
Transformer expressiveness studies what kinds of sequence-to-sequence mappings a Transformer can represent or approximate.
In-context learning (ICL) means a model learns from examples provided in the input itself, without updating its parameters.