Groups
Category
Level
Key-Value memory systems store information as pairs where keys are used to look up values by similarity rather than exact match.
Softmax turns arbitrary real-valued scores (logits) into probabilities that sum to one.
Self-attention can be viewed as message passing on a fully connected graph where each token (node) sends a weighted message to every other token.
Multi-Head Attention runs several attention mechanisms in parallel so each head can focus on different relationships in the data.
Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(\sqrt{d_k}\), applying softmax, and forming a weighted sum.