Concepts2

Groups

Multi-Head Attention

Multi-Head Attention runs several attention mechanisms in parallel so each head can focus on different relationships in the data.

#multi-head attention#scaled dot-product attention#transformer+12

📚TheoryIntermediate

Scaled Dot-Product Attention

Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(\sqrt{d_k}\), applying softmax, and forming a weighted sum.