Groups
Category
Multi-Head Attention runs several attention mechanisms in parallel so each head can focus on different relationships in the data.
Attention computes a weighted sum of values V where the weights come from how similar queries Q are to keys K.