Papers2

#Self-Attention

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.

#Multi-Head Linear Attention#Linear Attention#Self-Attention

Attention Is All You Need

Intermediate

Ashish Vaswani, Noam Shazeer et al.Jun 12arXiv

The paper introduces the Transformer, a model that understands and generates sequences (like sentences) using only attention, without RNNs or CNNs.

#Transformer#Self-Attention#Multi-Head Attention