Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.
The paper introduces the Transformer, a model that understands and generates sequences (like sentences) using only attention, without RNNs or CNNs.