This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.