This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
The paper introduces Fast-weight Product Key Memory (FwPKM), a memory layer that can quickly learn from the current text it reads, not just from past training.
This paper shows how a language model can keep learning while you use it, so it handles very long inputs without slowing down.