Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
TimeBill is a way to help big AI models finish their answers on time without ruining answer quality.