Fast weight models remember context with a tiny, fixed memory, but standard next-token training teaches them to think only one word ahead.
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.