Fast weight models remember context with a tiny, fixed memory, but standard next-token training teaches them to think only one word ahead.
MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.
Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.