Papers3

#RULER

Fast weight models remember context with a tiny, fixed memory, but standard next-token training teaches them to think only one word ahead.

Not triaged yet

MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.

Not triaged yet

Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.

Not triaged yet