The paper studies Mamba-2 (a fast, linear-time attention method) and pares it down to the pieces that truly boost accuracy.
MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.
This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
Yume1.5 is a model that turns text or a single image into a living, explorable video world you can move through with keyboard keys.
The paper introduces Canon layers, tiny add-ons that let nearby words share information directly, like passing notes along a row of desks.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.