This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
Yume1.5 is a model that turns text or a single image into a living, explorable video world you can move through with keyboard keys.
The paper introduces Canon layers, tiny add-ons that let nearby words share information directly, like passing notes along a row of desks.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.