The paper introduces CoPE, a simple change to how models track word positions that makes long documents much easier for them to understand.
FASA is a training-free method that makes large language models faster and lighter on memory by keeping only the most useful past tokens during decoding.
This paper shows how to turn a big Transformer model into a faster hybrid model that mixes attention and RNN layers using far less training data (about 2.3B tokens).
OmniTransfer is a single system that learns from a whole reference video, not just one image, so it can copy how things look (identity and style) and how they move (motion, camera, effects).
Ministral 3 is a new family of small-but-mighty AI language models (3B, 8B, 14B) that learn from a larger model using a step-by-step tutoring method called Cascade Distillation.
K-EXAONE is a super-sized language model that speaks six languages and can read very long documents (up to 256,000 tokens) without forgetting important details.
The paper introduces Canon layers, tiny add-ons that let nearby words share information directly, like passing notes along a row of desks.
This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patch’s embedding in an image sequence, just like language models predict the next word.
Large language models usually line words up in fixed order slots, which can waste mental energy and make it harder to find the important parts of a long or noisy text.
GRAPE is a new way to tell Transformers where each word is in a sentence by using neat math moves called group actions.