OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.
DeepSeek-OCR 2 teaches a computer to βreadβ pictures of documents in a smarter order, more like how people read.
HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.
HyperVL is a small but smart model that understands images and text, designed to run fast on phones and tablets.
The paper shows that video AIs do not need long, human-like chains of thought to reason well.