OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.
Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.