OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.
Multimodal AI models can mix up what they see and what they hear, making things up across senses; this is called cross-modal hallucination.
FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.
JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.