FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs
Key Summary
- â˘FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.
- â˘It contains 919 videos and 1,034 multiple-choice questions across 8 domains, with tricky distractors that force true audio-visual reasoning.
- â˘Across 20 models, the best score is only 64.8% (Gemini 3 Flash), showing future forecasting is still hardâespecially when speech is important.
- â˘The authors built a scalable, AI-assisted, human-checked pipeline to pick videos where audio really matters and to precisely locate events in time.
- â˘They propose OFF (Omni-Modal Future Forecasting), a 7K-sample instruction-tuning strategy that teaches models to explain why a future follows from a given audio-video moment.
- â˘OFF improves open-source models on FutureOmni and also helps on other audio-visual and even video-only benchmarks, showing better generalization.
- â˘Speech-heavy cases are the toughest; models do better with music or clear sound effects than with spoken language tied to visuals.
- â˘Ablation studies show audio plus video (A+V) beats using either alone, and even beats video plus subtitles, so raw audio carries unique signals text canât fully replace.
- â˘Attention visualizations suggest OFF helps models focus more on key audio and video frames, which likely drives the gains.
- â˘The work highlights real-world stakes like safer driving, better home assistants, and more reliable video understanding when quick, forward-looking decisions matter.
Why This Research Matters
In real life, we constantly predict what happens next to stay safe and act wisely, and FutureOmni checks whether AI can do the same with video and sound together. This can make self-driving systems more cautious by listening for sirens while watching for pedestrians. It can make home assistants more helpful by hearing warnings and seeing actions, like noticing boiling kettles or crying babies. It helps video tools summarize, edit, or flag risky moments before they fully unfold, saving time and preventing mistakes. In emergencies, combining radio chatter with chaotic visuals can guide faster, safer choices. By teaching AI to explain its predictions, we also make it more trustworthy and easier to correct when it errs.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine watching a movie scene where you hear a siren and see a person run across the streetâyou instantly think, âUh-oh, something is about to happen.â Youâre not just describing the scene; youâre predicting the next beat.
𼏠Filling (Multimodal Large Language Models â MLLMs):
- What it is: MLLMs are AI systems that can take in different kinds of informationâlike text, pictures, and soundsâand talk about them.
- How it works: 1) A vision part looks at images or video frames; 2) an audio part listens to sounds and speech; 3) a language brain ties everything together to answer questions or make predictions.
- Why it matters: Without MLLMs, an AI might only read text or only see pictures, missing important clues that are in sound or motion. đ Anchor: If you ask an AI, âWhatâs happening in this video of a guitar lesson?â MLLMs can look at the teacherâs fingers and listen to the chords to answer more accurately than using just one modality.
đ Hook: You know how a line of dominoes fallsâknock one, then the next, then the next?
𼏠Filling (Causal and Temporal Reasoning):
- What it is: Itâs understanding what causes what (cause â effect) and in what order things happen (earlier â later).
- How it works: 1) Spot the key event (like a warning shout); 2) note the timing; 3) connect it to what logically follows (someone stops moving); 4) rule out things that are out of order.
- Why it matters: Without cause-and-time logic, an AI might mix up the past and future or think a result came before its cause. đ Anchor: If a coach blows a whistle (cause) during a game, players stop (effect) right after. That sequence and link is what causal and temporal reasoning captures.
đ Hook: Think of a magician who tells a story while showing pictures and playing music. You need all three to get the trick.
𼏠Filling (Cross-Modal Reasoning):
- What it is: Itâs when AI connects clues across different sensesâlike matching a spoken warning (audio) to someone freezing (video).
- How it works: 1) Listen for key sounds or speech; 2) watch for matching actions or objects; 3) check timing alignment; 4) combine them to decide whatâs likely next.
- Why it matters: Without cross-modal links, the AI might believe a fake answer that looks right but is contradicted by the sound, or vice versa. đ Anchor: If a chef on video says, âNow add salt,â and you see a hand pour white grains into a pot, the audio and video together confirm the action.
The world before this paper: AI benchmarks mostly tested âretrospective understanding,â which means describing or analyzing what already happened in a video. These tests pushed progress in captioning, identifying objects, and even explaining scenesâbut they didnât focus on forecasting the future.
The problem: Real life needs foresight. Cars should predict a pedestrian stepping onto the street; a home assistant should guess that a boiling kettle will whistle; coaches or editors want to foresee the next play or cut. Yet few tools check if MLLMs can predict what happens next from both audio and video together.
Failed attempts: Some benchmarks asked models to predict futures, but mostly from text only (no real video or sound), or from vision + text while muting audio. That misses cases where sound is the main clue (like off-screen footsteps or a shouted command). Also, many prediction datasets were short and not varied enough, or required frequent updates to avoid data leakage.
The gap: We needed a benchmark that: 1) truly uses both audio and video, 2) demands causal and temporal reasoning, and 3) makes cheating hard by adding distractor answers that only audio or only video would falsely favor.
đ Hook: Picture a science fair judge who asks not just âWhat did you do?â but âWhat will happen next if you change this?â
𼏠Filling (FutureOmni):
- What it is: FutureOmni is a benchmark that tests whether AI can predict future events from both sound and video together.
- How it works: 1) Pick videos where audio matters; 2) find exact event times; 3) build causeâfuture pairs; 4) create tricky distractors (visual-only, audio-only, delayed, reverse-causal); 5) ask multiple-choice questions; 6) score models.
- Why it matters: Without a fair, audio-visual future test, we canât tell if models truly understand unfolding events or just parrot descriptions. đ Anchor: In a cartoon, stepping on tacks plus hearing a strict warning sets up the prediction âhe runs out quietlyâ rather than âhe screams now,â because the sound cue (warning) changes the likely future.
Real stakes: This matters for safer streets (anticipating hazards), better home devices (preventing accidents), smarter content tools (editing or summarizing videos that havenât finished), and rescue or surveillance (noticing a sound that predicts a dangerous next step). Itâs not enough to know the pastâuseful AI needs to peer a few moments ahead, just like we do.
02Core Idea
đ Hook: You know how crossing a busy street safely means looking and listeningâseeing the cars and hearing the honkâthen deciding when to move next?
𼏠Filling (Aha! Moment):
- One-sentence insight: To predict the future well, AI must fuse audio and video with causal-and-time logic, and we must evaluate and teach this skill directly.
- Multiple analogies:
- Traffic scout: Eyes on the lights, ears on the sirensâthen choose the next safe step.
- Movie trailer editor: Hear the rising music and see the heroâs glanceâthen cut to the big reveal at just the right moment.
- Sports commentator: Hear the coach shout âswitch!â and see the defender moveâthen predict the pass to the corner.
- Before vs After: ⢠Before: Models mostly described past scenes; audio was often ignored, so they fell for look-alike but wrong futures. ⢠After: With FutureOmni + OFF, models learn to tie sounds and sights together in time and justify why one specific future follows.
- Why it works (intuition): If we test on tricky, audio-visual futures and train with reasoned explanations (rationales) that link cause to effect, models learn to look for the right moments (keyframes) and the right cues (speech, sounds, music) at the right time order.
- Building blocks:
- Audio-coordinated selection: keep videos where audio truly changes the meaning.
- Temporal localization + audio fulfilling: find precise start/end times and annotate matching sound cues.
- Causal pair discovery + verification: select pastâfuture pairs that are close in time and logically linked.
- Adversarial distractors: options that trick single-modality shortcuts, forcing real cross-modal reasoning.
- OFF instruction-tuning with rationales: 7K examples that explicitly teach models the âwhy.â đ Anchor: Like teaching a student not just to answer âWhatâs next?â in a science lab, but to show the safety rule or law of motion that proves why that next step must happen.
đ Hook: Think of a riddle where the picture and the soundtrack both hide cluesâmissing either makes you guess wrong.
𼏠Filling (Adversarial Distractors):
- What it is: Carefully designed wrong answers that look tempting if you only watch or only listen or mix up time.
- How it works: 1) Visual-only plausible but audio-contradicted; 2) audio-only plausible but visually wrong; 3) delayed/past events; 4) reverse-causal (cause instead of effect).
- Why it matters: Without them, a model could score well by guessing from one modality or by mixing up past/future. đ Anchor: If a king whispers âOne more sound and off with your head,â then a âscream nowâ option is a trapâonly by using the audio threat plus the visual setup do you choose âhe runs out quietly.â
đ Hook: Imagine picking only videos where closing your eyes or muting the sound would change the story.
𼏠Filling (Audio-Coordinated Selection):
- What it is: A way to filter videos so that audio truly influences the meaning.
- How it works: 1) Caption a video with audio on; 2) caption again with audio muted; 3) keep videos where the captions differ a lotâaudio mattered.
- Why it matters: Without this, many clips would have decorative music only, and models could ignore audio and still do fine. đ Anchor: A doorbell heard off-screen changes the next action (someone goes to the door); background elevator music does not.
đ Hook: Think of a sports replay that freezes at the exact moment of a foulâwith a clear whistle heard.
𼏠Filling (Temporal Localization and Audio Fulfilling):
- What it is: Pinpointing exact start/end times of meaningful events and the matching sounds.
- How it works: 1) Scan the video to find plot-relevant events; 2) set tight timestamps; 3) check time boundaries with audio patterns; 4) add audio details (speech, sound effects, music) that align with visuals.
- Why it matters: Without precise timing, the cause and effect might overlap or be too far apart to be truly predictive. đ Anchor: â01:52â02:00 King warns softlyâ followed by â02:03â02:07 Tom tiptoes outâ is a clear, timed pair.
đ Hook: You know how detectives check each otherâs notes to avoid mistakes?
𼏠Filling (Causal Pair Discovery and Dual-Stage Verification):
- What it is: Finding truly causal pastâfuture pairs and double-checking them with AI and humans.
- How it works: 1) Propose candidate pairs with reasons; 2) keep pairs where the future closely follows the past (<30s); 3) auto-check logic; 4) human-verify quality.
- Why it matters: Without verification, pairs might be coincidental, vague, or out of order. đ Anchor: From âcoach shouts âSwitch!â at 04:10â to âdefender rotates at 04:14â is a verified causal pair.
đ Hook: Think of tutoring that not only gives answers but also shows the steps so you can do the next problem yourself.
𼏠Filling (OFF â Omni-Modal Future Forecasting):
- What it is: A training strategy using 7K instruction-tuning samples where each example explains why a specific future follows from the audio-video premise.
- How it works: 1) Keep audio/video encoders fixed; 2) fine-tune the language core with LoRA; 3) include step-by-step rationales; 4) train briefly but effectively.
- Why it matters: Without rationales, models may memorize patterns; with rationales, they learn the reasoning recipe and transfer it to new tasks. đ Anchor: After OFF, models pay more attention to the exact whistle moment and the playerâs movement frame, helping them predict the pass correctly.
03Methodology
At a high level: Input (videos with audio) â Audio-coordinated filtering â Temporal localization + audio fulfilling â Causal pair discovery â QA construction with adversarial distractors â Dual-stage verification â Benchmark evaluation; plus OFF training: Build 7K rationale-tuned samples â LoRA fine-tune text backbone â Evaluate on FutureOmni and other benchmarks.
Step A: Audio-coordinated video selection
- What happens: Collect ~18K YouTube videos (30sâ20min). Remove static/low-change clips via frame-similarity. Then caption each video twice: with and without audio. Keep the half where captions differ most (audio matters).
- Why it exists: If audio doesnât change meaning, future prediction can be faked by visuals alone.
- Example: A classroom clip where the teacherâs spoken instruction changes the next action is kept; a scenic montage with only background music is dropped.
Step B: Audio-visual temporal localization and calibration
- What happens: Use a strong multimodal model to identify plot-relevant events and set tight timestamps (MM:SS). Validate boundaries using audio features (e.g., MFCC differences) and add audio details (speech, sound effects, music) synchronized with visuals.
- Why it exists: Prediction needs precise, time-ordered pieces so a cause clearly comes before its effect.
- Example: 01:52â02:00 âKing warns softly,â 02:03â02:07 âTom tiptoes out,â validated by an audio change at 02:00.
Step C: Causal pair discovery
- What happens: Analyze neighboring events to pick true causeâfuture pairs within a 30s gap. Require the model to output Premise, Target (future), and a Rationale explaining the link. Score whether audio is causal (0â2) and label audio type (speech, sound, music).
- Why it exists: Temporal neighbors arenât always causal; we need events where the future is predictable from the past.
- Example: âRef blows whistleâ (speech/sound) â âPlayers stopâ (visual) within 5 seconds, rationale: rules of play.
Step D: QA construction with adversarial distractors
- What happens: Build multiple-choice questions from each causal pair and add four distractors:
- Visual-only plausible but audio-contradicted,
- Audio-only plausible but visually wrong,
- Delayed (a past event that tempts time confusion),
- Reverse-causal (cause instead of effect).
- Why it exists: Forces cross-modal reasoning and correct time direction.
- Example: After âquiet warning,â distractor âscream nowâ looks visually plausible in a slapstick scene but is contradicted by the audio threat.
Step E: Dual-stage verification
- What happens: 1) Automated logic check; 2) human review to ensure clarity, causality, and timing.
- Why it exists: Prevents ambiguous, coincidental, or mislabeled pairs.
- Example: If effect starts before cause, the pair is rejected.
Step F: Benchmark statistics and scope
- What happens: Final set has 919 videos and 1,034 QAs across 8 domains (Cartoon, Education, Emergency, Surveillance, Daily life, Movie, Game, Documentary) with varied audio (speech, sound, music) and longer durations (avg ~163.5s).
- Why it exists: Diverse, longer videos require richer context and make shortcuts harder.
- Example: Documentaries often rely on narration; Games have predictable physics; Emergencies have chaotic audio-visual cues.
Step G: OFF training (Omni-Modal Future Forecasting)
- What happens: Build FutureOmni-7K instruction-tuning data with rationales. Fine-tune open-source omni-models using LoRA, freezing audio/video encoders and updating the language core with a small learning rate for 1 epoch.
- Why it exists: Teach models the chain-of-thought linking audio+video to a specific near-future, improving transfer.
- Example: Rationale explains that âbecause the referee says âTimeout!â and the clock horn sounds, play will pause immediately.â
The secret sauce:
- Tricky distractors that punish single-modality guessing.
- Audio-first filtering that ensures sound truly changes meaning.
- Rationale-enhanced OFF that shifts attention to key audio/video frames at the right layers, improving generalization beyond forecasting.
Mini data walk-through:
- Input: A 2:30 clip of a shop scene. At 01:10 a bell rings (audio) while the door opens (video). At 01:13 the clerk looks up (video). QA: âGiven the bell and opening door, what happens next?â Correct: âClerk greets the customer.â Distractors: âClerk turns off lightsâ (opposite mood), âCustomer already paid earlierâ (delayed/past), âA loud crashâ (audio-only mismatch), âDoor locksâ (visual but audio contradicts).
04Experiments & Results
The test: Measure how accurately models choose the correct future event in multiple-choice questions when both audio and video are provided. This tests causal + temporal reasoning and true cross-modal fusion.
The competition: 20 models totalâ13 omni-modal (audio+video) and 7 video-onlyâcovering open-source systems (e.g., Qwen2.5-Omni, video-SALMONN 2, Ola) and proprietary ones (e.g., Gemini 2.5 Pro/Flash, Gemini 3 Flash, Claude Haiku 4.5), plus GPT-4o as a video-only strong baseline.
The scoreboard (with context):
- Top score: 64.8% by Gemini 3 Flashâlike getting a solid B when the class average hovers around C; good, but far from perfect.
- Proprietary omni-models average around 61%, while the best open-source omni-model reaches ~53%âa notable gap that signals room to grow.
- Video-only models trail omni-models: even strong systems (like GPT-4o as tested here) score below many omni-models, showing audio is crucial for future prediction.
Surprising and key findings:
- Speech is hardest: Even the best performer does better on music and clean sound effects than on speech-driven futures. Speech needs high-level language understanding and alignment with visuals.
- Contextual cold start: Very short videos ([0,2) minutes) have the lowest scores across models; performance improves with a bit more history, then dips slightly for very long clips. Forecasting needs enough build-up.
- Modality ablation: Audio+Video (A+V) beats Video-only (V) or Audio-only (A). It also beats Video+Subtitles or Video+Captions, meaning raw audio carries emotional/ambient signals that text alone canât fully capture.
- Error analysis (on Gemini 3 Flash fails): ⢠51.6% are video perception errors (missed a small visual cue), ⢠30.8% are audio-video reasoning failures (saw and heard pieces but didnât combine them), ⢠15.1% audio perception errors, ⢠only 2.5% lack of knowledge. So the main barriers are sensing and fusing, not missing facts.
OFF improvements:
- Training open-source omni-models with FutureOmni-7K + rationales yields consistent gains (e.g., video-SALMONN 2 +3.87 points overall). Speech-heavy improvements are especially notable, suggesting better spoken-language grounding with visuals.
- Generalization: OFF-trained models also improve on other benchmarks (WorldSense, DailyOmni, JointAVBench, OmniVideoBench) and even on video-only tests (Video-MME, MLVU), indicating broader benefits to attention and reasoning.
- Attention insights: After OFF, attention on known keyframes (both audio and video) increases in critical transformer layers, matching the observed performance bumps.
Bottom line: Future forecasting with real audio-visual cues is still challenging; even top systems miss many speech-linked futures. But targeted training with rationales (OFF) nudges models to look and listen at the right times, improving both forecasting and general skills.
05Discussion & Limitations
Limitations:
- Benchmark scope: While FutureOmni spans 8 domains and ~1K QAs, the world is bigger. Some niches (e.g., medical or industrial alarms) arenât included, so models trained here might still struggle in those.
- Multiple-choice format: Itâs great for controlled evaluation but may simplify the real world, where many plausible futures exist and answers arenât neatly listed.
- Annotation reliance: The pipeline uses strong proprietary models and then human verification. If upstream AI introduces subtle biases (e.g., speech accents, domain styles), they could echo in the dataset.
- Speech complexity: Accents, overlapping talk, and noisy scenes can still confuse models; improving speech-visual grounding remains a core challenge.
- Long-context processing: Very long videos slightly reduce performance; efficient context strategies and better temporal memory could help.
Required resources:
- To use FutureOmni for evaluation: just inference capability with audio+video inputs.
- To train with OFF: GPUs for LoRA fine-tuning, access to a compatible open-source omni-model, and data loading for videos + audio.
When not to use:
- Purely text tasks or static images without soundâthis benchmark wonât be a good fit.
- Domains where future events are inherently unpredictable or random (e.g., jump-scare edits with no causal hints).
Open questions:
- How to broaden beyond multiple-choice to free-form prediction while keeping evaluation fair?
- How to robustly handle speech-heavy, noisy, or overlapped audio with strong visual grounding?
- Can we design models that learn event physics and social rules more explicitly to reduce perception errors?
- What architectures best fuse audio and video over long timelines without losing early clues?
- How can we better measure and train âarrow of timeâ understanding so reverse-causal traps fail more often?
06Conclusion & Future Work
Three-sentence summary: FutureOmni is a new benchmark that checks whether AI can predict near-future events from both sound and video using true causal and temporal reasoning. Evaluations of 20 models show the task is hardâespecially with speechâwhile the proposed OFF training with 7K rationale-rich samples noticeably improves open-source models and even helps on other benchmarks. Attention analyses suggest OFF makes models focus more on key audio and visual frames in the right layers, supporting better generalization.
Main achievement: Defining, building, and validating the first comprehensive, adversarially robust, audio-visual future forecasting benchmarkâand demonstrating a practical training recipe (OFF) that measurably boosts foresight and transfer.
Future directions: Expand domains and durations, move beyond multiple choice to structured or free-text futures with automatic scoring, develop stronger speech-visual grounding, and explore architectures that keep long-range temporal cues. Also, design training that teaches models explicit âtime arrowsâ and domain rules (sports, traffic, emergencies) to reduce perception and fusion errors.
Why remember this: It shifts multimodal AI from just narrating the past to anticipating the next moment, which is how people act safely and smartly in the real world. By making models look and listen togetherâand explain their reasoningâFutureOmni and OFF push us toward AI that can plan, warn, and assist before problems happen.
Practical Applications
- â˘Evaluate and select multimodal models for autonomous driving that must anticipate hazards using both street sounds and visuals.
- â˘Train home assistants to predict near-future risks (e.g., a pot boiling over) by fusing kitchen sounds with camera views.
- â˘Build smarter video editing tools that auto-suggest the next cut based on rising music and visual setup.
- â˘Improve surveillance triage by predicting escalating situations from shouted commands and crowd movement.
- â˘Enhance sports analytics to forecast plays using coach shouts (audio) plus player positioning (video).
- â˘Design classroom feedback systems that link teacher instructions (speech) to student actions (video) to foresee confusion points.
- â˘Develop content moderation that anticipates harmful acts when audio cues and gestures align.
- â˘Assist emergency operations centers in predicting next steps from sirens, radio chatter, and body-cam footage.
- â˘Create accessibility tools that narrate likely next events for users with hearing or vision impairments.
- â˘Benchmark and fine-tune robots that must listen and watch to decide their next safe move in dynamic environments.