FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Qian Chen; Jinlan Fu; Changsong Li; See-Kiong Ng; Xipeng Qiu

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Intermediate

Qian Chen, Jinlan Fu, Changsong Li et al.1/20/2026

arXiv PDF

Key Summary

•FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.
•It contains 919 videos and 1,034 multiple-choice questions across 8 domains, with tricky distractors that force true audio-visual reasoning.
•Across 20 models, the best score is only 64.8% (Gemini 3 Flash), showing future forecasting is still hard—especially when speech is important.
•The authors built a scalable, AI-assisted, human-checked pipeline to pick videos where audio really matters and to precisely locate events in time.
•They propose OFF (Omni-Modal Future Forecasting), a 7K-sample instruction-tuning strategy that teaches models to explain why a future follows from a given audio-video moment.
•OFF improves open-source models on FutureOmni and also helps on other audio-visual and even video-only benchmarks, showing better generalization.
•Speech-heavy cases are the toughest; models do better with music or clear sound effects than with spoken language tied to visuals.
•Ablation studies show audio plus video (A+V) beats using either alone, and even beats video plus subtitles, so raw audio carries unique signals text can’t fully replace.
•Attention visualizations suggest OFF helps models focus more on key audio and video frames, which likely drives the gains.
•The work highlights real-world stakes like safer driving, better home assistants, and more reliable video understanding when quick, forward-looking decisions matter.

Why This Research Matters

In real life, we constantly predict what happens next to stay safe and act wisely, and FutureOmni checks whether AI can do the same with video and sound together. This can make self-driving systems more cautious by listening for sirens while watching for pedestrians. It can make home assistants more helpful by hearing warnings and seeing actions, like noticing boiling kettles or crying babies. It helps video tools summarize, edit, or flag risky moments before they fully unfold, saving time and preventing mistakes. In emergencies, combining radio chatter with chaotic visuals can guide faster, safer choices. By teaching AI to explain its predictions, we also make it more trustworthy and easier to correct when it errs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a movie scene where you hear a siren and see a person run across the street—you instantly think, “Uh-oh, something is about to happen.” You’re not just describing the scene; you’re predicting the next beat.

🥬 Filling (Multimodal Large Language Models – MLLMs):

What it is: MLLMs are AI systems that can take in different kinds of information—like text, pictures, and sounds—and talk about them.
How it works: 1) A vision part looks at images or video frames; 2) an audio part listens to sounds and speech; 3) a language brain ties everything together to answer questions or make predictions.
Why it matters: Without MLLMs, an AI might only read text or only see pictures, missing important clues that are in sound or motion. 🍞 Anchor: If you ask an AI, “What’s happening in this video of a guitar lesson?” MLLMs can look at the teacher’s fingers and listen to the chords to answer more accurately than using just one modality.

🍞 Hook: You know how a line of dominoes falls—knock one, then the next, then the next?

🥬 Filling (Causal and Temporal Reasoning):

What it is: It’s understanding what causes what (cause → effect) and in what order things happen (earlier → later).
How it works: 1) Spot the key event (like a warning shout); 2) note the timing; 3) connect it to what logically follows (someone stops moving); 4) rule out things that are out of order.
Why it matters: Without cause-and-time logic, an AI might mix up the past and future or think a result came before its cause. 🍞 Anchor: If a coach blows a whistle (cause) during a game, players stop (effect) right after. That sequence and link is what causal and temporal reasoning captures.

🍞 Hook: Think of a magician who tells a story while showing pictures and playing music. You need all three to get the trick.

🥬 Filling (Cross-Modal Reasoning):

What it is: It’s when AI connects clues across different senses—like matching a spoken warning (audio) to someone freezing (video).
How it works: 1) Listen for key sounds or speech; 2) watch for matching actions or objects; 3) check timing alignment; 4) combine them to decide what’s likely next.
Why it matters: Without cross-modal links, the AI might believe a fake answer that looks right but is contradicted by the sound, or vice versa. 🍞 Anchor: If a chef on video says, “Now add salt,” and you see a hand pour white grains into a pot, the audio and video together confirm the action.

The world before this paper: AI benchmarks mostly tested “retrospective understanding,” which means describing or analyzing what already happened in a video. These tests pushed progress in captioning, identifying objects, and even explaining scenes—but they didn’t focus on forecasting the future.

The problem: Real life needs foresight. Cars should predict a pedestrian stepping onto the street; a home assistant should guess that a boiling kettle will whistle; coaches or editors want to foresee the next play or cut. Yet few tools check if MLLMs can predict what happens next from both audio and video together.

Failed attempts: Some benchmarks asked models to predict futures, but mostly from text only (no real video or sound), or from vision + text while muting audio. That misses cases where sound is the main clue (like off-screen footsteps or a shouted command). Also, many prediction datasets were short and not varied enough, or required frequent updates to avoid data leakage.

The gap: We needed a benchmark that: 1) truly uses both audio and video, 2) demands causal and temporal reasoning, and 3) makes cheating hard by adding distractor answers that only audio or only video would falsely favor.

🍞 Hook: Picture a science fair judge who asks not just “What did you do?” but “What will happen next if you change this?”

🥬 Filling (FutureOmni):

What it is: FutureOmni is a benchmark that tests whether AI can predict future events from both sound and video together.
How it works: 1) Pick videos where audio matters; 2) find exact event times; 3) build cause→future pairs; 4) create tricky distractors (visual-only, audio-only, delayed, reverse-causal); 5) ask multiple-choice questions; 6) score models.
Why it matters: Without a fair, audio-visual future test, we can’t tell if models truly understand unfolding events or just parrot descriptions. 🍞 Anchor: In a cartoon, stepping on tacks plus hearing a strict warning sets up the prediction “he runs out quietly” rather than “he screams now,” because the sound cue (warning) changes the likely future.

Real stakes: This matters for safer streets (anticipating hazards), better home devices (preventing accidents), smarter content tools (editing or summarizing videos that haven’t finished), and rescue or surveillance (noticing a sound that predicts a dangerous next step). It’s not enough to know the past—useful AI needs to peer a few moments ahead, just like we do.

02Core Idea

🍞 Hook: You know how crossing a busy street safely means looking and listening—seeing the cars and hearing the honk—then deciding when to move next?

🥬 Filling (Aha! Moment):

One-sentence insight: To predict the future well, AI must fuse audio and video with causal-and-time logic, and we must evaluate and teach this skill directly.
Multiple analogies:
1. Traffic scout: Eyes on the lights, ears on the sirens—then choose the next safe step.
2. Movie trailer editor: Hear the rising music and see the hero’s glance—then cut to the big reveal at just the right moment.
3. Sports commentator: Hear the coach shout “switch!” and see the defender move—then predict the pass to the corner.
Before vs After: • Before: Models mostly described past scenes; audio was often ignored, so they fell for look-alike but wrong futures. • After: With FutureOmni + OFF, models learn to tie sounds and sights together in time and justify why one specific future follows.
Why it works (intuition): If we test on tricky, audio-visual futures and train with reasoned explanations (rationales) that link cause to effect, models learn to look for the right moments (keyframes) and the right cues (speech, sounds, music) at the right time order.
Building blocks:
1. Audio-coordinated selection: keep videos where audio truly changes the meaning.
2. Temporal localization + audio fulfilling: find precise start/end times and annotate matching sound cues.
3. Causal pair discovery + verification: select past→future pairs that are close in time and logically linked.
4. Adversarial distractors: options that trick single-modality shortcuts, forcing real cross-modal reasoning.
5. OFF instruction-tuning with rationales: 7K examples that explicitly teach models the “why.” 🍞 Anchor: Like teaching a student not just to answer “What’s next?” in a science lab, but to show the safety rule or law of motion that proves why that next step must happen.

🍞 Hook: Think of a riddle where the picture and the soundtrack both hide clues—missing either makes you guess wrong.

🥬 Filling (Adversarial Distractors):

What it is: Carefully designed wrong answers that look tempting if you only watch or only listen or mix up time.
How it works: 1) Visual-only plausible but audio-contradicted; 2) audio-only plausible but visually wrong; 3) delayed/past events; 4) reverse-causal (cause instead of effect).
Why it matters: Without them, a model could score well by guessing from one modality or by mixing up past/future. 🍞 Anchor: If a king whispers “One more sound and off with your head,” then a “scream now” option is a trap—only by using the audio threat plus the visual setup do you choose “he runs out quietly.”

🍞 Hook: Imagine picking only videos where closing your eyes or muting the sound would change the story.

🥬 Filling (Audio-Coordinated Selection):

What it is: A way to filter videos so that audio truly influences the meaning.
How it works: 1) Caption a video with audio on; 2) caption again with audio muted; 3) keep videos where the captions differ a lot—audio mattered.
Why it matters: Without this, many clips would have decorative music only, and models could ignore audio and still do fine. 🍞 Anchor: A doorbell heard off-screen changes the next action (someone goes to the door); background elevator music does not.

🍞 Hook: Think of a sports replay that freezes at the exact moment of a foul—with a clear whistle heard.

🥬 Filling (Temporal Localization and Audio Fulfilling):

What it is: Pinpointing exact start/end times of meaningful events and the matching sounds.
How it works: 1) Scan the video to find plot-relevant events; 2) set tight timestamps; 3) check time boundaries with audio patterns; 4) add audio details (speech, sound effects, music) that align with visuals.
Why it matters: Without precise timing, the cause and effect might overlap or be too far apart to be truly predictive. 🍞 Anchor: “01:52–02:00 King warns softly” followed by “02:03–02:07 Tom tiptoes out” is a clear, timed pair.

🍞 Hook: You know how detectives check each other’s notes to avoid mistakes?

🥬 Filling (Causal Pair Discovery and Dual-Stage Verification):

What it is: Finding truly causal past→future pairs and double-checking them with AI and humans.
How it works: 1) Propose candidate pairs with reasons; 2) keep pairs where the future closely follows the past (<30s); 3) auto-check logic; 4) human-verify quality.
Why it matters: Without verification, pairs might be coincidental, vague, or out of order. 🍞 Anchor: From “coach shouts ‘Switch!’ at 04:10” to “defender rotates at 04:14” is a verified causal pair.

🍞 Hook: Think of tutoring that not only gives answers but also shows the steps so you can do the next problem yourself.

🥬 Filling (OFF – Omni-Modal Future Forecasting):

What it is: A training strategy using 7K instruction-tuning samples where each example explains why a specific future follows from the audio-video premise.
How it works: 1) Keep audio/video encoders fixed; 2) fine-tune the language core with LoRA; 3) include step-by-step rationales; 4) train briefly but effectively.
Why it matters: Without rationales, models may memorize patterns; with rationales, they learn the reasoning recipe and transfer it to new tasks. 🍞 Anchor: After OFF, models pay more attention to the exact whistle moment and the player’s movement frame, helping them predict the pass correctly.

03Methodology

At a high level: Input (videos with audio) → Audio-coordinated filtering → Temporal localization + audio fulfilling → Causal pair discovery → QA construction with adversarial distractors → Dual-stage verification → Benchmark evaluation; plus OFF training: Build 7K rationale-tuned samples → LoRA fine-tune text backbone → Evaluate on FutureOmni and other benchmarks.

Step A: Audio-coordinated video selection

What happens: Collect ~18K YouTube videos (30s–20min). Remove static/low-change clips via frame-similarity. Then caption each video twice: with and without audio. Keep the half where captions differ most (audio matters).
Why it exists: If audio doesn’t change meaning, future prediction can be faked by visuals alone.
Example: A classroom clip where the teacher’s spoken instruction changes the next action is kept; a scenic montage with only background music is dropped.

Step B: Audio-visual temporal localization and calibration

What happens: Use a strong multimodal model to identify plot-relevant events and set tight timestamps (MM:SS). Validate boundaries using audio features (e.g., MFCC differences) and add audio details (speech, sound effects, music) synchronized with visuals.
Why it exists: Prediction needs precise, time-ordered pieces so a cause clearly comes before its effect.
Example: 01:52–02:00 “King warns softly,” 02:03–02:07 “Tom tiptoes out,” validated by an audio change at 02:00.

Step C: Causal pair discovery

What happens: Analyze neighboring events to pick true cause→future pairs within a 30s gap. Require the model to output Premise, Target (future), and a Rationale explaining the link. Score whether audio is causal (0–2) and label audio type (speech, sound, music).
Why it exists: Temporal neighbors aren’t always causal; we need events where the future is predictable from the past.
Example: “Ref blows whistle” (speech/sound) → “Players stop” (visual) within 5 seconds, rationale: rules of play.

Step D: QA construction with adversarial distractors

What happens: Build multiple-choice questions from each causal pair and add four distractors:
1. Visual-only plausible but audio-contradicted,
2. Audio-only plausible but visually wrong,
3. Delayed (a past event that tempts time confusion),
4. Reverse-causal (cause instead of effect).
Why it exists: Forces cross-modal reasoning and correct time direction.
Example: After “quiet warning,” distractor “scream now” looks visually plausible in a slapstick scene but is contradicted by the audio threat.

Step E: Dual-stage verification

What happens: 1) Automated logic check; 2) human review to ensure clarity, causality, and timing.
Why it exists: Prevents ambiguous, coincidental, or mislabeled pairs.
Example: If effect starts before cause, the pair is rejected.

Step F: Benchmark statistics and scope

What happens: Final set has 919 videos and 1,034 QAs across 8 domains (Cartoon, Education, Emergency, Surveillance, Daily life, Movie, Game, Documentary) with varied audio (speech, sound, music) and longer durations (avg ~163.5s).
Why it exists: Diverse, longer videos require richer context and make shortcuts harder.
Example: Documentaries often rely on narration; Games have predictable physics; Emergencies have chaotic audio-visual cues.

Step G: OFF training (Omni-Modal Future Forecasting)

What happens: Build FutureOmni-7K instruction-tuning data with rationales. Fine-tune open-source omni-models using LoRA, freezing audio/video encoders and updating the language core with a small learning rate for 1 epoch.
Why it exists: Teach models the chain-of-thought linking audio+video to a specific near-future, improving transfer.
Example: Rationale explains that “because the referee says ‘Timeout!’ and the clock horn sounds, play will pause immediately.”

The secret sauce:

Tricky distractors that punish single-modality guessing.
Audio-first filtering that ensures sound truly changes meaning.
Rationale-enhanced OFF that shifts attention to key audio/video frames at the right layers, improving generalization beyond forecasting.

Mini data walk-through:

Input: A 2:30 clip of a shop scene. At 01:10 a bell rings (audio) while the door opens (video). At 01:13 the clerk looks up (video). QA: “Given the bell and opening door, what happens next?” Correct: “Clerk greets the customer.” Distractors: “Clerk turns off lights” (opposite mood), “Customer already paid earlier” (delayed/past), “A loud crash” (audio-only mismatch), “Door locks” (visual but audio contradicts).

04Experiments & Results

The test: Measure how accurately models choose the correct future event in multiple-choice questions when both audio and video are provided. This tests causal + temporal reasoning and true cross-modal fusion.

The competition: 20 models total—13 omni-modal (audio+video) and 7 video-only—covering open-source systems (e.g., Qwen2.5-Omni, video-SALMONN 2, Ola) and proprietary ones (e.g., Gemini 2.5 Pro/Flash, Gemini 3 Flash, Claude Haiku 4.5), plus GPT-4o as a video-only strong baseline.

The scoreboard (with context):

Top score: 64.8% by Gemini 3 Flash—like getting a solid B when the class average hovers around C; good, but far from perfect.
Proprietary omni-models average around 61%, while the best open-source omni-model reaches ~53%—a notable gap that signals room to grow.
Video-only models trail omni-models: even strong systems (like GPT-4o as tested here) score below many omni-models, showing audio is crucial for future prediction.

Surprising and key findings:

Speech is hardest: Even the best performer does better on music and clean sound effects than on speech-driven futures. Speech needs high-level language understanding and alignment with visuals.
Contextual cold start: Very short videos ([0,2) minutes) have the lowest scores across models; performance improves with a bit more history, then dips slightly for very long clips. Forecasting needs enough build-up.
Modality ablation: Audio+Video (A+V) beats Video-only (V) or Audio-only (A). It also beats Video+Subtitles or Video+Captions, meaning raw audio carries emotional/ambient signals that text alone can’t fully capture.
Error analysis (on Gemini 3 Flash fails): • 51.6% are video perception errors (missed a small visual cue), • 30.8% are audio-video reasoning failures (saw and heard pieces but didn’t combine them), • 15.1% audio perception errors, • only 2.5% lack of knowledge. So the main barriers are sensing and fusing, not missing facts.

OFF improvements:

Training open-source omni-models with FutureOmni-7K + rationales yields consistent gains (e.g., video-SALMONN 2 +3.87 points overall). Speech-heavy improvements are especially notable, suggesting better spoken-language grounding with visuals.
Generalization: OFF-trained models also improve on other benchmarks (WorldSense, DailyOmni, JointAVBench, OmniVideoBench) and even on video-only tests (Video-MME, MLVU), indicating broader benefits to attention and reasoning.
Attention insights: After OFF, attention on known keyframes (both audio and video) increases in critical transformer layers, matching the observed performance bumps.

Bottom line: Future forecasting with real audio-visual cues is still challenging; even top systems miss many speech-linked futures. But targeted training with rationales (OFF) nudges models to look and listen at the right times, improving both forecasting and general skills.

05Discussion & Limitations

Limitations:

Benchmark scope: While FutureOmni spans 8 domains and ~1K QAs, the world is bigger. Some niches (e.g., medical or industrial alarms) aren’t included, so models trained here might still struggle in those.
Multiple-choice format: It’s great for controlled evaluation but may simplify the real world, where many plausible futures exist and answers aren’t neatly listed.
Annotation reliance: The pipeline uses strong proprietary models and then human verification. If upstream AI introduces subtle biases (e.g., speech accents, domain styles), they could echo in the dataset.
Speech complexity: Accents, overlapping talk, and noisy scenes can still confuse models; improving speech-visual grounding remains a core challenge.
Long-context processing: Very long videos slightly reduce performance; efficient context strategies and better temporal memory could help.

Required resources:

To use FutureOmni for evaluation: just inference capability with audio+video inputs.
To train with OFF: GPUs for LoRA fine-tuning, access to a compatible open-source omni-model, and data loading for videos + audio.

When not to use:

Purely text tasks or static images without sound—this benchmark won’t be a good fit.
Domains where future events are inherently unpredictable or random (e.g., jump-scare edits with no causal hints).

Open questions:

How to broaden beyond multiple-choice to free-form prediction while keeping evaluation fair?
How to robustly handle speech-heavy, noisy, or overlapped audio with strong visual grounding?
Can we design models that learn event physics and social rules more explicitly to reduce perception errors?
What architectures best fuse audio and video over long timelines without losing early clues?
How can we better measure and train “arrow of time” understanding so reverse-causal traps fail more often?

06Conclusion & Future Work

Three-sentence summary: FutureOmni is a new benchmark that checks whether AI can predict near-future events from both sound and video using true causal and temporal reasoning. Evaluations of 20 models show the task is hard—especially with speech—while the proposed OFF training with 7K rationale-rich samples noticeably improves open-source models and even helps on other benchmarks. Attention analyses suggest OFF makes models focus more on key audio and visual frames in the right layers, supporting better generalization.

Main achievement: Defining, building, and validating the first comprehensive, adversarially robust, audio-visual future forecasting benchmark—and demonstrating a practical training recipe (OFF) that measurably boosts foresight and transfer.

Future directions: Expand domains and durations, move beyond multiple choice to structured or free-text futures with automatic scoring, develop stronger speech-visual grounding, and explore architectures that keep long-range temporal cues. Also, design training that teaches models explicit “time arrows” and domain rules (sports, traffic, emergencies) to reduce perception and fusion errors.

Why remember this: It shifts multimodal AI from just narrating the past to anticipating the next moment, which is how people act safely and smartly in the real world. By making models look and listen together—and explain their reasoning—FutureOmni and OFF push us toward AI that can plan, warn, and assist before problems happen.

Practical Applications

•Evaluate and select multimodal models for autonomous driving that must anticipate hazards using both street sounds and visuals.
•Train home assistants to predict near-future risks (e.g., a pot boiling over) by fusing kitchen sounds with camera views.
•Build smarter video editing tools that auto-suggest the next cut based on rising music and visual setup.
•Improve surveillance triage by predicting escalating situations from shouted commands and crowd movement.
•Enhance sports analytics to forecast plays using coach shouts (audio) plus player positioning (video).
•Design classroom feedback systems that link teacher instructions (speech) to student actions (video) to foresee confusion points.
•Develop content moderation that anticipates harmful acts when audio cues and gestures align.
•Assist emergency operations centers in predicting next steps from sirens, radio chatter, and body-cam footage.
•Create accessibility tools that narrate likely next events for users with hearing or vision impairments.
•Benchmark and fine-tune robots that must listen and watch to decide their next safe move in dynamic environments.

Version: 1