A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan Kurpath; Jaseel Muhammad Kaithakkodan; Jinxing Zhou; Sahal Shaji Mullappilly; Mohammad Almansoori; Noor Ahsan; Beknur Kalmakhanbet; Sambal Shikhar; Rishabh Lalla; Jean Lahoud; Mariette Awad; Fahad Shahbaz Khan; Salman Khan; Rao Muhammad Anwer; Hisham Cholakkal

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Intermediate

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou et al.12/18/2025

arXiv PDF

Key Summary

•This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
•Instead of only multiple-choice, it uses open-ended and multi-turn questions that feel like real conversations with intent (what the user is trying to do).
•Every question comes with a weighted rubric so we can see exactly what the AI got right or wrong and why.
•They also introduce LongShOTAgent, a smart helper that uses tools (like speech-to-text, audio event detectors, and visual search) to find and combine the right parts of very long videos.
•On this benchmark, top models still struggle: Gemini-2.5-Flash scores 52.95%, LongShOTAgent 44.66%, and many open-source models stay below 30%.
•Performance drops as videos get longer, showing that long-term memory and reasoning are hard for today’s models.
•The dataset covers 157 videos averaging 45 minutes each, with 3,092 Q&A instances and human-verified rubrics.
•The evaluation is traceable and fair: models are judged criterion-by-criterion using open LLMs, and each model uses its own native video processing defaults.
•This work gives the community both a tough, realistic test and a practical, modular agent to make progress on long-form, omni-modal understanding.
•Real-world impact includes better video assistants for lectures, sports, meetings, safety monitoring, accessibility, and customer support.

Why This Research Matters

Much of life happens in long videos: classes, sports matches, meetings, tutorials, and events. To be truly helpful, AI must follow stories across time and combine what it sees, hears, and reads. This work gives the community a realistic test to measure that skill and a practical agent to improve it. Because the scoring uses weighted rubrics, we can see exactly what to fix—facts, timing, grounding, or tool use—instead of guessing. That means faster progress on assistants that can study with you, summarize meetings, explain plays in games, and spot safety issues. It also helps build fairer, more transparent evaluations the whole community can reproduce.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how watching a full game or a long class recording takes focus because you have to remember what was said earlier, notice what you’re seeing now, and connect sounds, voices, and actions over time? That’s exactly what we want AI to do with videos—but it’s much harder than it sounds.

The world before

AI started as great readers of text. Later, they learned to look at pictures and listen to audio. But videos are like moving stories with pictures, words, and sounds mixed together—often over an hour long.
Many tests (benchmarks) focused on short clips. Those are fine for quick actions, but they don’t show whether an AI can follow a lesson, a match, or a documentary from start to finish.
Most older tests also ignored key parts like raw audio or speech timing, or they boiled results down to one simple score. That hides exactly where the AI is messing up.

The problem

Long videos demand three things at once: understanding visuals (what’s on screen), speech (what people say), and ambient audio (like claps, doorbells, or engine sounds). And the AI must link these clues across long stretches of time.
We also want the AI to act more like a helper: ask clarifying questions, use tools (like a calculator or speech transcriber), and adapt as the situation changes.

🍞 Top Bread (Hook): Imagine you ask a friend, “Tell me what mattered in that whole science class video.” You don’t want a yes/no—you want a thoughtful answer. 🥬 Filling (The Actual Concept)

What it is: Open-Ended Questioning means asking questions that need explanations, not just yes/no.
How it works:
1. Ask a natural question (“Why did the experiment fail?”),
2. The AI gathers clues from speech, visuals, and sounds,
3. It explains in its own words,
4. It can answer follow-ups in a multi-turn chat.
Why it matters: Without it, tests feel fake and miss real mistakes, like mixing up who said what or why something happened. 🍞 Bottom Bread (Anchor): Example—In a cooking tutorial, “How did the chef keep pasta from getting sticky?” needs a real explanation, not a checkbox.

🍞 Top Bread (Hook): You know how detectives piece together a mystery using footprints (visual), alibis (speech), and odd noises (audio)? 🥬 Filling (The Actual Concept)

What it is: Multimodal Reasoning is when AI connects images, speech, and sounds to understand what’s truly happening.
How it works:
1. See what’s on screen (who, what, where),
2. Listen to speech (what’s said) and audio (non-speech sounds),
3. Align them in time,
4. Use logic to link causes and effects across the video.
Why it matters: Without it, AI might hear “the winner is…” but miss who’s on camera—or see a door close but miss the slam sound that explains why someone jumped. 🍞 Bottom Bread (Anchor): Example—During a lesson, the teacher says “Now watch this part,” points to a diagram, and a beep confirms a sensor activation. Multimodal reasoning ties all three together.

Failed attempts

Some tests used long videos but skipped audio or speech, missing essential cues.
Others had multi-modality but only on short clips or narrow tasks (like captioning one moment).
Many relied on multiple-choice or a single overall score judged by another LLM, which hides whether the AI failed at seeing, listening, or reasoning.

🍞 Top Bread (Hook): Think of a Swiss Army knife—you pull out the tool you need at the right time. 🥬 Filling (The Actual Concept)

What it is: Agentic Tools are helper tools an AI can call—like speech transcribers, object detectors, calculators, or web search—to solve a task.
How it works:
1. The AI breaks the problem into parts,
2. Picks the right tool (e.g., transcribe speech, detect sounds),
3. Uses results to refine its answer,
4. Repeats if needed.
Why it matters: Without tools, the AI guesses or misses details hidden in long videos. 🍞 Bottom Bread (Anchor): Example—“Who spoke just before the goal?” The AI calls a speech tool to get the exact words and time.

What was missing

A benchmark that: (1) uses long videos with real audio and speech, (2) asks open-ended and multi-turn, intent-driven questions, (3) checks tool use, (4) scores answers with detailed, explainable rubrics, and (5) scales with human-verified quality.

🍞 Top Bread (Hook): When teachers grade projects, they don’t just give one number—they check creativity, facts, and clarity. 🥬 Filling (The Actual Concept)

What it is: Weighted Rubrics for Evaluation break a score into parts (like key facts, timing, and correctness), each with its own weight.
How it works:
1. List must-have facts (high weight),
2. Add helpful details (medium),
3. Optional extras (low),
4. Penalize mistakes,
5. Add up for a fair, traceable score.
Why it matters: Without rubrics, we can’t see where the AI fell short—was it missing a fact, the time, or the reasoning? 🍞 Bottom Bread (Anchor): Example—For “How did the chef stop pasta from sticking?” high-weight items might include “lots of water,” “stirring,” and “cooking to al dente.”

Real stakes

Better note-taking for long classes and meetings.
Smarter sports recaps that explain turning points.
Safer monitoring (e.g., noticing a warning beep before a machine error).
More accessible videos for people who need clear summaries or captions.
Honest model improvement: see exactly what to fix—vision, audio, or long-term memory.

02Core Idea

The “Aha!” moment in one sentence: To truly judge and improve AI on long, real-world videos, you need a benchmark that blends all modalities, asks intent-driven open questions (single- and multi-turn), checks tool use, and scores with transparent rubrics—plus a working agent that shows how to tackle the challenge in practice.

🍞 Top Bread (Hook): Imagine a school science fair where projects are tested with real experiments, graded with clear rubrics, and the best teams use the right tools at the right time. 🥬 Filling (The Actual Concept)

What it is: LongShOTBench is that “science fair” for AI on long videos, and LongShOTAgent is the well-organized team captain using tools wisely.
How it works (big picture):
1. Build long, multi-modal video samples with aligned visual, speech, and audio captions,
2. Create scenario-based, intent-driven questions (single and multi-turn),
3. Provide reference answers and weighted rubrics,
4. Evaluate models criterion-by-criterion,
5. Offer LongShOTAgent, a modular agent that preprocesses, searches, refines, and uses tools to answer well.
Why it matters: This reveals exactly what AIs can and can’t do in realistic conditions, and shows a practical path to do better. 🍞 Bottom Bread (Anchor): Example—A 1-hour lecture becomes questions like “What proof did the teacher give?” “Where did students get stuck?” and “Summarize the 3 key takeaways,” each scored by rubrics.

Three analogies

Detective casefile: The video is the case. Photos (visual), witness statements (speech), and background noises (audio) must be tied together across time; the rubric is the checklist; the agent is the lead detective assigning tasks to specialists.
Orchestra: Video, speech, and audio are instrument sections. The rubric is the sheet music marking what must be played well. The agent is the conductor bringing in soloists (tools) at the right moment.
Library research: The benchmark is the reading list and grading rubric; the agent is the student who skims, searches indexes, quotes sources, and synthesizes a clean report.

Before vs. after

Before: Short clips, multiple-choice, single overall scores, and missing modalities hid the real problems.
After: Long, omni-modal videos; open-ended, multi-turn questions; tool use is expected; scores are interpretable with rubrics; and a working agent shows the method.

Why it works (intuition, no equations)

Diachronic alignment: Keeping speech, audio, and visuals lined up in time unlocks cause-and-effect.
Intent-driven prompts: Questions that match real goals (find, plan, explain) force deeper reasoning.
Rubrics: Breaking “correctness” into small, checkable bites reduces confusion and gives partial credit.
Agentic orchestration: Specialized tools beat guessing; picking the right one at the right time improves reliability.

Building blocks

Long videos (avg ~45 min), split by speech activity, with visual scene descriptions and audio events.
Scenario framing and task taxonomy (32 tasks across perception, information, multimodal, reasoning, and agentic).
Open-ended Q&A (single and multi-turn), with difficulty scaling from recall to complex temporal/causal inference.
Hierarchical, weighted rubrics (high/medium/low priority and penalties) for traceable scores.
Human validation to ensure clarity, correctness, and fairness.

🍞 Top Bread (Hook): You know how a school test isn’t just one question? It has reading, math, and science sections. 🥬 Filling (The Actual Concept)

What it is: LongShOTBench is a comprehensive test for long, multi-modal videos.
How it works:
1. Collect and align visuals, speech, and audio,
2. Ask intent-driven open questions (some multi-turn),
3. Score with rubrics so we see strengths and weaknesses.
Why it matters: Without this, we’d never know if AI failed at hearing, seeing, or reasoning across time. 🍞 Bottom Bread (Anchor): Example—A phone review video: “Which camera part did the reviewer praise most, and why?” scored by facts (part name), reasons (evidence), and timing.

🍞 Top Bread (Hook): Think of a team captain who knows when to call the goalie, the striker, or the coach during a tough game. 🥬 Filling (The Actual Concept)

What it is: LongShOTAgent is a modular AI that routes the question through preprocessing, search, refinement, and tool calls to answer well.
How it works:
1. Preprocess: sample scenes, transcribe speech, index embeddings,
2. Search: find the most relevant moments,
3. Refine: use stronger tools (better ASR, audio analysis, dense captions),
4. Synthesize: combine clues into a clear answer.
Why it matters: Without a smart “captain,” the AI wastes time, misses key moments, or guesses. 🍞 Bottom Bread (Anchor): Example—“When did the teacher switch to examples?” The agent finds the mention in speech, checks the frames, and returns the exact moment with context.

03Methodology

At a high level: Long video input → Multimodal preprocessing → Scenario-based question creation → Open-ended answers → Weighted rubrics → Human validation → Benchmark; and for the agent: User question + video → Preprocess and index → Search relevant clips → Refine with specialist tools → Final answer.

Part A: Building LongShOTBench (the dataset and evaluation)

Multimodal preprocessing (captions and alignment)

What happens: Split videos by speech activity. For each segment, transcribe speech (Whisper-large-v3), generate visual scene descriptions (Qwen2.5-VL-32B), and detect audio events (Audio-Flamingo-3). Then fuse these into a coherent, time-aligned summary.
Why this step exists: If you don’t align sight, speech, and sound, the AI can’t do true multimodal reasoning across time.
Example: In a cooking tutorial, the system aligns “chef dices onions” (visual), “dice them small” (speech), and sizzling sounds (audio) to the same timestamps.

Scenario framing and task mapping

What happens: For each video, generate a few realistic viewing scenarios (e.g., “a student trying to study this lecture”). Map each scenario to tasks across perception, information, multimodal, reasoning, and agentic categories (32 total tasks).
Why it exists: Intent matters. If we don’t mirror real goals (find, plan, explain), we test the wrong thing.
Example: For a phone review, scenarios include “buyer comparing battery life” (information retrieval + comparative analysis) or “creator checking stabilization” (motion analysis + multimodal verification).

Question generation (single- and multi-turn)

What happens: Create open-ended questions that fit each scenario and task. Control difficulty: Levels 1–2 (recall), 3 (moderate reasoning), 4–5 (temporal/causal/contextual). Include both single-turn and multi-turn dialogues.
Why it exists: Multiple-choice can hide reasoning gaps. Open-ended, multi-turn questions reveal depth.
Example: Single-turn: “How did the chef stop the pasta from getting sticky?” Multi-turn: “What went wrong the first time?” → “How did they fix it?”

Answer generation

What happens: Produce grounded, conversational reference answers that only use evidence from the video’s fused metadata. Keep it clear and helpful; say “not enough info” if uncertain.
Why it exists: Reference answers set the gold standard and prevent hallucination.
Example: “They used a large pot with lots of water, stirred during cooking, and checked for al dente before draining.”

Hierarchical, weighted rubrics

What happens: For each Q&A, create criteria with weights (high/medium/low priority) plus penalties for errors. Judges check each criterion independently, then compute a score.
Why it exists: A single score hides failure modes. Rubrics expose exactly what was right or wrong and allow partial credit.
Example: High (must mention ‘lots of water’, ‘stirring’, ‘al dente’), Medium (mention salt), Penalty (incorrect timing or fake fact).

Human validation

What happens: Trained annotators review and fix questions, answers, and rubrics. Remove unclear items. Ensure everything is factual, grounded, and fair.
Why it exists: LLMs can drift; humans bring clarity and consistency.
Example: If a question is ambiguous about “they,” editors clarify who “they” refers to.

The secret sauce for the benchmark

Intent-driven, real-feeling questions that cover 32 task types.
Clear, weighted rubrics giving traceable, diagnostic scores.
Long videos (avg ~45 minutes) with full multimodal alignment and human-verified quality.

Part B: LongShOTAgent (the modular reasoning pipeline)

Orchestrator receives user query + video

What happens: A compact LLM controller plans which tools to call.
Why it exists: Without a planner, tools are used randomly or not at all.
Example: For “Who explained the main formula, and when?” it plans to transcribe speech, search for “formula,” and verify with visuals.

Preprocessor tool (fast pass)

What happens: Scene detection, lightweight speech transcription (Whisper-small), SigLIP embeddings for frames, OCR, audio analysis. Store features in a vector database.
Why it exists: Creates a searchable index so the agent doesn’t comb through hours blindly.
Example: Indexes every second with text, sounds, and visual features.

Search tool (retrieve likely moments)

What happens: Semantic similarity search across modalities retrieves top-k segments.
Why it exists: Focuses attention where it matters; saves compute and reduces noise.
Example: Finds the 10 most relevant clips where “pasta” and “stirring” co-occur in speech and visuals.

Refiner tools (deep dive)

What happens: Use stronger, slower models when needed—Whisper-large-v3 for high-quality ASR, Audio-Flamingo-3 for detailed audio events, and a video refiner for dense captions.
Why it exists: Ensures accuracy on the most important segments without processing the entire video at maximal cost.
Example: Re-transcribe a 20-second clip around the moment the chef says “al dente.”

Synthesis and final answer

What happens: The orchestrator merges multimodal evidence, checks for consistency, and writes a concise answer.
Why it exists: Someone (the agent) must combine puzzle pieces into a clear explanation.
Example: “The teacher introduced the formula at 18:32, wrote it on the board, and explained the variables right after a student’s question.”

The secret sauce for the agent

Adaptive routing: Only call expensive tools when needed.
Cross-modal memory: Keep track of what was seen/heard already.
Iterative refinement: Re-check tricky parts rather than guessing.

🍞 Top Bread (Hook): Imagine using a map app—you search, zoom in, and then switch to satellite view when details matter. 🥬 Filling (The Actual Concept)

What it is: Agentic Tools let the AI switch views and instruments as needed (transcribe, detect, search, calculate).
How it works:
1. Plan → 2) Retrieve → 3) Refine → 4) Conclude.
Why it matters: Without tools, long videos are haystacks without needles. 🍞 Bottom Bread (Anchor): Example—To answer “Which speaker sounded worried before the alarm?” the agent checks audio tone, speech content, and the timeline right before the alarm sound.

04Experiments & Results

The test: What was measured and why

They evaluated models across four big areas: Core Perception (seeing/hearing basics), Reasoning (cause/effect, comparisons, math), Information tasks (summaries, instructions), and Multimodal tasks (aligning and combining modalities). There’s also a set of Agentic tasks that require smart tool use.
The dataset: 157 long videos averaging ~45 minutes (over 117 hours total), with 3,092 Q&A instances. This stresses long-horizon memory and multimodal integration.

The competition: Who was tested

Closed-source: Gemini-2.5-Flash.
Open-source: Qwen2.5-Omni-7B, Qwen2.5-VL-7B, Qwen3-VL-8B, InternVL3.5-8B, LLaVA-OneVision-7B, LLaVA-NeXT-Video-7B.
LongShOTAgent (this paper’s agentic system) was also evaluated.
Fairness note: Each model used its own native video ingestion defaults (no fixed frame policy) to avoid evaluator bias.

The scoreboard (with context)

Overall: Gemini-2.5-Flash scored 52.95%—like getting a solid B on a really tough final when many others are failing. LongShOTAgent scored 44.66%, impressive for a training-free, modular system. Most open-source models stayed below 30% overall.
By category: • Core Perception: Gemini ~41%, LongShOTAgent ~36%. Open-source ranged ~10–25%, showing basic cross-modal perception is still hard in long videos. • Reasoning: Gemini ~62% (strong), LongShOTAgent ~49% (solid), others ~8–29%. Complex thinking across time is the key separator. • Information tasks: Gemini ~55%, LongShOTAgent ~45%, others ~9–26%. Extracting instructions and summaries over long spans is challenging. • Multimodal tasks: Gemini ~54%, LongShOTAgent ~48%, others ~10–37%. Aligning sounds, speech, and visuals remains tough. • Agentic tasks: LongShOTAgent ~38.25% and Gemini ~40.27% on tool-using scenarios, showing the agent holds its own against a strong proprietary model.
Duration effect: Performance drops as videos get longer. Example: Gemini falls from ~55.6% on 0–30 min to ~47.2% on >60 min; LongShOTAgent shows a similar trend. This is like running out of mental “sticky notes” over time.

Surprising findings

Even top systems stumble on hour-long, open-ended, multi-turn questions that demand tight alignment of speech, visuals, and background audio.
Some models do okay on motion-related tasks but still fail to tie movements to spoken explanations or sounds, revealing gaps in true multimodal reasoning.
The agentic, tool-using approach narrows the gap with a much larger closed model on several categories—evidence that orchestration and retrieval matter as much as sheer size.

What this means

Today’s AI can handle parts of long videos but struggles to connect the dots over time and across modalities, especially under conversational, intent-driven questioning.
Weighted rubrics make the weaknesses obvious—was it the fact, the timing, the modality grounding, or the tool use that failed? That clarity is the win.

05Discussion & Limitations

Limitations (be specific)

Long context is still a brick wall: Scores drop as duration increases, showing present-day models lose track over time.
Compute-hungry: Full runs take significant GPU hours; large-scale evaluations are expensive.
Judge dependency: Although criteria are fine-grained, LLM-as-judge can still introduce subtle biases, even if mitigated by checking each criterion separately.
Modality coverage in the wild: While videos include raw audio and speech, real-world noise, accents, and music can still trip up ASR and audio event detectors.
Tool coverage: The 16-tool set is broad but not infinite; some niche tasks might need specialized tools not included.

Required resources

GPUs with ample memory (e.g., 4×RTX A6000 for main experiments), fast storage for long videos, and a vector database for embeddings.
Access to ASR, VLMs, audio models, and orchestration code (all modular, replaceable components).

When NOT to use

If you only need short clips (<30s) or single-modality tasks—simpler benchmarks may be enough.
If you only care about multiple-choice accuracy, not open-ended reasoning.
If your setup cannot process raw video+audio and must rely on pre-extracted frames only.

Open questions

How can models maintain reliable long-term memory without exploding compute costs?
What’s the best way to fuse speech, audio, and visuals so that timing stays precise and robust to noise?
Can we design even more interpretable, semi-automatic judges that require less human oversight yet remain trustworthy?
How do we generalize tool use so agents pick the right toolchains for totally new video domains?

06Conclusion & Future Work

Three-sentence summary

This paper introduces LongShOTBench, a tough, realistic benchmark for long, omni-modal video understanding, with open-ended and multi-turn questions and weighted, interpretable rubrics. It also presents LongShOTAgent, a modular, training-free agent that preprocesses, retrieves, refines, and uses tools to answer questions about long videos. Experiments show that even the best models struggle over hour-long contexts, but agentic orchestration closes part of the gap.

Main achievement

Making evaluation both realistic and explainable: LongShOTBench’s intent-driven questions with rubric-based scoring reveal exactly where models succeed or fail across vision, speech, audio, reasoning, and tool use.

Future directions

Build longer-context memory and retrieval methods, better audio-visual-speech synchronization, stronger tool selection and chaining, and even clearer judging protocols. Expand to more languages, domains, and accessibility needs.

Why remember this

It gives the community two things that were missing: (1) a benchmark that reflects how people actually use long videos, and (2) a practical agentic framework that shows how to do better right now. Together, they set a new standard for testing—and improving—AI’s real-world video understanding.

Practical Applications

•Lecture companion that finds key moments, answers follow-up questions, and summarizes the main ideas.
•Sports video analyst that explains turning points, who did what, and why a play worked.
•Meeting assistant that extracts action items, instructions, and disagreements with timestamps.
•Customer support triage that reviews tutorial or troubleshooting videos and gives step-by-step help.
•Safety and compliance monitor that flags risky events early by linking warning sounds and on-screen actions.
•Accessible viewing aid that describes scenes, clarifies speech, and aligns captions with real sounds.
•Content moderation and review tool that explains context behind controversial clips, not just single frames.
•Education study-tool that compares solved examples across a long class and explains differences.
•Product review analyzer that aligns spoken claims with what’s shown on screen and background audio cues.
•Video research assistant that searches long archives for specific evidence and composes a sourced report.

Version: 1