A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
Key Summary
- â˘This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
- â˘Instead of only multiple-choice, it uses open-ended and multi-turn questions that feel like real conversations with intent (what the user is trying to do).
- â˘Every question comes with a weighted rubric so we can see exactly what the AI got right or wrong and why.
- â˘They also introduce LongShOTAgent, a smart helper that uses tools (like speech-to-text, audio event detectors, and visual search) to find and combine the right parts of very long videos.
- â˘On this benchmark, top models still struggle: Gemini-2.5-Flash scores 52.95%, LongShOTAgent 44.66%, and many open-source models stay below 30%.
- â˘Performance drops as videos get longer, showing that long-term memory and reasoning are hard for todayâs models.
- â˘The dataset covers 157 videos averaging 45 minutes each, with 3,092 Q&A instances and human-verified rubrics.
- â˘The evaluation is traceable and fair: models are judged criterion-by-criterion using open LLMs, and each model uses its own native video processing defaults.
- â˘This work gives the community both a tough, realistic test and a practical, modular agent to make progress on long-form, omni-modal understanding.
- â˘Real-world impact includes better video assistants for lectures, sports, meetings, safety monitoring, accessibility, and customer support.
Why This Research Matters
Much of life happens in long videos: classes, sports matches, meetings, tutorials, and events. To be truly helpful, AI must follow stories across time and combine what it sees, hears, and reads. This work gives the community a realistic test to measure that skill and a practical agent to improve it. Because the scoring uses weighted rubrics, we can see exactly what to fixâfacts, timing, grounding, or tool useâinstead of guessing. That means faster progress on assistants that can study with you, summarize meetings, explain plays in games, and spot safety issues. It also helps build fairer, more transparent evaluations the whole community can reproduce.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how watching a full game or a long class recording takes focus because you have to remember what was said earlier, notice what youâre seeing now, and connect sounds, voices, and actions over time? Thatâs exactly what we want AI to do with videosâbut itâs much harder than it sounds.
The world before
- AI started as great readers of text. Later, they learned to look at pictures and listen to audio. But videos are like moving stories with pictures, words, and sounds mixed togetherâoften over an hour long.
- Many tests (benchmarks) focused on short clips. Those are fine for quick actions, but they donât show whether an AI can follow a lesson, a match, or a documentary from start to finish.
- Most older tests also ignored key parts like raw audio or speech timing, or they boiled results down to one simple score. That hides exactly where the AI is messing up.
The problem
- Long videos demand three things at once: understanding visuals (whatâs on screen), speech (what people say), and ambient audio (like claps, doorbells, or engine sounds). And the AI must link these clues across long stretches of time.
- We also want the AI to act more like a helper: ask clarifying questions, use tools (like a calculator or speech transcriber), and adapt as the situation changes.
đ Top Bread (Hook): Imagine you ask a friend, âTell me what mattered in that whole science class video.â You donât want a yes/noâyou want a thoughtful answer. 𼏠Filling (The Actual Concept)
- What it is: Open-Ended Questioning means asking questions that need explanations, not just yes/no.
- How it works:
- Ask a natural question (âWhy did the experiment fail?â),
- The AI gathers clues from speech, visuals, and sounds,
- It explains in its own words,
- It can answer follow-ups in a multi-turn chat.
- Why it matters: Without it, tests feel fake and miss real mistakes, like mixing up who said what or why something happened. đ Bottom Bread (Anchor): ExampleâIn a cooking tutorial, âHow did the chef keep pasta from getting sticky?â needs a real explanation, not a checkbox.
đ Top Bread (Hook): You know how detectives piece together a mystery using footprints (visual), alibis (speech), and odd noises (audio)? 𼏠Filling (The Actual Concept)
- What it is: Multimodal Reasoning is when AI connects images, speech, and sounds to understand whatâs truly happening.
- How it works:
- See whatâs on screen (who, what, where),
- Listen to speech (whatâs said) and audio (non-speech sounds),
- Align them in time,
- Use logic to link causes and effects across the video.
- Why it matters: Without it, AI might hear âthe winner isâŚâ but miss whoâs on cameraâor see a door close but miss the slam sound that explains why someone jumped. đ Bottom Bread (Anchor): ExampleâDuring a lesson, the teacher says âNow watch this part,â points to a diagram, and a beep confirms a sensor activation. Multimodal reasoning ties all three together.
Failed attempts
- Some tests used long videos but skipped audio or speech, missing essential cues.
- Others had multi-modality but only on short clips or narrow tasks (like captioning one moment).
- Many relied on multiple-choice or a single overall score judged by another LLM, which hides whether the AI failed at seeing, listening, or reasoning.
đ Top Bread (Hook): Think of a Swiss Army knifeâyou pull out the tool you need at the right time. 𼏠Filling (The Actual Concept)
- What it is: Agentic Tools are helper tools an AI can callâlike speech transcribers, object detectors, calculators, or web searchâto solve a task.
- How it works:
- The AI breaks the problem into parts,
- Picks the right tool (e.g., transcribe speech, detect sounds),
- Uses results to refine its answer,
- Repeats if needed.
- Why it matters: Without tools, the AI guesses or misses details hidden in long videos. đ Bottom Bread (Anchor): ExampleââWho spoke just before the goal?â The AI calls a speech tool to get the exact words and time.
What was missing
- A benchmark that: (1) uses long videos with real audio and speech, (2) asks open-ended and multi-turn, intent-driven questions, (3) checks tool use, (4) scores answers with detailed, explainable rubrics, and (5) scales with human-verified quality.
đ Top Bread (Hook): When teachers grade projects, they donât just give one numberâthey check creativity, facts, and clarity. 𼏠Filling (The Actual Concept)
- What it is: Weighted Rubrics for Evaluation break a score into parts (like key facts, timing, and correctness), each with its own weight.
- How it works:
- List must-have facts (high weight),
- Add helpful details (medium),
- Optional extras (low),
- Penalize mistakes,
- Add up for a fair, traceable score.
- Why it matters: Without rubrics, we canât see where the AI fell shortâwas it missing a fact, the time, or the reasoning? đ Bottom Bread (Anchor): ExampleâFor âHow did the chef stop pasta from sticking?â high-weight items might include âlots of water,â âstirring,â and âcooking to al dente.â
Real stakes
- Better note-taking for long classes and meetings.
- Smarter sports recaps that explain turning points.
- Safer monitoring (e.g., noticing a warning beep before a machine error).
- More accessible videos for people who need clear summaries or captions.
- Honest model improvement: see exactly what to fixâvision, audio, or long-term memory.
02Core Idea
The âAha!â moment in one sentence: To truly judge and improve AI on long, real-world videos, you need a benchmark that blends all modalities, asks intent-driven open questions (single- and multi-turn), checks tool use, and scores with transparent rubricsâplus a working agent that shows how to tackle the challenge in practice.
đ Top Bread (Hook): Imagine a school science fair where projects are tested with real experiments, graded with clear rubrics, and the best teams use the right tools at the right time. 𼏠Filling (The Actual Concept)
- What it is: LongShOTBench is that âscience fairâ for AI on long videos, and LongShOTAgent is the well-organized team captain using tools wisely.
- How it works (big picture):
- Build long, multi-modal video samples with aligned visual, speech, and audio captions,
- Create scenario-based, intent-driven questions (single and multi-turn),
- Provide reference answers and weighted rubrics,
- Evaluate models criterion-by-criterion,
- Offer LongShOTAgent, a modular agent that preprocesses, searches, refines, and uses tools to answer well.
- Why it matters: This reveals exactly what AIs can and canât do in realistic conditions, and shows a practical path to do better. đ Bottom Bread (Anchor): ExampleâA 1-hour lecture becomes questions like âWhat proof did the teacher give?â âWhere did students get stuck?â and âSummarize the 3 key takeaways,â each scored by rubrics.
Three analogies
- Detective casefile: The video is the case. Photos (visual), witness statements (speech), and background noises (audio) must be tied together across time; the rubric is the checklist; the agent is the lead detective assigning tasks to specialists.
- Orchestra: Video, speech, and audio are instrument sections. The rubric is the sheet music marking what must be played well. The agent is the conductor bringing in soloists (tools) at the right moment.
- Library research: The benchmark is the reading list and grading rubric; the agent is the student who skims, searches indexes, quotes sources, and synthesizes a clean report.
Before vs. after
- Before: Short clips, multiple-choice, single overall scores, and missing modalities hid the real problems.
- After: Long, omni-modal videos; open-ended, multi-turn questions; tool use is expected; scores are interpretable with rubrics; and a working agent shows the method.
Why it works (intuition, no equations)
- Diachronic alignment: Keeping speech, audio, and visuals lined up in time unlocks cause-and-effect.
- Intent-driven prompts: Questions that match real goals (find, plan, explain) force deeper reasoning.
- Rubrics: Breaking âcorrectnessâ into small, checkable bites reduces confusion and gives partial credit.
- Agentic orchestration: Specialized tools beat guessing; picking the right one at the right time improves reliability.
Building blocks
- Long videos (avg ~45 min), split by speech activity, with visual scene descriptions and audio events.
- Scenario framing and task taxonomy (32 tasks across perception, information, multimodal, reasoning, and agentic).
- Open-ended Q&A (single and multi-turn), with difficulty scaling from recall to complex temporal/causal inference.
- Hierarchical, weighted rubrics (high/medium/low priority and penalties) for traceable scores.
- Human validation to ensure clarity, correctness, and fairness.
đ Top Bread (Hook): You know how a school test isnât just one question? It has reading, math, and science sections. 𼏠Filling (The Actual Concept)
- What it is: LongShOTBench is a comprehensive test for long, multi-modal videos.
- How it works:
- Collect and align visuals, speech, and audio,
- Ask intent-driven open questions (some multi-turn),
- Score with rubrics so we see strengths and weaknesses.
- Why it matters: Without this, weâd never know if AI failed at hearing, seeing, or reasoning across time. đ Bottom Bread (Anchor): ExampleâA phone review video: âWhich camera part did the reviewer praise most, and why?â scored by facts (part name), reasons (evidence), and timing.
đ Top Bread (Hook): Think of a team captain who knows when to call the goalie, the striker, or the coach during a tough game. 𼏠Filling (The Actual Concept)
- What it is: LongShOTAgent is a modular AI that routes the question through preprocessing, search, refinement, and tool calls to answer well.
- How it works:
- Preprocess: sample scenes, transcribe speech, index embeddings,
- Search: find the most relevant moments,
- Refine: use stronger tools (better ASR, audio analysis, dense captions),
- Synthesize: combine clues into a clear answer.
- Why it matters: Without a smart âcaptain,â the AI wastes time, misses key moments, or guesses. đ Bottom Bread (Anchor): ExampleââWhen did the teacher switch to examples?â The agent finds the mention in speech, checks the frames, and returns the exact moment with context.
03Methodology
At a high level: Long video input â Multimodal preprocessing â Scenario-based question creation â Open-ended answers â Weighted rubrics â Human validation â Benchmark; and for the agent: User question + video â Preprocess and index â Search relevant clips â Refine with specialist tools â Final answer.
Part A: Building LongShOTBench (the dataset and evaluation)
- Multimodal preprocessing (captions and alignment)
- What happens: Split videos by speech activity. For each segment, transcribe speech (Whisper-large-v3), generate visual scene descriptions (Qwen2.5-VL-32B), and detect audio events (Audio-Flamingo-3). Then fuse these into a coherent, time-aligned summary.
- Why this step exists: If you donât align sight, speech, and sound, the AI canât do true multimodal reasoning across time.
- Example: In a cooking tutorial, the system aligns âchef dices onionsâ (visual), âdice them smallâ (speech), and sizzling sounds (audio) to the same timestamps.
- Scenario framing and task mapping
- What happens: For each video, generate a few realistic viewing scenarios (e.g., âa student trying to study this lectureâ). Map each scenario to tasks across perception, information, multimodal, reasoning, and agentic categories (32 total tasks).
- Why it exists: Intent matters. If we donât mirror real goals (find, plan, explain), we test the wrong thing.
- Example: For a phone review, scenarios include âbuyer comparing battery lifeâ (information retrieval + comparative analysis) or âcreator checking stabilizationâ (motion analysis + multimodal verification).
- Question generation (single- and multi-turn)
- What happens: Create open-ended questions that fit each scenario and task. Control difficulty: Levels 1â2 (recall), 3 (moderate reasoning), 4â5 (temporal/causal/contextual). Include both single-turn and multi-turn dialogues.
- Why it exists: Multiple-choice can hide reasoning gaps. Open-ended, multi-turn questions reveal depth.
- Example: Single-turn: âHow did the chef stop the pasta from getting sticky?â Multi-turn: âWhat went wrong the first time?â â âHow did they fix it?â
- Answer generation
- What happens: Produce grounded, conversational reference answers that only use evidence from the videoâs fused metadata. Keep it clear and helpful; say ânot enough infoâ if uncertain.
- Why it exists: Reference answers set the gold standard and prevent hallucination.
- Example: âThey used a large pot with lots of water, stirred during cooking, and checked for al dente before draining.â
- Hierarchical, weighted rubrics
- What happens: For each Q&A, create criteria with weights (high/medium/low priority) plus penalties for errors. Judges check each criterion independently, then compute a score.
- Why it exists: A single score hides failure modes. Rubrics expose exactly what was right or wrong and allow partial credit.
- Example: High (must mention âlots of waterâ, âstirringâ, âal denteâ), Medium (mention salt), Penalty (incorrect timing or fake fact).
- Human validation
- What happens: Trained annotators review and fix questions, answers, and rubrics. Remove unclear items. Ensure everything is factual, grounded, and fair.
- Why it exists: LLMs can drift; humans bring clarity and consistency.
- Example: If a question is ambiguous about âthey,â editors clarify who âtheyâ refers to.
The secret sauce for the benchmark
- Intent-driven, real-feeling questions that cover 32 task types.
- Clear, weighted rubrics giving traceable, diagnostic scores.
- Long videos (avg ~45 minutes) with full multimodal alignment and human-verified quality.
Part B: LongShOTAgent (the modular reasoning pipeline)
- Orchestrator receives user query + video
- What happens: A compact LLM controller plans which tools to call.
- Why it exists: Without a planner, tools are used randomly or not at all.
- Example: For âWho explained the main formula, and when?â it plans to transcribe speech, search for âformula,â and verify with visuals.
- Preprocessor tool (fast pass)
- What happens: Scene detection, lightweight speech transcription (Whisper-small), SigLIP embeddings for frames, OCR, audio analysis. Store features in a vector database.
- Why it exists: Creates a searchable index so the agent doesnât comb through hours blindly.
- Example: Indexes every second with text, sounds, and visual features.
- Search tool (retrieve likely moments)
- What happens: Semantic similarity search across modalities retrieves top-k segments.
- Why it exists: Focuses attention where it matters; saves compute and reduces noise.
- Example: Finds the 10 most relevant clips where âpastaâ and âstirringâ co-occur in speech and visuals.
- Refiner tools (deep dive)
- What happens: Use stronger, slower models when neededâWhisper-large-v3 for high-quality ASR, Audio-Flamingo-3 for detailed audio events, and a video refiner for dense captions.
- Why it exists: Ensures accuracy on the most important segments without processing the entire video at maximal cost.
- Example: Re-transcribe a 20-second clip around the moment the chef says âal dente.â
- Synthesis and final answer
- What happens: The orchestrator merges multimodal evidence, checks for consistency, and writes a concise answer.
- Why it exists: Someone (the agent) must combine puzzle pieces into a clear explanation.
- Example: âThe teacher introduced the formula at 18:32, wrote it on the board, and explained the variables right after a studentâs question.â
The secret sauce for the agent
- Adaptive routing: Only call expensive tools when needed.
- Cross-modal memory: Keep track of what was seen/heard already.
- Iterative refinement: Re-check tricky parts rather than guessing.
đ Top Bread (Hook): Imagine using a map appâyou search, zoom in, and then switch to satellite view when details matter. 𼏠Filling (The Actual Concept)
- What it is: Agentic Tools let the AI switch views and instruments as needed (transcribe, detect, search, calculate).
- How it works:
- Plan â 2) Retrieve â 3) Refine â 4) Conclude.
- Why it matters: Without tools, long videos are haystacks without needles. đ Bottom Bread (Anchor): ExampleâTo answer âWhich speaker sounded worried before the alarm?â the agent checks audio tone, speech content, and the timeline right before the alarm sound.
04Experiments & Results
The test: What was measured and why
- They evaluated models across four big areas: Core Perception (seeing/hearing basics), Reasoning (cause/effect, comparisons, math), Information tasks (summaries, instructions), and Multimodal tasks (aligning and combining modalities). Thereâs also a set of Agentic tasks that require smart tool use.
- The dataset: 157 long videos averaging ~45 minutes (over 117 hours total), with 3,092 Q&A instances. This stresses long-horizon memory and multimodal integration.
The competition: Who was tested
- Closed-source: Gemini-2.5-Flash.
- Open-source: Qwen2.5-Omni-7B, Qwen2.5-VL-7B, Qwen3-VL-8B, InternVL3.5-8B, LLaVA-OneVision-7B, LLaVA-NeXT-Video-7B.
- LongShOTAgent (this paperâs agentic system) was also evaluated.
- Fairness note: Each model used its own native video ingestion defaults (no fixed frame policy) to avoid evaluator bias.
The scoreboard (with context)
- Overall: Gemini-2.5-Flash scored 52.95%âlike getting a solid B on a really tough final when many others are failing. LongShOTAgent scored 44.66%, impressive for a training-free, modular system. Most open-source models stayed below 30% overall.
- By category: ⢠Core Perception: Gemini ~41%, LongShOTAgent ~36%. Open-source ranged ~10â25%, showing basic cross-modal perception is still hard in long videos. ⢠Reasoning: Gemini ~62% (strong), LongShOTAgent ~49% (solid), others ~8â29%. Complex thinking across time is the key separator. ⢠Information tasks: Gemini ~55%, LongShOTAgent ~45%, others ~9â26%. Extracting instructions and summaries over long spans is challenging. ⢠Multimodal tasks: Gemini ~54%, LongShOTAgent ~48%, others ~10â37%. Aligning sounds, speech, and visuals remains tough. ⢠Agentic tasks: LongShOTAgent ~38.25% and Gemini ~40.27% on tool-using scenarios, showing the agent holds its own against a strong proprietary model.
- Duration effect: Performance drops as videos get longer. Example: Gemini falls from ~55.6% on 0â30 min to ~47.2% on >60 min; LongShOTAgent shows a similar trend. This is like running out of mental âsticky notesâ over time.
Surprising findings
- Even top systems stumble on hour-long, open-ended, multi-turn questions that demand tight alignment of speech, visuals, and background audio.
- Some models do okay on motion-related tasks but still fail to tie movements to spoken explanations or sounds, revealing gaps in true multimodal reasoning.
- The agentic, tool-using approach narrows the gap with a much larger closed model on several categoriesâevidence that orchestration and retrieval matter as much as sheer size.
What this means
- Todayâs AI can handle parts of long videos but struggles to connect the dots over time and across modalities, especially under conversational, intent-driven questioning.
- Weighted rubrics make the weaknesses obviousâwas it the fact, the timing, the modality grounding, or the tool use that failed? That clarity is the win.
05Discussion & Limitations
Limitations (be specific)
- Long context is still a brick wall: Scores drop as duration increases, showing present-day models lose track over time.
- Compute-hungry: Full runs take significant GPU hours; large-scale evaluations are expensive.
- Judge dependency: Although criteria are fine-grained, LLM-as-judge can still introduce subtle biases, even if mitigated by checking each criterion separately.
- Modality coverage in the wild: While videos include raw audio and speech, real-world noise, accents, and music can still trip up ASR and audio event detectors.
- Tool coverage: The 16-tool set is broad but not infinite; some niche tasks might need specialized tools not included.
Required resources
- GPUs with ample memory (e.g., 4ĂRTX A6000 for main experiments), fast storage for long videos, and a vector database for embeddings.
- Access to ASR, VLMs, audio models, and orchestration code (all modular, replaceable components).
When NOT to use
- If you only need short clips (<30s) or single-modality tasksâsimpler benchmarks may be enough.
- If you only care about multiple-choice accuracy, not open-ended reasoning.
- If your setup cannot process raw video+audio and must rely on pre-extracted frames only.
Open questions
- How can models maintain reliable long-term memory without exploding compute costs?
- Whatâs the best way to fuse speech, audio, and visuals so that timing stays precise and robust to noise?
- Can we design even more interpretable, semi-automatic judges that require less human oversight yet remain trustworthy?
- How do we generalize tool use so agents pick the right toolchains for totally new video domains?
06Conclusion & Future Work
Three-sentence summary
- This paper introduces LongShOTBench, a tough, realistic benchmark for long, omni-modal video understanding, with open-ended and multi-turn questions and weighted, interpretable rubrics. It also presents LongShOTAgent, a modular, training-free agent that preprocesses, retrieves, refines, and uses tools to answer questions about long videos. Experiments show that even the best models struggle over hour-long contexts, but agentic orchestration closes part of the gap.
Main achievement
- Making evaluation both realistic and explainable: LongShOTBenchâs intent-driven questions with rubric-based scoring reveal exactly where models succeed or fail across vision, speech, audio, reasoning, and tool use.
Future directions
- Build longer-context memory and retrieval methods, better audio-visual-speech synchronization, stronger tool selection and chaining, and even clearer judging protocols. Expand to more languages, domains, and accessibility needs.
Why remember this
- It gives the community two things that were missing: (1) a benchmark that reflects how people actually use long videos, and (2) a practical agentic framework that shows how to do better right now. Together, they set a new standard for testingâand improvingâAIâs real-world video understanding.
Practical Applications
- â˘Lecture companion that finds key moments, answers follow-up questions, and summarizes the main ideas.
- â˘Sports video analyst that explains turning points, who did what, and why a play worked.
- â˘Meeting assistant that extracts action items, instructions, and disagreements with timestamps.
- â˘Customer support triage that reviews tutorial or troubleshooting videos and gives step-by-step help.
- â˘Safety and compliance monitor that flags risky events early by linking warning sounds and on-screen actions.
- â˘Accessible viewing aid that describes scenes, clarifies speech, and aligns captions with real sounds.
- â˘Content moderation and review tool that explains context behind controversial clips, not just single frames.
- â˘Education study-tool that compares solved examples across a long class and explains differences.
- â˘Product review analyzer that aligns spoken claims with whatâs shown on screen and background audio cues.
- â˘Video research assistant that searches long archives for specific evidence and composes a sourced report.