Active Perception Agent for Omnimodal Audio-Video Understanding

Keda Tao; Wenjie Du; Bohan Yu; Weiqiang Wang; Jian Liu; Huan Wang

Active Perception Agent for Omnimodal Audio-Video Understanding

Intermediate

Keda Tao, Wenjie Du, Bohan Yu et al.12/29/2025

arXiv PDF

Key Summary

•This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.
•Instead of watching every frame, it first uses audio to find the right moments, then zooms in on the matching video windows for details.
•It works in a loop: Think → Act → Observe → Reflect, choosing the best tool (audio or video) at each step.
•A new audio-guided event localization tool helps the agent quickly pinpoint when important things happen.
•By orchestrating strong single-sense tools (like speech-to-text and clip-focused video QA), it avoids hard cross-modal alignment training.
•Across three benchmarks (Daily-Omni, OmniVideoBench, WorldSense), OmniAgent improves accuracy by about 10–20% without extra training.
•It is explainable, because all tool calls and reflections are visible and auditable.
•The agent’s secret is a coarse-to-fine plan: listen globally for timing, then look locally in high resolution for proof.
•Ablations show each tool (audio QA, event location, video clip QA) is necessary; removing any part hurts results.
•This approach suggests a new path: active, tool-using agents for reliable, fine-grained audio-video understanding.

Why This Research Matters

Videos are long, and most of the time only a few moments really answer your question. OmniAgent uses audio like a time compass to jump straight to those moments, saving time and reducing mistakes. This helps students find key parts in lectures, families locate highlights in home videos, and professionals analyze meetings or news clips faster. Safety and accessibility also benefit: systems can catch urgent sounds (sirens, alarms, shouts) and verify what’s happening visually. The approach is transparent—every tool call and reflection can be audited—building trust. It opens a practical path beyond monolithic end-to-end models toward modular, explainable, and accurate multimedia understanding.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you watch a movie with your family and sometimes you close your eyes but still know something exciting is happening because the music gets loud or someone shouts? Your ears can guide your eyes to the best parts.

🥬 The Concept (Audio Understanding): It means figuring out what’s going on just by listening. How it works: (1) Capture the sounds and speech, (2) turn speech into text with timestamps, (3) detect special sounds like meows or applause, (4) summarize the overall audio scene. Why it matters: Without solid listening, you won’t know when to look; you’ll waste time searching everywhere visually.

🍞 Anchor: Imagine hunting for a cat in a vlog. You first listen for a meow to find the exact second, then you peek at that moment in the video.

🍞 Hook: Imagine flipping through a photo book. If you don’t know which page has the birthday cake, you’ll waste time scanning every page.

🥬 The Concept (Video Understanding): It means using visual frames to answer questions about objects, actions, and scenes. How it works: (1) Sample frames to get the gist, (2) zoom into a short clip with higher frame rate and resolution, (3) ask targeted visual questions. Why it matters: Without careful visual zooming, tiny but crucial details (like text on a sign) are easy to miss.

🍞 Anchor: After you hear “meow” at 1:48, you zoom into the video frames around 1:48 to see who picked up the kitten.

🍞 Hook: Think of a school project where you both watch a nature clip and listen to narration to learn. You understand more when your eyes and ears agree.

🥬 The Concept (Multimodal Understanding): It means combining sight and sound to understand better than with either alone. How it works: (1) Align what’s heard with what’s seen, (2) cross-check facts across modalities, (3) resolve conflicts when audio and video disagree. Why it matters: Without this, the model might trust the wrong channel—like mishearing a name or misreading a sign.

🍞 Anchor: If the audio says “Nán Kē” and the sign shows those characters at 1:30–1:40, you know the blogger is explaining that sign before the kitten appears.

The World Before: End-to-end omnimodal models try to learn everything at once—hearing and seeing in a single big network. They got good at general tasks, but stumbled when questions needed fine-grained facts: exact times, small text, brief sounds, or tiny actions. The Problem: (1) Fine-grained cross-modal reasoning is hard because audio and video don’t always line up perfectly in time. (2) Training one big model to align all senses is costly and brittle. (3) When videos are long, scanning every frame is expensive and noisy.

Failed Attempts: (a) End-to-end fusion: Powerful, but struggles to dynamically shift attention between audio and video at the right moment, often missing precise details. (b) Caption-all-frames agents: They pre-caption many frames, store text, and search it later. This burns compute, adds noise, and still may miss the relevant seconds. (c) Static pipelines: Fixed tool orders cannot adapt to each unique question’s needs.

🍞 Hook: You know how a good detective doesn’t interview everyone at once—they start with a clue (like a sound) and then focus on the exact spot.

🥬 The Concept (Cross-Modal Alignment Challenge): It’s the difficulty of matching what’s heard with what’s seen at the right time and place. How it works: (1) Detect events in audio, (2) map them to visual windows, (3) verify with both. Why it matters: If you can’t line them up, you miss the answer or chase the wrong evidence.

🍞 Anchor: Hearing “Conan” at 1:26–1:27 tells you to inspect the nearby video frames to confirm the sign and its meaning.

The Gap: What was missing was an active way to ask, “Should I listen first? Or look first?” and to call just the right tool at the right moment. The paper fills this by proposing an agent that plans, acts, observes, and reflects—like a student solving a mystery by checking the audio for timing clues, then zooming into the matched video clip for proof.

Real Stakes: This matters for everyday life—finding moments in long lectures, summarizing family videos, understanding news clips, helping safety systems notice urgent sounds (sirens, alarms) and match them with visuals. It saves time, reduces mistakes, and makes AI explanations more transparent because you can see every step it took.

02Core Idea

🍞 Hook: Imagine a treasure hunt where the bell rings exactly when you’re near the chest; you follow the bell to the spot, then look closely to dig up the prize.

🥬 The Concept (Active Perception): It’s when an AI doesn’t passively watch everything but instead chooses when to listen and when to look, step by step. How it works: (1) Plan what to check first, (2) call a tool (audio or video), (3) observe the result, (4) reflect and adjust the plan, (5) repeat until confident. Why it matters: Without active choices, the AI wastes effort, misses tiny clues, and gets confused in long videos.

🍞 Anchor: To answer “What did the sign say?” the agent listens for the mention of the sign, then inspects only that short visual clip in high resolution.

The Aha! Moment in one sentence: Let audio quickly tell you when to look, then zoom into just those video moments with fine detail, all controlled by a think–act–observe–reflect loop.

Three Analogies:

Librarian: First skim the table of contents (audio), then turn to the exact page (video clip) to read the fine print.
Detective: Listen for the shout “over here!” (audio localization), then examine the footprints and note the license plate (high-res video QA).
Chef: Smell the kitchen to find which pot is boiling (audio cue), then lift only that lid to check the ingredients (clip QA).

🍞 Hook: You know how rough maps get you to the neighborhood, and street-view gets you to the right door?

🥬 The Concept (Coarse-to-Fine Audio-Guided Perception): Start broad with audio to find times, then go fine with video at higher FPS/resolution. How it works: (1) Global audio captioning sets context, (2) event listing outlines the timeline, (3) event location pinpoints exact seconds, (4) clip video QA confirms details. Why it matters: Without starting coarse, you burn compute everywhere; without finishing fine, you miss tiny details.

🍞 Anchor: Hear a kitten meow at 01:48–01:49, then examine those exact frames to see who picked it up.

Before vs After:

Before: Models tried to learn everything together, often missing exact moments or small text.
After: The agent first asks the audio “when?” then asks the video “what exactly?”, raising accuracy and saving compute.

🍞 Hook: Think of a Swiss Army knife—you don’t use all blades at once, you pick the right one.

🥬 The Concept (Dynamic Tools Invocation): The AI selects the best tool (ASR, audio QA, event location, global video QA, clip QA) at each step. How it works: (1) Read the question, (2) choose a tool based on need, (3) gather evidence, (4) decide the next tool or answer. Why it matters: Without tool picking, the system either overuses heavy tools or misses key evidence.

🍞 Anchor: If the question is about what was said, it calls ASR; if it’s about small text on a sign, it calls clip QA in a narrow window.

Why It Works (intuition): Audio has high signal-for-time—sounds like speech, alarms, names, or meows sharply mark when important things happen. Video has high signal-for-detail—small text, objects, and actions are clearest when you zoom into a time slice with more frames and pixels. The looped thinking lets the agent cross-check, correct mistakes, and only dig deeper when needed.

Building Blocks (sandwich-style):

🍞 Hook: Like a student using notes before reading the whole book. 🥬 Concept (Think–Act–Observe–Reflect Loop): Plan → call a tool → read output → reflect and replan. Why it matters: Prevents rushing to wrong answers or wasting effort. 🍞 Anchor: After hearing “Conan” at 1:26, the agent reflects and decides to visually confirm the sign around 1:30–1:40.
🍞 Hook: Like making a timeline of a story. 🥬 Concept (Event Localization): Find the exact time an event happens using audio. Why it matters: Narrow time windows make video checks cheap and sharp. 🍞 Anchor: “Meow” at 01:48 marks where to look.
🍞 Hook: Like checking a word’s spelling under a magnifying glass. 🥬 Concept (Video Clip QA): Ask precise questions about a short, high-resolution video segment. Why it matters: That’s how you read tiny text or spot subtle actions. 🍞 Anchor: Read the characters on the “Nán Kē” sign.
🍞 Hook: Like turning speech into subtitles. 🥬 Concept (ASR): Convert speech to timestamped text. Why it matters: Lets the agent search for key phrases by time. 🍞 Anchor: Find where the blogger says “Nán Kē.”

03Methodology

At a high level: Input (user question + video + audio) → Plan (which sense first?) → Act (call one tool) → Observe (read tool output) → Reflect (revise plan) → Repeat until ready → Answer.

Step-by-step (with recipe-style details and examples):

Initialize memory and read the question

What happens: The agent stores the question and prepares an empty memory to record every tool call and observation.
Why it exists: Without memory, the agent repeats mistakes or forgets clues.
Example: Question: “Before picking up the kitten, the blogger explains a sign. Which concepts can it be associated with?”

Active thinking: choose where to start (listen-first vs look-first)

What happens: The agent judges if the question is time-sensitive (often audio-first) or purely visual (video-first).
Why it exists: Different questions need different senses and costs. Starting wrong wastes time.
Example: The agent suspects the audio mentions the sign earlier, so it listens first.

Audio global caption (AGC): set the big-picture context

What happens: Summarize overall audio topics, mood, and segments.
Why it exists: Without a map of the audio, event hunting is guessy.
Example: “The speaker chats about a place name, later a kitten meows.”

Event List (EL): sketch a rough timeline

What happens: Build a continuous outline of audio segments (00:00–end) labeling big shifts (topics, speaker changes, music on/off).
Why it exists: Ensures full coverage with no gaps; helps target likely windows.
Example: “01:25–01:30: mentions ‘Nán Kē’; 01:48–01:49: kitten meow.”

Event Location (ELO): pinpoint exact times

What happens: For a specific query (e.g., “first mention of ‘Conan’”), return exact timestamp(s) or ranges.
Why it exists: Precision reduces video search cost and boosts accuracy.
Example: “01:26–01:27: ‘Oh, Nán Kē, I saw Conan…’”

ASR: get timestamped transcript

What happens: Convert speech to text with times for each utterance.
Why it exists: Lets the agent verify wording and align audio with frames.
Example: “01:26–01:27: Oh, Nán Kē, I saw Conan…”

Audio QA (AQ): ask targeted audio questions

What happens: Query the audio for specifics (who said what, emotion, sound type) at chosen times.
Why it exists: Some questions are answerable by audio alone; this avoids unnecessary video cost.
Example: “What exactly was said about the sign’s meaning?”

Video Global QA (VGA): get a coarse visual sense

What happens: Sample sparse frames to see major objects/scenes.
Why it exists: Offers a cheap first look to guide clip selection.
Example: “Around 90–100s, is there a hanging signboard visible?” → “Yes, ‘Nán Kē’.”

Video Clip QA (VCA): zoom in for fine details

What happens: Load a short, high-FPS, high-resolution clip around the targeted time; ask precise visual questions.
Why it exists: This is where you read small text, confirm actions, or see micro-events.
Example: “Between 90–100s, what text is on the signboard?” → “Nán Kē.”

Reflection and cross-modal check

What happens: The agent compares audio findings with visual evidence to confirm or resolve conflicts.
Why it exists: Prevents errors from a single noisy tool; ensures both channels agree.
Example: Audio says “Nán Kē” and video shows the same characters—consistency achieved.

Answer or repeat

What happens: If evidence is enough, answer. If not, rethink and call another tool.
Why it exists: Avoids premature guesses and unnecessary extra steps.
Example: Having both the transcript and the sign text, the agent answers how the sign’s concepts relate (e.g., classic tale reference and anime bar naming).

The Secret Sauce (what makes it clever):

Audio-guided temporal grounding: Audio is a strong time compass; it tells you precisely when to look.
Coarse-to-fine focusing: Start broad and cheap; finish narrow and sharp.
Tool orchestration: Pick the right specialist for the job (ASR for speech, VCA for tiny text) instead of a one-size-fits-all pass.
Reflective loop: Think–Act–Observe–Reflect catches mistakes early and adapts dynamically.

Sandwich explanations for key method concepts:

🍞 Hook: Like asking “Where in the song is the chorus?” 🥬 Concept (Temporal Grounding): Finding exact times of events, mostly via audio. Why it matters: Without it, video search is a haystack hunt. 🍞 Anchor: Chorus at 01:10–01:25 tells you which frames to inspect for the crowd’s reaction.
🍞 Hook: Like choosing the right school supply. 🥬 Concept (Modality-Aware Toolset): Separate expert tools for audio, video, and events. Why it matters: Each tool excels at a specific job; mixing them blindly wastes effort. 🍞 Anchor: Use ASR for the exact quote; use clip QA for the exact sign text.
🍞 Hook: Like checking your work in math. 🥬 Concept (Cross-Modal Reflection): Compare what you heard vs what you saw before finalizing. Why it matters: Reduces hallucinations and mismatches. 🍞 Anchor: If audio says “left door” but video shows the right door opening, investigate again.

04Experiments & Results

The Tests (what and why):

Daily-Omni (30s/60s clips): Measures short-form audio-video reasoning with temporal alignment.
OmniVideoBench (0–30 min): Tests long-form comprehension where locating the right moments matters most.
WorldSense (medium length, 8 domains): Checks general, real-world audio-video understanding across topics. Why these: They stress fine-grained timing (when) and detail recognition (what), the core strengths of OmniAgent.

The Competition (who it was compared against):

Strong end-to-end OmniLLMs: Qwen3-Omni, Gemini 2.5-Flash, GPT-4o.
Agent baselines: Daily-Omni agent, XGC-avis, DVD.
Mix of open-source and closed-source, with and without explicit reasoning.

The Scoreboard (with context):

Daily-Omni: OmniAgent hits about 82.71% accuracy. That’s like scoring an A+ when many others are getting B’s (e.g., Qwen3-Omni at ~72.08%, Gemini 2.5-Flash-Thinking at ~72.7%).
OmniVideoBench: About 59.1% average. That’s a big jump over notable baselines (e.g., Qwen3-Omni-30B at ~38.4%), showing the agent’s edge in long videos where timing is critical.
WorldSense: About 61.2% average, topping both open and closed models listed, again without extra training.

Surprising/Insightful Findings:

Audio-first planning helped almost every backbone: Starting with AGC/EL/ELO typically led to fewer, smarter video calls.
The reflective loop mattered: LLMs that converged too fast (e.g., relying on global video answers) missed fine details.
Temporal grounding quality is pivotal: Better event location tools (e.g., strong timestamp grounding) lifted the whole pipeline.
Tool ablations showed necessity: Removing Video Clip QA, Audio QA, or Event Location sliced accuracy notably—each tool is a pillar, not a luxury.

Behavior Analysis Highlights:

Common pattern across datasets: Start with global audio context, end with clip-focused visual verification—matching the designed coarse-to-fine strategy.
LLM choice shapes habits: More deliberate LLMs (e.g., o3) used fine-grained tools at the right time, balancing cost and accuracy; faster-but-shallower models often stopped early on coarse cues and underperformed.

Efficiency Notes:

Compared to caption-all-frames agents, OmniAgent reduced token usage and latency by avoiding unnecessary visual processing.
Cost per question using public APIs was measured at roughly $0.05–$ 0.11, varying with video length and question complexity.

05Discussion & Limitations

Limitations (specific):

External tool dependency: The agent’s ceiling is limited by the accuracy of its ASR, event localization, and visual QA tools.
Iteration cost: The Think–Act–Observe–Reflect loop adds latency when questions are complex or evidence is scarce.
Temporal asynchrony: If audio and video are out of sync, the agent can need extra steps to reconcile them.
Premature convergence risk: Some LLM backbones may answer too early on coarse evidence without verification.

Required Resources:

Access to reliable audio tools (ASR, audio QA, event list/location) and visual tools (global QA, clip QA) via APIs or local models.
Enough compute to load short high-res clips on demand.
A reasoning-capable LLM to plan, reflect, and adapt.

When NOT to Use:

Purely visual tasks with no audio (e.g., silent timelapse reading of a static screen) may be solved faster by a simple visual model.
Real-time, ultra-low-latency settings where iterative reasoning is unacceptable.
Extremely noisy audio where event localization is unreliable and cannot be improved by preprocessing.

Open Questions:

How to train an end-to-end agentic OmniLLM with built-in tool self-calling to cut latency without losing accuracy?
How to strengthen open-source temporal grounding in audio and audio-video jointly?
How to build better long-term memory so the agent avoids repeated tool calls on similar sub-questions?
How to detect and repair cross-modal conflicts automatically (e.g., learned confidence and consistency scores)?

06Conclusion & Future Work

Three-sentence summary: This paper proposes OmniAgent, an active perception agent that uses audio to find the right times and video to verify fine details, guided by a Think–Act–Observe–Reflect loop. By orchestrating specialized audio, video, and event tools, it sidesteps hard cross-modal alignment training and achieves strong, fine-grained understanding. On three benchmarks, it outperforms leading models by 10–20% without extra training.

Main Achievement: Demonstrating that audio-guided, coarse-to-fine active perception—implemented as dynamic tool orchestration with reflection—substantially improves omnimodal accuracy and explainability.

Future Directions: Train an agentic OmniLLM with native tool self-calling, strengthen open-source temporal grounding in audio and joint audio-video, design better multimodal memory, and reduce iteration cost with smarter planning. Expand toolsets to other modalities (e.g., sensors), and build more rigorous benchmarks for holistic reasoning.

Why Remember This: It reframes multimodal understanding from “watch everything and hope” to “listen smart, then look sharp,” delivering both better answers and clearer explanations—much like a careful detective who first follows the sound, then inspects the scene.

Practical Applications

•Smart video search: Jump to exact moments when specific words, sounds, or events occur.
•Lecture and meeting summarization: Use audio to find topic shifts, then capture key slides or whiteboard details.
•Customer support QA triage: Quickly locate complaint moments in call+screen recordings and verify on-screen actions.
•Content moderation: Detect suspicious audio cues (e.g., certain phrases) and visually confirm policy violations.
•News analysis: Align speeches with crowd reactions or on-screen graphics to fact-check claims.
•Sports highlights: Use crowd roars/commentary to locate big plays, then verify with replay frames.
•Accessibility tools: Provide timestamped transcripts and aligned visual context for users with hearing or vision challenges.
•Home video organization: Find birthdays, names spoken, or pet moments, then auto-generate labeled clips.
•Education platforms: Auto-answer questions about specific parts of a lesson by aligning teacher’s words with board content.
•Compliance auditing: In regulated industries, align spoken approvals with on-screen forms to verify proper procedures.

Version: 1