See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen; Zhuoran Yu; Samuel Low Yu Hang; Subin An; Jeongik Lee; Yohan Ban; SeungEun Chung; Thanh-Huy Nguyen; JuWan Maeng; Soochahn Lee; Yong Jae Lee

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Intermediate

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang et al.12/1/2025

arXiv PDF

Key Summary

•This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.
•Unlike many video tests that can be solved by looking only, AV-SpeakerBench forces models to use both audio and video together to get the right answer.
•The benchmark has 3,212 carefully written multiple-choice questions focused on speakers, not just scenes.
•Human accuracy on this test is 93.74%, but the best model (Gemini 2.5 Pro) scores 73.04%, showing a big gap.
•Recent open models like Qwen3-Omni-30B do better than older ones but still fall far behind the best proprietary models.
•Adding audio helps Gemini 2.5 Pro a lot (10–20% gains across tasks), but helps Qwen3-Omni-30B much less and sometimes even hurts.
•Most model mistakes come from hearing problems (audio perception) and getting time wrong (temporal grounding).
•As scenes get busier with more people, all models struggle more with who spoke and when.
•A newer model released later (Gemini 3 Pro) reaches 77.62%, which is better but still below humans.

Why This Research Matters

Real life is multimodal: in meetings, classrooms, video calls, and TV shows, understanding depends on both seeing and hearing. AV-SpeakerBench pushes AI to match voices to faces and to get the timing right, which is exactly what’s needed for helpful assistants that take accurate notes or answer questions about conversations. Better audiovisual reasoning means fewer mistakes in transcripts, smarter highlights from long videos, and more reliable accessibility tools (like captions that match the right speaker). It also builds trust: when AI can explain who said what and when, people can verify it. Over time, this will improve remote work, telehealth, education, and entertainment search, where precise, speaker-aware understanding is essential.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine watching a movie with the sound off. You can still guess some things—who’s on screen, who waves, who walks—but it’s hard to know exactly who said what and when. Now imagine only listening with your eyes closed—you hear words, but it’s tricky to match voices to faces.

🥬 Filling (The Actual Concept)

What it is: Modern AI called multimodal large language models (MLLMs) try to do both at once: see the video and hear the audio to understand what’s happening.
How it works (recipe):
1. The model looks at video frames (faces, gestures, actions).
2. It listens to the audio (voices, words, loudness, pitch).
3. It links sights and sounds over time (who spoke, what was said, when it happened).
4. It answers questions.
Why it matters: Without linking audio and video, AI can guess the wrong speaker, miss words, or mix up the timing, like blaming the line on the wrong person.

🍞 Bottom Bread (Anchor) Think about a talk show clip: three people on a couch, laughing and interrupting. A smart AI should tell you who said, “That’s hilarious!” right after someone clapped—and not mix up the speakers.

—

🍞 Top Bread (Hook) You know how your brain combines what you see (a friend smiling) and what you hear (their voice) to figure out if they’re joking? That teamwork is your superpower.

🥬 Filling (New Concept): Multimodal Large Language Models (MLLMs)

What it is: A kind of AI that understands information from more than one sense—like sight and hearing—together.
How it works:
1. A vision part reads images and video frames.
2. An audio part listens to speech and sounds.
3. A language brain reasons with both to answer questions.
Why it matters: If the model only trusts vision, it’ll miss what was said; if it only trusts sound, it’ll miss who said it.

🍞 Bottom Bread (Anchor) When you ask, “Who said ‘That’s not true’ after the door slammed?”, an MLLM should use both the slam (audio) and the person’s face (video) to get it right.

—

🍞 Top Bread (Hook) Imagine a group chat in real life. Everyone looks similar from far away. To follow the story, you track each person: who they are, when they speak, and what they say.

🥬 Filling (New Concept): Speaker‑centric Audiovisual Reasoning

What it is: Focusing AI’s attention on people as the main units—linking each voice to the right face and moment.
How it works:
1. Spot the visible people.
2. Detect who is speaking now.
3. Match the words to the right person.
4. Keep track of turn-taking over time.
Why it matters: If AI centers on the whole scene instead of the speakers, it confuses voices and faces, and gets conversation details wrong.

🍞 Bottom Bread (Anchor) In a clip where a woman says, “I have an idea,” AI must choose her (not her silent friend) as the speaker who said that line.

—

🍞 Top Bread (Hook) You know how tiny clues—like a whisper, a quick nod, or a short “Okay”—can change the meaning of a scene?

🥬 Filling (New Concept): Fine‑grained Reasoning

What it is: Paying attention to small details in both sound and sight to make precise decisions.
How it works:
1. Notice short phrases (“Oh, I see”).
2. Compare how fast or loud someone talks.
3. Anchor these to exact moments (before/after actions).
4. Count who speaks and how often.
Why it matters: Without detail, AI guesses coarsely (e.g., “someone talked”), missing who exactly said what and when.

🍞 Bottom Bread (Anchor) Question: “After the woman takes a sip, who says ‘Finally’?” Fine-grained reasoning ties the word to the exact sip moment and the right person.

—

🍞 Top Bread (Hook) Teachers make tests that really check what you learned—not just easy questions you can guess.

🥬 Filling (New Concept): Benchmark (AV‑SpeakerBench)

What it is: A carefully designed test with 3,212 multiple-choice questions that force models to use both audio and video to understand human speech.
How it works:
1. Pick real YouTube clips with multiple people talking.
2. Write questions that require linking words to faces and moments.
3. Add answer choices that sound plausible but are wrong unless you fuse audio and video.
4. Have experts review and time-stamp everything.
Why it matters: Without a good test, we can’t tell if models truly “see, hear, and understand.”

🍞 Bottom Bread (Anchor) Example: “Who speaks immediately after the man in the gray shirt wiggles his fingers?” You must watch and listen—no shortcut from a single frame.

02Core Idea

🍞 Top Bread (Hook) Think of a duet: the piano and the singer must stay in sync. If you only listen to the piano, you’ll miss the lyrics; if you only listen to the singer, you’ll miss the rhythm. Great understanding needs both.

🥬 Filling (The “Aha!” Moment)

One sentence: Make the speaker—not the scene—the hero of the story, and bake audio+video fusion directly into the questions so the only path to the right answer is to align who spoke, what they said, and when it happened.

Multiple Analogies (3 ways):

Theater analogy: Instead of asking, “What’s on stage?”, ask, “Which actor said this line right after the curtain falls?”—you must watch timing and the actor’s mouth and listen to the words.
Detective analogy: Don’t just match footprints (visual) or voices (audio); solve the case by proving which suspect (face) spoke a specific sentence at a specific time.
Sports commentary analogy: The right question isn’t “Who’s on the field?” but “Who shouted ‘Go left!’ right after the whistle?”—requires hearing the words and seeing who shouted when the whistle blew.

Before vs After:

Before: Many video QA tests could be solved by looking only; models often ignored audio or used it loosely.
After: AV-SpeakerBench locks the door on shortcuts: questions explicitly demand cross-modal matching (voices ↔ faces) with precise timing (before/after/when), making true fusion necessary.

Why It Works (intuition):

Fusion is forced by design: Questions encode dependencies like “Who says X right after Y happens?” If you only see or only hear, you can’t reliably choose the correct option.
Temporal anchors keep models honest: Phrases like “just before,” “immediately after,” and “until the end” require time-aware reasoning, not static frame inspection.
Diverse distractors eliminate guessing: Wrong options are plausible (real people in the clip, real actions), so shallow cues don’t help.

Building Blocks (mini Sandwiches for key pieces):

🍞 Hook: You know how you check the clock to say if recess happened before lunch? 🥬 Concept: Temporal Grounding

What it is: Tying speech and actions to exact moments (before/after/when).
How it works: Pick anchors (events/phrases), then restrict reasoning to the exact window.
Why it matters: Without it, AI mixes moments and answers with the wrong time. 🍞 Anchor: “How many people are visible when he says ‘Let’s go’?” means count at that instant, not earlier.

🍞 Hook: Matching a phone voice to the person in the room is tricky! 🥬 Concept: Cross‑modal Attribution

What it is: Assigning the heard words to the right visible speaker.
How it works: Compare voice timbre and timing with mouth motion and presence.
Why it matters: Without it, AI may credit the wrong person. 🍞 Anchor: Picking “the man in the striped jacket” as the one who said “Oh, I see what’s going on.”

🍞 Hook: Whisper vs shout, fast vs slow, high vs low—you notice these even without words. 🥬 Concept: Paralinguistic Attributes (rate, pitch, intensity)

What it is: How someone speaks, not just what they say.
How it works: Measure speed (syllables/time), pitch (how high/low), loudness (energy).
Why it matters: Many questions compare speakers by these traits. 🍞 Anchor: “Who has the lowest pitch among those who speak?”

Bottom Bread (Anchor) Example transformation: Instead of “What is happening in the scene?”, AV-SpeakerBench asks, “Who speaks immediately after the man in the red shirt and the man in the gray T-shirt do a fist bump?” Now the only safe route is to watch the fist bump (video), hear the next line (audio), and match it to a visible person (identity).

03Methodology

At a high level: Real videos → Careful clip selection → Speaker-focused, fusion-required question writing → Expert review and timing → Multiple-choice evaluation with strict letter-only answers.

Step-by-step, like a recipe:

Source real conversational videos

What happens: Collect YouTube clips rich in human speech: movie clips, interviews, podcasts, game shows, vlogs.
Why this step exists: Natural, messy conversations create real challenges—overlapping talkers, quick turn-taking.
Example: A 12-second group interview snippet with four visible people.

Clip selection with speaker complexity

What happens: Annotators watch full videos to pick 5–30s segments with multiple visible people and meaningful changes (e.g., who speaks shifts after an action).
Why it matters: If only one person speaks or nothing changes, questions become trivial (“Who talks after X?” is always the same person).
Example: Choose the window where the second speaker enters and interrupts.

Fusion-driven question design (the secret sauce)

What happens: Write four-choice questions that encode audio+video dependencies right in the wording.
Why it matters: This prevents solving with one modality only.
Examples:
- Link phrase to identity: “Who says, ‘Oh, I see what’s going on’?”
- Visual-to-audio timing: “What does the woman say just before she drinks from her glass?”
- Audio-to-visual timing: “When the man says, ‘We are not cool,’ how many people are visible?”
- Multi-speaker coordination: “After the man in the gray shirt wiggles his fingers, how many times is ‘red line’ mentioned by all speakers?”

Construct strong distractors

What happens: Wrong answers are taken from real people/actions/phrases in the same clip, or realistic recombinations.
Why it matters: Eliminates giveaway options; shallow cues won’t work.
Example: If the correct speaker is “man in black suit,” other options include “man in gray sweater,” “woman in red dress,” “man with glasses”—all actually present.

Multi-stage expert review and refinement

What happens: Each question goes through (i) cross-review by another researcher, (ii) language polish by an LLM, (iii) final verification by two more researchers.
Why it matters: Catches ambiguity, timing errors, or accidental shortcuts (like burned-in captions revealing the line).
Example: Remove a question if captions show the exact quote, making audio unnecessary.

Time anchoring and labeling

What happens: Annotators record precise start/end times and anchor moments.
Why it matters: Enforces temporal grounding; answers must be judged inside the exact window.
Example: “From 0:05 until 0:11” is the only counted span for speech counting.

Unified evaluation prompt and strict answer format

What happens: Models get the same instruction and must reply with only A/B/C/D.
Why it matters: Keeps results fair and comparable across models.
Example: “Select the best answer… Respond with only the letter (A, B, C, or D).” Any other output is counted as wrong.

Model coverage and fair inputs

What happens: Evaluate proprietary (Gemini family) and open-source A+V models (e.g., Qwen3/Qwen2.5-Omni, VITA, Unified-IO 2, Video-LLaMA, PandaGPT, etc.). Use each model’s default frame sampling and pass the full audio.
Why it matters: Respects model design; avoids hidden advantages.
Example: 1 fps for some models, 5–8 frames for others, with full audio track.

Task taxonomy (12 types) spanning who/what/when/how

Speaker-centric: detection, recognition, counting.
Speaker–visual: activity recognition, visual counting, attribute recognition.
Speech-centric: speech recognition, speech counting, duration, pitch, rate, intensity.
Why it matters: Covers recognition, alignment, comparison, and temporal reasoning—comprehensively.
Example: “Among those who speak, who has the lowest rate of speech?” (rate) and “Who speaks right after the fist bump?” (temporal + attribution).

Quality filters to remove trivial cases

What happens: Exclude segments where answers don’t require moment-specific reasoning (e.g., only one person visible the whole time, or always-on subtitles).
Why it matters: Preserves the challenge of true audiovisual fusion.
Example: Remove: “How many people are visible when he says X?” if the scene never changes and always shows just one person.

Mini Sandwiches for two core mechanisms:

🍞 Hook: You pause a video at the exact clap to count hands raised. 🥬 Concept: Temporal Localization

What it is: Finding the exact frame/time slice a question talks about.
How it works: Use events/phrases as pointers; zoom into that instant.
Why it matters: If you pick the wrong instant, counts and attributions go wrong. 🍞 Anchor: “At the moment he says ‘Great to meet you,’ how many people are visible?”

🍞 Hook: Hearing a line is easy; proving who said it is harder. 🥬 Concept: Speech–Speaker Attribution

What it is: Matching a spoken string to the right face.
How it works: Align timing (mouth motion ↔ audio), voice traits (timbre, pitch), and visibility.
Why it matters: Avoids crediting the wrong person for the line. 🍞 Anchor: “Who says, ‘That would be so much fun’?” among four similar-looking people.

Secret Sauce (why this method is clever):

The questions themselves are booby-trapped against shortcuts: every plausible wrong option tempts a model that relies on just vision or just audio; only real fusion plus timing wins.
Expert curation and time stamps keep the challenge sharp and fair.
A rich mix of tasks mirrors real conversations—counting turns, matching quotes, and judging how people talk (fast/slow, loud/soft, high/low).

04Experiments & Results

The Test: What was measured and why

Metric: Multiple-choice accuracy—did the model pick the right letter?
Why: It’s clear, fair, and comparable across tasks and models.
Scope: 3,212 MCQs across 12 task types, all centered on who spoke, what was said, and when.

The Competition: Who was compared

Proprietary: Gemini family (2.0 Flash, 2.5 Flash, 2.5 Pro), later Gemini 3 Pro in the supplement.
Open-source A+V: Qwen2.5-Omni (3B/7B), Qwen3-Omni (30B), VITA/VITA-1.5, Unified-IO 2, Video-LLaMA/2, PandaGPT, OneLLM, Phi-4 Multimodal, AnyGPT, OLA.

The Scoreboard (with context)

Human performance: 93.74% (like scoring an A+).
Gemini 2.5 Pro (thinking): 73.04% (a strong B), best across almost all tasks in the main study.
Gemini 2.5 Flash (thinking): 67.84% (solid but behind Pro); non-thinking Flash: 60.27%.
Gemini 2.0 Flash: 53.21%.
Qwen3-Omni-30B: 54.14%—roughly on par with or a bit above Gemini 2.0 Flash, and the strongest open-source model tested, but far behind Gemini 2.5 Pro.
Many older/open video models hovered near chance on tougher tasks, indicating that true speaker-centric fusion remains hard.
Supplement (released after main paper): Gemini 3 Pro (thinking) reaches 77.62%, +4.6 points over 2.5 Pro, but still below humans.

Deeper Findings

Audio really helps—for some models

Modality ablation shows Gemini 2.5 Pro gains ~10–20 percentage points when audio is added across most tasks (e.g., large boosts in speech counting and intensity).
Qwen3-Omni-30B shows much smaller gains; sometimes audio even hurts—evidence of unstable fusion.
Takeaway: The big gap is not just vision quality; it’s fusion quality.

Where models stumble the most

Error analysis buckets:
- Audio perception: 31.7% (mishearing words, missing overlapping speech).
- Temporal grounding: 25.0% (mixing up “before”/“after” windows).
- Temporal localization: 16.7% (locking onto the wrong moment).
- Visual perception: 13.3% (mis-seeing who or what).
- Cross-modal attribution: 13.3% (matching the line to the wrong face).
Takeaway: Hearing clearly and getting time right are the main blockers.

Vision-only can sometimes work (and that’s okay)

Strong models sometimes answer correctly from video alone using mouth motion and gestures as clues.
But the same cues can mislead (e.g., assuming slower gestures mean slower speech). Adding audio resolves ambiguity.
The benchmark allows this: when audio isn’t strictly needed but would normally help, correct vision-only answers still count—just like humans.

More people, more trouble

Accuracy drops as the number of visible people increases for all tested models.
Crowded scenes make attribution and counting harder, especially under time constraints and overlapping speech.

Surprising/Notable Highlights

Qwen3-Omni-30B’s overall strength among open-source models shows real progress but also reveals how hard robust fusion is.
Gemini 2.5 Pro’s consistent audio gains suggest strong temporal alignment mechanisms.
The supplement’s Gemini 3 Pro jump (+4.6) signals that better fusion and timing are active areas of improvement, yet a sizable human–AI gap remains (~16 points).

05Discussion & Limitations

Limitations (be specific)

Occasional visual-only solvability: Some questions can be answered from mouth motion or gestures without audio; the benchmark tolerates this because humans do it too, but it can blur how much audio helped.
Domain scope: Clips are short (5–30s) and conversation-heavy; very long, noisy, or specialized domains (e.g., call centers with crosstalk) aren’t directly covered.
Speech attribute ambiguity: Pitch/loudness/rate can be affected by mic distance or background noise; though annotators try to avoid this, some edge cases remain tricky.
Resource needs: Evaluating big omni-models takes time and compute; some “thinking” modes are slow or costly, limiting broad replication.

Required Resources

Data: Access to the benchmark (3,212 MCQs, time stamps, and video references under CC BY-NC-SA 4.0).
Models: A+V-capable MLLMs that accept frames + full audio.
Compute: Enough GPU/CPU to run inference at each model’s default frame sampling and audio pipeline.
Evaluation code: A strict parser expecting single-letter answers (A/B/C/D).

When NOT to Use

Training datasets: AV-SpeakerBench is for evaluation only; do not train on it.
Non-speech tasks: If your use case is about ambient sounds (sirens, music genres) without speakers, other datasets fit better.
Biometric identification or surveillance: Disallowed by the license and out of scope ethically and technically.

Open Questions

Fusion mechanisms: What architectures most reliably align voices to faces over time, especially with overlapping speech?
Robust timing: How can models better lock onto exact before/after/when windows in the presence of edits and quick cuts?
Prosody understanding: Can models more accurately compare pitch/rate/intensity despite changing microphones and room acoustics?
Scale vs. technique: How much of the remaining gap is closed by bigger models vs. smarter cross-modal training and objectives?
Generalization: How well will speaker-centric fusion methods transfer to long-form meetings, multilingual speech, or noisy phones?

06Conclusion & Future Work

Three-sentence summary

AV-SpeakerBench is a new benchmark that tests whether multimodal AI can truly see, hear, and understand conversations by aligning who speaks, what is said, and when it happens.
By making the speaker the center of each question and baking audio+video dependencies into the options, it prevents easy shortcuts and fairly measures fine-grained audiovisual reasoning.
Results show a large human–AI gap: strong proprietary models (Gemini 2.5 Pro; later Gemini 3 Pro) lead but still lag behind humans, and open-source models especially need better fusion.

Main Achievement

A rigorously curated, expert-reviewed, speaker-centric audiovisual benchmark (3,212 MCQs, 12 task types) that decisively measures cross-modal fusion and temporal grounding.

Future Directions

Build and evaluate fusion modules that more robustly align voices and faces under overlap and noise.
Expand to longer conversations, multilingual settings, and diverse recording conditions.
Create training objectives and synthetic curricula targeted at attribution, timing, and prosody comparisons.

Why Remember This

It reframes video understanding around people and their speech—the core of real conversations—and sets a high, clear standard for models that must truly see, hear, and understand together.

Practical Applications

•Meeting assistants that attribute quotes to the correct person and summarize who said what and when.
•Video conferencing tools that produce accurate, speaker-labeled transcripts and action-item extraction.
•Lecture and podcast indexing that links exact phrases to the right speaker and timestamp for fast retrieval.
•Customer support call review that identifies who spoke key phrases (e.g., cancellations, commitments) and in what order.
•Courtroom or council session analysis that anchors statements to speakers with precise timing for auditing.
•Media production tools that auto-generate captions with correct speaker tags and align them to on-screen faces.
•Interview research tools that count turns, compare speaking rates, and highlight important quotes per speaker.
•Interactive tutoring systems that explain scenes by tying each line to the correct character and moment.
•Broadcast compliance checking that flags loudness spikes or sensitive phrases with accurate speaker attribution.
•Accessibility enhancements that let viewers filter by speaker or jump to moments when specific phrases are said.

Version: 1