🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
VABench: A Comprehensive Benchmark for Audio-Video Generation | How I Study AI

VABench: A Comprehensive Benchmark for Audio-Video Generation

Beginner
Daili Hua, Xizhi Wang, Bohan Zeng et al.12/10/2025
arXivPDF

Key Summary

  • •VABench is a new, all-in-one test that checks how well AI makes videos with matching sound and pictures.
  • •It covers three tasks: turning text into audio-video, turning a single image into audio-video, and making realistic stereo (left-right) audio.
  • •The benchmark grades models across 15 dimensions, including lip-sync, timing, realism, and how well audio, video, and text agree with each other.
  • •It uses both expert audio-visual tools and multimodal language models to score results, plus detailed Q&A checks for each sample.
  • •Seven content categories test real-life and creative cases, like animals, music, human voices, physical events, complex scenes, environments, and virtual worlds.
  • •In tests, integrated end-to-end models (like Veo3, Sora2, and Wan2.5) usually beat stitched-together video+audio pipelines.
  • •Veo3 performed best overall; Wan2.5 led lip-sync timing; Sora2 excelled in realism but trailed in audio aesthetics.
  • •Human studies showed VABench’s scores align well with what people prefer and notice.
  • •Stereo audio remains hard: models rarely place sounds to the correct left or right in a consistent, meaningful way.
  • •VABench sets a clearer, fairer scoreboard so researchers can improve audio-video generation where it matters most.

Why This Research Matters

Better audio-video benchmarking means videos that look and sound right, at the right time, every time. For classrooms, clearer lips and voices help students understand and trust what they watch. For filmmakers and creators, strong sync and stereo make stories more immersive and emotional. For safety and training, physics-aware sound (like alarms or impacts) prevents confusion. For accessibility, natural speech and timing support captions and lip-reading. Overall, VABench pushes AI to make videos that feel truly real and respectful of how humans see, hear, and feel scenes.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a cartoon where the character claps, but the sound happens a second later. You would notice right away that something feels off.

🥬 The Concept: Audio-Video Synchronization is making sure sounds happen at the same time as what you see. How it works: 1) detect key visual moments (like a clap), 2) locate matching audio events (like a clap sound), 3) line up their timings, 4) check they stay aligned over time. Why it matters: if timing drifts, speech looks wrong, actions feel fake, and viewers lose trust.

🍞 Anchor: When a person says hello, their lip movement must match the hello sound right then, not two frames later.

🍞 Hook: You know how flipping a flipbook makes the drawing move smoothly? If one page jumps, the movement looks jerky.

🥬 The Concept: Temporal Consistency means motion and appearance stay smooth and believable across frames. How it works: 1) compare neighboring frames, 2) check objects keep shape, color, and position, 3) make sure motion speeds change gradually, 4) flag sudden, impossible jumps. Why it matters: without it, videos feel glitchy, hurting realism.

🍞 Anchor: A running dog should keep the same fur color and size from frame to frame, with paws moving smoothly.

🍞 Hook: Close your eyes at a concert—you can still tell where the drums are. Sound can feel left, right, or far away.

🥬 The Concept: Stereophonic Audio Generation makes two-channel sound that feels like it comes from different directions. How it works: 1) create left and right channel signals, 2) adjust timing (ITD) and loudness (ILD) between them, 3) manage phase and width to form a stable soundstage, 4) keep signals clean so they play well on many devices. Why it matters: without stereo, everything sounds stuck in the middle and flat.

🍞 Anchor: If waves crash on the left of the screen, your left ear should hear them louder or earlier than your right ear.

🍞 Hook: Turning a written story into a mini-movie with pictures and sound feels like magic.

🥬 The Concept: Text-to-Audio-Video Generation (T2AV) creates both visuals and audio from text prompts. How it works: 1) read the prompt, 2) plan visuals and sounds, 3) generate video frames, 4) generate matching audio, 5) sync them. Why it matters: without T2AV tests, we can’t tell if models followed the story well.

🍞 Anchor: Given the text A kitten knocks a cup; we should see a kitten tip a cup and hear the clink exactly when it hits.

🍞 Hook: Imagine a still photo that comes alive—moving, sounding, and telling its own mini-story.

🥬 The Concept: Image-to-Audio-Video Generation (I2AV) turns one image into a moving, sounding scene. How it works: 1) analyze what’s in the image, 2) guess possible motions and sounds, 3) animate, 4) produce fitting audio, 5) keep them synced. Why it matters: it checks if models can reason from a single snapshot to a believable moment.

🍞 Anchor: A beach photo should become gentle waves rolling with soft crashes and distant gulls.

🍞 Hook: School report cards don’t just say good or bad—they grade math, reading, and science separately.

🥬 The Concept: A Benchmark is a standard test that fairly compares different models. How it works: 1) collect test cases, 2) define scoring rules, 3) run all models the same way, 4) publish scores. Why it matters: without a benchmark, improvements are just guesses.

🍞 Anchor: VABench is the report card for audio-video generation, with 15 subjects.

🍞 Hook: When you watch a movie, your brain checks if sounds fit what you see and what the story says.

🥬 The Concept: Cross-Modal Semantic Alignment means the text, video, and audio all tell the same story. How it works: 1) read the text, 2) analyze video content, 3) analyze audio content, 4) compare meanings, 5) score agreement. Why it matters: without it, you might see rain but hear a lawnmower.

🍞 Anchor: Prompt says electric guitar; the video should show a guitar and the audio should sound like one—not a piano.

🍞 Hook: Think of a very smart helper that can read, watch, and listen, then explain what it notices.

🥬 The Concept: Multimodal Large Language Models (MLLMs) judge complicated audio-visual results like a careful human would. How it works: 1) take the video and audio, 2) answer targeted questions, 3) give 1–5 scores on alignment, realism, and feeling, 4) keep judgment consistent with rules. Why it matters: without MLLMs, fully human scoring is slow and costly, and single-number metrics miss the big picture.

🍞 Anchor: An MLLM can check if a lightning flash appears before thunder and score how realistic that feels.

Before VABench, most tests focused only on how good the video looked. They missed whether sounds lined up, whether lips matched speech, whether physics-based effects (like the Doppler effect when a plane zooms past) felt right, and whether stereo made sense. Early attempts either needed real ground-truth videos (not possible for from-scratch text prompts) or relied on lots of manual grading, which doesn’t scale. The field needed a fair, complete, and automated way to check not just pretty pictures but the whole audio-video story working together. That missing piece is exactly what VABench provides: a multi-task, multi-metric, stereo-aware benchmark covering seven real and imaginative categories, from animals and music to complex scenes and virtual worlds. It matters for daily life because better syncing and realism make educational clips clearer, movies more immersive, safety messages more trustworthy, and creative tools more fun and useful.

02Core Idea

🍞 Hook: Imagine a science fair where every project is judged not just for looks, but also for how it sounds, how well it follows instructions, and whether the story makes sense.

🥬 The Concept: The Aha! is to grade audio and video together—systematically, across many angles—so models can’t hide weak sound behind pretty pictures. How it works: 1) design tasks (T2AV, I2AV, stereo), 2) build diverse test prompts with humans and LLMs, 3) score with expert models for precision and MLLMs for human-like sensemaking, 4) cover 15 dimensions including sync, semantics, realism, and Q&A checks. Why it matters: without a rich, unified benchmark, we can’t fairly compare models or know where to improve.

🍞 Anchor: Think of VABench as a team of referees: a timing ref (sync), a translator ref (text-to-audio/video meaning), a realism ref (physics), and a music ref (audio aesthetics) all judging the same play.

Three analogies for the idea:

  • Report card: Instead of giving just one grade, VABench gives many grades—lip-sync, timing, realism, and more—so strengths and weaknesses show clearly.
  • Orchestra test: It’s not enough if the violins (video) play well; the percussion (audio) must also be on-beat, and the conductor’s notes (text) must match the performance.
  • Triangle check: Text, video, and audio must form a triangle of agreement; if any side is weak, the triangle wobbles.

Before vs. After:

  • Before: Models were compared mostly on visuals; audio realism, sync, and emotional fit were under-checked.
  • After: VABench gives a holistic scoreboard, revealing tradeoffs—for example, one model may be super realistic but weaker on audio aesthetics; another may nail lip-sync but slip on semantics.

Why it works (intuition, no equations):

  • Expert models act like magnifying glasses for specific details (e.g., timing offsets, lip-speech alignment, text-audio match), catching tiny errors that humans may miss.
  • MLLMs act like knowledgeable judges, scoring big-picture qualities (alignment, artistry, expressiveness, realism) and answering fine-grained Q&A.
  • Covering seven content categories ensures models face real physics (thunder after lightning), human nuance (tone of voice), and style (music genres)—not just toy cases.
  • Stereo tests force attention to spatial hearing, not merely mixing everything into the center.

Building blocks:

  • Tasks: T2AV and I2AV test planning from words or a single image; stereo tests directional hearing.
  • Data curation: Experts plan categories; LLMs draft prompts and Q&A; humans verify correctness and fairness.
  • Metrics: Uni-modal audio quality, cross-modal alignment (text-video, text-audio, audio-visual), synchronization (lip-sync, desync), and MLLM macro/micro scoring.
  • Stereo analysis: Width, timing and loudness balance (ITD/ILD stability), phase coherence, and mono compatibility to ensure the soundstage is wide yet stable.

🍞 Anchor: With VABench, if a prompt says a can opener turns and metal scrapes, the video should show that action, the audio should sound metallic and on-beat with the turning, and the model gets rewarded only if all three agree.

03Methodology

At a high level: Input prompts or images → models generate audio-video → expert metrics measure timing, quality, and alignment → MLLMs score big-picture realism and answer Q&A → results aggregate into 15 dimensions.

  1. Tasks and Data 🍞 Hook: Think of three sports—story-to-movie (T2AV), photo-to-mini-movie (I2AV), and a stereo listening test. 🥬 The Concept: VABench evaluates T2AV, I2AV, and stereo prompts across seven categories. How it works: 1) experts define category coverage, 2) LLMs craft prompts and Q&A, 3) humans verify physics, semantics, and safety, 4) data balanced across animals, human sounds, music, environments, physical events, complex scenes, and virtual worlds. Why it matters: broad, clean data makes scores meaningful. 🍞 Anchor: A beach image in I2AV must lead to gentle wave motion plus soft ocean audio, not subway noises.

  2. Expert Model-based Metrics

  • Uni-modal audio quality 🍞 Hook: Like checking if a radio station sounds clear before judging the song. 🥬 The Concept: SpeechClarity (DNSMOS) and Speech Quality & Naturalness (NISQA) rate speech subsets; AudioAesthetic (Audiobox) summarizes enjoyment, usefulness, production quality vs. complexity. Steps: 1) feed audio, 2) predict clarity/quality, 3) compute CE+CU+PQ−PC, 4) normalize. Why it matters: noisy or fake-sounding speech ruins believability. 🍞 Anchor: If a vlogger talks off-screen, the speech should be clear without hiss or robotic tone.

  • Cross-modal semantic alignment 🍞 Hook: Reading a caption while watching a scene—you notice if it matches. 🥬 The Concept: Text-Video Align (ViCLIP), Text-Audio Align (CLAP), Audio-Visual Align (ImageBind) score meaning agreement. Steps: 1) embed each modality, 2) measure similarity, 3) average across frames or segments, 4) score higher for closer matches. Why it matters: mismatch (dog image but piano sound) confuses viewers. 🍞 Anchor: Prompt says classical guitar arpeggio; the video should show fingerpicking and the audio should sound like nylon strings.

  • Temporal synchronization 🍞 Hook: You know when a cartoon punch lands because the thwack hits exactly at impact. 🥬 The Concept: Desync (Synchformer) estimates audio-video time offsets; Lip-Sync confidence checks voice and lips for talking heads. Steps: 1) slice head and tail segments, 2) estimate offset, 3) compute lip-movement to speech alignment, 4) combine for timing health. Why it matters: even a few frames off breaks immersion. 🍞 Anchor: When a can opens, we must hear the metal scrape exactly as hands twist the opener.

  1. MLLM-based Evaluation
  • Macro scoring 🍞 Hook: A film critic judges storytelling, not just pixels and decibels. 🥬 The Concept: An MLLM scores Alignment, Artistry, Expressiveness, Audio Realism, and Visual Realism on a 1–5 rubric. Steps: 1) show the video, 2) ask for dimension-specific judgments with rules, 3) require timestamped reasons, 4) record final scores. Why it matters: it captures human-like judgments of mood, narrative, and plausibility. 🍞 Anchor: The MLLM can reward tense strings that swell right as a character reaches for a door.

  • Micro scoring (QA pairs) 🍞 Hook: A treasure hunt with detailed clues proves you really explored the scene. 🥬 The Concept: For each sample, 3–7 audio Qs and 3–7 visual Qs test fine details. Steps: 1) generate targeted Q&A from the prompt, 2) MLLM checks if the video satisfies each, 3) compute accuracy per sample, 4) average across the set. Why it matters: passing detail checks means the model didn’t just vibe—it followed instructions precisely. 🍞 Anchor: Question: Are gull cries audible above the wind at the shoreline? The model only passes if the audio clearly includes gulls.

  1. Stereo Analysis 🍞 Hook: Think of balancing on a beam—too wide and you wobble; too narrow and you feel stuck. 🥬 The Concept: Stereo tests measure Spatial Imaging Quality and Signal Integrity & Compatibility. Steps: 1) compute width (mid/side energy), 2) check ITD/ILD stability for localization, 3) analyze envelope correlation and transient sync, 4) compute phase coherence across bands, 5) assess mono compatibility. Why it matters: a wide but wobbly stereo image feels fake; a narrow but clean image feels small. 🍞 Anchor: If waves should be left and seagulls right, stereo metrics reveal whether channels differ in the right way rather than being identical.

  2. Aggregation and Reporting 🍞 Hook: After a decathlon, judges combine event scores into a final leaderboard. 🥬 The Concept: VABench aggregates expert and MLLM metrics into 15 dimensions and plots comparisons across tasks and categories. Steps: 1) run all metrics, 2) normalize where needed, 3) group by task and category, 4) compare models fairly. Why it matters: clear, multi-angle reporting shows tradeoffs and progress. 🍞 Anchor: A radar chart can show one model’s great stereo width but weaker mono compatibility, guiding improvements.

04Experiments & Results

The Test: VABench evaluates 778 T2AV and 521 I2AV samples across seven categories plus dedicated stereo prompts (116 for left/right separation). It measures uni-modal audio quality, cross-modal alignment (text-video, text-audio, audio-visual), synchronization (lip-sync, desync), MLLM macro scores (alignment, artistry, expressiveness, audio/visual realism), micro QA accuracy, and stereo spatial metrics.

The Competition: Two families of systems were compared. End-to-end AV models: Veo3-fast, Sora2, Wan2.5 Preview. Decoupled V+A pipelines: video generators (Seedance-1.0-lite, Wan2.2-TI2V, Kling2.5 Turbo) paired with audio models (MMAudio, ThinkSound light).

Scoreboard (with context):

  • Overall, end-to-end AV models led. Veo3 was strongest overall, especially in audio quality and cross-modal semantic alignment (think: A-level report card with many top marks). Sora2 excelled in realism (both audio and visual) but trailed in audio aesthetics. Wan2.5 achieved the best synchronization—especially lip-sync—like a drummer with perfect timing.
  • In V+A pipelines, MMAudio generally outperformed ThinkSound for broad tasks, though ThinkSound shined in music. Among video backbones, Kling ranked highest in visual quality; Seedance helped unlock better lip-sync from paired audio.
  • I2AV narrowed gaps between models compared to T2AV because the input image grounded visual content; some V+A combos (e.g., Seedance+MMAudio) even beat AV models on alignment, though AV models remained stronger in T2AV and fine-grained QA overall.
  • Category analysis: Models did relatively well on Music and Animals but struggled with Human Sounds (speech nuance and lip-sync) and Complex Scenes (many sources interacting). Virtual Worlds was paradoxically easier for some models, likely because internal style consistency is more forgiving than strict real-world physics.
  • Stereo: A tradeoff emerged between width and fidelity. Wan2.5 had the cleanest, most consistent channels but the narrowest image; Sora2 had the widest field but relied heavily on inter-channel phase tricks, hurting stable localization; Veo3 balanced width and stability best. Crucially, no model reliably followed prompts like waves left, gulls right, showing semantic stereo remains a big challenge.

Surprising Findings:

  • End-to-end training seems to create a shared semantic space that boosts alignment and timing stability, leading to smaller performance gaps and more reliability.
  • Human Sounds (speech) remains hardest across the board—even the best V+A combo didn’t surpass the weakest end-to-end AV model in that category.
  • Physics-aware prompts (like Doppler effect or thunder after lightning) exposed clear differences: Veo3 often showed the cleanest Doppler arc; all models captured lightning-before-thunder ordering to some extent, but with room for more realistic dynamics.

Human Validation: A pilot user study with six expert raters found strong correlation between human preferences and VABench scores across semantics, synchronization, and realism. That means VABench’s automated scoring lines up well with what people actually like and trust.

05Discussion & Limitations

Limitations:

  • Stereo evaluation covers two-channel mixes; it does not yet assess advanced surround formats (e.g., 5.1, ambisonics) or head-tracked spatial audio.
  • MLLM-based scoring can inherit biases from training data; while prompts and rubrics reduce randomness, nuanced cultural or emotional cues may still be misjudged.
  • Clip durations and default frame rates come from each system’s standard settings; extremely long or variable-length videos are not deeply explored.
  • While seven categories are broad, some edge cases (medical alarms, rare instruments, cross-language speech) are limited.
  • The benchmark focuses on reference-free generation; tasks that require exact ground-truth matching use a different evaluation philosophy.

Required Resources:

  • Access to video generation APIs or local models, audio models for V+A setups, GPU resources for computing expert metrics and running MLLMs, and storage for 48 kHz stereo audio.
  • The provided evaluation code, prompts, and QA templates.

When NOT to Use:

  • If you must compare outputs to a known ground truth (e.g., exact dubbing fidelity), use reference-based metrics instead.
  • For surround or VR audio research where head movement and room acoustics are critical.
  • For ultra-long films where long-term story structure is the main focus.
  • For speech recognition or translation accuracy tasks; VABench rates generation quality, not transcription.

Open Questions:

  • How to measure and encourage semantic stereo so models place sources meaningfully left/right on command.
  • Better metrics for human vocal nuance (tone, prosody) and micro lip-sync under varied cameras and lighting.
  • Fairness and demographic balance: building tests and metrics to reveal and reduce appearance or cultural biases.
  • Robustness to noisy prompts and adversarial cases while keeping creativity.
  • Scaling to longer sequences and interactive editing while keeping timing rock solid.

06Conclusion & Future Work

Three-sentence summary: VABench is a comprehensive benchmark that fairly grades how well AI generates synchronized audio and video from text or images, including stereo sound. It combines precise expert tools with human-like MLLM scoring and Q&A checks across 15 dimensions and seven content categories. The results show clear tradeoffs among models and highlight that end-to-end training often delivers the best overall audio-video coherence.

Main achievement: VABench sets a new standard for evaluating audio-video generation by unifying tasks, metrics, and stereo analysis into one human-aligned, multi-dimensional scoreboard.

Future directions: Expand to longer clips and interactive editing, add richer spatial audio formats, deepen speech and emotional nuance metrics, broaden fairness checks, and grow the dataset with more cultures and edge cases. A public leaderboard and community submissions could accelerate progress.

Why remember this: VABench moves the field from judging pretty pictures to judging the whole experience—sound, sight, timing, meaning, and space—so future AI videos feel real, stay in sync, and tell stories that make sense.

Practical Applications

  • •Choose the right model for your project by comparing VABench scores in the categories you care about (e.g., lip-sync for interviews, stereo for nature scenes).
  • •Diagnose weaknesses in a generation pipeline by checking which VABench dimensions scored lowest, then target those areas.
  • •Tune prompts for better alignment by using micro QA feedback to see which details the model misses.
  • •Improve audio post-processing using stereo metrics (e.g., reduce phase tricks, increase stable ITD/ILD cues) to get a cleaner soundstage.
  • •Benchmark updates to your model and verify real gains rather than relying on subjective impressions.
  • •Design training data with more physics-rich clips (e.g., thunder after lightning, Doppler scenes) to raise realism scores.
  • •Pick V+A pairings (video + audio models) guided by VABench results for your content type (e.g., ThinkSound for music-heavy clips).
  • •Validate educational and explainer videos with macro MLLM scores to ensure clarity, mood, and narrative support.
  • •Debug lip-sync issues by inspecting Desync and Lip-Sync metrics before release.
  • •Assess stereo prompt adherence (left vs. right placement) using the stereo radar chart to guide iterative improvements.
#audio-video benchmark#synchronization#lip-sync#text-to-audio-video (T2AV)#image-to-audio-video (I2AV)#cross-modal alignment#multimodal LLM evaluation#stereo width#ITD/ILD#phase coherence#mono compatibility#audio aesthetics#Doppler effect#video realism#QA-based evaluation
Version: 1