T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation | How I Study AI

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner

Zhe Cao, Tao Wang, Jiaming Wang et al.12/24/2025

Key Summary

•T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.
•It uses 500 tricky, realistic prompts so models must handle complex scenes, multiple events, and layered sounds.
•The benchmark measures both hard numbers (signal quality and timing) and soft, human-like judgment (following instructions and realism).
•A reasoning-first MLLM-as-a-Judge checks detailed checklists before scoring, making results more transparent and explainable.
•Models often fail at Audio Realism, especially making sounds that match materials and spaces (like metal vs. wood or echo in a gym).
•No single system wins at everything; even the best models still struggle with fine synchronization and realistic audio textures.
•Composed pipelines (separate best-in-class video plus audio models) can beat end-to-end systems on certain realism measures.
•T2AV-Compass highlights how much room there is to improve cross-modal alignment, instruction following, and perceptual realism.
•This benchmark can guide researchers and companies to fix weaknesses that real users care about, such as lip sync and believable sound.
•By unifying evaluation, T2AV-Compass sets a clearer path for building better, more trustworthy text-to-audio-video AI.

Why This Research Matters

As AI tools start making short videos with sound from simple text, people expect them to feel real, not just look pretty. T2AV-Compass checks the whole experience—video, audio, timing, and whether instructions were followed—so makers can fix what viewers actually notice. It exposes the Audio Realism Bottleneck, which explains why some clips still sound “off” even when they look great. Companies can use it to deliver better product demos, safer content, and more trustworthy media generation. Researchers get a clear path to improve models and prove progress. In short, this benchmark helps turn flashy demos into believable, useful, and reliable audio-video stories.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a great movie makes sense because the pictures and the sounds fit together—footsteps land exactly when a character walks, and a thunderclap rumbles when lightning flashes? That perfect match is what makes it feel real.

🥬 Filling (The Actual Concept):

What it is: Text-to-Audio-Video (T2AV) generation is when an AI turns words into a video with matching sound.
How it works (step by step):
1. You type a description (the prompt),
2. The AI imagines the visuals (video frames) and the sounds (audio track),
3. It puts them together so they happen at the right time,
4. It delivers one movie-like clip.
Why it matters: Without this, AI-made clips look cool but feel wrong—like clapping hands with no sound, or a dog barking two seconds late.

🍞 Bottom Bread (Anchor): Imagine typing, "A glass falls and shatters on a tile floor in a small kitchen" and getting a short clip where you see the fall and hear a sharp, echoey smash at the exact moment it hits.

🍞 Top Bread (Hook): Think of a school talent show: you judge the performance, not just the costume or the song. Everything has to work together.

🥬 Filling (The Actual Concept):

What it is: A benchmark is a fair test that checks how well different AIs do the same tasks.
How it works:
1. Gather a set of well-designed prompts,
2. Have many models produce results,
3. Grade each along important dimensions,
4. Compare scores to see strengths and weaknesses.
Why it matters: Without a good benchmark, we can’t tell if a model is truly improving or just looking lucky on easy examples.

🍞 Bottom Bread (Anchor): If five runners race the same distance on the same track, you can honestly compare who’s fastest; a benchmark is that shared track for AI.

🍞 Top Bread (Hook): Imagine you’re watching a drumline. If the sticks hit slightly off-beat, you feel it instantly.

🥬 Filling (The Actual Concept):

What it is: Cross-modal alignment means the sound and the picture tell the same story at the same time.
How it works:
1. Look at what’s on screen,
2. Listen to what’s in the audio,
3. Check that they match in meaning (a car engine with a car) and timing (rev happens when the car accelerates),
4. Score how close they are.
Why it matters: Without alignment, AI videos feel fake—like a roaring ocean sound over a quiet library scene.

🍞 Bottom Bread (Anchor): If a door slams visually at 3.2 seconds, you should hear the slam at 3.2 seconds, not 3.7.

🍞 Top Bread (Hook): You know how a recipe with just "Make cake" isn’t enough? Precise steps matter.

🥬 Filling (The Actual Concept):

What it is: Instruction following is how faithfully an AI follows detailed prompt rules (who, what, where, style, and sounds).
How it works:
1. Break the prompt into checkable parts (like lighting, number of people, actions, camera moves, sound types),
2. Generate the clip,
3. Verify each part,
4. Add up how many were met.
Why it matters: Without checking instructions, the video might be pretty but wrong—like having two dogs when you asked for one.

🍞 Bottom Bread (Anchor): If the prompt says, "Three red balloons rise as the camera slowly zooms out," the clip should show exactly three red balloons and a smooth zoom-out.

🍞 Top Bread (Hook): Sometimes, teachers use both a ruler and their eyes: the ruler gives exact numbers, and the eyes judge beauty.

🥬 Filling (The Actual Concept):

What it is: Objective signal-level metrics are math-based measurements for video and audio quality (like sharpness, noise, or timing).
How it works:
1. Extract signals from the video/audio,
2. Compute scores for clarity, artifacts, and alignment,
3. Average across clips,
4. Compare models.
Why it matters: Without numbers, you can’t track progress or fairly compare systems.

🍞 Bottom Bread (Anchor): Like timing a runner with a stopwatch, these metrics tell you exactly how “fast” or “clean” the signals are, not just how they look to the eye.

🍞 Top Bread (Hook): Giving more detailed directions helps a friend find a hidden treehouse.

🥬 Filling (The Actual Concept):

What it is: Prompt complexity means how rich and specific the text instructions are (multiple subjects, layered actions, camera moves, and sound events).
How it works:
1. Include clear subjects and setting,
2. Add timed events and interactions,
3. Specify camera moves and style,
4. Describe sound effects, speech, and music.
Why it matters: Simple prompts hide weaknesses; complex prompts reveal if models can handle real-world storytelling.

🍞 Bottom Bread (Anchor): “A boy kicks a ball” is simple; “Two kids pass a muddy soccer ball, rain starts, the coach whistles twice, and the camera tilts up to the gray sky” is complex—and a much better test.

🍞 Top Bread (Hook): When everyone grades differently, it’s confusing; when they agree on a plan, it’s clear.

🥬 Filling (The Actual Concept):

What it is: Before this paper, evaluations were split—some checked only video, others only audio; few checked whether they matched each other or the instructions.
How it works:
1. Unimodal tests graded parts separately,
2. Cross-modal timing and realism were often missed,
3. Results were hard to compare across studies.
Why it matters: This fragmentation made it hard to know if a model truly made convincing, instruction-following audio-video.

🍞 Bottom Bread (Anchor): It’s like judging a song only by its lyrics or only by its melody—good tests listen to the whole song together.

02Core Idea

🍞 Top Bread (Hook): Imagine a compass that not only points north but also tells you if your backpack is packed right, your shoes are tied, and your water bottle is full—so you’re fully ready for the hike.

🥬 Filling (The Actual Concept):

What it is: The key insight is a unified, dual-level evaluation—objective numbers plus reasoning-based judging—driven by complex, carefully designed prompts.
How it works:
1. Build 500 rich prompts using a taxonomy (organized categories) and real video inversion so they’re realistic,
2. Run many T2AV models on these prompts,
3. Score objective signal quality and cross-modal alignment,
4. Use an MLLM-as-a-Judge with checklists to rate instruction following and realism,
5. Aggregate the results for clear, diagnostic comparison.
Why it matters: Without a unified approach, we miss critical failures like off-timing sounds or physics-breaking visuals.

🍞 Bottom Bread (Anchor): It’s like testing a cooking robot by giving it tough recipes, measuring temperatures and textures (numbers), and also having a chef explain why the dish does or doesn’t match the recipe.

🍞 Top Bread (Hook): You know how coaches use drills and scrimmages? Drills test technique; scrimmages test game sense.

🥬 Filling (The Actual Concept): Dual-level evaluation framework.

What it is: A two-part test—objective metrics for signals, and subjective, reasoning-based judging for semantics and realism.
How it works:
1. Objective: measure sharpness, noise, and timing alignment,
2. Subjective: checklist-based instruction following plus realism tests for motion, structure, and materials-to-sound matching,
3. Combine scores to get a balanced grade.
Why it matters: Numbers catch technical flaws; reasoning catches meaning and plausibility—together they see the whole picture.

🍞 Bottom Bread (Anchor): Like grading a speech on timing and volume (objective) and also on clarity and persuasiveness (subjective).

🍞 Top Bread (Hook): Think of a careful judge who explains every decision so everyone learns.

🥬 Filling (The Actual Concept): MLLM-as-a-Judge.

What it is: A multimodal large language model that must reason first, then score.
How it works:
1. Convert each prompt into a QA checklist (e.g., “Are there three bells?”),
2. The judge watches/listens and writes a rationale,
3. It assigns a 1–5 score per dimension,
4. Results are stored with explanations for transparency.
Why it matters: Without rationales, scores feel mysterious; with rationales, we can pinpoint errors and improve models.

🍞 Bottom Bread (Anchor): Like a math teacher marking each step you got right or wrong, not just giving a final grade.

🍞 Top Bread (Hook): When you tap glass, wood, or metal, they sound different—that’s how your ears know what they are.

🥬 Filling (The Actual Concept): Audio realism bottleneck.

What it is: Today’s models struggle most with making sounds that truly match materials and spaces.
How it works:
1. Identify the visual material (glass vs. wood),
2. Expect matching timbre and echo based on the room,
3. Compare what should be heard vs. what is produced,
4. Detect mismatches (e.g., metal that sounds dull and plasticky).
Why it matters: If materials and spaces don’t sound right, viewers sense fakeness instantly.

🍞 Bottom Bread (Anchor): Seeing a chain net on a hoop but hearing a soft cloth swish breaks the illusion, even if the picture looks great.

🍞 Top Bread (Hook): A good map has categories—mountains, rivers, roads—so you can navigate.

🥬 Filling (The Actual Concept): Taxonomy-driven prompts and video inversion.

What it is: A method to build diverse, realistic prompts by organizing key elements and anchoring some in real videos to ensure physical plausibility.
How it works:
1. Collect and cluster prompts to cover many themes,
2. Enrich them with precise details (lighting, camera, sound),
3. Add 100 real-video-derived prompts via dense captions and human checks,
4. Finalize 500 challenging prompts.
Why it matters: Hard, realistic prompts reveal true strengths and weaknesses.

🍞 Bottom Bread (Anchor): Instead of “Make a car video,” the prompt becomes “A red vintage car idles under a streetlamp at night, camera dolly-in, soft rain patter, distant jazz sax,” which truly tests the model.

Before vs After:

Before: Mixed, unimodal metrics, scattered tests, unclear comparisons.
After: One compass with numbers plus reasoning for video, audio, alignment, instructions, and realism.

Why it works (intuition): Objective metrics catch signal flaws, while reasoning-based judging catches semantic and physics logic—together they reduce blind spots. Building prompts from a taxonomy plus real-video inversion ensures tests are both broad and grounded.

Building blocks:

Complex prompt suite (taxonomy + inversion),
Objective pillars (video quality, audio quality, cross-modal alignment),
Subjective pillars (instruction following with checklists, perceptual realism with diagnostic scores),
Aggregation and reporting for actionable insights.

03Methodology

High-level pipeline: Input (500 prompts) → Generate clips with each model → Objective scoring (video/audio/alignment) → Subjective judging (instruction following + realism) → Aggregate results → Analyze strengths and failures.

🍞 Top Bread (Hook): Imagine organizing a science fair: same experiment cards for all teams, lab instruments for precise measurements, and teachers who explain their grades.

🥬 Filling (The Actual Concept): Data construction with taxonomy-driven curation.

What it is: Building a balanced, challenging set of prompts using categories and clustering.
How it works:
1. Collect prompts from multiple sources and embed them in a semantic space,
2. Deduplicate via similarity thresholds and square-root sampling to keep variety,
3. Enrich prompts (subjects, motion, cinematography, sounds),
4. Human review for plausibility.
Why it matters: Without careful curation, tests skew easy or repetitive and don’t expose real issues.

🍞 Bottom Bread (Anchor): Like picking science fair topics so there aren’t 20 volcanoes and no robotics.

🍞 Top Bread (Hook): Sometimes you double-check homework answers by looking at the worked example in the textbook.

🥬 Filling (The Actual Concept): Video inversion.

What it is: Creating prompts from real, short video clips using dense captions with human verification.
How it works:
1. Select diverse 4–10 second real videos,
2. Auto-caption them with detailed, time-aligned descriptions,
3. Humans fix mismatches,
4. Use these verified prompts to anchor realism.
Why it matters: Keeps prompts physically grounded and reduces fantasy-only edge cases.

🍞 Bottom Bread (Anchor): If a real clip shows a chain-net basket in a small court, the prompt will mention its sharp metallic rattle and tight, indoor echo.

Objective evaluation (signals):

Video quality (technical + aesthetic): measure clarity (artifacts, blur) and pleasing composition.
Audio quality: measure signal fidelity (cleanliness) and content usefulness (are the sounds meaningful, not just noise?).
Cross-modal alignment: use embedding similarity to check text↔audio, text↔video, and audio↔video semantics; use temporal sync metrics to check timing.

🍞 Top Bread (Hook): Using both a ruler and a level helps you build a straight bookshelf.

🥬 Filling (The Actual Concept): Embedding-based alignment metrics (group concept).

What it is: Tools that place text, audio, and video into a shared space so similarity can be measured.
How it works:
1. Convert each modality into vectors,
2. Compute cosine similarity between pairs (text–audio, text–video, audio–video),
3. Higher similarity = better semantic match,
4. Average across clips.
Why it matters: Ensures what you hear/see matches what was asked.

🍞 Bottom Bread (Anchor): If the prompt says “barking dog,” the audio vector should sit near the “dog-bark” region, not near “violin.”

🍞 Top Bread (Hook): Clapping out a beat with a metronome shows if you’re early or late.

🥬 Filling (The Actual Concept): Temporal synchronization metrics (group concept).

What it is: Measures that check whether sounds and visuals happen at the same time, and whether lips match speech.
How it works:
1. Detect visual onsets (like a hit) and audio onsets (like a thump),
2. Compute the time offset between them,
3. Lower offsets mean better sync,
4. Specialized checks for lip-sync.
Why it matters: Even small delays break the illusion.

🍞 Bottom Bread (Anchor): A drum hit that sounds half a second late feels wrong even if everything else looks great.

Subjective evaluation (meaning + realism):

Instruction Following (IF): prompts become checklists across seven dimensions (Attributes, Dynamics, Cinematography, Aesthetics, Relations, World Knowledge, Sound). An MLLM judge must write its reasoning then score 1–5.
Realism (PR): judges score five diagnostic facets.

🍞 Top Bread (Hook): A good coach uses a checklist: stance, swing, follow-through—so players know exactly what to fix.

🥬 Filling (The Actual Concept): Checklist-based IF.

What it is: Turning a rich prompt into yes/no questions per sub-dimension (e.g., “Is the camera panning left?”).
How it works:
1. Extract slots from the prompt (counts, colors, motions, lighting, sounds),
2. Form binary QA checks,
3. Judge answers with rationale,
4. Convert to a 1–5 score.
Why it matters: Pinpoints missed instructions instead of vague overall grades.

🍞 Bottom Bread (Anchor): “Are there exactly three red balloons?” is clearer to verify than “Does the scene seem right?”

Realism facets (the “secret sauce”):

🍞 Top Bread (Hook): Your eyes and ears instantly notice when a puppet’s strings jerk or a voice echoes in the wrong room.

🥬 Filling (The Actual Concept): Motion Smoothness Score (MSS).

What it is: Rates how steady and natural motion looks over time.
How it works:
1. Check for flicker and smearing,
2. Judge frame-to-frame fluidity,
3. Allow natural blur in fast scenes,
4. Score 1–5.
Why it matters: Jitter and tearing break realism even if the content is correct.

🍞 Bottom Bread (Anchor): A smooth camera dolly-in should glide, not wobble like a shaky phone video.

🍞 Top Bread (Hook): Like a doctor noticing if an arm bends the wrong way.

🥬 Filling (The Actual Concept): Object Integrity Score (OIS).

What it is: Checks for anatomical and structural plausibility.
How it works:
1. Look for limb stretching and impossible joint angles,
2. Ensure rigid objects don’t wobble like jelly,
3. Check textures remain stable,
4. Score 1–5.
Why it matters: Broken bodies or warping cars feel uncanny.

🍞 Bottom Bread (Anchor): A basketball shouldn’t melt into an oval mid-bounce.

🍞 Top Bread (Hook): A character shouldn’t disappear and reappear like a magic trick unless the story says so.

🥬 Filling (The Actual Concept): Temporal Coherence Score (TCS).

What it is: Tracks whether objects exist and stay consistent over time.
How it works:
1. Detect sudden vanish/appear events,
2. Check identity and appearance stability,
3. Verify occlusion logic,
4. Score 1–5.
Why it matters: Keeps the story believable.

🍞 Bottom Bread (Anchor): If a red hat dips behind a wall, it should reappear as the same red hat.

🍞 Top Bread (Hook): Sound engineers can hear when audio has “robot shimmer” or weird clicks.

🥬 Filling (The Actual Concept): Acoustic Artifact Score (AAS).

What it is: Rates technical audio cleanliness (no metallic tone, pops, or pumping).
How it works:
1. Listen for electronic artifacts,
2. Check stability (no dropouts),
3. Ensure signal integrity (no clipping),
4. Score 1–5.
Why it matters: Dirty audio screams “fake.”

🍞 Bottom Bread (Anchor): A city scene shouldn’t hiss like a broken radio.

🍞 Top Bread (Hook): Tapping metal, wood, or glass tells your brain what it is—even with your eyes closed.

🥬 Filling (The Actual Concept): Material–Timbre Consistency (MTC).

What it is: Judges if the sound texture matches the visual material and space.
How it works:
1. Identify material and action (metal chain rattling, foot on gravel),
2. Check timbre and loudness envelope,
3. Match room echo (small room vs. hall),
4. Score 1–5.
Why it matters: Perfect pictures with wrong sounds still feel wrong.

🍞 Bottom Bread (Anchor): A chain-net basketball hoop should ring sharply with quick metallic overtones, not a dull thud.

Secret sauce: Reasoning-first judging + realism diagnostics expose subtle cross-modal errors that pure numbers miss, while objective metrics prevent hand-wavy scoring. Together, they make the evaluation robust and actionable.

04Experiments & Results

The test: The team evaluated 11 representative systems—closed-source end-to-end models (like Veo-3.1, Sora-2, Kling-2.6, Wan-2.6/2.5, Seedance-1.5, PixVerse-V5.5), open-source end-to-end (Ovi-1.1, JavisDiT), and composed pipelines (Wan-2.2 + HunyuanVideo-Foley; AudioLDM2 + MTV)—on all 500 prompts. They measured objective video/audio quality, cross-modal semantic alignment, synchronization, instruction following (video/audio), and realism (video/audio).

🍞 Top Bread (Hook): Think of it like a decathlon: sprinting (speed), long jump (power), and pole vault (technique) all count—so one star can’t just win by doing one thing well.

🥬 Filling (The Actual Concept): Scoreboard with context.

What it is: A balanced report that shows who leads per category and where gaps remain.
How it works:
1. Compare objective pillars: video quality (technical + aesthetic), audio quality (fidelity + usefulness), and alignment (text–audio, text–video, audio–video) plus sync (lower desync is better),
2. Compare subjective pillars: instruction following (video + audio) and realism (MSS, OIS, TCS, AAS, MTC),
3. Summarize which models are top-tier and where they fall short.
Why it matters: A single big number can hide critical weaknesses; this breakdown reveals what to fix.

🍞 Bottom Bread (Anchor): A model might get an “A” in video sharpness but a “C–” in sound realism—that’s a very different product experience than two “B+”s.

Highlights:

Closed-source models generally outperformed open-source ones across most dimensions. Veo-3.1 often landed top or near-top averages across pillars.
No single winner: while Veo-3.1 had the highest overall average in subjective scoring, it still showed major deficiencies in Audio Realism—especially MTC (material–timbre matching).
Audio Realism Bottleneck: Even strong systems hovered around the mid-to-low ranges for audio realism. One of the top systems achieved only around mid-50s on Audio Realism—a passable but far-from-studio grade result—while many clustered in the 30s–40s.
Composed pipelines can excel at specific tasks: Wan-2.2 + HunyuanVideo-Foley achieved best-in-class Video Realism in some analyses, suggesting specialist chaining can help preserve unimodal fidelity.
Cross-modal alignment and sync: Some models achieved competitive text–video and text–audio similarity, yet temporal synchronization (e.g., action-to-sound timing, lip sync) still revealed room for improvement (desync errors remained noticeable in various scenarios).

🍞 Top Bread (Hook): Surprises are like plot twists—some can teach you more than a predictable ending.

🥬 Filling (The Actual Concept): Surprising findings.

What it is: Lessons the numbers alone wouldn’t predict.
How it works:
1. Analyze where composed systems beat end-to-end ones,
2. Inspect where A/V semantic alignment looks fine but realism feels off,
3. Use judge rationales to connect symptoms to causes (e.g., “metallic sheen” artifact or “two-headed dog” visual glitch),
4. Identify consistent failure modes.
Why it matters: These insights redirect future research to the true bottlenecks.

🍞 Bottom Bread (Anchor): A model that nails the look of a basketball court but fails to deliver squeaky shoes and chain-net rattle on time isn’t yet game-ready.

Contextualizing numbers (plain language):

Objective video quality: Top systems produced clean, sharp frames with solid aesthetics—think crisp HD video. However, occasional motion jitter and subtle structural warping still occurred in complex scenes.
Objective audio quality: Many outputs were clear enough for casual listening, yet experts could hear metallic tones, muffled highs, and unstable noise floors—like a good demo, not a polished soundtrack.
Alignment (semantic) was decent on average (the right kinds of sounds with the right visuals), but alignment (temporal) faltered more often; small but perceivable delays broke immersion.
Instruction following: Closed-source leaders tracked composition, counts, lighting, and camera moves reasonably well; dynamics (complex motion and interactions) remained the hardest.
Realism: Visual realism (MSS, OIS, TCS) scores were comparatively stronger than audio realism (AAS, MTC). MTC showed the largest gap—models struggled to match materials and room acoustics.

Takeaway: The field has advanced far beyond simple demos, but the last 10–20%—especially believable, physics-matched audio—still separates current systems from human-level believability.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best maps have blank spots labeled “Here be dragons”—honest explorers mark them so others won’t get lost.

🥬 Filling (The Actual Concept): Limitations.

What it is: Things this benchmark can’t do perfectly yet.
How it works:
1. MLLM-as-a-Judge costs: reasoning-first judging is compute-heavy, so huge test sweeps are expensive,
2. Bias: the judge model can prefer certain styles or audio textures, subtly steering scores,
3. Coverage: 500 prompts are diverse but can’t capture every rare physical interaction or niche art style,
4. Duration: most clips are short (4–10s); long-form coherence isn’t fully tested.
Why it matters: Knowing limits helps interpret scores and plan improvements.

🍞 Bottom Bread (Anchor): Like a microscope with great magnification but a small field of view—amazing detail, limited area.

Required resources:

Running all metrics and the MLLM judge requires GPU/TPU resources, stable audio/video toolchains, and time. Practitioners should budget for storage (clips + logs), multi-run variance checks, and judge costs.

When NOT to use:

If you only need a quick sanity check on one dimension (e.g., video sharpness), T2AV-Compass may be overkill.
If your content is ultra-long-form storytelling (>1 minute), current settings may miss some failure modes.

Open questions:

How to distill the MLLM judge into a faster, cheaper evaluator without losing reasoning quality?
Can native, tightly coupled audio-visual generative architectures reduce the Audio Realism Bottleneck?
How to better score multi-source audio scenes (on-screen vs. off-screen sounds) under cluttered visuals?
What’s the best way to incorporate human-in-the-loop refinements to align metrics with nuanced perception over time?

Overall assessment: T2AV-Compass is a strong, diagnostic foundation that surfaces where models still fail in ways users feel—especially sound realism and tight synchronization—while providing interpretable paths to fix them.

06Conclusion & Future Work

Three-sentence summary: T2AV-Compass is a unified benchmark that tests text-to-audio-video systems using both objective signal metrics and a reasoning-first MLLM-as-a-Judge. Its 500 taxonomy-driven and real-video–anchored prompts reveal strengths and weaknesses in video quality, audio quality, alignment, instruction following, and realism. Extensive results show major progress yet a clear Audio Realism Bottleneck and no single model dominating all dimensions.

Main achievement: The paper delivers a practical, interpretable, end-to-end evaluation recipe—hard numbers plus explainable judgments—so researchers can see exactly what to improve and why.

Future directions: Scale to longer clips, distill lighter judges, deepen multi-source sound testing, and push native joint audio-visual generation to better capture physical correlations (especially material and room acoustics). More human-in-the-loop calibration can further align automated assessments with real viewer perception.

Why remember this: It reframes “Does the video look cool?” into “Does the whole experience feel real, follow the instructions, and line up across sight and sound?”—which is what audiences actually care about. T2AV-Compass is the compass pointing the field toward that standard.

Practical Applications

•Model benchmarking: Compare new T2AV models against a strong, unified standard before release.
•Regression testing: Re-run a subset of prompts after each update to catch drops in sync or realism.
•Targeted debugging: Use checklist rationales to pinpoint missed instructions (e.g., wrong camera move or missing SFX).
•Audio QA: Focus on MTC and AAS to improve material sound textures and remove metallic artifacts.
•Prompt engineering: Stress-test prompt formats to see which phrasing boosts instruction following.
•Safety and policy audits: Verify models don’t hallucinate harmful or misleading audio-visual cues.
•Dataset curation: Use insights to collect more training data for weak spots (e.g., off-screen sound, multi-source mixes).
•Reward modeling: Train evaluators or reward models using the dual-level signals (objective + judge rationale).
•Product evaluation: Pick composed vs. end-to-end pipelines depending on which scores matter most for your app.
•Research diagnostics: Study where sync fails (e.g., lip vs. event) to design better joint audio-visual architectures.

Version: 1