Apollo: Unified Multi-Task Audio-Video Joint Generation

Jun Wang; Chunyu Qiang; Yuxin Guo; Yiran Wang; Xijuan Zeng; Feng Deng

Apollo: Unified Multi-Task Audio-Video Joint Generation

Intermediate

Jun Wang, Chunyu Qiang, Yuxin Guo et al.1/7/2026

arXiv PDF

Key Summary

•APOLLO is a single, unified model that can make video and audio together or separately, and it keeps them tightly in sync.
•It uses one shared Transformer tower with an Omni-Full Attention layer so audio, video, and their captions can talk to each other at every step.
•A new positional trick called MixD-RoPE helps the model understand where and when things happen across both video frames and audio time.
•Training is done progressively across many tasks (text-to-video, text-to-audio, text-to-audio+video, and image-conditioned versions) to prevent one skill from overpowering the others.
•Random modality masking lets the same model practice single skills (like only video) and team skills (audio+video) without separate models.
•An automated pipeline built an 81 million-sample dataset with dense captions so the model can learn world knowledge and exact timing.
•Across many tests, APOLLO beats earlier open models and performs close to Veo-3 on joint audio–video generation.
•It shows strong lip-sync, clear emotions, accurate sound effects, and robust results even on out-of-distribution prompts.
•Unlike cascaded systems, it avoids error pile-up by learning everything end-to-end inside one tower.
•This work points to a scalable path for next-generation, instruction-following audio–video creation.

Why This Research Matters

APOLLO makes videos and sounds that finally feel like they belong together—mouths match words, actions match noises, and emotions look and sound right. This helps creators make better movies, ads, and lessons that are clear and engaging without heavy manual editing. It improves accessibility by producing accurate timing for captions and sound descriptions. It supports dubbing and localization where lips and speech must align across languages. It also builds trust in AI media by reducing distracting mismatches that break immersion. Finally, it offers a scalable recipe—one tower, full attention, strong data—for future multimodal systems, including AR/VR and spatial audio.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a cartoon where the character’s mouth moves, but the words come a second late. It feels wrong, right? Now imagine the music doesn’t match the dancing—double wrong.

🥬 The World Before: Before APOLLO, AI could make videos (T2V) or sounds (T2A), and some systems tried to make both together (T2AV). But even fancy systems often had problems: lips didn’t match speech, sound effects lagged, and models that tried to do everything sometimes forgot how to do one thing well (like making great video on its own). Many designs used two separate towers (one for audio and one for video) that only met in a few places, like two classmates who work on their own homework and only compare at the end.

🍞 Anchor: Think of a school play where the actors and the sound crew practice separately and only run one rehearsal together—the show will likely be out of sync.

🍞 Hook: You know how a good teacher helps the whole class talk to each other, not just to the teacher? That’s what attention does in AI.

🥬 Attention Mechanism (What/How/Why):

What it is: A way for a model to decide which parts of its input are most important right now.
How it works: (1) Look at all inputs; (2) score how useful each piece is; (3) focus more on high scores; (4) use that focused info to decide the next step.
Why it matters: Without attention, the model treats everything equally—like listening to ten people talk at once and not knowing who to follow.

🍞 Anchor: When asked “What’s the capital of France?”, attention helps focus on “capital” and “France,” which leads to “Paris.”

🍞 Hook: Imagine two walkie-talkies on different channels—they can’t talk clearly. That was many audio–video models.

🥬 Neural Network Architecture (What/How/Why):

What it is: The blueprint that decides how parts of a model connect and share information.
How it works: Layers pass messages through organized paths so patterns can be learned.
Why it matters: If audio and video live in separate towers and only whisper occasionally, they won’t sync lips to words or footsteps to sounds.

🍞 Anchor: One classroom (single path) makes group work easier than two classrooms with a tiny hallway in between.

🍞 Hook: Think of foggy pictures that slowly become clear—like a Polaroid developing.

🥬 Diffusion Models (What/How/Why):

What they are: Generators that start with noise and learn to remove it step by step until a clean sample (image, video, or audio) appears.
How they work: (1) Add noise during training; (2) learn to reverse the noise; (3) at test time, start from noise; (4) denoise repeatedly to create output.
Why they matter: They make high-quality, detailed results and are stable to train, but timing across audio and video is tricky without smart design.

🍞 Anchor: Like sculpting a statue by carefully shaving off clay until the figure appears.

🍞 Hook: Transformers are like super listeners that can pay attention to every word in a story at once.

🥬 Transformer Models (What/How/Why):

What they are: Neural nets that use attention to relate all parts of a sequence to each other.
How they work: Tokens talk to all other tokens via attention, then pass through feed-forward layers to mix information.
Why they matter: For syncing sounds and frames, every sound moment may need to look at many frames and vice versa.

🍞 Anchor: Reading a comic strip where each panel can reference any other panel to keep the plot consistent.

🍞 Hook: You know how a great library has not just many books, but well-labeled shelves and helpful notes?

🥬 Dataset Construction, Data Annotation, and Data Curation (What/How/Why):

What they are: Building big data collections, writing detailed labels (captions, transcripts), and carefully filtering for quality.
How they work: (1) Gather lots of audio–video; (2) split into clean scenes; (3) add dense captions (what’s seen/heard, who speaks, what is said); (4) remove low-quality or unsafe pieces.
Why they matter: If training data is messy or unlabeled, the model can’t learn exact lip–speech matches or correct sound timing.

🍞 Anchor: A well-organized cookbook with clear recipes helps you cook reliably, unlike a crumpled pile of random notes.

Where Things Failed: Many older methods trained on just one task (like only text-to-video). That makes the model great at one skill but weak at others, and it forgets cross-modal world rules (like “clapping should make a clap sound right then”). Cascaded systems (video first, audio second) pile up errors: if the video timing is off, the audio generator must guess, and guesses stack.

The Gap: We needed a single, unified model where audio, video, and text all learn together; a training schedule that grows skills step by step; and a huge, clean, densely captioned dataset so the model learns precise timing and semantics.

Real Stakes: Better dubbing, clearer educational videos, safer and more accessible media (accurate captions/sounds), and creative tools that truly follow instructions—no more talking fish with silent mouths, unless you ask for it!

02Core Idea

🍞 Hook: Think of a band where every musician listens to everyone else in real time, not just to a metronome—that’s how you get perfect harmony.

🥬 The “Aha!” in one sentence: Put all audio, video, and text into one shared Transformer (single tower) with Omni-Full Attention so every part can coordinate at every step, then train it progressively across many tasks using lots of well-annotated, clean data.

🍞 Anchor: It’s like moving from two walkie-talkies to everyone on one group call, plus practicing with varied songs until the timing is second nature.

Multiple Analogies:

School analogy: Instead of separate math and reading classes that rarely meet, you hold a single integrated class where reading problems include math stories, and everyone practices together.
Kitchen analogy: One big pot stew (single tower) where flavors (audio, video, text) mix throughout cooking (full attention), not separate pots mixed at the end.
Sports analogy: A soccer team that trains passing, defense, and shooting together, not in silos, and uses scrimmages of increasing difficulty (progressive training).

Before vs. After:

Before: Dual towers with shallow cross-talk; single-task training; small or weakly labeled datasets; common lip-sync failures and timing drift.
After: Single tower with full, all-to-all attention; multi-task progressive training; large dense-caption data; strong lip–speech alignment, better audio–video timing, and robust unimodal quality.

Why It Works (intuition, no equations):

If audio and video tokens sit in one attention space, they can constantly correct each other’s timing and meaning—like a drummer and dancer watching each other’s moves.
MixD-RoPE shares a time axis between audio and video, giving them a common “beat” so frames know which audio moments match.
Progressive multi-task practice prevents overfitting to one skill and strengthens general world knowledge that transfers across tasks.
Flow matching gives a smooth path from noise to clean output, making the generation stable and detailed.

Building Blocks (each as a sandwich):

🍞 Hook: Imagine one big classroom instead of two. 🥬 Single-Tower Architecture:
- What: One shared Transformer backbone for audio, video, and text.
- How: All tokens enter one model; features are mixed layer by layer; outputs branch to audio and video decoders at the end.
- Why: Constant cross-talk keeps timing tight and semantics aligned. 🍞 Anchor: A choir practices in one room, so voices blend naturally.
🍞 Hook: Picture an octopus looking everywhere at once. 🥬 Omni-Full Attention:
- What: Full attention across audio tokens, video tokens, and both captions simultaneously.
- How: Concatenate all streams, compute attention jointly, then split back to modalities.
- Why: No siloed guessing—every part sees the full context each step. 🍞 Anchor: While generating a clap, the model also sees the frame of hands meeting.
🍞 Hook: Think of a shared drum beat. 🥬 MixD-RoPE:
- What: Positional embeddings that align 3D video positions (time, height, width) with audio time via a shared temporal axis.
- How: Use 3D RoPE for video and 1D time RoPE for audio, with synchronized time IDs.
- Why: Gives both modalities the same timeline, reducing drift. 🍞 Anchor: Like syncing a metronome for the band and the dancer.
🍞 Hook: Imagine clearing fog step by step. 🥬 MMDiT (Multimodal Diffusion Transformer):
- What: A Transformer that denoises multimodal latents (audio+video) together.
- How: Encode inputs; run joint diffusion with attention; decode to waveforms and frames.
- Why: One engine learns cross-modal patterns during generation, not as an afterthought. 🍞 Anchor: Cooking one stew instead of separate soups.
🍞 Hook: Practice the right mix of drills. 🥬 Progressive Multi-Task Training:
- What: Train in stages across tasks (T2V, T2A, T2AV, TI2V, TI2AV) and rebalance where weak.
- How: Pretrain broadly → specialize underperforming skills → refine on high-quality data.
- Why: Avoids forgetting, boosts generalization, and polishes fidelity. 🍞 Anchor: Start with fundamentals, then scrimmages, then championship-level drills.
🍞 Hook: Play peek-a-boo with senses. 🥬 Random Modality Masking:
- What: Hide/query only certain streams so the same model can act as T2V, T2A, or T2AV.
- How: Mask keys/queries to select which modalities interact during training.
- Why: Teaches single and joint skills without separate models, preventing collapse. 🍞 Anchor: Sometimes the model listens only to text+video; other times only text+audio.
🍞 Hook: A clean, well-labeled library beats a messy attic. 🥬 Automated Data-Construction Pipeline + Large-Scale Dataset:
- What: An automated system that filters, splits scenes, transcribes, captions, and checks sync to build 81M high-quality triplets.
- How: Remove low quality; diarize speakers; add dense audio/video captions; verify temporal and semantic alignment.
- Why: Precise labels and variety teach the model exact lip–speech ties and real-world timing. 🍞 Anchor: Like recipe cards with steps and photos, not just a list of ingredients.

03Methodology

At a high level: Inputs (video, video caption, audio caption/speech text, audio) → Encoders → Single-Tower MMDiT with Omni-Full Attention + MixD-RoPE → Audio and Video Decoders → Outputs (audio, video) using flow matching.

Step-by-step (with why and examples):

Input preparation

What: Four inputs—video, video-related text, audio-related text (captions/speech), and audio.
Why: Each carries different clues: lip shapes, scene context, words being said, and actual sound qualities.
Example: A person saying, “Yes! I did it!” in a bright kitchen; the video shows smiles and a pan; the audio caption notes joyful tone; audio has sizzling and speech.

Encoding each modality

What: Compress video to spatiotemporal latents (3D VAE at ~3 Hz); compress audio to temporal latents (Audio-VAE at ~43 Hz); encode captions via text encoders.
Why: Smaller latents make joint attention feasible and let the model compare time points across modalities efficiently.
Example: 2-second clip becomes a handful of video tokens per frame and audio tokens per small time slice, plus text tokens.

Mixed-Dimension Rotary Position Embedding (MixD-RoPE)

What: Assigns position to tokens so the model knows where and when each piece belongs.
Why: Without positions, a clap sound might attach to a random frame; shared time IDs align moments across audio and video.
Example: The instant a hand hits a table shares the same time ID with the sharp thud in audio.

Omni-Full Attention inside the single MMDiT tower

What: Concatenate all hidden states (video, video caption, audio caption, audio), normalize/scale, attend all-to-all, then split back.
Why: Every audio token can see all frames and words; every frame can see all sounds and captions. Without this, alignment depends on weak, late fusion.
Example: While forming a mouth shape, the model checks the exact phoneme timing and the caption meaning to avoid mismatched syllables.

Flow-matching denoising (generation core)

What: Start from noise and learn a velocity field that smoothly moves latents toward clean audio/video data.
Why: Provides stable, high-fidelity generation for both modalities in sync.
Example: From a hiss of noise to crisp speech and a smooth, coherent video sequence.

Decoding back to waveforms and frames

What: Use Audio-Decoder to turn audio latents into 44.1 kHz waveforms; Video-Decoder to turn video latents into frames.
Why: Latents are the model’s working language; decoders translate back to human-listenable/seeable forms.
Example: You hear the exact consonant plosive when the lips close, and you see the moment the omelette flips.

Random Modality Masking during training

What: Selectively restrict queries/keys so the model sometimes trains as T2V, sometimes as T2A, and sometimes as T2AV/TI2V/TI2AV.
Why: Prevents unimodal collapse, keeps skills balanced, and uses scarce paired AV data efficiently by bootstrapping from single-modality tasks.
Example: On one batch, only text→video is active; on another, text→audio; later, both are active together.

Progressive multi-task curriculum

Stage I (Pre-train):
- What: Broad training across mixed tasks to learn basic A/V generation, timing, and semantics.
- Why: Gives the model sturdy foundations.
- Example: It learns that footsteps should land with step sounds—not a second later.
Stage II (Specialized Post-train):
- What: Focus extra practice where metrics show weakness (e.g., global cross-modal alignment).
- Why: Addresses blind spots without losing earlier skills.
- Example: Improve ImageBind alignment by adjusting data mix with more challenging scenes.
Stage III (Quality-Refined Post-train):
- What: Finetune on carefully curated, high-quality samples.
- Why: Polish realism, lip details, and acoustic texture.
- Example: Sharper smiles, clearer breathing sounds between sung phrases.

Automated data-construction pipeline

What: Filter low-quality A/V; split into single scenes; transcribe/diarize speech; caption audio and video; verify temporal and semantic sync; merge into dense captions.
Why: Garbage in, garbage out—clean, well-labeled data is essential for timing and meaning.
Example: Remove clips with >20% silence or unstable camera; keep clean, aligned samples.

The Secret Sauce:

Unified single-tower + Omni-Full Attention: constant cross-modal conversation prevents timing drift.
MixD-RoPE with shared time axis: audio and video literally count time together.
Progressive, masked multi-task training: the model practices solo and ensemble skills, then polishes performance.
Scaled, dense-caption data: teaches nuanced lip shapes, prosody, and scene-sound relationships.

What breaks without each step:

No single tower: audio/video sync weakens due to late or shallow fusion.
No Omni-Full Attention: modalities can’t fully consult each other; lip-sync suffers.
No MixD-RoPE: time references diverge; sounds slip against frames.
No progressive multi-task: overfitting to one task; unimodal quality drops.
No clean dense data: poor alignment, artifacts, and unreliable instruction following.

04Experiments & Results

The Test (what and why): APOLLO is evaluated on Verse-Bench for text-to-audio–video (T2AV) and also checked on unimodal skills (T2V, T2A). The goals are (1) high video quality (looks good, moves naturally), (2) high audio quality (clear, realistic), (3) strong semantic match to the prompt, and (4) tight synchronization (lips/phonemes, actions/sounds).

The Competition (who):

Cascaded baselines: OpenSora + FoleyGen (T2V then V2A), AudioLDM2 + TemoTkn (T2A then A2V). These often suffer error pile-up.
Joint models: JavisDiT, UniVerse-1, Ovi. These typically use dual towers or limited fusion.

The Scoreboard (with context):

Video quality: Motion Score (MS) and Aesthetic Score (AS). APOLLO hits MS ≈ 0.48 and AS ≈ 0.51—think of this as smooth movement and pleasing visuals that beat most open baselines.
Identity consistency (ID): ≈ 0.59—like recognizing the same person across frames; APOLLO is stronger here than most joint baselines.
Audio quality: Lower Fréchet Distance (FD ≈ 1.36) and KL (≈ 1.06) are better. APOLLO shows improved realism over cascades and dual-tower systems.
Semantic alignment (CLAP): ≈ 0.232—solid text–audio matching, on par or better than prior work.
Synchronization: AV-A (≈ 0.028, lower is better) and SyncNet Confidence (≈ 6.79, higher is better). This is like getting an A+ in timing where many others are at B’s or C’s—APOLLO’s lips and sounds line up closely.
Global cross-modal alignment (ImageBind score): ≈ 0.316, indicating overall A/V consistency.

Key Takeaways:

Against cascades: APOLLO avoids error compounding by generating audio and video jointly inside one tower, explaining its large sync and quality gains.
Against dual towers: Omni-Full Attention with a shared time axis (MixD-RoPE) keeps modalities in lockstep, beating methods that only fuse late or lightly.

Surprising/Notable Findings:

Unimodal boost from multimodal training: APOLLO’s video and audio quality stay strong—even surpassing some specialized T2V/T2A systems—suggesting that learning cross-modal rules improves single-modality skills.
Robust OOD generalization: On prompts outside typical training, APOLLO maintains better sync and semantics than baselines, likely due to diverse, dense-caption data and multi-task exposure.

Qualitative Highlights:

Lip-sync precision: Mouth shapes match phonemes, including tongue/teeth positions; competitors show delays or mismatches.
Emotional expressiveness: Facial features and prosody align (joy, sadness, excitement), avoiding robotic looks.
Music/rap performance: Pitch and rhythm changes align with facial tension and breathing; baselines often drift.
Image-to-AV: Preserves identity and adds natural motion; others tend to drift or feel mechanical.

Ablations (what changes what):

Single vs dual tower: Single tower with Omni-Full Attention scores higher on ID, MOS (listening quality), WER (lower is better), de-sync (lower), Sync-Conf (higher), and IB score. Even when pretraining the audio tower, dual-tower alignment lags due to distribution mismatch.
Multi-task masking: Training across all tasks beats training on T2AV alone—representations become more general, and T2V/I2V also improve.
Progressive training: Removing the staged schedule drops performance; adding specialized and quality-refined stages lifts alignment and realism further.

Bottom line: APOLLO consistently outperforms open baselines across video, audio, and synchronization, and reaches performance close to Veo-3 among open systems.

05Discussion & Limitations

Limitations:

Data reliance: Automated annotations, while scalable, can include occasional label noise; rare edge cases (e.g., unusual instruments or dialects) may be underrepresented.
Compute cost: A single 26B-parameter tower with joint attention is resource-heavy to train and serve, especially for long clips or high resolutions.
Very long-range timing: While improved, extremely long videos or complex overlapping soundscapes can still challenge perfect sync and scene continuity.
Domain shifts: In highly specialized domains (e.g., scientific instrumentation sounds), performance can dip without targeted fine-tuning.

Required Resources:

High-performance GPUs/TPUs for joint attention over multimodal latents and long sequences.
Storage and I/O for large datasets (81M samples) and high-rate audio/video latents.
Inference acceleration (e.g., caching, chunked attention) for practical deployment.

When NOT to Use:

Ultra-low-latency or on-device cases where memory and compute are tiny; a lighter unimodal model may be better.
Settings needing absolute guarantees against any hallucination (e.g., certain medical/legal uses); extra verification layers are needed.
Situations with strict domain data mismatches and no ability to fine-tune or adapt.

Open Questions:

Can we achieve similar alignment with far smaller models via distillation or sparse attention?
How to extend to 3D/VR with spatial audio while keeping perfect sync and low compute?
Can the pipeline self-correct label noise (e.g., active learning with human-in-the-loop) at massive scale?
How to encode longer contexts (minutes) without losing lip detail or audio texture?
What are the best safety and watermarking tools for unified AV generation that preserve quality and sync?

06Conclusion & Future Work

Three-sentence summary: APOLLO unifies audio, video, and text inside one Transformer tower with Omni-Full Attention and a shared temporal rhythm (MixD-RoPE), then trains progressively across multiple tasks on a massive, densely captioned dataset. This design delivers high-fidelity outputs with tight lip–speech and action–sound alignment, robust instruction following, and strong unimodal performance. Across benchmarks, it outperforms previous open models and performs on par with Veo-3 among open systems for joint AV generation.

Main achievement: Showing that a single-tower, fully attentive, multi-task framework—supported by careful positional timing and large-scale dense data—can fix classic AV failures (asynchrony, lip mismatch) without sacrificing single-modality quality.

Future directions:

Make it lighter: distill and optimize inference for real-time or on-device use.
Make it longer and richer: push to long-form videos, richer scene changes, and spatial audio.
Make it safer and more controllable: stronger guardrails, watermarks, and fine-grained editing controls.

Why remember this: APOLLO demonstrates that the simplest shape—one tower where everything talks to everything—paired with the right training recipe and data, can unlock reliable, synchronized, and scalable audio–video generation that feels natural to humans.

Practical Applications

•Automatic dubbing of videos with lip-sync preserved across languages.
•Educational videos with synchronized narration and on-screen highlights.
•Social media content creation where music beats and motion edits stay in lockstep.
•Film previsualization with quick audio–video mockups that follow storyboards.
•Game cutscene generation with tightly matched voice lines and character animations.
•Marketing and product demos with voiceovers and sound effects aligned to visuals.
•Accessibility tools that generate accurate audio descriptions and timed captions.
•Podcast-to-video converters that create talking head clips with lip-sync.
•Karaoke and music videos with synchronized singing faces and instrumental cues.
•Virtual influencers or avatars that speak and emote naturally in real time (with optimization).

Version: 1