JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Kai Liu; Jungang Li; Yuchong Sun; Shengqiong Wu; Jianzhang Gao; Daoan Zhang; Wei Zhang; Sheng Jin; Sicheng Yu; Geng Zhan; Jiayi Ji; Fan Zhou; Liang Zheng; Shuicheng Yan; Hao Fei; Tat-Seng Chua

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Intermediate

Kai Liu, Jungang Li, Yuchong Sun et al.12/28/2025

arXiv PDF

Key Summary

•JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.
•It uses a special fusion step called SyncFusion to line up what it hears with what it sees at the exact place and time.
•Inside, it follows an encoder → LLM → decoder recipe, and talks to a powerful diffusion generator (JavisDiT) using learnable query tokens.
•A three-stage training pipeline (pretraining, fine-tuning, instruction tuning) teaches it to listen, watch, reason, and then create.
•The team built JavisInst-Omni, a 200K-sample instruction dataset that mixes questions, answers, and generation across text, audio, and video.
•On understanding tests, JavisGPT reached top results for synchronized audio–video reasoning (e.g., 93.8% on AVQA) using fewer training samples than others.
•On generation tests, it made higher-quality, better-synced sounding videos (e.g., FVD 317.5 and JavisScore 0.157) than prior unified models.
•SyncFusion improved accuracy while using fewer tokens and less latency than simple concatenation or interleaving strategies.
•Jointly training understanding and generation made both skills better, especially the realism and instruction-following of generated videos.

Why This Research Matters

In daily life, people want assistants that can watch and listen like we do, then explain or create new clips that feel real. JavisGPT shows how one model can align sound and sight precisely, which is essential for education videos, creative editing, and accessibility tools. It turns multi-turn instructions into synchronized results, making it useful for storytelling, sports breakdowns, or lab demos. Because it learns both understanding and generation together, it’s better at following instructions and preserving timing. As more apps demand trustworthy media creation, strong synchrony is the difference between gimmicks and tools people depend on.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine watching a movie where footsteps sound exactly when a character steps and the clap comes right when hands meet. When sight and sound line up, your brain relaxes and the story feels real. But when they drift apart, everything feels off.

🥬 Filling (The Actual Concept): Before this work, AI models that handled pictures, videos, or sounds usually learned them separately. Some models could describe images with text; others could answer questions about a short video; a few could caption audio. But doing all of this together—listening and watching at once, reasoning about both, and then creating a new synchronized sounding video—was rare and tricky. How it worked before (step by step):

Build separate tools: one for video, one for audio.
At test time, run video tool then audio tool (or vice versa) and hope they match.
If they don’t match, try to fix timing by hand (e.g., shift the sound a bit). Why this matters: Without tight timing, a bark might happen after the dog closes its mouth, and your brain notices the mismatch instantly.

🍞 Bottom Bread (Anchor): Think of a cartoon where the character’s mouth moves a second too late for the words—funny in a meme, but bad for trust in educational clips, safety cameras, or virtual assistants.

🍞 Top Bread (Hook): You know how a marching band must keep the drum beat perfectly aligned with the marching steps? If the beat drifts, the parade looks messy.

🥬 Filling (The Problem): For AI, the hard part is synchrony—making sound and visuals line up exactly in space (where) and time (when). Earlier multimodal LLMs often glued audio and video features next to each other without truly teaching them to align at each video patch and moment. How that led to failure:

Simple concatenation: toss audio and video features together and hope the LLM figures it out.
Interleaving frames: weave audio frames between video frames; often slow and clunky.
Separate generators: create video first, then audio—errors pile up and outputs desynchronize. Why it matters: Real-world scenes are full of micro-timings—door slams, engine revs, claps—that must match the exact visual change.

🍞 Bottom Bread (Anchor): If a pot hits the floor at frame 100, the clang must also start right then—otherwise even a child can tell it’s fake.

🍞 Top Bread (Hook): Picture a class project where you must watch a science experiment, explain it in your own words, and then recreate the experiment on video with the correct sounds. You need to understand first, then generate.

🥬 Filling (The Gap): The field was missing a single, unified system that could both understand and generate synchronized audio–video, guided by natural language instructions, and do multi-turn conversations about it. Attempts that fell short:

Pipelines of separate tools: flexible but fragile—errors from one step flow into the next.
Unified models without tight sync: could talk about audio and video but didn’t nail space-time alignment.
Generators without a strong language brain: could produce media but missed the fine-grained intent in instructions. Why this matters: Everyday uses—learning music, sports analysis, filmmaking, accessibility—need both understanding and faithful, synchronized generation.

🍞 Bottom Bread (Anchor): A learner asking, “When does the engine start?” and then saying, “Now make a clip of a red race car, number 27, roaring through turns,” needs one assistant that listens, reasons, and creates, all in sync.

🍞 Top Bread (Hook): Imagine a helpful studio assistant who can watch a scene with you, answer your questions, then spin up a new clip that keeps the sounds and sights perfectly together.

🥬 Filling (The Solution): JavisGPT fills this gap with a simple but powerful recipe: encoder → LLM → decoder, with a new SyncFusion module that teaches the model to align audio and video at the right place and time. It also adds special learnable queries that translate the LLM’s intentions into conditions for a pretrained diffusion generator that makes synchronized audio and video. Why it matters: Without a unifying brain and these bridges, the model either misunderstands the scene or generates out-of-sync media.

🍞 Bottom Bread (Anchor): The system can answer, “The engine sound happens after the man begins speaking,” and then, in the very next turn, generate a short, synchronized clip of a red race car #27 screaming through tight corners—exactly as requested.

02Core Idea

Aha! Moment in one sentence: If one brain learns to watch and listen together—and uses timing-aware bridges to a good generator—it can both understand sounding videos and create new ones that stay in sync.

Multiple analogies (3 ways):

Orchestra: The LLM is the conductor, SyncFusion seats the instruments (audio/video) in the right spots, and the diffusion decoder is the orchestra producing the final music and show.
News studio: The LLM is the anchor, SyncFusion is the producer lining up clips and soundbites, and the diffusion model is the broadcast system that airs the synchronized segment.
Puppet show: The LLM writes the script, SyncFusion connects each puppet’s movement with its sound, and the generator is the stage crew that makes the full performance.

Before vs After:

Before: Audio and video were often treated separately; generation came in a pipeline with mismatches and lag; instruction-following was limited for synchronized media.
After: Audio and video are fused with space-time awareness; a unified LLM reasons over them; and a diffusion decoder makes tightly synchronized sounding videos—all directed by natural language.

Why it works (intuition without equations):

Synchrony needs local alignment: The model must know which visual patch is making which sound right now. SyncFusion injects audio cues into each frame’s patches so the LLM sees who is sounding where and when.
Generation needs good conditions: Learnable queries let the LLM pack its intention into the exact format the diffusion model understands, so the decoder knows the what (semantic condition) and the where/when (space-time prior).
Stepwise learning helps: Pretrain listening and mapping to the generator’s style, then fine-tune for sync, then teach instruction-following across tasks. Each stage reduces confusion and stabilizes training.

Building blocks (Sandwich explanations):

🍞 Top Bread (Hook): You know how a teacher first asks you to look and listen, then explain what’s happening? 🥬 Filling (The Actual Concept): Unified Multimodal LLM (MLLM) is one brain that understands text, audio, and video together, and can also talk back or create new media.

What: One model that handles many modalities jointly.
How: Encoders turn audio/video into tokens; the LLM reasons over them with text; a decoder makes new sound + video.
Why: Separate brains can’t easily stay in sync; one brain can. 🍞 Bottom Bread (Anchor): The model watches a guitar solo, hears the strum, answers timing questions, then creates a fresh solo clip with matching sound.

🍞 Top Bread (Hook): Imagine matching a drum hit to the exact frame where the stick strikes. 🥬 Filling: Audio–Visual Synchrony means sounds match the correct place and moment in the video.

What: Precise where+when matching of audio with visuals.
How: Learn local alignments at frame/patch level and preserve timing in generation.
Why: Without it, everything feels fake. 🍞 Anchor: The bark must start the instant the dog’s mouth opens.

🍞 Top Bread (Hook): Think of mixing two puzzle pieces so their edges fit perfectly. 🥬 Filling: SyncFusion is a module that merges audio clues into visual patches using attention so the LLM sees synchronized audio–video tokens.

What: A fusion layer that injects audio into frame-wise video features.
How: Split audio to match frames, cross-attend audio→video per frame, output sync-aware tokens.
Why: Without it, the model can’t tell which patch is making which sound. 🍞 Anchor: When a car revs in the top-left, the fused token at that spot carries the rev sound for that frame.

🍞 Top Bread (Hook): Like sending clear stage directions to your film crew. 🥬 Filling: Hierarchical JavisQueries are learnable tokens the LLM uses to package its intent for the diffusion generator.

What: Two kinds of queries: semantic (what happens) and space-time prior (where/when it happens).
How: The LLM fills these queries; small MLPs map them to the generator’s condition spaces.
Why: Without them, the generator guesses and timing drifts. 🍞 Anchor: “Red race car #27 speeds up at 2s, exits at 7s” becomes precise conditions the generator follows.

🍞 Top Bread (Hook): Like studying basics before advanced projects. 🥬 Filling: Multimodal Pretraining teaches the model to hear, see, and start aligning with the generator.

What: First-stage training with audio-text and caption alignment.
How: Learn to caption audio; learn to map LLM’s intent into the generator’s condition space.
Why: Without this, later training is shaky. 🍞 Anchor: The model learns what an engine “sounds like” and how to ask the generator for one.

🍞 Top Bread (Hook): A chef fine-tunes a recipe after tasting. 🥬 Filling: Audio–Video Fine-Tuning sharpens space-time sync for both understanding and generation.

What: Second-stage training to improve synchrony.
How: Train SyncFusion on sounding-video captions; refine query mappings with diffusion loss.
Why: Without this step, timing is still sloppy. 🍞 Anchor: Now the drum hit matches the exact stick impact frame.

🍞 Top Bread (Hook): A workbook full of mixed practice makes you flexible. 🥬 Filling: JavisInst-Omni is a 200K instruction dataset mixing QA and generation across audio, video, and both together.

What: Diverse dialogues, single- and multi-turn, for understanding and generation.
How: Curated with GPT-4o and verified by humans; includes synchrony-aware tasks.
Why: Without lots of good practice, the model won’t follow real instructions well. 🍞 Anchor: “I don’t like this angle—get closer to the fingers without changing music” becomes an edited, synchronized clip on the next turn.

03Methodology

At a high level: Inputs (text + audio + video) → encoders → SyncFusion (makes sync-aware tokens) → LLM (reasons, answers, or prepares generation conditions) → diffusion decoder (JavisDiT) → output (text and/or synchronized sounding video).

Step-by-step recipe with why each step exists and a tiny example:

Collect inputs and tokenize.

What happens: The user supplies any mix of text, audio, and video; the system also stores past dialogue turns for context.
Why it exists: The model must know what to watch/listen to and what the user wants.
Example: User: “When does the engine start?” with a 4-second clip.

Visual encoding (frozen Qwen2.5-VL video encoder + projector).

What: The video is turned into a grid of patch features across frames.
Why: The LLM needs compact tokens instead of raw pixels.
Example: 16 frames at 240p become sequences of vectors, each tied to a specific patch and time.

Audio encoding (frozen BEATs + small MLP projector).

What: The raw waveform → mel-spectrogram → audio features → projected into the LLM’s token space.
Why: The LLM can only work with token embeddings, not raw waves.
Example: An engine hum becomes a time–frequency map, then a sequence of audio tokens.

SyncFusion (cross-attention that injects audio into video patches per frame).

What: Split audio features to match the number of video frames; cross-attend audio→video frame by frame; produce SyncAV tokens that carry local where/when sound info.
Why: Without precise per-patch, per-frame fusing, the model can’t ground sounds.
Example: A clap at frame 12 is embedded into the patches where hands meet.

LLM reasoning (Qwen2.5-VL-7B with LoRA adapters during training).

What: The LLM reads text tokens, the SyncAV tokens, and special learnable JavisQuery tokens.
Why: The LLM is the brain that understands, follows instructions, and plans answers or generation.
Example: It infers “engine starts after the man begins speaking.”

Two output paths from the LLM.

Path A: Text response.
- What: The LLM simply answers in natural language.
- Why: For QA, explanations, or conversation.
- Example: “The engine sound begins right after the man’s first sentence.”
Path B: Generation intent via learnable queries (JavisQueries → JavisCond).
- What: The LLM fills two types of queries: semantic condition (what to make) and space–time prior (where/when it happens). Small MLPs map these to the diffusion model’s condition spaces.
- Why: The generator needs clear, aligned instructions for both meaning and timing.
- Example: “Red race car #27, loud engine begins at 1.5s, sharp turns at 2–3s.”

Diffusion decoder (JavisDiT) creates synchronized audio + video.

What: A pretrained diffusion transformer takes noise and the conditions (semantic + space–time prior), then gradually denoises into a final sounding video.
Why: Diffusion models are strong at high-quality, realistic generation with control signals.
Example: It outputs a 4-second 240p/24fps video with 16kHz audio that matches the plan.

Training in three progressive stages.

Stage I: Multimodal Pretraining (MM-PreTrain).
- What: Teach audio understanding and align LLM outputs with the generator’s text condition space.
- Why: Gives the model strong ears and a first bridge to the decoder, stabilizing later learning.
- Example: Audiocaptioning and alignment loss to match the decoder’s style.
Stage II: Audio–Video Fine-Tuning (AV-FineTune).
- What: Train SyncFusion on sounding-video captions (for comprehension) and refine generation queries with diffusion loss.
- Why: Sharpen space–time sync for both understanding and creation.
- Example: Better timing on “gunfire while the player shoots.”
Stage III: Multimodal Instruction Tuning (MM-InstTune).
- What: Use JavisInst-Omni (200K) to teach instruction-following, multi-turn dialogue, and interleaved understand–generate tasks.
- Why: Real users talk in rich, back-and-forth ways; the model must follow and adapt.
- Example: “I don’t like this angle—get closer to the fingers (keep music).” The next clip is re-framed with the same audio.

The secret sauce elements explained (Sandwich style):

🍞 Hook: Like a map that shows both landmarks and time schedules. 🥬 Filling: Space–Time Prior (ST-prior) conditions tell the generator where objects appear and when sounds fire. What: Learnable queries produce priors aligned to the decoder’s prior space. How: The LLM fills these queries; a projector maps them; an alignment loss keeps them on target. Why: Without ST-priors, events happen but drift in space or time. 🍞 Anchor: “Car enters top-left at 0.5s, engine peaks at 2s, fades by 3.5s.”
🍞 Hook: Like a translator helping two teams cooperate. 🥬 Filling: Alignment loss bridges the LLM’s internal language to the decoder’s condition language. What: A penalty that nudges LLM-produced conditions to match the decoder’s reference style. How: Compare predicted conditions with those from a frozen text/prior encoder and minimize the gap. Why: Without this, training is unstable and quality drops. 🍞 Anchor: The LLM’s “loud engine” condition ends up in the exact embedding the decoder expects.
🍞 Hook: Like adding clip-on training wheels instead of rebuilding the bike. 🥬 Filling: LoRA adapters lightly tune the LLM without changing all its weights. What: Small, low-rank trainable add-ons to big layers. How: Train the adapters during fine-tuning; merge them back for inference at no extra cost. Why: Full fine-tuning is heavy; no tuning is too rigid. 🍞 Anchor: The model adapts to AV tasks on limited GPUs while staying fast at test time.

Efficiency choices that matter:

Frozen encoders (vision/audio) speed training and preserve reliable features.
SyncFusion reduces token count and latency compared to interleaving, while boosting accuracy.
Keeping the diffusion generator frozen and aligning to it cuts compute and stabilizes learning.

Put together, these steps form a clean pipeline: sense (encoders) → align (SyncFusion) → think (LLM) → condition (queries + projectors) → create (diffusion).

04Experiments & Results

The tests: The authors checked three abilities—(1) video-only understanding, (2) audio-only understanding, and (3) synchronized audio–video understanding—plus (4) text-to-audio–video generation. Why? To prove JavisGPT can watch, listen, reason across both, and then make new, synchronized media.

The competition: JavisGPT was compared to strong multimodal systems like VideoLLaMA2/2.1, Qwen2.5-Omni, UnifiedIO-2, and NExT-GPT, as well as to pure diffusion generators like JavisDiT.

The scoreboard with context:

Video understanding benchmarks (ActivityNet, Perception, MVBench): JavisGPT matched or slightly exceeded top vision LLMs. For example, ActivityNet 58.1, Perception 70.2, MVBench 68.4. Think of this as getting an A- in visual comprehension while learning two subjects at once (audio + video).
Audio understanding (ClothoAQA, TUT2017): It showed strong listening skills—ClothoAQA 67.3, TUT2017 82.1—like scoring high in music class while also excelling in art.
Audio–video synchronized understanding (AVQA, MU-AVQA, AVSD): JavisGPT led the pack—AVQA 93.8, MU-AVQA 82.1, AVSD 62.2—showing best-in-class skill at figuring out who made which sound when. That’s like acing the hardest part of the test: timing-based questions.
Generation quality (JavisBench-mini): JavisGPT beat other unified models and even slightly surpassed the base JavisDiT in overall quality and synchrony. Key numbers: FVD 317.5 (lower is better; closer to real videos), CLIP 0.324 and CLAP 0.308 (better text–media match), AV-IB 0.202 (better audio–video semantic match), and JavisScore 0.157 (better spatio-temporal synchrony). In school terms, this is moving from a solid B+ to an A- in making realistic, on-cue media.

Meaningful comparisons:

Against pipeline-based models (NExT-GPT, UnifiedIO-2), JavisGPT created cleaner, more synchronized output and followed instructions better, especially in multi-turn chats.
Human study over 100 interleaved dialogue cases: Annotators rated JavisGPT higher in instruction following, QA accuracy, generation quality, context reasoning, and proactive thinking. It didn’t just answer; it remembered your preferences and used them.

Surprising findings and ablations:

SyncFusion vs alternatives: SyncFusion achieved top AV understanding with fewer tokens (~2.0K vs 3.5K) and lower latency (~224 ms vs up to ~555 ms). Translation: better and faster.
Training stages matter: Skipping pretraining made diffusion fine-tuning unstable; skipping AV fine-tuning hurt synchrony. The full three-stage pipeline worked best and most reliably.
Joint training helps both ways: When understanding and generation are trained together, the generator learns more precise conditions (better quality, better instruction following), and understanding also benefits. Synchrony improved modestly with scale, hinting there’s more headroom with bigger models or more unified objectives.
ST-prior queries: Removing them barely changed overall quality but slightly hurt synchrony—exactly their purpose.

Takeaway: The numbers show JavisGPT doesn’t just talk about sounding videos well; it also makes them well, and crucially, keeps sound and sight in step—its signature strength.

05Discussion & Limitations

Limitations (be specific):

Mixed training objectives: The LLM is trained with next-token prediction for text, while the generator learns with diffusion loss for media. These mismatched objectives send different signals into the shared brain, which may cap performance.
Asymmetric flow: Inputs use continuous embeddings for understanding, but outputs are injected via query conditions for generation. Understanding helps generation (queries can attend to what was seen), but the reverse path is weaker.
Scale not fully explored: Results are on a ~7B LLM with public data. Larger backbones and more diverse multimodal data could push quality and synchrony further.
Preference alignment: Current instruction tuning is strong, but more advanced alignment (e.g., reinforcement learning for better reasoning and media quality) could boost real-world performance and safety.

Required resources:

GPUs (e.g., 8×A100 80GB) for staged training with frozen encoders and LoRA on the LLM.
Pretrained vision/audio encoders (Qwen2.5-VL, BEATs) and a pretrained diffusion generator (JavisDiT).
Curated instruction data (JavisInst-Omni) with synchrony-aware tasks.

When NOT to use it:

Ultra-low-latency live production where every millisecond matters; diffusion decoding can be too slow.
High-fidelity speech lip-sync for close-up faces if you require broadcast-grade precision; specialized speech-lip models may be better today.
Domains far outside training data (e.g., rare instruments or extreme camera rigs); outputs may be less reliable and need human review.

Open questions:

Can we unify the training objective so understanding and generation share the same learning signal (e.g., autoregressive multimodal tokens) without losing diffusion-level quality?
How far will synchrony and realism climb with larger LLMs and more long-form training data?
What are best practices for safe, watermark-protected generation and robust deepfake detection in such capable systems?
How can user preferences be captured and remembered responsibly across sessions to improve proactive generation without privacy risks?

06Conclusion & Future Work

Three-sentence summary: JavisGPT is a unified model that learns to watch and listen together, reason about synchronized audio–video, and then generate new sounding videos that keep sight and sound in step. It introduces SyncFusion for precise space–time alignment and uses learnable queries to talk clearly to a pretrained diffusion generator, all trained through a three-stage pipeline and a 200K-sample instruction dataset. The result is state-of-the-art synchronized understanding and generation with strong instruction-following in multi-turn conversations.

Main achievement: Making one brain that both understands and creates synchronized sounding videos—reliably and interactively—by aligning audio and video locally in time and space and then steering a diffusion generator with semantically rich, timing-aware conditions.

Future directions: Unify training signals (e.g., autoregressive multimodal tokens) so understanding and generation help each other even more; scale up the backbone and data; use reinforcement learning to boost reasoning quality, preference alignment, and synchrony; add fine-grained editing controls.

Why remember this: It turns the long-standing wish—“one assistant that can watch, listen, explain, and then make a new, perfectly synced clip”—into a practical, tested system. As multimodal AI becomes everyday, getting timing right is the difference between toy demos and truly useful tools.

Practical Applications

•Educational videos that highlight exactly when sounds happen (e.g., identifying instrument entries in music class).
•Accessible summaries that align visual actions with corresponding sounds for low-vision or hard-of-hearing users.
•Video editing assistants that keep original audio in sync while changing camera angles or zooms.
•Sports analysis clips that match crowd reactions or whistle blows to exact plays and timestamps.
•Content creation tools for vloggers to generate B-roll with perfectly timed ambient sounds.
•Safety and security recaps where alarms, footsteps, or impacts align to the precise frame.
•Game streaming highlights that pair voiceovers, sound effects, and on-screen action without lag.
•Cinematic previsualization where directors sketch a scene in text and get a timed, sound-on storyboard.
•Music-driven visualizations that sync animations to beats and instrument entrances.
•Customer support demos that show and tell device noises (e.g., engine or appliance sounds) at the right moments.

Version: 1