Bridging Your Imagination with Audio-Video Generation via a Unified Director

Jiaxu Zhang; Tianshu Hu; Yuan Zhang; Zenan Li; Linjie Luo; Guosheng Lin; Xin Chen

Bridging Your Imagination with Audio-Video Generation via a Unified Director

Intermediate

Jiaxu Zhang, Tianshu Hu, Yuan Zhang et al.12/29/2025

arXiv PDF

Key Summary

•UniMAGE is a single “director” AI that writes a film-like script and draws the key pictures for each shot, so stories stay clear and characters look the same from scene to scene.
•It uses a Mixture-of-Transformers to blend a text brain (for writing) and an image brain (for drawing) inside one model that shares attention.
•First, it learns with text and images mixed together (interleaved) so it understands how words and pictures connect in long stories.
•Then, it trains the writing and drawing parts separately (disentangled) so each becomes an expert without confusing the other.
•Special ID tokens act like name tags for characters and places, helping the model keep faces, clothes, and locations consistent across many shots.
•Scripts can be continued or extended by splitting context beforehand, teaching the model to pick up the story naturally from where it left off.
•On the ViStoryBench tests, UniMAGE beats open-source baselines in character consistency, alignment with prompts, and overall coherence.
•Ablation studies show performance drops without ID prompting or pre-context splitting, proving these ideas are key.
•Users preferred UniMAGE’s results in a study, citing better plots and more stable characters over long sequences.
•This approach helps everyday creators turn simple ideas into multi-shot, film-like stories that existing audio–video generators can bring to life.

Why This Research Matters

Long stories are hard: characters must look the same, places must be consistent, and the plot has to move forward logically. UniMAGE gives everyday creators a single tool to plan multi-shot scripts and draw stable keyframes that match the story. This makes it easier to produce educational videos, ads, trailers, and short films without a big team. Because it outputs structured scripts with dialogue, sound cues, and visuals, today’s audio–video generators can follow along more reliably. By practicing story continuation, UniMAGE helps creators build series and episodes that feel connected. In short, it lowers the barrier to film-like storytelling and raises the ceiling on quality and coherence.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how making a good movie takes both a writer to plan the story and a director to plan the shots? If they don’t talk to each other, the movie can feel confusing.

🥬 The Concept (What the world looked like before): For a long time, AI tools for video creation split the job into two separate parts: a language model wrote the script, and an image model drew key pictures for each shot. These two parts didn’t fully share a single brain, so the story logic (who does what, when, and why) and the visual look (who looks like whom, what the place looks like) didn’t always match up—especially across many scenes. Short, single-shot videos looked great, but long, multi-shot stories often fell apart: characters changed faces or outfits, scenes repeated, and the plot wandered.

How it worked back then (step by step):

User typed a prompt.
An LLM expanded it into a rough script.
A separate image model drew keyframes, one by one.
A video model tried to animate those frames.

Why it mattered: Without tight teamwork, long stories broke. Characters looked different shot to shot, and the script didn’t always line up with the images, making it hard to build a film-like experience.

🍞 Anchor: Imagine you’re telling a bedtime story and a friend draws pictures as you go—but they can’t hear you clearly. Your hero has black hair in one picture and blond in the next. The magic ring is square in one scene and round in the next. The story stops making sense.

🍞 Hook: Imagine trying to build a train with two teams: one lays tracks (story), the other makes cars (pictures). If they don’t coordinate distances and turns, the train will derail.

🥬 The Problem: Researchers needed a way to keep story logic and visuals in sync across many shots. Past fixes, like powerful prompts or small add-on modules for consistency, helped a little but not enough for long narratives with multiple characters.

How earlier fixes worked (and struggled):

Agent pipelines: Many tiny helper AIs passed notes to each other (script agent, storyboard agent, image agent). This required per-case prompt tuning and often produced fragile connections.
Autoregressive text+image models: They could write and draw in one stream, but image quality often lagged behind diffusion models.
Unified backbones with diffusion modules: Better image quality, but still not great at very long, multi-shot coherence.
Per-story training: Worked for one story instance but didn’t generalize.

Why they fell short: None truly learned long-range story structure and stable identities while keeping high image quality and flexible continuation.

🍞 Anchor: Think of a group project where everyone emails separate drafts. Even if each piece is good, the final report can repeat parts, skip steps, or change character names mid-way.

🍞 Hook: You know how a great director keeps the whole orchestra together—strings, brass, drums—so the music tells one big story?

🥬 The Gap: We were missing a single “director” model that thinks through both the words and the images together for long stories, then lets each part specialize without losing the common plan.

How the new idea fills it: UniMAGE brings script writing and keyframe drawing into one roof (a Mixture-of-Transformers). It first learns with text and images interleaved so the two parts understand each other deeply, then it separates training so each becomes an expert, while special ID tokens keep characters and places consistent.

Why we should care (real stakes):

Everyday creators can turn simple prompts into multi-shot, film-like scripts and images.
Teachers can storyboard history lessons or science demos.
Small studios can plan ads, trailers, or shorts faster, with fewer continuity errors.
Fans can create longer, coherent fan films.
All of this plugs into today’s video and audio generators (like Veo/Sora-style systems) that need strong, structured prompts to shine.

🍞 Anchor: Picture a recipe app that not only writes the recipe but also draws clear step photos with the same cook, same kitchen, and same tools for all steps. Now cooking (and filming) becomes much easier and more reliable.

02Core Idea

🍞 Hook: Imagine a comic book creator who can both write the panels and draw them—and keeps every character’s face the same from the first page to the last.

🥬 The Aha Moment (one sentence): Put one “director” brain in charge of both script reasoning and keyframe drawing, teach it by mixing text and images together first, then let the writing and drawing parts specialize, with name-tag tokens to lock in who’s who and where’s where, and with practice on picking up stories mid-way.

Multiple analogies:

Orchestra conductor: One leader keeps melody (story) and rhythm (visuals) in harmony.
LEGO instruction maker: First design the whole build with pictures and captions together, then polish the captions and photos separately, but still follow the same plan.
Sports playbook: The coach draws plays (images) and writes notes (text) on the same page, then later drills offense and defense separately while using consistent jersey numbers (IDs).

🍞 Anchor: Think of a long classroom poster project. If one student writes and draws on the same draft first, then later refines text and images apart—while using sticky notes labeled “Hero A,” “City Park,” etc.—the final poster stays consistent.

Now, the key building blocks in the right order, each with the Sandwich pattern:

Unified Director Model

🍞 Hook: You know how a movie director oversees both the script and the camera shots?
🥬 What it is: A single AI that writes structured scripts and generates keyframe images for multi-shot stories. How it works:
1. Takes a user prompt.
2. Writes global details (characters, places) and shot-by-shot text.
3. Draws a key picture for each shot.
4. Feeds these to audio–video generators. Why it matters: Without one boss brain, text and images drift apart; long stories break.
🍞 Anchor: A user writes, “A treasure hunt in a jungle.” The model returns a cast list, scene plan, and matching keyframes that keep the hero’s look stable.

Mixture-of-Transformers (MoT)

🍞 Hook: Imagine two expert teachers sharing one classroom, taking turns to lead while listening to the same students.
🥬 What it is: One transformer backbone with two experts—one for understanding/writing, one for image generation—that share attention. How it works:
1. Text, ViT (understanding), and VAE (generation) tokens flow through shared self-attention.
2. The text expert predicts next text tokens.
3. The image expert predicts the clean image signal (via rectified flow) from noisy latents. Why it matters: Shared attention lets words and pictures inform each other without losing each one’s strengths.
🍞 Anchor: When the script mentions “a glowing tablet,” the shared attention helps the image expert paint glow correctly, and the text expert remembers that glow later.

Interleaved Concept Learning

🍞 Hook: Imagine learning a language by reading a page of story, then looking at a picture that matches it—over and over.
🥬 What it is: Training on sequences where text and images are mixed in order, so the model links narrative steps to visuals. How it works:
1. Feed [text → image → text → image] for many shots.
2. Jointly train both experts so text guides images and images sharpen text understanding.
3. Capture long-range story structure. Why it matters: Without interleaving, the model can’t deeply learn how scenes evolve across shots.
🍞 Anchor: The script says, “She enters the hidden chamber,” then a chamber image appears; the model learns that these go together.

Disentangled Expert Learning

🍞 Hook: Think of band practice: first everyone plays together, then the choir and the orchestra rehearse separately to polish their parts.
🥬 What it is: After interleaving, train the text expert on pure scripts and the image expert on text–image pairs, with the text branch frozen when improving images. How it works:
1. Train writer on only text to boost logic and continuation.
2. Train painter on text–image while stopping gradients into writer.
3. Each expert gets sharper without confusing the other. Why it matters: Without separating, the model may be less flexible at continuing or extending stories and may cap image fidelity.
🍞 Anchor: The model learns to continue a scene cleverly from partial context while also improving faces and lighting in pictures.

In-Context ID Prompting

🍞 Hook: Imagine giving actors name tags and scenes labels so everyone stays consistent.
🥬 What it is: Special tokens inside vision tokens that mark frame, character IDs, and environment IDs. How it works:
1. Insert ID tokens among ViT/VAE tokens.
2. Give them full attention connections.
3. Tie each image to the right people/places across time. Why it matters: Without IDs, characters drift—hair, faces, clothes change across shots.
🍞 Anchor: “Character2” stays the same blond-haired young man from Shot 1 through Shot 9.

Pre-Context Script Splitting

🍞 Hook: You know how storytellers ask, “Previously on…” before adding the next episode?
🥬 What it is: Randomly split scripts and insert an <Extension> or <Continuation> signal so the model learns to extend or continue from context. How it works:
1. Cut a script into two parts.
2. Add a short prompt or system signal.
3. Train the model to write the next shots smoothly. Why it matters: Without this, continuations repeat content or break logic.
🍞 Anchor: After “They reach the temple,” the model naturally adds “the floor glows and guardians rise,” not a random rewind.

Before vs After:

Before: Separate writing and drawing often led to mismatched stories and faces.
After: One director model plans both, keeps IDs consistent, and continues stories smoothly.

Why it works (intuition):

Shared attention lets text and image tokens co-inform long context.
Interleaving teaches tight word–picture alignment.
Disentangling sharpens each expert.
IDs lock down who’s who and where.
Pre-context practice builds the “keep going” skill.

🍞 Anchor: It’s like writing and illustrating a graphic novel with character bios, scene labels, and chapter recaps—first in one notebook, then editing text and art separately while keeping the same binder of labels.

03Methodology

High-level pipeline: Input prompt → Generate structured script text → (Optional) Extend/Continue script → Split into shots → Generate keyframe images per shot → Hand off to audio–video generator.

Step 1: Script structure (G, C, F)

🍞 Hook: Imagine a travel plan that has a guest list, a day-by-day plan, and photos of key places.
🥬 What it is: The script has Global descriptions (G: characters and environments), Content (C: per-shot Frame and Video descriptions), and Keyframes (F: images per shot). Special tokens like <Character1> and <Environment2> reference the same entities across shots. How it works:
1. The model writes G first: who and where.
2. Then C: frame-level (static layout) and video-level (actions/dialogue/timing).
3. Then F: one keyframe per shot. Why it matters: Without a consistent structure, the model can’t keep track of people/places across many scenes.
🍞 Anchor: “<Character1> a scientist” appears in G, and every shot that includes her uses <Character1>, keeping her identity stable.

Step 2: Mixture-of-Transformers details

Tokens:
- Text tokens: for script writing (Next Token Prediction, i.e., predict one wordpiece at a time).
- ViT tokens: for understanding images and linking them to text meaning.
- VAE tokens: for image generation latents, refined using Rectified Flow to remove noise.
Why this step exists: Shared attention makes text and images inform each other so story beats and visual cues align.
Example with data: When the text says “glowing tablet,” ViT tokens and VAE tokens attend to that phrase and learn to show consistent glow.

Step 3: Interleaved Concept Learning (joint training)

What happens:
1. Feed long sequences where text and images alternate by shots.
2. Train both experts together: text predicts next tokens; image predicts denoised latents.
3. Learn global narrative patterns and per-shot visual alignment.
Why it exists: Long stories need the model to connect earlier events to later images.
Example: Shot 1 sets “who/where,” Shot 7 shows the same people in a new place without losing identity.

Step 4: In-Context ID Prompting (stabilizing identities)

What happens:
1. Insert special ID tokens (frame, character, environment) among ViT/VAE tokens.
2. Give them full attention links to their image tokens.
3. Encode consistent identity signals across time.
Why it exists: Prevents face/clothes/location drift across multi-shot sequences.
Example: The blond-haired Character2 keeps the same hair and coat in every jungle shot.

Step 5: Disentangled Expert Learning (specializing)

What happens:
1. Text expert trains on large pure-text script sets (no images) to strengthen logic and continuation skills.
2. Image expert trains on interleaved text–image and on single-shot pairs for fidelity, freezing the text branch (stop-gradient) so visuals improve without shifting text understanding.
Why it exists: Relaxing the tight coupling improves flexibility: better story continuation and crisper images.
Example: The text expert learns to add new twists (“guardians awaken”) from partial context, while the image expert learns clean lighting and faces from high-quality pairs.

Step 6: Pre-Context Script Splitting (extension and continuation)

What happens:
1. Randomly split scripts.
2. Insert <Extension> with a user-like hint (summarized by a helper LLM) or <Continuation> before the last shot.
3. Train the text expert to keep going naturally.
Why it exists: Prevents repetition and helps the model evolve the plot over long arcs.
Example: After “They find the chamber,” the model continues with “The floor patterns glow; stone guardians rise,” not a reset.

Step 7: Inference (how users use it)

What happens:
1. User gives a prompt.
2. Model writes G and C (multi-shot script).
3. User can insert new prompts to extend or ask the model to continue.
4. Model segments by shots and renders F (keyframes) with ID-stable visuals.
5. The outputs guide an external audio–video generator.
Why it exists: Separating text-first then images keeps logic tight and visuals consistent while letting users steer the story.
Example: The astrophysicist speech becomes a multi-shot script and matching keyframes; a video model then animates it with aligned voice/music.

Secret sauce (why this method is clever):

First interleave to build a shared mental map of story+image.
Then disentangle to let each brain excel without tug-of-war.
Lace the visuals with identity name tags (ID tokens) so faces and places don’t drift.
Teach “resume storytelling from here” via pre-context splitting, so long narratives remain fresh and coherent.

What breaks without each part:

Without interleaving: weaker story–image alignment.
Without ID prompting: characters change faces or clothes unexpectedly.
Without disentangling: weaker continuation and capped image fidelity.
Without pre-context splitting: repetitive or stalled continuation.

04Experiments & Results

The test (what they measured and why): The team evaluated how well UniMAGE keeps characters consistent, follows the prompt’s plot, and produces good-looking images—especially across many shots. They used ViStoryBench, a public benchmark for story visualization, plus human user studies and ablation tests.

The competition (baselines):

StoryDiffusion: focuses on consistency across images via attention tweaks.
Story2Board: training-free storyboard creation.
SEED-Story: a unified model but trained mainly on animated styles and often needs per-story setups.
TheaterGen and Story-Adapter (for broader context comparisons).

The scoreboard (with context):

Character Identification Similarity (CIDS): UniMAGE hits about 59.2. Think of this like recognizing your friend in a class photo every time—UniMAGE does it more reliably than others.
Onstage Character Count Matching (OCCM): ~88.07. That’s like calling attendance correctly almost all the time, even when scenes get crowded.
Alignment (how closely images match text prompts and script): ~80.8—like getting an A when many others score in the B range.
Image Quality (Inception) and Aesthetics: UniMAGE is competitive; while a few baselines may win on a single score here or there, they don’t balance this with consistency and alignment as well as UniMAGE.

What the qualitative examples showed:

Multi-character scripts: Baselines often changed faces or hair between shots. UniMAGE kept identities steady across new camera angles and locations.
Long-form scripts: Some baselines repeated scenes or styles. UniMAGE varied scenes appropriately while preserving plot and style.
Cross-domain generalization: SEED-Story struggled outside stylized animation. UniMAGE handled a broader range of genres and looks.

Surprising findings:

SEED-Story’s high style similarity (CSD) often came from overfitting to a narrow animation style—great style sameness, weaker generalization.
Removing ID Prompting (the ablation) sharply reduced consistency metrics (CSD, CIDS, OCCM), confirming that the name-tag idea is essential.
Without Pre-Context Script Splitting, continuations tended to repeat content and weaken narrative flow; with it, the model advanced plots more naturally.

User study (making numbers human): In a study with 50 people judging 40 stories, UniMAGE consistently ranked best for overall quality, plot alignment, and character consistency. Think of this as most viewers preferring UniMAGE’s “mini-movies” because the plot made sense and the characters stayed recognizable.

Takeaway: UniMAGE didn’t just chase pretty pictures. It balanced pretty pictures with stable characters and a plot that moves forward logically—exactly what you need for film-like storytelling.

05Discussion & Limitations

Limitations (honest look):

Emotional pacing and fine directorial style: UniMAGE keeps logic and identities, but nuanced “mood arcs,” advanced cinematography choices, and subtle pacing are still early.
Dependency on data and compute: Training the shared transformer and experts on hundreds of thousands of sequences needs strong GPUs and curated data.
External video/audio quality limits: Final moving videos depend on separate video/audio generators, which can still drift in voice or facial features over very long outputs.
Granular control: While you can extend/continue stories, super-precise control (exact lens choice or micro-acting beats) remains limited.

Required resources:

A MoT-based foundation model with text and diffusion components.
Large mixed datasets: multi-shot text–image scripts, pure-text scripts, single-shot text–image pairs.
Training time for both interleaved and disentangled stages.

When not to use:

If you only need a single image or ultra-short clip, simpler tools are faster.
If you must lock down cinematic micro-details (e.g., exact 35mm lens, shot speed ramps) in every shot, you may need specialized control systems.
If your domain has strict, rare visual styles with no training data, results may be less stable.

Open questions:

How to model emotional beats and pacing over long arcs with explicit controls (music swells, tension curves)?
How to bring camera grammar (lenses, framing, blocking) into the script structure as first-class, controllable elements?
How to improve identity and voice persistence in full-length generated videos, not just keyframes?
How to reduce compute and data needs while keeping long-form coherence?
How to expand beyond visuals and dialogue to include sound design and score timing as co-equal script elements?

06Conclusion & Future Work

Three-sentence summary: UniMAGE is a single “director” model that writes structured, multi-shot scripts and generates matching keyframes, keeping long stories coherent. It learns by first interleaving text and images to align reasoning and visuals, then disentangling writing and drawing so each becomes an expert—helped by ID name tags and pre-context splitting. The result is state-of-the-art open-source performance on character consistency and narrative alignment for long-form story creation.

Main achievement: Showing that a unified, mixture-of-transformers director—trained “first interleave, then disentangle,” with identity tags and continuation practice—can bridge user imagination to film-like scripts and pictures that hold together across many shots.

Future directions:

Add richer cinematic controls (lenses, blocking, pacing curves) and emotional arcs.
Tighten identity/voice consistency in full-length audio–video outputs.
Make training more efficient and broaden domain coverage (documentary, sports, education).

Why remember this: It turns scattered tools into one director brain that plans the story and the visuals together, making long, coherent, and consistent story creation accessible to non-experts—and giving today’s video generators the strong, structured guidance they need to shine.

Practical Applications

•Create multi-shot storyboards for short films, ads, and trailers with consistent characters.
•Generate lesson-story scripts and keyframes for classrooms (history reenactments, science demos).
•Draft episodic web series outlines with smooth continuation between episodes.
•Produce previsualization packs (script + keyframes) for indie filmmakers before shooting.
•Generate marketing narratives with stable brand characters across scenes.
•Design game cutscene scripts and reference frames to guide in-engine cinematics.
•Make accessible audiobooks with scene-by-scene images and clear dialogue cues.
•Plan documentary segments (interviews, b‑roll) with coherent visual guides.
•Prototype interactive stories where users extend or branch the narrative mid-way.
•Feed structured outputs to video generators to reduce prompt engineering time.

Version: 1