AutoMV: An Automatic Multi-Agent System for Music Video Generation

Xiaoxuan Tang; Xinping Lei; Chaoran Zhu; Shiyun Chen; Ruibin Yuan; Yizhi Li; Changjae Oh; Ge Zhang; Wenhao Huang; Emmanouil Benetos; Yang Liu; Jiaheng Liu; Yinghao Ma

AutoMV: An Automatic Multi-Agent System for Music Video Generation

Intermediate

Xiaoxuan Tang, Xinping Lei, Chaoran Zhu et al.12/13/2025

arXiv PDF

Key Summary

•AutoMV is a team of AI helpers that turns a whole song into a full music video that matches the music, the beat, and the lyrics.
•It first listens to the song to find parts like intro, verse, and chorus, separates the vocals, and lines up the lyrics with the music.
•A Screenwriter AI plans the story and characters, and a Director AI writes shot-by-shot instructions and makes key images to start each shot.
•The system uses different video tools for different needs: general story shots or close-up singing shots with lip-sync.
•A Verifier AI checks if each clip follows the script, looks realistic, and keeps characters consistent, and it asks for retries if needed.
•AutoMV outperforms strong commercial tools on a 30-song test in audio–visual alignment and expert-rated quality while costing about 10–20 USD and taking roughly 30 minutes per song.
•Removing lyrics hurts storytelling and music matching, removing the character bank breaks identity consistency, and removing the verifier reduces realism and polish.
•AutoMV also proposes a fair scoring checklist with 12 criteria and tests large multimodal models as automatic judges, which are promising but still not as good as human experts.
•This approach lowers the barrier for indie musicians to get coherent, song-synced videos without a big team or budget.
•There is still a gap to professional music videos, especially in long-term character consistency, perfect beat-synced motion, and top-tier creativity.

Why This Research Matters

AutoMV lowers the cost and time for making real music videos, helping indie artists and small teams compete with bigger studios. It turns a song directly into a coherent, story-driven video that actually respects the music’s structure and lyrics, not just a set of random pretty scenes. This can boost artist discovery on social platforms where tight audio–visual sync wins attention. Teachers and students can also use it to learn storytelling, editing, and music analysis by seeing how structure maps to visuals. The evaluation tools encourage fairer comparisons and faster progress for future research. Overall, it democratizes a key creative medium without needing a large crew or budget.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you want to make a music video for your favorite song, but instead of hiring a whole crew—writer, director, camera team—you could ask one smart system to do it all in an afternoon.

🥬 The Concept (Music-to-Video, or M2V): M2V is the task of creating a video that fits a song’s sound, structure, and lyrics. How it works: (before AutoMV)

A model takes some input like text prompts or a short audio clip.
It generates a short, pretty video.
It struggles to stay consistent across minutes, keep the same characters, or match the beat and lyrics. Why it matters: Without true music-awareness and long-term planning, you get a patchwork of clips instead of a real music video.

🍞 Anchor: Think of earlier systems like flipping through random TikTok clips while a song plays—fun for a moment, but not a true music video that tells a story.

🍞 Hook: You know how songs have parts—like the intro that sets the mood, verses that tell the story, and the chorus that repeats the big idea?

🥬 The Concept (Music Structure): Music structure is how a song is divided into sections (intro, verse, chorus, bridge) over time. How it works:

Detect boundaries where the music changes.
Label each section (verse, chorus, etc.).
Use those labels to plan scenes and shot timing. Why it matters: Without structure, the video can cut or change at the wrong times, missing emotional peaks like the chorus.

🍞 Anchor: When the chorus hits, the video should also open up—brighter shots, wider angles, or a big dance moment—so it feels powerful.

🍞 Hook: Tap your foot to a beat—your body finds the rhythm without thinking.

🥬 The Concept (Beat and Lyric Alignment): Beat alignment means visuals line up with the rhythm; lyric alignment means mouth shapes and scenes match the sung words. How it works:

Track the rhythm and syllables.
Place cuts and actions on strong beats.
Match lip motion to vowel/consonant timing. Why it matters: Without this, the video feels off—cuts arrive late, dancing looks out-of-sync, and lips don’t match words.

🍞 Anchor: If the singer says ‘shine’ and the shot brightens right then, it feels magical; if it happens a second later, it feels wrong.

🍞 Hook: Imagine reading a comic where the hero randomly changes hair and clothes every page—you’d lose the story.

🥬 The Concept (Temporal Consistency): Temporal consistency keeps the same characters, style, and story logic across minutes. How it works:

Define character looks once.
Reuse those details in every shot.
Keep camera style and colors steady. Why it matters: Without consistency, viewers can’t follow who’s who or feel connected to the story.

🍞 Anchor: If the main singer suddenly has a different face in the next shot, you notice—and it breaks the mood.

The world before AutoMV:

AI could make stunning short clips but stumbled on long, song-length videos.
Adding lyrics to prompts helped a little, but videos still missed the music’s structure and often changed characters mid-video.
Creators either: (a) paid big teams over many weeks, or (b) settled for short, disjointed AI clips.

The problem researchers faced:

How to make a full, multi-minute MV that stays consistent, follows the song’s structure and beat, and matches lyrics—all automatically.

Failed attempts and why they fell short:

Single-model, single-prompt approaches: Looked good for 5–10 seconds but fell apart across a full song (lost identity, missed the chorus, weird timing).
Prompting with lyrics only: Gave literal scenes but ignored rhythm and section changes.
One-size-fits-all video backends: Great for scenery, weak for lip-sync; great for faces, less flexible for creative scenes.

The gap this paper fills:

A music-aware, role-based ‘film crew’ of agents that plans the story by song structure, stabilizes character identity, picks the right video tool per shot, and automatically checks quality.
A fair way to evaluate long MVs with criteria humans actually care about: timing, story, visuals, and artfulness.

Real stakes in daily life:

Indie musicians and small creators can’t afford $10k per track and weeks of studio time.
Social video platforms reward tight audio–visual sync and clear stories.
A practical system could make professional-feeling MVs accessible, boosting discovery and income for small artists.

🍞 Anchor: With AutoMV, an artist can upload a song in the morning and share a coherent, song-synced music video that afternoon—without hiring a crew.

02Core Idea

🍞 Hook: Picture a movie studio where each expert knows exactly what to do—writer, director, casting, camera, editor—and they all read the same script.

🥬 The Concept (The Aha!): Treat music video making like a coordinated movie studio run by specialized AI agents that listen to the song first, plan by its structure and lyrics, then generate and verify each shot. How it works (at a glance):

Listen and extract: Find structure, beats, vocals, and time-aligned lyrics.
Plan: A Screenwriter agent writes a time-aligned story with characters.
Direct: A Director agent turns that plan into shot prompts and keyframes.
Render: Use the best video tool per shot (story vs. singing).
Verify: A Verifier agent checks realism, identity, and alignment, then accepts or retries. Why it matters: Without this orchestration, you get pretty but mismatched clips; with it, you get a coherent MV that rides the music.

🍞 Anchor: It’s like a sports team with clear positions; when each role does its part and passes the ball on time, the whole play works.

Three analogies:

Orchestra: The song is the sheet music; agents are sections (strings, brass, percussion); the Verifier is the conductor keeping everyone in time.
Comic storyboard: The screenwriter drafts panels (scenes), the director inks camera angles, and the renderer colors; the verifier checks continuity.
Lego set: MIR tools sort bricks by color/shape (beats/sections), the script is the manual, the director assembles, and the verifier ensures nothing wobbles.

Before vs. After:

Before: One-shot prompting, short clips, identity drift, poor lyric/beat sync.
After: Time-aligned script, stable characters, beat-aware cuts, targeted lip-sync shots, and automated quality control.

Why it works (intuition, not equations):

Planning by the music’s map (structure + lyrics) creates the right places to change shots and emotions.
Separating roles reduces confusion: writers write, directors direct, renderers render, verifiers judge.
Switching backends per need is like using the right tool: a general video model for broad scenes, a lip-sync model for close-up singing.
Automatic verification closes the loop: catch mistakes, regenerate, and improve consistency.

Building blocks (each one matters):

🍞 Hook: You know how a GPS first builds a map before giving directions? 🥬 The Concept (Music Information Retrieval, MIR): MIR turns sound into structured info—sections, beats, vocals, and lyrics with timestamps. How it works:

Segment into intro/verse/chorus/bridge.
Separate vocals from instruments.
Transcribe and align lyrics in time. Why it matters: Without this map, the story can’t land at the chorus when the music does. 🍞 Anchor: When the chorus arrives at 0:49, the plan knows to widen the shot and lift the mood right then.

🍞 Hook: Imagine a writer who can hear the song and instantly sketch the story beats. 🥬 The Concept (Screenwriter Agent): An AI that writes a time-aligned mini-script and designs characters. How it works:

Read structure and lyrics.
Propose segments (3–15 seconds) with meanings and moods.
Define characters and their roles. Why it matters: Without a script, shots feel random instead of telling a story. 🍞 Anchor: For the line 'It was just two lovers,' the script plans a tender park scene for that exact time window.

🍞 Hook: A director turns words into cameras, actions, and key images. 🥬 The Concept (Director Agent + Keyframes): The Director writes shot details and generates keyframes to anchor the look. How it works:

Retrieve character details from a shared bank.
Write camera moves, actions, and environment.
Generate a keyframe; reuse last frames to keep continuity. Why it matters: Without keyframes and consistent descriptions, characters and styles drift. 🍞 Anchor: The last frame of Shot 1 becomes the starting image of Shot 2 to prevent sudden outfit or face changes.

🍞 Hook: Sing-alongs work best close-up—you watch the mouth. 🥬 The Concept (Lip-sync Backend): A specialized renderer for accurate singing shots. How it works:

Feed the clean vocal track.
Map syllables to mouth shapes.
Render close-ups where articulation matters (e.g., chorus entry). Why it matters: Without proper lip-sync, singing looks fake. 🍞 Anchor: On the word 'shine,' the mouth shape and the visual brightness peak together.

🍞 Hook: Quality control is like a safety inspector who won’t let a wobbly bridge open. 🥬 The Concept (Verifier Agent): An AI judge that checks realism, script match, and identity consistency. How it works:

Score candidates for physical feasibility (no warped limbs, plausible motion).
Check alignment with script, captions, and lyrics.
Approve the best or ask for a retry. Why it matters: Without verification, small errors pile up into a distracting video. 🍞 Anchor: If a dance move looks floaty or a face changes, the verifier flags it and requests a new take.

Put together, these pieces turn a raw song into a planned, rendered, and checked long-form MV that feels like the music.

03Methodology

At a high level: Input song → Music-aware preprocessing → Time-aligned script and characters → Keyframes + shot prompts → Backend video generation (story or singing) → Verifier scoring and reruns → Assemble full MV.

Step-by-step recipe with why each step exists and concrete examples:

🍞 Hook: Imagine building a house—you need a blueprint before lifting a hammer. 🥬 The Concept (Music-aware Preprocessing): Convert the audio into a blueprint of sections, lyrics, and clean vocals. What happens:

Music captioning: Describe genre, mood, instruments, and likely singer traits.
Structure analysis: Split the song into intro/verse/chorus/bridge.
Source separation: Split vocals from backing track for cleaner processing.
Lyrics transcription and alignment: Turn singing into timed words. Why this step exists: Without a blueprint, timing drifts and storytelling misses the right musical moments. Example: The system detects Verse 1 from 0:18–0:49 and chorus from 0:49–1:02, and aligns the word 'golden' to 0:52.3–0:52.9. 🍞 Anchor: When chorus starts at 0:49, the plan switches to brighter shots and wider angles to match the musical lift.

🍞 Hook: A good story starts with who and where, not just pretty pictures. 🥬 The Concept (Screenwriter Agent): Turn the blueprint into a time-aligned story plan. What happens:

Segment the track into 3–15 s shots that respect lyric and section boundaries.
For each segment, extract the meaning (theme, emotion, imagery) from the lyrics and music.
Propose scene descriptions that match the meaning and the mood. Why this step exists: Without a story plan, shots feel like a random slideshow. Example: For 'It was just two lovers,' the screenwriter schedules a 7 s park scene with soft sunlight and relaxed movement. 🍞 Anchor: The plan marks 'shine' at 0:51.8 and requests a glow effect and a cut to a smiling close-up right there.

🍞 Hook: Casting keeps a movie believable; you don’t want a hero who looks different in every scene. 🥬 The Concept (Character Bank): A shared profile that locks identity details. What happens:

Define age, gender, skin tone, hair, outfit, and style notes.
Store profiles where all agents can retrieve them.
Inject details into every keyframe and shot prompt. Why this step exists: Without it, faces/outfits drift and viewers lose the thread. Example: 'Singer: mid-20s, curly dark hair, denim jacket, warm skin tone, silver pendant.' These attributes reappear across shots. 🍞 Anchor: When the camera cuts from a wide street scene to a close-up, the necklace and hair curls match—so you trust it’s the same person.

🍞 Hook: A single image can lock in the look so the following frames don’t wander. 🥬 The Concept (Keyframe Generation): Create a reference image that defines each shot’s style and characters. What happens:

Director writes an image prompt from the script and character bank.
Generate 1–3 candidate keyframes.
Reuse the last frame of the previous clip to preserve continuity when needed. Why this step exists: Without keyframes, the video model can drift in style or character. Example: The city street background, jacket texture, and pendant glint stay steady across two adjacent shots. 🍞 Anchor: The last frame of Shot A becomes the first frame of Shot B; the hand position and lighting still match.

🍞 Hook: Different jobs need different tools—a paintbrush is not a chisel. 🥬 The Concept (Backend Selection): Choose the right video engine for the shot. What happens:

For general story and cinematic shots, use a cinematic video generator (e.g., Doubao API), possibly by stitching 1–3 subclips.
For singing shots where mouth shapes matter, use a lip-sync-focused model (e.g., Qwen-Wan-2.2) with the clean vocal track.
Align all subclips to a frame grid (like 1/24 s) so audio and video stay in lockstep. Why this step exists: Without switching tools, either story shots look stiff or lip-sync looks fake. Example: The chorus close-up uses the lip-sync backend; the following dance montage uses the cinematic backend. 🍞 Anchor: On 'shine,' the singer’s mouth shape is perfect and the cut to a wider dance shot still lands exactly on the beat.

🍞 Hook: A second pair of eyes catches mistakes before they hit the screen. 🥬 The Concept (Verifier and Rerun): Automatically check quality and alignment. What happens:

Score keyframes for realism and instruction match (pose, lighting, character correctness).
Score video candidates for physical plausibility and script/lyric alignment; check identity consistency.
If a clip fails, regenerate and pick the best candidate. Why this step exists: Without automated checks, small glitches (weird hands, face swaps, off-beat cuts) sneak through. Example: If a hand clips through a guitar or a face shifts mid-shot, the verifier asks for a new take. 🍞 Anchor: The accepted take keeps fingers on the fretboard and the face stable; timing still matches the chorus.

🍞 Hook: Think of editing as threading beads—you want the pattern to flow. 🥬 The Concept (Assembly): Concatenate vetted shots into the final MV. What happens:

Place shots exactly on their planned time boundaries.
Crossfade or hard cut to match rhythm and emotion.
Mix the original full song audio back on top. Why this step exists: Without careful assembly, even good shots can feel jumpy or out of time. Example: A soft dissolve during a gentle bridge, a crisp cut on a snare hit entering the chorus. 🍞 Anchor: The final MV watches like a single performance—not like stitched random clips.

The secret sauce:

Music-first planning: Sections and lyric timing drive where story moments and cuts happen.
Role specialization: Writer, director, renderer, and verifier each do one job well.
Identity lock: A character bank plus keyframes glues the look together.
Tool switching: Story shots and singing shots use different engines.
Automatic guardrails: The verifier catches realism and alignment issues before they compound.

Concrete micro-example with data:

Detected: Chorus from 0:49.2–1:02.1; lyric 'shine' at 0:51.8.
Plan: At 0:49.2, punch in to a face close-up; at 0:51.8, trigger a light bloom and smile.
Backend: Lip-sync engine for 0:49.2–0:52.5; cinematic engine for 0:52.5–1:02.1 wide shots.
Verify: Reject take where eyes flicker unrealistically; accept the smooth one with correct mouth shapes.
Assemble: Hard cut at 0:49.2, crossfade at 0:52.5, original audio restored across both shots.

04Experiments & Results

🍞 Hook: When you test a game controller, you don’t just see if it turns on—you try every button during real gameplay.

🥬 The Concept (The Test): Measure if long music videos truly match the music, tell a story, look consistent, and feel artistic. How it works:

Use 30 real songs from YouTube across several languages.
Run AutoMV end-to-end and compare to two strong commercial baselines.
Score with both automatic measures and expert human ratings on 12 criteria grouped into Technical, Post-production, Content, and Art. Why it matters: Without thorough tests, short pretty clips might look good but fail as full MVs.

🍞 Anchor: Think of an exam that checks math, writing, art, and gym—because a great MV needs many skills.

The competition:

Baselines: Two commercial, closed-source pipelines (narrative modes and image-to-video workflows) known for good short-form results.
Upper bound: Human-made professional MVs for the same songs.

Key metrics and what they mean:

🍞 Hook: Imagine a universal translator that measures how closely pictures and sounds ‘agree’ in meaning. 🥬 The Concept (ImageBind Score): A model compares the video frames and the audio in a shared space; higher similarity means better audio–visual alignment. How it works:

Embed audio and video frames into a single representation space.
Compute similarity scores across the MV.
Average to get an overall alignment score. Why it matters: Without alignment, visuals don’t reflect the music. 🍞 Anchor: AutoMV’s higher ImageBind score means its pictures and sounds are more in sync than the baselines.

🍞 Hook: In school, rubrics make grading fair; here, the rubric checks what humans care about in MVs. 🥬 The Concept (LLM-as-Judge with 12 Criteria): A multimodal model watches the MV and scores 12 items in 4 groups: Technical, Post-production, Content, and Art. How it works:

Feed the full MV and song.
Score 12 sub-criteria (e.g., character consistency, lip-sync, storytelling, visual quality) from 1–5.
Weight groups (e.g., Content and Art at 30% each) to get a final score. Why it matters: Traditional numbers alone can’t grade story, emotion, or artistry. 🍞 Anchor: The judge checks if a chorus looks like a chorus, if faces stay the same, and if the edit breathes with the beat.

Scoreboard with context:

Audio–visual alignment: AutoMV achieves about 24.4% on ImageBind vs. roughly 18.5–19.9% for the two baselines—like moving from a B- to an A- in matching the music.
Cost and time: AutoMV costs about 10–20 USD per song and runs in roughly 30 minutes on the authors’ hardware, comparable or cheaper than baselines.
LLM-judged quality: AutoMV consistently ranks above both baselines across Technical (e.g., identity stability, lip-sync), Post-production (e.g., shot continuity, audio–visual correlation), and Content (e.g., theme relevance, storytelling, emotion). Some LLMs rate AI-styled ‘novelty’ higher for a baseline, but AutoMV remains stronger in classic filmic quality.
Human experts: Professionals still prefer human-directed MVs overall, as expected, but they rate AutoMV clearly above commercial baselines and note its stronger music-aware planning.

Surprising or notable findings:

AI novelty bias: Some automatic judges reward flashy AI artifacts as ‘novel art,’ which can inflate Art scores for certain baselines.
Judge quality varies: Stronger video-understanding models (e.g., Gemini family) correlate better with human experts; others saturate scores and can’t separate methods well.
Structured planning matters: Gains are largest in measures tied to music alignment and high-level coherence, not just raw visual fidelity.

Ablation insights (what breaks when pieces are removed):

No lyrics/timestamps: Biggest drops in Content (theme, story, emotion) and audio–visual correlation; Technical stability is less affected.
No character bank: Identity consistency collapses; viewers report face and outfit swaps across shots; storytelling clarity falls.
No verifier: More physical artifacts, lower visual polish, and occasional off-script details; some timing measures may stay similar, but overall quality suffers.

Bottom line: AutoMV beats current automatic tools on aligning with the music, keeping characters consistent, and telling a coherent story—at a practical cost and runtime—while still trailing professional human work in the very top levels of craft and creativity.

05Discussion & Limitations

🍞 Hook: Even great tools have weak spots—like a camera that’s amazing in sunlight but struggles in the dark.

🥬 The Concept (Limitations): What AutoMV can’t reliably do yet. What it is: Known gaps in realism, precision, and creativity. How it shows up:

Physical plausibility: Hands and object contacts can fail; rare pose glitches sneak through.
Beat-precise motion: Dance trajectories may not perfectly lock to micro-beats.
Text rendering: On-screen letters or signs can wobble or morph over time.
Lip-sync without vocal separation: Mixed audio reduces mouth accuracy; separation helps but isn’t perfect for all singers.
Long-run identity drift: Over many minutes, small changes can accumulate. Why it matters: Without addressing these, the last 10–20% of professional sheen remains out of reach. 🍞 Anchor: A handwritten letter close-up may look pretty but the letters can change shape frame to frame.

Resources required:

Access to APIs for image/video generation and lip-sync.
A GPU server (e.g., multi-80GB GPUs) helps with speed and longer clips.
Budget per song (about 10–20 USD in the paper’s setup) and ~30 minutes of compute time.

When not to use:

Ultra-precise choreography where every limb must hit micro-beats.
Legal-risk scenarios (e.g., look-alike characters of real people without consent).
Projects demanding bespoke, innovative cinematography or hand-crafted art direction beyond current AI flexibility.

🍞 Hook: Safety belts and road rules don’t stop trips—they make them safer. 🥬 The Concept (Ethical Safeguards): Practices that reduce harm while enabling creativity. What it is: Transparency, consent, and attribution. How it works:

Label AI-generated content to avoid confusion.
Respect copyright—share links, not raw protected media.
Avoid portraits resembling real people without permission. Why it matters: Trust and fairness keep the creative ecosystem healthy. 🍞 Anchor: An MV upload that clearly says ‘AI-generated’ with creator credits and music rights info respects viewers and artists.

Open questions ahead:

Can we fuse beat analysis with motion generation for reliably on-beat dance?
How do we keep characters 100% stable across very long videos, varied lighting, and complex actions?
Can automatic judges better match expert taste without overvaluing AI artifacts?
What training or fine-tuning would close the creativity gap while keeping costs low?

Takeaway: AutoMV is a big leap for practical, music-aware MV creation, but the final stretch to professional artistry needs advances in motion control, identity tracking, typography, and evaluation.

06Conclusion & Future Work

Three-sentence summary: AutoMV is a music-first, multi-agent system that turns a full song into a coherent music video by planning with structure and lyrics, stabilizing characters, choosing the right video tools per shot, and auto-verifying quality. It significantly outperforms strong commercial baselines on music alignment and expert-rated coherence while keeping costs and runtime practical. Although there remains a gap to human-directed productions, AutoMV narrows it and proposes fairer evaluation for long-form MVs.

Main achievement: The paper shows that a role-based, music-aware pipeline—with a screenwriter, director, character bank, specialized renderers, and a verifier—can reliably produce long, song-synced videos that hold together as stories, not just pretty seconds-long clips.

Future directions:

Tighter beat-to-motion control for dance and complex staging.
Stronger long-run identity locks and style preservation.
Smarter, cheaper, and more human-aligned automatic judges.
Faster, lighter inference pipelines and broader creative toolsets.

Why remember this: It reframes MV generation from ‘one model, one prompt’ to ‘a studio of specialists guided by the music,’ proving that planning and verification make the difference between a montage and a music video.

Practical Applications

•Indie artists auto-generate concept MVs for new singles to test audience reactions before paying for full productions.
•Labels create quick lyric-synced teasers for social media aligned to chorus hits and hooks.
•Music educators demonstrate song structure by showing how verses and choruses drive visual changes.
•Content creators repurpose catalog songs into story-driven visuals for playlists or channel branding.
•Karaoke venues display lip-synced performer visuals that match lyrics in real time for popular tracks.
•Advertising teams prototype music-backed commercials where cuts land exactly on beats.
•Event organizers generate stage-screen visuals that switch with musical sections during live shows.
•Songwriters pitch demos with instant narrative videos that reflect the intended mood and theme.
•Fan communities create respectful, clearly labeled AI MVs to celebrate favorite tracks.
•UX researchers study how timing and emotional cues affect viewer engagement across platforms.

Version: 1