The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu; Xin He; Qu Yang; Wanshun Chen; Jiadi Yao; Huang Liu; Zihao Yi; Bo Zhao; Xingyu Chen; Ruotian Ma; Fanghua Ye; Erkun Yang; Cheng Deng; Zhaopeng Tu; Xiaolong Li;  Linus

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Intermediate

Chenyu Mu, Xin He, Qu Yang et al.1/25/2026

arXiv PDF

Key Summary

•This paper teaches AI to turn simple dialogue into full movie scenes by first writing a detailed script and then filming it step by step.
•The system has three helpers: ScripterAgent (writes the script), DirectorAgent (shoots the video with smooth scene links), and CriticAgent (scores how well it matches the plan).
•A new dataset called ScriptBench trains the writer using real dialogue, audio, and character positions, checked by experts for movie realism.
•ScripterAgent is trained in two stages: it first learns script format, then learns taste and style using a reward system that mixes rules and human preferences (GRPO).
•DirectorAgent solves the “videos are too short” problem by cutting the story at smart places and using frame-anchoring to keep looks and layout consistent across scenes.
•A new score called VSA (Visual-Script Alignment) checks not just what appears but when it appears, matching timing in the script to timing in the video.
•Across many top video models, using these scripts boosts faithfulness and coherence; average human ratings jump from about a B to an A.
•There’s a trade-off: some models make the most beautiful shots, others follow the script more strictly; this framework helps both do better.
•Limits remain, like perfect lip sync and tiny action timing, but the approach is a big step toward automated, long, story-like videos.
•Bottom line: the script is the bridge between ideas (dialogue) and finished, movie-like video, and these agents build and cross that bridge.

Why This Research Matters

Stories are how we learn, remember, and connect, and this work helps AI tell longer, clearer stories on screen. With script-first planning, small creators and teachers can produce scene-accurate videos that used to require big teams. Advertisers and studios can keep brand looks and character identities stable across many scenes, saving time and reshoots. Timing-aware evaluation (VSA) means the important beats land when they should, making explanations, lessons, and emotions more effective. Even with different video models, this framework boosts faithfulness and pacing, making tools more reliable. It also shines a light on a key choice—spectacle vs. strict adherence—so teams can pick the right model for their goals.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and a friend act out a story with only a few lines like “I’m tired” or “I found the pen.” You can say the lines, but how do you know where to stand, what the camera shows, or when to cut to a close-up? Without a plan, your “movie” feels messy.

🥬 The World Before: For years, AI could make short, pretty video clips from single sentences, like “a dog running on a beach.” That’s cool for moments, but not for movies. Movies need planning: who’s on screen, which camera angle, when to cut, and how emotions change over time. Most AI video tools didn’t plan; they just tried to paint moving pictures from one big prompt.

🍞 Anchor: It’s like trying to film a soccer game with only the words “play soccer” and no positions, plays, or camera plan—you’d miss the passes and goals.

🍞 Hook: You know how a school play starts with a script that tells each actor what to say and where to stand? Without the script, everyone bumps into each other.

🥬 The Problem: Turning dialogue (“I found the pen”) into a whole scene is a big leap. Dialogue is sparse—many things are implied but not said: the mood, the setting, who holds what, where the camera is, and how shots connect. AI struggled to fill in these missing pieces, so long videos fell apart: faces changed, clothes shifted, and timing didn’t match.

🍞 Anchor: If a character says “Thank you,” the camera might need a close-up of a relieved smile right then. Without a plan, the close-up might happen too early or too late—or not at all.

🍞 Hook: Think of assembling a LEGO castle. The picture on the box isn’t enough; you need step-by-step instructions.

🥬 Failed Attempts: People tried longer prompts (“a cinematic scene with two people on a rooftop…”) or simple outlines, but these still missed details: exact shot times, camera moves, character blocking (where people stand and move), and continuity across scenes. Another attempt was to generate separate clips and stitch them, but that caused identity drift—haircuts, clothes, and lighting would change between clips.

🍞 Anchor: It’s like baking a cake by throwing all ingredients together without the recipe steps. You get something cake-like, but not the cake you wanted.

🍞 Hook: You know how a recipe tells you what to do now and what to do next?

🥬 The Gap: The missing piece was the script itself—a precise, shot-by-shot plan that translates vague dialogue into exact instructions any video model can follow. But making such scripts also needs film knowledge (shot types, pacing) and creative reasoning. No public dataset focused on teaching AI to write these director-level scripts from dialogue.

🍞 Anchor: Without a script, the AI is guessing; with a script, it follows a recipe for every second of the scene.

🍞 Hook: Imagine a relay race where each runner hands the baton cleanly so the speed stays high.

🥬 Real Stakes: Long, coherent videos matter for education, entertainment, advertising, and communication. If AI can keep characters, places, and emotions consistent over minutes, it can help indie creators, teachers, and studios make better content faster. But we also need tools to check if the video really follows the plan, moment by moment, so the “relay” doesn’t drop the baton between shots.

🍞 Anchor: A teacher could turn a class dialogue into a short film with the right angles at the right times, keeping students engaged and the story clear.

Now, let’s build the simple ideas you need first.

🍞 Hook: You know how a student learns from examples and gets better with feedback? 🥬 Prerequisite Concept — Basic AI Learning:

What it is: AI is a computer program that learns patterns from lots of data to make predictions.
How it works: 1) See examples; 2) Guess an answer; 3) Get told how close it was; 4) Adjust to get closer next time.
Why it matters: Without learning from examples and feedback, the AI can’t improve. 🍞 Anchor: Like practicing free throws: shoot, watch if it goes in, adjust, repeat.

🍞 Hook: Talking to a helpful librarian. 🥬 Prerequisite Concept — Dialogue Systems:

What it is: AI that understands and responds to conversations.
How it works: 1) Read your words; 2) Find meaning; 3) Choose a reply; 4) Speak or type it.
Why it matters: Our input is just dialogue; the AI must squeeze hidden context out of short lines. 🍞 Anchor: When you say “I’m late,” a good assistant infers hurry, route, and next steps.

🍞 Hook: A flipbook where each page is a frame. 🥬 Prerequisite Concept — Video Generation:

What it is: AI creates sequences of images that change over time.
How it works: 1) Understand the prompt; 2) Imagine the first frames; 3) Predict small changes; 4) Repeat to form smooth motion.
Why it matters: To tell a story, frames must change in the right way at the right time. 🍞 Anchor: Draw a ball higher each page to show it bouncing down.

🍞 Hook: A director’s toolbox. 🥬 Prerequisite Concept — Cinematic Storytelling:

What it is: The craft of using shots, camera moves, and timing to tell a clear, emotional story.
How it works: 1) Pick shot types (wide, medium, close-up); 2) Plan camera moves; 3) Block actors; 4) Cut at emotional beats.
Why it matters: Without this language, the video looks random, not cinematic. 🍞 Anchor: Zooming close when a secret is revealed makes the moment land.

02Core Idea

🍞 Hook: You know how a treasure map turns a vague idea (“treasure is somewhere”) into clear steps (“walk 50 paces north, then turn right”)?

🥬 The Aha! Moment in one sentence: If you first turn dialogue into a professional movie script, then use that script to guide video models scene by scene, you can get long, coherent, cinematic videos.

Multiple Analogies:

Recipe First: Dialogue is ingredients; the script is the recipe; the video is the cake.
GPS for Filmmaking: Dialogue is the destination; the script is turn-by-turn directions; the video is the ride with smooth lane changes.
Orchestra: Dialogue is the melody; the script is the sheet music; the video is the performance staying in rhythm.

Before vs. After:

Before: One big prompt tried to cover everything. Results were short, pretty, but often jumbled over time.
After: A script breaks the story into shots with times, angles, and actions. Each part is generated with continuity so characters, places, and timing stay consistent.

Why It Works (intuition, no equations):

Dialogue is too short to fully specify a movie. A script adds missing details (who, where, how long, which angle) in a structured way.
Structured instructions lower confusion for video models—fewer guesses mean fewer mistakes.
Chaining scenes with visual anchors gives the next scene a memory of the last frame, so looks and layouts don’t drift.
A critic that checks timing and content (VSA) closes the loop, rewarding what truly stays on-script.

Building Blocks (each with a Sandwich Explanation):

🍞 Hook: Think of a great screenwriter who turns simple lines into a whole scene plan. 🥬 ScripterAgent:

What it is: An AI “screenwriter” that turns sparse dialogue into a shot-by-shot, time-stamped movie script with camera and action details.
How it works: 1) Learn script format (SFT); 2) Learn taste and style with feedback (GRPO); 3) Output a structured script (who, where, when, how to film).
Why it matters: Without an exact plan, video models guess and break continuity. 🍞 Anchor: From “I found the pen” to “0–4s wide shot on rooftop; 4–7s medium on handoff; 7–13s close-up of relieved smile.”

🍞 Hook: A sturdy practice field built to train players. 🥬 ScriptBench:

What it is: A carefully made training set with dialogue, audio, and character positions, plus expert-checked scripts.
How it works: 1) Rebuild scene context; 2) Plan shots under film rules; 3) Auto-check and fix errors until scripts are consistent.
Why it matters: Garbage in, garbage out; clean, cinematic data teaches the model real directing habits. 🍞 Anchor: Like a workbook with solved examples and corrections.

🍞 Hook: Judges at a science fair comparing several projects side by side. 🥬 GRPO (Group Relative Policy Optimization):

What it is: A way to improve the writer by comparing several candidate scripts at once and nudging it toward the better ones.
How it works: 1) Generate multiple versions; 2) Score them with rules and human-style taste; 3) Prefer the ones that win within the group.
Why it matters: Creative tasks have many “right” answers; relative judging teaches style without being rigid. 🍞 Anchor: Picking the best of eight drafts and learning why that one sings.

🍞 Hook: A movie director who knows exactly when to cut and how to keep the look steady across scenes. 🥬 DirectorAgent:

What it is: An AI “director” that turns the script into video segments while keeping story flow and visuals consistent.
How it works: 1) Cut at smart boundaries; 2) Fit each chunk within model limits; 3) Use frame-anchoring to keep faces, clothes, and layout steady.
Why it matters: Most video models can’t do very long clips; this stitches them cleanly. 🍞 Anchor: The last frame of scene A becomes the first hint for scene B, so hair and lighting don’t suddenly change.

🍞 Hook: A storyteller who weaves between chapters without losing the thread. 🥬 Cross-Scene Continuous Generation:

What it is: A plan to generate long stories as a chain of well-timed scenes that flow.
How it works: 1) Respect shot integrity; 2) Cut at emotional beats; 3) Avoid cutting mid-complex camera moves.
Why it matters: Random cuts break rhythm and confuse viewers. 🍞 Anchor: Ending a shot after a line finishes, not mid-sentence.

🍞 Hook: Keeping a bookmark so you start the next page exactly where you left off. 🥬 Frame-Anchoring:

What it is: Using the last frame of one scene as a visual reference for the first frame of the next.
How it works: 1) Extract last frame; 2) Feed it to the next generation; 3) Add text cue “continue from previous.”
Why it matters: Prevents identity drift and layout jumps. 🍞 Anchor: The blue jacket and window light stay the same across shots.

🍞 Hook: A fair referee who checks if the play matched the playbook and happened at the right time. 🥬 CriticAgent and VSA:

What it is: An AI critic plus a timing-aware score (VSA) that checks if visuals match the script and if moments happen when they should.
How it works: 1) Score camera work, acting, fidelity, emotion, pacing; 2) Use VSA to match script instructions to the correct time windows.
Why it matters: Seeing a pen sometime isn’t enough; it must appear exactly during the “pen handoff” seconds. 🍞 Anchor: If the script says “close-up at 7–13s,” VSA checks for a close-up then—not earlier or later.

03Methodology

At a high level: Dialogue → ScripterAgent writes a detailed script → DirectorAgent films it in linked chunks → CriticAgent scores script and video (including VSA).

Step 1: Build the learning playground (ScriptBench) 🍞 Hook: You know how a good workbook gives clear examples and checks your answers? 🥬 ScriptBench:

What happens: The team collected rich dialogue scenes with audio and character positions, then expanded each into a shot-level script. They rebuilt context (who, where, why), planned shots with rules (integrity, duration, semantic cuts, feasibility), and ran multi-round error checks (dialogue covered, character looks consistent, scene coherent, positions physically possible).
Why this step exists: If training data is sloppy, the model learns sloppy habits.
Example: Dialogue says “I’m finally up here.” The script clarifies “rooftop at sunset,” places characters near a railing, and sets shot times (0–4s wide, 4–7s medium, etc.). 🍞 Anchor: Like turning a rough math problem into a full step-by-step solution key.

Step 2: Teach the writer to speak “movie” (ScripterAgent SFT) 🍞 Hook: Practice scales before playing a concerto. 🥬 SFT (Supervised Fine-Tuning):

What happens: The model (Qwen-Omni-7B) reads dialogue and learns to output the correct JSON script format with fields like shot type, camera movement, timestamps, and blocking.
Why this step exists: Format and completeness first; creativity later. Without structure, later steps break.
Example: From “Thank you, I’ll keep it safe,” the model learns to produce a close-up with hand-and-pen details at the exact seconds. 🍞 Anchor: It’s learning the sheet-music rules before improvising jazz.

Step 3: Teach taste and timing (ScripterAgent GRPO with hybrid rewards) 🍞 Hook: A coach who scores both the form (rules) and the artistry (style). 🥬 GRPO + Hybrid Reward:

What happens: For each dialogue, the model writes several script drafts. Each draft gets two scores: rule-based structure (JSON correct, dialogue covered, consistency, physical plausibility) and human-preference style (shot division, acting notes, aesthetics, directorial intent), learned from expert ratings. The model then favors drafts that rank better within the group.
Why this step exists: Movies aren’t only correct—they must feel right. GRPO teaches preference without needing one single “perfect” answer.
Example: Two drafts both valid; the chosen one times the close-up right after the reveal and picks a handheld move to raise tension. 🍞 Anchor: Like choosing the best of eight rehearsals because it hits both the notes and the emotion.

Step 4: Turn the script into video (DirectorAgent) 🍞 Hook: A chef plating a big meal in courses so everything stays hot and coordinated. 🥬 Intelligent, Shot-Aware Segmentation:

What happens: The script is split at natural film boundaries: complete shots, near dialogue beats, avoiding mid-complex moves, and within model time limits (with a safety buffer).
Why this step exists: Video models have short generation windows. Smart cuts prevent mid-action breaks and keep rhythm.
Example: A 60s scene becomes 5 chunks that each include whole shots, not fragments. 🍞 Anchor: You wouldn’t cut a song mid-chorus; finish the phrase, then transition.

🍞 Hook: Keeping a bookmark so the next page starts in the right place. 🥬 Frame-Anchoring:

What happens: The last frame from chunk i is fed as a visual reference into chunk i+1, plus a textual cue (“Continuing from the previous scene”).
Why this step exists: Prevents identity and layout drift between chunks.
Example: The red scarf, window angle, and background chairs match across cuts. 🍞 Anchor: A baton pass in a relay—smooth handoff keeps speed and direction.

Step 5: Judge the results (CriticAgent + VSA) 🍞 Hook: A referee who checks both what happened and when it happened. 🥬 CriticAgent + VSA:

What happens: The AI critic scores camera articulation, body blocking, visual fidelity, emotion arc, and pacing. The VSA metric compares each frame’s content to the script’s shot instructions inside the right time windows.
Why this step exists: Regular metrics catch “what,” but miss “when.” VSA verifies timing fidelity.
Example: If the script says “medium shot from 4–7s,” VSA rewards matching frames during 4–7s, not 0–3s. 🍞 Anchor: Like a music judge who checks that the solo starts at bar 16, not bar 12.

Secret Sauce (what’s clever):

The script-as-bridge: turning vague dialogue into precise, executable instructions.
Relative learning (GRPO): comparing multiple drafts teaches nuanced style.
Cross-scene continuity: segmentation plus frame-anchoring stretches short video models into long, coherent stories.
Timing-aware evaluation (VSA): measuring not just presence but punctuality of visual events.

Mini Walkthrough with Data:

Input dialogue: “Ah… I’m so tired, but I’m finally up here.” “I found the pen.” “Thank you, I’ll keep it safe.”
ScripterAgent output: Shot 1 (0–4s wide rooftop, wind ambience), Shot 2 (4–7s medium hand + pen), Shot 3 (7–13s close-up relief), Shot 4 (13–15s cutaway to skyline), with blocking and camera notes.
DirectorAgent: Splits into 2–3 chunks under model limits, uses last frame from chunk 1 to anchor chunk 2.
CriticAgent + VSA: Scores high because the close-up appears exactly during 7–13s and characters stay consistent across scenes.

04Experiments & Results

The Test: What did they measure and why?

Script quality: Structure, shot division, completeness, and narrative flow—because a clean recipe makes better cooking.
Video faithfulness and cinema quality: Camera work, acting/blocking, visual detail match, emotion arc, and pacing—because a movie is more than pretty frames.
Temporal-semantic alignment (VSA): Checks that events occur when the script says—because timing is storytelling.

The Competition: What was it compared against?

Script generation baselines: CHAE, MoPS, SEED-Story, and a multi-agent movie system. These can write or plan stories but lack this paper’s tight, shot-level film script focus from dialogue.
Video models: A wide set of top generators (e.g., HYVideo1.5, Wan2.6, Sora2-Pro, Veo3.1, etc.). Each was tested with and without ScripterAgent’s scripts.

The Scoreboard (with context):

Scripts: Human experts gave ScripterAgent higher Dramatic Tension (about 4.1 vs ~3.6–3.8) and Visual Imagery (about 4.3 vs ~3.3–3.8). Think moving from a B to an A on “is this filmable and exciting?”
Videos without scripts: Average human ratings were around 3.7 (a solid C+/B-): decent visuals but shaky story fidelity.
Videos with scripts: Average human ratings rose to about 4.2 (a strong B+/A-). Script Faithfulness especially jumped (e.g., Wan2.6 from ~3.2 to ~4.0), and Character Consistency improved thanks to frame-anchoring.
VSA: Timing alignment improved across the board (e.g., Veo3.1 from ~51.4 to ~53.8; HYVideo1.5 near 54.8), meaning shots appeared at the right moments more often.

Surprising Findings:

Trade-off revealed: Some models shine at beauty and physics (spectacle), others at obeying the script (fidelity). With ScripterAgent, both kinds improved, but their personalities remained: spectacle-first models still looked spectacular; fidelity-first models still followed plans tightly. The framework helps you choose the right tool for your goal.
More action with scripts: Dynamic motion scores rose when using scripts—clear action notes nudge models to avoid dull “talking heads.”

Concrete Examples (kid-friendly):

If the script says “close-up of the pen handoff at 4–7s,” VSA checks those seconds. With scripts, the handoff appears right on cue more often.
If the character wears a blue jacket, frame-anchoring helps keep it blue in the next shot too, not suddenly red.

Why These Numbers Matter:

A +0.4 point gain on a 0–5 expert scale is like lifting your grade from 84 to 92—noticeable in professional work.
The VSA gains mean the story beats land at the intended times, which makes scenes easier to follow and more emotionally effective.

Ablations (what pieces help the most?):

Script only: Big gains in Visual Fidelity and Body Blocking—better descriptions lead to clearer actions and looks.
Script + segmentation: Pacing and camera articulation improve—cuts happen at smarter times.
Full Agent (add frame-anchoring): Highest scores overall—continuity locks in character identity and layout across scenes.

Takeaway: The script-first, continuity-aware pipeline makes many different video models tell clearer, steadier stories, not just prettier ones.

05Discussion & Limitations

Limitations (honest talk):

Lip sync and micro-actions: Tiny mouth shapes or hand timings can still drift off by a little; the model may get the “what and when” mostly right, but not every subtle twitch.
Style rigidity: Following the script strictly can slightly reduce free-form visual flash in some models—there’s a balance between dazzling shots and exact obedience.
Longest horizons: While the method chains scenes well, extremely long films may still show small continuity hiccups without further memory tools.
Data coverage: ScriptBench is large and expert-checked, but no dataset covers every genre, culture, or camera trick; rare styles may need extra tuning.

Required Resources:

A script-writing model (7B scale in the paper), fine-tuning compute, and access to high-end video generators.
Storage for multi-minute video segments and caching anchor frames.
Optional: human experts for preference data (or a learned proxy critic, as used here).

When NOT to Use:

Purely improvisational, experimental visuals where strict timing is unimportant.
Single short clips (5–10s) with no story—overhead of scripting may not pay off.
Real-time applications with ultra-low latency; splitting and anchoring add orchestration steps.

Open Questions:

Can we align lips and micro-gestures precisely to speech while keeping global coherence?
Can the director learn when to break rules artistically (e.g., jump cuts) without losing clarity?
Can we build style-adaptive agents that shift between “spectacle” and “fidelity” on command?
Can VSA expand to multi-character interactions (e.g., who holds what when) with even finer granularity?
How do we scale beyond minutes to episodes while keeping memory compact and robust?

06Conclusion & Future Work

Three-Sentence Summary:

This paper shows that writing a professional, shot-by-shot script from dialogue first, then filming it with continuity tools, turns short, pretty clips into longer, coherent, cinematic videos.
The system uses three agents—ScripterAgent (writer), DirectorAgent (filmer), CriticAgent (judge)—plus a new timing-aware metric (VSA) and a strong training set (ScriptBench) to bridge the gap from words to movie.
Across many top video models, this boosts faithfulness, pacing, and consistency, while revealing a real trade-off between “looks amazing” and “sticks to the plan.”

Main Achievement:

Proving that “the script is the bridge”: adding a precise, director-level script in the middle dramatically improves long-horizon video generation from dialogue.

Future Directions:

Nail lip sync and micro-gestures with speech-aligned control.
Teach the director to choose and blend styles (spectacle vs. fidelity) on demand.
Extend VSA to even finer event tracking and multi-character interactions.
Scale the continuity chain to episodes or feature-length content with compact memory.

Why Remember This:

It reframes the challenge: don’t prompt harder—plan smarter. By treating the script as the core translation layer between ideas and images, the system achieves the thing movies live on: timing, clarity, and emotional flow. The next wave of AI filmmaking won’t just draw moving pictures; it will direct them.

Practical Applications

•Turn classroom dialogues into short films with accurate shot timing to boost student engagement.
•Create storyboard-quality animatics from scripts to pre-visualize scenes before full production.
•Generate consistent multi-scene ads where logos, outfits, and locations stay stable across cuts.
•Produce training videos that precisely align actions (like safety steps) with voiceover timing.
•Rapidly explore multiple directing styles for the same dialogue by swapping camera plans.
•Assist indie filmmakers in planning and shooting order with clear shot lists and blocking.
•Localize scenes by rewriting scripts to match regional styles while preserving timing and beats.
•Prototype episodic content by chaining scenes with frame-anchoring to keep character identity.
•Automate video documentation of customer support dialogues with on-script visuals.
•Use VSA-based QA to flag off-timed or missing shots before final renders go live.

Version: 1