DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Shijian Ma; Yunqi Huang; Yan Lin

DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Intermediate

Shijian Ma, Yunqi Huang, Yan Lin12/22/2025

arXiv PDF

Key Summary

•DramaBench is a new test that checks how well AI continues drama scripts across six separate skills instead of one big score.
•It uses a clever two-step method: first an AI labels what it sees (like 'this line breaks logic'), then math turns those labels into reliable scores.
•The six skills are Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling.
•The benchmark contains 1,103 real scripts and compares 8 top language models across 8,824 evaluations with strong statistical testing.
•All models followed screenplay format perfectly, but they differed a lot in logic consistency and how efficiently they advanced the story.
•GPT-5.2 was the most balanced overall, Qwen3-Max showed the deepest emotions, and Gemini-3-Pro best escalated conflicts.
•Human checks agreed well with the AI judge on 3 of 5 content skills, showing the method is promising but needs improvement for certain dimensions.
•Each score traces back to specific labeled moments (like which line went out-of-character), giving clear, fixable feedback.
•The six skills barely overlap, proving each measures something unique and important for drama writing.
•DramaBench sets a rigorous, reproducible standard that can help train better, more controllable creative-writing AIs.

Why This Research Matters

DramaBench helps writers and studios spot exactly where an AI’s storytelling shines or stumbles—logic, pacing, emotions, character voice, conflict, or format. That means faster revisions and fewer 'mystery' failures that only show up late in production. Teachers can give students precise, skill-by-skill feedback, not just a vague grade. Viewers benefit from shows that feel tighter, more consistent, and more emotionally resonant. Companies can choose or train models for specific strengths (like emotion or logic) instead of guessing. The labels can directly train future models, creating a loop that steadily improves creative writing AIs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you watch a play or a movie, it’s not just about what happens next—it’s also about whether the characters stay themselves, the emotions make sense, the problems get tougher, and the script looks like a real script? Good drama needs many things to work together.

🥬 Filling (The Actual Concept):

What it is: This paper builds DramaBench, a big, careful test to see how well AIs continue drama scripts along six different skills at once.
How it works: Instead of asking an AI judge to hand out stars, DramaBench asks it to label specific things (like 'this line broke a fact' or 'this scene deepened emotion'), then turns those labels into fair numbers. It does this on 1,103 real scripts and compares 8 advanced models.
Why it matters: Creative writing fails in different ways—a script can look perfect but break logic, or be emotional but go nowhere. One number can’t tell you what to fix. Six separate scores can.

🍞 Bottom Bread (Anchor): Imagine grading a basketball player on dribbling, passing, shooting, defense, stamina, and teamwork—not just a single 'good' or 'bad.' That’s what DramaBench does for drama scripts.

Now, the world before:

Many tests checked story understanding (like choosing the right ending) or measured full-script writing, but didn’t match screenwriting’s special needs: keeping strict format, distinct character voices, emotional arcs, logical memory of facts, and conflict growth.
People often used 'LLM-as-a-judge,' asking a model to just score a text 1–10. But that can be biased and hard to reproduce. Human grading is slow, costly, and sometimes disagreeing.

The problem:

Drama continuation is multi-skill: characters must stay themselves, the plot should move, the feelings should evolve, the facts must not be contradicted, conflicts should escalate (or twist), and the format must be industry-standard. No benchmark measured all of this together, fairly and at scale.

Failed attempts:

Single-score ratings hid weaknesses. A model might be great at format but weak at logic, and you couldn’t tell.
Pure human judging didn’t scale and had low agreement.
LLMs that directly gave scores showed position and verbosity bias, and were not easily reproducible.

The gap:

We needed a test that (1) breaks quality into separate, independent skills, (2) uses structured labels instead of just stars, and (3) converts labels to numbers with clear math so it’s reproducible and actionable.

Prerequisite concepts (explained simply):

🍞 You know how you chat with smart assistants? 🥬 Large Language Models (LLMs) are computer programs that read and write language. They work by predicting the next words using patterns learned from lots of text. Without LLMs, we wouldn’t have AIs to continue scripts at all. 🍞 Example: When an AI writes the next scene of a play after reading the first half, that’s an LLM at work.
🍞 Imagine judging a school project for creativity, accuracy, neatness, and teamwork. 🥬 Multi-dimensional evaluation means scoring something from many angles instead of just one. It separates strengths from weaknesses. If you skip this, you might say 'good' while missing that accuracy was poor. 🍞 Example: A story can be very emotional (high) but illogical (low) at the same time.
🍞 Picture a scoreboard in sports that turns events into points. 🥬 Statistical metrics are the math rules that turn labeled events (like 'this was a logic break') into fair scores. Without them, results feel random and can’t be compared. 🍞 Example: Counting how many times the AI contradicted a fact to compute a 'Logic Break Rate.'
🍞 Think of following a recipe step by step. 🥬 Rule-based analysis means checking text against clear rules. If the text breaks a rule, it’s an error. Without rules, checks become opinion. 🍞 Example: A screenplay line missing a character name in caps is a format violation.
🍞 Imagine all students agreeing to write answers in the same layout so teachers can grade quickly. 🥬 Format Standards are the agreed rules for how scripts should look (like Fountain format). Without them, a script is messy and hard to read or film. 🍞 Example: 'INT. KITCHEN – NIGHT' is a proper scene heading in a screenplay.

Real stakes (why everyday people should care):

Better drama tools help writers draft faster and fix specific problems (like flat emotions) without breaking others (like logic).
Viewers get shows and films that feel tighter, truer to character, and more emotionally engaging.
Teachers and students get clearer feedback: what to improve and how.
Studios can compare writing AIs fairly, saving time and money.
Creators keep creative control with concrete, fixable notes instead of a mysterious '7/10.'

02Core Idea

🍞 Top Bread (Hook): Imagine building a LEGO castle: instead of guessing if it’s 'good,' you label each brick—'window,' 'door,' 'tower'—then count if you have what a real castle needs. That’s clearer than just saying 'nice job.'

🥬 Filling (The Actual Concept):

What it is: The key idea is to replace fuzzy one-number judging with labeled evidence for six separate drama skills, then turn those labels into objective scores.
How it works: First, an AI labels the continuation in a structured way (like 'this beat drives the plot' or 'this line is out-of-character'). Then simple math turns those counts into metrics (like Effective Narrative Rate or Logic Break Rate). Each skill is measured independently.
Why it matters: Labels are reproducible, explainable, and teachable. You can train models on those labels (e.g., avoid 'redundant beats') and know exactly what to fix.

🍞 Bottom Bread (Anchor): It’s like grading a science lab by checking 'followed procedure,' 'recorded data,' 'explained results,' and 'safety' separately—so students know which step to improve.

Aha! moment in one sentence:

Don’t ask a model to hand you a final grade; ask it to label the parts, then add them up with clear, fair math.

Three analogies (same idea, different angles):

Chef analogy: Instead of saying 'the meal is 7/10,' you check salt level, doneness, plating, and temperature. Each gets its own score.
Sports analogy: A game isn’t just 'win or lose'; you track rebounds, assists, turnovers, and shots made. That reveals how the team actually played.
Health analogy: A single 'healthy/unhealthy' stamp is vague. Measuring heart rate, blood pressure, and sleep gives real, fixable signals.

Before vs After:

Before: One-size-fits-all judging, often biased and unreproducible, hiding what’s broken.
After: Six clear dials—Format, Narrative, Character, Emotion, Logic, Conflict—each with labeled evidence and math scores. You see strengths and weaknesses plainly.

Why it works (intuition, no equations):

Labels anchor the score to specific moments in text, so you can trace exactly why you got a number.
Different dimensions barely overlap (near-zero correlations), so each measures a distinct talent. That prevents double-counting.
Simple statistics make results stable: if two models differ a lot on 'beats per page,' tests show it’s not random luck.

Building blocks (each with a Sandwich explanation):

🍞 You know how teachers mark answers as 'right,' 'wrong,' or 'needs more info'? 🥬 LLM Labeling + Statistical Analysis is a two-step recipe: (1) label what happened in the text using clear categories, (2) turn label counts into scores. Without step 1, scores aren’t explainable; without step 2, labels aren’t comparable. 🍞 Example: Label 'out-of-character' lines, then compute OOC Rate.
🍞 Imagine a script typed exactly like the industry expects so actors and directors can use it fast. 🥬 Format Standards check if the writing follows rules like Fountain format. If ignored, scripts become messy and slow to film. 🍞 Example: 'EXT. PARK – DAY' as a proper heading is a format pass.
🍞 Think of a story that wastes no time—every scene moves the mission forward. 🥬 Narrative Efficiency measures how many beats push the plot vs. sit still or repeat. Without it, scripts feel padded. 🍞 Example: 'She discovers a hidden camera' is a driver beat; 'He sighs again' may be static.
🍞 Picture your best friend always sounding like themselves, not suddenly acting the opposite. 🥬 Character Consistency checks if dialogue matches established personality and voice. Without it, characters feel fake. 🍞 Example: A shy character suddenly giving a fiery speech with no build-up is out-of-character.
🍞 Remember movies that make you feel a knot in your stomach and then relief? 🥬 Emotional Depth tracks how feelings shift and how complex they get. Flat emotions = flat drama. 🍞 Example: A scene moving from fear to determined courage shows an emotional arc.
🍞 Think of a puzzle where every piece must fit what came before. 🥬 Logic Consistency verifies facts from earlier scenes aren’t broken later. Breaks pop the audience out of the story. 🍞 Example: If the hero broke a leg earlier, they shouldn’t be sprinting in the next scene.
🍞 Imagine a roller-coaster that keeps climbing or twisting so you lean forward in your seat. 🥬 Conflict Handling checks whether the core problem escalates, twists, pauses, resolves, or gets dropped. Dropping it confuses viewers. 🍞 Example: Adding a new obstacle is 'twist' escalation; skipping the problem is 'dropped.'

03Methodology

High-level pipeline: Input (script context + model continuation) → LLM Labeling per dimension (or rules for format) → Statistical metrics → Scores and actionable feedback.

Step-by-step (with Sandwich explanations for key steps and each dimension):

🍞 You know how a referee first records what happened, then the scoreboard updates? 🥬 Step A: Structured Labeling. The evaluator AI (Qwen3-Max) doesn’t give a star rating. Instead, it assigns predefined labels at the right granularity (like line-level or scene-level). Without clear labels, numbers are arbitrary. 🍞 Example: For logic, the AI labels each fact as 'maintained' or 'violated.'
🍞 Imagine turning tallies into fair points so teams can be compared. 🥬 Step B: Statistical Metrics. The system converts label counts into objective scores (like rates and averages) and uses significance tests to ensure differences are real. Without this, results might be noise. 🍞 Example: Logic Break Rate = violated facts divided by all checked facts.
🍞 Think of checking spelling with a dictionary instead of opinions. 🥬 Format Standards (Rule-Based):

What happens: A parser checks Fountain screenplay rules—scene headings, character names, action lines.
Why this step exists: Formatting must be strict and reproducible; rules beat opinions.
Example data: If 0 of 100 lines break rules, Format Error Rate = 0%.
Secret sauce: Separating format from content ensures clean scripts and clean content analysis.

🍞 Like counting only the moves that advance a chess game, not the fidgeting. 🥬 Narrative Efficiency (Event-Level Beats):

What happens: The AI extracts beats from action lines and labels each as Driver (advances plot), Static (descriptive), or Redundant (repeats info).
Why: Without this, models can pad with mood and gestures without moving the story.
Example: 9 driver, 1 static, 0 redundant → ENR = 9/10 = 0.9; Beats per Page = driver beats per 250 tokens.
Secret sauce: Beat-level labeling pinpoints exactly where pacing slows.

🍞 Imagine a voice coach checking if each line sounds like the same person. 🥬 Character Consistency (Dialogue-Level):

What happens: From the context, the AI writes a persona profile (traits, speech style). It then labels each new line: In_Character, Neutral, or OOC.
Why: Characters breaking their own rules shatters immersion.
Example: 2 OOC lines out of 40 → OOC Rate = 5%.
Secret sauce: Persona profiles convert fuzzy 'vibes' into checkable expectations.

🍞 Think of a mood thermometer tracking how feelings change within a scene. 🥬 Emotional Depth (Scene-Level):

What happens: For key characters, the AI marks opening and closing emotions using Valence (positive/negative) and Arousal (high/low), detects whether there’s a Shift, and flags Complex_Emotion (mixed feelings).
Why: Drama thrives on evolving emotions; static arcs feel dull.
Example: If 3 of 4 scenes shift → Arc Score average = 0.75; Complexity Ratio = complex scenes divided by all scenes.
Secret sauce: Separating intensity (arousal) from positivity (valence) captures subtler changes.

🍞 Like a continuity editor who checks the set notes so nothing contradicts earlier scenes. 🥬 Logic Consistency (Fact-Level):

What happens: The AI extracts hard constraints from the context (injuries, locations, time limits), then verifies if the continuation Maintains or Violates each.
Why: One contradiction can undo the story’s believability.
Example: 1 violation, 7 maintained → Break Rate = 1/(1+7) = 12.5%.
Secret sauce: Fact-level checking separates 'memory' from 'style.'

🍞 Think of a coach asking: did the challenge get harder, twist, pause, resolve, or just get forgotten? 🥬 Conflict Handling (Global-Level):

What happens: The AI identifies the core conflict from context, then labels the handling as Escalation (+2), Twist (+2), Pause (+1), Resolution (0), or Dropped (−5). A weighted average is the Conflict Score.
Why: Mid-story dramas should raise stakes, not sidestep them.
Example: Two escalations, one twist, no drops → strong positive score.
Secret sauce: A steep penalty for 'Dropped' catches the worst storytelling sin—forgetting the problem.

Evaluation setup:

Inputs: 1,103 scripts in proper Fountain format split at scene boundaries when possible (about 70%). Context averages 381 tokens; continuation about 401.
Models: 8 top LLMs (e.g., GPT-5.2, Qwen3-Max, Gemini-3-Pro), all prompted identically.
Evaluator: Qwen3-Max performs labeling for 5 content dimensions; Format uses a deterministic parser.
Statistics: Mann-Whitney U tests with FDR correction check if differences between models are real, not luck.

Why this method is clever (the secret sauce):

Reproducible: Labels + formulas mean the same input gives the same output.
Actionable: Every score maps back to specific lines and beats you can fix.
Independent dials: Dimensions are almost uncorrelated, so each reveals a distinct craft skill.
Data flywheel: Labels double as training signals (e.g., reduce redundant beats via preference learning).

04Experiments & Results

The test: Measure six skills—Format, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling—on 8 models over 1,103 scripts (8,824 total evaluations). The goal is not just who 'wins,' but where each model is strong or weak.

The competition: 8 leading LLMs, all given the same instructions and the same scene-aware context split. Qwen3-Max serves as the labeling evaluator for the five content dimensions; format is judged by rules.

The scoreboard with context:

Format Standards: All models scored 0% error. That’s like the whole class getting 100% on neat handwriting. It means prompt-based format control is basically solved here.
Narrative Efficiency: Average ENR was 93.3%, but 'beats per page' exposed big gaps. GPT-5.2 wrote the most driver beats per page (3.52), while GLM-4.6 wrote the fewest (2.43). This is like one runner finishing laps far faster than another.
Character Consistency: Out-of-Character (OOC) lines were rare on average (0.97%); 87% of scripts were perfect. But when failures happened, they were glaring (a long tail of outliers).
Emotional Depth: 90.3% of continuations had emotional shifts; complex emotions appeared in 97.5% of scenes. Arousal shifts were more common than valence shifts, meaning models changed intensity more easily than positivity/negativity.
Logic Consistency: The biggest separator. Logic break rates ranged roughly from 2.0% (GPT-5.2) to 5.3% (GLM-4.6). That’s like some players almost never turning the ball over, while others fumble more often.
Conflict Handling: Most models escalated or twisted conflicts (about 79% escalations). Very few 'dropped' the main problem (~1.5% drop rate). Gemini-3-Pro topped this metric.

Who stood out and how:

GPT-5.2: Most balanced leader—best in narrative efficiency, character consistency, and logic; still strong in others.
Qwen3-Max: Best in emotional depth (highest arc rate, rich complexity), but weaker in narrative efficiency.
Gemini-3-Pro: Best at conflict handling, though emotional depth was its weakest area.

Statistical confidence (is it real?):

252 pairwise tests across metrics with FDR correction found 65.9% significant differences. 'Beats per Page' had the most significant and large-effect gaps (26/28 significant; 20 large). 'Conflict Score' had the least variance across models.

Surprising findings:

Everyone can format like a pro by prompt alone—format is no longer the bottleneck.
Emotional complexity is common across models, but changing intensity (arousal) is easier than flipping positivity (valence).
Logic is the Achilles’ heel: small percentages matter a lot—a single contradiction can ruin believability.
Character consistency looks great on averages but hides rare, dramatic failures (the 'catastrophic outlier' effect).

Concrete case snippets (why the numbers matter):

Logic win: A continuation that keeps the character weak after childbirth respects prior constraints—no magic recovery.
Logic fail: Waking up in a bedroom right after being in surgery breaks spatial/temporal continuity—audiences notice.
Narrative win: Every beat advances—finds hidden camera, realizes betrayal, plans counter, confronts antagonist—no fluff.
Narrative fail: Many static beats—repeated gazes, pauses, vague emotions—pretty words, little progress.

Bottom line: No single model is the all-around champion. Skill profiles differ, so training and prompting should be targeted by dimension.

05Discussion & Limitations

Limitations (specific and honest):

Evaluator bias exists: On two dimensions—Narrative Efficiency and Character Consistency—human agreement with the AI evaluator was weak. Beat detection and OOC judgments can be subjective, so rankings on these dials should be read with caution.
English only: The dataset is monolingual; adapting to other languages needs new format parsers and possibly different cultural cues for character voice and emotion.
Short-form focus: Scripts are short dramas. Full-length films, comedy timing, or experimental structures may require revised metrics.
Human validation scale: Only 17% of the dataset got human checks; broader validation would strengthen confidence and calibrate the evaluator.

Required resources:

Evaluation uses API calls for labeling (except format), so you need budget for tokens and access to the evaluator model.
Parsing and stats scripts are provided, but running 8,824 evaluations at scale requires time and compute for orchestration.

When not to use this benchmark:

If you’re scoring poetry, novels, or interactive fiction without adaptation—screenplay-specific dials won’t fully fit.
If you need ultra-long narrative coherence (multiple episodes or a whole feature)—this benchmark checks mid-length continuations, not entire sagas.
If you must avoid any LLM in the loop—content dimensions rely on an LLM labeler.

Open questions:

Can multi-evaluator ensembles (e.g., GPT-4 + Claude + Gemini) reduce bias for narrative beats and OOC detection?
What data or prompts best train models to cut padding (static beats) while preserving emotional richness?
How do emotional depth and conflict handling transfer across genres (comedy, thriller, romance)?
Can the label data directly power preference learning to shrink logic breaks below 1% without harming creativity?
How to extend the framework to cross-episode arcs, B-story interweaving, and season-level payoffs?

06Conclusion & Future Work

Three-sentence summary:

DramaBench introduces a six-dimensional, label-first evaluation for drama script continuation, combining rule-based format checks with LLM labeling and straightforward statistics.
Across 1,103 scripts and 8 top models, it shows no universal winner: strengths differ by dimension, with clear gaps especially in logic and pacing.
The method is reproducible, interpretable, and actionable—scores connect to exact lines and beats—validated by significance tests and human agreement on 3 of 5 content skills.

Main achievement:

Turning creative-writing evaluation from a single fuzzy grade into six independent, evidence-backed dials that reveal exactly what to improve.

Future directions:

Use evaluator ensembles to reduce bias where human agreement was weak; expand to more languages and genres; scale human validation; adapt metrics for long-form and serial arcs; feed labels into training loops (DPO/reward modeling) for continual improvement.

Why remember this:

Because great drama isn’t one thing—it’s many crafts in harmony. DramaBench is the first large-scale, fair, and fixable way to measure those crafts separately so models (and writers) can actually get better where it counts.

Practical Applications

•Use the six scores to give scriptwriters targeted notes (e.g., 'reduce redundant beats in Act 2').
•Filter AI-generated scenes that exceed a logic break threshold before human review.
•Tune prompts to fix dialogue-action imbalance when Narrative Efficiency is low.
•Train reward models using labeled OOC lines and redundant beats to improve character and pacing.
•Choose different AIs for different tasks (e.g., one to deepen emotions, another to check logic).
•Add a 'no conflict drop' guardrail in writers’ rooms to catch accidental thread loss.
•Automate format compliance checks with the rule-based parser to save editorial time.
•Create classroom rubrics that mirror the six dimensions for fairer grading of student scripts.
•Benchmark new models on DramaBench before deploying them in creative pipelines.
•Monitor model updates over time to see if targeted training reduces specific error types.

Version: 1