Deriving Character Logic from Storyline as Codified Decision Trees

Letian Peng; Kun Zhou; Longfei Yun; Yupeng Hou; Jingbo Shang

Deriving Character Logic from Storyline as Codified Decision Trees

Beginner

Letian Peng, Kun Zhou, Longfei Yun et al.1/15/2026

arXiv PDF

Key Summary

•The paper turns messy character descriptions from stories into neat, executable rules so role‑playing AIs act like the character in each specific scene.
•It builds a Codified Decision Tree (CDT): a flowchart of if–then rules that are validated on lots of scene–action evidence from the original stories.
•Each tree node stores short, verified behavior statements (like 'often cheers teammates') and each edge asks a checkable question about the scene (like 'Is this a rehearsal?').
•When a new scene arrives, the system walks down the tree, collects only the rules that match the scene, and gives them to the AI to guide the next action.
•The rules are not guessed once and kept forever—they are proposed, tested on data, kept if strong, refined if partial, and thrown out if false.
•Across two big benchmarks (Fandom and Bandori), CDT beats fine‑tuning, retrieval, text‑only profiles, and even human‑written profiles.
•A lighter version (CDT‑Lite) keeps most of the gains while being cheaper to build and run.
•More training data steadily improve CDT, and a special mode (goal‑driven CDT) can focus on relationships like 'how Character A acts toward Character B'.
•The trees are interpretable and easy to edit, so developers can inspect, fix, or extend character logic transparently.

Why This Research Matters

CDTs make character behavior predictable, explainable, and easy to fix, which is crucial for safe and enjoyable AI interactions. Games get non‑player characters that truly act in character, improving immersion without brittle scripts. Creative writing tools can keep voices consistent across chapters while still reacting to each scene. Educational and support agents gain reliable tone control (polite at office hours, encouraging during practice). Because the rules are executable and validated, teams can audit and edit them, reducing hallucinations and mismatched behavior. Finally, the approach scales with more data and adapts to special goals (like modeling relationships), making it practical for real products.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing pretend with friends. If the scene is a spooky cave, your brave friend goes first; if it’s a math class, your sleepy friend zones out. Who does what depends on the situation.

🥬 The Concept — Role‑playing agents:

What it is: A role‑playing (RP) agent is an AI that acts like a specific character across many scenes.
How it works: 1) Read the scene, 2) Recall what the character is like, 3) Pick the next action that fits the character.
Why it matters: Without a clear “who am I?” guide, the AI’s behavior drifts—sometimes brave, sometimes not—confusing players and readers. 🍞 Anchor: In a school festival scene, a “cheerful leader” character rallies the group; in a quiet library scene, the same character whispers advice.

🍞 Hook: You know how a character bio card lists traits like ‘brave’ or ‘shy’? That card is handy but often too vague when the scene changes.

🥬 The Concept — Behavioral profiles:

What it is: A profile is a description that reminds the AI how this character usually behaves.
How it works: 1) Write traits and habits, 2) Give them to the AI, 3) Hope the AI stays consistent.
Why it matters: Plain text profiles are not executable, so the AI can misread or ignore them when scenes get tricky. 🍞 Anchor: A profile saying “always helps friends” doesn’t explain what to do in a storm, a test, or a concert—so the AI might waffle.

🍞 Hook: Think of a GPS that only shows a map picture, not turn‑by‑turn directions. That’s harder to follow.

🥬 The Concept — Grounding:

What it is: Grounding is feeding the AI scene‑specific facts so decisions match what’s happening now.
How it works: 1) Extract what’s true in this scene, 2) connect those facts to character rules, 3) choose actions that fit both.
Why it matters: Without grounding, the AI may act out of place (cheering at a funeral, shouting in a library). 🍞 Anchor: If the scene says “it’s raining” and “the team is late,” grounding helps the AI suggest “grab umbrellas and run,” not “start a picnic.”

🍞 Hook: A checklist is clearer than a long essay when you need to act fast.

🥬 The Concept — Codification:

What it is: Codification turns text rules into small, executable checks like tiny programs.
How it works: 1) Rewrite “brave during challenges” as a function that asks “Is there a challenge now?”, 2) if yes, activate “be brave”.
Why it matters: Executable rules are consistent and testable; essays are not. 🍞 Anchor: “If the scene is a rehearsal, encourage bandmates” becomes a checkable switch the AI can flip only when rehearsal is true.

🍞 Hook: You’ve seen flowcharts: “If you finished homework → play; else → finish homework.”

🥬 The Concept — Decision trees:

What it is: A decision tree is a flowchart that asks questions and follows branches to a conclusion.
How it works: 1) Start at the root, 2) ask a question, 3) go left or right, 4) repeat until you land on an answer.
Why it matters: Trees give clear, step‑by‑step choices instead of vague guesses. 🍞 Anchor: “Is it class time?” If yes → “be quiet and take notes”; if no → “chat with friends.”

The world before: RP agents had big language skills but shaky character consistency. People tried three things:

Write longer human profiles: rich but vague; not executable.
Fine‑tune models on past scenes: memorizes surface patterns; struggles with new contexts.
Retrieval or aggregation: pull similar scenes or merge mini‑summaries; mixes different situations and blurs when rules apply.

The problem: Text descriptions conflate behaviors from very different scenes. Without scene‑aware switches, AIs can act brave in class or quiet in battle.

What was missing: A structured, validated, and executable way to say, “In this kind of scene, this character tends to do that,” and flip it on only when the scene truly matches.

Real stakes: This powers safer helpers (staying in character), better games (NPCs act predictably but lively), clearer creative writing tools (consistent styles), and easier debugging (you can see and edit the rules).

02Core Idea

🍞 Hook: Think of building a fair‑ride operator’s panel with labeled switches: ‘night scene’, ‘on stage’, ‘friend in trouble’. You only flip the switches that the current scene triggers.

🥬 The Concept — Codified Decision Trees (CDT):

What it is: A CDT is an if–then flowchart whose nodes store short, validated behavior statements, and whose edges ask scene checks. Following true edges collects only the rules that fit the current scene.
How it works: 1) Propose candidate if–then rules from clusters of similar scenes and actions, 2) validate them across all data, 3) keep, refine, or discard, 4) repeat to grow a tree, 5) at runtime, traverse the tree and gather matching rules to guide the AI.
Why it matters: Without CDTs, rules stay fuzzy and sometimes fire at the wrong time; with CDTs, you deterministically retrieve only context‑appropriate guidance. 🍞 Anchor: For a “cheerful band leader,” the CDT might add “encourage teammates” only when the edge question “Is this rehearsal?” is true.

Aha! in one sentence: Instead of summarizing characters as one long paragraph, learn a tree of validated if–then rules so the right pieces snap into place only when the scene truly matches.

Three analogies:

Cookbook: The CDT is like a cookbook with tabs. You open the ‘rainy day’ tab (edge true) and see ‘make soup’ (node rule); you don’t see ‘grill outdoors’ because that tab stays shut.
Museum audio guide: It only plays track 12 when you’re near Exhibit 12 (scene condition true), not all tracks at once.
Toolbox with safety locks: Tools (behaviors) unlock only when the correct key (scene predicate) turns.

Before vs after:

Before: Human text profiles and retrieval globs together scenes; rules are untested and not executable.
After: A hierarchical, validated tree routes scenes to precise, reusable rules; you can inspect, edit, and update locally.

Why it works (intuition, no equations):

Clustering groups scenes that likely share triggers, reducing noise while proposing rules.
Validation tests each rule broadly; strong ones get promoted, weak ones are tossed, and in‑between ones become subtrees for specialization.
Traversal ensures we only collect rules whose questions are truly satisfied now—no more “always brave everywhere.”

🍞 The Concept — Scene–action pairs:

What it is: Evidence entries like “scene text → next action text” drawn from the original story.
How it works: 1) Use many pairs to see patterns, 2) propose triggers from recurring scene conditions to recurring actions.
Why it matters: Without evidence pairs, rules would be guesses. 🍞 Anchor: Scenes showing “late to rehearsal” often lead to “apologize and rush in.”

🍞 The Concept — Triggers (if–then rules):

What it is: Candidates of the form “If scene has A, then character tends to do B.”
How it works: 1) Propose from clusters, 2) validate on all pairs, 3) keep/refine/discard.
Why it matters: Triggers make behavior conditional, not universal. 🍞 Anchor: “If class question is asked → character tends to stay silent” only fires in class, not on stage.

🍞 The Concept — Validation (NLI‑style check):

What it is: A data check that asks, “Does this behavior statement really support the observed action?” across many examples.
How it works: 1) Compare statement to action via entail/neutral/contradict, 2) compute accuracy, 3) accept, refine, or reject.
Why it matters: Without validation, trees fill with wishful rules that break later. 🍞 Anchor: A statement “always helps friends” that contradicts many actions gets removed or narrowed.

🍞 The Concept — Traversal:

What it is: Walking the tree by asking edge questions and collecting only the node statements you passed.
How it works: 1) Start at root, 2) for each child edge, ask its question, 3) follow true edges, 4) gather visited node statements, 5) give them to the RP model to produce the next action.
Why it matters: It converts a giant rule set into a scene‑specific mini‑guide on demand. 🍞 Anchor: In a “moon concert idea” scene, you follow the ‘excited plan?’ and ‘rehearsal?’ branches but skip ‘classroom?’ so study behaviors don’t leak in.

Building blocks:

Instruction‑following embeddings to cluster scene–action pairs by their likely next‑verb meaning (reduces surface‑similarity traps).
Recursive hypothesis–validation loop to grow precise, specialized subtrees.
Usability‑oriented Top‑K selection at inference to keep the most helpful statements.
A lightweight variant (CDT‑Lite) that replaces expensive validators with a small distilled model.

🍞 Anchor overall: Give a new scene to CDT; it returns 2–4 crisp statements like “is excited by wild ideas; encourages bandmates; plays guitar and sings” that the AI then uses to write the next, in‑character line.

03Methodology

High‑level recipe: Input (scene–action pairs) → [Cluster similar pairs] → [Hypothesize if–then triggers] → [Validate on whole dataset] → [Grow tree recursively] → Output (a CDT that grounds the RP model per scene).

🍞 Hook: Imagine sorting thousands of photos into albums before writing captions; the right sorting makes the captions crisper.

🥬 The Concept — Embeddings and clustering:

What it is: Turning scenes and actions into vectors so similar items clump together.
How it works: 1) Build an action embedding that reflects likely next verbs, 2) build a scene embedding guided by “Thus, Character decides to …”, 3) K‑Means clusters on joint scene–action vectors.
Why it matters: Without clustering, hypotheses mix too many behaviors and become mushy. 🍞 Anchor: Scenes “late to practice,” “forgot pick,” and “coach waiting” cluster together, hinting at apology/effort actions.

Step A — Propose hypotheses in each cluster:

What happens: In each cluster, the system drafts pairs (q, h): a scene check q (“Is Character in rehearsal now?”) and a behavior statement h (“Character encourages bandmates and joins in”). It also proposes global statements (h with no q) when a behavior seems always true.
Why this exists: To guess clean if–then regularities from locally consistent evidence.
Example: Cluster about class scenes → q: “Is Character being questioned by a teacher?”; h: “Character tends to be unsure of answers.”

🍞 Hook: Scientists don’t stop at guesses—they test them.

🥬 The Concept — Validation with NLI:

What it is: A truth test that checks whether statement h actually supports action a across the dataset (entailed / neutral / contradicted).
How it works: 1) Try making h global; if accuracy ≥ accept threshold, store at the current node, 2) else filter data by q and test h there, 3) above accept → make a leaf, below reject → discard, between → make a child node and recurse on the filtered subset.
Why it matters: It promotes precise rules, discards fragile ones, and refines ambiguous ones. 🍞 Anchor: “If rehearsal then encourages bandmates” becomes a child node because it’s very accurate only in rehearsal‑filtered scenes.

Step B — Grow nodes recursively (four cases):

h is globally strong: add h to current node’s statement set H.
h is weak even after q‑filter: abolish (q, h).
h is very strong after q‑filter: add a leaf child with h.
h is middling after q‑filter: add a child (empty H), pass down the filtered data, and continue the loop until depth or size limits stop.

Safety rails and knobs:

Accept threshold θ_acc, reject threshold θ_rej, and filtering fraction θ_f to avoid endless recursion and keep trees compact.
Diversification: pass down established statements and path questions to discourage redundant hypotheses.
Depth cap and min‑data per node to prevent overfitting and blow‑ups.

🍞 Hook: When walking in a theme park, you only visit rides that match your height and interest.

🥬 The Concept — Traversal and Top‑K selection:

What it is: Routing the scene through true edges, collecting node statements, then keeping a top subset to guide the RP model.
How it works: 1) Start at root; append its statements, 2) test each child edge question on the new scene, 3) follow True edges recursively, 4) rank collected statements (e.g., by usability), 5) pass top‑K to the RP prompt.
Why it matters: Too many statements overwhelm; ranking prefers broadly helpful, applicable ones. 🍞 Anchor: For a concert‑planning scene, the chosen top‑K might be “easily excited,” “encourages friends,” “vocalist‑guitarist,” skipping “poor at studying.”

The secret sauce:

Instruction‑following scene embeddings: Instead of grouping by surface similarity (same episode), embeddings reflect “what action this scene seems to trigger,” leading to cleaner clusters.
Recursive hypothesis–validation: The tree grows only where data support it, giving high precision without losing coverage.
Executable checks: Every edge is a question the system can answer now (True/False/Unknown), making behavior switches deterministic.
CDT‑Lite: Replace expensive validators with a small, distilled NLI model; most accuracy remains, costs drop sharply.

🍞 The Concept — CDT‑Lite:

What it is: A thrifty version that uses a compact discriminator for validation and traversal.
How it works: Distill a small model from a stronger validator on a subset, then use it everywhere checks are needed.
Why it matters: Validation dominates cost; a light checker makes CDTs practical at scale. 🍞 Anchor: On Bandori dialogues, CDT‑Lite kept top performance while cutting validation calls’ expense by large margins.

Worked mini‑example:

Input cluster: Scenes where Arisa hesitates and Kasumi proposes a wild plan; actions show Kasumi pumps up the team.
Hypothesis: q = “Is there an exciting plan?”; h = “Kasumi gets fired up and encourages others.”
Validation: Across all pairs, h is strong only when q is true → add a child node.
Traversal: In a ‘live on the moon’ scene, q is True → collect h, plus root statements like ‘positive character; direct expression.’
Output grounding: [“easily excited,” “encourages bandmates,” “vocalist‑guitarist”] → RP model produces an in‑character reply like cheering the team into action.

04Experiments & Results

🍞 Hook: Think of a spelling bee. You don’t just say “I’m good”—you spell lots of words while judges compare you to others.

🥬 The Concept — Benchmarks:

What it is: Shared, carefully built test sets so everyone measures on the same playing field.
How it works: Here, two benchmarks—Fandom (fine‑grained story actions across 8 series, 45 characters, 20,778 pairs) and Bandori (dialogue‑level role‑play across 8 bands, 40 characters, 7,866 pairs), plus 77,182 Bandori event‑story pairs for scaling/OOD tests.
Why it matters: Without benchmarks, we can’t fairly compare methods. 🍞 Anchor: Next‑action prediction is scored automatically, so differences are clear and repeatable.

🍞 The Concept — Baselines (the competition):

What it is: Other strong ways to use the same data.
How it works: Vanilla prompting; Fine‑tuning on past scene–action pairs; Retrieval‑based in‑context learning (RICL); Extract‑Then‑Aggregate (ETA) textual profiles; Human Profile; and Codified Human Profile.
Why it matters: If CDT only beats weak rivals, it’s not convincing; here it outperforms strong ones too. 🍞 Anchor: CDT is compared to both automated and hand‑written profile strategies.

🍞 The Concept — NLI score:

What it is: A number based on whether the ground‑truth action entails, is neutral to, or contradicts the model’s predicted action (100/50/0).
How it works: For each test scene, score the prediction vs reference; average over all scenes.
Why it matters: It captures “does this follow the same character logic?” rather than exact wording. 🍞 Anchor: Saying “encourages teammates” vs “cheers everyone up” still counts as entailed.

The scoreboard with context:

Fandom (fine‑grained): CDT ≈ 60.8 vs Human Profile ≈ 58.3; also beats Vanilla, Fine‑tuning, RICL, and ETA across all 8 series.
- Interpretation: That’s like getting an A when the best alternative gets a high B.
Bandori (conversational): CDT‑Lite ≈ 79.0, topping all bands and surpassing human and codified human profiles.
- Interpretation: On fast, chatty dialogues, situation‑aware rules shine even more.

Surprising findings:

Fine‑tuning sometimes underperforms Vanilla because it memorizes early‑story token habits and struggles with the strict chronological split to future scenes.
CDT‑Lite often matches or slightly beats heavier CDT, showing that smart structure beats sheer validator size for this task.
Verbalized/Wikified CDT (text‑only versions of the tree) still beat human profiles, proving the tree induces higher‑quality knowledge—even when flattened.
Scaling law: More scene–action pairs keep improving CDT. On Fandom, with just 64 pairs, CDT already beats human profiles.
Goal‑driven CDT (relations): Focusing the tree on interactions like Haruhi→Kyon yields extra gains when those pairs interact—and helps the overall test too.

OOD (out‑of‑domain) generalization:

Bandori event stories (77K actions) show CDT trained on main stories transfers well: beats Vanilla and Human Profiles across all bands.
Open‑ended rollouts: Human and LLM judges prefer CDT’s behavior in more cases than human profiles, indicating better stability in fresh scenarios.

Ablations and variants (what matters most):

Remove validation? Performance drops—keeping only tested rules is crucial.
No clustering or no instruction‑following embeddings? Drops again—clean clusters seed clean rules.
Depth: Deeper helps until it saturates; too shallow loses specificity.
Top‑K ranking: Usability‑ranked statements ground best on average.

05Discussion & Limitations

Limitations (honest look):

Uses only storyline scene–action data: it doesn’t directly include author notes or official trait sheets; that avoids mismatched priors but may miss rare, canonical facts.
Built offline: trees are static; they don’t update during live interaction yet.
Single‑character focus: multi‑party joint profiling is future work (though relation‑focused trees help for pairs).
Text‑only scenes: multimodal signals (game states, visuals) are not yet integrated.

Required resources:

A corpus of scene–action pairs (the more, the better), an RP model, and a validator (LLM or small distilled discriminator) for the hypothesis–validation loop.
Some compute for clustering and validation; CDT‑Lite greatly reduces the latter.

When not to use:

If you have almost no data (tiny or noisy storylines), the tree may overfit or stay too shallow to help.
If scenes rely on non‑text signals (UI states, physics) with no text form, you first need a text grounding layer.
If you require on‑the‑fly personality shifts (characters canonically change mid‑story every scene), you’ll need incremental updating.

Open questions:

Continual learning: How to safely add new branches as stories grow, without breaking old logic?
Multi‑character/global constraints: How to coordinate several CDTs so group dynamics remain coherent?
Multimodal grounding: How to plug in images, audio, or game states as additional edge questions?
Robustness: How to detect and repair spurious triggers or dataset artifacts automatically?
Human‑in‑the‑loop editing: Best UI for inspecting, approving, and revising branches at scale.

06Conclusion & Future Work

Three‑sentence summary:

This paper turns narrative evidence (scene→action pairs) into a Codified Decision Tree: an executable, interpretable map of when a character tends to do what.
The tree grows by proposing if–then triggers, validating them on all data, and refining them into specialized subtrees; at runtime it routes scenes to collect only matching rules.
Across benchmarks, CDT (and CDT‑Lite) beats fine‑tuning, retrieval, and even human‑written profiles, stays easy to inspect, and scales with more data.

Main achievement:

Demonstrating that validated, situation‑aware codification—not just longer text profiles—yields reliably grounded, in‑character behavior, and doing so in an interpretable, editable structure.

Future directions:

Continual and online CDT updates, multi‑character joint trees for group logic, and multimodal edges that check vision/game‑state signals.
Richer goal‑driven subtrees (relations, emotions, goals) and better human‑in‑the‑loop editing tools.

Why remember this:

CDT shows a practical path from story data to executable character logic: transparent enough to trust, modular enough to edit, and strong enough to outperform human profiles—turning ‘persona’ from paragraph to program.

Practical Applications

•Build in‑character game NPCs that reliably react to quests, battles, or town scenes using CDT traversal.
•Design chat companions that maintain tone and habits (supportive, calming, upbeat) only when scene checks apply.
•Convert long franchise lore into editable behavior trees for consistent storytelling assistants.
•Ground customer‑service bots with situation‑aware rules (refund cases, escalations) that are auditable by policy teams.
•Create writer tools that keep character voices consistent across new chapters or spin‑offs.
•Deploy educational tutors that switch strategies (hinting, quizzing, encouraging) based on lesson scene checks.
•Model character relationships (goal‑driven CDT) for more precise dialogue between specific pairs (mentor–student, rivals).
•Use CDT‑Lite to scale across many characters with limited compute while preserving interpretability.
•Translate trees into human‑readable wiki profiles for documentation or community review.
•Continuously refine rules by validating new scene–action logs and updating only affected subtrees.

Version: 1