Deriving Character Logic from Storyline as Codified Decision Trees
Key Summary
- â˘The paper turns messy character descriptions from stories into neat, executable rules so roleâplaying AIs act like the character in each specific scene.
- â˘It builds a Codified Decision Tree (CDT): a flowchart of ifâthen rules that are validated on lots of sceneâaction evidence from the original stories.
- â˘Each tree node stores short, verified behavior statements (like 'often cheers teammates') and each edge asks a checkable question about the scene (like 'Is this a rehearsal?').
- â˘When a new scene arrives, the system walks down the tree, collects only the rules that match the scene, and gives them to the AI to guide the next action.
- â˘The rules are not guessed once and kept foreverâthey are proposed, tested on data, kept if strong, refined if partial, and thrown out if false.
- â˘Across two big benchmarks (Fandom and Bandori), CDT beats fineâtuning, retrieval, textâonly profiles, and even humanâwritten profiles.
- â˘A lighter version (CDTâLite) keeps most of the gains while being cheaper to build and run.
- â˘More training data steadily improve CDT, and a special mode (goalâdriven CDT) can focus on relationships like 'how Character A acts toward Character B'.
- â˘The trees are interpretable and easy to edit, so developers can inspect, fix, or extend character logic transparently.
Why This Research Matters
CDTs make character behavior predictable, explainable, and easy to fix, which is crucial for safe and enjoyable AI interactions. Games get nonâplayer characters that truly act in character, improving immersion without brittle scripts. Creative writing tools can keep voices consistent across chapters while still reacting to each scene. Educational and support agents gain reliable tone control (polite at office hours, encouraging during practice). Because the rules are executable and validated, teams can audit and edit them, reducing hallucinations and mismatched behavior. Finally, the approach scales with more data and adapts to special goals (like modeling relationships), making it practical for real products.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre playing pretend with friends. If the scene is a spooky cave, your brave friend goes first; if itâs a math class, your sleepy friend zones out. Who does what depends on the situation.
𼏠The Concept â Roleâplaying agents:
- What it is: A roleâplaying (RP) agent is an AI that acts like a specific character across many scenes.
- How it works: 1) Read the scene, 2) Recall what the character is like, 3) Pick the next action that fits the character.
- Why it matters: Without a clear âwho am I?â guide, the AIâs behavior driftsâsometimes brave, sometimes notâconfusing players and readers. đ Anchor: In a school festival scene, a âcheerful leaderâ character rallies the group; in a quiet library scene, the same character whispers advice.
đ Hook: You know how a character bio card lists traits like âbraveâ or âshyâ? That card is handy but often too vague when the scene changes.
𼏠The Concept â Behavioral profiles:
- What it is: A profile is a description that reminds the AI how this character usually behaves.
- How it works: 1) Write traits and habits, 2) Give them to the AI, 3) Hope the AI stays consistent.
- Why it matters: Plain text profiles are not executable, so the AI can misread or ignore them when scenes get tricky. đ Anchor: A profile saying âalways helps friendsâ doesnât explain what to do in a storm, a test, or a concertâso the AI might waffle.
đ Hook: Think of a GPS that only shows a map picture, not turnâbyâturn directions. Thatâs harder to follow.
𼏠The Concept â Grounding:
- What it is: Grounding is feeding the AI sceneâspecific facts so decisions match whatâs happening now.
- How it works: 1) Extract whatâs true in this scene, 2) connect those facts to character rules, 3) choose actions that fit both.
- Why it matters: Without grounding, the AI may act out of place (cheering at a funeral, shouting in a library). đ Anchor: If the scene says âitâs rainingâ and âthe team is late,â grounding helps the AI suggest âgrab umbrellas and run,â not âstart a picnic.â
đ Hook: A checklist is clearer than a long essay when you need to act fast.
𼏠The Concept â Codification:
- What it is: Codification turns text rules into small, executable checks like tiny programs.
- How it works: 1) Rewrite âbrave during challengesâ as a function that asks âIs there a challenge now?â, 2) if yes, activate âbe braveâ.
- Why it matters: Executable rules are consistent and testable; essays are not. đ Anchor: âIf the scene is a rehearsal, encourage bandmatesâ becomes a checkable switch the AI can flip only when rehearsal is true.
đ Hook: Youâve seen flowcharts: âIf you finished homework â play; else â finish homework.â
𼏠The Concept â Decision trees:
- What it is: A decision tree is a flowchart that asks questions and follows branches to a conclusion.
- How it works: 1) Start at the root, 2) ask a question, 3) go left or right, 4) repeat until you land on an answer.
- Why it matters: Trees give clear, stepâbyâstep choices instead of vague guesses. đ Anchor: âIs it class time?â If yes â âbe quiet and take notesâ; if no â âchat with friends.â
The world before: RP agents had big language skills but shaky character consistency. People tried three things:
- Write longer human profiles: rich but vague; not executable.
- Fineâtune models on past scenes: memorizes surface patterns; struggles with new contexts.
- Retrieval or aggregation: pull similar scenes or merge miniâsummaries; mixes different situations and blurs when rules apply.
The problem: Text descriptions conflate behaviors from very different scenes. Without sceneâaware switches, AIs can act brave in class or quiet in battle.
What was missing: A structured, validated, and executable way to say, âIn this kind of scene, this character tends to do that,â and flip it on only when the scene truly matches.
Real stakes: This powers safer helpers (staying in character), better games (NPCs act predictably but lively), clearer creative writing tools (consistent styles), and easier debugging (you can see and edit the rules).
02Core Idea
đ Hook: Think of building a fairâride operatorâs panel with labeled switches: ânight sceneâ, âon stageâ, âfriend in troubleâ. You only flip the switches that the current scene triggers.
𼏠The Concept â Codified Decision Trees (CDT):
- What it is: A CDT is an ifâthen flowchart whose nodes store short, validated behavior statements, and whose edges ask scene checks. Following true edges collects only the rules that fit the current scene.
- How it works: 1) Propose candidate ifâthen rules from clusters of similar scenes and actions, 2) validate them across all data, 3) keep, refine, or discard, 4) repeat to grow a tree, 5) at runtime, traverse the tree and gather matching rules to guide the AI.
- Why it matters: Without CDTs, rules stay fuzzy and sometimes fire at the wrong time; with CDTs, you deterministically retrieve only contextâappropriate guidance. đ Anchor: For a âcheerful band leader,â the CDT might add âencourage teammatesâ only when the edge question âIs this rehearsal?â is true.
Aha! in one sentence: Instead of summarizing characters as one long paragraph, learn a tree of validated ifâthen rules so the right pieces snap into place only when the scene truly matches.
Three analogies:
- Cookbook: The CDT is like a cookbook with tabs. You open the ârainy dayâ tab (edge true) and see âmake soupâ (node rule); you donât see âgrill outdoorsâ because that tab stays shut.
- Museum audio guide: It only plays track 12 when youâre near Exhibit 12 (scene condition true), not all tracks at once.
- Toolbox with safety locks: Tools (behaviors) unlock only when the correct key (scene predicate) turns.
Before vs after:
- Before: Human text profiles and retrieval globs together scenes; rules are untested and not executable.
- After: A hierarchical, validated tree routes scenes to precise, reusable rules; you can inspect, edit, and update locally.
Why it works (intuition, no equations):
- Clustering groups scenes that likely share triggers, reducing noise while proposing rules.
- Validation tests each rule broadly; strong ones get promoted, weak ones are tossed, and inâbetween ones become subtrees for specialization.
- Traversal ensures we only collect rules whose questions are truly satisfied nowâno more âalways brave everywhere.â
đ The Concept â Sceneâaction pairs:
- What it is: Evidence entries like âscene text â next action textâ drawn from the original story.
- How it works: 1) Use many pairs to see patterns, 2) propose triggers from recurring scene conditions to recurring actions.
- Why it matters: Without evidence pairs, rules would be guesses. đ Anchor: Scenes showing âlate to rehearsalâ often lead to âapologize and rush in.â
đ The Concept â Triggers (ifâthen rules):
- What it is: Candidates of the form âIf scene has A, then character tends to do B.â
- How it works: 1) Propose from clusters, 2) validate on all pairs, 3) keep/refine/discard.
- Why it matters: Triggers make behavior conditional, not universal. đ Anchor: âIf class question is asked â character tends to stay silentâ only fires in class, not on stage.
đ The Concept â Validation (NLIâstyle check):
- What it is: A data check that asks, âDoes this behavior statement really support the observed action?â across many examples.
- How it works: 1) Compare statement to action via entail/neutral/contradict, 2) compute accuracy, 3) accept, refine, or reject.
- Why it matters: Without validation, trees fill with wishful rules that break later. đ Anchor: A statement âalways helps friendsâ that contradicts many actions gets removed or narrowed.
đ The Concept â Traversal:
- What it is: Walking the tree by asking edge questions and collecting only the node statements you passed.
- How it works: 1) Start at root, 2) for each child edge, ask its question, 3) follow true edges, 4) gather visited node statements, 5) give them to the RP model to produce the next action.
- Why it matters: It converts a giant rule set into a sceneâspecific miniâguide on demand. đ Anchor: In a âmoon concert ideaâ scene, you follow the âexcited plan?â and ârehearsal?â branches but skip âclassroom?â so study behaviors donât leak in.
Building blocks:
- Instructionâfollowing embeddings to cluster sceneâaction pairs by their likely nextâverb meaning (reduces surfaceâsimilarity traps).
- Recursive hypothesisâvalidation loop to grow precise, specialized subtrees.
- Usabilityâoriented TopâK selection at inference to keep the most helpful statements.
- A lightweight variant (CDTâLite) that replaces expensive validators with a small distilled model.
đ Anchor overall: Give a new scene to CDT; it returns 2â4 crisp statements like âis excited by wild ideas; encourages bandmates; plays guitar and singsâ that the AI then uses to write the next, inâcharacter line.
03Methodology
Highâlevel recipe: Input (sceneâaction pairs) â [Cluster similar pairs] â [Hypothesize ifâthen triggers] â [Validate on whole dataset] â [Grow tree recursively] â Output (a CDT that grounds the RP model per scene).
đ Hook: Imagine sorting thousands of photos into albums before writing captions; the right sorting makes the captions crisper.
𼏠The Concept â Embeddings and clustering:
- What it is: Turning scenes and actions into vectors so similar items clump together.
- How it works: 1) Build an action embedding that reflects likely next verbs, 2) build a scene embedding guided by âThus, Character decides to âŚâ, 3) KâMeans clusters on joint sceneâaction vectors.
- Why it matters: Without clustering, hypotheses mix too many behaviors and become mushy. đ Anchor: Scenes âlate to practice,â âforgot pick,â and âcoach waitingâ cluster together, hinting at apology/effort actions.
Step A â Propose hypotheses in each cluster:
- What happens: In each cluster, the system drafts pairs (q, h): a scene check q (âIs Character in rehearsal now?â) and a behavior statement h (âCharacter encourages bandmates and joins inâ). It also proposes global statements (h with no q) when a behavior seems always true.
- Why this exists: To guess clean ifâthen regularities from locally consistent evidence.
- Example: Cluster about class scenes â q: âIs Character being questioned by a teacher?â; h: âCharacter tends to be unsure of answers.â
đ Hook: Scientists donât stop at guessesâthey test them.
𼏠The Concept â Validation with NLI:
- What it is: A truth test that checks whether statement h actually supports action a across the dataset (entailed / neutral / contradicted).
- How it works: 1) Try making h global; if accuracy ⼠accept threshold, store at the current node, 2) else filter data by q and test h there, 3) above accept â make a leaf, below reject â discard, between â make a child node and recurse on the filtered subset.
- Why it matters: It promotes precise rules, discards fragile ones, and refines ambiguous ones. đ Anchor: âIf rehearsal then encourages bandmatesâ becomes a child node because itâs very accurate only in rehearsalâfiltered scenes.
Step B â Grow nodes recursively (four cases):
- h is globally strong: add h to current nodeâs statement set H.
- h is weak even after qâfilter: abolish (q, h).
- h is very strong after qâfilter: add a leaf child with h.
- h is middling after qâfilter: add a child (empty H), pass down the filtered data, and continue the loop until depth or size limits stop.
Safety rails and knobs:
- Accept threshold θ_acc, reject threshold θ_rej, and filtering fraction θ_f to avoid endless recursion and keep trees compact.
- Diversification: pass down established statements and path questions to discourage redundant hypotheses.
- Depth cap and minâdata per node to prevent overfitting and blowâups.
đ Hook: When walking in a theme park, you only visit rides that match your height and interest.
𼏠The Concept â Traversal and TopâK selection:
- What it is: Routing the scene through true edges, collecting node statements, then keeping a top subset to guide the RP model.
- How it works: 1) Start at root; append its statements, 2) test each child edge question on the new scene, 3) follow True edges recursively, 4) rank collected statements (e.g., by usability), 5) pass topâK to the RP prompt.
- Why it matters: Too many statements overwhelm; ranking prefers broadly helpful, applicable ones. đ Anchor: For a concertâplanning scene, the chosen topâK might be âeasily excited,â âencourages friends,â âvocalistâguitarist,â skipping âpoor at studying.â
The secret sauce:
- Instructionâfollowing scene embeddings: Instead of grouping by surface similarity (same episode), embeddings reflect âwhat action this scene seems to trigger,â leading to cleaner clusters.
- Recursive hypothesisâvalidation: The tree grows only where data support it, giving high precision without losing coverage.
- Executable checks: Every edge is a question the system can answer now (True/False/Unknown), making behavior switches deterministic.
- CDTâLite: Replace expensive validators with a small, distilled NLI model; most accuracy remains, costs drop sharply.
đ The Concept â CDTâLite:
- What it is: A thrifty version that uses a compact discriminator for validation and traversal.
- How it works: Distill a small model from a stronger validator on a subset, then use it everywhere checks are needed.
- Why it matters: Validation dominates cost; a light checker makes CDTs practical at scale. đ Anchor: On Bandori dialogues, CDTâLite kept top performance while cutting validation callsâ expense by large margins.
Worked miniâexample:
- Input cluster: Scenes where Arisa hesitates and Kasumi proposes a wild plan; actions show Kasumi pumps up the team.
- Hypothesis: q = âIs there an exciting plan?â; h = âKasumi gets fired up and encourages others.â
- Validation: Across all pairs, h is strong only when q is true â add a child node.
- Traversal: In a âlive on the moonâ scene, q is True â collect h, plus root statements like âpositive character; direct expression.â
- Output grounding: [âeasily excited,â âencourages bandmates,â âvocalistâguitaristâ] â RP model produces an inâcharacter reply like cheering the team into action.
04Experiments & Results
đ Hook: Think of a spelling bee. You donât just say âIâm goodââyou spell lots of words while judges compare you to others.
𼏠The Concept â Benchmarks:
- What it is: Shared, carefully built test sets so everyone measures on the same playing field.
- How it works: Here, two benchmarksâFandom (fineâgrained story actions across 8 series, 45 characters, 20,778 pairs) and Bandori (dialogueâlevel roleâplay across 8 bands, 40 characters, 7,866 pairs), plus 77,182 Bandori eventâstory pairs for scaling/OOD tests.
- Why it matters: Without benchmarks, we canât fairly compare methods. đ Anchor: Nextâaction prediction is scored automatically, so differences are clear and repeatable.
đ The Concept â Baselines (the competition):
- What it is: Other strong ways to use the same data.
- How it works: Vanilla prompting; Fineâtuning on past sceneâaction pairs; Retrievalâbased inâcontext learning (RICL); ExtractâThenâAggregate (ETA) textual profiles; Human Profile; and Codified Human Profile.
- Why it matters: If CDT only beats weak rivals, itâs not convincing; here it outperforms strong ones too. đ Anchor: CDT is compared to both automated and handâwritten profile strategies.
đ The Concept â NLI score:
- What it is: A number based on whether the groundâtruth action entails, is neutral to, or contradicts the modelâs predicted action (100/50/0).
- How it works: For each test scene, score the prediction vs reference; average over all scenes.
- Why it matters: It captures âdoes this follow the same character logic?â rather than exact wording. đ Anchor: Saying âencourages teammatesâ vs âcheers everyone upâ still counts as entailed.
The scoreboard with context:
- Fandom (fineâgrained): CDT â 60.8 vs Human Profile â 58.3; also beats Vanilla, Fineâtuning, RICL, and ETA across all 8 series.
- Interpretation: Thatâs like getting an A when the best alternative gets a high B.
- Bandori (conversational): CDTâLite â 79.0, topping all bands and surpassing human and codified human profiles.
- Interpretation: On fast, chatty dialogues, situationâaware rules shine even more.
Surprising findings:
- Fineâtuning sometimes underperforms Vanilla because it memorizes earlyâstory token habits and struggles with the strict chronological split to future scenes.
- CDTâLite often matches or slightly beats heavier CDT, showing that smart structure beats sheer validator size for this task.
- Verbalized/Wikified CDT (textâonly versions of the tree) still beat human profiles, proving the tree induces higherâquality knowledgeâeven when flattened.
- Scaling law: More sceneâaction pairs keep improving CDT. On Fandom, with just 64 pairs, CDT already beats human profiles.
- Goalâdriven CDT (relations): Focusing the tree on interactions like HaruhiâKyon yields extra gains when those pairs interactâand helps the overall test too.
OOD (outâofâdomain) generalization:
- Bandori event stories (77K actions) show CDT trained on main stories transfers well: beats Vanilla and Human Profiles across all bands.
- Openâended rollouts: Human and LLM judges prefer CDTâs behavior in more cases than human profiles, indicating better stability in fresh scenarios.
Ablations and variants (what matters most):
- Remove validation? Performance dropsâkeeping only tested rules is crucial.
- No clustering or no instructionâfollowing embeddings? Drops againâclean clusters seed clean rules.
- Depth: Deeper helps until it saturates; too shallow loses specificity.
- TopâK ranking: Usabilityâranked statements ground best on average.
05Discussion & Limitations
Limitations (honest look):
- Uses only storyline sceneâaction data: it doesnât directly include author notes or official trait sheets; that avoids mismatched priors but may miss rare, canonical facts.
- Built offline: trees are static; they donât update during live interaction yet.
- Singleâcharacter focus: multiâparty joint profiling is future work (though relationâfocused trees help for pairs).
- Textâonly scenes: multimodal signals (game states, visuals) are not yet integrated.
Required resources:
- A corpus of sceneâaction pairs (the more, the better), an RP model, and a validator (LLM or small distilled discriminator) for the hypothesisâvalidation loop.
- Some compute for clustering and validation; CDTâLite greatly reduces the latter.
When not to use:
- If you have almost no data (tiny or noisy storylines), the tree may overfit or stay too shallow to help.
- If scenes rely on nonâtext signals (UI states, physics) with no text form, you first need a text grounding layer.
- If you require onâtheâfly personality shifts (characters canonically change midâstory every scene), youâll need incremental updating.
Open questions:
- Continual learning: How to safely add new branches as stories grow, without breaking old logic?
- Multiâcharacter/global constraints: How to coordinate several CDTs so group dynamics remain coherent?
- Multimodal grounding: How to plug in images, audio, or game states as additional edge questions?
- Robustness: How to detect and repair spurious triggers or dataset artifacts automatically?
- Humanâinâtheâloop editing: Best UI for inspecting, approving, and revising branches at scale.
06Conclusion & Future Work
Threeâsentence summary:
- This paper turns narrative evidence (sceneâaction pairs) into a Codified Decision Tree: an executable, interpretable map of when a character tends to do what.
- The tree grows by proposing ifâthen triggers, validating them on all data, and refining them into specialized subtrees; at runtime it routes scenes to collect only matching rules.
- Across benchmarks, CDT (and CDTâLite) beats fineâtuning, retrieval, and even humanâwritten profiles, stays easy to inspect, and scales with more data.
Main achievement:
- Demonstrating that validated, situationâaware codificationânot just longer text profilesâyields reliably grounded, inâcharacter behavior, and doing so in an interpretable, editable structure.
Future directions:
- Continual and online CDT updates, multiâcharacter joint trees for group logic, and multimodal edges that check vision/gameâstate signals.
- Richer goalâdriven subtrees (relations, emotions, goals) and better humanâinâtheâloop editing tools.
Why remember this:
- CDT shows a practical path from story data to executable character logic: transparent enough to trust, modular enough to edit, and strong enough to outperform human profilesâturning âpersonaâ from paragraph to program.
Practical Applications
- â˘Build inâcharacter game NPCs that reliably react to quests, battles, or town scenes using CDT traversal.
- â˘Design chat companions that maintain tone and habits (supportive, calming, upbeat) only when scene checks apply.
- â˘Convert long franchise lore into editable behavior trees for consistent storytelling assistants.
- â˘Ground customerâservice bots with situationâaware rules (refund cases, escalations) that are auditable by policy teams.
- â˘Create writer tools that keep character voices consistent across new chapters or spinâoffs.
- â˘Deploy educational tutors that switch strategies (hinting, quizzing, encouraging) based on lesson scene checks.
- â˘Model character relationships (goalâdriven CDT) for more precise dialogue between specific pairs (mentorâstudent, rivals).
- â˘Use CDTâLite to scale across many characters with limited compute while preserving interpretability.
- â˘Translate trees into humanâreadable wiki profiles for documentation or community review.
- â˘Continuously refine rules by validating new sceneâaction logs and updating only affected subtrees.