MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences

Zizhen Li; Chuanhao Li; Yibin Wang; Yukang Feng; Jianwen Sun; Jiaxin Ai; Fanrui Zhang; Mingzhu Sun; Yifei Huang; Kaipeng Zhang

MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences

Intermediate

Zizhen Li, Chuanhao Li, Yibin Wang et al.1/12/2026

arXiv PDF

Key Summary

•MeepleLM is a special AI that reads a board game’s rulebook and pretends to be different kinds of players to give helpful, honest feedback.
•It uses the MDA idea (Mechanics → Dynamics → Aesthetics) to first imagine how the rules will play out and then explain how that would feel for a player.
•The team collected 1,727 cleaned-up rulebooks and 150,000 high-quality reviews to teach the model what real players like and dislike.
•They discovered five player personas (like System Purist and Social Lubricator) so the AI can tailor feedback to different tastes instead of giving one-size-fits-all answers.
•MeepleLM beats strong general models (like GPT-5.1 and Gemini3-Pro) at matching community opinions and writing grounded, varied critiques.
•It avoids the common AI “play-it-safe” habit by capturing both praise and complaints, including strong negative opinions when deserved.
•User tests show people preferred MeepleLM’s feedback about 70% of the time for being more authentic and useful.
•The method helps designers spot problems early, try fixes faster, and understand which audiences will enjoy a game.
•Even though it’s text-only today, future versions aim to look at boards, cards, and art, and to learn individual player tastes.
•This approach could also test other interactive systems, not just board games, making Human-AI teamwork more audience-aware.

Why This Research Matters

MeepleLM gives designers fast, honest previews of how different players will actually feel about their game, before costly printing or long human playtests. It helps teams catch pacing issues, fairness problems, or teaching hurdles earlier when they are cheaper to fix. Players benefit too: the model can surface which audiences will likely love or dislike a game, improving recommendations. Because the approach focuses on experience (not just text), it encourages human-centered design that respects different tastes. This same recipe—rules → imagined dynamics → persona reactions—can be applied to many interactive systems like learning apps or training simulations. In short, it nudges AI from being a rule checker to being a partner that understands people.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you build a new board game with your friends. You write the rules, print some cards, and feel excited. But to really know if it’s fun, you need lots of playtests with many different people. That takes time, money, and many, many evenings.

🥬 Filling (The Actual Concept):

What it is: This paper introduces MeepleLM, a virtual playtester—an AI that reads game rules, pretends to be different kinds of players, and gives feedback as if those people just played your game.
How it works: It learns from real rulebooks and real player reviews, thinks through how rules turn into gameplay moments, and then explains how different players would feel about it.
Why it matters: Without this, designers either wait for long, expensive human playtests or rely on generic AI opinions that don’t match real players.

🍞 Bottom Bread (Anchor): Think of a party game like One Night Ultimate Werewolf. Some players love the chaos; others hate the confusion. MeepleLM can explain both views before you print your prototype.

You know how: When you read a recipe (the rules), you can kind of imagine how the dish will taste, but you still don’t fully know until you cook it and eat it with different friends? That’s board game testing. The recipe (rulebook) is just the start—the experience comes out when people actually play.

🥬 The Problem Before: Designers and new AI tools could already write rules, generate game ideas, or even code simple engines. But two big things were missing.

Inferring Latent Dynamics: Rules are static text. Real fun happens when rules collide during play (like bluffing, blocking, or push-your-luck moments). Most AIs don’t have a full game engine to simulate turns, so they struggle to imagine gameplay truly.
Modeling Subjective Preferences: People are different. A tight, brain-burny strategy can thrill a logic lover but exhaust a social butterfly. If AI gives a single average answer, it becomes bland and unhelpful.

🍞 Sandwich: Explaining Large Language Models (LLMs) 🍞 Hook: You know how a super-helpful librarian can answer almost any question? 🥬 Concept: LLMs are smart computer programs that read and write human language.

How it works: 1) Read a lot of text. 2) Learn patterns. 3) Predict the next words to answer or explain.
Why it matters: Without LLMs, we can’t build a text-based virtual playtester that reads rulebooks and writes critiques. 🍞 Anchor: When you ask an AI, “Explain how this game works,” that’s an LLM talking.

🍞 Sandwich: Explaining Emergent Experience 🍞 Hook: Imagine starting a snowball rolling down a hill—the path it takes depends on bumps and twists you can’t see from the top. 🥬 Concept: Emergent experience means the fun you feel isn’t just in the rules—it's created during play by players’ choices and interactions.

How it works: Rules (Mechanics) cause interactions (Dynamics), which lead to feelings (Aesthetics).
Why it matters: If you only read the rules, you might miss what it truly feels like to play. 🍞 Anchor: In a bluffing game, a single clever lie can flip the whole table’s mood from calm to giggly chaos.

Failed Attempts: Earlier systems gave feedback like grammar checkers for rulebooks or made new mechanics by focusing on syntax. That often led to “correct” but boring or broken ideas. Others tried to simulate users without grounding in real data, which risked stereotypes or fake-sounding opinions.

The Gap: Designers needed a tool that could:

Imagine gameplay without a full engine (make the invisible dynamics visible), and
Talk like real players with real preferences (not just average opinions).

What This Paper Adds: MeepleLM learns from 1,727 structured rulebooks and 150,000 carefully filtered reviews. It uses the MDA (Mechanics → Dynamics → Aesthetics) chain as a thinking path and speaks from five real, data-derived player personas. This creates critiques that are both grounded (matching the rules) and personal (matching the audience).

Real Stakes: Why care? Faster feedback shortens design cycles. Better alignment avoids frustrating players. And clearer, audience-specific notes help you decide who a game is for—family night, strategy club, or party crowd—before you spend your budget.

02Core Idea

🍞 Top Bread (Hook): You know how a good movie trailer helps you picture how the full film will feel—and how different people (your grandma vs. your best friend) will react differently? That’s what this paper does for games.

🥬 The Aha in One Sentence: Teach an AI to first imagine how rules play out (M→D→A), then speak in the voice of different kinds of players (personas) to give targeted, useful critiques.

🍞 Sandwich: Explaining the MDA Framework 🍞 Hook: Imagine baking cookies—ingredients (flour, sugar), baking process (mixing, heat), final taste (yum!). 🥬 Concept: MDA is a way to think about games: Mechanics (rules) → Dynamics (interactions in play) → Aesthetics (player feelings).

How it works: 1) Identify key rules, 2) Imagine the interactions those rules cause, 3) Predict how players will feel.
Why it matters: Without walking through this chain, AI might judge a game by text alone and miss the real fun or frustration. 🍞 Anchor: A “role-swap” rule (Mechanic) causes confusion and bluffing (Dynamics), which some find hilarious and others find stressful (Aesthetics).

🍞 Sandwich: Explaining Persona Modeling 🍞 Hook: Think of your friend group—there’s a strategist, a storyteller, a party-starter, a speed-lover, and a thrill-chaser. 🥬 Concept: Personas are profiles that summarize what different types of players care about.

How it works: 1) Gather many real reviews, 2) Cluster similar viewpoints, 3) Refine into clear archetypes (like System Purist, Social Lubricator), 4) Use them to guide critiques.
Why it matters: Without personas, feedback becomes an average opinion that helps no one. 🍞 Anchor: A System Purist may complain about randomness that a Thrill Seeker celebrates.

Three Analogies for the Core Idea:

Cooking Show: Read the recipe (rules), picture the cooking (dynamics), taste in your mind (aesthetics), then ask five different judges for opinions (personas).
Wind Tunnel for Games: Put the rulebook in, simulate many play winds, then measure how different flyers (personas) react to the turbulence.
Book Club with Characters: The same story gets five different reviews because each reader values something different—pacing, logic, drama, humor, or surprise.

Before vs. After:

Before: AI feedback felt generic, often ignored how plays unfold, and didn’t match real community splits (love vs. hate).
After: MeepleLM predicts the range of opinions, ties comments to specific rules and moments, and aligns with authentic player voices.

Why It Works (Intuition):

Forcing the AI to think Mechanics → Dynamics → Aesthetics keeps critiques grounded and logical.
Conditioning on a persona focuses the emotional lens, so the same gameplay moment gets honestly different reactions.
Training on cleaned rulebooks and high-quality, facet-scored reviews reduces noise and teaches realistic patterns.

Building Blocks:

Structured Rulebooks: Clean, consistent rule summaries so the model won’t hallucinate.
High-Quality Reviews: Filtered by reasonableness (MDA links, constructive content).
Persona Discovery: Data-driven clusters refined with experts into five clear archetypes.
MDA Chain-of-Thought: An explicit, step-by-step reasoning path from rules to feelings.
Verifier Loop: A second model checks that the reasoning matches the real rating and the rules.
Persona-Conditional Tuning: Fine-tuning the model to produce reasoning + critique that fits each persona’s style.

🍞 Bottom Bread (Anchor): If the rulebook says “night roles swap cards,” MeepleLM predicts bluffing chaos, then explains why the Social Lubricator loves the laughter while the System Purist dislikes the lost certainty.

03Methodology

At a high level: Input (Structured Rulebook + Target Persona) → MDA Reasoning (think-through of Mechanics → Dynamics → Aesthetics) → Persona-Aligned Critique (rating + review).

Step 1: Curate and Structure Rulebooks

What happens: 1,727 official rulebooks are parsed from PDFs, reorganized into the same clean format (objective, components, setup, flow, scoring, edge cases) and cross-checked for accuracy.
Why it exists: The model needs a reliable source of truth. Messy, inconsistent text leads to errors.
Example: “Place 7 artifacts” vs. “place 5 artifacts.” The pipeline fixes such mismatches so critiques stay grounded.

Step 2: Filter and Score Reviews

What happens: From 1.8M raw reviews, the team keeps ~150K high-quality ones using three checks: remove junk (too short, off-topic), score how well they link Mechanics → Dynamics → Aesthetics, and tag content facets (like luck vs. strategy, pacing, rule clarity).
Why it exists: The AI should learn from reviews that actually explain why a game felt the way it did.
Example: “The role-swap ruins deduction” gets high causal scores; “Cool art!” gets filtered out for being gameplay-irrelevant.

🍞 Sandwich: Chain-of-Thought (CoT) 🍞 Hook: You know how a math teacher asks you to “show your work” so they can see your steps? 🥬 Concept: CoT is the AI writing down its thinking path before giving the final answer.

How it works: It lists Mechanics (What), infers Dynamics (How it plays), and ends with Aesthetics (How it feels).
Why it matters: Without showing its steps, the AI might jump to a shallow or wrong conclusion. 🍞 Anchor: For a push-your-luck rule, the model first notes the rule, then imagines risk-taking turns, then concludes “tense and exciting” for Thrill Seekers or “too swingy” for Purists.

Step 3: Discover Personas

What happens: Reviews are embedded with extra tags (sentiment tiers and content facets), clustered, then interpreted with an expert-in-the-loop into five personas: System Purist, Efficiency Essentialist, Narrative Architect, Social Lubricator, Thrill Seeker.
Why it exists: People weigh different values. Modeling these differences is key to useful, non-generic critiques.
Example: “Dice ruin control” → Purist; “Epic adventure!” → Narrative Architect; “Best ice-breaker ever!” → Social Lubricator.

Step 4: Synthesize MDA Reasoning Chains

What happens: A strong teacher model reads each rulebook-review pair and reconstructs the missing steps (Mechanic → Dynamic → Aesthetic), specifically grounded in the rulebook and the persona’s values.
Why it exists: Real reviews don’t always spell out the chain. Teaching the model the missing middle helps it infer gameplay from rules.
Example: Mechanic: hidden roles. Dynamic: bluffing + uncertainty. Aesthetic: laughter for Social Lubricator, frustration for Purist.

Step 5: Verifier-Guided Filtration

What happens: Another model double-checks that the chain matches the actual rating and doesn’t invent rules. Bad chains get regenerated or tossed.
Why it exists: This keeps the training set honest and coherent.
Example: If the review is a 3/10, but the reasoning sounds super positive, the verifier flags it.

Step 6: Persona-Conditional Instruction Tuning

What happens: The base model (Qwen3-8B) is fine-tuned to output both the reasoning chain and the final critique, with the persona’s full profile provided as context (not just a label).
Why it exists: The same game moment must generate different reactions depending on who’s “speaking.”
Example: Given the rulebook and “You are a System Purist who hates swingy luck,” the model writes a critique focused on fairness and control.

Step 7: Inference Protocol (How It’s Used)

What happens: For each test game, the model runs many simulated reviews (e.g., 100), sampling personas in the same proportions as real communities.
Why it exists: Real communities are mixes of people, not just one type. This recovers realistic rating distributions.
Example: For Werewolf-style games, you’ll see polarized opinions: high scores from socialites, low scores from purists.

The Secret Sauce:

Forcing the MDA chain keeps critiques tied to rules and believable play moments.
Feeding full persona profiles teaches the model to switch values and voice naturally.
Verifier checks reduce hallucinations and mismatches to ratings.

Concrete Mini-Example (One Night Ultimate Werewolf):

Input: Rulebook says some roles swap cards at night; players discuss and vote.
CoT: Mechanic (role swaps) → Dynamic (bluffing, shifting information) → Aesthetic (funny chaos vs. broken deduction).
Output: Social Lubricator gives a high-energy, party-friendly thumbs-up; System Purist warns about logic breakdown and suggests variant rules.

04Experiments & Results

The Test: Can MeepleLM match what real communities think, stay factually grounded in the rules, and give advice that predicts what players will actually say?

Setups and Competitors:

Data: 207 held-out games (not in training), covering easy-to-heavy complexity and old-to-new releases.
Comparisons: GPT-5.1, Gemini3-Pro, Qwen3-235B, and the base Qwen3-8B.
Protocol: For each game, run many simulated reviews with persona proportions matching real data.

What They Measured (with friendly meanings):

Preference Alignment: Can the model predict average ratings and rank games like players do? • MAE (Mean Absolute Error): Lower is better; think “closer dart hits.” • Wasserstein Distance: Lower means the whole shape of the predicted rating distribution matches humans (not just the average). • Kendall’s tau: Higher is better; it rewards putting games in roughly the same order as the community.
Review Quality: • Factual Correctness: Does the critique match the actual rules? • Dist-2: Word variety—avoids sounding repetitive. • Diversity of Perspectives: Within a batch, do the reviews cover different angles like pacing, balance, and interaction?
Practical Utility: • Opinion Recovery (Op-Rec): Do simulated reviews re-discover real viewpoints mined from human reviews?

The Scoreboard with Context:

MeepleLM had the best alignment across the board. Its MAE of about 0.66 is like scoring an A when others are hovering around B’s. Its tiny Wasserstein distance (~0.22) means it gets the shape of community opinions right, not just the middle.
It beats GPT-5.1 and Gemini3-Pro on ranking and distribution, and avoids the “play-it-safe” positivity bias where models crowd scores around 7–9.
Factual accuracy is very high (~98–99%) while also using richer, more varied language and covering more different topics per batch.
Opinion Recovery is highest, meaning it’s good at forecasting what the market will actually say.

Surprising Findings:

Polarization Recovery: In games people argue about, other models tend to smooth over differences. MeepleLM actually recreates the strong split between lovers and haters.
Persona Helps Ranking: Removing personas drops ranking alignment a lot—proof that modeling different tastes is essential, not optional.
Grounding is Non-Negotiable: Removing the rulebook context destroys factual accuracy and makes critiques untrustworthy.

Ablations (Turning Features Off):

No Rulebook: Factual accuracy plummets (~99% to ~60%).
No Persona: Rankings get worse and critiques feel generic.
No MDA Chain: Lower recovery of real opinions; the model loses the bridge between rules and experience.

User Study (Blind A/B):

Participants compared MeepleLM vs. GPT-5.1 on both familiar and unfamiliar games.
Results: ~70% preferred MeepleLM overall (≈78% when familiar with the game; ≈74% when unfamiliar)—people called it more authentic, less like marketing, and better at pinpointing likely pain points.

05Discussion & Limitations

Limitations (Be Specific):

Text-Only: The current model can’t see boards, art, or iconography. That matters because visuals affect clarity, theme, and even pacing.
Coarse Personas: Five personas capture broad tastes but miss personal quirks. Real people can be a blend or change moods.
Assumes Honest Rulebooks: If the rulebook is incomplete or unclear, the model’s reasoning can be thrown off.
No Full Turn-by-Turn Engine: It imagines dynamics via reasoning, not by literally simulating every move, so very precise balance issues might be missed.

Resources Needed:

Clean, structured rulebooks; a sizable review corpus; GPU time to fine-tune a strong base model (e.g., Qwen3-8B) with LoRA; and a verifier model for quality control.

When NOT to Use:

Highly visual or dexterity-heavy games where component feel or table presence drives the fun.
Ultra-novel prototypes with mechanics so new they barely resemble training data—expect shakier predictions.
Cases where you need exact numeric balance proofs (like economic micro-tuning) rather than audience reactions.

Open Questions:

Multimodal Integration: How much does seeing the board art and icons improve predictions?
Individual Modeling: Can we build stable, privacy-safe profiles for specific players to get hyper-personal feedback?
Interactive Loops: What happens if the AI “plays” multiple imagined rounds, updating beliefs each turn?
Bias and Fairness: How to ensure personas don’t drift into stereotypes and remain respectful, helpful lenses?
Beyond Board Games: How well does this generalize to apps, learning tools, or other interactive systems with rules and users?

06Conclusion & Future Work

Three-Sentence Summary: MeepleLM is a virtual playtester that reads rulebooks, imagines how the game will actually play (Mechanics → Dynamics → Aesthetics), and delivers critiques in the voices of different player personas. Trained on 1,727 rulebooks and 150K carefully filtered reviews, it gives grounded, diverse, and community-aligned feedback. It outperforms general models, predicts real opinions better, and helps designers iterate faster.

Main Achievement: Turning static rules into believable, persona-specific experiences by enforcing an MDA reasoning chain and grounding it in real community data.

Future Directions: Add vision to understand boards and components; shift from group personas to individual player models; and build interactive, multi-turn simulations that refine feedback as fictional plays unfold.

Why Remember This: It shows how to make AI feedback audience-aware and experience-first, not just text-first. That’s a big step toward Human-AI collaboration where creative tools don’t just function—they empathize with different people’s fun.

Practical Applications

•Early-stage playtesting: Predict likely fun/pain points from a draft rulebook without organizing full tables.
•Persona-focused design: Tune a game for System Purists vs. Social Lubricators by comparing their simulated critiques.
•Variant testing: Try house rules (e.g., limit randomness) and see how each persona’s satisfaction changes.
•Rulebook QA: Spot unclear steps, missing edge cases, or mismatched counts via grounded critique checks.
•Market fit analysis: Estimate which audiences and channels (party vs. strategy clubs) will rate a game highly.
•Recommendation support: Suggest games to players based on their persona-like tastes.
•Competitive analysis: Compare two similar titles and see where each persona prefers one over the other.
•Post-launch triage: Mine simulated reviews to prioritize fixes that matter most to specific player groups.
•Teaching aids: Generate persona-aware teaching tips (e.g., how to onboard non-gamers smoothly).
•Publisher pitching: Attach believable, persona-aligned feedback to strengthen a game’s pitch deck.

Version: 1