GameTalk: Training LLMs for Strategic Conversation

Victor Conchello Vendrell; Max Ruiz Luyten; Mihaela van der Schaar

GameTalk: Training LLMs for Strategic Conversation

Intermediate

Victor Conchello Vendrell, Max Ruiz Luyten, Mihaela van der Schaar1/22/2026

arXiv PDF

Key Summary

•Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.
•GameTalk trains AIs using rewards that come at the end of the entire dialogue, not just for single replies, so they learn long-term strategy.
•Three new “behavior signals”—ISE (how well you predict others), SRP (how well you act given your beliefs), and LO (how well you influence others)—both diagnose skills and shape training rewards.
•They adapt three training methods (GRPO, DPO, STaR) to multi-turn chats by branching conversations and learning from full outcomes.
•Reward shaping that boosts LO (influence) plus a small bonus for natural-sounding speech leads to the biggest wins.
•In tests on Rock-Paper-Scissors, a pricing game (Bertrand), and a two-term negotiation (Size-Price Bargaining), trained models beat untrained ones and even outdo simple Nash strategies.
•DPO is strongest in the more complex games, GRPO shines in Rock-Paper-Scissors, and STaR helps less because it learns only from positives and can overfit early patterns.
•Surprisingly, improving “opponent modeling” alone (ISE) didn’t raise win rates; teaching influence (LO) did, though it initially made speech stiff—fixed by a naturalness bonus.
•GameTalk shows that conversational fine-tuning is a practical path to LLMs that can reason, negotiate, and coordinate over many turns.

Why This Research Matters

Many everyday tasks—scheduling, customer support, price setting, and negotiating—are long conversations where single good sentences aren’t enough. GameTalk shows how to train AIs to plan across the whole dialogue, not just reply-by-reply. With behavior signals for predicting, choosing, and influencing, models learn persuasive but still natural conversations that close deals and build cooperation. This enables smarter assistants that can actually reach outcomes people care about—agreements, savings, and fair trades. It also highlights new safety needs: aligning influence with honesty and respect when AIs get better at persuasion.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how planning a group trip isn’t about one text, but the whole back-and-forth—suggesting dates, comparing prices, and finally booking? Success depends on the entire conversation, not a single message.

🥬 The Concept: Many AIs (LLMs) are trained and graded one message at a time, but real life needs long, goal-driven conversations. This paper fixes that by training AIs to aim for a goal that’s scored at the very end of the conversation.

How it works (story of the field):
1. Before: LLMs shined at single-turn tasks (summaries, answers), and RLHF helped them be helpful and safe—per turn.
2. Problem: In negotiations, teamwork, or scheduling, you must plan across many turns. Optimizing just one reply can break the long plan.
3. Attempts that fell short: (a) Static action predictors—good at picking one move, not steering a multi-turn dialogue. (b) Game-specific systems (like adding separate planning modules) work, but don’t teach the LLM itself to converse strategically. (c) Plain sparse rewards (only at the end) often don’t teach the subtle, in-the-middle moves.
4. Gap: A general way to train LLMs so their words today set up better outcomes tomorrow—measured by the final result of the whole chat.
5. What’s new here: Use reinforcement learning (RL) to optimize across a conversation and create behavior signals that reveal and improve key skills (predicting others, choosing good actions, and influencing).
Why it matters: Without whole-conversation training, an AI helper might say the right thing now but ruin trust, miss chances to coordinate, or fail to close a deal later.

🍞 Anchor: Imagine an email assistant that helps you negotiate a refund. If it focuses only on sounding nice in its next sentence, it might never push for a clear resolution. If it’s trained to win the whole refund conversation, it learns when to be friendly, when to ask for a supervisor, and when to propose a fair compromise.

— New concept 1 — 🍞 You know how a puppy learns tricks faster when it gets a treat right after doing the right thing?

🥬 Reinforcement Learning (RL): RL is a way to teach AI by giving rewards for good choices.

How it works:
1. The AI tries something.
2. It gets a reward (good, bad, or neutral).
3. It changes its behavior to earn more reward next time.
Why it matters: If rewards come only for single messages, the AI won’t plan ahead. If rewards summarize the full conversation, it learns long-term strategy.

🍞 Anchor: In a five-turn negotiation, the AI learns to ask questions early, build trust in the middle, and close the deal at the end—because the big reward comes after everything.

— New concept 2 — 🍞 Imagine teaching a kid not just to answer, but to make a plan with friends for a whole afternoon—texting, agreeing, and deciding together.

🥬 GameTalk: GameTalk is a training framework that teaches AIs to make smart, strategic choices across multi-turn conversations.

How it works:
1. Put the AI in simple, well-defined games (so success is clear).
2. Let the players chat, think privately, then act.
3. Give one reward at the end that reflects the whole interaction.
4. Use adapted RL methods to improve the AI’s conversation and decisions.
Why it matters: It directly trains AIs to do what real life needs—reason, coordinate, and negotiate over time.

🍞 Anchor: In Rock-Paper-Scissors with chatting, GameTalk rewards the AI not just for saying “scissors,” but for using talk to nudge the opponent into a predictable move first.

02Core Idea

🍞 You know how in chess, your first move sets up your tenth? One good move isn’t enough—you need a plan for the whole game.

🥬 The Aha! Moment: Train the AI on the entire conversation’s success, and teach three key skills—predict others (ISE), act well given your beliefs (SRP), and influence others (LO)—so its words today help it win tomorrow.

Multiple analogies:
1. Sports playbook: Don’t grade a pass; grade the touchdown drive. Also teach reading defenses (ISE), choosing the right play (SRP), and drawing defenders where you want them (LO).
2. Group project: Understand teammates’ habits (ISE), pick tasks that fit the team (SRP), and motivate others to align with the plan (LO).
3. Magic trick: Guess what the audience expects (ISE), choose the right trick for that crowd (SRP), and patter to guide attention (LO).
Before vs After: • Before: Models tried to sound good each turn; they often missed long-term wins. • After: Models plan across turns, learn to persuade, and optimize for the final outcome.
Why it works (intuition): • If you can steer the situation (LO), really grasp what the other will do (ISE), and then pick the best move for that belief (SRP), your final reward goes up. • Reward shaping pushes learning toward these skills instead of hoping sparse, end-only rewards are enough.
Building blocks (introduced with Sandwich below): ISE, SRP, LO plus adapted training methods (GRPO, DPO, STaR).

— New concept 3 — 🍞 Imagine trying to guess what your friend will pick in Rock-Paper-Scissors after hearing how they talk about it.

🥬 Internal State Evaluation (ISE): ISE measures how accurately the AI predicts the other player’s next move from the conversation so far.

How it works:
1. Read the chat history.
2. Form a belief about the opponent’s likely actions.
3. Compare that belief to the opponent’s actual behavior.
Why it matters: If your guesses about others are off, even smart actions can fail.

🍞 Anchor: If your opponent often hints “I like rock,” strong ISE means you correctly expect more rock and prepare to counter it.

— New concept 4 — 🍞 Think of choosing the best route given the traffic you believe is ahead.

🥬 State-Relative Performance (SRP): SRP measures how good your action is, given what you believe the other will do.

How it works:
1. Take your current belief about the opponent.
2. List your possible actions.
3. Pick the one with the highest expected payoff against that belief.
Why it matters: Great beliefs don’t help if you choose poorly; great choices don’t help if the beliefs are nonsense. SRP checks the “choice” part.

🍞 Anchor: If you think the other will play rock 60%, SRP is high if you favor paper accordingly.

— New concept 5 — 🍞 You know how a friend can nudge the group to pick their favorite movie by speaking up at the right moment?

🥬 Leverage Opportunity (LO): LO measures how well your words and actions can steer the other person toward outcomes that help you.

How it works:
1. Use conversation to shape the situation (trust, expectations, options).
2. Choose messages that increase the chance the opponent acts in your favor.
3. Then take the move that cashes in that advantage.
Why it matters: Influence turns close calls into wins.

🍞 Anchor: In pricing, saying “Let’s both keep prices fair and high” while undercutting slightly can steer the rival to cooperate with you—then you capture the market.

03Methodology

At a high level: Rules prompt → Multi-turn chat with private thinking → Branching rollouts → End-of-dialogue rewards (utility + LO + naturalness) → Update the model → Repeat

Step 1: Set up the game world

What happens: The AI and its opponent get the rules (e.g., Rock-Paper-Scissors, Bertrand pricing, or Size-Price Bargaining). Each turn, the AI first thinks privately (<think>), then either talks (<talk>) or acts (<play>).
Why this step exists: Clear, simple environments make success measurable and repeatable. Private thinking encourages planning without leaking strategy.
Example: In Rock-Paper-Scissors, each side can chat to hint, bluff, or probe, then play “rock,” “paper,” or “scissors.”

Step 2: Play multi-turn conversations

What happens: The AI alternates between talking and acting for up to 5 interactions. Some games add mid-conversation updates (e.g., last round’s prices and profit in Bertrand).
Why it matters: Many real problems require multiple turns to build trust or uncover information.
Example: In the bargaining game, the buyer and seller trade offers (“8 units at $30 each”), update beliefs, and try to land a deal before turns run out.

Step 3: Compute end-of-dialogue rewards

What happens: At the end of a full episode, the AI receives a utility score from the game’s outcome. For training, they also add shaped rewards: LO (influence) and a small bonus for natural-sounding talk.
Why it matters: Just the final utility is too sparse; adding LO teaches influence, and naturalness keeps speech human-like (the LO-only model got terse).
Example: In Rock-Paper-Scissors, win=2, tie=1, loss=0, plus extra points if your dialogue plausibly nudged the opponent and your messages sound like real conversation.

Step 4: Diagnose skills using behavior signals (ISE, SRP, LO)

What happens: The system estimates (a) what the opponent truly tends to do and (b) what the AI believes they’ll do, then scores ISE, SRP, and LO from the conversation state.
Why it matters: These scores separate “Can I predict?” from “Do I choose well?” from “Can I influence?”, so you know what to fix.
Example: If ISE is low but LO is high, you might be persuasive without truly understanding the other side.

Step 5: Train with adapted algorithms

GRPO (Group Relative Policy Optimization) • What happens: From one conversation, the system forks into k parallel futures at a chosen turn. Each branch finishes to the end; rewards compare branches to update the policy for that turn. • Why it matters: It learns which reply at that moment leads to the best final outcome. • Example data: Three branches after the same mid-chat message might finish as [win, tie, loss]; only the reply that led to “win” gets boosted.
DPO (Direct Preference Optimization) • What happens: Rank multiple full rollouts by final reward, then push the model toward higher-ranked responses and away from lower-ranked ones via pairwise or listwise comparisons. • Why it matters: Directly comparing completions gives a strong, relational learning signal. • Example data: If “reply A” beats “reply B” in final earnings, the model learns to prefer A-like replies.
STaR (Self-Taught Reasoner) • What happens: Generate many full conversations; fine-tune on the ones with top outcomes. • Why it matters: It’s simple and uses only good examples, but can overfit early successes and miss nuanced trade-offs. • Example data: Keep only the dialogues that secured a great bargain and train on every turn inside them.

Step 6: Secret sauce—reward shaping with LO + naturalness

What happens: Add an influence reward (LO) and a small naturalness bonus (LLM-as-judge says “Yes/No” to conversationality).
Why it matters: LO alone boosts wins but can produce robotic talk; adding naturalness balances persuasion with human-friendly style.
Example: The best Rock-Paper-Scissors model both feints convincingly and speaks naturally.

— New concept 6 — 🍞 Imagine a team scrimmage: everyone tries different plays, then you compare which play led to the best drive.

🥬 GRPO: GRPO improves learning by comparing several responses for the same situation and boosting the ones that led to better final outcomes.

How it works:
1. Fork the same conversation into several branches.
2. Let each finish to get full-episode rewards.
3. Update the policy to favor the branch whose reply worked best at the split.
Why it matters: It shows the model which mid-conversation choices pay off later.

🍞 Anchor: In a price war, three slightly different messages (“Let’s cooperate at $140,” “How about$ 135?”, “Market is tough—maybe $145?”) get compared by final profit; the winner’s wording is learned.

— New concept 7 — 🍞 Think of a taste test: if people consistently prefer flavor A over B, you learn to make more of A.

🥬 DPO: DPO trains the model to prefer responses that finish with higher rewards by directly comparing better vs. worse outcomes.

How it works:
1. Generate multiple completions.
2. Rank by final reward.
3. Nudge the model toward higher-ranked ones and away from lower-ranked ones.
Why it matters: Clear comparisons create a strong learning signal—especially in complex, multi-turn tasks.

🍞 Anchor: If “cooperate-then-undercut” beats “undercut-from-start” in long-run profit, the model learns to prefer the former.

— New concept 8 — 🍞 Picture a student who writes many solutions, keeps the correct ones, and studies them to improve.

🥬 STaR: STaR makes many attempts, picks successful ones, and trains on them.

How it works:
1. Generate full dialogues.
2. Filter for top outcomes.
3. Fine-tune on those turns.
Why it matters: Simple and stable, but can cling to the first pattern that works and miss richer strategies.

🍞 Anchor: If one bargaining style happens to close a few early deals, STaR might over-learn that style even if it’s not best overall.

04Experiments & Results

The Test: They trained small Llama-3 models as both agent and opponent, using up to 5 turns per episode, and LoRA for efficient fine-tuning. They measured final rewards and the three behavior signals (ISE, SRP, LO). In bargaining, they measured “bargaining power.” They also used a “naturalness” judge to keep talk human-like.

The Competition: They compared untrained models to GRPO-, DPO-, and STaR-trained ones. They also tested reward shaping variants (plain final utility; add ISE reward; add LO reward; LO + naturalness bonus).

Scoreboard with context:

Rock-Paper-Scissors (RPS): • Goal: Use chat to predict or steer the opponent, then pick the winning move. • Results: All trained models beat the untrained baseline and a non-conversational Nash player. GRPO scored highest overall here, with DPO close behind. Win rates rose dramatically (for GRPO, roughly three-quarters wins vs. a third for the baseline), and losses dropped (single-digit percent for GRPO). • Meaning: That’s like going from “coin-flip” results to “I usually win,” by using talk to influence or read the opponent.
Bertrand Competition (iterated pricing): • Goal: Either form stable cooperation (near-monopoly prices) or cleverly deceive to capture the whole market. • Results: DPO achieved the strongest normalized earnings, clearly beating the baseline and even GRPO. STaR helped less. Trained agents learned deceptive or persuasive chat—advocating high prices while undercutting just enough to win. • Meaning: That’s like going from near-zero profits (everyone undercuts) to meaningful earnings through strategy and persuasion.
Size-Price Bargaining (two-term negotiation): • Goal: Agree on both quantity and price; higher bargaining power = you grab more of the surplus. • Results: DPO again topped performance, with GRPO strong, and STaR underperforming. Trained buyers consistently secured better deals than untrained ones. • Meaning: That’s like going from a weak negotiator to someone who regularly lands favorable terms.

Surprising findings:

Training with ISE reward (just “get better at predicting the opponent”) raised ISE and SRP but actually hurt final game performance. In contrast, LO reward (teach influence) created the biggest win boosts.
However, LO alone made speech stiff and short. Adding a small naturalness bonus kept the wins high while making talk more human.
DPO consistently outperformed in more complex, multi-turn, persuasion-heavy tasks—likely because direct preference signals are rich and comparative. GRPO was excellent in simpler RPS. STaR often overfit early successes and learned less flexible strategies.

Bottom line: GameTalk-trained models beat the baselines across all games, with DPO the best general performer when strategy and persuasion really matter.

05Discussion & Limitations

Limitations:

Influence without understanding: Agents sometimes win by steering the opponent (high LO) without truly modeling them well (lower ISE/SRP). That can make behavior brittle against new or human opponents.
Fixed opponents and two players: Training and testing used a single fixed LLM opponent and 2-player games. Results may change with diverse or human rivals, or more players.
Sparse compute budget: Small models and tight GPU memory shaped design choices; bigger models or longer dialogues may require scaled resources and careful controls.
Naturalness judge is simplistic: A Yes/No conversationality check may miss nuance like politeness, culture, or ethics.

Required resources:

A GPU with ~48GB VRAM (in the paper’s setup), LoRA fine-tuning, and the ability to run multiple branched rollouts per turn. Some games (like multi-round pricing) are memory-hungry.

When NOT to use:

One-shot Q&A tasks where single-turn quality is sufficient.
Safety-critical settings where deceptive strategies are unacceptable (e.g., medical consent), unless strong guardrails and norms are enforced.
Highly open-ended tasks with no clear success metric—GameTalk relies on well-defined utilities; you’d need reliable proxies first.

Open questions:

Can we align high influence (LO) with honest, transparent opponent modeling (ISE) so wins come from understanding, not just nudging?
How to estimate “conversation utility” directly from language for open-ended tasks (beyond games)?
What happens against humans or diverse agent pools—do strategies generalize, and how should ethics be enforced?
Can we combine DPO’s comparative strength with GRPO’s branching to get the best of both worlds?

06Conclusion & Future Work

Three-sentence summary: GameTalk trains AIs to win across an entire conversation, not just a single reply, by adapting RL-style methods to multi-turn dialogue. It adds three behavior signals—ISE, SRP, LO—to both diagnose skills and shape rewards, with LO plus a naturalness bonus proving especially powerful. In games from RPS to pricing and bargaining, trained models clearly outperform baselines, with DPO strongest in complex persuasion.

Main achievement: Showing that end-of-conversation optimization plus targeted reward shaping (especially LO) makes LLMs meaningfully better at reasoning, coordinating, and negotiating over multiple turns.

Future directions: Close the gap between influence and true understanding (raise ISE and SRP without losing LO), evaluate against humans and varied agents, and learn utilities directly from language to move beyond tidy games.

Why remember this: It reframes how we teach AIs to talk—train for the journey, not the sentence. That shift unlocks smarter helpers for negotiations, team decisions, and any task where words today set up wins tomorrow.

Practical Applications

•Customer support bots that negotiate returns or credits while keeping tone friendly and outcomes fair.
•Procurement assistants that coordinate with multiple vendors to secure better prices and quantities over multi-turn emails.
•Sales copilot tools that guide multi-step outreach, handle objections, and close deals more reliably.
•Team decision aides that help groups align on plans or budgets through structured, persuasive dialogue.
•Marketplace strategy agents that coordinate pricing or promotions within ethical and legal boundaries.
•Training simulators for human negotiators to practice against adaptive, strategically conversational AIs.
•Game AIs for education that demonstrate bluffing, cooperation, and opponent modeling in simple games.
•Meeting schedulers that proactively steer toward workable times instead of passively suggesting slots.
•Ad auction or bidding coaches that simulate rivals and teach pacing and influence strategies over rounds.
•Policy or governance deliberation helpers that surface trade-offs and steer toward consensus responsibly.

Version: 1