Multi-agent cooperation through in-context co-player inference

Marissa A. Weis; Maciej Wołczyk; Rajai Nasser; Rif A. Saurous; Blaise Agüera y Arcas; João Sacramento; Alexander Meulemans

Multi-agent cooperation through in-context co-player inference

Intermediate

Marissa A. Weis, Maciej Wołczyk, Rajai Nasser et al.2/18/2026

arXiv

Key Summary

•The paper shows that AI agents can learn to cooperate simply by playing lots of different kinds of opponents and figuring them out on the fly, without hardcoding how those opponents learn.
•Agents use in-context learning during each game to infer who they are facing and adapt quickly, like changing tactics mid-match.
•This fast, in-episode adaptation makes them exploitable at first (they can be “extorted”), which strangely creates pressure that eventually pushes both sides toward fair cooperation.
•Training in a mixed pool (half diverse, simple tabular opponents; half learning agents) is the key ingredient that teaches agents to infer and adapt in context.
•A new method called Predictive Policy Improvement (PPI) uses a sequence model as both a world model and a policy prior, improving actions by “planning” inside the model.
•Both PPI and a standard method (A2C) learn to cooperate in the Iterated Prisoner’s Dilemma when trained in the mixed pool, but they fail (defect) without that diversity or when given opponent IDs.
•The three-step mechanism is: diversity → in-context best response → vulnerability to extortion → mutual shaping → cooperation.
•This approach removes the need for meta-learning timescales or assuming a specific opponent learning rule, making it simpler and more scalable.
•The work connects how modern foundation models learn in context with how to get multi-agent cooperation using standard decentralized RL.
•Results suggest a practical path to cooperative AI systems for real-world settings like traffic, markets, and robot teams.

Why This Research Matters

Real-world systems—traffic, markets, online platforms, and robot teams—require many independent decision-makers to coordinate fairly and efficiently. This paper shows a simple, scalable recipe for getting cooperation: train with diverse, hidden partners so agents learn to read and adapt to others on the fly. By avoiding hand-engineered opponent models and meta-learning machinery, it becomes easier to build robust cooperative behaviors into practical systems. The same in-context skills used by modern foundation models naturally support this cooperation in multi-agent settings. Ultimately, this can reduce congestion, prevent bidding wars that hurt everyone, and help teams of machines (and people with machines) work together smoothly.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine two kids playing a repeated game of "share or steal". If both share, both win. If one steals while the other shares, the stealer wins big. If both steal, nobody gains much. How do they learn to share without a teacher?

🥬 The Concept (Reinforcement Learning): Reinforcement Learning (RL) is teaching agents to make decisions by rewarding good actions and not-so-good ones.

How it works:
1. The agent tries an action.
2. It gets a reward (good or bad).
3. It updates its choices to get more good rewards in the future.
Why it matters: Without RL, agents can’t improve from experience. 🍞 Anchor: A robot learns to press the green button because it got candy the last time it did.

🍞 Hook: You know how a soccer game has two teams learning at once? No one coach controls everyone.

🥬 The Concept (Decentralized Multi-Agent RL): Multi-Agent RL is when many agents learn at the same time, each seeing only their own viewpoint (no central boss).

How it works:
1. Each agent observes its own signals and past moves.
2. Each picks actions to boost its own reward.
3. They all learn together from repeated interactions.
Why it matters: Without this setup, we can’t study realistic many-player worlds like traffic or markets. 🍞 Anchor: Multiple self-driving cars decide how to merge, each seeing slightly different things.

🍞 Hook: Picture a classic trust game you keep playing with the same partner.

🥬 The Concept (Iterated Prisoner’s Dilemma, IPD): IPD is a repeated two-player game where each round you choose to cooperate (help) or defect (take advantage).

How it works:
1. In each round, both secretly pick C or D.
2. Payoffs are highest if you defect while the other cooperates, but mutual cooperation beats mutual defection over time.
3. The game repeats many rounds, so history matters.
Why it matters: It captures the real-life tension between short-term gain and long-term trust. 🍞 Anchor: Two classmates can share notes every day (both benefit) or one can freeload.

🍞 Hook: Have you ever tried to hit a moving target? It’s harder than a still one.

🥬 The Concept (Non-stationarity): In multi-agent learning, the “environment” keeps changing because other players are learning too.

How it works:
1. As you update your strategy, others update theirs.
2. What worked yesterday may stop working today.
3. Your data distribution keeps shifting.
Why it matters: Single-agent RL expects a steady world; without handling non-stationarity, learning breaks. 🍞 Anchor: If your friend changes their chess style after every game, your plan must keep changing too.

🍞 Hook: Think of a stable agreement where no one can do better by changing alone.

🥬 The Concept (Nash Equilibrium and Social Dilemmas): A Nash equilibrium is a set of strategies where no one benefits from solo changes; in social dilemmas, this can be bad for everyone (like mutual defection).

How it works:
1. Each player’s move is best given others’ moves.
2. In one-shot Prisoner’s Dilemma, mutual defection is the only Nash equilibrium.
3. But in repetition, better outcomes (cooperation) are possible, yet hard to reach.
Why it matters: Agents can get stuck in “everyone loses a little” rather than “everyone wins more.” 🍞 Anchor: Two stores keep lowering prices to undercut each other, and both make less money.

🍞 Hook: You know how you can guess a teacher’s rules just by paying attention to a few examples in class?

🥬 The Concept (In-Context Learning): In-context learning means adapting your behavior using the clues inside the current stream of experience, without changing long-term memory.

How it works:
1. Read the ongoing history (who did what, when, with what rewards).
2. Infer the pattern or type of opponent on the fly.
3. Adjust your next actions accordingly, all within the same episode.
Why it matters: It lets agents adapt fast, even before updating their weights. 🍞 Anchor: You switch from polite play to defensive play mid-game after noticing your opponent’s tricks.

🍞 Hook: When you play a board game with a friend, you learn their habits and plan around them.

🥬 The Concept (Co-Player Learning Awareness): This is when an agent reasons about how the other side is learning and adjusts to shape their future choices.

How it works:
1. Watch how the co-player reacts to your moves.
2. Predict how they’ll update next.
3. Choose actions that lead them toward outcomes you prefer.
Why it matters: Without this, agents often end up in mutual defection instead of cooperation. 🍞 Anchor: You know your friend forgives quickly, so you send a peace signal after a clash to restore teamwork.

The world before: Many MARL systems either hardcoded how opponents learn (differentiate through their updates) or split agents into “fast naive learners” and “slow meta-learners.” These were powerful but brittle, needing strong assumptions or complex timescale separations. The problem: Can we get cooperation among self-interested agents without assuming a specific opponent learning rule or setting up special fast/slow learner roles? Failed attempts: Purely independent learners often collapsed to defection; explicit opponent IDs or training only against similar learners didn’t force genuine inference and led to poor cooperation. The gap: A simple, scalable recipe that makes agents infer co-players in context and reach cooperation—with standard tools and no hand-crafted opponent models. Real stakes: In traffic, markets, online platforms, and robot teams, we need agents that quickly figure out who they’re dealing with and settle into fair, stable cooperation without brittle hacks.

02Core Idea

Aha! moment in one sentence: Train sequence-model agents against a diverse mix of co-players so they learn to infer and adapt in context within each episode; this fast adaptation makes them temporarily exploitable, creating mutual pressure that ultimately stabilizes into cooperation—no hardcoded opponent rules needed.

🍞 Hook: Imagine you don’t know your partner for a group project, so you watch the first few minutes and adapt your work style to match.

🥬 The Concept (Sequence-Model Agent): A sequence-model agent remembers and reasons over the whole history, not just the latest step.

How it works:
1. It encodes past observations, actions, and rewards.
2. It predicts what happens next and how good different next actions might be.
3. It picks actions by balancing its prior expectations with estimated future returns.
Why it matters: Without memory over history, the agent can’t infer opponent types or adapt mid-game. 🍞 Anchor: It’s like a teammate who remembers every play so far and uses that to call the next play.

🍞 Hook: Think of practicing with many different sparring partners so you become good at reading anyone.

🥬 The Concept (Mixed Pool Training): Mixed pool training pairs the agent against a diverse set of co-players—some simple, some learning—forcing robust inference.

How it works:
1. Half the time, play against simple tabular opponents sampled from many behaviors.
2. Half the time, play against other learning agents.
3. Don’t reveal who’s who; require inference from behavior.
Why it matters: Without diversity and hidden identities, agents don’t need to learn real in-context inference and often end up defecting. 🍞 Anchor: Training with mystery partners makes you great at spotting styles quickly and responding well.

🍞 Hook: You taste soup as you cook and adjust spices right away.

🥬 The Concept (In-Context Best Response): The agent uses the current episode’s history to guess the co-player’s strategy and chooses the best counter-strategy right now.

How it works:
1. Observe early-round patterns (e.g., tit-for-tat or random).
2. Infer the likely rule.
3. Switch to the action pattern that gets the highest long-run reward versus that rule.
Why it matters: Without this fast switch, you waste many rounds on the wrong tactic. 🍞 Anchor: After three turns, you realize your opponent mirrors you—so you cooperate to lock in mutual gains.

🍞 Hook: If you always try to please someone who reacts to your behavior, they can push you around.

🥬 The Concept (Mutual Extortion Dynamics): Fast adapters are exploitable at first; two such agents trying to shape each other end up pressured toward fair cooperation.

How it works:
1. One agent learns to push an in-context learner into cooperating more than is fair (extortion).
2. When both can extort, they keep shaping each other.
3. The stable outcome that neither can improve upon tends to be mutual cooperation.
Why it matters: Without this shaping pressure, both might settle on mutual defection instead. 🍞 Anchor: Two tough negotiators realize that constant hardball stalls the deal, so they land on a fair split.

Multiple analogies (3 ways to see it):

Playground analogy: Kids first test each other (poke/prod), then learn the other’s style quickly and settle on playing nicely because bullying backfires when both can do it.
Dance analogy: Partners feel each other’s moves mid-song; if both try to lead too hard, they stumble, so they agree—without words—on a cooperative rhythm.
Cooking analogy: You taste and adjust live; if both cooks adjust aggressively, the dish can swing wildly, but they soon agree on a balanced seasoning.

Before vs. after:

Before: Agents either assumed the opponent’s learning rule or split roles into fast naive vs. slow meta learners. Cooperation needed engineered setups.
After: Plain decentralized RL + sequence models + diverse opponents is enough. Agents self-learn in-episode best responses, become temporarily extortable, and converge to cooperation through mutual shaping.

Why it works (intuition): Diversity teaches the agent to infer types quickly. In-episode adaptation makes each agent influenceable (extortable), giving the other a training signal to shape them. When both can shape each other, the tug-of-war resolves at a cooperative point, because one-sided exploitation becomes unstable.

Building blocks:

Mixed pool with hidden identities (forces true inference).
Sequence model memory (stores rich context to infer types).
Predictive Policy Improvement (uses the model to plan better next actions).
In-context best response (fast adaptation inside episodes).
Mutual shaping across timescales (fast in-context + slow weight updates) pushing toward cooperation.

03Methodology

At a high level: Interaction history → Sequence model predicts next tokens and values → Policy improvement (reweight by estimated returns) → Play games to collect new trajectories → Retrain the model on all data → Repeat.

Step A: Pretrain a sequence model on diverse simple opponents.

What happens: Train a GRU-based sequence model to predict next observations, actions, and rewards from games between random tabular agents.
Why this step exists: It gives the model basic world knowledge and a vocabulary of behaviors so it can start recognizing patterns early.
Example: Pretraining on 200,000 IPD trajectories where tabular agents use five parameters (initial cooperate chance and four response probabilities to CC, CD, DC, DD).

🍞 Hook: You study old game tapes before your season starts. 🥬 The Concept (Tabular Agents): Simple opponents defined by five numbers: initial cooperation chance and how they react to last round’s outcome.

How it works: Given the last joint action, they sample C/D by their preset probability.
Why it matters: They provide a rich, controllable diversity of styles for training. 🍞 Anchor: A “mostly tit-for-tat” tabular agent cooperates after CC but defects after CD.

Step B: Mixed pool training with hidden identities.

What happens:
1. Sample opponent: 50% another learning agent, 50% a random tabular agent.
2. No opponent ID is given; the agent must infer from behavior.
3. Play 100-round episodes, store full histories.
Why this step exists: Forces genuine in-context inference; otherwise, the agent can memorize labels or overfit to one partner type.
Example: Agents trained only vs. other learners or with explicit IDs collapse to defection; mixed/hidden training leads to cooperation.

🍞 Hook: You don’t know your teammate, so you watch first and adapt. 🥬 The Concept (In-Context Best Response): The agent reads early-round signals to infer the opponent and switches strategy inside the episode.

How it works: The sequence model encodes history; the policy improves by giving higher weight to actions predicted to yield better returns versus that inferred type.
Why it matters: Without this, the agent reacts too slowly and gets stuck in bad patterns. 🍞 Anchor: Recognizing a forgiving opponent, the agent sends a cooperative signal to build trust.

Step C: Predictive Policy Improvement (PPI) for planning inside the model.

What happens:
1. The sequence model predicts future trajectories (observations/actions/rewards) given a candidate next action.
2. Monte Carlo rollouts (e.g., 15 steps) estimate the expected return of each action from the current history.
3. The improved policy reweights the model’s prior action probabilities by exp(beta × estimated return).
4. Use this improved policy to collect fresh data.
5. Retrain the sequence model on all accumulated trajectories (self-supervised next-token losses for obs/action/reward).
Why this step exists: It turns a predictive model into a decision-maker, aligning actions with predicted long-run payoffs without a separate critic network.
Example: At round t, the agent considers “Cooperate” or “Defect,” simulates 15-step futures in the model, sees that cooperation against a tit-for-tat-like opponent yields higher returns, and boosts the probability of cooperate.

🍞 Hook: You imagine several next moves and pick the one leading to the best future. 🥬 The Concept (Monte Carlo Rollouts): Simulate possible futures in your head (the model) to estimate how good each action is.

How it works: Sample multiple continuations conditioned on the candidate action and average the returns.
Why it matters: Without lookahead, the agent might pick shortsighted moves that harm long-run cooperation. 🍞 Anchor: Trying out “what if I cooperate now?” in your mental model, you see the opponent likely reciprocates for many rounds.

Step D: Alternative baseline—A2C with a sequence model.

What happens: A GRU-based actor-critic learns a policy and a value head, using advantage estimates and standard RL updates.
Why this step exists: To test whether a standard, model-free approach also benefits from mixed/hidden training.
Example: A2C learns in-context best responses vs. tabular agents and shows the same mechanisms (extortability, mutual shaping), but with some instability across random seeds.

🍞 Hook: One teammate plans by imagination; the other learns by trial and error. 🥬 The Concept (A2C): Advantage Actor-Critic updates a policy using advantage signals while learning a value function.

How it works: Estimate how much better an action was than expected and nudge the policy accordingly.
Why it matters: It’s a strong, simple baseline to compare against PPI. 🍞 Anchor: When a move pays off more than predicted, A2C makes that move more likely next time.

The secret sauce:

Using a single sequence model as both a world model and a policy prior, then improving actions by reweighting with estimated returns. This neatly couples prediction and control.
Mixed pool with hidden identities creates the need (and training signal) for true in-context inference.
Fast in-episode adaptation makes agents temporarily extortable, which paradoxically supplies the gradients that push the system toward stable cooperation.

What breaks without each piece:

No diversity → no need to infer; policies drift to defection.
Reveal opponent IDs → memorization instead of inference; defection persists.
No rollouts/value signal → shortsighted choices that fail to build cooperation.
No memory over history → can’t detect opponent type; best response fails to emerge.

04Experiments & Results

The test: Can mixed-pool, hidden-identity training teach agents to infer opponents in context, become initially extortable, and then converge to cooperation? Measured by cooperation rates over 100-round IPD episodes and average rewards.

The competition (baselines/ablations):

Pure learning-only training (no tabular diversity).
Policies given explicit opponent IDs (no inference needed).
Two learning algorithms: PPI (model-based with sequence rollouts) and A2C (model-free actor-critic).

Scoreboard with context:

Mixed pool + hidden IDs: Both PPI and A2C agents converge to high cooperation against other learners—think of moving from a class average of C (mutual defection) to an A or A− (robust cooperation across seeds).
No mixed pool (only learners) or with explicit opponent identification: Agents collapse to mutual defection—like a team that never practices reading new rivals, so they default to safe-but-bad play.

Mechanism deep-dive (three-step validation):

Diversity → In-context best response: Train a PPI agent solely against random tabular opponents. Result: Within an episode, the agent rapidly identifies the opponent’s style and shifts to the best counter-strategy—like recognizing tit-for-tat and choosing sustained cooperation. This shows fast, goal-directed adaptation.
In-context learners are extortable: Freeze that agent as a "Fixed In-Context Learner" (Fixed-ICL). Train a fresh PPI agent only against it. Result: The new agent learns to extort the Fixed-ICL, getting a larger share of the total reward by exploiting its adaptation. This proves the key vulnerability that supplies learning pressure.
Mutual extortion → Cooperation: Initialize two agents with the learned extortion strategy and train them against each other. Within episodes they shape each other toward more cooperation, and weight updates reinforce this across episodes. The tug-of-war stabilizes into cooperative behavior.

A2C parallels and differences:

Step 1: A2C also learns in-context best responses vs. tabular agents.
Step 2: A2C sometimes finds even stronger exploiters of its own Fixed-ICL, hinting at complex adversarial tactics.
Step 3: A2C pairs initially move toward cooperation but can be seed-sensitive, occasionally reverting to defection later (training instability).

Surprising findings:

Giving explicit opponent IDs actually hurts cooperation. When inference is too easy (just read the label), agents don’t learn the flexible, general-purpose in-context skills needed for stable cooperation.
Simple tabular agents are critical teachers. Their controlled diversity forces genuine inference and provides clean signals that train best-response adaptation.

Takeaway in plain terms: Training against a rich, hidden mix of partners makes agents quick readers of behavior. That quickness first makes them exploitable, but when everyone can exploit, they end up agreeing (through learning pressure) on fair cooperation. Numbers trend like moving from below 50% cooperation to strong majority cooperation in the full setup, compared to persistent defection in the ablations—akin to going from a class-wide D to a solid A− in teamwork.

05Discussion & Limitations

Limitations:

Diversity dependence: If the training pool isn’t varied enough, agents may learn brittle heuristics instead of true in-context inference, leading back to defection.
Scaling and complexity: IPD is simple; extending to larger, partially observable, or many-action games may need more capacity and careful rollout budgeting.
Instability (A2C): Model-free baselines can be seed-sensitive and sometimes drift back to defection.
Model misspecification: If the sequence model poorly predicts co-player responses, planning (PPI) can mislead action choices.
Safety/fairness: Temporary extortability is useful for learning but could be risky in safety-critical domains if exploited before cooperation stabilizes.

Required resources:

Compute for sequence models and Monte Carlo rollouts (PPI), memory for replay datasets, and time to gather diverse trajectories.
A library of opponent types (tabular or scripted strategies) to ensure training diversity.
Standard RL frameworks and sequence modeling toolkits.

When not to use:

Pure zero-sum adversarial settings where cooperation isn’t a goal (fast adaptation mainly fuels arms races).
Extremely short episodes that leave no time for in-context inference.
Safety-critical contexts without safeguards, since agents are initially exploit-prone.
Environments with non-reactive, fixed opponents where simpler best responses suffice.

Open questions:

Beyond IPD: How does this scale to multi-player commons, auctions, or networked resource sharing?
Communication: Do cheap-talk or shared signals help agents coordinate faster and more fairly?
Robustness: How to harden against adversaries designed to exploit in-context learners without drifting back to defection?
Theory: Stronger guarantees for convergence, sample complexity, and equilibrium selection under function approximation.
Human-AI ecosystems: How do these dynamics play out when some players are people with richer goals and norms?

06Conclusion & Future Work

Three-sentence summary: Training sequence-model agents against a diverse, hidden mix of co-players teaches them to infer and adapt in context within each episode. That fast adaptation initially makes them extortable, creating mutual shaping pressure that ultimately stabilizes into cooperation—no meta-gradients or hardcoded opponent rules required. A simple decentralized RL recipe (mixed pool + sequence models + planning-based policy improvement) is enough to get robust cooperation in IPD.

Main achievement: Showing that in-context co-player inference, induced by mixed-pool training, can replace engineered naive/meta learner setups and naturally produce the extortion-to-cooperation pathway.

Future directions: Scale to richer games and populations, add communication, study robustness to adversaries, and develop stronger theory and safety guardrails. Explore human-AI teams where norms and fairness matter.

Why remember this: It bridges how foundation models learn in context with how multi-agent cooperation can emerge using standard tools, offering a practical, scalable path to cooperative AI in the messy, many-player real world.

Practical Applications

•Traffic merging assistants that infer nearby drivers’ styles and nudge toward smooth, cooperative merges.
•Warehouse robot fleets that quickly read each other’s priorities and avoid gridlock while sharing aisles.
•Online marketplace agents that settle on fair pricing and inventory sharing, avoiding destructive price wars.
•Autonomous drones that cooperatively patrol areas, adapting mid-flight when partners change tactics.
•Energy grid controllers that coordinate load balancing with other controllers to prevent brownouts.
•Multi-player negotiation bots that identify counterpart types and converge to fair, stable agreements.
•Customer support agent swarms that route tickets cooperatively based on inferred agent strengths.
•Game AI that adapts mid-match to player styles yet tends toward fair play instead of griefing.
•Federated learning participants that coordinate update schedules and data sharing without a central boss.
•Multi-robot search-and-rescue teams that infer teammates’ behaviors to stabilize cooperation under stress.

Version: 1