Discovering Multiagent Learning Algorithms with Large Language Models

Zun Li; John Schultz; Daniel Hennes; Marc Lanctot

Discovering Multiagent Learning Algorithms with Large Language Models

Intermediate

Zun Li, John Schultz, Daniel Hennes et al.2/18/2026

arXiv

Key Summary

•The paper shows how a code-writing AI (a large language model) can invent brand‑new multi‑agent learning algorithms instead of humans having to hand‑design them.
•They use an evolutionary system called AlphaEvolve that edits the actual source code of algorithms and keeps the versions that play games better (are less exploitable).
•For CFR-style learning, the AI discovered VAD‑CFR, which adapts how much it forgets old mistakes based on how “shaky” learning currently is, gives small boosts to good moves, and delays averaging until the strategy is stable.
•VAD‑CFR beats strong baselines like DCFR and DPCFR+ on most benchmark games, lowering exploitability faster (often reaching under 0.001 when others plateau higher).
•For PSRO-style population training, the AI discovered SHOR‑PSRO, which blends a steady regret-based solver with a soft, temperature-controlled push toward the best pure strategies.
•SHOR‑PSRO smartly changes its blending and exploration over time (annealing), helping it explore early and then lock down a solid equilibrium later.
•Across many poker-like and dice/card games, these evolved algorithms converge faster and more reliably than standard methods with fixed rules.
•The key advance is “semantic code evolution”: the LLM mutates and rewrites logic, not just tuning numbers, so it can invent new mechanisms humans didn’t try.
•This approach could generalize: future solvers might be co‑designed by humans and AI, speeding up progress in complex multi‑agent settings.
•The paper highlights limits too: results depend on the training games, compute for many evaluations, and good prompting for the LLM mutations.

Why This Research Matters

Many real systems—auctions, cybersecurity, traffic control, and online markets—involve multiple decision-makers with partial information. Faster, more reliable multi-agent solvers help these systems reach stable, fair, and safe behaviors more quickly. Automating algorithm design with LLMs speeds discovery beyond what manual trial-and-error can manage. Because the evolved code is readable, engineers can adopt and adapt it, not just treat it as a black box. The methods reduce early-training noise and improve late-stage precision, which is especially valuable when mistakes are costly. Over time, human+AI co-design could produce families of algorithms tailored to new domains with less effort and better performance.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and your friends are playing a strategy board game with hidden cards. Everyone wants to win, but no one can see all the information, and the best move changes as others learn. How do you all get better without getting stuck in bad habits?

🥬 The world before (what it is): Multi-Agent Reinforcement Learning (MARL) is about many learners improving together in shared environments, like teams or opponents in games. For years, people made these learners stronger with careful, hand-crafted rules.

How it worked: Researchers chose an algorithm family (like CFR or PSRO), picked update rules (how to count mistakes, how much to forget, how to average strategies), and tuned lots of knobs by trial and error.
Why it mattered: Good rules meant faster learning and stronger strategies; bad rules meant slow progress or getting fooled by clever opponents.

🍞 Anchor: Think of coaches designing soccer drills by guesswork. Some drills help the whole team; others waste time. That was MARL design—useful, but slow and human-heavy.

🍞 Hook: You know how a smart writing assistant can suggest better sentences than you might think of yourself? What if it could suggest better learning rules too?

🥬 The problem (what it is): Hand-designing multi-agent algorithms means exploring a huge maze of possibilities with limited time and human intuition.

How it works today: People try fixed schedules (like always averaging equally or always discounting the same way) because they’re simple to analyze.
Why it matters: Those simple choices can be far from optimal, so we miss speed and stability, especially in tricky games with hidden information.

🍞 Anchor: It’s like always using the same study plan for every test—fine sometimes, but not when the material changes.

🍞 Hook: Imagine letting a creative, careful coder automatically tweak your programs, test them in minutes, and keep only the best ideas.

🥬 Failed attempts (what they are): Past improvements often nudged parameters (hyperparameter tuning) or used neural nets to learn update rules that were hard to interpret.

How they worked: Either small number tweaks around known algorithms or black-box policies that worked but were hard to understand.
Why it mattered: We needed a way to search beyond knobs, into the actual logic, while keeping results readable and reusable.

🍞 Anchor: It’s like not just seasoning your soup but actually rethinking the recipe—while still understanding the steps.

🍞 Hook: You know how sometimes the missing puzzle piece is not a new tool, but a smarter way to use what you have?

🥬 The gap (what it is): We lacked a system that could evolve the algorithm’s code itself—adding new rules, changing flow, and inventing mechanisms—then test them quickly.

How it should work: Use a large language model to propose code edits that change the algorithm’s structure (not just numbers), evaluate on games, and keep the good ones.
Why it matters: This “semantic evolution” can uncover ideas humans might skip because they seem odd or complex at first.

🍞 Anchor: It’s like having a teammate who not only suggests a different play but also rewrites the playbook, then runs scrimmages to prove it works.

🍞 Hook: Think of a fair referee who says, “Show me, don’t tell me.”

🥬 Real stakes (what it is): Many real systems—auctions, traffic, cybersecurity, online markets—are multi-agent and partially hidden.

How it works: Better solvers mean faster discovery of stable, fair, or safe strategies in these systems.
Why it matters: Without reliable, fast-converging algorithms, systems can be gamed, stall, or behave unpredictably.

🍞 Anchor: When you ask an assistant, “Is this strategy safe against cheaters?”, you want an answer quickly and confidently. This research moves us closer to that.

02Core Idea

🍞 Hook: You know how a good chef tastes the dish, then decides whether to add salt or sweetness based on how it actually tastes right now? Not by a fixed recipe.

🥬 The “Aha!” (what it is): Let an LLM evolve the source code of multi-agent algorithms—changing the rules themselves—then keep what reduces exploitability fastest.

How it works:
1. Start with working baseline code (CFR or PSRO).
2. Ask an LLM to rewrite small parts (logic, control flow, formulas).
3. Auto-run the new code on training games, score by exploitability.
4. Keep winners, repeat. That’s semantic evolution.
Why it matters: It jumps beyond tuning numbers to discovering fresh mechanisms humans didn’t try.

🍞 Anchor: It’s like having a coach who invents new drills, scrimmages them right away, and only keeps drills that make the team score more and get scored on less.

🍞 Hook: Imagine learning to ride a bike by paying more attention when you wobble and relaxing when you’re steady.

🥬 Concept 1 — Volatility-Adaptive Discounted CFR (VAD‑CFR) (what it is): A CFR variant that changes how much it “forgets” past regret based on current volatility and delays averaging until the policy calms down.

How it works:
1. Measure volatility using a moving average of instant regret magnitudes.
2. If volatility is high, discount old regrets more; if low, keep more history.
3. Boost positive instantaneous regret slightly (help good ideas take hold fast).
4. Wait (warm-start) before averaging policies; then weight later, informative iterations more—especially those with meaningful regret signals.
Why it matters: Without adapting to volatility, early noisy data can poison the average; without boosting, good actions take too long to shine.

🍞 Anchor: Like practicing piano: ignore messy warmups, then record your best takes once your hands are steady.

🍞 Hook: Picture a tug-of-war between a careful planner and a bold sprinter—sometimes you need balance.

🥬 Concept 2 — SHOR‑PSRO (what it is): A PSRO meta-solver that linearly blends Optimistic Regret Matching with a smoothed best-pure-strategy push, and anneals this blend over time.

How it works:
1. Compute a stable distribution via Optimistic Regret Matching (ORM).
2. Compute a softmax over pure strategies (temperature controls how sharp).
3. Mix them with a blending factor λ.
4. Anneal λ and exploration bonuses: explore early, refine late.
5. Use different settings for training (averaged) vs evaluation (last iterate, low noise).
Why it matters: Fixed solvers over-explore or over-exploit. The hybrid adapts to the training stage and yields stronger, steadier populations.

🍞 Anchor: Early-season scrimmages try many lineups (explore), playoffs lock a winning lineup (exploit). SHOR‑PSRO automates that schedule.

🍞 Hook: You know how spell-check suggests entire sentence rewrites, not just comma fixes?

🥬 Concept 3 — Semantic code evolution (what it is): The LLM edits the algorithm’s logic (the “recipe”), not only the seasoning (parameters).

How it works:
1. Treat code as the genome.
2. LLM proposes mutations that add/remove rules, change control flow, or inject new calculations.
3. Evaluate fitness (exploitability) on multiple games.
4. Select, keep, and iterate.
Why it matters: This opens the door to discovering mechanics like warm-start thresholds and regret-sensitive averaging that aren’t obvious.

🍞 Anchor: It’s like redesigning the car’s engine, not just adjusting tire pressure.

Before vs After:

Before: Static discounts, always-on averaging, fixed PSRO solvers.
After: Volatility-aware forgetting, delayed/weighted averaging, hybrid meta-solvers with annealing and eval-time asymmetry.

Why it works (intuition):

In hidden-information games, early signals are noisy. Adapting discounting + warm-start filters noise but preserves late precision.
In population games, exploration then exploitation is key; a mixed solver with temperature and λ gives a smooth handoff rather than a hard switch.

Building blocks (with mini “sandwiches”):

🍞 Hook: Think of “regret” as wishing you’d picked a better move. 🥬 Regret (what it is): A score of how much better another action would have done. Steps: compute action values, compare to chosen action, accumulate over time. Why: Minimizing regret leads toward equilibrium. 🍞 Anchor: After a chess move, you compare “what I did” to “best reply”—that gap is regret.
🍞 Hook: Imagine focusing more on stronger clues. 🥬 Regret Matching (what it is): Turn positive regrets into probabilities. Steps: clip negatives to zero, normalize positives. Why: Steers play toward actions with proven upside. 🍞 Anchor: If two past routes got you home faster, you pick them more often.
🍞 Hook: Like cooling soup gradually so flavors settle. 🥬 Annealing (what it is): Slowly change parameters (like λ or temperature) over time. Steps: start exploratory, end stable. Why: Prevents getting stuck early; finishes with precision. 🍞 Anchor: Practice many songs early in the semester; perfect one for the recital.

03Methodology

At a high level: Input (baseline code + training games) → LLM-driven mutation (propose code edits) → Automated evaluation (run on games, compute exploitability) → Selection (keep winners) → Output (evolved algorithms: VAD‑CFR, SHOR‑PSRO).

Step 1: Preparing the playground (search spaces)

What happens: The authors expose specific, swappable code hooks.
- CFR hooks: update_accumulate_regret, get_updated_current_policy, update_accumulate_policy.
- PSRO hooks: TrainMetaStrategySolver.get_meta_strategy and EvalMetaStrategySolver.get_meta_strategy.
Why this step: Without clear “plug points,” changes would break the code or only tweak parameters. Hooks let the LLM safely rewire logic.
Example: In CFR, a mutation may change how old regrets are discounted or when to start policy averaging.

Step 2: LLM-driven mutation (semantic evolution)

What happens: The LLM reads the parent code and suggests logical edits—adding volatility tracking, optimism terms, or hybrid blending.
Why this step: It enables discovering new mechanisms (e.g., warm-start threshold, regret-magnitude weighting) beyond human-typical heuristics.
Example: Propose a hard warm-start at iteration 500 and non-linear scaling of positive projected regrets.

Step 3: Automated evaluation

What happens: Each candidate runs on several proxy games (e.g., Kuhn Poker, Leduc Poker, Goofspiel, Liar’s Dice). The system computes exploitability exactly at a fixed iteration budget (e.g., 1000 for CFR, 100 for PSRO).
Why this step: It gives a quick, comparable fitness score; without it, selection would be guesswork.
Example: If exploitability drops faster than baselines across the training set, the candidate survives.

Step 4: Evolutionary selection and repetition

What happens: Keep valid, better-performing variants, discard the rest, and iterate.
Why this step: Repeated selection pressures the population toward robust, general improvements.
Example: After many generations, two stars emerge: VAD‑CFR and SHOR‑PSRO.

Secret Sauce 1 — VAD‑CFR (CFR path)

What happens (recipe):
1. Measure volatility: EWMA of instantaneous regret magnitudes.
2. Adaptive discounting: If volatility high, discount past more; if low, remember more. Use different discounting for positive vs negative accumulated regret.
3. Asymmetric boosting: Multiply positive instantaneous regret by ~1.1.
4. Policy projection: When forming the current policy, project what regrets will be after this step, then apply non-linear scaling to positives.
5. Warm-start + weighted averaging: Don’t average policies before a threshold (e.g., 500). After that, weight by time, stability, and regret magnitude.
Why it exists: Early CFR iterations are noisy in imperfect-information games. Averaging too early cements noise; static discounting can either over-forget or over-remember.
Example with data: Suppose at iteration 200, volatility is high. VAD‑CFR increases discounting so bad early memories fade. By iteration 700, volatility calms; averaging activates with high weights on these more trustworthy steps, pulling exploitability down quickly.

Secret Sauce 2 — SHOR‑PSRO (PSRO path)

What happens (recipe):
1. Compute ORM strategy: Stability via optimistic regret matching.
2. Compute softmax over pure strategies: Temperature controls sharpness; lower means more “greedy.”
3. Hybrid blend: σ = (1−λ)·σ_ORM + λ·σ_softmax.
4. Anneal λ, temperature, and diversity bonus across PSRO iterations: more exploration early, more exploitation late.
5. Train vs Eval solvers: Training returns an averaged strategy (stability); evaluation returns last-iterate with tiny λ and temp (sharpness) for low-noise exploitability.
Why it exists: A static meta-solver can be too timid or too reckless. The hybrid and schedules automate the healthy arc from curiosity to certainty.
Example with data: In 3-player Leduc Poker, SHOR‑PSRO starts with λ≈0.3 and temp≈0.5 to explore, then by later iterations λ≈0.05 and temp≈0.01 to solidify, giving lower exploitability than PRD or Uniform.

Mini “sandwiches” for key moving parts:

🍞 Hook: Like listening more carefully when a room is noisy. 🥬 Volatility (what it is): A measure of how jumpy recent regrets are. Steps: EWMA of absolute instantaneous regrets; map to [0,1]; adjust discounts. Why: High volatility means don’t trust long history; low means you can. 🍞 Anchor: When a game’s strategy swings wildly, VAD‑CFR forgets faster.
🍞 Hook: Don’t average your practice runs while you’re still warming up. 🥬 Warm-start averaging (what it is): Delay building the average policy until stability. Steps: If t < threshold, skip averaging; else use time- and regret-weighted updates. Why: Prevents early noise from polluting the final strategy. 🍞 Anchor: Start counting your batting average only after your swing is steady.
🍞 Hook: Blend peanut butter (smooth) with jelly (bold) for balance. 🥬 Hybrid meta-strategy (what it is): Mix ORM (steady) with softmax of pure strategies (bold). Steps: compute both; pick λ; combine; anneal λ over time. Why: Balance exploration and exploitation throughout training. 🍞 Anchor: Early scrimmages try bold plays; late games stick with what works.

What breaks without each step:

No volatility-adaptive discounting: You either cling to noisy history or forget too much.
No warm-start: Average policy encodes early mistakes; convergence slows.
No hybrid/annealing in PSRO: Populations drift or stall; exploitability plateaus higher.

04Experiments & Results

The test (what they measured and why):

Metric: Exploitability—how much a perfect opponent could gain against your strategy. Lower is better.
Why: It directly reflects how close you are to a Nash equilibrium in these games.
Protocol: Fixed horizons (e.g., 1000 CFR iterations; 100 PSRO iterations). Exact exploitability computed via full game-tree traversal.

The competition (baselines):

CFR family: CFR, CFR+, LCFR, DCFR, PCFR+, DPCFR+, HS‑PCFR+(30).
PSRO solvers: Uniform, Nash (LP for 2-player), AlphaRank, Projected Replicator Dynamics (PRD), Regret Matching (RM).

Scoreboard with context:

VAD‑CFR vs state-of-the-art:
- Training set (3p Kuhn, 2p Leduc, 4-card Goofspiel, 5-sided Liar’s Dice): VAD‑CFR consistently drops exploitability faster. Think of getting an A when others get B’s.
- Test set (4p Kuhn, 3p Leduc, 5-card Goofspiel, 6-sided Liar’s Dice): In 3p Leduc, VAD‑CFR reaches below 10^-3 while many baselines plateau higher—like acing a hard exam most classmates struggle with. In 6-sided Liar’s Dice, it matches or beats DCFR, showing robustness in larger state spaces.
- Broad sweep (11 games, appendix): VAD‑CFR matches or surpasses prior best results in 10/11 games; 4p Kuhn is the outlier where it doesn’t win.
SHOR‑PSRO vs standard solvers:
- Training set: SHOR‑PSRO hits exploitability under ~10^-3 faster than PRD/RM in simpler games (Kuhn), indicating speed plus stability—like finishing a race laps ahead without wobbling.
- Test set: In 3p Leduc and 6-sided Liar’s Dice (harder, more chaotic), SHOR‑PSRO matches or outperforms top baselines; the hybrid and annealing keep progress steady instead of bouncing around.

Surprising findings:

Warm-start threshold around 500 iterations emerged from evolution, not from prompt hints—despite evaluation being at 1000 iterations. The system “rediscovered” the value of ignoring early noise and then averaging more decisively.
Regret-magnitude weighting helped emphasize “informative” iterations, a nuance beyond simple linear or polynomial averaging.
The training/evaluation asymmetry in PSRO—averaging during training, last-iterate for evaluation—reduced evaluation noise while keeping training stable.

Interpretation (why these wins make sense):

CFR side: Hidden-information games produce jittery early signals. VAD‑CFR’s volatility-aware forgetting and delayed/weighted averaging let the algorithm learn the shape of a good policy first, then crystallize it cleanly.
PSRO side: Populations need breadth first (diversity), then depth (refinement). SHOR‑PSRO’s hybrid solver and annealed parameters provide a smooth on-ramp from exploring to locking in low-exploitability mixtures.

Practical take: If you can compute exact exploitability, these evolved schemes give faster, stabler drops—like moving from a steady jog to a smooth sprint without tripping.

05Discussion & Limitations

Limitations (honest look):

Dependence on training distribution: Evolution optimizes on chosen games. If deployment games are very different, gains may shrink.
Compute budget: Many candidates must be compiled and run; exact exploitability is expensive beyond small/medium games.
Prompting and LLM quality: Mutations depend on clear prompts and capable models; weaker LLMs may produce trivial or broken edits.
Interpretability vs complexity: Although code is readable, evolved logic can grow intricate (multiple schedules, exponents), making formal analysis harder.
Fixed iteration horizons: Designs like warm-start thresholds may implicitly fit the evaluation budget unless revalidated for other horizons.

Required resources:

A strong LLM with code-editing skill, an evaluation farm (cluster) to run many candidates, and a suite of representative training games with exact exploitability tools (e.g., OpenSpiel).
Engineering to build safe sandboxes, regression tests, and selection logic.

When NOT to use:

Very large games where exact exploitability is impossible and approximate evaluators are too noisy—selection pressure may become unreliable.
Highly non-stationary real-time environments where evaluation lags make fitness outdated before selection.
Strictly theory-first contexts where proofs of convergence are required before any deployment—the evolved mechanisms may precede formal guarantees.

Open questions:

Theory: Under what conditions do volatility-adaptive discounting and warm-started, regret-weighted averaging preserve or improve CFR’s convergence bounds?
Generality: How well do SHOR-style hybrids transfer to non-zero-sum or general-sum settings?
Scaling: Can we replace exact exploitability with faithful proxies to evolve on bigger games?
Safety: How to ensure mutations never smuggle in unfair advantages (e.g., peeking at hidden info) and remain robust to implementation quirks?
Auto-curricula: Can the system co-evolve training tasks (games, scenarios) along with algorithms for even faster progress?

06Conclusion & Future Work

3-sentence summary:

This paper uses an LLM-driven evolutionary system, AlphaEvolve, to edit and improve the source code of multi-agent learning algorithms, selecting variants that lower exploitability fastest.
It discovers VAD‑CFR for regret minimization, which adapts forgetting to volatility, boosts good moves, and delays/weights policy averaging to avoid early noise; and SHOR‑PSRO for population methods, which blends regret-based stability with softmax-driven greed and anneals this blend over time.
Across many benchmark games, both variants converge faster and more reliably than strong baselines, showing that semantic code evolution can invent effective, non-intuitive mechanisms.

Main achievement:

Proving that LLMs can conduct semantic algorithm evolution—discovering new learning rules (not just tuning parameters)—that deliver state-of-the-art empirical performance in imperfect-information games.

Future directions:

Scale to larger games via proxy fitness, sampling, or learned exploitability estimators; extend to deep RL settings; add proof-guided prompts to bias toward theoretically grounded variants; co-evolve curricula and solvers; and explore cooperative, general-sum games.

Why remember this:

It marks a shift from human-only heuristic design to human+AI co-discovery, where readable code evolves to include clever schedules and hybrids that match how real learning dynamics behave. Like a great coach who both invents and validates new drills, the system shows that creative, testable algorithm design can be automated—and that can accelerate progress wherever many agents must learn together.

Practical Applications

•Design stronger poker or strategy-game bots that converge faster and are harder to exploit.
•Improve bidding strategies in ad auctions by finding robust equilibria under changing market conditions.
•Stabilize automated negotiations among software agents by evolving better regret and mixing rules.
•Boost cybersecurity simulations where attackers and defenders co-train, reducing exploitable weaknesses.
•Optimize traffic-routing agents that must adapt to volatile patterns while avoiding early noisy decisions.
•Enhance robot multi-agent coordination (e.g., warehouse fleets) via adaptive exploration-to-exploitation schedules.
•Accelerate research: auto-discover algorithmic variants before investing in long theoretical analysis.
•Create teaching tools that evolve and visualize learning rules, helping students see why certain mechanisms work.
•Customize solvers to new games or markets by running evolution on representative training scenarios.
•Prototype general-sum or cooperative variants by extending hybrid meta-solvers with diversity-aware objectives.

Version: 1