Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Zhiyuan Hu; Yunhai Hu; Juncheng Liu; Shuyue Stella Li; Yucheng Wang; Zhen Xu; See-Kiong Ng; Anh Tuan Luu; Xinxing Xu; Bryan Hooi; Cynthia Breazeal; Hae Won Park

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Intermediate

Zhiyuan Hu, Yunhai Hu, Juncheng Liu et al.1/14/2026

arXiv PDF

Key Summary

•This paper introduces MATTRL, a way for multiple AI agents to learn from their own conversations at test time using short, reusable text notes instead of retraining their weights.
•It forms a small expert team, lets them debate for a few rounds, and injects helpful past experiences into the dialog to reach a better decision.
•A special scoring system gives credit to the most useful utterances, turning them into a searchable experience pool for future problems.
•Across medicine, math, and education tasks, MATTRL beats strong single- and multi-agent baselines, improving accuracy by 3.67% over multi-agent and 8.67% over single-agent setups on average.
•Difference Rewards (a counterfactual credit method) gave the sharpest top-1 accuracy, while Shapley-style credit spread gains more thinly and hurt precision.
•An adaptive router that chooses between single-agent and MATTRL did even better, showing the right collaboration style depends on the case.
•Adding random few-shot examples didn’t match MATTRL’s gains, showing structured experience and credit matter more than just extra context.
•Teams of three experts balanced precision and diversity best; bigger teams helped recall (Hit@10) but could harm top-1 precision.
•Because model weights never change, MATTRL adapts to new domains at inference while preserving the model’s original skills.
•Main limits are extra compute/latency for multi-agent discussions and keeping the experience pool clean and up-to-date.

Why This Research Matters

Real teams learn from their best moments; MATTRL lets AI teams do the same, instantly, without costly retraining. Because experiences are compact and human-readable, the system is more transparent and auditable than black-box updates. In medicine, this can mean safer shortlists and clearer tie-breakers for rare conditions. In education, it helps teachers ask better questions that move students from confusion to understanding. In math and other analytical domains, it reduces errors by retrieving reminders to cross-check logic. And since model weights stay fixed, organizations can adapt to new tasks while preserving general abilities and safety constraints.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how group projects can be amazing when everyone brings a different talent—but also messy if no one keeps notes on what actually worked? Teams do better when they remember their best moves.

🥬 Filling (The Actual Concept):

What it is: This paper studies multi-agent systems—teams of AI experts that talk, debate, and solve hard problems together—and introduces a way for them to learn from their own conversations at test time without retraining.
How it works (story before the method):
1. Before: Single AIs often struggled with tricky, shifting problems (like unusual medical cases), while multi-agent teams helped by cross-checking—but training these teams with reinforcement learning (RL) was expensive and unstable.
2. People tried: Full-blown multi-agent RL (lots of compute, fragile), clever prompting (often brittle), and few-shot examples (not targeted enough).
3. The missing piece: A stable, low-cost way to adapt during inference, using the rich signals already inside the team’s dialogue.
Why it matters: If teams could reuse their best reasoning steps as small, searchable notes, they could adapt quickly to new situations without losing their general skills.

🍞 Bottom Bread (Anchor): Imagine a hospital meeting where specialists jot down short, proven rules after each case (“When X and Y appear together, down-rank Z”). Next time, those notes guide the team to a smarter, faster diagnosis.

— New Concept — Multi-Agent Systems 🍞 Hook: Imagine a soccer team: the goalie, defenders, midfielders, and forwards each play a unique role. 🥬 The Concept:

What it is: Multi-agent systems are groups of AI agents with different roles that collaborate to solve a task.
How it works:
1. Each agent looks at the same problem from its specialty.
2. They share, critique, and revise ideas across a few rounds.
3. A coordinator helps them reach a final decision.
Why it matters: One agent might miss a clue another catches; together, they are more robust to tricky cases and distribution shifts. 🍞 Anchor: A math “team” with Algebra, Geometry, and Calculus experts cross-check each other to avoid a sneaky mistake.

— New Concept — Reinforcement Learning (RL) and its multi-agent pain points 🍞 Hook: Think of a video game where you try moves and learn from the score at the end. 🥬 The Concept:

What it is: RL adjusts behavior based on rewards; in multi-agent RL, many teammates learn at once.
How it works:
1. Agents act, get a reward, and update policies.
2. In multi-agent RL, everyone changes simultaneously.
3. This makes the “world” feel like it’s moving under your feet (non-stationary), and rewards can be sparse and noisy.
Why it matters: Training becomes unstable, costly, and can overfit to one domain, hurting general skills. 🍞 Anchor: If every player on a team constantly changes tactics mid-season, it’s hard to know which change helped win the match.

— New Concept — Distribution Shift 🍞 Hook: You practiced spelling “color,” but your test uses “colour.” Same idea, different setting. 🥬 The Concept:

What it is: Distribution shift is when new problems look different from training data.
How it works: The model’s old habits might not fit new styles or edge cases.
Why it matters: Performance drops if the system can’t adapt quickly. 🍞 Anchor: Rare diseases or puzzle-like math questions often don’t look like the training set.

— New Concept — Test-Time Adaptation 🍞 Hook: A chef learns a new trick while cooking tonight’s dinner, not next semester. 🥬 The Concept:

What it is: Adjusting behavior during inference using information available right now, without changing weights.
How it works:
1. Observe the current case.
2. Pull in relevant, structured hints.
3. Condition the next steps on these hints.
Why it matters: Quick, safe adaptation without retraining. 🍞 Anchor: A doctor team consulting past case notes during a live consult.

What the world needed: a way to keep the strength of multi-agent debate but avoid the cost/instability of training, by turning the rich, step-by-step dialogue into reusable, test-time experiences. That is the gap this paper fills with MATTRL.

02Core Idea

🍞 Top Bread (Hook): Imagine a study group that writes the smartest lines from past sessions on sticky notes and brings those notes to the next meeting. Suddenly, the group makes fewer mistakes and agrees faster.

🥬 Filling (The Actual Concept):

What it is (one sentence): MATTRL lets multi-agent teams reuse their best past reasoning—packaged as short, structured text experiences—during inference, so they adapt to new problems without retraining.
How it works (big picture):
1. Build a small team of specialists.
2. Let them discuss a case in rounds.
3. Score each utterance and the final outcome; keep the best bits as compact experiences.
4. Next time, retrieve matching experiences and inject them into the discussion to guide better reasoning.
Why it matters: Stable, low-cost adaptation; dense guidance at every turn; and preserved generality because weights don’t change.

🍞 Bottom Bread (Anchor): In a medical consult, past high-quality notes like “Anchor on key discriminators first” are retrieved to help the team rank rare diseases more accurately.

— New Concept — Textual Experience 🍞 Hook: You know how a coach leaves short sideline notes like “Mark #10 tightly” that change the game? 🥬 The Concept:

What it is: A compact, structured text snippet that captures an actionable step and why it helped.
How it works:
1. Capture the local context and the useful move.
2. Add a one-line rationale.
3. Store it so it’s easy to search and reuse.
Why it matters: Reusable, human-readable guidance beats a vague scalar reward. 🍞 Anchor: “Good practice: Clarify leukocoria locus before assuming a subtype [helpful].”

— New Concept — Structured Experience Integration 🍞 Hook: Like organizing recipe tips into a neat cookbook you can quickly flip through while cooking. 🥬 The Concept:

What it is: Retrieving the most relevant experiences and inserting them into each agent’s prompt during the debate.
How it works:
1. Embed experiences and the current query.
2. Retrieve top matches.
3. Append them under a fixed “Experience Hints” block.
Why it matters: The team gets targeted nudges at exactly the right time. 🍞 Anchor: A Calculus agent retrieves “Cross-check calculus with inequality bounds” before finalizing a maximum-area proof.

— New Concept — Credit Assignment 🍞 Hook: In a group project, who deserves the gold star? The idea-spark, the proof-checker, or the tie-breaker? 🥬 The Concept:

What it is: Scoring which agent utterances actually moved the team toward a better final answer.
How it works:
1. Score each utterance on qualities like correctness and information gain.
2. Mix in the final outcome credit (with more credit to later turns or per a decay rule, depending on design).
3. Keep the highest-scoring snippets.
Why it matters: Only the truly helpful steps become experiences, keeping the pool sharp. 🍞 Anchor: A Pediatrics note that correctly down-weights CMV earns high credit and becomes a reusable hint.

— New Concept — Difference Rewards vs. Shapley-style 🍞 Hook: To see if a player mattered, imagine replaying the match without them. 🥬 The Concept:

What it is: Difference Rewards approximate each agent’s impact by comparing the team with vs. without that agent; Shapley averages impact across many coalitions.
How it works:
1. Difference: Replace an agent’s utterance with a neutral one and measure the change.
2. Shapley: Sample permutations of agent orders and average marginal gains.
3. Normalize and select experiences.
Why it matters: Difference gave sharper top-1 precision (less credit “smear”), while Shapley was fair but diluted decisive moves. 🍞 Anchor: Removing Neurology’s decisive “peroxisomal first” nudge lowers the final rank accuracy—so that utterance gets high Difference credit.

Three analogies for the “Aha!”

Sticky-notes: Save the smartest lines, reuse them next time.
Playbook: Keep winning plays and call them when the pattern matches.
Recipe margin notes: “Bake 5 min longer if the batter is dense”—a tiny note that saves the cake.

Before vs After

Before: Multi-agent debate helps, but gains are limited; training new RL policies is expensive and unstable.
After: Same agents, but now guided by curated experiences at test time—more accurate, more robust.

Why it works (intuition)

Keep weights fixed → stability.
Turn-level scoring → dense signals beat sparse rewards.
Retrieval → context-matched hints, less noise.
Coordinator and convergence checks → disciplined debates.

Building blocks

Team formation; round-based dialogue; meeting bulletin; coordinator summary.
Experience construction with utterance scoring + terminal outcome credit (decay).
Retrieval via embeddings + FAISS; fixed injection template.
Credit assignment options (Naive, Difference, Shapley).

03Methodology

At a high level: Input (task record) → Team Formation → Multi-round Dialogue with Experience Retrieval → Coordinator Synthesis → Final Decision, while simultaneously building and updating a test-time experience pool from high-credit turns.

Step A: Team Formation

What happens: A coordinator selects a small set of specialists (e.g., Pediatrics, Neurology, Ophthalmology) based on the case or problem.
Why it exists: Right roles prevent noisy, off-topic advice.
Example: For a rare-disease case with neuro-ocular signs, choose Pediatrics, Neurology, Ophthalmology.

— New Concept — Experience Pool 🍞 Hook: Like a shared folder of best tips from past meetings. 🥬 The Concept:

What it is: A database of high-quality textual experiences distilled from earlier team turns.
How it works:
1. Score turns using an LLM judge for correctness, relevance, and information gain.
2. Blend with the final case outcome credit (with a decay over turns).
3. Summarize top turns into short, structured snippets and store them with embeddings for search.
Why it matters: Each new case can benefit from what worked before. 🍞 Anchor: “Honest uncertainty when evidence is thin” becomes a general rule in the pool.

Step B: Multi-round Dialogue with Retrieval

What happens: Each round, non-converged specialists retrieve top-K relevant experiences and revise their opinions. A lightweight MEETING step aggregates the freshest updates into a shared bulletin. Agents see the bulletin next round to avoid loops.
Why it exists: Retrieval delivers the right nudge at the right time; the bulletin aligns the team.
Example: Ophthalmology retrieves “clarify leukocoria locus first,” changing the rank order.

— New Concept — Retrieval with Embeddings 🍞 Hook: Like searching a library by meaning, not exact words. 🥬 The Concept:

What it is: Use an embedding model and a FAISS index to find the most semantically similar experiences.
How it works:
1. Encode the agent’s current context; encode all experiences.
2. Compute similarity; pick top-K.
3. Insert them under a fixed “Experience Hints” block in the prompt.
Why it matters: Fast, relevant, low-noise guidance. 🍞 Anchor: A geometry specialist retrieves “cross-check via inequality bounds” for a max-area problem.

Step C: Convergence and Coordinator Synthesis

What happens: If a specialist has no new changes, it’s marked converged. When all converge or the cap is reached, the coordinator summarizes evidence and outputs the final decision.
Why it exists: Prevents endless debate and makes outcomes auditable.
Example: The coordinator fuses ranked differentials into a precise top-10 list.

Step D: Experience Construction (Credit Assignment)

What happens:
1. Per-utterance scoring by an LLM judge (correctness, information gain, relevance, clarity).
2. Terminal outcome credit allocated back to turns with a decay weight (later or earlier turns prioritized, per design).
3. Combine the two into a final per-utterance score; threshold to keep the best.
4. Summarize kept utterances into compact experiences (action + rationale), tag with minimal context, embed, and index.
Why it exists: Ensures only truly helpful steps become reusable tips.
Example: A Pediatrics utterance that reorders top candidates based on decisive features gets saved; a vague “seems consistent” remark gets filtered out.

— New Concept — Decay over Turns 🍞 Hook: First drafts matter, but later tie-breakers can clinch the win. 🥬 The Concept:

What it is: A simple way to give different emphasis to early vs. late turns when redistributing the final outcome credit.
How it works:
1. Assign larger/smaller share to certain turns (e.g., later decisive moves).
2. Split a turn’s share across agents by their contribution ratios.
3. Blend with utterance quality to pick experiences.
Why it matters: Rewards decisive moments without ignoring useful setup. 🍞 Anchor: A final-round argument that locks the correct diagnosis gets slightly extra credit.

Step E: Secret Sauce

Dense, textual feedback: Turn-level text is richer than a single number reward.
Fixed weights: No parameter updates—stable and domain-safe.
Targeted retrieval: Only the most relevant experiences are injected.
Practical credit: Difference Rewards isolate key contributors at modest compute, often boosting top-1 precision.

Concrete Walkthrough (Medicine)

Team Formation: Recruit Pediatrics, Neurology, Ophthalmology.
Round 1: Each posts top-10 hypotheses; meeting bulletin shares highlights.
Retrieval: Pediatrics pulls “Anchor on key discriminators first” and refines.
Round 2: Opinions tighten; Ophthalmology retrieves “clarify leukocoria locus.”
Convergence: No new changes; coordinator synthesizes the report and final top-10.
Experience Construction: High-credit utterances become reusable hints.
Next Case: Agents retrieve matching hints and start closer to the right answer.

04Experiments & Results

The Tests (What and Why)

Medicine (RareBench Task 4): Can teams rank the correct rare disease in the top-k? Measures both precision (Hit@1) and coverage (Hit@10) plus MRR.
Math (HLE): Can the team solve expert-level problems (exact-match accuracy)?
Education (SuperGPQA tutoring): Can the team of teachers improve a student’s post-test accuracy (∆Acc)?

The Competition (Baselines)

Medicine: MDAgents; RareAgents; RareAgents-Refined (stronger prompts and peer review).
Math/Education: Single-agent vs. plain multi-agent vs. MATTRL.

The Scoreboard (with context)

Medicine (Hit@k, higher is better): • MDAgent: Hit@1 0.32, Hit@3 0.49, Hit@5 0.57, Hit@10 0.68, MRR 0.46 • RareAgents: 0.29, 0.38, 0.47, 0.68, MRR 0.42 • RareAgents-Refined: 0.35, 0.49, 0.57, 0.70, MRR 0.47 • MATTRL: 0.39, 0.51, 0.61, 0.75, MRR 0.51 Interpretation: MATTRL’s 0.39 Hit@1 is like moving from a solid B to an A- in top-spot precision, and its 0.75 Hit@10 means more reliable shortlists under uncertainty.
Math (HLE Accuracy): • Single agent: 0.27 • Multi-agent (no experience): 0.33 (+0.06) • MATTRL: 0.36 (+0.09 vs single) Interpretation: Debate helps, but curated experience adds another clear step-up—like moving from 27/100 to 36/100 on a famously tough exam.
Education (SuperGPQA Tutoring): • All start at Acc_pre = 0.44. • Single teacher: Acc_post 0.60 (∆Acc 0.16) • Multi-teacher: Acc_post 0.73 (∆Acc 0.29) • MATTRL: Acc_post 0.77 (∆Acc 0.33) Interpretation: Collaboration nearly doubles learning gains over single-teacher; experience-augmented collaboration adds a meaningful extra push.

Surprising/Notable Findings

Credit Assignment Ablation (Medicine): • Naive: Hit@1 0.39, @3 0.51, @5 0.61, @10 0.75 • Difference: 0.40, 0.53, 0.61, 0.74 • Shapley: 0.35, 0.49, 0.59, 0.75 Takeaway: Difference Rewards sharpen top-1/3 precision by curbing free-riding; Shapley spreads credit (fair) but dilutes decisive moves unless heavily sampled.
Adaptive Router (Medicine): • Single-agent: 0.39/0.49/0.56/0.64 (Hit@1/3/5/10) • MATTRL: 0.39/0.51/0.61/0.75 • Adaptive: 0.45/0.58/0.66/0.79 Takeaway: Choosing between single- vs. multi-agent per case is best—some cases are clean enough for one agent; others benefit from team cross-checking.
Team Size Scaling: Three experts were the sweet spot for top-1 precision; bigger teams improved broad coverage (Hit@10) but could add noise for the top spot.
Few-shot vs Test-time Experience: Adding 3 few-shot examples to RareAgents gave tiny/negative changes (Hit@1 up a bit, others down), while MATTRL significantly improved across the board—evidence that structured, credited experience beats generic extra context.

Compute/Settings Notes

Backbone: GPT-5 for agents; 3 experts; max 3 rounds.
Experience construction: top 25% utterances from 30 cases.
Retrieval: Qwen3-Embedding-4B + FAISS; top-K similarity.

Bottom line: Across three very different domains, experience-conditioned collaboration consistently beats both single-agent and plain multi-agent setups.

05Discussion & Limitations

Limitations

Extra compute and latency: Multi-agent, multi-round debates plus retrieval add cost. Tight budgets may need routing and early stopping.
Experience pool drift: Without curation, outdated or spurious tips can accumulate; deduplication and recency weighting are needed.
Judge dependence: Utterance scores come from an LLM judge; biased or noisy judging could miscredit experiences.
Domain coverage: Extremely novel domains may not have enough early experiences to help much.
Prompt sensitivity: Poorly formatted hints or overlong prompts may distract agents.

Required Resources

A capable LLM for specialists and the coordinator; an embedding model and FAISS index; storage for the experience pool; and orchestration code for rounds, retrieval, and scoring.

When NOT to Use

Simple, standardized tasks where a single pass works well (routing to single-agent may be best).
Ultra-low-latency settings (e.g., real-time controls) where multi-turn debate is too slow.
Highly regulated contexts without guardrails for what experiences may be stored or reused.

Open Questions

Lifecyle management: How to best prune, refresh, and rank experiences over time?
Better credit: Can we further stabilize/improve precision with hybrid credit schemes or learned evaluators?
Safety and privacy: How to prevent leakage of sensitive details while still keeping experiences useful?
Theory: What guarantees can we give about convergence and robustness when conditioning on retrieved experiences?
Adaptive control: How to tune team size, round budgets, and hint counts on the fly based on confidence?

06Conclusion & Future Work

Three-Sentence Summary

MATTRL strengthens multi-agent reasoning by capturing high-value dialogue turns as compact textual experiences and retrieving them at test time—no weight updates needed.
It consistently outperforms strong single- and multi-agent baselines in medicine, math, and education, with Difference Rewards giving the best top-1 precision among credit schemes.
An adaptive router that chooses between single- and multi-agent further boosts performance, showing that matching collaboration style to case complexity matters.

Main Achievement

Turning turn-level, credit-assigned dialogue into a reusable, searchable experience pool that reliably improves multi-agent consensus and accuracy under distribution shift.

Future Directions

Smarter routing and dynamic budgets for speed/accuracy trade-offs; lifecycle management for the experience pool; safer, fairer, and more sample-efficient credit assignment; and theoretical analyses of stability.

Why Remember This

MATTRL’s key idea—“don’t retrain the brains; upgrade the conversation with trusted notes”—offers a practical path to robust, adaptable reasoning that scales across domains while preserving general skills.

Practical Applications

•Clinical decision support: Retrieve high-credit differential rules during multi-specialty consults to improve top-1 and shortlist accuracy.
•STEM problem-solving assistants: Inject proven tactics (e.g., inequality cross-checks) into math/physics collaboration rounds.
•AI tutoring systems: Reuse role- and topic-specific teaching experiences to design better questions and feedback.
•Incident response triage: Multi-agent investigation with retrieved past postmortems to prioritize likely root causes.
•Legal/contract review: Specialists (risk, compliance, finance) consult experiences for edge clauses and common pitfalls.
•Data analysis pipelines: Collaborative agents recall prior anomaly patterns and validation steps to reduce false positives.
•Product support bots: Retrieve high-impact troubleshooting steps learned from previous chats to shorten resolution time.
•Scientific literature review: Agents reuse structured appraisal notes (quality checks, evidence weight) to rank hypotheses.
•Business forecasting: Teams retrieve scenario-planning experiences to stress-test assumptions before final projections.
•Education content creation: Use high-credit question-design experiences to craft assessments that target misconceptions.

Version: 1