Scaling Multiagent Systems with Process Rewards

Ed Li; Junyu Ren; Cat Yan

Scaling Multiagent Systems with Process Rewards

Intermediate

Ed Li, Junyu Ren, Cat Yan1/30/2026

arXiv PDF

Key Summary

•This paper teaches AI teams to get better by scoring every move they make, not just the final answer.
•Their method, called MAPPA, uses a coach AI to rate each agent’s action on a 0–10 scale with context like tool outputs and error messages.
•This fixes the classic credit assignment problem by blaming or praising the right agent at the right step.
•Dense, per-action feedback makes learning far more sample-efficient, because one long attempt yields many training signals instead of just one.
•They fine-tune each agent’s weights separately, so agents can specialize without forgetting or interfering with each other.
•On hard math contests (AIME/AMC), accuracy jumps by +5.0 to +17.5 percentage points; on data-science pipelines, success rises by +16.7 points and quality improves up to 47%.
•They use REINFORCE++ with global normalization so training stays stable even when each agent sees different inputs from upstream teammates.
•A stronger coach AI helps, but even weaker coaches can work thanks to extra context (tool logs) and the easier task of critiquing rather than creating.
•They also show quirks: the system can over-favor some task types (like regression), revealing how coach biases can shape what agents learn.
•Overall, MAPPA is a practical recipe for scaling multiagent systems on complex, tool-using tasks with minimal human supervision.

Why This Research Matters

In real jobs, work often fails at the handoffs—exactly where multiagent systems need the most help. MAPPA makes those handoffs visible and learnable by scoring each step with context, so teams improve where it counts. This cuts wasted compute and time, because one long attempt becomes many useful lessons. It also enables small, specialized agents to outperform a single big generalist for complex pipelines, saving cost at scale. By reducing the need for ground truth at every step, MAPPA unlocks training in domains where labels are scarce but logs and tool outputs are rich. As AI teams spread into analytics, research, and operations, process-aware coaching becomes the backbone of reliable, scalable systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a school project where three classmates split the work: one brainstorms, one runs experiments, and one writes the final report. If the project fails, who deserves the blame? The brainstormer, the experimenter, or the writer?

🥬 The Concept (Multiagent Systems): Multiple AI agents work together like a team, each with a role (reasoning, coding, checking) to solve big, tricky problems. How it works:

Give each agent a job (e.g., plan, compute, verify)
Let them pass information and tools back and forth (like code execution)
Combine their outputs to produce a final answer Why it matters: One super-general agent can’t be best at everything. Teams allow specialization without mixing up skills. 🍞 Anchor: A math solver drafts an approach, a coder tests it with Python, and a verifier prints the final boxed answer.

🍞 Hook: You know how a coach watches every play in a game, not just the final score? That’s how good feedback works.

🥬 The Concept (Credit Assignment): Credit assignment means figuring out which agent’s action helped or hurt the outcome. How it works:

Watch who did what and when
Connect later problems to earlier causes (e.g., file-missing errors)
Reward or penalize the right step and the right teammate Why it matters: If you grade only the final score, the wrong player might get blamed, and the team won’t learn correctly. 🍞 Anchor: If the analyst can’t find a file, the coach blames the data engineer who forgot to save it, not the analyst who reported the error.

🍞 Hook: Think of getting a sticker every time you follow a good step in a recipe, not just when the cake is done.

🥬 The Concept (Sample Efficiency): Sample efficiency is learning a lot from a little—making every practice count. How it works:

Turn one long attempt into many small lessons
Score each step, not just the ending
Use all those step-scores to update the agents Why it matters: Complex tasks are expensive to run; you want maximal learning from each run. 🍞 Anchor: A single multiagent attempt can contain dozens of scored actions, giving dozens of learning signals instead of one.

The world before: Large language models could reason, use tools like code interpreters, and even debate. To make multiple agents collaborate, people mostly used clever prompts and roles. That works, but it doesn’t let agents truly specialize by changing their own weights. When teams were trained end-to-end, two roadblocks stopped progress: (1) credit assignment across agents and (2) low sample efficiency from sparse, end-only rewards.

What people tried (and why it fell short):

Prompt engineering: Good for orchestration, but doesn’t change what agents actually know.
Outcome-only RL: Gives one reward per full attempt—too sparse, wastes most of the trajectory.
Verifier-heavy methods: Require ground truth labels or specialized verifiers for many steps, which aren’t always available.
GRPO-style normalization: Assumes equal input states per prompt, but multiagent pipelines have wildly different intermediate states.

The gap: A way to give frequent, fair, and context-aware feedback at every action, even when there’s no ground truth—plus a training recipe that stays stable when each agent sees different things.

Real stakes (why you should care):

Data science pipelines fail from tiny handoffs (like a missing file). Fixing the right step saves hours.
Math solutions improve when reasoning, coding, and verifying each get tailored feedback.
In real workflows (support, research, analytics), teams of AIs must be reliable, efficient, and fixable. Better feedback makes better teams.

02Core Idea

🍞 Hook: You know how a good coach doesn’t wait until the end of the season to give feedback? They give tips after every play so the team improves fast.

🥬 The Concept (MAPPA – Per-Action Process Rewards from AI Feedback): MAPPA teaches AI teams by scoring each action on a 0–10 scale using a coach AI that sees roles, inputs, and tool results. How it works:

Agents act step-by-step (e.g., plan, code, verify)
A coach AI rates each action in context (role, inputs, tool logs)
Each agent updates its own weights using those scores (reinforcement learning)
Repeat to build specialized, non-interfering experts Why it matters: Without per-action coaching, teams get only one end score. They learn slowly, assign credit poorly, and may punish the wrong agent. 🍞 Anchor: If a Python script errors with FileNotFound, the coach scores the earlier agent low for not saving the file, and the downstream agent isn’t unfairly penalized for noticing the error.

Three analogies for the same idea:

Sports: Instead of grading only the final scoreboard, the coach reviews every pass, shot, and defensive move.
Cooking: A taster checks flavor after each step—marinade, sauté, sauce—so mistakes don’t snowball.
School project: The teacher gives quick notes on research, drafts, and slides, not just the final presentation.

Before vs. After:

Before: One big reward at the end; fuzzy blame; slow, wasteful learning; agents can interfere with each other’s skills.
After: Many small rewards; precise blame/praise; fast, rich learning; agents specialize with separate weights.

🍞 Hook: Picture a referee who doesn’t just judge the final winner but also explains why each call happened.

🥬 The Concept (Coach vs. Judge): A judge checks if the final answer is correct; a coach gives context-aware scores for each action on the way there. How it works:

The coach reads the agent’s role, inputs, and tool outputs
It decides if this specific action helped the team move forward
It assigns a 0–10 score and optional correctness if ground truth is available Why it matters: Judges are great for the end, but coaches are needed to improve the process itself. 🍞 Anchor: On DSBench, the coach uses metrics (like RMSE or F1), file lists, and error logs to fairly score Data Engineer, Modeler, or Analyst.

🍞 Hook: Imagine two runners training on different tracks—you can’t compare them by “distance from the same starting line.”

🥬 The Concept (REINFORCE++ with Global Normalization): A stable way to update agents’ policies even when each sees different inputs from upstream teammates. How it works:

Compute a learning signal (advantage) per action from the coach’s scores
Normalize advantages across the whole batch (not per prompt), so diverse states are fine
Add a gentle “stay-close” penalty so the agent doesn’t change too fast
Update policies with clipped steps for stability Why it matters: Methods that assume identical states per prompt can misfire in pipelines where states vary. 🍞 Anchor: Two runs of the same math problem may produce different drafts; global normalization keeps learning stable across those differences.

Building blocks of MAPPA:

Agent topology: Who does what, in what order (e.g., Solver → Coder → Verifier)
Process rewards: 0–10 per-action coaching, with or without ground truth
Separate weights: Each agent fine-tunes independently to avoid interference
Training loop: REINFORCE++ with global normalization and a small “don’t drift too far” penalty
Tool-augmented context: Coach sees code output, errors, and file artifacts, enabling causal credit assignment

Why it works (intuition, no equations):

More signals: Dozens of small lessons beat one big lesson
Better causality: Tool logs show who caused what
Stable updates: Batch-wide normalization tames noisy scores
Specialization: Independent weights let skills grow without stepping on each other

03Methodology

At a high level: Task → Agents take turns with tools → Coach scores each action → Build training examples → Update each agent’s policy → Repeat.

Step 1: Rollout in a tool-using world

What happens: Agents act in a set order (like Problem Solver → Code Executor → Verifier). When an agent writes code, it runs in a sandbox; the agent gets stdout/stderr, file lists, and error messages.
Why this step exists: Real tasks depend on tools and files. Without tool feedback, you can’t fairly judge who helped or hurt.
Example: The Code Executor runs Python to check a math step; the sandbox returns a traceback on a bug.

🍞 Hook: You know how a teacher can give better advice when they see your scratch work, not just your final answer? 🥬 The Concept (Process Rewards): Scores for each action, not just the end result. How it works:

After every agent action, ask a coach AI to rate it 0–10
The coach reads the role, inputs, action text, and any tool outputs
The coach can use ground truth if available (often only for the final step) Why it matters: Without step-by-step scoring, you waste most of the attempt and mis-assign blame. 🍞 Anchor: In DSBench, if submission.csv is missing, the coach checks if X_test.pkl or model.pkl existed to blame the right upstream agent.

Step 2: Coach evaluation per action

What happens: For each action, collect (agent_id, input, action, tool observations, score). If ground truth applies (e.g., final predictions), the coach folds in metrics like RMSE or F1—but still gives a process score, not just correctness.
Why this step exists: Dense feedback turns one trajectory into many training signals; causal checks prevent unfair penalties downstream.
Example: Analyst tries to predict but finds no model.pkl; the coach scores the Modeler low, not the Analyst.

🍞 Hook: Think of seatbelts—they keep you from veering too far when you swerve. 🥬 The Concept (“Stay-close” Penalty): A gentle penalty that discourages the policy from changing too quickly in one update. How it works:

Compare the new policy to a reference snapshot
Add a small cost for drifting too far
Keep updates within safe bounds (clipping) Why it matters: Prevents wild swings when coach scores are noisy, especially across many agents. 🍞 Anchor: After a few coached steps, the Code Executor doesn’t suddenly stop using code—it adjusts gradually toward better code usage.

Step 3: Build training examples (experiences)

What happens: Turn each action into a tuple: (agent_id, what it saw, what it did, the reward). Compute a return per action that also reflects later feedback so earlier helpful moves get credit.
Why this step exists: Each action needs its own learning target; earlier smart choices should share in later success.
Example: The Problem Solver’s clear plan earns credit when the final answer is correct.

🍞 Hook: Imagine runners on different tracks—you compare improvement across the whole team, not only within one lane. 🥬 The Concept (Global Advantage Normalization): Standardize learning signals across all actions in the batch, not per prompt. How it works:

Aggregate advantages from all agents and tasks
Normalize once globally to reduce variance
Use these cleaned signals to update policies Why it matters: Pipelines create diverse states; per-prompt normalization can break when inputs differ. 🍞 Anchor: Two math runs with different reasoning chains still contribute fairly to the same training update.

Step 4: Policy updates with REINFORCE++

What happens: Each agent’s model updates its weights using the coach scores, the global-normalized advantages, the stay-close penalty, and safe, clipped steps.
Why this step exists: It translates feedback into skill growth for each specialist without mixing up their roles.
Example: Verifier learns to be concise and precise; Code Executor learns when to call tools and how much to print.

Step 5: Distributed, on-policy training loop

What happens: Many workers run tasks in parallel, overlap coach scoring with rollouts, and then synchronize updates. After each update, new rollouts use the latest policies.
Why this step exists: On-policy RL needs fresh data every time; multiagent runs are slow, so we pipeline everything to use GPUs efficiently.
Example: Prompts are split across workers; updated weights are broadcast back to inference engines so the next round uses the improved agents.

The Secret Sauce (what makes it clever):

Per-action, context-aware coaching assigns blame/praise to the right teammate
Dense signals squeeze maximum learning from expensive rollouts
Separate weights unlock safe specialization (no catastrophic forgetting across roles)
Global normalization steadies learning despite state diversity
Tool logs (stdout/stderr/files) give the coach causal clues no agent alone could infer

04Experiments & Results

The tests: They trained and evaluated on two very different multiagent systems.

MathChat (competition math): Three agents—Problem Solver, Code Executor with Python, Verifier. Evaluated on held-out AIME and AMC problems.
DSBench (data science pipelines): Three agents—Data Engineer, Modeler, Analyst—must pass files and models correctly to produce predictions. Why these: MathChat stresses reasoning with tools; DSBench mirrors real-world pipelines where one person’s mistake can block the next.

The competition: Baselines used the same agents without MAPPA’s per-action coaching. In DSBench, success means the full pipeline produced valid predictions; quality means metrics like Accuracy, F1, MAE, RMSE (including “fair” versions that penalize failures).

The scoreboard (with context):

MathChat improvements: +5.0–17.5 percentage points across AIME and AMC, depending on the base model size. That’s like moving from a B- to a solid A on tough contests.
DSBench success rate: +16.7 percentage points overall at the best checkpoint—more complete, working pipelines.
DSBench quality: Up to 47% better on fair metrics (which account for both correctness and reliability). That’s like doing not only better when you finish the test, but also finishing more tests.

Behavioral changes: The bigger Qwen3-4B model learned to use tool calls more effectively while also becoming more concise. The smaller 1.5B model improved accuracy without big behavior shifts—showing that process rewards can help even when capacity limits behavior changes.

Partial information test: Even when agents saw only the immediately previous agent’s output (less context), MAPPA still boosted accuracy (+3.9 to +5.8 points). This shows robustness when visibility is limited.

Surprising findings: On DSBench, as training went on, the system started to specialize in regression tasks—regression quality kept improving, while classification gains faded back toward baseline. The analysis traced this to the coach giving slightly higher scores to regression steps, revealing how coach biases can shape what agents focus on. Fair metrics helped pick the best checkpoint before that imbalance grew too large.

Takeaway: MAPPA’s dense, per-action coaching turned one long attempt into many learning nudges, improved end-to-end reliability, and delivered large gains on both math reasoning and real data pipelines.

05Discussion & Limitations

Limitations (be specific):

Coach bias: If the coach subtly prefers some task types (e.g., regression) or styles (verbosity), the team may drift toward those, even if that’s not desired.
Small eval sets: Benchmarks like AIME/AMC and DSBench are limited in size, so variance can be higher; multiple seeds would strengthen claims.
Stateless coaching: The coach scores each step without remembering training history, so it can’t detect or correct its own scoring imbalances over time.
Reward hacking risk: Agents might learn to please the coach’s scoring style without truly improving final outcomes if not monitored.

Required resources:

Multiple GPUs for parallel rollouts and on-policy updates (they used 8×H100s)
A coach model (stronger is better for lower noise, but weaker can still work with tool logs)
A sandboxed tool environment (e.g., Python execution, file workspace) so the coach can see causal signals

When not to use:

Short, single-turn tasks where a final correctness check is cheap and sufficient
Settings with no tool or environment feedback (harder for the coach to assign fair credit)
Extremely resource-limited environments where on-policy RL and distributed rollouts aren’t feasible

Open questions:

Trainable coach: Can the coach learn from outcomes, stronger models, or humans to reduce bias over time?
Outcome-aware reward backpropagation: Can we decompose final results backward to the exact steps that mattered most?
Richer feedback: Beyond a single score, can the coach provide corrected actions for supervised fine-tuning or preference learning?
Stability and scale: How do we maintain balance across many agents and many task types as systems grow to dozens of specialists?

06Conclusion & Future Work

Three-sentence summary: MAPPA fine-tunes multiagent systems by giving each action a context-aware 0–10 score from a coach AI, fixing credit assignment and boosting sample efficiency. With dense per-action rewards and stable updates, agents specialize safely and learn from every step—even when the final attempt fails. The approach delivers sizable gains in math reasoning and real data pipelines without heavy reliance on ground truth.

Main achievement: Turning one expensive multiagent attempt into many precise lessons by coaching every action, so the right teammate learns the right thing at the right time.

Future directions: Make the coach adaptive and trainable, decompose outcomes backward to key steps, combine scalar scores with corrected demonstrations, and scale to larger, more diverse teams. Monitoring behavioral metrics (like tool use and response length) will be crucial to catch bias or reward hacking early.

Why remember this: MAPPA shows a practical path to grow AI teams through feedback-rich, end-to-end training—moving beyond prompt engineering to true specialization—so they can tackle long, complex, tool-heavy tasks with minimal human supervision.

Practical Applications

•Train a multiagent data science team to reliably deliver submission files on Kaggle-style tasks with fewer pipeline breaks.
•Improve math-solving pipelines by coaching the reasoning, coding, and verifying steps separately.
•Harden code-generation workflows where one agent writes code, another tests it, and a third deploys it.
•Build analytics report teams: data cleaner, chart maker, and narrative writer, each coached on their own outputs.
•Operate ETL pipelines with agents that extract, transform, validate, and publish data, with precise blame for failures.
•Run research assistants that draft hypotheses, design experiments (code), and validate results with per-step coaching.
•Augment customer support triage: one agent classifies, another searches knowledge bases, another crafts the final response.
•Automate QA for software: spec reader, test generator, and bug triager, scored on each action from logs and results.
•Teach small on-device agents to collaborate (e.g., summarize, translate, verify) with minimal labeling by using tool feedback.
•Monitor and fine-tune enterprise agent workflows using fair metrics and behavioral dashboards (tool-use rates, response lengths).

Version: 1