AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Yinyi Luo; Yiqiao Jin; Weichen Yu; Mengqi Zhang; Srijan Kumar; Xiaoxiao Li; Weijie Xu; Xin Chen; Jindong Wang

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Intermediate

Yinyi Luo, Yiqiao Jin, Weichen Yu et al.2/3/2026

arXiv PDF

Key Summary

•AgentArk teaches one language model to think like a whole team of models that debate, so it can solve tough problems quickly without running a long, expensive debate at answer time.
•It does this by shifting the heavy work from inference time to training time, distilling the team’s reasoning steps into the single model’s weights.
•AgentArk uses three layers of learning: Reasoning-Enhanced Fine-Tuning (learn final answers plus reasoning), Trajectory-Based Data Augmentation (learn multiple correct paths), and Process-Aware Distillation (learn step-by-step self-checking with a reward model).
•Across many datasets and models, the distilled single model improves by about 4.8% on average and closely approaches multi-agent performance while keeping single-model speed.
•The Process-Aware Distillation (PAD) method is the most reliable, consistently improving step decomposition, self-checking, and error correction—not just final accuracy.
•High-quality supervision matters more than just more data; a strong Process Reward Model (PRM) helps even small students, and adding too many noisy trajectories can hurt.
•Scaling up teacher agents helps only if the student is big enough to learn from them; small students quickly saturate.
•The distilled skills transfer to out-of-domain tasks and even to multimodal models, showing better generalization and robustness than simple fine-tuning.
•AgentArk reduces inference-time cost and latency, making on-device and real-time reasoning more practical.
•Care is needed to avoid inheriting bad habits from teachers; the paper recommends correctness checks and audits for safety.

Why This Research Matters

AgentArk brings team-like reasoning power to a single model, which makes smart apps faster and cheaper to run. This is crucial for on-device assistants, classroom tools, and customer support bots that need quick, reliable answers. By teaching the process (how to think), not just the result, models become better at breaking problems down, checking their work, and fixing mistakes. The approach also transfers across tasks and even into multimodal setups, so improvements aren’t locked to one dataset. Finally, reducing inference-time coordination shrinks latency and carbon costs, making advanced reasoning more accessible. With proper safety checks, this can raise the floor of everyday AI reliability. It’s a practical path to smarter AI that fits in real-world budgets and timelines.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a group of friends can solve a hard puzzle by talking it through—each person spots different clues, and together they fix mistakes? Teams are great, but they can be slow and noisy.

🥬 Filling (The Actual Concept):

What it is: Before this paper, many cutting-edge AI systems used Multi-Agent Systems (MAS), where several language models talk, critique, and reach a better answer together.
How it worked: 1) Multiple AI agents read the same problem. 2) They propose ideas. 3) They argue (politely!) and point out mistakes. 4) They refine answers over several rounds. 5) They pick a final answer by consensus.
Why it mattered: MAS often outperformed single AIs on tricky reasoning tasks, like multi-step math or multi-hop questions.

🍞 Bottom Bread (Anchor): Just like a debate team that keeps improving its argument each round, MAS made AI answers better by catching errors and refining steps.

The World Before: Single language models could solve many questions but often stumbled on problems needing careful, multi-step logic—like long math chains, multi-hop reading, or medical exam-style reasoning. Multi-agent debate fixed this by adding diversity (many viewpoints) and iteration (multiple passes), which boosted correctness and reasoning clarity.

The Problem: MAS is powerful but expensive and risky at inference time. Expensive because many agents talking for many rounds takes lots of time and compute. Risky because mistakes can spread; if one agent is confidently wrong, others may echo it, amplifying errors.

Failed Attempts: Early distillation ideas tried to teach a single model by copying only final answers or shallow traces. That helped a bit but lost the heart of MAS: the step-by-step conflict, critique, and correction that actually produces strong reasoning. Imitating just the last answer is like studying only the final page of a math solution—you miss the method.

The Gap: We needed a way for a single model to internalize the debate’s thinking process, not just its result. That means learning to generate, check, and fix its own steps in one go, like a skilled student who has practiced not just answers but how to think.

Real Stakes: Why care? Because long debates make apps laggy and pricey. Think on-device assistants, classroom tools, medical triage helpers, or real-time customer chatbots—they need fast, reliable reasoning. Also, fewer moving parts at inference time means fewer ways for errors to spread. If we can pack team-style thinking into one model, we get near-team performance with single-model speed.

So this paper asks a bold question: Can we make one model act like a whole team—thinking in multiple ways, critiquing itself, and fixing mistakes—without actually running a debate at answer time?

02Core Idea

🍞 Top Bread (Hook): Imagine practicing with a whole orchestra for weeks and then walking on stage alone, yet sounding like the full group because you learned everyone’s parts. That’s the dream: one performer with the wisdom of many.

🥬 Filling (The Actual Concept):

What it is (one sentence): AgentArk distills the debate behaviors of multiple AI agents into a single model’s weights so the model can self-debate internally in one pass.
How it works: 1) Generate rich debate logs from multiple agents. 2) Extract high-quality, corrective reasoning paths. 3) Train the single model with three layers: Reasoning-Enhanced Fine-Tuning (R-SFT), Trajectory-Based Data Augmentation (DA), and Process-Aware Distillation (PAD) with a Process Reward Model (PRM) and GRPO.
Why it matters: This shifts compute from inference time to training time, giving single-model speed with multi-agent-style reasoning and fewer chances for error amplification.

🍞 Bottom Bread (Anchor): Like learning chess by studying grandmaster debates, then playing alone but thinking through the same moves and counter-moves in your head.

Multiple Analogies (three ways):

Team-to-Solo Sports: A basketball player practices team drills (MAS) and later plays streetball alone, still making team-smart moves because those patterns are internalized.
Study Group to Exam: You prep with friends who challenge your steps (debate). In the exam hall, you're alone but can mentally replay critiques and fix mistakes (distilled reasoning).
Map App Offline: You download the maps ahead of time (training). Later you navigate quickly without loading tiles (inference), because the knowledge is already on your device.

Before vs After:

Before: Strong reasoning needed live debates—slow, costly, and sometimes unstable.
After: A single model emulates debate logic internally—fast, cheaper, and more robust.

Why It Works (intuition, no equations):

The win from multi-agent debate isn’t the fancy structure; it’s the behavior: proposing diverse ideas, catching errors, and revising. If we can expose the student model to many examples of these behaviors—especially corrections—then reward it for good intermediate steps, it learns to stage a tiny, efficient debate inside itself.

Building Blocks (with concept sandwiches):

🍞 You know how big group projects (MAS) split work and then combine ideas? 🥬 MAS (Multi-Agent Systems):

What it is: Several AIs interact—debate, critique, and agree—to solve hard problems.
How it works: 1) Many agents propose solutions. 2) They read each other’s steps. 3) They point out mistakes. 4) They revise and converge.
Why it matters: Without it, tough multi-step tasks often fail because one pass misses errors. 🍞 Anchor: Like a science fair team where each member tests a different hypothesis, and the team merges the best parts.

🍞 Imagine drawing a path from your house to school. 🥬 Reasoning Trajectories:

What it is: The step-by-step path a model takes from question to answer.
How it works: 1) Break the problem into steps. 2) Write each step’s logic. 3) Reach an answer.
Why it matters: Without the path, you can’t see or fix where you went wrong. 🍞 Anchor: A math solution that shows each calculation, not just the final number.

🍞 Think of a teacher marking answers right or wrong. 🥬 Supervised Learning:

What it is: Training by showing inputs and desired outputs (and sometimes the reasoning).
How it works: 1) Show examples. 2) Model guesses. 3) Compare to correct. 4) Adjust to be closer next time.
Why it matters: Without supervision, the model doesn’t know what to imitate. 🍞 Anchor: Practicing math worksheets with an answer key.

🍞 Picture practicing not just answers but full worked solutions. 🥬 Reasoning-Enhanced Fine-Tuning (R-SFT):

What it is: Fine-tuning on both reasoning steps and final answers.
How it works: 1) Feed debates’ correct traces. 2) Train to produce those steps. 3) Train to produce the final answer linked to those steps.
Why it matters: Without the steps, models overfit to shortcuts and generalize poorly. 🍞 Anchor: Learning long division by writing every intermediate subtraction, not only the quotient.

🍞 Imagine learning different ways to solve the same puzzle. 🥬 Trajectory-Based Data Augmentation (DA):

What it is: Train on multiple, diverse correct reasoning paths to the same answer.
How it works: 1) Find varied correct traces. 2) Filter for diversity. 3) Train the model on all of them.
Why it matters: Without variety, the model may break when the usual path doesn’t fit. 🍞 Anchor: Solving 24 as 12+12, 8×3, or 25−1, so you’re flexible.

🍞 Think of a coach who scores each drill step, not just the final game result. 🥬 Reinforcement Learning:

What it is: Learning by trying actions and getting rewards or penalties.
How it works: 1) Try steps. 2) Get feedback. 3) Prefer better steps over time.
Why it matters: Without step feedback, the model can’t learn good intermediate habits. 🍞 Anchor: Practicing piano and getting a thumbs-up for correct fingerings in each bar, not only the final song.

🍞 Imagine practicing how to catch and fix your own mistakes. 🥬 Process-Aware Distillation (PAD):

What it is: Teach the model to value good intermediate steps using a learned step-level reward.
How it works: 1) Train a Process Reward Model (PRM) that scores reasoning steps. 2) Use those scores to guide the student to produce better steps. 3) Optimize the policy to prefer high-scoring reasoning.
Why it matters: Without process awareness, you improve answers only a little; with it, you improve thinking. 🍞 Anchor: A math tutor who praises each correct sub-step and helps you spot exactly where a slip happened.

🍞 Picture a referee grading every move in a routine. 🥬 Process Reward Model (PRM):

What it is: A model that scores whether each reasoning step is likely correct.
How it works: 1) Learn from debate traces which steps align with good solutions. 2) Score steps relatively (better vs worse). 3) Provide fine-grained feedback.
Why it matters: Without PRM, feedback is too blunt (just right/wrong at the end), and learning is shaky. 🍞 Anchor: A dance coach scoring each spin and jump, not just the final pose.

🍞 Imagine choosing the best of several attempts by comparing them side-by-side. 🥬 Group Relative Policy Optimization (GRPO):

What it is: A way to improve the model by comparing a small group of its own outputs and nudging it toward the better ones.
How it works: 1) Sample a few answers. 2) Score them with the PRM. 3) Push the policy toward higher-scored ones while keeping it stable.
Why it matters: Without relative comparison, training can be unstable or slow. 🍞 Anchor: Trying three word problems, keeping notes on which solution path scored best, and practicing that pattern.

03Methodology

High-Level Recipe: Input → Multi-Agent Debate (data generation) → Knowledge Extraction (select corrective, diverse traces) → Three Distillation Layers (R-SFT, DA, PAD) → Single Fast Reasoning Model

Step 1: Data Generation through Multi-Agent Debate

What happens: A small team (e.g., 5 agents) solves each problem for several rounds, reading each other’s steps, pointing out errors, and revising. This yields multiple reasoning trajectories and final answers.
Why this step exists: We need the raw “thinking behaviors”—diverse ideas, critiques, and corrections—that make MAS strong. Without it, there’s nothing rich to distill.
Example: On a GSM8K math word problem, one agent uses arithmetic decomposition, another uses algebra, and they converge after catching a miscalculation.

Step 2: Knowledge Extraction (Correctness-First)

What happens: From the debate logs, keep only the final correct answers and the reasoning paths that lead to them—especially those that pivot from a wrong step to a correct one after critique.
Why this step exists: Corrective traces teach the model how to fix itself. Without filtering, noisy or wrong trajectories could confuse the student.
Example: Keep the trace where an agent first misreads a unit, then corrects it when peers point it out.

Step 3A: Reasoning-Enhanced Fine-Tuning (R-SFT)

What happens: Train the student to generate both the reasoning steps and the correct final answer.
Why this step exists: Teaching only final answers encourages shortcuts; including steps teaches structure.
Example: The student learns to break “How many apples does Sarah have now?” into small, labeled sub-steps.

Step 3B: Distillation with Data Augmentation (DA)

What happens: From the correct set, a high-capacity teacher picks multiple distinct reasoning paths to the same answer (e.g., different identities, heuristics). The student is trained on these multiple paths.
Why this step exists: Variety builds flexibility. Without it, the student may fail when the usual path doesn’t fit.
Example: Solve 120 as 10×12 in one path, 15×8 in another, or repeated addition in a third.

Step 3C: Process-Aware Distillation (PAD)

What happens: Train a PRM to score step-by-step correctness and then use GRPO to fine-tune the policy so the student prefers higher-scored reasoning paths.
Why this step exists: Step-level rewards teach the student to decompose, self-check, and correct mid-solution. Without PAD, models may improve less and be brittle.
Example: On a multi-step fraction problem, the PRM boosts steps that correctly simplify fractions and penalizes missing common denominators.

The Secret Sauce:

Quality over quantity: PAD’s process-aware signals consistently beat just adding more traces.
Corrective focus: Traces that show mistakes being fixed are gold—they teach internal error correction.
Capacity matching: Big teacher ensembles help only if the student can absorb them; otherwise, simpler is better.

Concrete Mini-Examples:

R-SFT example: Train the model to output: “Step 1: Let x be… Step 2: Substitute… Final: 42.” This links answer to reasoning.
DA example: Show three correct solutions for the same geometry area problem—coordinate geometry, dissection, and formula-based—so the student can pick the best fit later.
PAD example: The PRM assigns higher scores to steps that check intermediate arithmetic; GRPO then nudges the student to include these checks naturally.

Putting It Together (like a recipe):

Collect debate logs from multiple agents across tasks (GSM8K, MATH, MetaMathQA, MedMCQA).
Keep correctness-first, diverse, and corrective reasoning trajectories.
Train the student with R-SFT to imitate structured reasoning and answers.
Enrich with DA so the student learns multiple valid paths.
Apply PAD (PRM + GRPO) to reward good intermediate steps and internalize self-correction.
Deploy the single model: one pass, multi-agent-style thinking.

Why it won’t work without each stage:

Without Debate Data: No rich behaviors to learn.
Without Filtering: Noise and wrong steps pollute learning.
Without R-SFT: The student may hallucinate steps or jump to answers poorly.
Without DA: The student becomes brittle on new forms of problems.
Without PAD: The student improves less on decomposition, verification, and fixing errors—the heart of debate reasoning.

04Experiments & Results

The Test: The authors measured how well distilled single models solve reasoning-heavy tasks and how robustly they generalize beyond training. They used accuracy for math and medical QA, step-quality metrics (like reasoning coherence), perplexity of reasoning tokens (lower is better), and OOD measures (F1/ROUGE/BERTScore) on open-ended tasks.

The Competition: Baselines included single-agent models (no debate), standard supervised fine-tuning (SFT) on answers, and full multi-agent debate at inference time (strong but slow). AgentArk compared R-SFT, DA, and PAD across several student families (Qwen3, Llama3, Gemma) and sizes (0.6B to 8B; teachers up to 32B/27B).

The Scoreboard (with context):

Overall: AgentArk’s distilled single agents improved by about 4.8% on average, approaching multi-agent performance while keeping single-model speed.
In-Distribution vs Out-of-Distribution: Biggest gains were in-distribution (up to ~30% in best cases), with smaller but real gains out-of-distribution (~1–7%). This fits expectations: ID is easier; OOD proves transfer.
Across Methods: PAD was the most consistent winner. R-SFT and DA sometimes helped but were less stable across datasets.
Datasets: MetaMathQA and GSM8K saw the largest boosts (they need deep reasoning), MATH moderate, and MedMCQA the least (more domain facts than pure reasoning).

Surprising (and useful) findings:

PRM capacity matters a lot: A strong PRM can lift even small students; a weak PRM limits gains.
Student capacity bounds teacher benefits: More agents (e.g., 10 or 20) help only if the student is big enough; small students saturate or even degrade.
Quality beats quantity: Simply adding more trajectories doesn’t guarantee improvement; PAD’s step-quality signal is more reliable.
Behavior, not just accuracy: PAD models showed better step decomposition, self-checking, and error correction—evidenced by lower reasoning perplexity and higher LLM-judge scores for coherence and verification.

Robustness and Transfer:

TruthfulQA: All distillation methods beat the base; PAD achieved the best BLEU/ROUGE, suggesting better factual discipline and stability.
Open-ended transfer: Training only on math (e.g., GSM8K) still improved performance on HotpotQA (multi-hop), QASPER (long-context), and QMSum (summarization). This indicates AgentArk boosts general reasoning habits, not just memorized patterns.

Multimodal Note:

Distilling from a larger vision-language model into a smaller one showed consistent, modest gains—even when trained on text-only reasoning data—suggesting the distilled reasoning is model-agnostic and reusable.

Bottom line: AgentArk nears the reasoning quality of multi-agent debates while keeping the speed and simplicity of a single model—especially when using PAD to teach stepwise thinking.

05Discussion & Limitations

Limitations (be specific):

Task coverage: Experiments target math, medical QA, and a few OOD tasks; more domains and modalities (vision, audio, tools) would test broader applicability.
Debate focus: The pipeline uses debate to generate traces; other MAS forms (e.g., planner-executor, verifier loops) may add new insights not covered here.
Reward learning risks: The PRM can inherit teacher biases or be imperfect; bad step scores can misguide training.
Student capacity: Very small models saturate quickly; adding more agents or data may not help and could hurt.

Required Resources:

Training-time compute: PAD is the most expensive (PRM + GRPO); R-SFT and DA are lighter. However, inference stays fast (single pass). Teams need GPUs (e.g., H100s) for large-scale PAD.

When NOT to Use:

Ultra-low compute for training: If you can’t afford any extra training time, you may prefer simpler fine-tuning or lightweight prompting tricks.
Purely factual tasks with little reasoning: If tasks are mostly look-up (e.g., static knowledge recall), the gains from process-level distillation may be small.
Tiny students with strict memory limits: If the student is too small, it may not absorb the debate behaviors effectively.

Open Questions:

Can we make PRMs modular (logic, arithmetic, verification) to tailor guidance per task?
Can we auto-tune how many agents to use based on student size and task difficulty?
How to best combine PAD with tool use and retrieval, so process signals and external knowledge reinforce each other?
What safety checks ensure we don’t distill flawed reasoning or biases—and how can PRMs help prevent that?

06Conclusion & Future Work

Three-Sentence Summary: AgentArk distills the brainpower of a debating team of AI models into one model that thinks through steps, checks itself, and fixes mistakes—all in a single pass. It does this with a three-layer approach: Reasoning-Enhanced Fine-Tuning, Diverse Trajectory Augmentation, and Process-Aware Distillation powered by a Process Reward Model and GRPO. The result is near multi-agent performance with single-agent speed, better generalization, and improved robustness.

Main Achievement: Proving that the benefits of multi-agent debate come from the reasoning behaviors—and that these behaviors can be internalized into a single model through process-aware distillation, especially PAD.

Future Directions: Build stronger and more modular PRMs, combine PAD with tool use and retrieval, extend to richer multimodal settings, and develop adaptive strategies that match teacher complexity to student capacity. Also, standardize safety audits and correctness checks for PRMs and distilled policies.

Why Remember This: It shows a practical path to powerful reasoning without slow debates at answer time—shifting compute to training so real-time apps can be both smart and fast. If you remember one idea, remember this: teach the process, not just the answer, and a single model can think like a team.

Practical Applications

•On-device tutoring apps that explain math step-by-step without needing cloud-scale debates.
•Customer support chatbots that self-check and correct answers quickly to reduce handle time.
•Clinical triage assistants that reason through symptoms and verify intermediate checks before suggesting options.
•Coding helpers that decompose tasks, verify intermediate logic, and correct errors in one pass.
•Document summarizers that plan, verify facts, and refine claims internally for higher factuality.
•Education platforms that provide multiple solution methods to the same problem for flexible learning.
•Search assistants that perform multi-hop reasoning quickly without slow multi-agent orchestration.
•Data analysts that produce stepwise justifications and sanity checks for calculated metrics.
•Robotics or planning agents that internally debate action sequences and pick robust plans faster.
•Multimodal assistants that apply distilled reasoning to images plus text (e.g., charts, diagrams).

Version: 1