Behavior Knowledge Merge in Reinforced Agentic Models
Key Summary
- •The paper solves a big problem: when you merge several reinforcement-learned models, their special skills get watered down by simple averaging.
- •It introduces RAM, a method that keeps shared knowledge averaged but preserves and boosts each model’s unique, task-specific updates.
- •RAM first finds which parameters are shared versus unique across agents by making a simple “on/off” map of changes.
- •It then computes a fair scaling factor for unique parts based on how much overlap each agent has with the others.
- •Finally, it merges by averaging shared parts and multiplying unique parts by the scaling factor (or leaving them as-is in RAM).
- •Across coding, tool use, and long-context memory tasks, RAM+ achieves a new state of the art and even beats the individual specialists on most tests.
- •The method is fast, data-free, and works across different model families (Qwen and Llama), showing strong generalization.
- •This approach reduces storage and training costs by turning many specialist agents into one strong generalist without losing skills.
- •RAM is especially helpful because RL updates are sparse and different across tasks, which makes standard SFT-style merging fail.
- •Limitations include potential conflicts when merging many agents, the simplifying assumption behind the rescaling rule, and uncertain behavior at very large (70B+) scales.
Why This Research Matters
RAM lets organizations combine many specialist AI agents into one dependable generalist without retraining or sharing private data. This reduces infrastructure costs by storing and serving a single model while keeping expert-level performance in coding, tool use, and long-context memory. It is especially practical when joint RL training across tasks is infeasible because each task needs its own environment and reward setup. By preserving unique skills and averaging shared knowledge, RAM often creates positive synergy, so the merged model can even outperform the original specialists. The method works across different model families (Qwen and Llama) and scales efficiently, making it useful in real systems. It also keeps general instruction-following ability more stable than common alternatives, improving safety. Overall, RAM is a simple, plug-in solution that turns fragmented expertise into a cohesive, high-performing AI assistant.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how each friend in a group has a special talent—one codes, one remembers details, and one is great with tools? Wouldn’t it be awesome if one friend could do all three just as well?
🥬 Filling (The Actual Concept)
- What it is: This paper is about combining several specialized AI agents (each trained with rewards using reinforcement learning) into one powerful generalist without losing their special skills.
- How it works (big picture): Before this paper, people merged models by averaging their changes, a strategy built for supervised fine-tuning (SFT). But RL-trained agents change only small, special parts of the model (sparse updates). When you average those, the important, rare changes get shrunk—this is called signal dilution. The paper proposes RAM, a method that treats shared changes and unique changes differently so nothing important gets washed out.
- Why it matters: Without a better way, companies must keep separate agents for each skill (costly and clunky), or accept that the merged model will forget crucial abilities.
🍞 Bottom Bread (Anchor): Imagine trying to mix three milkshakes—chocolate, strawberry, and vanilla—by averaging them. You end up with a bland flavor. Instead, RAM lets you pour the shared “milk” together but keeps each flavor’s syrup strong, so the final mix still tastes like each original.
— New Concept — Reinforcement Learning (RL) 🍞 Hook: Imagine earning stars every time you solve a puzzle. You try different moves and remember what earns you more stars. 🥬 The Concept: RL is a way to train AI by giving it rewards for good actions.
- How it works: (1) The AI tries actions; (2) a reward tells it how good they were; (3) it updates itself to get better rewards next time; (4) repeat to learn strong behaviors.
- Why it matters: RL creates very targeted skills (like tool-use or coding) by focusing updates in small, important areas. 🍞 Anchor: A coding agent gets higher reward when unit tests pass, so it learns specific coding “moves” that help pass tests.
— New Concept — Agentic Models 🍞 Hook: Think of a smart assistant that can plan steps, call tools, and check its own work. 🥬 The Concept: Agentic models are AIs that take actions in a loop: think, act (maybe with tools), observe, and improve.
- How it works: They plan, use tools or search, read results, and adjust strategies.
- Why it matters: These models benefit most from RL and are the main target for merging. 🍞 Anchor: A travel-planning agent books flights, compares prices, then re-checks options before buying.
— New Concept — Model Merging 🍞 Hook: You know how teams share notes and make one master doc? Model merging is like that for AIs. 🥬 The Concept: Model merging combines several specialized models (trained from the same base) into one model without re-training from scratch.
- How it works: Compute each model’s changes from the base (task vectors), then fuse them into a single set of changes.
- Why it matters: Saves time, compute, and data sharing; ideally keeps all skills. 🍞 Anchor: Instead of storing three big agents (coding, memory, tools), you store one merged model that does all three.
— New Concept — Task Vector 🍞 Hook: Imagine marking up a base recipe with your edits in red. Your red marks are what you changed. 🥬 The Concept: A task vector is the difference between the base model and a task-tuned model (the “edits”).
- How it works: For each parameter, note how much it changed for that task; that collection of changes is the task vector.
- Why it matters: Merging is about combining these vectors; if we mess this up, we lose task skills. 🍞 Anchor: The coding agent’s vector highlights the exact weights that improved passing unit tests.
— New Concept — Sparse Parameter Updates 🍞 Hook: Think of fixing a bike: you don’t replace every part, just the few that matter. 🥬 The Concept: RL often changes only a small part of the model (sparse updates), not everything.
- How it works: Rewards push the model to tweak only the pieces tied to the winning behavior.
- Why it matters: Sparse updates are powerful but fragile—averaging can shrink them too much. 🍞 Anchor: The paper shows a coding agent only changes about 3.2% of parameters, while memory agents change far more.
The World Before: SFT merging methods (like simple averaging, Fisher weighting, TIES, DARE) worked okay when task vectors were dense and similar. But in RL, different agents touch very different parts of the model, and often only a few. The Problem: When you average, a coding agent’s unique updates get divided by the number of agents, weakening the very skills that made it strong. Failed Attempts: Averaging (Task Arithmetic), Fisher weighting, trimming (TIES), or random dropping + rescaling (DARE) all still treat unique and shared changes too similarly, causing signal dilution. The Gap: We need a way to separate shared vs. unique regions and handle them differently. Real Stakes: You want one model that can code, remember long documents, and use tools—reliably—without re-training, extra data, or losing privacy. RAM makes that practical.
02Core Idea
— New Concept — Signal Dilution 🍞 Hook: If three kids whisper different answers at once, the sound blends into a murmur and you can’t hear any one clearly. 🥬 The Concept: Signal dilution is when important, unique changes from one agent get shrunk by averaging with many zeros from other agents.
- How it works: Unique parameters from one agent are averaged with no-change values from others, reducing their size.
- Why it matters: The very skills that made the agent special become too weak, hurting performance. 🍞 Anchor: The paper shows that shrinking just the unique part of a coding agent’s updates quickly hurts coding accuracy.
— New Concept — Shared vs. Unique Regions 🍞 Hook: In a Venn diagram, some parts overlap (shared) and others don’t (unique). 🥬 The Concept: Shared regions are parameters changed by multiple agents; unique regions are changed by only one.
- How it works: Build a mask of where each agent changed parameters; count how many agents changed each spot.
- Why it matters: Shared parts benefit from averaging (consensus), while unique parts must be preserved (or boosted) to avoid dilution. 🍞 Anchor: If both tool and memory agents change a weight, average it. If only coding changes it, keep it strong.
The Aha! Moment (in one sentence): Treat shared and unique updates differently—average the shared, preserve and smartly rescale the unique—so no skill gets washed out.
Three Analogies:
- Orchestra: Shared notes (the melody) are blended; solo parts (unique riffs) must be kept loud and clear.
- Trail mix: Nuts and raisins (shared basics) are mixed evenly, but the rare chocolate chips (unique treats) shouldn’t be thinned out.
- Group project: Common sections are harmonized; each person’s unique expertise is kept intact and highlighted.
Before vs. After:
- Before: Averaging treated every change the same, shrinking unique RL signals and hurting specialization.
- After: RAM separates shared from unique; shared is averaged for balance, unique is preserved and gently amplified to keep expert power.
Why It Works (intuition):
- RL updates are sparse and often non-overlapping, so unique changes rarely harm other tasks.
- Averaging helps when multiple agents agree (shared), but it needlessly weakens unique knowledge.
- Preserving unique updates—and slightly scaling them when lots of overlap elsewhere causes contraction—keeps experts expert while still gaining cross-task balance.
— New Concept — Overlap-Unique Ratio (how much is shared vs. unique) 🍞 Hook: Think of a playlist: how many songs are on everyone’s list (shared) versus only on yours (unique)? 🥬 The Concept: The overlap-unique ratio compares how much of an agent’s changes are shared with others versus unique to itself.
- How it works: Count an agent’s changed parameters that overlap with other agents; compare to how many only it changed.
- Why it matters: If an agent overlaps a lot, its shared parts will be averaged (shrunk), so its unique parts may need a bit more boost to compensate. 🍞 Anchor: A tool agent with many shared changes gets its unique parts scaled slightly higher to balance the shrinkage in shared areas.
— New Concept — Rescaling Unique Regions 🍞 Hook: If your microphone volume drops for half the song, you turn it up a little to balance the final recording. 🥬 The Concept: RAM multiplies unique updates by a small, carefully chosen factor to counteract the shrinkage from averaging shared parts.
- How it works: Compute a ratio based on overlap; turn it into a safe scaling number (with a cap) and multiply only unique parameters for that agent.
- Why it matters: This avoids over-boosting while keeping expert signals strong enough to perform. 🍞 Anchor: With a modest rescaling, the memory agent’s long-context accuracy bounces back after merging.
— New Concept — Reinforced Agent Merging (RAM) 🍞 Hook: Imagine building a superhero with shared powers combined smoothly and each hero’s signature move kept powerful. 🥬 The Concept: RAM is a distribution-aware merging method for RL-trained agents that averages shared changes and preserves/rescales unique changes.
- How it works: (1) Detect where each agent changed parameters; (2) identify shared vs unique; (3) average shared; (4) keep/boost unique with a small factor; (5) leave untouched parts at base values.
- Why it matters: It prevents unique skills from being washed out while still benefiting from shared agreement. 🍞 Anchor: After RAM, a single model codes better, uses tools more reliably, and remembers long documents—often matching or beating the original specialists.
03Methodology
At a high level: Task vectors from several RL agents → Probe who changed what → Compute a small boost for unique parts per agent → Merge by averaging shared and boosting unique → One strong generalist model.
Step A: Probe the distribution (who changed what)
- What happens: For each agent, mark every parameter as changed or not (using a tiny threshold). Then, for each parameter, count how many agents changed it.
- Why this step exists: Without knowing shared vs. unique spots, we’d be back to one-size-fits-all averaging, which causes signal dilution.
- Example: Suppose parameter #17 was changed by the coding and memory agents but not the tool agent. That parameter is shared (count = 2). Parameter #42 changed only for the coding agent—so it’s unique to coding.
Step B: Compute a per-agent “unique boost”
- What happens: For each agent, compare how many of its changed parameters overlap with others versus how many are unique. Turn that ratio into a safe scaling number (like 1.0 meaning no change, 1.1 meaning a 10% boost), capped so it doesn’t explode.
- Why this step exists: When shared regions get averaged, they contract. A small boost in the unique region helps keep total task performance steady.
- Example: If the tool agent overlaps a lot with others, its shared parts will shrink more from averaging. So we give its unique parts a slightly higher multiplier than, say, the very sparse coding agent.
Step C: Selective merging
- What happens: For each parameter:
- If no agent changed it, keep the base model value (effectively add zero change).
- If exactly one agent changed it (unique), apply that agent’s small boost to its change and keep it.
- If multiple agents changed it (shared), average their changes.
- Why this step exists: Averaging shared parts smooths agreement; preserving and boosting unique parts protects expert skills from dilution.
- Example with toy numbers: For parameter #10, coding says +0.06, memory says +0.04, tool says 0.00. This is shared by two agents, so average (+0.05). For parameter #21, only coding changed it by +0.10; coding’s boost is 1.1, so we keep +0.11. If parameter #30 wasn’t changed by anyone, it stays at +0.00.
Putting it all together (Input → Output):
- Inputs: Several RL-trained agents’ task vectors relative to a common base.
- Process: Build change maps; count overlaps; compute per-agent unique boosts; combine values parameter-by-parameter using the selective rules.
- Output: One merged task vector that, when added to the base model, yields a single generalist agent.
Concrete mini-example (three parameters, two agents):
- Base: [B1, B2, B3]
- Coding vector: [0.10, 0.00, 0.05]
- Memory vector: [0.00, 0.08, 0.05]
- Overlaps: Param1 unique to coding; Param2 unique to memory; Param3 shared by both.
- Suppose boosts: coding unique boost 1.1; memory unique boost 1.05.
- Merge: Param1 → 1.1×0.10 = 0.11; Param2 → 1.05×0.08 = 0.084; Param3 → average(0.05, 0.05) = 0.05.
- Final merged vector: [0.11, 0.084, 0.05], which keeps both experts strong and agrees where they overlap.
The Secret Sauce:
- RAM is distribution-aware: it looks at which parameters are changed by whom and uses that structure to decide what to average and what to preserve.
- It avoids data access and extra training: you only need the task vectors, not the datasets or reward models.
- It stays stable: unique boosts are small and capped to avoid over-amplification.
— New Concept — On-Policy RL (why merging is hard without RAM) 🍞 Hook: Imagine learning to play a game while the rules can change unless you always practice in the real game itself. 🥬 The Concept: On-policy RL updates the model using data collected by the current model in its own environment.
- How it works: The agent acts, collects feedback, and updates from that exact experience stream.
- Why it matters: Jointly training many on-policy RL tasks at once is impractical (needs multiple live environments/reward models), so merging separately trained agents is the practical path. 🍞 Anchor: A UI agent must click real buttons and read real screens to learn; doing that simultaneously for many tasks is messy, so we train separately and then merge.
— New Concept — Base Model and Task Vectors (together) 🍞 Hook: Think of a blank notebook (base) and different colored sticky notes for each subject (task vectors). 🥬 The Concept: The base model is the common starting point; each task vector is a set of changes taped on.
- How it works: RAM builds one neat, combined set of sticky notes by smartly stacking shared notes and preserving rare but important notes.
- Why it matters: A clean, single notebook beats juggling three. 🍞 Anchor: One merged model that codes, remembers, and uses tools replaces three standalone agents.
04Experiments & Results
The Test (what and why):
- Domains: Coding (LiveBench, LiveCodeBench), Tool Use (Berkeley Function Calling Leaderboard, Live and Non-Live), Long-Context Memory (RULER-HotpotQA and RULER-SQuAD at various lengths).
- Metrics: Accuracy and unit test pass for coding; strict function-call correctness for tools (AST-based); exact match for long-context QA.
- Why these: They stress very different, specialized behaviors, which is exactly where signal dilution shows up.
The Competition (baselines):
- Task Arithmetic (simple averaging), Fisher (curvature-weighted averaging), TIES (trim and sign-consensus), DARE (random drop + rescale, often combined with TA/TIES). All are designed around SFT-style dense updates, not sparse RL ones.
The Scoreboard (with context):
- Overall: RAM averages 64.82 vs. the strongest baseline (DARE+TA) at 63.33. RAM+ (with unique rescaling) hits 66.55—like moving from a strong B to a solid A.
- Across 12 tasks: RAM+ sets state-of-the-art results on 9/12, and achieves the best overall average. It often even beats the original specialists in their own domains (the “A+ beats individual tutors” moment).
- Coding: On LiveBench/LiveCodeBench, RAM/RAM+ match or surpass the coding specialist, suggesting beneficial synergy from tool/memory reasoning.
- Tool Use: On complex parallel and multi-parallel calls (the toughest categories), RAM+ significantly outperforms the tool specialist, indicating that preserved unique circuits and shared averaging both help.
- Long-Context Memory: RAM+ shines, reaching top scores at long lengths like 64K, where small mistakes in preserving memory-related changes would normally be costly.
Surprising Findings:
- Positive Synergy: The merged generalist not only prevents forgetting—it often outperforms the individual experts in their own specialties, meaning skills can reinforce each other when merged the right way.
- Architecture Generalization: On a different base (Llama 3.2-3B) with math, search, and tools, RAM/RAM+ once again outperform baselines and retain capabilities better.
- Safety vs. Forgetting: RAM keeps general instruction-following ability stable or improved compared to the base on Qwen; even on the smaller Llama, it forgets much less than TIES/DARE.
- Efficiency: RAM reaches a better performance–time trade-off than prior SOTA merging methods—often faster than complex baselines and clearly more accurate than the fastest simple averages.
Plain-English Meaning:
- RAM isn’t just avoiding damage; it’s unlocking helpful cross-task transfers. By not squashing rare, unique skills, it lets them coexist and even boost each other—like a team that plays better together because everyone still has their superpower.
05Discussion & Limitations
Limitations (be specific):
- Many Agents at Once: As you merge more and more agents, shared regions might collide in tricky ways; simple averaging of shared parts might need smarter conflict resolution.
- Simplifying Assumption: The unique-boost math assumes all parameters contribute somewhat evenly on average; real models have uneven importance, so a finer-grained (but costlier) importance estimate could help.
- Tuning Needs: The default boost strength works well across tested models, but very different data types or modalities may need light tuning.
- Scale Unknowns: Results are strong at 3B–7B; it’s not yet proven how well the approach scales at 70B+.
Required Resources:
- Task vectors from agents trained on the same base model; minimal compute to build masks, count overlaps, and merge; no data, no reward models, and no extra RL runs.
When NOT to Use:
- If agents were trained from different base models (no shared starting point), task vectors aren’t directly compatible.
- If your tasks heavily interfere in the same parameters and directions (extreme overlap with conflicting signs), you may need advanced conflict handling beyond simple averaging of shared parts.
- If you can do a full joint multi-task on-policy RL training with all environments and rewards (rare in practice), merging may be unnecessary.
Open Questions:
- Can we estimate parameter importance more precisely (e.g., curvature-aware) without heavy cost and further reduce conflicts?
- What are the best strategies when merging 5–10+ agents with dense conflicts?
- How does RAM behave at very large scales (70B+) and with multimodal agents (vision + text + tools)?
- Can dynamic, per-layer or per-module boosts outperform a single global boost per agent?
06Conclusion & Future Work
Three-Sentence Summary: This paper identifies why standard merging fails for RL-trained agents: sparse, heterogeneous updates cause signal dilution when averaged. It introduces RAM, which separates shared versus unique changes, averages the former, and preserves/rescales the latter to keep specialized skills strong. RAM/RAM+ set new state-of-the-art results across coding, tool use, and long-context memory—often beating the original specialists—while being efficient and general across architectures.
Main Achievement: Turning many RL-trained specialists into one reliable generalist by a simple, distribution-aware merge that prevents unique skills from being washed out.
Future Directions: Handle more agents with better conflict resolution; add light-weight importance estimates for finer control; test on 70B+ and multimodal settings; explore per-layer adaptive boosts and automated hyperparameter selection.
Why Remember This: RAM changes how we think about combining RL agents—don’t average everything; protect and gently boost what’s unique. That simple idea unlocks strong, practical generalists without retraining, extra data, or privacy risks.
Practical Applications
- •Deploy one merged assistant that codes, runs tools, and remembers long documents instead of maintaining three separate agents.
- •Create domain generalists (e.g., finance + legal + data analysis) by merging RL specialists trained in isolated environments.
- •Enterprise privacy: merge locally trained task experts without sharing raw datasets or prompts.
- •Rapid skill updates: add a new specialist (e.g., calendar tools) by merging its task vector into the existing generalist.
- •Edge and on-device AI: reduce storage by replacing multiple specialists with one merged model.
- •Research prototyping: quickly test cross-task synergy by merging agents from different domains without re-training.
- •Operations robustness: maintain instruction-following while adding specialized behaviors, reducing catastrophic forgetting.
- •Multi-agent systems: periodically merge high-performing agents into a single checkpoint for simpler deployment.
- •Tool ecosystems: merge agents specialized in different API families to improve coverage and reliability.
- •Education platforms: combine math reasoning, search, and code-tutoring agents for a single teaching assistant.