LatentMem: Customizing Latent Memory for Multi-Agent Systems

Muxin Fu; Guibin Zhang; Xiangyuan Xue; Yafu Li; Zefeng He; Siyuan Huang; Xiaoye Qu; Yu Cheng; Yang Yang

LatentMem: Customizing Latent Memory for Multi-Agent Systems

Intermediate

Muxin Fu, Guibin Zhang, Xiangyuan Xue et al.2/3/2026

arXiv PDF

Key Summary

•LatentMem is a new memory system that helps teams of AI agents remember the right things for their specific jobs without overloading them with text.
•It fixes two big problems in multi-agent systems: everyone using the same memory (homogenization) and too much information to read (information overload).
•LatentMem keeps a simple experience bank of raw past teamwork and uses a small model (the composer) to create short, role-aware memory snippets called latent memories.
•A special training method, LMPO, teaches the composer which memories actually help finish tasks, by pushing learning signals back through the latent memories.
•Across six benchmarks and four popular multi-agent frameworks, LatentMem improves performance by up to 19.36% over vanilla systems.
•It uses about 50% fewer tokens and cuts inference time to roughly two-thirds compared to common memory baselines.
•It generalizes well to new domains and unseen agent organizations, with gains like +7.10% on PDDL and +7.90% on CAMEL.
•LatentMem works without changing the underlying agent frameworks, so it can be plugged in easily.
•Compared to direct multi-agent fine-tuning (MARTI), LatentMem consistently performs better under the same training budget.
•By making memories compact and tailored to each role, LatentMem reduces mistakes like repeating steps or blindly copying old trajectories.

Why This Research Matters

When AI teammates remember the right things for their jobs, they solve problems faster and make fewer mistakes. LatentMem cuts through clutter by giving each agent a tiny, role-tailored memory, which means lower costs and faster responses in real applications. It stays strong even when the subject or team setup changes, so builders don’t have to redesign everything for new tasks. The system plugs into existing multi-agent frameworks without rewiring them, speeding up adoption. Because it learns what memories actually help, it keeps getting better over time. This leads to more reliable coding assistants, smarter research helpers, and sturdier planning systems. In short, it turns “more memory” from a liability into a superpower.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a school group project where everyone shares one giant notebook. The artist, the writer, and the researcher all have to flip through the same pages to find what they need. It’s slow, messy, and people miss important notes.

🥬 The Concept (Multi-Agent Systems, MAS): MAS are teams of AI “teammates” with different roles that work together on a task. How it works:

Each agent has a job (like planning, coding, testing, or summarizing).
They talk to each other and the environment.
They use memory to remember what worked before. Why it matters: Without clear roles and good memory, they step on each other’s toes, repeat mistakes, and lose time. 🍞 Anchor: In a coding task, a Strategy Agent plans, a Code Agent writes code, a Test Agent runs tests, and a Summarizer wraps it up. Together, they can beat one big all-in-one agent.

🍞 Hook: You know how your phone’s photo album gets crowded if you never organize it? Finding one picture becomes a chore.

🥬 The Concept (MAS Memory): MAS memory stores what the team did before so future tasks go faster and better. How it works:

Save past conversations, actions, and results (trajectories).
Retrieve similar past cases.
Feed them to agents so they can reuse good ideas. Why it matters: Without memory, every task starts from scratch, wasting time and missing lessons. 🍞 Anchor: If the team once built a similar game level, memory helps the new task reuse the winning level-generation trick.

🍞 Hook: Imagine giving the same study guide to the artist and the math whiz. One of them won’t get what they need.

🥬 The Concept (Memory Homogenization): Many systems give identical memory to all agents, ignoring their different roles. How it works:

A shared memory is retrieved.
It’s copy-pasted to every agent.
Agents try to sift through it themselves. Why it matters: Roles blur, everyone chases the same clues, and small mistakes snowball into team-wide errors. 🍞 Anchor: The Test Agent doesn’t need full design history; they need test hints. Without role-aware memory, they get lost.

🍞 Hook: Ever tried to read a 200-page manual to find one tiny instruction? Too much info can be worse than too little.

🥬 The Concept (Information Overload): Multi-agent systems often stuff agents with long, fine-grained memories. How it works:

Store detailed, multi-level notes (raw logs, summaries, skills).
Retrieve lots of entries for safety.
Jam them into the context window. Why it matters: The signal (what’s useful) gets buried in noise, slowing inference and causing confusion. 🍞 Anchor: When answering “What’s the capital of France?”, the agent shouldn’t wade through entire travel blogs.

🍞 Hook: Think of a tidy pencil case vs. a messy backpack. The tidy case holds exactly what you need, in a small space.

🥬 The Concept (The Gap Before This Paper): Systems lacked a learnable way to craft role-aware, compact memories. How it works:

Prior methods hand-engineered schemas and long text stores.
Little customization per role.
No end-to-end learning that teaches memory what really helps. Why it matters: Without compact, trained, role-aware memory, teams remain slow and error-prone. 🍞 Anchor: We need a “smart memory maker” that packs the right tools for each teammate.

The world before LatentMem looked like this: teams of LLM agents could collaborate but struggled to adapt their memories to roles, and they wasted tokens lugging around long histories. People tried multi-granularity stores (raw logs, insights, procedural skills) and better retrieval, but two things still broke: (1) agents got the same memory regardless of job; (2) contexts got too long. What was missing was a learnable, compact, role-aware memory that plugs into any MAS. That matters in daily life because AI teams are writing code, answering questions, and planning tasks; when they remember smarter, they help us faster, cheaper, and more reliably.

02Core Idea

🍞 Hook: You know how a coach gives different tips to a goalie and a striker? Same game, different roles, custom advice.

🥬 The Concept (LatentMem): LatentMem is a role-aware, token-efficient memory system that turns raw past teamwork into small, useful memory snippets for each agent. How it works:

Keep an experience bank of raw multi-agent trajectories.
Retrieve relevant past cases for a new task.
Use a memory composer to compress them into short, role-conditioned latent memories.
Inject these latent tokens into each agent’s model before it reasons. Why it matters: Without role-aware compressed memory, teams waste tokens, miss key role-specific cues, and underperform. 🍞 Anchor: The Code Agent gets code patterns, the Test Agent gets testing cues, and the Strategy Agent gets planning hints—each in just a few latent tokens.

Aha! Moment in one sentence: If we treat memory as small, trainable, role-conditioned latent tokens, we can keep the most helpful bits while throwing away the rest, so each agent gets exactly what helps them decide next.

Three analogies:

Backpack packing: Each student packs only what their class needs (role-aware), using space-saving organizers (latent tokens).
Radio tuner: The composer tunes to the right station for each role, filtering static into a clear, short signal.
Recipe cards: Instead of a whole cookbook, each chef gets a tiny role-specific recipe card for tonight’s menu.

Before vs. After:

Before: Long, shared text memories; agents scan too much; roles blur; tokens spike; decisions slow.
After: Short, role-tailored latent tokens; agents see only what matters; tokens drop; decisions sharpen.

🍞 Hook: Think of summarizing a whole year’s class notes into a few flashcards that still make you ace the test.

🥬 The Concept (Latent Memory): A fixed-length vector memory injected as tokens into the model’s hidden space. How it works:

Retrieve raw trajectories.
Condition on the agent’s role profile.
Compress into L′ latent tokens with the composer. Why it matters: Without fixed-length latent tokens, context grows unbounded and gets noisy. 🍞 Anchor: Eight tiny tokens carry the strongest code patterns for the Code Agent, saving thousands of text tokens.

🍞 Hook: A tailor measures you before sewing the suit; one-size-fits-all never fits well.

🥬 The Concept (Role Profiles): Short descriptions of each agent’s job to steer memory customization. How it works:

Read the role profile (e.g., “Test Agent: ensure quality via tests”).
Ask: Which past bits help this role right now?
Compose a role-aligned latent memory. Why it matters: Without role profiles, memories collapse into the same generic bundle. 🍞 Anchor: The Summarizer gets high-level progress cues, not raw stack traces.

🍞 Hook: Practicing the piano with a metronome gives feedback that keeps you on beat.

🥬 The Concept (LMPO): A training rule that sends task success signals back through latent memories to improve the composer. How it works:

Run several trajectories on a question, score each (rewards).
Compare them within the group (advantages).
Adjust the composer token-by-token so helpful memories become more likely. Why it matters: Without LMPO, the composer can’t learn which compressed memories truly help. 🍞 Anchor: If shorter testing tips led to more passing tests, LMPO nudges the composer to keep and refine those tips.

Why it works (intuition):

Latent tokens give a tight slot for memory, forcing focus.
Role profiles align the focus to each job.
LMPO supplies the learning signal so the composer keeps what improves final answers.

Building blocks:

Experience bank (raw, lightweight store).
Similarity retrieval (find relevant pasts).
Memory composer (neural compressor conditioned on role profile).
Latent token injection (append to hidden states).
LMPO (grouped relative rewards, token-level objective, differentiable path back to the composer).

03Methodology

At a high level: Query → Retrieve raw trajectories (Experience Bank) → Compose role-aware latent memory (Composer) → Inject latent tokens into the agent → Agent reasons and acts → Store new trajectory.

🍞 Hook: Think of a library (experience bank), a librarian who makes custom summaries (composer), and students who tuck those summary cards into their notebooks (latent injection) before taking a test.

🥬 The Concept (Experience Bank): A super-light store of raw multi-agent trajectories. What happens:

Keep raw step-by-step logs: who acted, the prompt, and the output.
When a new query q arrives, compute an embedding and find the top-K similar past trajectories. Why this step exists: Without a record of what really happened, the composer has nothing factual to squeeze into helpful memory. Example: For “Design a parkour runner game,” retrieval might bring a previous game-making run with level generation and test fixes. 🍞 Anchor: Like pulling last year’s finished science fair logs before starting a similar project.

🍞 Hook: Imagine a role-aware squeezer that turns a long diary into just a few perfect flashcards for the person who needs them.

🥬 The Concept (Memory Composer): A small transformer (initialized from the LLM, trained with LoRA) that turns retrieved trajectories + the active agent’s role profile into a fixed-length latent memory of L′ tokens. What happens:

Input: role profile γ_role and retrieved trajectories T_q.
Output: m_j ∈ R^{L′×D}, a short matrix of latent tokens.
We set L′=8 by default for a good accuracy–cost trade-off. Why this step exists: Without compression, contexts bloat; without role conditioning, memories become generic and less useful. Example: For the Test Agent, the composer emits tokens emphasizing edge-case tests and prior failure modes. 🍞 Anchor: The Test Agent’s flashcards focus on pass/fail clues, not whole code histories.

🍞 Hook: Sliding a bookmark into the exact page you need beats carrying a whole stack of notes.

🥬 The Concept (Latent Memory Injection): Concatenate latent tokens to the agent’s hidden states before decoding. What happens:

The agent encodes its current prompt into hidden states h_j.
We append latent memory m_j to form [h_j; m_j].
The agent’s next-token predictions condition on this augmented input. Why this step exists: Without direct injection, the memory can’t guide token-by-token reasoning. Example: The Code Agent’s next lines of Python are influenced by compact code patterns in m_j. 🍞 Anchor: It’s like inserting a concise hint card into your notebook right before answering a question.

🍞 Hook: When a team wins, they circle back to learn what helped most, so they can repeat it.

🥬 The Concept (Online Update): After finishing the task, append the new trajectory to the experience bank. What happens:

Save the new multi-agent dialog and actions.
Make it available for future retrieval. Why this step exists: Without updates, the system can’t learn from fresh tasks or adapt to new distributions. Example: A tricky PDDL puzzle solution becomes tomorrow’s helpful reference. 🍞 Anchor: After a school competition, you add your notes to the team binder for next year.

🍞 Hook: Think of a music teacher who listens to a group performance, ranks the takes, and then tweaks what the students practice next.

🥬 The Concept (LMPO – Latent Memory Policy Optimization): The training loop that improves the composer using task-level rewards. What happens:

For each query and retrieved T_q, sample G trajectories using the MAS with latent memory injection.
Score each trajectory with a reward (e.g., pass@1 for code, accuracy for QA).
Compute group-based advantages: each run is compared to the group mean (stabilizes learning).
Optimize a token-level clipped objective (PPO-style), but the ratio depends on the memory-augmented policy; gradients flow through the latent memory into the composer. Why this step exists: Trajectory-level training can dilute credit across many tokens; token-level updates help capture long-horizon coordination cues precisely. Example with data: On TriviaQA, runs that answer correctly get higher relative advantages; the composer learns to emit latent tokens that emphasize entity linking and retrieval cues for the Assistant. 🍞 Anchor: Like adjusting which flashcard facts to keep based on which practice tests you got right.

The recipe with concrete I/O:

Input: New query q, agent role profile γ, experience bank B.
Step A (Retrieve): Compute embeddings and fetch top-K trajectories T_q.
Step B (Compose): Feed γ and T_q to the composer σ_φ to get m (L′=8 latent tokens).
Step C (Inject): Concatenate m with the agent’s hidden states; decode responses.
Step D (Update): Store the new trajectory back into B.
Output: Final answers/actions; improved composer after LMPO training.

What breaks without each step:

No retrieval: Composer guesses without evidence; memories become generic.
No role profile: All agents receive similar memories; roles blur; errors correlate.
No latent injection: The memory can’t shape token-by-token reasoning; weak impact.
No LMPO: Composer can’t learn what actually helps; token waste returns.

Secret sauce:

Fixed-length, role-aware latent tokens tame context length while preserving the most helpful bits.
Differentiable memory injection lets rewards backprop to the composer.
Token-level LMPO targets the exact places in long interactions where memory matters most.

04Experiments & Results

🍞 Hook: Think of a school tournament where teams compete across subjects—trivia, coding, planning. We don’t just want raw scores; we want to know who improved most, who used fewer pages of notes, and who finished faster.

🥬 The Concept (The Test): Evaluate accuracy and efficiency across tasks and frameworks. What they measured:

Task success (e.g., accuracy on QA, pass rates on code tests).
Token usage (how many tokens the system reads/writes).
Inference time (how long it takes to finish). Why it matters: Good memory should boost correctness while using fewer resources. 🍞 Anchor: It’s like scoring A’s while studying fewer pages in less time.

Competition:

Baselines included memory-free systems and popular memory designs: Voyager, Generative, JoyAgent, MetaGPT, ChatDev, OAgents, G-Memory.
MAS frameworks tested: AutoGen, MacNet (seen in training), CAMEL, DyLAN (unseen).
Datasets: TriviaQA, StrategyQA, PopQA (QA); KodCode, BigCodeBench (code); PDDL (symbolic planning).

Scoreboard with context:

Performance: LatentMem improved up to 19.36% over vanilla settings and consistently beat memory baselines. Examples: +16.20% on AutoGen + TriviaQA; +18.45% on code (Llama-3.1-8B, AutoGen + KodCode). Think: moving from a class average of B– to a solid A.
Efficiency: About 50% fewer tokens and roughly 2/3rds inference time compared to mainstream memories. That’s like solving the test in two-thirds the time with half the notes.
Generalization: +7.10% on PDDL (out-of-domain) and +7.90% on CAMEL (unseen MAS). Unlike many baselines that drop in new settings, LatentMem stays strong—role-aware memory travels well.

Surprises and insights:

LMPO vs. direct fine-tuning (MARTI): Under the same budget, LatentMem often wins by notable margins (e.g., +11.73% on AutoGen + TriviaQA). Teaching the memory maker can be better than retraining all agents.
Scaling token count K: While some baselines degrade with more retrieved trajectories (overload), LatentMem keeps improving because it always compresses to a fixed-length latent memory.
Role clustering: t-SNE plots show memories cluster cleanly by role across datasets and frameworks. That means the composer reliably avoids homogenization.

Real examples of impact:

In PDDL, LatentMem reduces step repetition and avoids blindly copying mismatched past steps.
In code gen, it guides the Test Agent toward failure cases and the Code Agent toward implementable patterns, boosting pass rates.

Bottom line: LatentMem doesn’t just score higher; it does so with fewer tokens and faster inference, and it keeps those gains when you change the subject or the team lineup.

05Discussion & Limitations

Limitations:

Training cost: LMPO requires rollout sampling and reward evaluation; collecting and scoring multi-agent trajectories can be compute- and time-intensive.
Composer dependence: If role profiles are poorly written or misleading, the composer may tailor unhelpful memories.
Sparse rewards: Tasks with weak or noisy rewards may slow learning; careful reward shaping helps.
Very small tasks: For trivial problems, the overhead of retrieval and composition may not pay off.

Required resources:

A lightweight embedding model for retrieval (e.g., MiniLM) and a small transformer composer (LoRA-tuned) per backbone family.
Access to multi-agent rollouts and a reward function (e.g., exact-match QA or code unit tests).

When not to use:

Tiny, single-shot tasks with no benefit from prior experience.
Settings with no reliable reward signal or where trajectories are too few to retrieve meaningful neighbors.
Extremely constrained runtimes where even small retrieval/composition overheads are unacceptable.

Open questions:

Adaptive length: Can the composer dynamically choose L′ per task/role for even better efficiency?
Multi-source retrieval: How to best blend heterogeneous memories (tools, docs, external KBs) before compression?
Safety and robustness: How to detect and downweight harmful or spurious trajectories during composition?
Co-evolution: What happens if we also lightly adapt agent prompts or routing jointly with the composer for even bigger gains?

06Conclusion & Future Work

Three-sentence summary: This paper introduces LatentMem, a learnable, role-aware memory system that compresses raw multi-agent experiences into short latent tokens for each agent. A new training method, LMPO, pushes task rewards back through these latent memories so the composer learns what truly helps. The result is higher accuracy, fewer tokens, faster inference, and stronger generalization across tasks and team organizations.

Main achievement: Showing that fixed-length, role-conditioned latent memories—trained end-to-end—can beat hand-crafted, text-heavy memories in both performance and efficiency without changing agent frameworks.

Future directions:

Make memory length adaptive and uncertainty-aware.
Blend multiple retrieval sources and add safety filters.
Co-train light agent routing/prompting with the composer for synergistic gains.

Why remember this: LatentMem reframes memory as compact, learned, role-specific signals rather than long text dumps—an idea that scales better, generalizes further, and makes multi-agent teams both smarter and leaner.

Practical Applications

•Software teams of agents: Strategy, Coding, Testing, and Summarizing agents solve tickets with fewer tokens and higher pass rates.
•Customer support triage: Role-aware memories for retrieval, diagnosis, and response agents speed up accurate ticket resolution.
•Research assistants: Planner, Retriever, and Writer agents use compact memories to craft grounded literature reviews faster.
•Data engineering: Pipeline-builder, Validator, and Monitor agents remember schema and failure patterns without bloated logs.
•Game design: Designer, Coder, and QA agents reuse level-generation and balancing tricks via short latent hints.
•Education: Tutor, Hints, and Grader agents adapt to student history using compact, role-specific guidance.
•Robotics planning: Navigator and Manipulator agents reuse successful motion and recovery routines in tight memory budgets.
•Healthcare workflow prototyping: Intake, Risk-check, and Plan agents recall role-specific protocols while minimizing context size.
•Business analytics: Analyst, Forecaster, and Reporter agents compress prior analyses into role-aligned latent cues.
•Tool-using agents: Router and Executor roles get tailored memory for which tools to pick and how to apply them safely.

Version: 1