LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents
Key Summary
- •Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
- •This paper observes that most of what is stored in the cache is nearly identical across agents and only a tiny low-rank piece differs because of the LoRA adapter.
- •LRAgent splits the value cache into a shared base cache (the big common part) and a small adapter part that is stored in low-rank form (LR cache).
- •Two schemes are proposed: BaseShared (share base cache; keep per-agent low-rank caches) and BaseLRShared (share both base and low-rank caches when the LoRA down-projection is shared).
- •A new kernel, Flash-LoRA-Attention, reorders math so the low-rank piece is handled cheaply before it gets expanded, saving a lot of compute.
- •Across HotpotQA and ScienceQA with 8B models, accuracy stays close to normal (non-shared) while big efficiency gains are achieved.
- •BaseLRShared approaches the speed and first-token latency of fully shared caching, without the larger accuracy drops others see.
- •Memory use drops to near one third of the non-shared baseline, letting much longer contexts fit on the same GPU.
- •Compared to DroidSpeak and fully shared caches, LRAgent strikes a better balance: fast like full sharing, accurate like non-sharing.
- •Code is available so teams can adopt these ideas in real multi-agent tool-using systems.
Why This Research Matters
Multi-agent assistants often read the same long context many times, wasting memory and compute; LRAgent fixes that without dulling each agent’s special skills. By sharing the big common cache and keeping only tiny personalized parts, teams can serve longer documents on the same hardware. Faster first tokens and higher throughput mean smoother user experiences, especially when tools like web search are in the loop. The approach avoids expensive retraining or architectural overhauls, making it practical to adopt. On-device or cost-sensitive deployments benefit because memory footprints shrink significantly. Finally, the ideas generalize beyond one benchmark, pointing toward scalable, efficient long-context multi-agent systems.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine three friends studying the same textbook. Each friend makes their own giant photocopy of every chapter before adding a few sticky notes with personal thoughts. That’s a lot of paper for almost the same pages!
🥬 The Concept (LLM agents and their memory): Large Language Models (LLMs) can work as a team of agents that take turns thinking, using tools, and writing answers; they keep notes about past words in something called a KV cache so they don’t re-read the whole book each time. How it works:
- An LLM agent reads a long history (the trajectory) that includes past thoughts and tool outputs.
- It builds a key-value (KV) cache to speed up attention so it can focus on the right parts later.
- In multi-agent systems, several agents read almost the same long history and each builds its own KV cache. Why it matters: If every agent builds and stores its own huge cache for the same text, memory explodes and everything slows down. 🍞 Anchor: Think of three siblings watching the same movie on separate tablets, each downloading the full video. That’s triple the bandwidth and storage for the same show.
🍞 Hook: You know how you can customize a backpack with a tiny patch instead of sewing a whole new bag? A small change can specialize something big.
🥬 The Concept (LoRA): LoRA is a way to fine-tune a big model by learning two small low-rank matrices instead of changing all the giant weights. How it works:
- Keep the original model frozen.
- Learn a down-projection (A) that squeezes information into a tiny r-dimensional space.
- Learn an up-projection (B) that maps the tiny space back to the model’s size. Why it matters: You get role-specific behavior using only a small number of new parameters, saving training and memory. 🍞 Anchor: It’s like adding a small snap-on lens to a camera to specialize it, rather than buying a whole new camera.
🍞 Hook: Imagine roommates sharing a big apartment but decorating their own corners differently.
🥬 The Concept (Multi-LoRA): Multi-LoRA means several agents share one big pretrained model but each uses its own tiny LoRA adapter to specialize for a role. How it works:
- All agents use the same backbone weights (the shared apartment).
- Each agent adds its own LoRA A and B (their decor) to act differently.
- At inference time, each agent runs with the shared backbone + its adapter. Why it matters: You get many specialized agents without duplicating the whole model. 🍞 Anchor: A planner, a tool user, and a checker all live in one model, each with a different LoRA that makes them good at their jobs.
🍞 Hook: When you read a chapter once and bookmark it, you don’t need to reread every word later; you just jump to the bookmark.
🥬 The Concept (KV cache): A KV cache stores the keys and values that attention needs so the model can quickly look back without recomputing everything. How it works:
- As the model reads tokens, it builds key and value vectors.
- These vectors are saved as the KV cache.
- Later tokens use attention over the cache to stay fast. Why it matters: Without KV caches, long-context generation would be very slow and expensive. 🍞 Anchor: It’s like keeping an index of a book so you can find any topic instantly without re-flipping every page.
The world before: Teams built multi-agent LLM systems because specialized roles (plan, act, reflect) solve complex tasks better. They used LoRA to cheaply fine-tune each role. But each agent processed mostly the same long context, built its own KV cache, and repeated a lot of computation. Memory filled up, latency grew, and serving costs climbed.
The problem: How can we let agents share the big common part of their caches while keeping each agent’s special LoRA effect, so we save memory and time without hurting accuracy?
Failed attempts:
- Fully shared caches (pretend all agents are identical) can be fast, but the small agent-specific differences vanish and accuracy drops.
- Positional-alignment tricks help reuse chunks but still recompute a lot and aren’t tailored to multi-LoRA.
- Selective recomputation (like DroidSpeak) saves some cache layers but still forces most hidden-state computation for already-seen text.
The gap: No one had a cache-sharing method that directly leverages multi-LoRA’s nature: the backbone is shared and similar across agents, while adapters are small and different.
Real stakes: Faster customer chatbots that use web search, cheaper knowledge assistants that read long documents, and on-device multi-persona helpers that must fit in tight memory all depend on solving this wasteful duplication. With long contexts (tens of thousands of tokens), the difference is the line between smooth experiences and out-of-memory crashes.
02Core Idea
🍞 Hook: Imagine three kids copying the same 100-page worksheet. Instead, one kid copies the pages once, and the others just add their tiny notes. Everyone finishes faster with less paper.
🥬 The Concept (Key insight): Split the value cache into two parts: a big common base cache shared by all agents and a tiny low-rank adapter part that’s unique to each agent; share the big part, store the tiny part compactly, and only expand it when needed. How it works:
- Notice that across agents reading the same text, the shared backbone’s outputs are nearly the same, while LoRA adds small, low-rank differences.
- Save and reuse the base cache for everyone.
- Save each agent’s LoRA contribution in low-rank form (the LR cache), not full size.
- Reconstruct the full effect only when necessary, and do it cleverly so it stays cheap. Why it matters: You cut memory and repeated computation while keeping each agent’s unique behavior. 🍞 Anchor: One central library holds the big textbook; each student keeps only small, personalized sticky notes.
Multiple analogies:
- Library and sticky notes: Base cache = library book; LR cache = small sticky notes per student.
- Kitchen and spices: Base cache = the main soup; LR cache = a tiny spice mix each chef adds at the end.
- Highway and exits: Base cache = the main highway used by all; LR cache = short exit ramps added per destination.
Before vs After:
- Before: Every agent rebuilt the whole cache and kept full copies; speed and memory suffered.
- After: Agents reuse a shared base cache; only their tiny adapter parts are kept compactly and applied late, so memory drops and speed rises.
Why it works (intuition):
- The pretrained backbone encodes general language knowledge that doesn’t change across roles, so its cache is nearly identical across agents.
- LoRA adds low-rank tweaks; low-rank means we can store and process them in a tiny space and expand only at the very end.
- Because the adapter’s effect is small and decorrelated across agents, sharing the base preserves accuracy while avoiding conflicts.
Building blocks (each as a mini sandwich):
- 🍞 Hook: You know how you keep a master copy of a worksheet and only write your answer on a small answer sheet? 🥬 The Concept (Base cache): The base cache is the part of the value cache produced by the frozen backbone that’s almost the same for all agents. How it works: Run the backbone once, store its value outputs, and let everyone reuse them. Why it matters: It avoids N copies of nearly the same thing. 🍞 Anchor: One shared master printout on the classroom wall.
- 🍞 Hook: Imagine you carry only a small card with your personal notes. 🥬 The Concept (Adapter output): The adapter output is the LoRA-induced difference from the base outputs. How it works: It’s computed as (hidden states × A) × B, where (hidden states × A) is small. Why it matters: This small piece carries the role’s personality. 🍞 Anchor: Your signature sticky notes.
- 🍞 Hook: Folding a big poster into a small pamphlet saves space. 🥬 The Concept (LR cache): The LR cache stores the adapter’s tiny middle piece (hidden states × A) of size rank r. How it works: Save only the compact r-dimensional activations, then multiply by B later to get full size. Why it matters: Memory shrinks dramatically. 🍞 Anchor: Keep the pamphlet, not the poster.
- 🍞 Hook: Sharing is caring (and faster). 🥬 The Concept (KV cache sharing): Reusing already computed caches across agents. How it works: Store once, reuse many times. Why it matters: Less memory and less work. 🍞 Anchor: Everyone uses the same map instead of drawing their own.
03Methodology
At a high level: Input tokens → build shared base cache once → store tiny per-agent (or shared) LR caches → during attention, add the tiny adapter influence efficiently → Output tokens.
We now introduce the two schemes and the speed-up kernel, each with the sandwich pattern when first mentioned.
🍞 Hook: Picture all teams sharing the main cookbook, while each chef keeps a small spice packet.
🥬 The Concept (BaseShared): BaseShared shares the base value cache across all agents and keeps a separate low-rank (LR) cache per agent. How it works (step by step):
- First agent reads the context and computes the value cache from the frozen backbone; save this as the base cache.
- That first agent also saves its small LR cache (hidden states Ă— A_agent) while generating.
- When another agent arrives, it reuses the base cache (no redoing the big value projection), and it computes its own LR cache for any parts it hasn’t seen yet.
- During attention, reconstruct the adapter’s effect by multiplying the LR cache by B_agent when needed. Why it matters: Memory drops to roughly 1/N of the original for N agents (plus a tiny LR part), but you still pay some compute to build LR caches for earlier text the new agent hasn’t touched. 🍞 Anchor: One big shared soup for everyone, but each chef quickly remixes their own spice packet before serving.
What breaks without it: If you don’t share the base, you store N almost-identical caches and run N redundant projections. If you store full adapter outputs instead of LR, you lose most memory savings.
Concrete example: Three agents read a 30k-token trajectory. Instead of 3 full caches, you keep 1 full base cache plus 3 tiny LR caches, saving tens of gigabytes at long lengths.
🍞 Hook: Imagine that all chefs agree to use the same grinder that preps the spices; now you can even share some spice prep across the team.
🥬 The Concept (Shared-A and BaseLRShared): In many setups, the down-projection A is nearly the same across tasks; if everyone literally shares A, then the LR cache (hidden states × A) becomes common, too. BaseLRShared shares both base cache and LR cache. How it works (step by step):
- Train or configure agents so they share the same A (the down-projection), but keep different B’s (up-projections) for their roles.
- As tokens stream in, compute and save the base cache once and the single shared LR cache once.
- Any agent can form its own adapter effect by multiplying the shared LR with its own B at attention time.
- When switching agents, there is no need to recompute hidden states for earlier text; only new tokens are processed. Why it matters: You cut both memory and compute. Prefill work is no longer repeated per agent; speed approaches that of fully shared caches, but accuracy stays high. 🍞 Anchor: One common spice preparation (LR cache) for the whole kitchen; each chef adds their own final twist (B) at plating time.
What breaks without it: If A isn’t shared, the LR caches differ and you must rebuild them per agent, losing the compute win.
Concrete example: Over a 60k-token trajectory with three agents, BaseLRShared processes the past once; each agent only handles the new chunk, approaching the best possible throughput.
🍞 Hook: Instead of blowing up a tiny thumbnail into full size for every page, first decide how much that page even matters, then scale the thumbnail once.
🥬 The Concept (Flash-LoRA-Attention): A custom attention kernel that rearranges the math so the LR cache is combined with attention weights before expanding, keeping most work in the tiny rank r space. How it works (step by step):
- Usual way: Expand LR cache to full size (big) for all past tokens, then do attention — expensive.
- Smart way: Do attention weighting on the LR cache first (cheap, because it’s r-dimensional along the long sequence), then multiply by B once at the end of the block.
- Implement this reordering inside FlashAttention’s memory-efficient kernel to keep it fast and GPU-friendly. Why it matters: The heavy cost tied to the long sequence length runs in the tiny rank r instead of the large model dimension, saving lots of time. 🍞 Anchor: Skim pages with a small magnifying glass (low-rank) before you decide which single page to print big.
What breaks without it: If you expand LR to full size first, cost scales with both long sequence length and large output dimensions; you lose most of the speed gains.
Concrete example with data: Suppose rank r=8 and head dimension is thousands; computing along length L in r is orders of magnitude cheaper than doing it in full size for every token.
Putting it all together (recipe view):
- Input → Build/Reuse base cache → Build/Reuse LR cache (per agent for BaseShared; once for BaseLRShared) → During attention, compute base part normally and compute LR part via Flash-LoRA-Attention → Add them → Output.
- Secret sauce: Decouple what’s common (share it) from what’s special (store it small), and delay expensive steps until after you’ve already shrunk the work.
Example walkthrough across agents:
- Step 1 (Plan agent): Reads 10k tokens, saves base cache and LR cache (shared or per-agent, depending on scheme), answers or calls a tool.
- Step 2 (Action agent): Reuses base cache. In BaseShared, it LR-prefills earlier tokens; in BaseLRShared, it skips that and only handles new tokens (say +2k). Attends with fast LR math.
- Step 3 (Reflect agent): Same reuse logic; decides to finish or send back for another plan-pass, and so on.
04Experiments & Results
🍞 Hook: Think of a relay race. If runners can reuse the same warm-up and hand off their gear efficiently, the whole team finishes faster without tripping.
🥬 The Concept (The test): The authors tested accuracy, speed (throughput and time-to-first-token), and memory on two 8B instruction-tuned models (LLaMA-3.1-8B-Instruct and Ministral-8B-Instruct) across two multi-step QA tasks (HotpotQA and ScienceQA) using three agents (plan, action, reflect). How it works:
- Train role-specific LoRA adapters (rank 8 on query and value) using agent trajectories.
- Evaluate several cache strategies: Non-Shared (baseline), FullShared (everything shared), DroidSpeak (selective recomputation), BaseShared, and BaseLRShared.
- Measure not just raw speed but also accuracy and end-to-end latency with tool calls, because lower accuracy often makes agents take more steps and grow the context. Why it matters: A good method must be both fast and accurate; speed alone isn’t enough if it causes more steps or wrong answers. 🍞 Anchor: A relay team that runs a bit faster but drops the baton more often won’t win overall.
The competition (baselines):
- Non-Shared: Accurate but memory- and compute-heavy.
- FullShared: Fastest in theory, but often hurts accuracy because it ignores agent differences.
- DroidSpeak: Recompute critical layers to balance reuse and accuracy, but still does lots of hidden-state work.
Scoreboard with context:
- Accuracy: BaseShared stayed within about 0.7 percentage points of Non-Shared on average; BaseLRShared within about 1.5 points. FullShared and DroidSpeak dropped more (up to around 5.3 and 2.6 points). That’s like keeping an A- when others slipped to B or C on some tests.
- Throughput: With Flash-LoRA-Attention, BaseLRShared nearly matched FullShared across sequence lengths up to tens of thousands of tokens, achieving large speedups versus Non-Shared. BaseShared also sped up and matched or beat DroidSpeak in many traces.
- Time-to-first-token (TTFT): BaseLRShared cut TTFT by up to about 4.4Ă—, approaching FullShared; BaseShared also reduced TTFT and was competitive with DroidSpeak.
- Memory: Both LRAgent schemes used nearly one third of the memory of Non-Shared for long contexts and were close to FullShared.
Surprising findings:
- Sharing only the base cache (not the whole cache) preserved accuracy much better than fully sharing everything. This supports the paper’s insight that the base is common but the adapter is crucial.
- Sharing A (the down-projection) not only enabled compute savings but also improved accuracy compared to separate A’s — a win-win.
- Methods that look fast in fixed traces can be slower end-to-end if they hurt accuracy and cause longer trajectories; BaseLRShared avoided this trap.
Concrete number sense:
- Think of throughput jumps like going from a typical neighborhood speed to highway speed; BaseLRShared approaches the fast lane of FullShared without the accuracy ticket.
- Memory savings let you handle 60k+ token sequences on a single 48GB GPU, where Non-Shared might hit out-of-memory.
🍞 Anchor: The best relay team reuses shared prep, keeps each runner’s strengths, and passes the baton smartly; that’s why they finish fast and still hit the target time.
05Discussion & Limitations
🍞 Hook: Even the best backpacks have weight limits; you still need to pack wisely and know where you’re going.
🥬 The Concept (Limitations and when not to use): LRAgent shines when multiple agents share most of their context and use LoRA adapters, especially with shared A; it’s less ideal when agents read very different texts or when adapters modify keys heavily. How it works (caveats and needs):
- BaseShared still recomputes LR caches for earlier tokens when agents switch; it saves memory big-time but compute savings are modest.
- BaseLRShared depends on shared-A multi-LoRA; without shared A, you lose the shared LR cache and most compute gains.
- If LoRA also modifies keys (common is Q and V), attention math reordering becomes trickier due to positional encodings (like RoPE); benefits may shrink.
- The custom Flash-LoRA-Attention kernel must be integrated carefully with your stack (FlashAttention versions, GPU types). Why it matters: Knowing boundaries helps you pick the right scheme and avoid surprises. 🍞 Anchor: If your teammates all study different books, sharing one big summary doesn’t help much.
Required resources:
- A GPU that supports FlashAttention-style kernels.
- Multi-LoRA checkpoints for your agents; ideally a shared-A setup for BaseLRShared.
- Engineering time to integrate the kernel and cache manager.
When not to use:
- Tasks where agents rarely share context or read totally different documents.
- Pipelines where LoRA adapts the key projections heavily and you can’t accept smaller speedups.
- Extremely constrained hardware where custom kernels can’t be deployed.
Open questions:
- How far does this extend to larger models, more agents, or 1M-token contexts?
- Can dynamic routers (mixtures of B’s) co-exist with shared LR caches cleanly?
- How well does it combine with quantization or other KV compression tricks?
- Can similar reordering speedups help when LoRA adapts keys under RoPE?
- What are the best policies to decide when to reconstruct adapter effects or to cache intermediate products?
Overall assessment: LRAgent delivers a strong accuracy–efficiency balance in realistic multi-agent, long-context settings and offers a practical path to lower memory and latency without retraining exotic architectures.
06Conclusion & Future Work
Three-sentence summary: LRAgent observes that across agents the big part of the value cache is nearly the same, while the LoRA-induced part is small and low-rank. It splits the cache into a shared base and a compact LR cache and uses a new Flash-LoRA-Attention kernel to apply the adapter effect cheaply at the right time. This keeps accuracy near non-shared levels while approaching the speed and memory benefits of fully shared caching.
Main achievement: Turning multi-LoRA’s structure into a concrete cache-sharing strategy (BaseShared and BaseLRShared) that reduces memory and redundant compute, plus a kernel that makes the low-rank math pay off in practice.
Future directions:
- Extend reordering tricks to key-side adapters and other positional schemes.
- Explore hybrid policies (some shared, some selective recomputation) guided by live accuracy signals.
- Combine with quantization or cross-layer SVD for even larger memory wins.
- Scale to more agents, bigger models, and million-token contexts.
Why remember this: It’s a clear example of “share the common, compress the special, compute late and small,” turning a simple observation about similarity into a measurable win for real multi-agent systems that read long, tool-augmented contexts.
Practical Applications
- •Speed up enterprise multi-agent copilots that plan, search internal wikis, and draft reports from long documents.
- •Cut cloud costs for customer-support bots by sharing caches across planner, tool-caller, and verifier agents.
- •Enable multi-persona chat on a single GPU or high-end edge device by fitting longer contexts in memory.
- •Accelerate RAG pipelines where several agents iterate over retrieved passages and summaries.
- •Improve TTFT for interactive tool-using assistants, reducing user wait times in web and mobile apps.
- •Scale multi-step scientific assistants (plan/act/reflect) to longer papers without out-of-memory failures.
- •Boost throughput in batch serving of multi-agent workflows for analytics and data exploration.
- •Combine with shared-A training to simplify adapter management while gaining both accuracy and speed.
- •Deploy more robust long-context tutoring systems that iterate over textbooks, notes, and web sources.
- •Strengthen incident response copilots that repeatedly consult logs and knowledge bases across agent roles.