Memory Caching: RNNs with Growing Memory

Ali Behrouz; Zeman Li; Yuan Deng; Peilin Zhong; Meisam Razaviyayn; Vahab Mirrokni

Memory Caching: RNNs with Growing Memory

Beginner

Ali Behrouz, Zeman Li, Yuan Deng et al.2/27/2026

arXiv

Key Summary

•Recurrent neural networks (RNNs) are fast but forgetful because they squeeze everything they’ve seen into a tiny, fixed memory.
•Transformers remember a lot by keeping all past tokens, but that is slow and costly because it grows with the square of the sequence length.
•Memory Caching (MC) lets RNNs keep compact “chapter summaries” of the past, so their effective memory grows with the sequence length.
•MC splits a long sequence into segments, caches each segment’s final hidden state, and lets the current token read from both the live (online) memory and selected cached memories.
•There are four ways to combine cached memories: simple residual addition, Gated Residual Memory (GRM), Memory Soup (mixing parameters), and Sparse Selective Caching (SSC) that picks only the most relevant caches.
•MC provides a smooth trade-off from O(L) RNNs to O(L^2) Transformers, landing in the middle at O(NL) where N is the number of cached segments.
•Across language modeling, long-context, and retrieval benchmarks, MC consistently boosts RNN-style models and narrows the gap with Transformers on recall-heavy tasks.
•SSC gives strong accuracy with low overhead by loading only a few relevant caches per token, improving both memory use and speed on long sequences.
•Even applied post-training, simple MC improves length extrapolation, showing it is a flexible, general technique rather than a single architecture.
•MC turns the old belief—“RNNs must have fixed memory”—on its head by letting their memory grow sensibly with context.

Why This Research Matters

Many real tasks involve very long inputs: legal contracts, medical notes, codebases, logs, and book-length documents. Transformers recall well but become slow and memory-hungry as inputs grow, while traditional RNNs are fast but forget the distant past. Memory Caching offers a middle path: keep compact, searchable summaries so models remember more without paying full Transformer costs. This means better answers on long documents, fewer missed references, and stronger factual grounding. It can be added to several recurrent architectures and even used post-training, making it practical in deployed systems. SSC further makes long-context use cases affordable by consulting only the most relevant caches.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to remember a whole school year’s lessons using just one tiny sticky note. You’ll quickly run out of space and forget important details. But if you write short summaries at the end of each week and keep them, you can remember much more without carrying your whole notebook every day.

🥬 The Concept (RNNs):

What it is: Recurrent Neural Networks (RNNs) are models that process one token at a time and keep a fixed-size hidden state (a tiny memory) to remember the past.
How it works:
1. Read the next token.
2. Update the hidden state to squeeze in what’s new.
3. Use that hidden state to make the next prediction.
Why it matters: Without enough space, RNNs must overwrite older information; on very long inputs they forget crucial facts, hurting recall. 🍞 Anchor: Like writing over the same sticky note every day; last week’s math formula gets erased by today’s history date.

🍞 Hook: Now picture a friend who keeps every page they ever wrote so they can look back at anything anytime. That’s amazing for memory—but their backpack gets very heavy.

🥬 The Concept (Transformers and Attention):

What it is: Transformers use attention to look back at all past tokens directly; they “cache” every token so memory grows with the sequence.
How it works:
1. Turn each token into key, query, and value vectors.
2. For the current token, compare its query to all past keys to find relevance.
3. Mix past values using those relevance scores to produce the output.
Why it matters: This great recall comes with quadratic cost; doubling the length nearly quadruples the work and memory. 🍞 Anchor: Like flipping through the whole photo album every time to find the right picture—works well, but slow and storage-heavy.

🍞 Hook: People tried to be clever: can we keep things fast like RNNs but still look back far like Transformers?

🥬 The Concept (Linear/efficient attention and recent RNNs):

What it is: Linear attention and efficient RNNs compress the past to keep costs lower than Transformers.
How it works:
1. Replace the full pairwise attention with a compact summary (a running memory).
2. Update the summary with each new token.
3. Read from that summary for output.
Why it matters: They’re faster but still suffer: a single fixed-size summary can’t hold everything for ultra-long context, so recall degrades. 🍞 Anchor: It’s like keeping one index card for all your subjects—efficient, but you’ll forget chapter 1 by chapter 10.

🍞 Hook: What’s missing is a middle path: not keeping everything (too expensive) and not keeping almost nothing (too forgetful).

🥬 The Concept (The Gap this paper fills):

What it is: A simple method for RNN-style models to grow their effective memory with sequence length—without paying Transformer-level costs.
How it works:
1. Break the sequence into segments.
2. After finishing a segment, save (cache) the segment’s final memory state.
3. When processing new tokens, read both the current memory and some cached memories.
Why it matters: This scales memory capacity with length, improving recall while keeping computation between RNNs and Transformers. 🍞 Anchor: Like keeping weekly summary cards. Today, you can glance at your current notes and also any past week’s summary that seems relevant.

🍞 Hook: Why care in daily life? Think of reading a long book, coding over many files, or searching a months-long chat—important details might be far back.

🥬 The Concept (Real Stakes):

What it is: Long-context tasks in language, code, logs, and documents need reliable recall over tens of thousands of tokens.
How it works:
1. Keep compact summaries so models can scale to long inputs.
2. Add selection, so only the most relevant summaries are consulted.
3. Balance speed and memory: use more caches when needed, fewer when not.
Why it matters: Better recall improves answers, reduces hallucinations, and boosts usefulness on real-world, long documents without massive compute. 🍞 Anchor: A help-bot that can correctly cite a paragraph from 200 pages ago, without grinding to a halt or running out of memory.

02Core Idea

🍞 Hook: You know how detectives keep case files with timelines and summaries? They don’t remember every detail in their heads, but they store compact notes they can pull out when needed.

🥬 The Concept (Memory Caching):

What it is: Memory Caching (MC) lets RNNs save compact memory “checkpoints” for segments of the input and later read from them, so effective memory grows with sequence length.
How it works:
1. Split the long input into segments (chapters).
2. For each segment, run the usual RNN update and save the final memory state (the chapter summary).
3. For a new token, combine the current live memory with one or more cached summaries to produce the output.
Why it matters: It bridges the gap: still efficient like RNNs, but with scalable recall like Transformers, controlled by how many caches you consult. 🍞 Anchor: Today’s question can use the current page’s notes plus any past chapter summaries that match the topic.

Multiple Analogies:

Library analogy: Each segment is a book chapter; MC keeps chapter summaries. When stuck, you open the relevant summaries instead of rereading the whole book.
Breadcrumb trail: As you walk a long path, you drop labeled breadcrumbs (caches). Later, you follow only the breadcrumbs that match your current goal.
Backpack packing: Instead of carrying every item (Transformers) or only a tiny pouch (RNNs), you pack light but keep foldable checklists from previous trips (caches) you can quickly unfold.

Before vs After:

Before MC: RNNs had one small, ever-overwriting memory; long-range recall faded, especially for very long sequences.
After MC: RNNs keep a gallery of compact, queryable memories; recall improves notably, while remaining much cheaper than full attention.

Why It Works (intuition):

Single summaries overflow: a fixed-size memory must forget. Segmenting creates multiple summaries, preventing early information from being erased.
Relevance matters: With gating/selectors, tokens consult only the most relevant summaries, limiting noise and cost.
Checkpoints are optimization waypoints: Each cached state is a snapshot of the model’s internal learning on that segment, preserving older patterns.

Building Blocks (five mini-concepts, each Sandwich-style):

🍞 Hook: You don’t read a 500-page book without chapters; chapters help organize memory. 🥬 The Concept (Segmentation):

What: Split the input into segments and compress each into a final memory state.
How:
1. Choose a segment size (constant or logarithmic, etc.).
2. Run the RNN updates inside each segment.
3. Cache the final state as that segment’s summary.
Why: Creating multiple summaries prevents the single-memory overflow and enables scalable recall. 🍞 Anchor: Weekly school notes instead of one endless page.

🍞 Hook: Sometimes you need just a sprinkle of salt, not the whole shaker. 🥬 The Concept (Residual Memory):

What: Read from live memory plus the sum of cached memories.
How:
1. Compute output from current memory.
2. Add outputs from each cached memory.
3. Use the sum as the final answer.
Why: Dead simple and already helps, but treats all caches equally. 🍞 Anchor: Skim all chapter summaries and add their hints together.

🍞 Hook: You don’t weigh all past chapters equally; you prefer the ones most related to your current question. 🥬 The Concept (Gated Residual Memory, GRM):

What: Like Residual Memory, but each cache gets a gate (a learned weight) based on how relevant it is to the current token.
How:
1. Make a small vector from the current token (a connector).
2. Compare it with each segment’s pooled representation to get relevance scores.
3. Softmax the scores to gates and take a weighted sum of cached outputs plus the live memory.
Why: Without gates, irrelevant caches add noise; with gates, recall is sharper and more accurate. 🍞 Anchor: You give more attention to the chapter titled “Volcanoes” when the question is about lava.

🍞 Hook: Sometimes the recipe changes: you don’t just add soups; you blend ingredients to make a new flavor. 🥬 The Concept (Memory Soup):

What: Instead of summing outputs, mix the cached memory parameters themselves (for deep memories) to build an input-dependent memory module on the fly.
How:
1. Compute gates like GRM.
2. Interpolate the parameters of cached memories using those gates.
3. Apply the newly mixed memory to the current query.
Why: For deep/nonlinear memories, mixing parameters can create a specialized retriever tailored to this token. 🍞 Anchor: Like blending past chapter notes into a custom study guide just for today’s question.

🍞 Hook: When your closet is too full, you keep only your favorites in reach. 🥬 The Concept (Sparse Selective Caching, SSC):

What: A router picks only the top-k most relevant caches to use, saving compute and memory.
How:
1. Score each cached segment by similarity to the current token.
2. Choose the best few (top-k) caches.
3. Combine them (via GRM or Soup) with the current memory.
Why: Using all caches is wasteful; SSC keeps costs low on ultra-long sequences while preserving recall. 🍞 Anchor: Instead of opening every folder, you open only the two most likely to contain the answer.

03Methodology

At a high level: Input sequence → Segment into chunks → Per-segment RNN update and cache final state → For each token, aggregate online memory with selected cached memories → Output.

Step-by-step recipe (with mini Sandwiches where new ideas appear):

Prepare the inputs

What happens: Tokenize the long sequence into a list of vectors.
Why this step: Models need vectors to process symbols.
Example: A 16,000-token article becomes 16,000 small vectors.

Segment the sequence 🍞 Hook: You organize a binder with tabs so you can find things later. 🥬 The Concept (Segmentation choice):

What: Decide how to split the sequence: equal-size (e.g., 256 tokens), or logarithmic (few very long, some shorter), etc.
How:
1. Pick a policy: constant C, or log-sized buckets.
2. Mark segment boundaries.
3. Keep a small summary vector for each segment (mean-pooled keys, for example) for fast relevance checks.
Why: Segment size trades recall and cost. Smaller segments = more caches (better recall, higher cost). Larger segments = fewer caches (cheaper, less precise). 🍞 Anchor: Chapters every 10 pages vs. 50 pages; more chapters help you locate facts, but take extra time to create and store.

Per-segment memory updates (online memory)

What happens: Inside a segment, run your chosen recurrent update rule (e.g., Linear Attention, Deep Linear Attention, or Titans) to produce a final memory state.
Why this step: This compresses the segment’s information into a compact memory.
Example with simple linear attention flavor: Each new token adds a small key–value outer product into a running matrix; the last state is the segment’s cache.

Cache the segment’s final memory

What happens: When the segment ends, save its final memory state to a cache list.
Why this step: This saved state is the segment’s “chapter summary.” Without caching, older information would be irretrievable.
Example: After 256 tokens, store $M_s$ egment as Cache #7.

Retrieve at each new token using aggregation 🍞 Hook: When answering a quiz question, you glance at today’s notes and any past summaries that fit. 🥬 The Concept (Aggregation):

What: Combine the online memory with one or more cached memories to produce the output for the current token.
How:
1. Compute the current token’s query.
2. Option A: Residual addition (sum all cached outputs + online output).
3. Option B: Gated Residual Memory (compute relevance gates and take weighted sum).
4. Option C: Memory Soup (mix parameters of cached memories using gates, then apply).
5. Option D: SSC (route to the top-k caches, then combine via B or C).
Why: Without aggregation, you only use the tiny online memory and forget the long past; with aggregation, effective memory grows with the sequence. 🍞 Anchor: Final answer = today’s note + the most relevant chapter summaries.

Choosing gates and routing (the secret sauce) 🍞 Hook: You don’t call every friend for advice; you call the one who knows this topic best. 🥬 The Concept (Relevance scores and gates):

What: Compute a similarity between the current token’s small connector vector and each segment’s pooled representation to set gates.
How:
1. Make $u_t$ from the token (like a mini fingerprint of its needs).
2. Compare $u_t$ with each segment summary (dot product) to score relevance.
3. Softmax to get gates so they sum to 1.
Why: Without relevance, you mix in unrelated caches and add noise; with it, the model zooms in on helpful memory. 🍞 Anchor: Ask the “volcano” expert for volcano questions; don’t poll the poetry club.

Complexity and the dial between speed and recall

What happens: Using N cached segments with sequence length L yields about O(NL) compute. N=1 is RNN-like O(L). N=L is Transformer-like O( $L^2$ ). SSC keeps N small per token.
Why this step: It clarifies the budget you’re spending. You turn the dial by choosing segment sizes and how many caches to consult.
Example: With 16K tokens and 256-token segments, $N≈62$ ; SSC with top-k=4 uses only 4 caches per token, holding costs down.

Variants for different memory modules

Linear Attention and SWLA: GRM and Soup are equivalent mathematically in the linear case; SSC helps by selecting fewer caches.
Deep Linear Attention (DLA) / Titans: Soup becomes distinct and powerful because parameters are nonlinear; mixing them yields custom retrievers per token.
Why this matters: For deep memories, Memory Soup can craft a specialized memory on the fly, while GRM remains a strong, simpler default.

Checkpoints vs. independent compressors 🍞 Hook: Do you keep editing the same master document (checkpoints), or do you keep separate notebooks per chapter (independent)? 🥬 The Concept (Two caching philosophies):

What: Either treat each segment’s cache as a checkpoint of the same evolving memory, or as a fresh memory trained only on that segment.
How:
1. Checkpoints: Initialize each new segment’s memory from the last segment’s final state.
2. Independent: Initialize each segment’s memory from a fixed start, so segments don’t interfere.
Why: Checkpoints keep continuity; independent avoids interference. Which is better may depend on the task and length. 🍞 Anchor: One big evolving Google Doc vs. a separate doc per chapter.

Example walk-through with tiny numbers:

Sequence length L=12, segment size=4 → segments S1, S2, S3.
Process S1 tokens; save Cache1.
Process S2; save Cache2.
Now in S3 at token t, compute its query.
Router picks the top-2 relevant caches (say Cache1 and Cache2).
GRM computes gates $\begin{pmatrix} 0.2 \\ 0.5 \\ 0.3 \end{pmatrix}$ for [online, Cache1, Cache2].
Final output = 0.2Online + 0.5Cache1(q) + 0.3*Cache2(q).

Secret sauce summary:

Caching compact summaries prevents memory overflow.
Gating/routing focuses computation on what matters.
The N dial lets you pick your spot on the speed–recall curve.
Deep memories + Soup can tailor a retriever per token.

04Experiments & Results

The Test: The authors asked, “Can Memory Caching improve RNN-style models on tough recall and long-context tasks while staying efficient?” They measured language modeling (perplexity), commonsense QA accuracy, long-context recall (Needle-in-a-Haystack), in-context retrieval (datasets like SQuAD, NQ, DROP), and long-context understanding (LongBench). They also checked training throughput and ablations to see which pieces mattered most.

The Competition: Baselines included standard Transformers, hybrid efficient models, and strong recurrent/linear-attention families (e.g., SWLA, DLA, Titans, RetNet, RWKV, DeltaNet, and an improved Log-Linear++ baseline). MC was plugged into SWLA, DLA, and Titans in several flavors (Residual/GRM, Memory Soup, and SSC).

The Scoreboard (with context):

Language modeling and commonsense QA: Adding MC to DLA and Titans consistently improved results over their base versions. Think of it as turning a B into a B+ or A−—a solid step up without changing the whole school.
Needle-in-a-Haystack (long-context retrieval): MC variants beat their base RNNs and did especially well as context length grew. Versus Log-Linear, MC held up better at 16K tokens, meaning it kept finding the “needle” more often.
In-context retrieval (SQuAD, NQ, DROP, etc.): Transformers still lead, but MC narrowed the gap and outperformed other recurrent approaches. That’s like moving from mid-pack to near the leaders on recall-heavy tracks.
LongBench (mixed long-context tasks): All MC-enhanced models improved over their base RNNs; GRM and Soup were particularly steady, with SSC offering a strong efficiency/accuracy mix.
Throughput and efficiency: As sequences grew, MC—especially SSC—kept training/inference speed much closer to RNNs than to Transformers. You get more memory without paying the full Transformer bill.

Surprising/Notable findings:

Even the simplest Residual Memory (no gates) helped, suggesting that just having multiple summaries already combats forgetting.
GRM often gave the best accuracy among MC variants, showing that relevance-aware gating is powerful.
Memory Soup and GRM are equivalent for strictly linear memories but diverge for deep ones; for deep modules, Soup can craft task-specific retrievers, sometimes matching GRM.
SSC delivered a sweet spot: near-GRM accuracy with lower overhead by consulting only a few top caches per token.
Post-training MC (no extra learning, just caching during inference) improved length extrapolation—evidence MC is a general, portable idea.

Big picture: MC reliably boosts recall and long-context behavior for RNN-like models. While Transformers remain the top recall champs in the hardest settings, MC moves efficient RNNs much closer without abandoning their speed advantages.

05Discussion & Limitations

Limitations:

Transformers still win on the very hardest recall tasks that demand pinpoint token-level lookup across huge contexts.
Segmenting introduces design choices: too large and you lose detail; too small and costs rise.
Gating/routing quality matters; if relevance scoring is off, the model may consult the wrong caches.
Cached states require extra memory traffic; on some hardware stacks, loading many caches per token can become a bottleneck unless SSC is used.

Required resources:

Modest extra RAM/disk to store caches; compute for scoring and combining caches (kept small with SSC).
Training data and standard accelerators (GPUs/TPUs) similar to recurrent baselines; no exotic infrastructure is needed.

When not to use:

Tiny sequences where a simple RNN already remembers enough (MC overhead might not pay off).
Tasks requiring exact per-token cross interactions like full attention over short windows (a small Transformer block may be simpler).
Cases with strictly real-time constraints and ultra-light models where even top-k routing is too costly.

Open questions:

Best segmentation automatically: Can we learn segment sizes or boundaries on the fly?
Better routers: Can more advanced similarity functions (or learned retrieval indexes) further improve selection without high cost?
Theory of capacity: Exactly how does effective memory scale with segments and gates for different architectures?
Cross-layer caching: What if we cache from multiple layers or combine caches across layers for richer retrieval?
Continual use: How does MC behave in streaming settings where input never ends and caches must be rotated intelligently?

06Conclusion & Future Work

Three-sentence summary: Memory Caching lets RNN-style models save compact summaries of past segments and read them later, so effective memory grows with the sequence. This delivers stronger long-range recall and long-context understanding than baseline RNNs, while staying much cheaper than full Transformers. With variants like GRM, Memory Soup, and SSC, MC offers a tunable speed–recall trade-off and works across multiple architectures.

Main achievement: The paper overturns the assumption that RNNs must have a fixed memory by introducing a simple, general technique—segment-and-cache—that scales their effective memory with context length and substantially improves recall-heavy performance.

Future directions: Learn segmentation and routing end-to-end, explore richer pooling and similarity measures, combine cache signals across layers, and integrate MC with retrieval-augmented or tool-using systems for even longer horizons. Post-training MC for existing models and hardware-aware cache loading could further widen its impact.

Why remember this: MC is a practical middle path between forgetting (RNNs) and full recall at high cost (Transformers). It’s simple, architecture-agnostic, and proven to help on real benchmarks—an idea you can carry into many sequence models whenever you need longer memory without breaking the bank.

Practical Applications

•Summarizing and querying long legal or policy documents while keeping the ability to cite earlier sections.
•Code assistance across multi-file projects, recalling definitions or patterns written thousands of tokens earlier.
•Customer support chatbots that remember crucial details from long, multi-session conversations.
•Log analysis where rare but important events happened far back in time and must be retrieved efficiently.
•Academic or technical reading assistants that track references and figures across lengthy papers.
•Medical documentation tools that recall symptoms and diagnoses from prior visits without scanning everything.
•Financial report analysis over long quarterly filings, preserving links between distant sections.
•Long-form creative writing aids that stay consistent with earlier plot points and character facts.
•Regulatory compliance checkers that verify rules against large, cross-referenced specifications.
•Search and retrieval over extended transcripts or meeting notes with selective, low-cost memory lookups.

Version: 1