Memory in the Age of AI Agents

Yuyang Hu; Shichun Liu; Yanwei Yue; Guibin Zhang; Boyang Liu; Fangyi Zhu; Jiahang Lin; Honglin Guo; Shihan Dou; Zhiheng Xi; Senjie Jin; Jiejun Tan; Yanbin Yin; Jiongnan Liu; Zeyu Zhang; Zhongxiang Sun; Yutao Zhu; Hao Sun; Boci Peng; Zhenrong Cheng; Xuanbo Fan; Jiaxin Guo; Xinlei Yu; Zhenhong Zhou; Zewen Hu; Jiahao Huo; Junhao Wang; Yuwei Niu; Yu Wang; Zhenfei Yin; Xiaobin Hu; Yue Liao; Qiankun Li; Kun Wang; Wangchunshu Zhou; Yixin Liu; Dawei Cheng; Qi Zhang; Tao Gui; Shirui Pan; Yan Zhang; Philip Torr; Zhicheng Dou; Ji-Rong Wen; Xuanjing Huang; Yu-Gang Jiang; Shuicheng Yan

Memory in the Age of AI Agents

Intermediate

Yuyang Hu, Shichun Liu, Yanwei Yue et al.12/15/2025

arXiv PDF

Key Summary

•This survey explains how AI agents remember things and organizes the whole topic into three clear parts: forms, functions, and dynamics.
•Agent memory is different from LLM memory, RAG, and context engineering, even though they sometimes use similar tools.
•There are three main forms of memory: token-level (explicit notes), parametric (knowledge baked into model weights), and latent (hidden states inside the model).
•There are three main functions of memory: factual (facts about users and the world), experiential (lessons from past tries), and working (the scratchpad used right now).
•The dynamics of memory cover how memories are formed, evolve (consolidate, update, forget), and are retrieved at the right time.
•Token-level memory can be flat (1D), planar/graph-like (2D), or hierarchical (3D), each trading simplicity for stronger reasoning.
•The survey maps many recent systems into this taxonomy and compiles benchmarks and open-source frameworks to help researchers build and test memory.
•It highlights future frontiers like automated memory management, reinforcement learning for memory, multimodal memory, shared memory for teams of agents, and trustworthy memory.
•The big takeaway is to treat memory as a first-class part of an AI agent, not just an add-on.
•This clearer picture helps build agents that adapt over time, remember responsibly, and work better in the real world.

Why This Research Matters

AI agents are moving from single-turn chat to long-term helpers that plan, learn, and adapt, and none of that works well without solid memory. With this survey’s map, builders can pick the right memory form for the job, define its purpose clearly, and keep it healthy over time. This means assistants that truly personalize, researchers that improve with experience, and coders that reuse proven fixes safely. It also leads to better governance: traceable facts, controlled forgetting, and safer handling of sensitive data. In short, this roadmap turns memory from a messy afterthought into a reliable engine for real-world, trustworthy AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you build a LEGO robot helper. At first, it can follow instructions, but it forgets everything the moment you turn it off. It doesn’t remember your favorite snack, the tricky parts of your homework, or where it left the broom last time. That’s not very helpful.

🥬 The Story: Before this survey, many AI systems were like that LEGO robot: strong at answering one question now, weak at remembering important things later. Large language models (LLMs) got very good at reading and writing, but they weren’t designed to store a personal diary, a toolbox of tricks learned from mistakes, or a scratchpad of steps for a long project. As people built “agents” (LLMs that plan, use tools, and act over time), memory became the make-or-break feature. Without good memory, agents repeat work, lose track of goals, and can’t improve.

The World Before:

LLMs were trained on huge text piles and could answer many questions, but they couldn’t easily update their inner knowledge (their parameters) on the fly.
Developers tried to stuff extra text into the prompt window (context), but that window is limited and messy.
Some systems added retrieval (RAG) to look up facts from a database, which helped with accuracy but didn’t let the agent remember new experiences.
Different teams used different words—“episodic,” “semantic,” “long-term,” “short-term”—often meaning slightly different things, which made comparing methods confusing.

The Problem:

The research got fragmented. One paper called something “LLM memory” when it was really agent memory; another called a retrieval trick “memory” though it never learned from experience.
Old categories like “short-term vs. long-term” didn’t capture what modern agents do, like turning raw interactions into reusable skills or keeping multi-layer knowledge graphs.
There was no common map to show how all these pieces fit together.

Failed Attempts (and why they fell short):

Only using longer context windows: helps a bit, but the agent still wades through lots of old text and can’t maintain a clean, growing knowledge base.
Pure RAG pipelines: great for pulling facts from Wikipedia-like stores, but not for remembering your specific preferences or learning strategies from past failures.
Ad-hoc memory buffers: easy to start, but they get noisy and unorganized and don’t scale to long projects.

The Gap:

The field needed a clean, shared language and a practical framework: What is memory in an agent? What forms does it take? What jobs does it do? How does it change over time?
It also needed a guide to benchmarks and tools so teams could test fairly and build faster.

Real Stakes (Why you should care):

Everyday assistants: A calendar bot that remembers you hate early meetings and automatically avoids them.
Education: A tutor that learns where you struggle and adjusts lessons over weeks, not just one chat.
Science and software: Research or coding agents that build up tactics, tools, and lessons so they get better with each project.
Safety and trust: A health or finance assistant that keeps accurate, traceable memories—so important decisions are both right and explainable.

🍞 Anchor: Think of a smart kitchen helper. With strong memory, it remembers your allergies (factual), learns which pancake flip worked best (experiential), and keeps today’s recipe steps on a sticky note (working). This survey explains how to build that kind of memory into AI agents—clearly and consistently.

02Core Idea

🍞 Hook: You know how a good school binder has tabs (math, science), pockets (handouts), and a planner (what to do today)? That structure keeps your brain calm and your work on track.

🥬 The Core Idea (in one sentence): The paper’s aha! is to organize agent memory with one simple triangle: forms (how memory is stored), functions (what jobs memory does), and dynamics (how memory changes over time).

Multiple Analogies:

Toolbox analogy: Forms are the containers (drawer, pegboard, backpack), functions are the tools’ purposes (hammer vs. screwdriver), and dynamics is how you add, sharpen, or retire tools.
Library analogy: Forms are the formats (books, e-books, index cards), functions are why you keep them (facts, study notes, to-do lists), dynamics is how books are acquired, cataloged, updated, or weeded.
Sports team analogy: Forms are positions (goalie, striker), functions are roles in a play (defend, attack, pass), dynamics is player training, strategy updates, and substitutions.

Before vs After:

Before: People mixed up ideas (like RAG vs. memory), used fuzzy labels, and squeezed everything into short/long-term bins.
After: We have a clear map: three forms (token-level, parametric, latent), three functions (factual, experiential, working), and three dynamics (formation, evolution, retrieval). Now teams can compare fairly and build better.

Why it Works (intuition):

Separating what memory looks like (form) from what it’s for (function) avoids confusion. A memory graph (form) could store facts or skills (functions).
Adding dynamics captures the real life of memory: it gets created, improved, and sometimes forgotten—on purpose.
This structure scales: it fits simple chatbots and giant multi-agent labs.

Building Blocks (with sandwich explanations):

🍞 Token-level Memory
- What: Explicit, human-readable memories (like notes in a notebook).
- How: Store chunks/snippets, optionally link them in graphs/trees, and retrieve by search or rules.
- Why: Without it, agents can’t show their work or be audited.
- 🍞 Example: A customer-support bot keeps per-user preference notes it can read back.
🍞 Parametric Memory
- What: Knowledge baked into the model’s weights.
- How: Train or fine-tune so the model “just knows” patterns and styles.
- Why: Without it, the agent must look everything up and generalizes worse.
- 🍞 Example: A role-playing bot naturally stays in character because the persona is in its parameters.
🍞 Latent Memory
- What: Hidden internal states (like KV caches) carried inside the model during computation.
- How: Generate, reuse, or compress internal embeddings/tokens over time.
- Why: Without it, the model repeatedly reprocesses long inputs and loses efficiency.
- 🍞 Example: A long-document assistant holds compact latent summaries so it doesn’t reread 200 pages.
🍞 Factual Memory (function)
- What: Stable facts about the user and environment.
- How: Extract, store, update, and retrieve facts with IDs and time stamps.
- Why: Without it, the agent keeps asking the same questions.
- 🍞 Example: It remembers you’re allergic to peanuts.
🍞 Experiential Memory (function)
- What: Lessons from past successes/failures.
- How: Save trajectories, distill strategies/skills, reuse them on new tasks.
- Why: Without it, the agent never improves.
- 🍞 Example: It learns that searching before coding avoids errors.
🍞 Working Memory (function)
- What: The scratchpad for the current task.
- How: Keep steps, subgoals, and partial results; prune and summarize as you go.
- Why: Without it, the agent loses track mid-problem.
- 🍞 Example: A planner stores today’s to-dos and checks them off.
🍞 Memory Formation (dynamic)
- What: Turning raw interaction into useful memory entries.
- How: Summarize, structure, and choose what’s worth keeping.
- Why: Without it, memory fills with noise.
- 🍞 Example: From a 30-message chat, it saves only key preferences.
🍞 Memory Evolution (dynamic)
- What: Consolidating, updating, and forgetting.
- How: Merge duplicates, fix conflicts, and retire stale bits.
- Why: Without it, memory becomes a junk drawer.
- 🍞 Example: It updates your new address and drops the old one.
🍞 Memory Retrieval (dynamic)
- What: Pulling the right memory at the right time.
- How: Build a query from the current goal and fetch the best matches.
- Why: Without it, the agent either uses nothing or everything.
- 🍞 Example: When booking flights, it retrieves your seat and meal preferences.

Bonus: Sub-forms of token-level memory

🍞 Flat (1D): A pile of notes; simple but unstructured.
- How: Append, search by similarity.
- Why: Without structure, large piles get messy.
- 🍞 Example: A chat history with occasional summaries.
🍞 Planar (2D): A graph/tree; adds relationships for smarter traversal.
- How: Link related notes; traverse edges.
- Why: Without links, multi-hop reasoning is hard.
- 🍞 Example: A knowledge graph linking people, places, events.
🍞 Hierarchical (3D): Layers of abstraction; zoom in and out across levels.
- How: Cluster, summarize, and connect levels.
- Why: Without layers, big memories get slow and confusing.
- 🍞 Example: Topic → event summaries → raw records.

🍞 Anchor: Think of this framework like a tidy backpack: pockets (forms), contents’ purpose (functions), and how you pack/unpack over a trip (dynamics). With this, AI agents stop being forgetful tourists and start being reliable travel partners.

03Methodology

At a high level (as a recipe): Input (papers, systems, and benchmarks) → [Step A: Define and compare key concepts] → [Step B: Classify memory by forms] → [Step C: Classify memory by functions] → [Step D: Describe memory dynamics] → [Step E: Map existing systems and benchmarks to the taxonomy] → Output (a clean, shared map of agent memory).

Step A: Define and compare key concepts

What happens: The survey draws clear lines between Agent Memory, LLM Memory, RAG, and Context Engineering.
Why it exists: Without clean definitions, people talk past each other.
Example data: A Venn diagram listing overlapping techniques (e.g., graph retrieval appears in both RAG and memory) but different goals (persistent self-evolving state vs. static external lookup).

Sandwich cards for the comparisons:

🍞 Agent Memory vs. LLM Memory
- What: Agent memory is about a persistent, evolving store an agent uses across tasks; LLM memory often means internal mechanisms like KV cache or long-context architectures.
- How: Agent memory includes external notes, graphs, and skills; LLM memory includes attention tricks and cache management.
- Why: Without this split, we confuse efficiency tweaks with true learning over time.
- 🍞 Anchor: A diary you keep (agent memory) vs. your brain’s short-term focus tricks (LLM memory).
🍞 Agent Memory vs. RAG
- What: Both retrieve info, but RAG mainly pulls from static sources per task, while agent memory evolves from the agent’s own experiences.
- How: RAG pipelines index facts; agent memory stores facts, experiences, and working state that grow over time.
- Why: Without this, we call any retrieval “memory,” even when nothing is learned.
- 🍞 Anchor: Looking up a recipe online (RAG) vs. writing family recipes in your cookbook as you improve them (agent memory).
🍞 Agent Memory vs. Context Engineering
- What: Context engineering packs the prompt window efficiently; agent memory is the durable knowledge/experience behind it.
- How: One manages space; the other manages long-lived content.
- Why: Without both, you either overflow the window or forget essential knowledge.
- 🍞 Anchor: Arranging your backpack (context engineering) vs. deciding what to keep for the whole trip (agent memory).

Step B: Classify memory by forms (what carries memory?)

What happens: The survey defines three forms—token-level, parametric, latent—and subtypes for token-level (flat, planar, hierarchical).
Why it exists: Different tasks need different containers (explicit notes vs. hidden states vs. baked-in knowledge).
Example data: Systems like MemGPT (token-level), CharacterLM (parametric), and KV-reuse methods (latent) are mapped to these forms.

Step C: Classify memory by functions (why does the agent need memory?)

What happens: The survey identifies three jobs: factual (stable facts), experiential (lessons/skills), and working (current scratchpad).
Why it exists: Purpose matters—storing a birthday (fact) differs from storing a coding trick (experience).
Example data: Personalization agents emphasize factual user memory; coding agents emphasize experiential skills; planners emphasize working memory.

Step D: Describe memory dynamics (how memory operates and evolves?)

What happens: The lifecycle is split into formation (what to keep), evolution (consolidate/update/forget), and retrieval (when/how to use it).
Why it exists: Real memory isn’t a dump; it’s curated and timed.
Example data: Summarization for formation; conflict resolution and forgetting curves for evolution; query construction and reranking for retrieval.

Now, dive deeper into the lifecycle with step-by-step “how it works” and “what breaks without it”:

🍞 Memory Formation
- What: Turn raw logs into reusable nuggets (facts, cases, skills, summaries).
- How (recipe):
  1. Collect artifacts (messages, tool outputs, actions, results).
  2. Extract key info (who/what/when/why), or distill strategies.
  3. Structure it (cards, graph nodes/edges, indexed chunks).
  4. Validate (filter redundancy, detect conflicts, add provenance).
  5. Save with metadata (time, source, confidence).
- Why: Without formation, the memory becomes a noisy transcript that’s hard to use.
- 🍞 Anchor: From a 50-step web-browsing trace, keep the final sources, the winning plan, and the pitfalls.
🍞 Memory Evolution (Consolidate, Update, Forget)
- What: Keep memory healthy and current.
- How (recipe):
  1. Consolidate: Merge duplicates, connect related nodes, promote stable patterns to summaries.
  2. Update: Resolve contradictions, refresh facts (addresses change!), refine strategies.
  3. Forget: Retire stale or low-value entries to reduce clutter and errors.
- Why: Without evolution, memory rots—slow, contradictory, and untrustworthy.
- 🍞 Anchor: Your contact list cleans itself: merge duplicate entries, fix new phone numbers, delete disconnected lines.
🍞 Memory Retrieval
- What: Fetch exactly what helps right now.
- How (recipe):
  1. Detect need (planning step, ambiguity, user query).
  2. Build a query (keywords, entities, task state, subgoal).
  3. Retrieve (vector search, graph traversal, rules).
  4. Rerank and compress (top-k, summarize, de-duplicate).
  5. Inject into the model (context text, soft tokens, tool inputs).
- Why: Without smart retrieval, the agent either drowns in context or flies blind.
- 🍞 Anchor: Booking travel? Pull seat/meal preferences, frequent airports, and budget caps—nothing more, nothing less.

Step E: Map systems and benchmarks

What happens: The survey catalogs representative papers under each form/function/dynamic and lists datasets/frameworks (e.g., long-dialogue, deep research, software tasks, lifelong streams).
Why it exists: To help practitioners pick the right benchmarks and scaffolds.
Example data: LoCoMo, LongMemEval (dialogue), GAIA/XBench/BrowseComp (research/web), SWE-bench Verified (code), StreamBench (lifelong).

The secret sauce of this method:

One triangle (forms–functions–dynamics) explains a crowded field cleanly.
It’s flexible enough to fit new advances (e.g., RL-managed memory, multimodal memory, shared memory for teams).
It keeps engineering and cognition aligned: you can choose containers (form) to match jobs (function) and manage them over time (dynamics).

04Experiments & Results

Because this is a survey, the authors do not run a single new system; instead, they summarize how the community evaluates memory and what patterns show up. Here’s how to make the numbers and tests meaningful.

The Test: What do benchmarks measure, and why?

Long conversation memory: Can an agent remember identities, preferences, and past events across many turns? (e.g., LoCoMo, LongMemEval.)
Complex research and web tasks: Does memory help multi-step planning, source tracking, and iterative refinement? (e.g., GAIA, XBench, BrowseComp.)
Software/code repair: Can agents reuse debugging insights and tool-usage experience? (e.g., SWE-bench Verified.)
Lifelong learning streams: Can agents continually absorb and update knowledge without forgetting? (e.g., StreamBench.)
Multimodal settings: Can agents track objects, scenes, or time in video/embodied tasks using memory?

The Competition: What kinds of systems get compared?

Token-level systems: Flat logs/summaries, graph-based memories, hierarchical pyramids.
Parametric systems: Knowledge/persona edited into weights; adapters/LoRA add-ons for modular memory.
Latent systems: KV cache reuse, soft tokens, compressed hidden states for efficiency and long-range recall.
RAG and context-engineering baselines: Strong at static fact lookup and context packing but not designed for evolving agent experiences.

The Scoreboard (with context instead of raw percentages):

Long dialogue personalization: Hierarchical and graph-shaped token memories tend to behave like going from a messy binder to a well-tabbed one—agents stay on-topic and consistent more often than flat buffers.
Deep research and browsing: Systems that combine experiential memory (lessons, strategies) with factual stores often feel like getting an A when the simple RAG baseline earns a B—plans are tighter, and repeated pitfalls are avoided.
Coding agents: Experience-as-skills (e.g., code snippets, tool recipes) can be a big step up over plain context stuffing, similar to having a reusable toolbox instead of re-inventing fixes each time.
Latent memory for long inputs: Reusing or compressing internal states often turns slow, clunky reading into smoother, more efficient runs—like summarizing a chapter into crisp index cards without losing key ideas.

Surprising Findings:

Blurry borders: Some RAG systems behave like memory (they evolve), and some memory systems are just fancy retrieval. The taxonomy helps separate intent (persistent, self-evolving) from mechanism (retrieval, graphs).
Structure pays off: Graphs and hierarchies can amplify recall and reasoning, but they need careful maintenance or they bloat.
RL is rising: Letting agents learn when to write/retrieve/forget memory can outperform hand-crafted rules in complex tasks.
Multimodal memory matters: When agents see the world (video/embodied), explicit memory of objects, scenes, and timelines makes them far more capable.
Trust and safety: Memory errors (wrong fact, stale preference) are more damaging when they persist; provenance and conflict resolution are not “nice-to-haves”—they’re required.

Bottom line: Across benchmarks, systems that treat memory as a first-class capability—aligned with form, function, and dynamics—generally outperform approaches that treat memory as just longer context or simple retrieval. The exact gains vary by domain, but the qualitative trend is consistent: better structure and lifecycle management lead to more reliable agents.

05Discussion & Limitations

Limitations (be specific):

Surveys can’t exhaustively test every method; some comparisons are qualitative, not head-to-head.
The field moves fast; new hybrids (e.g., agentic RAG that learns) blur boundaries and can outgrow neat boxes.
Standardized, memory-specific benchmarks are still forming, especially for experiential and working memory.
Multimodal memory and multi-agent shared memory are early-stage; best practices for scale, latency, and synchronization are open.
Trust, safety, and privacy for persistent memory (e.g., consent, redaction, provenance) require more rigorous, standardized tooling.

Required resources to use these ideas:

Storage: Vector DBs, graph stores, or key–value systems to hold token-level memory.
Model capacity: LLMs or VLMs that support long contexts or latent memory tricks.
Engineering toolkits: Summarizers, deduplication, conflict resolution, and retrieval/reranking.
Governance: Policies for consent, retention, forgetting, and auditing.

When NOT to use certain memory types:

Heavy token-level graphs in tiny tasks: Overkill—setup cost outweighs benefits.
Pure parametric editing for volatile facts: Risky—weights are slow to update and can forget old knowledge.
Raw KV reuse without pruning: Memory bloat and latency spikes.
Uncurated experience logs: They accumulate noise; better to distill strategies or skills.

Open questions:

Automated memory controllers: How can agents learn optimal write/retrieve/forget policies under budget and risk constraints?
Provenance and truthfulness: How to attach, verify, and maintain source trails as memories evolve?
Cross-agent sharing: How do teams avoid echo chambers, resolve conflicts, and ensure privacy when they share memory?
Multimodal scaling: What’s the right structure for time, space, and objects across video and embodied streams?
Parametric + token + latent fusion: What are the best recipes for mixing all three without duplicating or drifting information?

Honest assessment: The taxonomy won’t solve everything, but it gives the community a shared compass. It helps teams pick the right memory forms for the job, define the functions clearly, and manage the lifecycle responsibly.

06Conclusion & Future Work

Three-sentence summary: This survey reframes agent memory with a simple, powerful triangle: forms (token-level, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval). It clarifies how agent memory differs from LLM memory, RAG, and context engineering, while mapping many recent systems and benchmarks into one shared picture. The result is a practical playbook for building agents that remember better, learn continually, and act more responsibly.

Main achievement: A clear, up-to-date taxonomy that unifies a fragmented field and directly connects engineering choices (forms) to goals (functions) and operations (dynamics).

Future directions:

Automated memory management (learned policies for write/retrieve/forget under constraints).
Deeper integration with reinforcement learning to internalize memory skills.
Multimodal memory that handles time, space, and objects at scale.
Shared memory for multi-agent teams with safety, privacy, and conflict resolution.
Trustworthy memory with provenance, auditing, and right-to-be-forgotten controls.

Why remember this: When you treat memory as a first-class part of an AI agent—not an afterthought—everything improves: personalization, planning, reliability, and safety. This survey gives you the language and the blueprint to do it right.

Practical Applications

•Design a chatbot memory: use token-level profiles for user facts, experiential notes for lessons, and a working scratchpad for current goals.
•Upgrade a RAG pipeline: add an experiential store so the system learns better retrieval plans from past successes and failures.
•Speed up long reads: use latent memory (soft tokens/KV compression) to keep key context without reprocessing entire documents.
•Personalize recommendations: maintain factual preference memory with timestamps and conflict resolution.
•Stabilize role-playing agents: encode persona in parametric memory and keep a small token-level diary for evolving details.
•Improve coding agents: store reusable skills (snippets, tool recipes) and retrieve them by task signatures.
•Build trustworthy assistants: attach provenance to factual memory, enable redaction, and add scheduled forgetting policies.
•Support multimodal tasks: maintain object/scene graphs as token-level memory and latent temporal embeddings for video.
•Teach memory policies: train an RL controller to decide when to write, retrieve, summarize, or forget under token budgets.
•Enable team agents: create a shared memory with namespaces, access controls, and conflict-resolution rules.

Version: 1