AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

Jiafeng Liang; Hao Li; Chang Li; Jiaqi Zhou; Shixin Jiang; Zekun Wang; Changkai Ji; Zhihao Zhu; Runxuan Liu; Tao Ren; Jinlan Fu; See-Kiong Ng; Xia Liang; Ming Liu; Bing Qin

AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

Intermediate

Jiafeng Liang, Hao Li, Chang Li et al.12/29/2025

arXiv PDF

Key Summary

•This survey links how human brains remember things to how AI agents should remember things so they can act smarter over time.
•It explains three kinds of AI memory: what’s inside the model’s weights, what fits in the context window, and what’s stored outside in a memory bank.
•It proposes two easy ways to sort agent memory: by nature (episodic vs. semantic) and by scope (inside one task vs. across many tasks).
•It lays out a full memory life cycle for agents: extract memories, update them, retrieve the right ones, and use them for better decisions.
•It compares storage places (context vs. external banks) and storage formats (text, graphs, parameters, and latent vectors).
•It reviews benchmarks that test memory for facts and profiles (semantic) and for doing real tasks on the web and in tools (episodic).
•It highlights memory security risks like stealing private info or planting bad memories, and it surveys defenses.
•It points to the future: multimodal memory (text, images, audio, video) and shareable skills that move between agents.
•The big idea is that brain-inspired design helps AI agents remember the right stuff, at the right time, in the right way.
•With strong memory, agents can stay consistent, personalize to users, and learn from experience instead of repeating mistakes.

Why This Research Matters

Smart memory makes AI feel less like a forgetful assistant and more like a steady teammate that learns with you. It lets agents remember your needs, reuse successful strategies, and avoid repeating old mistakes. This saves time in real tasks like shopping online, planning trips, or troubleshooting software. It helps tutors build on what you already know and doctors’ assistants respect medical preferences safely. Strong security keeps your private data safe from leaks and bad actors. Overall, memory turns one-off answers into growing experience that pays off tomorrow.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your backpack for school. If you only carry today’s worksheet and forget everything you learned last week, solving new problems gets really hard. Brains (and helpful AI) need good backpacks for memory.

🥬 The Concept: Memory systems are how minds and machines keep helpful information so they can use it later.

What it is: A way to store, manage, and reuse information across time.
How it works: 1) Capture what happened, 2) Keep what matters, 3) Find it again later, 4) Use it to decide better.
Why it matters: Without memory, every problem is brand new, and you keep making the same mistakes.

🍞 Anchor: Your brain remembers how you did a tough math problem before; next time, you finish faster. An AI agent can do that too—if it has the right memory.

The World Before: For years, Large Language Models (LLMs) acted like super-smart goldfish—their answers were great, but they forgot everything once a chat ended. Engineers stretched context windows, but huge windows are slow and expensive and still suffer from the “lost-in-the-middle” problem. Updating model weights with new facts was pricey and could cause them to forget old knowledge.
The Problem: Agents need to keep track of who you are, what worked last time, and how a multi-step task is going. But LLMs are stateless by default, so agents couldn’t smoothly connect past and future or carry lessons from one task to the next.
Failed Attempts: People tried stuffing more text into the prompt, but it got messy, slow, and distracted the model. Others tried retraining the model to bake in new knowledge, but that’s costly, static, and risks overwriting what the model already knows. Pure retrieval systems helped with facts but didn’t capture the agent’s lived experience (what it tried and whether it worked).
The Gap: We needed a brain-inspired, end-to-end way to treat memory as a living system: not just storage, but a full cycle—extract what matters, update it as the world changes, retrieve the right pieces, and use them to act better. Plus, we needed clear categories (what kind of memory is this?) and strong defenses (how do we keep it safe?).
Real Stakes: Good memory means assistants that remember your allergies, web agents that don’t re-click the same broken button, copilots that reuse working solutions, and tutors that build on what you already know. Without it, AI wastes time and makes repeat mistakes. With it, AI becomes a steady partner that learns like we do.

🍞 Anchor: Think of an agent booking your trip. If it remembers your budget, that you prefer trains, and the last route that worked, it plans smarter next time. That’s memory turning past into future wins.

🍞 Hook: You know how scientists study the brain to learn how we think and remember?

🥬 The Concept (Cognitive Neuroscience): Cognitive neuroscience studies how the brain supports memory.

What it is: A science that connects brain activity to thinking and remembering.
How it works: 1) The brain encodes experiences, 2) Stabilizes them with replay and rest, 3) Retrieves them with cues, 4) Updates them when surprised.
Why it matters: It shows which memory ideas are likely to work in AI agents.

🍞 Anchor: If scientists see the brain replay memories during sleep to make them stronger, AI can replay past tasks to strengthen its skills, too.

🍞 Hook: Imagine a robot that can talk well but forgets everything after each chat.

🥬 The Concept (LLMs): Large Language Models generate text but don’t naturally remember across sessions.

What it is: A text-predicting system trained on lots of data.
How it works: It uses learned patterns (weights) plus whatever you put in the current prompt.
Why it matters: Great at language, but needs extra help to build lasting memories.

🍞 Anchor: Chatting with an LLM is like asking a smart friend with short-term memory—unless you give it a memory system.

🍞 Hook: Picture three notebooks: one glued inside the robot’s brain, one sticky note in front of it, and one big filing cabinet nearby.

🥬 The Concept (Three memory carriers in AI): Parametric, working, and external memory.

What it is: 1) Parametric memory (inside weights), 2) Working memory (context window), 3) External memory (databases/graphs).
How it works: Inside weights give general knowledge; context holds what’s immediately relevant; external storage keeps long-term facts and experiences.
Why it matters: You need all three to act smart over time.

🍞 Anchor: The robot uses its brain-knowledge to read a map (parametric), keeps today’s to-do list visible (working), and files finished trips for later (external).

02Core Idea

🍞 Hook: You know how coaches watch game videos to help players learn patterns and improve next time? What if AI did that with its own memories, using brain-inspired rules?

🥬 The Concept (Aha!): The key insight is to design agent memory the way brains do: treat memory as a living system with types, places, formats, and a full life cycle—so agents can remember, adapt, and improve over time.

What it is: A unified, brain-to-agent blueprint for building, storing, managing, testing, and protecting memory.
How it works: 1) Categorize memory by nature (episodic vs. semantic) and scope (inside one task vs. across tasks), 2) Store it in the right place (context vs. memory bank) and format (text, graphs, parameters, latent vectors), 3) Manage it with a cycle (extract, update, retrieve, use), 4) Evaluate it with benchmarks, 5) Secure it from attacks.
Why it matters: Without this, agents drown in text, forget key details, and can be tricked. With it, they become steady learners.

🍞 Anchor: Like a well-run library: books sorted by type, stored on the right shelves, tracked with a catalog, checked out when needed, graded by readers, and protected by locks.

Multiple analogies:

Backpack analogy: Episodic memories are your trip logs; semantic memories are your facts book; inside-trail is today’s worksheet; cross-trail is your portfolio across the year.
Kitchen analogy: Working memory is the countertop; external memory is the pantry; parametric memory is your cooking instincts. The recipe: prep (extract), taste and adjust (update), grab the right spice (retrieve), and serve (use).
Sports analogy: Film room (external memory), player intuition (parametric), play calls on the wristband (working). Practice cycle: record, review, refine, replay.

Before vs. After:

Before: Memory was mostly “stuff more into the prompt” or “retrain the model,” which is clumsy and brittle.
After: Memory is a brain-inspired system with clear types, smart storage, a management life cycle, fair tests, and security rules.

Why it works (intuition, no equations):

Separating memory by type and scope reduces confusion (don’t mix recipes with shopping lists).
Putting memory in the right place balances speed and capacity (fast counter vs. big pantry).
A life cycle prevents clutter and stale info (clean as you go, keep what works).
Benchmarks guide progress (practice drills), and security stops cheats (referees and locks).

Building blocks (with sandwiches):

🍞 Hook: Imagine a diary with step-by-step adventures. 🥬 The Concept (Episodic memory): Stores who-did-what-when-where from past tasks.

How it works: Logs tool calls, steps taken, outcomes.
Why it matters: Reuse wins and avoid repeat mistakes. 🍞 Anchor: “Last time the login failed until I cleared cookies—do that first.”

🍞 Hook: Think of your fact cards for a quiz. 🥬 The Concept (Semantic memory): Stores facts, rules, and profiles.

How it works: Records definitions, preferences, and stable knowledge.
Why it matters: Keeps agents consistent and accurate. 🍞 Anchor: “Alex is vegetarian; suggest plant-based dinner.”

🍞 Hook: You don’t bring the whole library to the desk. 🥬 The Concept (Inside-trail vs. Cross-trail): Inside-trail helps now; cross-trail helps later, across tasks.

How it works: Temporary vs. persistent storage.
Why it matters: Stay focused today; get smarter tomorrow. 🍞 Anchor: Scratch paper vs. a saved study guide.

🍞 Hook: Some info fits on a sticky note, some belongs in a folder. 🥬 The Concept (Storage place and format): Context vs. memory bank; text, graphs, weights, latent vectors.

How it works: Choose place and format by speed, size, and structure needs.
Why it matters: Right tool, right job. 🍞 Anchor: Shopping list as text; family tree as a graph.

🍞 Hook: Good habits keep your room tidy. 🥬 The Concept (Life cycle): Extract, update, retrieve, use.

How it works: Turn messy streams into clean, helpful memories.
Why it matters: Prevents overload and stale info. 🍞 Anchor: Summarize notes, fix mistakes, find what you need, apply it.

🍞 Hook: You can’t win games if anyone can steal your playbook. 🥬 The Concept (Security): Stop stealing and poisoning of memory.

How it works: Filter, monitor, and lock down data.
Why it matters: Keeps agents trustworthy. 🍞 Anchor: A tutor who never leaks your grades or learns bad facts on purpose.

03Methodology

At a high level: Inputs (interactions, tools, web pages) → Step A: Extract memories → Step B: Update them (inside-trail and cross-trail) → Step C: Retrieve the right ones → Step D: Use them (context or parameters) → Outputs (better plans and actions).

Step A: Memory Extraction (three flavors)

🍞 Hook: You don’t copy an entire book—just the useful parts. 🥬 The Concept (Flat extraction): Save raw chunks or light summaries.

What happens: Log steps, results, key sentences; maybe summarize.
Why it exists: Simple, fast, and covers everything.
Example: After a web task, store “Tried login → failed; cleared cookies → success.” 🍞 Anchor: Like highlighting important lines in a textbook.

🍞 Hook: Big stories need chapters and headings. 🥬 The Concept (Hierarchical extraction): Build layers—gists up top, details below.

What happens: Make a pyramid: brief summaries linked to detailed blocks.
Why it exists: You can zoom in only when needed.
Example: Top node “Plan a 3-day Beijing trip”; child nodes: flights, stay, food; each with details. 🍞 Anchor: Table of contents that jumps to the right page.

🍞 Hook: Instead of carrying notes, generate a fresh cheat sheet right when you need it. 🥬 The Concept (Generative extraction): Rebuild compact context on the fly.

What happens: Compress current and past steps into a minimal, task-ready summary.
Why it exists: Keeps the context short, focused, and fast.
Example: Before step 20, produce “So far: chose train; hotel near station; need vegetarian dinner nearby.” 🍞 Anchor: Making a quick study card before a quiz.

Step B: Memory Updating (two places)

🍞 Hook: Clean-as-you-go beats end-of-day mess. 🥬 The Concept (Inside-trail updating): Refresh the working context during a task.

What happens: Keep relevant details; drop noise; re-summarize at checkpoints.
Why it exists: Context windows are small; attention is limited.
Example: Keep only the latest successful login method; remove failed tries. 🍞 Anchor: Erasing scratch work that’s no longer needed.

🍞 Hook: Your year-long binder needs curating. 🥬 The Concept (Cross-trail updating): Maintain the long-term memory bank.

What happens: Merge duplicates, forget stale info, integrate new lessons.
Why it exists: Prevents bloat; keeps the best, most current knowledge.
Example: Replace “prefers taxis” with “prefers trains” after repeated choices; archive old preference. 🍞 Anchor: Updating a profile card when a friend switches favorite sports.

Step C: Memory Retrieval (two strategies)

🍞 Hook: Sometimes you search by keywords; other times by tags like “recent” or “important.” 🥬 The Concept (Similarity-based retrieval): Find nearest semantic matches.

What happens: Embed the query; pick top-k similar memory chunks.
Why it exists: Great for facts and straightforward references.
Example: Query “vegetarian dinner near station,” retrieve similar past restaurant picks. 🍞 Anchor: Using a search bar that finds related pages.

🍞 Hook: A good librarian also checks date, author, and popularity. 🥬 The Concept (Multi-factor retrieval): Combine relevance with recency, importance, and efficiency.

What happens: Score memories by multiple signals; pick the best mix.
Why it exists: Reduces noise, aids long workflows.
Example: Pick the most recent, high-importance steps that led to wins. 🍞 Anchor: Pinning top notes and ignoring outdated ones.

Step D: Memory Application (two modes)

🍞 Hook: You can either read the note while solving, or memorize it for good. 🥬 The Concept (Contextual augmentation): Insert retrieved memories into the prompt.

What happens: Build a tailored context that guides reasoning.
Why it exists: Immediate, flexible, and safe to test.
Example: Add user profile and last plan summary into the next planning prompt. 🍞 Anchor: Keeping a formula sheet next to your workbook.

🍞 Hook: Or you can practice until the skill becomes second nature. 🥬 The Concept (Parameter internalization): Distill experiences into the model’s weights.

What happens: Fine-tune or reinforce with good trajectories and rules.
Why it exists: Faster inference, stable habits, no retrieval delay.
Example: Train from many successful login-fix examples so the agent proactively clears cookies when needed. 🍞 Anchor: After many drills, you just “know” the move.

Storage choices (place and format)

🍞 Hook: Tools in the right drawer save time. 🥬 The Concept (Context window vs. memory bank): Short-term vs. persistent storage.

How it works: Context = quick access; bank = large, long-term library.
Why it matters: Speed vs. capacity trade-off. 🍞 Anchor: Countertop spices vs. pantry bulk.

🍞 Hook: Not all notes look the same. 🥬 The Concept (Text, graphs, parameters, latent vectors): Different formats for different jobs.

How it works: Text is simple; graphs keep relationships; parameters are instincts; latent vectors are compact and trainable.
Why it matters: Choose for interpretability, structure, speed, or compression. 🍞 Anchor: A paragraph, a map, a habit, or a dense code.

Secret sauce of the method

Dual taxonomy (nature and scope) keeps design clear.
A full life cycle prevents memory clutter.
Structured formats (graphs, hierarchies) beat flat piles of text.
Retrieval that blends relevance with recency cuts noise.
Security guardrails protect trust.

Example with real data (travel planning) Input: “Plan a 3-day trip Harbin→Beijing, budget 5,000 RMB; I’m vegetarian.” A) Extract: Save steps (flight checks, hotel options), summarize costs, note “vegetarian.” B) Update: Inside-trail—keep cheapest valid flight and hotel; Cross-trail—update user preference if repeated. C) Retrieve: Pull last successful Beijing itinerary and user profile. D) Use: Insert profile and best-practice steps into the prompt; generate a plan that fits the budget and diet. Output: A trip plan aligned with budget and food preference, faster than starting from scratch.

04Experiments & Results

The Test: What did they measure and why?

Semantic-oriented tests measure memory fidelity (recalling facts over long text and many chats), dynamics (keeping profiles updated and consistent), and generalization (using stored experience to do better on new tasks).
Episodic-oriented tests measure how memory lifts real task performance: web browsing, tool use, and environment interaction, where agents must remember states, preferences, and past fixes.

🍞 Hook: Like exams for different subjects—spelling (facts), writing (coherence), and projects (doing real things). 🥬 The Concept (Benchmarks as exams): Collections of tasks that score how well memory helps an agent.

How it works: Provide long contexts or multi-step tasks; check recall, updates, and success rates.
Why it matters: Shows whether memory systems really help beyond demos. 🍞 Anchor: Report cards with A’s for accurate recall and improved task completion.

The Competition: Compared against what?

Baselines: Plain LLMs with long prompts but no structured memory; RAG systems without agent experience logs; and single-discipline designs that miss brain-inspired life cycles.

The Scoreboard (with context):

On semantic memory tests like LoCoMo, LOCCO, RULER, BABILong, and MPR, structured memory and smart retrieval raise recall accuracy and reduce distortion when distractors grow. For example, LoCoMo has reported up to around 80% correct retrieval in multi-round settings when memory is well-managed—like getting an A when many models still hover around B-level under heavy context.
On personalization sets like PersonaMem, MemDaily, and PrefEval, agents with stable profiles pick user-aligned actions more often, especially when preferences change over time.
On episodic tasks like WebArena, WebShop, and Mind2Web, memory helps track state across clicks, reuse past fixes (e.g., login workarounds), and complete longer workflows—turning C-level partial progress into B+ or A− end-to-end success on multi-step goals.
Tool-use tests such as ToolBench and GAIA show that memory reduces “execution illusions” (calling the wrong tool or forgetting parameters) by retrieving schemas and successful prior calls.
Environment tasks like BabyAI and ScienceWorld reward agents that remember intermediate states and causal chains, boosting sample efficiency and final scores.

Surprising Findings:

More context is not always better—multi-factor retrieval often beats simple similarity because it avoids flooding the prompt with near-duplicates or stale info.
Highly compressed “gists” can work—if paired with on-demand drill-down into details.
Personalized profiles work best when updated gradually; hard overwrites can break consistency.
Security matters: subtle poisoning can quietly derail decisions unless memory sources are filtered and monitored.

Takeaway Numbers with Meaning:

Think of 80% retrieval accuracy on long-context recall as an A: it means an agent usually grabs the right nuggets from a haystack, even across rounds. On web tasks, memory often shifts results from incomplete sequences (C range) to end-to-end finishes (B+ to A−), which is a big practical jump.

Bottom line: Across many benchmarks, agents with brain-inspired, lifecycle-managed memory are more accurate, more consistent, and more capable of finishing hard, long tasks.

05Discussion & Limitations

Limitations:

Memory quality depends on extraction and update rules; bad summaries or missing merges create clutter or drift.
Similarity-only retrieval struggles with procedural know-how; multi-factor or structure-aware methods are needed.
Parameter internalization is fast at inference but costly to train and can forget older knowledge if not done carefully.
Generative compression can drop rare but crucial details if not paired with selective drill-down.
Security remains hard: backdoors and privacy leaks can hide in large memory stores.

Required Resources:

Storage for external memory banks and indexes (vector DBs or graphs).
Compute for summarization, retrieval, and occasional fine-tuning.
Tooling for logging trajectories, profiling users (ethically), and maintaining access controls and audit trails.

When NOT to Use:

Ultra-short, one-shot tasks where simple prompting suffices.
Highly sensitive settings without strong privacy, redaction, and audit protections in place.
Situations with tiny budgets where retrieval/indexing costs outweigh benefits.

Open Questions:

Multimodal memory: How to align and retrieve across text, images, audio, and video with shared structure?
Skill libraries: Best ways to extract, compose, and safely reuse procedural knowledge across agents and domains?
Trust and safety: How to automatically detect and repair poisoned or outdated memories during retrieval?
Lifelong balance: How to schedule forgetting vs. remembering to stay both accurate and adaptable?
Evaluation: How to fairly score memory’s impact on long-horizon planning and real-world reliability (latency, cost, and safety together)?

06Conclusion & Future Work

Three-sentence summary: This survey connects how the brain builds and uses memory to how AI agents should store, manage, retrieve, and apply memories. It presents a clear taxonomy (episodic vs. semantic; inside-trail vs. cross-trail), practical storage choices (place and format), a full life cycle (extract–update–retrieve–use), fair benchmarks, and security guardrails. The result is a blueprint for reliable, personalized, and steadily improving agents.

Main Achievement: A unified, brain-inspired framework that turns memory from a pile of past text into a living system that improves agent reasoning, planning, and personalization while staying efficient and secure.

Future Directions: Build multimodal memory that fuses text, images, audio, and video; extract and share reusable skills safely across agents; strengthen defenses against memory theft and poisoning; and refine benchmarks that reflect real-world, long-horizon challenges.

Why Remember This: Memory is how yesterday powers tomorrow. With the right kinds, in the right places, managed the right way—and protected—AI agents stop re-learning the same lessons and start growing like good teammates who remember, adapt, and help us more each day.

Practical Applications

•Personalized assistants that remember preferences (diet, budget, schedule) and plan accordingly.
•Web agents that reuse successful workflows (login fixes, search routes) to finish complex tasks faster.
•Developer copilots that store, retrieve, and generalize code patterns to reduce bugs and repetition.
•Customer support bots that maintain consistent profiles and solutions across long-running tickets.
•Education tutors that track progress and tailor lessons, revisiting weak spots and building on strengths.
•Healthcare triage helpers that remember allergies and care plans while enforcing privacy controls.
•Enterprise knowledge bots that convert meeting notes into structured, searchable, and secure memories.
•Tool-use agents that recall correct API parameters and sequences, reducing failed calls.
•Robotic or simulation agents that learn reusable skills and share them across environments.
•Compliance monitors that detect and quarantine suspicious or poisoned memory entries automatically.

Version: 1