MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang; Haotian Ren; Chong Zhan; Zhenhong Zhou; Junhao Wang; He Zhu; Wangchunshu Zhou; Shuicheng Yan

MemEvolve: Meta-Evolution of Agent Memory Systems

Beginner

Guibin Zhang, Haotian Ren, Chong Zhan et al.12/21/2025

arXiv PDF

Key Summary

•MemEvolve teaches AI agents not only to remember past experiences but also to improve the way they remember, like a student who upgrades their study habits over time.
•It uses a two-loop (inner and outer) learning process: the agent learns from tasks, and then the memory system itself gets redesigned to work better next time.
•The paper splits memory into four swappable parts—Encode, Store, Retrieve, Manage—so the system can evolve each part safely and sensibly.
•EvolveLab is a unified, open codebase that re-implements 12 well-known memory systems with the same plug-and-play interface.
•Across four tough benchmarks (GAIA, WebWalkerQA, xBench-DeepSearch, TaskCraft), MemEvolve boosts performance and transfers well to new models and frameworks.
•It delivers up to a 17.06% improvement on some settings and reaches a pass@3 of 80.61% on GAIA, which is competitive with strong multi-agent systems.
•The evolution balances accuracy with cost and speed using Pareto-style selection, so gains don’t explode the API bill or latency.
•MemEvolve’s evolved memories generalize across tasks from the same family and even across different LLM backbones without re-tuning.
•Compared to fixed, human-designed memories, MemEvolve’s adaptive memories are more reliable and consistently helpful.
•The work suggests a future where agents automatically find the right way to learn for the job at hand, just like adaptive human learners.

Why This Research Matters

MemEvolve means AI helpers can keep getting better not only by remembering more, but by upgrading the way they remember to fit each job. That leads to faster, more accurate results without blowing up costs or waiting times. It reduces the need for constant human re-engineering when tasks, models, or frameworks change. People and teams can count on more reliable web research, planning, and reasoning support in day-to-day work. Over time, this approach can unlock robust automation across many knowledge tasks, from education to industry. In short, it moves us closer to truly adaptive digital teammates.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how some students keep messy notes and others keep neat, organized notebooks? Now imagine a student who doesn’t just take better notes—they also figure out a smarter way to take notes for each subject over time.

🥬 The Concept: Before this paper, most AI agents either had no memory or used a fixed way to remember. They could store past attempts and reuse them, but their memory style (what to save, how to save it, how to find it later) stayed the same.

What it is: A memory architecture is the plan for how an agent turns experiences into helpful reminders, where it keeps them, and how it finds them again.
How it worked before:
1. Store raw attempts (trajectories) and a few good examples.
2. Later, people built smarter summaries like tips, templates, or even small tools.
3. But all of these were hand-designed and static—great for some tasks, not others.
Why it matters: If the memory style can’t change, an agent might be great at one kind of task (like browsing) but weak at another (like math reasoning).

🍞 Anchor: Think of a debate team and a math team. The exact same notebook format won’t work for both. If you can’t change your note-taking style, you’ll struggle in at least one.

🍞 Hook: Imagine riding a bike on roads, trails, and sand. If your bike can’t change tires or gears, some routes will be bumpy and slow.

🥬 The Concept (Task Adaptation): Task adaptation means changing how you learn based on what the task needs.

What it is: Matching your memory strategy to the job—what to keep, how detailed, and how to look it up.
How it works:
1. Notice what the task type demands (short facts, long plans, tools, or math patterns).
2. Adjust how you encode (summaries vs. skills), store (lists vs. graphs), and retrieve (keyword vs. reasoning-aware search).
3. Keep tuning as tasks shift.
Why it matters: Without adaptation, even a good agent wastes time, grabs the wrong memories, or misses helpful patterns.

🍞 Anchor: For a cooking task you keep recipes; for algebra you keep solution steps; for web research you keep tool shortcuts and sources.

🍞 Hook: Picture three learners: one forgets everything, one keeps neat but unchanging notes, and one improves both notes and note-taking style over time.

🥬 The Concept (Self-evolving Memory Systems): These let agents save experiences and re-use them to perform better later—like a skillful learner.

What it is: A system that records, distills, and reuses knowledge from past runs.
How it works:
1. After each task, extract experiences (what worked/failed).
2. Turn them into tips, templates, or tools.
3. Retrieve relevant pieces for the next task.
Why it matters: Without self-evolving memory, the agent repeats old mistakes instead of getting smarter.

🍞 Anchor: If yesterday you learned that Wikipedia’s “history” page counts revisions best, a self-evolving memory nudges you to use that trick again today.

🍞 Hook: Imagine you always revise not only your notes, but also your entire way of taking notes after every test to score higher on the next one.

🥬 The Concept (The Problem): Existing AI memories are static. They help agents learn from experience, but the memory’s own design can’t change.

What it is: A fixed memory pipeline—how you encode, store, and retrieve.
How it works: Designers choose one approach hoping it works everywhere.
Why it breaks: Different tasks need different memory shapes. A one-size-fits-all memory underperforms or even hurts.

🍞 Anchor: A tip-based memory might be great for browsing, but clumsy for coding where reusable tools matter more.

🍞 Hook: Think of a toolbox with swappable parts, like Lego. If a build isn’t sturdy, you replace or rearrange the blocks, not the whole set.

🥬 The Concept (The Gap): We needed memory systems that can change their own architecture—how they learn from experience.

What it is: Meta-adaptation for memory itself.
How it works: Observe performance, diagnose bottlenecks, and redesign memory parts.
Why it matters: Without meta-adaptation, agents hit a ceiling—they learn slower and miss task-specific advantages.

🍞 Anchor: If your searching task needs better filtering, the system should notice and swap in a smarter retrieval method automatically.

🍞 Hook: Why should you care? Because this means more reliable AI assistants that research faster, make fewer repeated mistakes, and adapt to new work without costly re-engineering.

🥬 The Concept (Real Stakes): In daily life, tasks vary wildly—finding accurate info on the web, writing code, solving math, or planning errands.

What it is: A way for AI to keep working well even when tasks shift.
How it works: By evolving both what it learns and how it learns.
Why it matters: Saves time, money, and headaches—stronger results with stable costs and speed.

🍞 Anchor: Imagine your homework helper that gets better every week—not just remembering facts, but also improving how it studies them for each subject.

02Core Idea

🍞 Hook: You know how a great coach improves both the players’ skills and the training plan itself? They don’t just practice more; they redesign the drills to learn faster.

🥬 The Concept (Aha! in one sentence): MemEvolve teaches agents to evolve their experiences and also evolve the very memory system that learns from those experiences.

What it is: A meta-evolution framework with two loops—one learns from tasks, the other redesigns the memory architecture.
How it works:
1. Inner loop: Agent tackles tasks using a current memory system; memory fills with new experiences.
2. Outer loop: The framework evaluates performance, cost, and speed; then redesigns memory’s parts to do better next time.
3. Repeat: Co-evolution—better memory → better agent → clearer feedback → even better memory.
Why it matters: Without this, agents may collect lots of data but won’t improve how they learn from it.

🍞 Anchor: Like a student who doesn’t just study more, but also upgrades their study method after each exam based on what actually raised their scores.

🍞 Hook: Imagine three analogies.

🥬 The Concept (Multiple Analogies):

Analogy 1—Gardener: The plant (agent experience) grows every day. But the gardener (meta-evolution) also changes soil, pots, and watering schedules (memory parts) to help future growth.
Analogy 2—Backpack: You carry tips, tools, and maps (memories). But after each trip, you also reorganize the backpack (memory design) so next time you pack smarter.
Analogy 3—School Notebook: You keep notes (experiences), and after tests you redesign your note-taking system (headings, summaries, color-coding) to learn faster.

🍞 Anchor: After failing a web-research question, the system doesn’t just save the failure; it also decides to add a better retrieval filter next time.

🍞 Hook: Think of upgrading from a fixed recipe to a chef who improves the recipe after every dinner based on diners’ feedback.

🥬 The Concept (Before vs. After):

Before: Fixed memory—maybe great for one domain, weak in others; improvements stall.
After: Adaptive memory—modules can be swapped or tuned; performance improves and transfers better to similar tasks and even different LLMs.
Why it matters: This turns memory from a static tool into a learning partner that adapts with the agent.

🍞 Anchor: A search agent that once relied on simple keyword recall now adds a reasoning-aware gate and an LLM guardrail because that combo proved faster and more accurate.

🍞 Hook: Imagine solving a maze while also redrawing your map style to make future mazes easier.

🥬 The Concept (Why it works—intuition):

What it is: Bilevel optimization—improve task skill (inner) and improve how learning happens (outer).
How it works:
1. Inner loop collects real signals: success rate, token cost, latency.
2. Outer loop uses those signals to select better memory designs with good trade-offs (Pareto ranking).
3. Diagnosis highlights weak spots (e.g., messy summaries, slow search); design proposes targeted fixes.
Why it matters: Real performance data steers architectural changes, avoiding guesswork.

🍞 Anchor: If many failures stem from irrelevant context, the diagnosis chooses a stricter retrieval filter and the design installs it for the next round.

🍞 Hook: Lego time—build big from small, swappable blocks.

🥬 The Concept (Building Blocks): The memory system is split into four modules so we can safely evolve parts.

What it is: Encode, Store, Retrieve, Manage.
How it works:
1. Encode: Turn raw runs into tips, schemas, or tools.
2. Store: Keep them in vectors, JSON, graphs, or libraries.
3. Retrieve: Bring back what matters, possibly with hybrid or LLM-guarded methods.
4. Manage: Clean up—merge, prune, or forget to stay sharp.
Why it matters: Modular design lets the system change only what’s needed, keeping everything compatible.

🍞 Anchor: For GAIA-style research, the system might switch from plain semantic search to hybrid retrieval plus an LLM guard that filters fluff, because that combo won before.

03Methodology

At a high level: Input (tasks and histories) → Inner Loop (agent learns with current memory) → Outer Loop (diagnose and redesign memory) → Output (an evolved memory architecture and a smarter agent).

🍞 Hook: Imagine a game where you play a level (learn), then you upgrade your controller settings (redesign) so the next level is easier and faster.

🥬 The Concept (Inner vs. Outer Loops):

What it is: Two nested learning processes working together.
How it works:
1. Inner loop: Use the current memory system to solve a batch of tasks. Save successes, failures, and costs.
2. Summarize results into a score vector: performance, API cost, and latency.
3. Outer loop: Pick the best memory designs using Pareto-style selection and create improved variants.
Why it matters: This lets real outcomes guide architectural upgrades, not just intuition.

🍞 Anchor: If the agent spends too many tokens fetching irrelevant context, the outer loop designs a stricter retrieval policy for next time.

Step-by-step recipe (with examples):

Input preparation

What happens: Collect a batch of tasks (e.g., 60 per iteration: 40 new, 20 repeated for stability). Initialize each candidate memory with an empty store so comparisons are fair.
Why it exists: We need consistent, comparable evidence across different memory designs.
Example: On TaskCraft, sample a mix of web-lookup and reasoning tasks so we test both tool-use and logic.

Inner loop: Experience evolution

What happens:
- The agent runs with a chosen memory system.
- After each task, it produces experiences (e.g., distilled tips, reusable tools, or failure modes) via Encode, and then Stores them.
- Retrieval provides context for the next turns (e.g., hybrid search + LLM grooming).
Why it exists: Without this loop, we wouldn’t learn from experience or generate the data that drives design.
Example (GAIA query): “Find the BAFTA 2019 winner’s Wikipedia page and count revisions up to the release month.” The evolving memory suggests using MediaWiki’s revision API with a date cutoff and reminds the agent to store the exact oldids for auditability.

Score it: Summaries for each candidate

What happens: For every candidate memory system, aggregate task-level vectors into an overall summary: higher is better for success, lower is better for cost and delay (but they’re re-signed so higher means better in selection).
Why it exists: We need a multi-objective picture—accuracy alone isn’t enough.
Example: Candidate A hits strong accuracy but high latency; Candidate B is slightly less accurate but faster and cheaper.

Outer loop: Architectural selection (Pareto ranking)

What happens: Rank candidates by non-dominated sorting over (performance, −cost, −delay). Prefer those that balance gains without bloating time or money.
Why it exists: Avoid overfitting to accuracy that would make the system impractically slow or expensive.
Example: If two candidates tie on accuracy, we pick the one that’s cheaper and/or faster.

Diagnose-and-Design evolution (the secret sauce)

What happens:
- Diagnosis: Inspect trajectories and logs to locate bottlenecks: noisy summaries (Encode), bloated stores (Store), off-target retrieval (Retrieve), or stale clutter (Manage).
- Design: Propose concrete, modular fixes—e.g., switch Store from JSON to graph, add an LLM guard to Retrieve, deepen Encode from 3 to 5 abstraction levels, or enable periodic pruning in Manage.
Why it exists: Targeted upgrades beat random tweaks; modular constraints keep designs valid and runnable.
Example (xBench Chinese query about 瓷宫): Diagnosis sees that direct text search often misses ticket inscriptions. Design adds a retrieval probe for image captions and travel booking sites; Encode adds a “source-type tag” so those items are prioritized for similar tasks.

Generate descendants and repeat

What happens: Keep the top-K parent (K=1 in the paper), generate S=3 children with different, valid module settings, and run the next iteration with new task batches.
Why it exists: Controlled exploration finds better memory designs without chaotic jumps.
Example: Parent uses hybrid retrieval; children vary the guardrail strictness, change Store to a light graph, or add Manage-based deduplication.

Concrete module behaviors:

🍞 Hook: Think of a factory line with four stations—shaping, shelving, picking, and cleaning.

🥬 The Concept (Encode):

What it is: Turn raw runs into structured bits—tips, templates, multi-level summaries, or even Python tools.
How it works: Extract key steps, checks, failure modes, reusable functions; tag with task/domain.
Why it matters: Bad encoding = hard-to-reuse junk.

🍞 Anchor: After solving a Wikipedia revision task, Encode saves a reusable MediaWiki-API snippet and a checklist: locate canonical page → revisions API → cutoff at release month → count.

🍞 Hook: Shelves matter—books dumped in a pile get lost.

🥬 The Concept (Store):

What it is: Where and how you keep the items—vector DBs, JSON, graphs, or code libraries.
How it works: Index by tags, semantics, or relationships; support fast and flexible queries.
Why it matters: Poor storage slows retrieval and loses structure.

🍞 Anchor: A knowledge-graph Store links tasks to required tools and outcomes, so related tricks are discovered together.

🍞 Hook: Grabbing the right book fast beats skimming the whole library.

🥬 The Concept (Retrieve):

What it is: How memory is fetched for a task—semantic search, hybrid filters, LLM guardrails, or skill-based probes.
How it works: Match by meaning, verify with light reasoning, and adapt to the current stage (plan vs. execute).
Why it matters: Off-target retrieval confuses the agent and wastes tokens.

🍞 Anchor: For “ticket text on 瓷宫,” Retrieve prefers prior memories tagged with “image captions / travel booking,” plus a short LLM refinement to clean noise.

🍞 Hook: Closets get messy—cleaning days keep things useful.

🥬 The Concept (Manage):

What it is: Offline maintenance—consolidate, prune, or forget.
How it works: Merge duplicates, remove stale items, compress long trails, rebalance indices.
Why it matters: Without it, the memory bloats and slows down.

🍞 Anchor: After many runs, Manage prunes low-confidence tips and merges overlapping tools, keeping only the clearest versions.

Implementation and setup details:

EvolveLab provides a single BaseMemoryProvider that all 12 re-implemented systems inherit from, enforcing the Encode/Store/Retrieve/Manage interface.
Benchmarks include GAIA, WebWalkerQA, xBench-DS, and TaskCraft.
Inner loop batch per candidate: 60 tasks (40 new, 20 repeated).
Outer loop per iteration: select top-1 parent, generate 3 children; run for 3 iterations.
Tested with frameworks like SmolAgent (lightweight) and Flash-Searcher (high-performance research), and transfers to CK-Pro and OWL.

Secret Sauce summary:

Tightly coupled diagnosis (from real trajectories) + modular, constrained redesign.
Pareto-aware selection balances accuracy, cost, and speed.
Progressive, stage-aware retrieval and multi-level encoding unlock reliable gains without runaway cost.

04Experiments & Results

🍞 Hook: Imagine a school tournament where teams are judged not only by points scored, but also by how quickly and cheaply they earned them.

🥬 The Concept (The Test):

What it is: Evaluate memory systems across four tough benchmarks—GAIA, WebWalkerQA, xBench-DS, and TaskCraft.
How it works:
1. Measure task success (e.g., pass@1–3),
2. Track API cost per task,
3. Track execution latency and steps.
Why it matters: Real agents must be accurate, affordable, and fast enough.

🍞 Anchor: It’s like saying, “Great score, but did you drain your whole budget or take all afternoon?”

🍞 Hook: Who are they playing against?

🥬 The Concept (The Competition):

What it is: Compare MemEvolve to both memory-free agents and popular self-improving memory baselines (e.g., Voyager, ExpeL, Dynamic Cheatsheet), and integrate it into frameworks like SmolAgent and Flash-Searcher.
How it works: Keep the rest of the agent the same; only switch memory systems so the comparison is fair.
Why it matters: We want to know if evolving memory beats fixed, human-designed memories.

🍞 Anchor: In Flash-Searcher, swapping different memory modules shows which approach truly helps on GAIA, xBench-DS, and WebWalkerQA.

🍞 Hook: Numbers are fun, but only if they mean something.

🥬 The Concept (Scoreboard with context):

Highlights:
- Up to 17.06% improvement in some cross-model settings (e.g., Kimi K2 on WebWalkerQA).
- On GAIA, MemEvolve achieves pass@3 of about 80.61% in one setup, competitive with strong multi-agent systems, without building huge offline knowledge bases.
- On xBench-DS with GPT-5-mini, MemEvolve lifts pass@1 by roughly 6% and reaches strong pass@3, showing clear benefits over no-memory baselines.
- Costs remain similar to no-memory: for Flash-Searcher on GAIA, MemEvolve’s per-task cost is about $0.085 vs.$ 0.086 baseline; delays are on par with other self-improving baselines.
What it means: It’s like getting an A while spending the same allowance and about the same time as your classmates.

🍞 Anchor: On WebWalkerQA and xBench-DS, the evolved memory from TaskCraft still boosts results without any extra tuning—like borrowing a study method from English class that also helps in history.

🍞 Hook: Will it work on other teams and textbooks?

🥬 The Concept (Generalization):

What it is: Transfer the evolved memory to different benchmarks, LLMs (e.g., Kimi K2, DeepSeek V3.2), and agent frameworks (CK-Pro, OWL) without re-evolving.
What happened: Gains persist across these transfers—evidence that MemEvolve learns broadly useful memory principles within the task family.
Why it matters: Lower maintenance—no need to always re-design from scratch.

🍞 Anchor: The same retrieval upgrade that helped TaskCraft also helped WebWalkerQA and xBench-DS, much like a good outline style helps across essays.

🍞 Hook: Any surprises?

🥬 The Concept (Surprising Findings):

Some famous fixed memories didn’t boost performance consistently. For example, ExpeL underperformed across the board in this deep-research setting—it was just built for different types of problems.
Evolved memories showed steady, compounding gains as more tasks accumulated, indicating they discovered sound design patterns rather than one-off tricks.
Stage-aware retrieval and multi-level encoding appeared again and again in the top designs, suggesting these are robust features of effective agent memory.

🍞 Anchor: Think of the champion strategy: small but smart changes—better filters, clearer summaries, occasional pruning—winning again and again across varied questions.

05Discussion & Limitations

🍞 Hook: Even superheroes have limits; knowing them keeps the team safe and strong.

🥬 The Concept (Limitations):

What it is: Where MemEvolve might struggle.
How it works:
1. Transfers best within a related task family (e.g., research/planning). Jumping to radically different worlds (like embodied robotics) may require fresh evolution.
2. Quality of evolution depends on the agent’s signals—noisy feedback can mislead the outer loop.
3. Iterative evolution adds overhead; though costs stayed near baseline in tests, extreme settings could increase latency.
Why it matters: Knowing boundaries helps decide when to re-evolve or adjust objectives.

🍞 Anchor: A note-taking system that shines in reading-heavy classes may not directly suit a woodworking shop—you’d re-tune it.

🍞 Hook: What do you need in your backpack to use this?

🥬 The Concept (Required Resources):

What it is: Practical needs to run MemEvolve.
How it works:
1. Access to LLM APIs or local models and tools.
2. Compute to run batches of tasks per iteration.
3. Storage for memory items (DBs/graphs/code repos) and logs for diagnosis.
Why it matters: Evolution needs data volume and reliable metrics to improve.

🍞 Anchor: Like a sports team needs a field, balls, and scoreboards before training plans can be improved.

🍞 Hook: When should you not flip the switch?

🥬 The Concept (When NOT to Use):

What it is: Cases where MemEvolve may not be worth it.
How it works:
1. Very small, fixed task sets—hand-tuned prompts might be simpler.
2. Severe latency/cost limits—no room for iterative search.
3. Domains with no stable signal for success—too noisy to guide design.
Why it matters: Sometimes the simplest tool is best.

🍞 Anchor: If you only have five trivial homework questions, you probably don’t need a full study-system overhaul.

🍞 Hook: What’s next on the roadmap?

🥬 The Concept (Open Questions):

What it is: Future puzzles to solve.
How it works:
1. Can we speed up evolution with better surrogate rewards or smaller pilot tasks?
2. How to auto-detect domain shifts and trigger re-evolution only when needed?
3. Can Manage learn optimal forgetting schedules automatically?
4. How to co-evolve tools and workflows alongside memory for even bigger gains?
Why it matters: Sharper, faster, more autonomous evolution expands where agents can shine.

🍞 Anchor: Imagine the system noticing a new domain (like chemistry) and auto-switching to a tool-centric memory with safe lab-checklists—no human in the loop.

06Conclusion & Future Work

🍞 Hook: Picture a student who doesn’t just learn facts, but also keeps improving how they study for each subject.

🥬 The Concept (3-sentence summary): MemEvolve is a meta-evolution framework where agents evolve both their experiences and the memory architectures that learn from those experiences. It uses a dual-loop process—an inner loop that gathers and applies experience, and an outer loop that diagnoses and redesigns memory modules (Encode, Store, Retrieve, Manage). Built on the unified EvolveLab codebase, MemEvolve delivers consistent gains, strong transfer to new tasks, models, and frameworks, and keeps costs and speed in check with Pareto-aware selection.

🍞 Anchor: In practice, this meant better success rates on GAIA, WebWalkerQA, xBench-DS, and TaskCraft, often without extra tuning when switching models or frameworks.

Main achievement: Turning memory from a fixed, hand-crafted component into an adaptive, data-driven architecture that reliably boosts agent performance.

Future directions:

Faster, lighter evolution using smarter proxies and auto domain-shift detectors.
Joint evolution of memory with tools and workflows for even larger gains.
Learned maintenance (Manage) that balances freshness and stability over very long horizons.

Why remember this: It marks a shift from “agents that learn” to “agents that also improve how they learn,” much like the step from a diligent student to a truly adaptive learner who keeps upgrading their study system.

Practical Applications

•Build adaptive research assistants that learn better web-search routines and source-checking steps over time.
•Create coding copilots that evolve from storing snippets to auto-distilling reusable utilities and test templates.
•Deploy customer-support bots that refine retrieval filters and summary styles based on actual ticket outcomes.
•Run internal knowledge bases that auto-prune stale content and promote high-confidence, frequently used tips.
•Design classroom tutors that change their hint strategies for math vs. reading based on student performance.
•Operate enterprise agents that evolve domain-specific workflows, like compliance checks or audit trails.
•Enable lab assistants that learn tool-centric procedures (APIs, scripts) and keep safety checklists up to date.
•Support multilingual search agents that adapt storage and retrieval to different languages and data sources.
•Build project planners that evolve from generic templates to stage-aware guidance (plan, execute, verify).
•Automate competitive analysis agents that refine data collection and summarization methods across markets.

Version: 1