MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang; Quanyu Long; Jianzhu Bao; Tao Feng; Weizhi Zhang; Haodong Yue; Wenya Wang

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Intermediate

Haozhen Zhang, Quanyu Long, Jianzhu Bao et al.2/2/2026

arXiv PDF

Key Summary

•MemSkill turns memory operations for AI agents into learnable skills instead of fixed, hand-made rules.
•A small controller picks a few relevant skills, and an LLM executor uses those skills in one go to write and edit memories.
•A designer watches where the system struggles, then refines old skills and invents new ones to fix those problems.
•This creates a closed loop: use skills, see what failed, evolve skills, and repeat.
•MemSkill works at span level (chunks of text) so it scales better to long histories than turn-by-turn methods.
•On LoCoMo, LongMemEval, HotpotQA, and ALFWorld, MemSkill beats strong baselines across different tasks.
•Skills learned with one base model (LLaMA) transfer well to another (Qwen) without retraining.
•Ablations show both parts matter: the controller’s smart skill selection and the designer’s skill evolution.
•Evolved skills become domain-aware (e.g., ‘temporal context’ for chat; ‘object location’ for embodied tasks).
•This pushes AI agents toward self-improving memory that better fits real, changing situations.

Why This Research Matters

Real people talk to assistants over weeks or months, and projects span many steps—so memory must adapt as things change. MemSkill lets AI agents build and refine memories with skills that grow from real mistakes, not just fixed rules written upfront. That makes answers more consistent, plans more reliable, and long conversations less confusing. It also transfers across models and datasets, reducing rework when you switch tools. By revealing and evolving domain-specific skills (like tracking timelines or object locations), MemSkill makes memory more interpretable. Ultimately, it nudges AI toward being a thoughtful long-term partner, not just a short-term responder.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class notebook getting longer and longer all year. If you don’t organize it, finding last month’s lesson becomes a treasure hunt.

🥬 The Concept (Agent Memory): An AI agent’s memory is the place it saves helpful bits from past conversations or actions so it can stay consistent and smart later.

How it works: (1) Read new interactions; (2) Decide what to store, update, or forget; (3) Retrieve the right memories to answer new questions.
Why it matters: Without memory, the agent repeats itself, forgets promises, and gets confused on long tasks. 🍞 Anchor: A tutor-bot that remembers you love dinosaurs won’t suggest volcano projects next time unless you changed your mind.

🍞 Hook: You know how some rules like “always highlight the first sentence” don’t work for every page?

🥬 The Concept (Static, Handcrafted Operations): Many AI systems used fixed rules like ADD/UPDATE/DELETE with human-written instructions.

How it works: (1) A pipeline applies the same steps each turn; (2) An LLM fills details; (3) Heuristics decide when to revise or prune.
Why it matters: When conversations get weird or super long, fixed rules break—either they miss important info or keep too much junk. 🍞 Anchor: If your rule is “always note the first fact,” you’ll miss the important twist hidden in paragraph four.

🍞 Hook: When your binder gets too thick, you stop filing page-by-page and switch to summarizing whole weeks at a time.

🥬 The Concept (Granularity Problem): Many systems processed memory turn-by-turn, which is too tiny for long histories.

How it works: They extract per message, revising bit by bit.
Why it matters: This is slow, repetitive, and weak at capturing bigger patterns over long spans. 🍞 Anchor: Summarizing a whole chapter is better than copying each sentence line-by-line.

🍞 Hook: Imagine having a set of study tricks you can mix and match: “make a timeline,” “track characters,” “summarize steps.”

🥬 The Concept (Learnable Memory Skills): Instead of fixed rules, treat memory behaviors as skills the system can learn, reuse, and evolve.

How it works: (1) Keep a skill bank (like a toolbox); (2) Pick a few relevant skills for the current chunk; (3) Apply them in one go to produce memory updates.
Why it matters: This adapts to different kinds of info and different tasks as histories grow. 🍞 Anchor: For a history lesson, you pick “timeline” and “who did what” skills; for a science lab, you pick “steps” and “conditions.”

🍞 Hook: After quizzes, good students check mistakes and upgrade how they study next time.

🥬 The Concept (Closed-Loop Improvement): The system should learn which skills to use and also evolve the skills themselves from tough cases.

How it works: (1) Use skills; (2) See what failed; (3) Update which skills to pick; (4) Improve or add skills.
Why it matters: Without the loop, the agent keeps repeating the same errors in new forms. 🍞 Anchor: If you keep missing date questions, you add a new study trick: always build a timeline.

The World Before: AI agents had long conversations and complex tasks, but memory systems often relied on human-written routines and step-by-step turn processing. These were brittle when topics shifted, slow on long histories, and hard to tune.

The Problem: How can an agent flexibly decide what to store, how to refine it later, and when to prune, across many different styles of interactions—without hand-writing all the logic?

Failed Attempts:

Fixed primitives (add/update/delete) with human heuristics: brittle under distribution shifts; often over- or under-store.
Per-turn extraction: too small a lens, causing repetition and missing cross-span patterns.
One-shot summaries: may become vague, losing key structure like timelines or relationships.

The Gap: A method was missing that (1) learns what memory behaviors to apply, (2) composes them at span-level for scale, and (3) evolves those behaviors over time from real failures.

Real Stakes: Better memory makes assistants consistent across months, customer support accurate over long tickets, educational tutors personalized, research copilots less redundant, and robots able to follow multi-step plans without losing track of objects or constraints.

02Core Idea

🍞 Hook: Picture a travel kit with tools: a map-maker, a checklist-writer, and a timeline-drawer. Before each trip, you pick a few tools to pack.

🥬 The Concept (MemSkill): The key idea is to turn memory operations into learnable, reusable, and evolving skills—and to pick a small, relevant set for each chunk of context, then apply them in one pass.

How it works: (1) Keep a shared skill bank (like a toolbox); (2) A controller chooses Top-K skills for the current span; (3) An LLM executor uses only those skills to write or edit memories; (4) A designer studies hard failures and refines or invents new skills; (5) Repeat.
Why it matters: Without skill-conditioned, span-level memory, agents stay slow, brittle, and overfit to hand rules. 🍞 Anchor: For a long chat, the controller might pick “capture temporal context” + “handle entity relationships” + “update details,” helping the executor produce crisp, helpful memory entries.

The “Aha!” Moment (one sentence): Instead of hard-coding what memory to build, teach the agent a set of memory skills and let it learn which ones to use—and even how to evolve those skills over time.

Multiple Analogies:

Chef’s Menu: The skill bank is a cookbook. The controller picks recipes (skills) that fit today’s ingredients (text span), and the executor cooks the dish (memory entry). The designer writes new recipes after tasting failures.
Sports Team: Skills are plays. The controller (coach) picks plays for the situation. The executor (players) runs them. The designer (analyst) reviews game tapes and updates the playbook.
School Study Kit: Skills are study tricks (timelines, character maps, step lists). The controller picks the right tricks for a chapter; the executor writes the notes; the designer adds new tricks when tests show a weakness.

🍞 Hook: You know how switching from sentence-by-sentence notes to chapter summaries makes you faster and clearer?

🥬 The Concept (Span-Level, Skill-Conditioned Generation): MemSkill operates over chunks (spans), not just individual turns, and composes a handful of skills in one LLM call to produce memory updates.

How it works: (1) Segment long history into spans; (2) Retrieve related memories; (3) Pick skills; (4) Generate structured updates once per span.
Why it matters: Reduces repeated work, captures cross-sentence patterns, and scales to long histories. 🍞 Anchor: For 10 pages of chat, pick “timeline + activities + entity links,” generate 5 clean memory items at once instead of 50 tiny per-turn edits.

🍞 Hook: After a test, you learn which study tricks actually helped.

🥬 The Concept (Reinforcement Learning for Skill Selection): The controller learns to pick skills by getting a reward from downstream task performance.

How it works: (1) Build memory using chosen skills; (2) Answer benchmark questions or complete tasks; (3) Use the score as feedback to improve future picks.
Why it matters: Without this learning, the controller might pick random or redundant skills. 🍞 Anchor: If choosing “timeline” increases your quiz score, you’ll pick it more often.

🍞 Hook: When the same kind of mistake keeps happening, you invent a new trick.

🥬 The Concept (Designer-Led Skill Evolution): A designer reviews clusters of hard failures and refines or adds skills that would have prevented them.

How it works: (1) Keep a buffer of tough cases; (2) Cluster them; (3) Analyze root causes; (4) Edit templates or create new skills; (5) Roll back if updates hurt.
Why it matters: Prevents stagnation and lets the system grow abilities the initial set didn’t cover. 🍞 Anchor: Missing time clues? Add a “capture temporal context” skill so dates and sequences never slip by again.

Before vs After:

Before: Fixed, turn-level rules; brittle, slow, often vague or redundant.
After: Learnable, span-level skills; adaptive, efficient, and structured.

Why It Works (intuition): Separating “how to remember” (skills) from “what to remember right now” (controller’s picks) lets the system reuse good patterns while staying flexible. Feedback from tasks teaches better picks, and designer evolution fills ability gaps.

Building Blocks (brief sandwiches):

🍞 Skill Bank: A shared toolbox of structured memory behaviors. Steps: store templates (purpose, when, how, constraints); keep short descriptions for selection; expand/refine over time. Why: A central place to grow reusable memory know-how. 🍞 Example: “Capture Activity Details,” “Track Object Location.”
🍞 Controller: A picker that matches the current span to a few relevant skills by embedding both and selecting Top-K. Why: Too many skills at once confuses the executor; a small, right set focuses it. 🍞 Example: For a plan update, pick “Update” + “Refine Temporal Details.”
🍞 Executor (LLM): The writer that applies selected skills in one pass to produce INSERT/UPDATE/DELETE. Why: One coherent generation scales better and reduces contradictions. 🍞 Example: Generates three crisp memory items after reading a 512-token span.
🍞 Designer: The improver that studies failures, clusters them, and edits or creates skills; rolls back bad changes. Why: Keeps the toolbox evolving with real data. 🍞 Example: Adds “Handle Entity Relationships” after repeated linking errors.

03Methodology

High-Level Recipe: Input → Segment into spans → Retrieve related memories → Controller selects Top-K skills → LLM executor applies skills once → Update memory bank → Evaluate tasks to get reward → Log hard cases → Designer evolves skills → Repeat.

🍞 Hook: Think of turning a huge book into tidy chapter notes using just the right note-taking tricks.

🥬 The Concept (Span Segmentation): Split long interaction histories into manageable spans (e.g., ~512 tokens).

How it works: (1) Chunk the text; (2) Process spans in order.
Why it matters: Prevents overload and lets skills capture patterns across multiple sentences at once. 🍞 Anchor: Instead of 1,000 tiny sticky notes, you make 10 clear chapter summaries.

🍞 Hook: Before writing new notes, you glance at your old notes to avoid repeating yourself.

🥬 The Concept (Memory Retrieval): For each span, pull up to R relevant memory items (e.g., 20) from the current trace’s memory bank.

How it works: (1) Use an embedding retriever (e.g., Contriever); (2) Fetch top matches.
Why it matters: Context about what you already know prevents duplicates and guides updates. 🍞 Anchor: You see you already saved “Lily moved to Boston,” so you update it with “in March” instead of adding a duplicate.

🍞 Hook: When cooking, you don’t use every spice—just a few that fit the dish.

🥬 The Concept (Controller’s Top-K Skill Selection): Pick a small set of relevant skills from a growing skill bank.

What it is: A learned policy that scores skills based on how well their descriptions match the current span + retrieved memories.
How it works: (1) Embed the span+memories as state; (2) Embed each skill’s description; (3) Score via dot products; (4) Sample an ordered Top-K without replacement; (5) Train with PPO using downstream task rewards.
Why it matters: Keeps the executor focused and adaptable as new skills are added. 🍞 Anchor: For a schedule update, the controller picks “Capture Temporal Context,” “Update,” and “Refine Temporal Details.”

🍞 Hook: Once the tools are chosen, the builder gets to work.

🥬 The Concept (LLM Executor with Skill Conditioning): The LLM receives the span, retrieved memories, and selected skills, and outputs structured actions.

How it works: (1) Follow each skill’s purpose/when/how/constraints; (2) Emit INSERT/UPDATE/DELETE blocks; (3) Parser applies them to the memory bank.
Why it matters: One-pass, skill-guided generation is more consistent and scalable than many tiny calls. 🍞 Anchor: After reading a dialog span, the executor outputs: INSERT: “Project kickoff on June 3 (Alice, remote).” UPDATE: (add room number to an earlier meeting memory).

🍞 Hook: You check your notes by trying homework that depends on them.

🥬 The Concept (Task-Centered Evaluation & Reward): After updating the memory bank for a trace, answer associated questions or run tasks to get a score.

How it works: (1) Use the built memory; (2) Compute F1, success rate, or a judge score; (3) Treat it as reward for the controller’s choices.
Why it matters: Teaches the controller which skill subsets really help downstream performance. 🍞 Anchor: If the agent now answers more timeline questions correctly, the controller learns to pick timing-related skills more often.

🍞 Hook: When mistakes repeat, you keep a list to fix patterns, not just single slips.

🥬 The Concept (Hard-Case Buffer): Keep a rolling set of tough failures with details like the question, ground truth, model answer, used memories, and failure count.

How it works: (1) Sliding window; (2) Cluster by similarity; (3) Rank by difficulty (low reward × repeated fails).
Why it matters: Focuses improvement on impactful, recurring errors. 🍞 Anchor: Many misses involve who-did-what-when; cluster them to prompt a better activity/temporal skill.

🍞 Hook: A coach reviews game tapes and updates the playbook.

🥬 The Concept (Two-Stage Skill Evolution by Designer): Analyze patterns then propose changes.

How it works: (1) LLM-based analysis suggests missing or misused behaviors; (2) Edit templates of existing skills or add new ones; (3) Keep best-performing snapshots; roll back bad updates.
Why it matters: Lets the toolbox grow beyond the starter set and stay aligned with data needs. 🍞 Anchor: Add “Handle Entity Relationships” to capture roles like mentor/student or teammate/opponent.

🍞 Hook: When you get a new study trick, you try it more at first to see if it helps.

🥬 The Concept (Exploration Bias for New Skills): After evolution, temporarily boost the chance of picking new skills so the controller can evaluate them.

How it works: (1) Increase logits for new skills to meet a target probability; (2) Decay this encouragement over ~50 steps.
Why it matters: Prevents ignoring promising skills just because they’re new. 🍞 Anchor: You promise yourself to always draw a timeline for the next few readings to test its value.

Example Walkthrough:

Input: A 512-token dialog span: “We moved the demo from May 28 to June 3. Alice will lead, remote. Bob will draft slides by Friday.”
Retrieved memories include: “Demo on May 28,” “Alice is PM,” “Bob writes slides.”
Controller picks K=3: Capture Temporal Context, Update, Capture Activity Details.
Executor actions: (1) UPDATE memory[‘Demo on May 28’] → ‘Demo on June 3 (remote).’ (2) INSERT ‘Alice leads demo on June 3 (remote).’ (3) INSERT ‘Bob drafts slides due Friday.’
Evaluate: Answer “When is the demo and who leads it?” → Now easy; reward increases.
Hard-Case Buffer: Later, repeated misses on relationships trigger adding “Handle Entity Relationships.”

Secret Sauce:

Skill-conditioned, one-pass generation at span level (scales and stays coherent).
A controller that learns which small subset of skills matters right now.
A designer that grows and sharpens the skill bank using real failure patterns and rollback safeguards.

04Experiments & Results

🍞 Hook: Think of a tournament where your new study method goes against champion note-taking strategies across different subjects.

🥬 The Concept (What They Tested): The team measured how good the built memories were for answering questions and for completing multi-step tasks.

How it works: Evaluate on datasets with long conversations (LoCoMo, LongMemEval), multi-hop QA (HotpotQA), and interactive embodied tasks (ALFWorld).
Why it matters: Shows whether skills help in both reading-long-text worlds and acting-in-environments worlds. 🍞 Anchor: It’s like testing your notes on history questions, science readings, and a lab activity.

The Competition (Baselines): No-Memory, Chain-of-Notes, ReadAgent, MemoryBank, A-MEM, Mem0, LangMem, and MemoryOS—strong, diverse systems using summaries, heuristics, or other memory modules.

Scoreboard with Context:

Conversational Benchmarks (LoCoMo, LongMemEval): Using LLaMA, MemSkill achieved top LLM-judge scores (e.g., LoCoMo L-J ≈ 50.96), beating A-MEM and MemoryOS. That’s like scoring an A when many peers score a B.
Embodied Tasks (ALFWorld): MemSkill reached the highest success rates on both seen and unseen splits (e.g., average SR ≈ 47.86% with LLaMA), finishing tasks more reliably than others. Think: more robots completing the mission on time.
Cross-Model Transfer: Skills learned with LLaMA transfered strongly to Qwen without retraining, still topping or matching baselines—like borrowing a friend’s calculator and still acing the test.
Cross-Dataset Transfer: LongMemEval was tested by directly using skills learned on LoCoMo; MemSkill stayed best among methods, showing skills weren’t overfitted to one dataset.
Distribution Shift (HotpotQA): With 50/100/200 concatenated documents, MemSkill beat MemoryOS and A-MEM, especially at 200 docs where long-context noise is highest—like staying calm and accurate in the noisiest library.

Surprising Findings:

More Skills Helps (to a point): Increasing Top-K (e.g., to 7) often improved results on longer contexts, suggesting that complex spans benefit from composing multiple skills.
Evolution Matters: Ablations showed removing the controller (random skill picks) or the designer (no evolution) both hurt performance; the designer’s ability to add new skills gave extra gains beyond simple refinements.
Domain-Specific Skills Emerged: On LoCoMo, skills like “Capture Temporal Context” and “Capture Activity Details” naturally appeared; on ALFWorld, “Track Object Location” and “Capture Action Constraints” dominated—evidence that the system discovers what each domain truly needs.

Concrete Numbers (selected):

LoCoMo (LLaMA): MemSkill L-J ≈ 50.96 vs A-MEM ≈ 46.34 and MemoryOS ≈ 44.59.
ALFWorld averages (LLaMA): MemSkill ≈ 47.86% SR vs MemoryBank ≈ 25.00% and CoN ≈ 40.71%.
Qwen transfer kept MemSkill competitive or best on multiple metrics, highlighting portability.

Takeaway: Across text-heavy and action-heavy tasks, MemSkill’s learnable, evolving skills consistently build better memories that lead to better answers and higher task completion.

05Discussion & Limitations

🍞 Hook: Even great study methods have limits—like making timelines can’t help if the problem is actually bad retrieval.

🥬 The Concept (Limitations): What MemSkill can’t do (yet).

What it is: A candid look at boundaries and trade-offs.
How it works: Identify where assumptions may not hold or resources may be heavy.
Why it matters: Knowing boundaries helps you use the method wisely. 🍞 Anchor: If your notes are perfect but your search is broken, you’ll still miss the page you need.

Limitations:

Designer Dependency on LLM Quality: The designer’s edits and new skills rely on the base LLM’s analysis quality; weak judges/designers could propose poor skills.
Reward Signal Granularity: Rewards come after whole spans or traces; fine-grained credit assignment remains tricky.
Retrieval Still Matters: If embeddings can’t fetch the right memories, perfect skills won’t help at answer time.
Initial Skill Warm-Start: Starting from minimal primitives is stable but may delay discovery of niche domain skills.
Cost and Latency: Span-level LLM calls plus periodic designer evolution add compute overhead.

Required Resources:

Base LLM with reliable instruction following (executor and designer roles).
Embedding model for state/skill encoding and a retriever (e.g., Contriever).
RL infrastructure (e.g., PPO) for the controller.
Compute budget for training cycles and evolution rounds.

When Not to Use:

Very short, simple tasks where plain prompts suffice (the overhead may not pay off).
Domains with strict schemas where handcrafted extraction is already optimal.
Settings with no reliable reward or evaluation signal to guide learning.

Open Questions:

Can we learn better, denser rewards to credit individual skill picks within a span?
How to co-evolve retrieval (embeddings) with the skill bank for tighter loops?
Can skills transfer zero-shot across radically different modalities (e.g., audio/video)?
What governance tools best keep evolved skills safe, private, and auditable?
How to auto-discover span sizes or adaptive chunking strategies per domain?

06Conclusion & Future Work

Three-Sentence Summary: MemSkill turns memory operations into learnable, reusable, and evolving skills. A controller selects a small, relevant set for each span, an LLM executor applies them in one pass, and a designer refines or adds skills from hard cases, closing the loop. This delivers stronger, more scalable memory for diverse tasks, beating strong baselines and transferring across models and datasets.

Main Achievement: Showing that skill-conditioned, span-level memory plus closed-loop skill evolution can reliably outperform static, hand-designed pipelines in both conversational and embodied settings.

Future Directions:

Jointly evolve retrieval and skills for tighter recall–storage synergy.
Learn finer-grained rewards for better credit assignment to individual skill picks.
Explore multi-modal skills (text + vision + action) and automated chunk sizing.
Build governance tooling for safe, interpretable skill evolution at scale.

Why Remember This: MemSkill reframes “how agents remember” from fixed steps to a living skill set that learns from experience—an approach that’s more natural, more scalable, and better aligned with the messy, changing nature of real-world tasks.

Practical Applications

•Long-term personal assistants that remember preferences, timelines, and evolving plans across months.
•Customer support bots that track ticket histories accurately and surface the right past fixes.
•Educational tutors that build student-specific skill maps and remember progress and misconceptions.
•Project copilots that maintain clean milestones, owners, and dates as specs change.
•Research assistants that consolidate findings, sources, and cross-paper links without duplications.
•Enterprise knowledge bases that evolve extraction skills as documentation formats change.
•Healthcare triage chatbots that capture symptom timelines and medication changes reliably.
•Developer copilots that track design decisions and dependencies across long threads.
•Robotic/embodied agents that remember object locations, states, and action constraints for tasks.
•Call-center analytics that evolve skills to capture new product terms and policy updates.

Version: 1