RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Haonan Bian; Zhiyuan Yao; Sen Hu; Zishan Xu; Shaolei Zhang; Yifu Guo; Ziliang Yang; Xueran Han; Huacan Wang; Ronghao Chen

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Beginner

Haonan Bian, Zhiyuan Yao, Sen Hu et al.1/11/2026

arXiv PDF

Key Summary

•RealMem is a new benchmark that tests how well AI assistants remember and manage long, ongoing projects across many conversations.
•Instead of short chats or single tasks, RealMem uses 2,000+ multi-session dialogues across 11 real-life scenarios like fitness, travel, writing, and project management.
•It checks four hard skills: recalling past facts, updating plans when things change, reasoning about time and schedules, and proactively suggesting next steps using memory.
•The dataset is built with a three-stage pipeline: plan the project (blueprints), simulate realistic talks (multi-agent role-play), and keep memory and schedules clean and up to date.
•Modern memory systems still struggle: even strong baselines are far below the Oracle that has perfect memory access.
•Precision of what you retrieve (NDCG) matters more than getting a lot of possibly noisy context (Recall) for high-quality answers.
•MemoryOS performs best on answer quality overall, while Graph Memory is strongest when using full session context and for complex relationships.
•Technical, rigid tasks (like code architecture) remain especially challenging, while consultative or creative domains (like health advice and literature) fare better.
•Adding new memories is slower than retrieving them in all systems, highlighting a key deployment bottleneck.
•Human judges agree with the benchmark’s automatic scores, showing the evaluation matches real user expectations.

Why This Research Matters

RealMem pushes AI assistants closer to real-life usefulness by testing whether they can carry long projects from idea to completion without forgetting, contradicting, or double-booking. It highlights the need for precise retrieval and reliable updates rather than just big-context reading. This matters for planning trips, managing health routines, coaching learning paths, organizing work projects, and more—places where people expect continuity and smart next steps. By revealing gaps between today’s systems and an ideal Oracle, RealMem guides research toward better memory architectures and evaluation. It also shows where small models fail and when larger models are worth the cost. In short, RealMem points the way to AI that acts like a dependable teammate over weeks and months, not just a one-off helper.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine planning a six-month science fair project with a helper who remembers every idea, every change, and your busy calendar—so you never repeat work or double-book.

🥬 The Concept (AI & LLMs):

What it is: An AI powered by a Large Language Model (LLM) is like a super-smart assistant that reads and writes text.
How it works: 1) It reads your messages, 2) predicts the most helpful next words, 3) learns patterns from lots of examples, and 4) uses extra notes (memory) to stay consistent.
Why it matters: Without memory, the AI forgets what you agreed on yesterday. Plans fall apart, and you have to repeat yourself.

🍞 Anchor: You tell an AI, “We decided on Paris from May 1–10 near the Eiffel Tower.” Next week, it should remember that when booking hotels.

🍞 Hook: You know how you keep a notebook for a long project—plans, changes, dates? That notebook is your memory.

🥬 The Concept (LLM Memory Systems):

What it is: A memory system stores important facts from past chats so the AI can stay on track over time.
How it works: 1) Extract key facts, 2) index them for search, 3) retrieve the right bits later, 4) update or delete when something changes.
Why it matters: Without it, the AI treats every chat like day one and gives inconsistent advice.

🍞 Anchor: After you injure your knee, the AI should stop recommending hikes and switch to gentle yoga automatically.

🍞 Hook: School tests usually check what you just learned, but big projects need you to remember and use knowledge over months.

🥬 The Concept (Benchmarks):

What it is: A benchmark is a fair test to compare how well AIs do at a specific skill.
How it works: 1) Build a dataset, 2) define tasks, 3) set scoring rules, 4) compare systems.
Why it matters: Without the right test, we might think an AI is good at long projects when it’s only good at quick answers.

🍞 Anchor: A spelling quiz won’t tell you who can write a full research paper; you need a different test.

The world before: Most AI memory tests focused on casual chit-chat or short tasks. They checked if the AI could recall a fact from a long transcript or keep a persona consistent. These were useful first steps, but they missed the messy reality of projects that evolve over weeks: goals shift, schedules conflict, and multiple projects interleave.

The problem: Real projects produce interwoven, natural questions in broken-up sessions. Plans evolve (dynamic state), calendars fill (temporal constraints), and users rarely repeat everything they said. Existing tests didn’t capture this. They often asked clean, after-the-fact questions, like “What’s the user’s favorite color?” rather than “Given the changed travel dates and my new meeting, what should we adjust next?”

Failed attempts: Benchmarks like LoCoMo, LongMemEval, and HaluMem measured long-context recall, consistency over time, or hallucinations—but mostly with isolated QA after sessions, not during. They didn’t force the AI to juggle multiple projects, track changing states, or proactively push the plan forward mid-conversation.

The gap: We needed a test where memory is used live, inside ongoing projects, to make the next step better. That means: queries that arise naturally, sessions that are interleaved, states that evolve, and assistants that proactively align with the user’s goals.

🍞 Hook: It’s like coaching a soccer team for a whole season, not just explaining the rules before game one.

🥬 The Concept (Long-term Project-oriented Interactions):

What it is: Conversations that continue over many sessions to move a project from start to finish while handling changes and schedules.
How it works: 1) Start with goals, 2) break them into milestones, 3) update plans as surprises happen, 4) check calendars, 5) keep going across weeks.
Why it matters: Without steady memory and updates, the assistant forgets progress, suggests conflicts, and derails the project.

🍞 Anchor: Planning a 12-day New Zealand trip, you later add two West Coast days without extending the trip; the assistant must reshuffle the itinerary without causing chaos.

Real stakes: In daily life, people want AI coaches for fitness, travel, finances, health, learning, and more. If the AI can’t remember evolving plans or check time conflicts, it wastes your time or gives bad advice (like double-booking a medical exam on top of a meeting). For companies, poor memory means broken workflows, missed deadlines, and unhappy users.

Enter RealMem: It’s the first benchmark to mirror real, project-style conversations. It creates multi-session, interleaved dialogues, grounded in realistic goals, and scores AIs on using memory to keep projects coherent. It focuses on four must-haves: 1) questions that arise naturally from the project’s flow, 2) interleaved sessions across projects, 3) dynamically evolving states, and 4) proactive alignment—nudging the project forward even when the user’s message is vague.

02Core Idea

🍞 Hook: Think of building a LEGO city—streets, buildings, people—over many weekends. You need a plan, conversations with helpers, and a tidy box to keep the right pieces handy each time.

🥬 The Concept (RealMem’s Key Insight):

What it is: The aha moment is that to test real memory, you must simulate real projects where memory is used live to advance the plan—not just retrieved after the fact.
How it works: 1) Build a project blueprint (what we’re making), 2) simulate many realistic chats that follow and change that plan, 3) continuously extract, clean, and retrieve memory and schedules to keep everything consistent.
Why it matters: Without this, we only test if the AI can parrot facts, not whether it can keep a long project on the rails.

🍞 Anchor: If your assistant remembers you chose “book flights and RV next,” it should bring that up when you say, “This looks great!” instead of waiting for a perfect command.

Multiple analogies:

Sports coach: The coach (assistant) remembers past games (sessions), updates plays (project state), checks game times (schedule), and suggests next drills (proactive alignment).
Orchestra: The conductor (assistant) keeps track of who plays when (schedule), adapts to tempo changes (state updates), recalls themes (static retrieval), and cues the next section (proactive alignment).
Kitchen: For a week-long meal prep, remember allergies (persona), rotate recipes (project state), avoid oven overlaps (temporal reasoning), and suggest tomorrow’s prep when you say “Dinner was great!” (proactive alignment).

Before vs After:

Before: Tests measured if AIs could remember a detail after reading a long log.
After: RealMem measures if AIs can use memory mid-conversation to maintain evolving plans, resolve conflicts, and push the project forward.

Why it works (intuition): Real projects are moving targets. So the benchmark must: (1) plan the big picture so sessions make sense over time, (2) simulate realistic back-and-forth so memory arises naturally, and (3) maintain a clean memory base and schedule so retrieval helps, not hurts. This closed loop pressures the AI to retrieve precisely relevant context (not just a lot) and to update state correctly.

Building blocks (with mini Sandwich explanations):

🍞 Hook: You don’t start a treehouse without a sketch. 🥬 Project Foundation Construction:

What: A blueprint that sets persona, goals, attributes, milestones, events, and session summaries.
How: 1) Define who the user is (persona), 2) set clear goals (e.g., lose 15 kg in 6 months), 3) list attributes to track (diet plan, body metrics), 4) map milestones and events, 5) create session summaries, 6) interleave multiple projects into one queue.
Why: Without a plan, later sessions feel random and contradict each other. 🍞 Anchor: For “Travel + Fitness” running in parallel, sessions interleave like: Travel Day 1 hotels, then Fitness plan tweak, then Travel flights.

🍞 Hook: Actors need a script and their previous scene notes. 🥬 Multi-Agent Dialogue Generation:

What: A User Agent and an Assistant Agent role-play realistic sessions.
How: 1) User Agent sees the current event + last summary, 2) Assistant sees relevant memory points + global schedule, 3) sessions are drawn from an interleaved queue, 4) they talk, 5) outputs feed the memory system.
Why: Without structured roles and context, you get time mix-ups and shallow replies. 🍞 Anchor: The Assistant checks the schedule before accepting “book a physical exam at 10:30” because there’s a 10:00–11:00 meeting.

🍞 Hook: A messy desk wastes time; a tidy folder speeds you up. 🥬 Memory and Schedule Management:

What: Extract, deduplicate, and retrieve memory points; update a global schedule.
How: 1) Memory Extraction Agent pulls key facts from chats, 2) Schedule Agent adds time-bound tasks, 3) Dedup Agent cleans overlaps, 4) Retrieval uses the next session summary as a query to fetch the right context.
Why: Without clean, up-to-date memory, retrieval returns noise and the assistant goes off-track. 🍞 Anchor: After you switch from squats to gentle yoga, the memory state flips; the next day’s gear list omits squat shoes and adds a mat.

🍞 Hook: Different questions need different moves—like tools in a toolbox. 🥬 Query Types (set-level):

What: Four kinds—Static Retrieval, Dynamic Updating, Temporal Reasoning, Proactive Alignment.
How: Each type targets a unique memory use: 1) recall facts, 2) change states, 3) reason over time, 4) anticipate next steps.
Why: Without covering all four, the benchmark misses crucial real-world skills. 🍞 Anchor: “Let’s continue our travel plan” (Static Retrieval), “Add West Coast but keep 12 days” (Dynamic Updating), “Does 10:30 clash with my meeting?” (Temporal Reasoning), “This is amazing!” → “Shall we book flights and RV?” (Proactive Alignment).

03Methodology

High-level recipe: Input (personas, goals, blueprints) → Stage 1: Project Foundation Construction → Stage 2: Multi-Agent Dialogue Generation → Stage 3: Memory and Schedule Management → Output (multi-session dialogues, evolving memory, and schedules).

Stage 1: Project Foundation Construction

What happens: Build the scaffolding so long stories stay coherent.
Why this step exists: Without a blueprint, sessions drift, contradict, or forget goals.
How (step-by-step):
1. Persona: Define who the user is (e.g., Lucy, 35; preferences, constraints).
2. Goal: Set a measurable target (e.g., lose 15 kg in 6 months; France trip May 1–10).
3. Attributes: List dynamic states to track (diet plan, body metrics; destination, dates, hotel area).
4. Blueprint: Outline milestones (e.g., itinerary draft → bookings → packing list).
5. Event list: Encode causal links (can’t book hotels before dates are chosen).
6. Session summaries: Short guides for each conversation turn.
7. Interleaving: Merge session queues across multiple projects to simulate real life.
Example: Travel Planning + Fitness: sessions alternate—(Travel: dates), (Fitness: plan), (Travel: hotels), (Fitness: injury update).

Stage 2: Multi-Agent Dialogue Generation

What happens: A User Agent and an Assistant Agent role-play each session using the right context.
Why this step exists: Real memory must emerge naturally from conversation, not be planted artificially.
How (step-by-step):
1. Draw the next session from the interleaved queue.
2. Context for User Agent: current event + prior event summary (prevents peeking into the future).
3. Context for Assistant Agent: relevant memory points + global schedule (reduces time conflicts).
4. Generate the dialogue turns.
5. Log outputs for memory extraction.
Example (Temporal Reasoning): User asks for a 10:30 physical exam; Assistant checks schedule (10:00–11:00 meeting, 11:30 1:1) and proposes 12:15 or 9:00 instead.

Stage 3: Memory and Schedule Management

What happens: Convert raw chats into structured memory; keep time-bound data in a global schedule.
Why this step exists: Raw text is too messy; structured memory makes retrieval precise and updates safe.
How (step-by-step):
1. Memory Extraction Agent: Identify salient facts (e.g., “Training changed to gentle yoga”).
2. Schedule Agent: Add/modify events (e.g., “Team meeting 10:00–11:00 Jan 5”).
3. Deduplication Agent: Merge near-duplicates; remove stale entries.
4. Retrieval: Use the next session summary as a query; fetch Top-K memory points or linked sessions.
5. Feed these back to the Assistant Agent; repeat in a closed loop.
Example (Dynamic Updating): “Add West Coast but keep 12 days” triggers a memory update to redistribute itinerary days.

The four query types (each with a Sandwich):

🍞 Static Retrieval (Hook): Like opening yesterday’s class notes to continue your homework. 🥬 What: Fetch the latest settled facts to move forward. How: 1) Query memory for current project state, 2) confirm choices, 3) proceed to next step. Why: Without it, the assistant re-asks basics or contradicts past choices. 🍞 Anchor: “Continue the travel plan” → recalls May 1–10 and Eiffel Tower area.
🍞 Dynamic Updating (Hook): Like changing a recipe mid-cook when you run out of eggs. 🥬 What: Modify a plan when new constraints arrive. How: 1) Retrieve constraints and current plan, 2) find conflicts, 3) propose trade-offs, 4) update memory state. Why: Without it, the assistant ignores new info or breaks previous decisions. 🍞 Anchor: Keep trip at 12 days but insert two West Coast days—reallocate time from Queenstown.
🍞 Temporal Reasoning (Hook): Like checking your calendar before saying yes to a party. 🥬 What: Ensure plans fit the schedule without overlap. How: 1) Retrieve global schedule, 2) compare time windows, 3) detect conflicts, 4) suggest safe slots. Why: Without it, double-bookings and late starts happen. 🍞 Anchor: 10:30 exam conflicts; suggest 12:15 or 9:00 options.
🍞 Proactive Alignment (Hook): When a friend says “That’s awesome!”, you remember what they wanted next and help them do it. 🥬 What: Use memory to infer implicit next steps and keep momentum. How: 1) Retrieve user priorities, 2) match to current project point, 3) suggest the next logical action. Why: Without it, conversations stall and users must micromanage the assistant. 🍞 Anchor: After praise, the assistant proposes booking flights and RV because that was the stated priority.

Two context settings for evaluation:

Value = memory: Provide Top-20 retrieved memory points.
Value = session: Provide Top-5 linked original sessions (used where supported).

The secret sauce:

Proactive alignment baked into evaluation—tests whether assistants move the project forward using memory, not just answer trivia.
Project state memory + schedule memory—separate storage makes both updates and time checks reliable.
Interleaved multi-project sessions—pressure-tests retrieval precision in noisy, realistic timelines.
Closed-loop generation—dialogues create memory, memory shapes future dialogues.

04Experiments & Results

The test: RealMem measures whether memory helps the AI do project work inside the conversation. It scores:

Retrieval: Recall@K and NDCG@K (how well the right memories are ranked), plus LLM-judged Mem Recall (semantic coverage) and Mem Helpful (usefulness).
Generation: QA Score (0–3) focusing on whether answers correctly use the user’s current state.

The competition: Four representative memory systems—Mem0, A-mem, MemoryOS, Graph Memory—plus an Oracle that has perfect access to the ground-truth relevant memory. Answers are generated with GPT-4o-mini and GPT-4o; GPT-4o is also used as the judge.

Scoreboard with context:

Memory-only setting: MemoryOS leads QA quality. This suggests hierarchical memory designs (short-term, mid-term, long-term) help compress the right facts for accurate answers without needing full session logs—like getting an A when others get Bs.
Session-based setting: Graph Memory shines. Graphs capture who/what/when relationships, which fit RealMem’s complex entity webs, yielding strong NDCG and better downstream QA—like picking the top clues first in a detective case.
Oracle gap: Even with GPT-4o, no system reaches Oracle-level recall (~0.993). This means poor retrieval, not just weak language modeling, is the main bottleneck for long projects.
Model size matters: GPT-4o consistently outperforms GPT-4o-mini, showing that RealMem’s tasks require nontrivial reasoning, not only memory lookup.

Surprising findings:

Precision beats breadth: A-mem attains the highest Recall@20 on some settings but lower NDCG; broader retrieval with more noise lowers answer quality. Graph Memory’s higher NDCG lines up with stronger QA, proving that ranking the best bits first is crucial.
Proactive Alignment is hard: Systems that do well here (MemoryOS) better anticipate next steps, but many assistants still wait for exact commands.
Temporal Reasoning favors graph-like structures: Graph Memory handles time-linked entities well, aligning with its strengths in relational representation.

Topic-by-topic performance (MemoryOS shown; others follow similar trends):

Strong in human-centric, consultative, and creative domains: Mental Health Support, Health Consultation, Literary Creation—Helpful scores often 0.6–0.8 range. The assistant can be useful even with imperfect recall, as conversations allow some flexibility.
Weak in rigid, technical domains: Code Architecture shows sharp drops (often <0.4). These require strict logical consistency and dependency tracking; fuzzy matches break designs.

Efficiency and cost:

Adding memories is slower than retrieving across all systems, a real-world bottleneck for scaling assistants.
MemoryOS: competitive retrieval speed but the highest token cost—performance comes with context overhead.
A-mem: fastest and cheapest but trades off accuracy.
Graph Memory: near MemoryOS retrieval speed with lower token costs but slower memory addition.

Human validation:

Human rankings match automated QA scores exactly (MemoryOS > Graph Memory > A-mem > Mem0). This strengthens trust that RealMem’s scoring reflects real user preferences.

Bottom line: RealMem reveals that doing long projects well requires precise retrieval, reliable updates, time-aware planning, and proactive nudging. Today’s systems make progress but are far from the Oracle ceiling, especially under interleaved, evolving contexts.

05Discussion & Limitations

Limitations:

Tool use not evaluated (yet): RealMem focuses on memory use in dialogue. Real projects often require tools like calendars, browsers, or booking APIs; adding this would test end-to-end execution.
Data generation dependence: The pipeline leans on Gemini 2.5 for controlled, realistic dialogues, plus human checks. This may affect reproducibility and cost, though it improved adherence and realism in practice.
Domain difficulty gaps: Technical, hard-constraint domains (e.g., code architecture) still expose weaknesses in current memory designs.
Session linking constraints: Some systems (e.g., page-based memory) can’t align to original sessions, limiting certain evaluations.

Required resources:

A capable backbone LLM (GPT-4o-level) for generation and judging.
Embedding models and vector/graph stores for retrieval.
Orchestration to run extraction, schedule updates, and dedup.
Compute and tokens for long dialogues; MemoryOS-level performance implies higher token budgets.

When not to use RealMem (as-is):

If you mainly need tool-augmented execution benchmarking (e.g., booking APIs), since current focus is dialogue memory.
If your application is purely factual Q&A or short, single-session tasks; simpler benchmarks suffice.
If your system cannot store structured schedules or project states; RealMem’s tasks will unfairly penalize you.

Open questions:

How to couple memory with tools safely (e.g., calendar APIs) and measure end-to-end success?
Can we learn retrieval policies that optimize NDCG directly, not just Recall?
What representations best track evolving states—graphs, state machines, or hybrid OS-like layers?
How can we speed up memory ingestion without losing precision or context richness?
Can smaller models match large-model QA given near-Oracle retrieval?

06Conclusion & Future Work

Three-sentence summary:

RealMem is a benchmark that tests whether AI assistants can remember, update, and use project memories during real, multi-session conversations.
It builds realistic dialogues through a blueprint → multi-agent role-play → memory-and-schedule loop, then scores retrieval precision and answer quality.
Results show a large gap to the Oracle, proving that precise retrieval, dependable updates, time reasoning, and proactive alignment are still open challenges.

Main achievement:

Shifting evaluation from post-hoc fact recall to live, project-centric memory use, with proactive alignment and project-state tracking at its core.

Future directions:

Add tool-use assessments and end-to-end task completion; optimize ingestion latency; develop representations that better handle strict logical dependencies; train retrieval to maximize ranking quality (NDCG); and study small-model performance under near-perfect retrieval.

Why remember this:

Long-term AI helpers won’t be judged by how much they’ve read, but by how well they carry a project from start to finish—adapting to changes, avoiding conflicts, and nudging you forward. RealMem is the first rigorous test built for that reality, turning memory from a scrapbook into a working brain for ongoing collaboration.

Practical Applications

•Build AI planners that remember evolving travel, fitness, or study plans and keep them conflict-free with your calendar.
•Design assistants that proactively suggest next steps (e.g., booking items) when users give positive but vague feedback.
•Use graph-structured memories for domains with complex relationships (e.g., project dependencies, medical entities).
•Adopt hierarchical memory (short/mid/long-term) to compress and retrieve the most helpful facts quickly.
•Tune retrieval for ranking quality (high NDCG), not just coverage (Recall), to improve answer accuracy.
•Implement schedule-aware reasoning to prevent double-bookings and to propose safe time windows.
•Add deduplication and state-versioning so updates replace outdated facts without losing history.
•Stress-test agents with interleaved, multi-project session queues to ensure robustness under realistic workloads.
•Measure with both automated QA scores and human ratings to align system behavior with user expectations.

Version: 1