LMEB: Long-horizon Memory Embedding Benchmark
Key Summary
- ā¢LMEB is a new test that checks whether text-embedding models can remember and find information across long stretches of time, not just in short, neat passages.
- ā¢It covers four kinds of memoryāepisodic, dialogue, semantic, and proceduralāacross 22 datasets and 193 zero-shot retrieval tasks.
- ā¢The benchmark is zero-shot, so models are tested on their general ability without being trained specifically for these tasks.
- ā¢LMEB shows that bigger models are not always better; some smaller models beat larger ones on long-horizon memory tasks.
- ā¢Instructions sometimes help models retrieve better, but not alwaysāsome models improve, some stay the same, and some get worse.
- ā¢Results on LMEB and the popular MTEB benchmark barely correlate, meaning success on standard passage retrieval does not guarantee success on long-term memory retrieval.
- ā¢LMEB uses a standardized format and common IR metrics (like NDCG@10) to make evaluations fair, reproducible, and easy to extend.
- ā¢This benchmark fills a missing piece for building reliable memory-augmented AI agents that need to remember, update, and use information over weeks or months.
Why This Research Matters
AI helpers in the real world must remember what you like, what changed recently, and how to get multi-step jobs done. LMEB directly measures whether embedding models can retrieve the right memories over long periods, not just short, tidy passages. This benchmark pushes the field toward agents that donāt forget important updates, like a changed address or a new prescription. It also highlights that standard passage-retrieval success doesnāt guarantee long-term memory skills, steering research to where it truly counts. With its open tools and leaderboard, LMEB lets everyone test fairly and improve together. In short, it brings us closer to trustworthy AI assistants for day-to-day life, work, and learning.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook) You know how your brain can remember last summerās trip (episodic), facts like 'Paris is the capital of France' (semantic), how to ride a bike (procedural), and things friends said in conversations (dialogue)? Computers need their own versions of these memories to be helpful over time.
š„¬ Filling (The Actual Concept)
- What it is: Memory embeddings are compact number-versions of text that help computers store and later find the right memories quickly.
- How it works:
- Read a piece of text (a chat turn, a fact, a step-by-step guide).
- Turn it into a vector (a list of numbers) using an embedding model.
- Save those vectors in a searchable library.
- When thereās a new question, turn it into a vector and find the closest matches.
- Why it matters: Without this, an AI would drown in past messages and documents, and it wouldnāt know which parts are relevant now.
š Bottom Bread (Anchor) Imagine a helper app that remembers you love 'pepperoni pizza' from a chat last month. With memory embeddings, it can quickly fetch that detail when planning a dinner.
š Top Bread (Hook) Remembering last yearās birthday party is different from knowing that bees pollinate flowers. Thatās because some memories are tied to time and place.
š„¬ Episodic Memory
- What it is: Episodic memory is about recalling specific events, linked to time, people, and places.
- How it works:
- Tag events with who, what, when, and where.
- Store these event memories.
- Use time or event clues in a question to fetch the exact event(s) from the past.
- Why it matters: Without it, an AI canāt answer questions like 'What did we do last Tuesday?' or 'Who gave me a gift last Saturday?'
š Anchor Question: 'What did we chat about on May 8th?' Answer: The system pulls turns from that dateālike 'Iām off to go swimming with the kids.'
š Top Bread (Hook) Think of a long text chain with a friend. To keep a smart conversation going, you need to remember earlier messages.
š„¬ Dialogue Memory
- What it is: Dialogue memory remembers information across many turns or sessions of conversation.
- How it works:
- Save each message turn with timestamps.
- When asked a question, search only within the relevant conversation history (not the whole world).
- Prefer the most recent updates if facts have changed.
- Why it matters: Without it, the AI loses track of context and gives confused or outdated answers.
š Anchor User: 'When is Eliseās first yoga class?' The system fetches 'My first class is today at 5:00pm!' from earlier in the chat.
š Top Bread (Hook) You donāt remember where you learned 'Paris is the capital of France'āyou just know it.
š„¬ Semantic Memory
- What it is: Semantic memory holds general facts not tied to a particular time or place.
- How it works:
- Store factual passages from documents.
- For a question, retrieve the passage that states the fact.
- Use that passage as evidence.
- Why it matters: Without it, the AI canāt answer basic knowledge questions reliably.
š Anchor Question: 'What does the NTCP protein mediate?' The system retrieves a biomedical passage saying it mediates bile acid transport.
š Top Bread (Hook) When you tie your shoes or follow a recipe, youāre using 'how-to' memory.
š„¬ Procedural Memory
- What it is: Procedural memory is about skills, steps, and action sequences.
- How it works:
- Store step-by-step guides, code/API documentation, or experience cards.
- Match a new task description to the right procedure.
- Retrieve complete or partial sequences that fit the task.
- Why it matters: Without it, an AI agent canāt reliably carry out multi-step tasks or reuse past solutions.
š Anchor Task: 'Find the API to get sales tax by ZIP.' The system retrieves the 'Saleax_b$$y_APinjas' endpoint with its parameters.
š Top Bread (Hook) Imagine looking for a photo taken years ago in a huge album with lots of similar picturesāyou need strong clues and a good filing system.
š„¬ Long-horizon Memory Retrieval
- What it is: Finding relevant memories that are fragmented, context-dependent, and spread out over long periods.
- How it works:
- Understand the query (including time clues like 'last Sunday').
- Search across a big timeline or a long document.
- Combine scattered pieces (if needed) and rank the best matches.
- Why it matters: Without it, agents forget important history and fail at long-term tasks.
š Anchor A home-assistant remembers you changed favorite sports teams mid-year and retrieves the latest preference when making plans.
š Top Bread (Hook) Pop quiz time! Youāre tested on something you werenāt coached onāthat shows what you really understand.
š„¬ Zero-shot Retrieval Tasks
- What it is: Testing models on new tasks without task-specific training.
- How it works:
- Use the model as-is (pretrained) to embed queries and documents.
- Retrieve relevant items across many datasets.
- Measure performance consistently.
- Why it matters: Without zero-shot testing, we canāt tell if a model generalizes beyond its training.
š Anchor A model that never saw REALTALK before can still retrieve the correct chat snippet about 'first yoga class.'
š Top Bread (Hook) When you grade a stack of answers, getting the right ones near the top counts more.
š„¬ NDCG (Normalized Discounted Cumulative Gain)
- What it is: A score that rewards putting the most relevant results near the top of the list.
- How it works:
- Compute discounted gain: higher-ranked relevant items get bigger credit.
- Compare against the best possible ordering to normalize the score between 0 and 1.
- Report to focus on the top results.
- Why it matters: Without NDCG, a model that hides the right answer at rank 50 could look 'ok'ābut thatās not useful.
š Anchor If the correct memory shows up at rank 1, is close to 1; if itās at rank 10, itās lower, encouraging better top-ranked retrieval.
The World Before Before LMEB, most embedding tests focused on neat, short 'find the passage' problems (like MTEB and BEIR). But real agentsālike personal assistants or coding helpersāneed to remember things over weeks, pick the latest updates, and connect pieces scattered across many sessions or documents.
The Problem Benchmarks didnāt really check whether embeddings could handle long-horizon, messy, time-aware memory. Without a proper test, researchers couldnāt tell which models truly supported memory-augmented agents.
Failed Attempts People tried evaluating on standard retrieval, or built single-dataset tests for dialogue/episodic memory. These were narrow, didnāt cover all memory types, and werenāt standardizedāso results didnāt generalize.
The Gap We needed a unified, reproducible, zero-shot benchmark with multiple memory types, realistic constraints (like searching only within a conversation), and consistent metrics.
Real Stakes In daily life, assistants must remember your preferences, the last advice they gave, and how to carry out multi-step plans. Without reliable memory retrieval, agents double-ask questions, use outdated info, and fail at long projects. LMEB makes it possible to identify and improve models that handle these real needs.
02Core Idea
The 'Aha!' Moment One sentence: If we want AI agents to remember like helpful partners, we must test embeddings on long-horizon, context-rich memoryāacross episodic, dialogue, semantic, and procedural tasksāusing a single, fair, zero-shot benchmark.
Multiple Analogies
- Library analogy: Imagine a library with four sectionsādiaries (episodic), chat logs (dialogue), encyclopedias (semantic), and how-to manuals (procedural). LMEB checks if the librarian (the embedding model) can quickly find the right page in each section, even when the clues are fuzzy or time-based.
- Detective analogy: A detective sifts through old case files (episodic), phone transcripts (dialogue), reference books (semantic), and toolkit manuals (procedural). LMEB tests whether the detective can connect the dots over time.
- Backpack analogy: An agent carries a backpack with memories. LMEB makes sure the agent can reach into the right pocket (memory type), pull out the most recent update, and use it correctlyāwithout special practice beforehand.
Before vs After
- Before: We mostly knew how models did on short, clean passage retrieval. Big models often looked best.
- After: We now see that long-horizon memory retrieval is different. Some smaller models beat bigger ones, instruction prompts help some models but not others, and standard benchmarks (like MTEB) donāt predict success here.
Why It Works (Intuition)
- Varied memory types: By testing episodic, dialogue, semantic, and procedural memory, the benchmark reflects real agent needs.
- Realistic constraints: Sometimes you can only search within the current conversation or sessionāLMEB enforces that with candidate pools.
- Zero-shot: Models arenāt tuned per task, so we observe true generalization.
- Balanced difficulty and diversity: 22 datasets and 193 tasks make gaming the benchmark hard and progress meaningful.
Building Blocks
- Taxonomy: Four memory types organized by abstraction and time-dependence (episodic/dialogue are time-heavy; semantic/procedural are more stable/skill-oriented).
- Standardized data format: queries, corpus, qrels, optional candidatesāso models and datasets plug in easily.
- Metrics: Common IR metrics with as the main score to reward top-ranked correctness.
- Two query modes: With or without instructions, revealing how instruction sensitivity affects models.
- Leaderboard and toolkit: Reproducible runs across frameworks (Transformers, Sentence-Transformers, vLLM) and easy extension.
- Diversity check: Weighted Jaccard similarity ensures datasets arenāt all the same, keeping the test broad.
Putting it together: LMEB doesnāt just push a single needle; it checks the whole dashboard of long-term memory skills, so models that do well are more likely to power reliable, long-running AI agents.
03Methodology
High-level Recipe Input ā (A) Prepare data (queries/corpus with time/metadata) ā (B) Encode with an embedding model (optionally add an instruction) ā (C) Retrieve from the right memory pool (full corpus or bounded candidates) ā (D) Rank results ā (E) Score with IR metrics (mainly NDCG@10) ā Output: Performance per dataset/type and overall.
Step-by-Step
- Prepare Data
- What happens: Each dataset is converted to a unified format with queries (whatās being asked), a corpus (memories to search), qrels (which memories are correct), and sometimes a candidates list (which restricts the searchable pool for realism, like 'only this userās chat history').
- Why this step exists: Without a common format, comparisons become unfair and messy.
- Example: Dialogue question 'What did we chat about on May 8th?' is only searched against that conversationās turns, not the entire internet.
- Make Time Clues Explicit (when needed)
- What happens: For queries like '2 days ago', LMEB appends a 'current time' anchor so the model (and evaluator) know exactly which date that means.
- Why this step exists: Without a clear anchor, 'last Sunday' is ambiguous.
- Example: 'What did we discuss 2 days ago? [Current time: 11:17 AM on Sunday 22 October, 2023]' makes '2 days ago' precise.
- Encode Queries and Corpus
- What happens: The embedding model turns text into vectors. There are two modes:
- w/o inst.: encode the query as-is.
- w/ inst.: prepend a brief task instruction to the query (e.g., 'Given a temporal query, retrieve relevant passages...').
- Why this step exists: Instructions can guide models to focus on what matters (like time cues or multi-hop needs). But some models donāt benefit, so LMEB reports both.
- Example: Procedural task 'Compare sales tax rates' might be encoded alongside an instruction 'Given a tool-use query, retrieve API docs aligned with the query.'
- Retrieve from the Correct Memory Pool
- What happens: The query vector is compared to vectors in the allowed memory pool. If a candidates file is provided, search only within that set (e.g., a single sessionās chat turns or items in one shopping scenario).
- Why this step exists: Real agents rarely search 'everything'āthey search the right notebook, so to speak.
- Example: In LongMemEval, a 'knowledge update' query searches the userās history to find the latest stated preference.
- Rank Results
- What happens: The system orders the top matches by similarity.
- Why this step exists: Having the right memory at rank 1 is much more useful than at rank 50.
- Example: For 'Who is my favorite sports team?', the best hit should be the most recent message stating 'The Seattle Seahawks are my favorite team now.'
- Score with IR Metrics
- What happens: Compute metrics like NDCG@10 and Recall@10 that grade how good the top of the list is.
- Why this step exists: Numbers make comparisons fair and progress measurable across models and datasets.
- Examples with friendly intuition:
- NDCG@10: Rewards putting the correct answers near the top.
- Capped Recall@10: Measures how many of the truly relevant items appear in the top 10, even when there are many correct items.
Friendly Math (with tiny examples)
- NDCG@k . For example, if top 3 ranks have relevance , then . . If the best possible ordering has both relevant items at ranks 1 and 2, then , so .
- Capped Recall@k . Example: There are 12 relevant items and your top-10 contains 8 of them. Denominator is , so .
- Weighted Jaccard Similarity (dataset diversity) . Example: Suppose words overlap on just 'data' and 'model'. In S: 'data'=0.6, 'model'=0.2; in T: 'data'=0.4, 'model'=0.3. Then numerator . Denominator . So .
The Secret Sauce
- Four memory types ensure broad skill coverage.
- Realistic candidate pools and explicit time anchors mimic how real assistants search.
- Zero-shot setup tests generalization, not just memorized tricks.
- Two query modes (with/without instructions) reveal whether models can follow task hints.
- A shared format and toolkit make it easy to add new models or datasets and compare apples-to-apples.
Concrete Mini-Walkthrough
- Query: 'When is Eliseās first yoga class?'
- Pool: Only turns from that chat session or userās history.
- Encode: Add a short instruction about temporal reasoning (if using w/ inst.), then embed the query.
- Retrieve: Rank dialogue turns; the top hit says 'My first class is today at 5:00pm!'
- Score: This high-rank correct hit boosts and Recall@10.
04Experiments & Results
The Test
- What they measured: Retrieval quality across 22 datasets and 193 tasks, focusing on the top of the ranked list using (main), plus Recall@10, MAP, and MRR.
- Why this matters: In real use, the best memory needs to be near the top. rewards that.
The Competition
- 15 widely used embedding models were evaluated, from a few hundred million parameters up to about 10B.
- Two modes per model: with and without instructions.
- Baseline comparison: Correlation with MTEB (a popular passage-retrieval benchmark) to test whether old skills transfer to long-horizon memory.
The Scoreboard (with context)
- Reasonable difficulty: The top Mean (Dataset) score was about 61.41 on (roughly like getting a solid A- when the test is challenging and varied). That means itās neither trivial nor impossible.
- Bigger isnāt always better: Some smaller models (like EmbeddingGemma-300M or bge-m3 (Dense), depending on setting) were competitive or even better than larger ones, showing that architecture and training matter as much as raw size.
- Instructions help⦠sometimes: For certain models, adding a task instruction boosted performance; others barely changed; a few even dropped. So, instruction sensitivity is model-dependent.
- Orthogonality with MTEB: Scores on LMEB and MTEB had very low correlation. Translation: being great at standard passage retrieval doesnāt guarantee good long-term memory retrieval. These are different skills.
By Memory Type (Intuition Only)
- Episodic and Dialogue: Harder for models that rely on tidy passages, because clues are fragmented and time-sensitive. Many models struggled to transfer their MTEB strengths here.
- Semantic: Some transfer from MTEB exists, but still modest, since LMEB often operates within bounded contexts.
- Procedural: Moderate alignment with MTEB for models trained on tools/code, but still distinct due to step-by-step reasoning needs.
Surprising Findings
- Instruction prompts arenāt a free lunch: they can help or hurt depending on how the model was trained.
- Candidate pools matter: Restricting retrieval to the correct memory space simulates real life and changes the difficulty profile.
- Diversity confirmed: Weighted Jaccard analyses showed dialogue datasets cluster, while procedural ones are very differentāpreventing overfitting to a single style of text.
Human Meaning of the Numbers
- A model hitting around 60 on across this many varied tasks is quite capable but still leaves room for improvementāespecially on dialogue and episodic time-tracking where humans are naturally strong.
05Discussion & Limitations
Limitations
- Coverage: Even with 22 datasets, not every real-world memory situation is included (e.g., multimodal memories with images/audio are out of scope here).
- English-only: LMEB focuses on English; long-horizon memory in multilingual settings remains open.
- Retrieval-only lens: LMEB evaluates retrieval quality, not the full loop of 'retrieve ā reason ā act ā update memory' (though itās a key foundation).
- Synthetic vs human data: Some datasets use AI-generated content, which may differ from messy, fully organic human data.
Required Resources
- Compute: Running 15 models over 22 datasets requires decent compute and storage, though the toolkit streamlines it.
- Engineering: Integration via standard wrappers is simple, but teams should budget time to test both w/ and w/o instructions.
When NOT to Use
- If your task is pure short-passage retrieval with no long-term or temporal needs, simpler benchmarks (like MTEB) may suffice.
- If your systemās memory is multimodal (images/audio/video), youāll need an expanded benchmark.
- If you must evaluate full agent pipelines (planning, acting, tool-calling), LMEB should be paired with agentic benchmarks.
Open Questions
- Training signals: What mix of data and objectives best teach models episodic and dialogue time-tracking?
- Memory updates: How should embeddings reflect 'the latest' when facts change in conversation?
- Instructions: Can we design universal, robust instructions that help most models instead of a few?
- Beyond text: How to extend LMEB to multimodal memory while keeping it fair and reproducible?
- Reranking and hybrid pipelines: How much do multi-stage retrieval or rerankers change the landscape on LMEB tasks?
06Conclusion & Future Work
3-Sentence Summary LMEB is a unified, zero-shot benchmark that tests how well embedding models handle long-horizon, context-rich memory retrieval across episodic, dialogue, semantic, and procedural tasks. Experiments over 22 datasets and 193 tasks reveal that bigger models donāt always win, instructions help some models but not others, and performance on standard passage retrieval (MTEB) doesnāt predict success on LMEB. By standardizing formats, metrics, and tools, LMEB gives researchers a clear way to build and compare models that actually remember what matters over time.
Main Achievement LMEB fills the evaluation gap for long-term, fragmented, time-aware memory retrieval with a principled, extensible framework and public leaderboardāfinally giving the community a target that matches real agent needs.
Future Directions
- Enrich dialogue/episodic tasks with more human, real-world long-duration data.
- Add multilingual and multimodal memory retrieval tracks.
- Explore instruction design, reranking, and hybrid retrieval to push performance further.
- Tie retrieval to downstream agent success (planning, tool use, code fixes) for end-to-end impact.
Why Remember This If we want AI helpers that donāt forgetāwho you are, what just changed, and how to do complex tasksāthen we must measure those skills directly. LMEB is that honest measurement, showing which embedding models truly support long-horizon memory in the wild.
Practical Applications
- ā¢Personal assistants that remember evolving preferences (e.g., dietary changes) over months.
- ā¢Customer support bots that recall prior tickets and the latest resolutions without re-asking.
- ā¢Coding assistants that retrieve relevant bug-fix experience cards from past issues (e.g., MemGovern).
- ā¢Tool-using agents that find the correct API docs and endpoints for a userās goal (e.g., ToolBench, Gorilla).
- ā¢Educational tutors that track a studentās progress and misconceptions across sessions.
- ā¢Project managers that surface the latest decisions and action items from long meeting histories.
- ā¢Healthcare triage helpers that recall longitudinal patient notes to avoid contradictions.
- ā¢Shopping planners that pick items satisfying local and global constraints (e.g., DeepPlanning).
- ā¢RAG systems that prefer the most recent updates when facts change in an ongoing conversation.
- ā¢Research assistants that retrieve precise evidence spans in long papers (e.g., QASPER, SciFact).