Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning

Yuyang Hu; Jiongnan Liu; Jiejun Tan; Yutao Zhu; Zhicheng Dou

Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning

Intermediate

Yuyang Hu, Jiongnan Liu, Jiejun Tan et al.1/8/2026

arXiv PDF

Key Summary

•This paper turns an AI agent’s memory from a flat list of notes into a logic map of events connected by cause-and-time links.
•It’s inspired by how people remember life as scenes (events) and navigate those scenes when they need to recall something.
•The new system, called CompassMem, breaks incoming text into meaningful events and links them with relations like causal and temporal.
•Agents then actively explore this Event Graph using planners and explorers instead of just grabbing the closest-matching text.
•A topic layer groups related events so the search can start from multiple angles and avoid tunnel vision.
•Compared with strong baselines on LoCoMo (conversations) and NarrativeQA (stories), CompassMem retrieves better evidence and reasons more accurately.
•It shines especially on multi-hop and temporal questions, where logic and order matter a lot.
•Even when search runs on a smaller model, a high-quality event graph built offline still boosts reasoning.
•Ablations show each piece (events, relations, subgoals, refinement, clustering) contributes to the final gains.
•Bottom line: memory doesn’t just store facts; its structure can guide smarter, longer reasoning.

Why This Research Matters

Many real-world questions are not about finding one sentence; they are about following a chain of events. By turning memory into a logic map, CompassMem helps AI assistants answer multi-step, time-sensitive questions more accurately. This reduces wasted reading, speeds up support cases, and makes personal assistants more reliable over long conversations. It also helps smaller models perform better when paired with a well-built event memory, lowering costs. In classrooms and research, it supports deeper understanding of long stories and documents. Overall, it makes AI reasoning more like human recall: eventful, logical, and focused.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine trying to remember a whole school year as one giant blur. That would be hard, right? Instead, your brain remembers it as scenes: the science fair, the field trip, and the big test week. You jump between these scenes to find what you need.

🥬 Filling (The Situation): Before this work, many AI agents stored what they saw as a long, flat list of text chunks, and found things by measuring word similarity. In one sentence: AI memories were mostly flat and searched by “what sounds similar,” not by “what logically connects.” How it worked before:

Split long text into pieces (chunks).
Save chunks in a big list or a simple structure.
When asked a question, fetch the top-k chunks with the closest embeddings (word meaning vectors). Why that was a problem: Without real connections like cause-and-effect or proper time order, the agent couldn’t follow complex trails, combine clues over time, or avoid getting distracted by words that merely sounded similar.

🍞 Bottom Bread (Anchor): If you ask, “Why did the character leave the city, and what happened next?” a flat memory might pull a random paragraph with the word “city,” but miss the actual chain of events explaining the reason and the next step.

🍞 Top Bread (Hook): You know how a good story is divided into chapters and scenes, with one thing leading to another? That flow helps you remember and retell it.

🥬 Filling (The Problem): The challenge researchers faced was: How can an AI store and search memories in a way that follows the logic of events—like scenes and their links—so it can answer tough, long questions that hop across time and cause? What they tried before:

Flat retrieval (RAG): fast but shallow; good for keyword-like matches, weak for chained reasoning.
Structured trees/graphs: better organization, but often missed explicit logical relations (e.g., A caused B), or still fell back to similarity search instead of really navigating the structure. Why it mattered: If memory can’t represent logic, the agent struggles with multi-step, time-aware questions and ends up guessing or overfetching.

🍞 Bottom Bread (Anchor): Think of solving a mystery: you need to know who did what, when, and why. A list of random notes is not enough; you need the map of how clues connect.

🍞 Top Bread (Hook): Picture a museum map. The rooms (events) are connected by hallways (relations), so you can tour in the right order to understand the exhibit.

🥬 Filling (The Gap): What was missing was a logic-first memory that: (1) stores experiences as coherent events, (2) links them with explicit relations (like causes and time), and (3) lets the agent actively navigate these links when answering. Why it matters: Without this, the agent wastes time scanning unrelated text or misses the key chain of reasoning.

🍞 Bottom Bread (Anchor): If you ask, “After the storm knocked out power, how did the team fix the lab?” you want the system to jump from “storm” → “power outage” → “backup generator setup” → “lab reboot,” not just pull random sentences with “lab.”

🍞 Top Bread (Hook): Imagine you and a friend read a long novel. Your friend remembers it as events and can quickly hop between scenes to answer questions. That friend is much faster and clearer.

🥬 Filling (The Stakes): In real life, AI assistants need to remember days of chats, long documents, and plans with steps that depend on each other. The stakes include:

Customer support tracking the root cause across many tickets.
Personal assistants recalling long conversations and schedules.
Tutors remembering what a student tried before and why it worked.
Researchers summarizing book-length materials. Why it matters: Getting the logic right saves time, avoids mistakes, and builds trust.

🍞 Bottom Bread (Anchor): If you ask your assistant, “Which study methods helped me last month before my math test, and what did I do next?” you need event links (what helped → then what happened) to get a useful answer, not just a pile of notes with the word “math.”

🍞 Top Bread (Hook): You know how your brain spots “scene changes” in a movie—new place, new goal, or a big surprise?

🥬 Filling (Event Segmentation Theory): Event Segmentation Theory says people naturally break continuous life into meaningful events. One sentence: it’s how our minds chop experience into scenes that are easier to store and recall. How it works:

We notice boundaries (new goal, new setting, or big change).
We package what just happened into an event.
We link that event to others by time and cause. Why it matters: This makes long stories easy to navigate and helps us answer “what led to what” questions.

🍞 Bottom Bread (Anchor): When recalling last weekend, you don’t replay every second; you jump to “the soccer game,” then “the pizza place,” then “the homework session,” and connect them by what happened and why.

02Core Idea

🍞 Top Bread (Hook): Think of a treasure map where each X marks a scene from the story, and the paths show which scene leads to which.

🥬 Filling (Aha! in One Sentence): The key insight is: organize memory as an Event Graph—a logic map of events connected by explicit relations—and actively navigate it to gather just the right evidence for tough questions. How it works (at a glance):

Turn incoming text into coherent events (like scenes).
Link events with relations (causes, time order, part-of, motivation).
Add a topic layer to group related events.
When asked a question, plan subgoals and explore the graph along logical paths, not just by word similarity. Why it matters: Without event-level structure and logic-aware navigation, agents miss chains of reasoning, especially across time and multiple hops.

🍞 Bottom Bread (Anchor): Asking, “Why did the team change strategy, and what was the result?” becomes a guided walk: cause event → decision event → outcome event, rather than scanning random text snippets.

🍞 Top Bread (Hook, Analogy 1): Like a detective’s corkboard with photos (events) pinned and red strings (relations) showing who connects to what.

🥬 Filling (Event Graph): An Event Graph is a network where nodes are events and edges are their logical links. How it works:

Build nodes from coherent scenes.
Draw edges for relations: causal, temporal, part-of, etc.
Use this map to steer search. Why it matters: The map carries the logic, so following edges mirrors reasoning steps.

🍞 Bottom Bread (Anchor): To answer “Who helped fix the error after the first failure?”, you follow edges from “first failure” → “request for help” → “helper arrives” → “fix applied.”

🍞 Top Bread (Hook, Analogy 2): Think of a subway map: stations are events; colored lines are relation types; transfers help you change routes to reach the goal fast.

🥬 Filling (Before vs After): Before: flat notes + nearest-neighbor search; agents over-retrieve or miss links. After: event graph + guided traversal; agents collect minimal, on-point evidence through logical paths. Why it matters: It cuts confusion, reduces redundancy, and improves accuracy for multi-step questions.

🍞 Bottom Bread (Anchor): Instead of reading five chapters to answer “What happened after the storm and before the repair?”, you ride the “temporal line” from “storm” to “repair,” stopping only at the key stations.

🍞 Top Bread (Hook, Analogy 3): Imagine hiking with a compass: the terrain is memory, the compass is logic, and trails are relations that tell you which way is promising.

🥬 Filling (Why It Works—Intuition): The trick is that relations encode constraints—what can logically come before, after, or cause something else. If you only use similarity, unrelated but similar-sounding text can distract you. But if you follow relations, each step stays meaningful. Building blocks:

Event Segmentation (make clean scenes)
Relation Extraction (add cause/time/part-of links)
Incremental Graph Update (merge, connect, and grow without drift)
Topic Evolution (group events to start from multiple angles)
Active Multi-Path Search (Planner, Explorers, Responder loop) Why it matters: Each block prevents a specific failure—like fragmentation, confusion, or tunnel vision.

🍞 Bottom Bread (Anchor): If a question needs two clues from different parts of the story, the graph helps you start in two topics, walk along relations, and meet in the middle with the exact pair of events you need.

03Methodology

🍞 Top Bread (Hook): Imagine building a LEGO city one scene at a time, adding roads to show how places connect, and then using a smart guide to tour the city efficiently.

🥬 Filling (High-Level Recipe): At a high level: Input text → Event Segmentation → Relation Extraction → Topic Layer + Incremental Graph Update → Active Multi-Path Search (Planner + Explorers) → Distilled Evidence → Final Answer. Why it matters: Each stage keeps the memory clean, logical, and easy to navigate so the agent can reason step by step.

🍞 Bottom Bread (Anchor): Given a long chat, the system turns it into scenes, links them, picks starting points across topics, walks the links to collect just a few perfect quotes, and answers.

🍞 Top Bread (Hook): You know how you mark chapters and scene breaks in a book to quickly find parts later?

🥬 Filling (Incremental Hierarchical Memory Construction): This is the process of building the Event Graph step by step as new text arrives. One sentence: we keep adding events and edges carefully so old and new knowledge fit together. How it works:

Event Segmentation: The model identifies events (scenes) with attributes: text span, time info, summary, and participants.
Relation Extraction: It finds links like causal, temporal, motivation, and part-of between events.
Incremental Update: For each new event, either merge with an equivalent one, connect with an edge, or add as a fresh node. Why it matters: Without careful updates, memory gets messy—duplicates, contradictions, and lost links.

🍞 Bottom Bread (Anchor): If two chats describe the same meeting, we merge them; if one caused another, we add a cause edge; if it’s brand new, we add a new node.

🍞 Top Bread (Hook): Think of grouping your photos into albums so you can start searching from different themes, not just one.

🥬 Filling (Topic Evolution): Topic Evolution creates a layer that clusters events by theme to diversify starting points. One sentence: it groups related events and keeps groups tidy over time. How it works:

Cluster early events into topics (e.g., k-means).
As new events arrive, attach them to the closest topic or create a new one.
Periodically re-cluster to prevent drift. Why it matters: Without topics, the search may over-focus on a single angle and miss other crucial paths.

🍞 Bottom Bread (Anchor): When answering “What did Jon and Gina both have in common?”, starting from multiple topics (work, hobbies, travel) avoids missing the shared hobby thread.

🍞 Top Bread (Hook): Imagine tidying your notes: if two notes say the same thing, you staple them; if one leads to another, you draw an arrow.

🥬 Filling (Node Fusion & Expansion): This step avoids duplicates and adds the right edges. One sentence: new events are compared to existing ones for merging or connecting. How it works:

Compare new event to similar old events.
If they are equivalent, merge; if related, add an edge; else add a new node. Why it matters: Without it, the graph bloats and links get lost.

🍞 Bottom Bread (Anchor): Two entries about “the April outage at 3 PM” become one merged event; an entry “engineer dispatched” gets a causal edge from “outage reported.”

🍞 Top Bread (Hook): When solving a riddle, you break it into mini-tasks and check them off.

🥬 Filling (Planner & Subgoals): The Planner reads the question and breaks it into 2–5 subgoals, then tracks which are satisfied. One sentence: it turns a big question into a small checklist that steers the search. How it works:

Decompose the query into subgoals (e.g., find cause, find effect, find time).
Keep a binary satisfaction vector (checked/unchecked).
If stuck, refine the query to target missing subgoals. Why it matters: Without subgoals and refinement, the search can wander or stop too early.

🍞 Bottom Bread (Anchor): For “What did the speaker create after moving?”, the subgoals become: find move event; find creative events after; extract specific types.

🍞 Top Bread (Hook): Think of sending out a few scouts who start in different neighborhoods and report back with the best clues.

🥬 Filling (Explorers: Localization → Navigation → Coordination): Explorers actively traverse the Event Graph to gather evidence. How it works:

Localization: Retrieve top-k similar events, then pick starters across top-p different topics to avoid tunnel vision.
Navigation: At each node, choose SKIP (discard), EXPAND (keep and continue), or ANSWER (stop path) based on subgoals and neighbors.
Coordination: Multiple Explorers run in parallel, share a global queue, and prioritize next nodes that best match unsatisfied subgoals. Why it matters: Without logic-guided traversal and coordination, you either over-collect junk or miss key links.

🍞 Bottom Bread (Anchor): If the unsatisfied subgoal is “find what happened next,” the priority queue favors neighbors with strong temporal_after links.

🍞 Top Bread (Hook): After gathering clues, you don’t rewrite the whole book—you quote the few lines that prove your answer.

🥬 Filling (Responder with Distilled Evidence): When subgoals are satisfied or no more candidates remain, the Responder answers using only the selected evidence. One sentence: it keeps generation short and precise, grounded in the collected events. How it works:

If all subgoals are covered, answer succinctly from kept events.
If evidence is empty, fall back to the initial top-k. Why it matters: Without distillation, answers can be long, unfocused, or rely on irrelevant text.

🍞 Bottom Bread (Anchor): The final answer might be a short phrase, like “paintings and stained glass,” backed by exactly two event nodes that mention them, in order.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a quiz where some questions need a single fact, but others need you to track what happened, why it happened, and in what order.

🥬 Filling (The Test): The team tested CompassMem on two tough benchmarks:

LoCoMo: very long multi-session conversations, with question types like single-hop, multi-hop, open-domain, and temporal.
NarrativeQA: long stories where answers often require combining clues spread across the narrative. They measured F1 and BLEU-1—think of them like accuracy grades on both the content and the wording. Why it matters: These tasks stress exactly what event graphs help with: following chains across time and logic.

🍞 Bottom Bread (Anchor): It’s like asking, “Who started the plan, what did they do later, and when?”—you need both the who-what and the when.

🍞 Top Bread (Hook): If most classmates get a B- on a hard test, and one student gets an A+, you notice.

🥬 Filling (The Competition and Scoreboard): CompassMem was compared to strong systems, including RAG, Mem0, MemoryOS, and graph-based methods like HippoRAG, A-Mem, and CAM.

On LoCoMo with GPT-4o-mini, CompassMem lifted average F1 from a strong baseline (~47.9%) to about 52.2%, with especially big gains on temporal questions (about 58% vs ~49%).
With Qwen2.5-14B, it reached roughly 52.5% F1 overall and led in every subset.
On NarrativeQA, CompassMem topped CAM by more than 5 F1 with GPT-4o-mini and by over 8 F1 with Qwen2.5-14B. Why it matters: The biggest jumps appear where logic and time matter most—exactly where an Event Graph should help.

🍞 Bottom Bread (Anchor): For multi-hop and temporal questions, performance moved from “good sophomore” to “top-of-class,” especially on tasks like “what caused what, then what happened next.”

🍞 Top Bread (Hook): Faster is nice, but smarter is better—unless it becomes too slow to use.

🥬 Filling (Efficiency–Performance Trade-off): The team also checked costs: construction time, total processing time, per-question latency, and token usage. Findings:

Construction time was much lower than some heavy systems and reasonable compared to others.
Total time and per-question latency were in a practical range—close to lighter baselines and far better than very heavy ones.
Token use was higher than some, but the extra cost came with clear accuracy gains. Why it matters: A better brain that still runs fast enough is what you want in real applications.

🍞 Bottom Bread (Anchor): It’s like studying a bit longer but getting a much better grade—and still finishing the test on time.

🍞 Top Bread (Hook): What happens if you remove a car’s steering or brakes? You see which parts really matter.

🥬 Filling (Ablations and Surprises): Removing any major piece—topics, events (using fixed chunks instead), edges (relations), query refinement, or subgoals—reduced performance, most noticeably on multi-hop and temporal questions. This confirms each module’s role in the whole system. Other insights:

Scaling: Building the Event Graph with a bigger model and letting a smaller model do the search still gave strong gains, showing the value of high-quality offline memory.
Hyperparameters: Increasing the number of starting topics and initial retrieved events steadily helped, indicating the value of diverse starting points and a rich candidate pool.
Thinking models: Even with a model that’s better at reasoning internally, CompassMem still added gains, meaning structure and navigation help beyond raw model smarts. Why it matters: The improvements come from the design (events + relations + guided search), not just from bigger models.

🍞 Bottom Bread (Anchor): Even a strong student does better with a clean outline and a good plan—CompassMem gives that outline and plan to the model.

05Discussion & Limitations

🍞 Top Bread (Hook): Even great maps can be wrong if the roads are drawn poorly, or if you never update them when the city changes.

🥬 Filling (Honest Assessment): Limitations:

Quality depends on event segmentation and relation extraction; if those go wrong, the graph misleads search.
Evaluations, while broad, still cover only certain domains; more tasks would strengthen the case.
Token usage can be higher due to planning, exploration, and evidence distillation, though balanced by accuracy.
Latency may rise if the memory or graph is massive; careful tuning and caching matter. Required Resources:
An embedding model for similarity, an LLM for segmentation/relations, storage for the Event Graph, and moderate compute for online updates and search.
Optional: stronger model for offline construction, smaller model for online search to save cost. When NOT to Use:
Very short tasks where a flat context window works fine.
Ultra low-latency scenarios with tiny budgets (e.g., millisecond chatbots) unless you prebuild memory.
Extremely noisy streams with few coherent events; the graph could get cluttered. Open Questions:
Can we learn event segmentation and relation extraction more robustly, perhaps end-to-end or with feedback loops?
How should the system represent uncertainty in relations—and correct itself over time?
What’s the best way to forget or compress old events without losing logic?
How to integrate external knowledge graphs or temporal commonsense effectively?
What additional safeguards are needed for privacy when memories persist across long horizons?

🍞 Bottom Bread (Anchor): Think of a library: if books (events) are mis-shelved (relations wrong), even a perfect map won’t help. Future work aims to shelve better, flag uncertainty, and tidy up regularly.

06Conclusion & Future Work

🍞 Top Bread (Hook): Picture memory not as a junk drawer of facts, but as a well-marked trail through scenes that lead you to the answer.

🥬 Filling (Takeaway): 3-Sentence Summary: CompassMem organizes an agent’s memory as an Event Graph where events are linked by explicit logical relations. A planner–explorer–responder loop actively navigates this logic map, collecting only the evidence needed to answer complex questions. This shifts memory from passive storage to an active guide for long-horizon reasoning, improving retrieval and accuracy on challenging benchmarks. Main Achievement: Turning memory into a logic-aware map—so topology carries reasoning—by combining event segmentation, relation extraction, topic evolution, and active multi-path search. Future Directions: Stronger, more reliable event and relation extraction; smarter forgetting and compression; tighter integration with external knowledge; better uncertainty handling; and broader tests across domains like planning, multi-agent coordination, and tools. Why Remember This: It shows that how you store experience shapes how well you can think; with events and explicit links, agents don’t just recall—they reason.

🍞 Bottom Bread (Anchor): Next time an AI must answer “what led to what, and then what?”, CompassMem hands it a compass and a map, not just a pile of notes.

Practical Applications

•Customer support triage: trace root causes across many tickets and sessions to propose the next best fix.
•Personal assistants: recall multi-week plans and explain why certain steps were taken and what followed.
•Enterprise incident analysis: reconstruct timelines with causes and effects for outages or compliance reviews.
•Healthcare case summaries: link visits, treatments, and outcomes over time for clearer clinical reasoning.
•Legal and policy review: navigate long documents by events (filings, rulings) and their temporal/causal links.
•Education and tutoring: track a student’s attempts, strategies, and results to recommend the next study step.
•Research curation: synthesize long papers or books by hopping across event chains instead of scanning linearly.
•Creative writing aids: map plot events and character motivations to check consistency and find plot holes.
•Project management: connect tasks by dependencies and outcomes to explain delays and optimize planning.
•Agent teams: share a common event graph so multiple agents can coordinate reasoning over the same memory.

Version: 1