KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Tingyu Wu; Zhisheng Chen; Ziyan Weng; Shuhe Wang; Chenglong Li; Shuo Zhang; Sen Hu; Silin Wu; Qizhen Lan; Huacan Wang; Ronghao Chen

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Intermediate

Tingyu Wu, Zhisheng Chen, Ziyan Weng et al.1/8/2026

arXiv PDF

Key Summary

•KnowMe-Bench is a new test that checks if AI helpers truly understand a person, not just remember facts.
•Instead of short chat logs, it uses rich life stories full of actions, places, and inner thoughts to give better clues about who someone is.
•It rebuilds those stories into a clear, time-ordered “cognitive stream” that separates what happened now from flashbacks to the past.
•The benchmark asks three kinds of questions: exact facts, logical connections, and deep insights about motives and principles—always tied to specific evidence.
•Retrieval tools (like RAG) make fact answers better but still struggle with time-sensitive explanations and deeper motivations.
•Entity-graph memories (like Mem0) are great for keeping names straight but can get confused by flashbacks and overwrite the present with the past.
•Stream-based memories (like MemOS) handle timelines and flashbacks much better and win on temporal and reasoning tasks.
•Across all systems, scores on the hardest “insight” questions remain low, showing we need memory beyond retrieval to model people well.
•The dataset is carefully built by multi-agent steps with strict “verify-and-revise,” plus human experts, to stay faithful to the source text.
•This matters for building safe, empathetic digital companions that can explain, anticipate, and align with users over the long haul.

Why This Research Matters

Digital companions should do more than remember names—they should understand why we choose things and what we want to avoid. KnowMe-Bench pushes AI to use real, rich life stories and prove answers with exact evidence, which reduces guessing and hallucinations. This can lead to healthier coaching (that respects triggers), better study helpers (that fit your habits), and fairer recommendations (that align with your values). By handling flashbacks and time correctly, AI can explain choices instead of mixing up past and present. Safer answers through abstention tests mean fewer confident mistakes. Overall, this shifts AI from being a fact-fetcher to a thoughtful partner.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re telling a friend your life story. You don’t just list facts like dates and names—you explain how moments felt, why you chose things, and how old memories shaped today’s choices.

🥬 The Concept (Autobiographical Narratives): These are first-person life stories that mix what happened with how it felt and why it mattered.

What it is: Long, rich stories people tell about their own lives, including events, places, and inner thoughts.
How it works: 1) A person recalls events, 2) adds feeling and meaning, 3) links present moments to past flashbacks, 4) builds a sense of self across time.
Why it matters: Without them, an AI only sees scattered facts and misses the deeper reasons behind choices. 🍞 Anchor: A diary entry like “I smelled the ocean and remembered my grandmother’s house” ties a present smell (now) to a meaningful memory (then).

🍞 Hook: You know how a good friend doesn’t just remember your favorite snack—they also sense when you’re stressed and why you avoid certain games.

🥬 The Concept (Person Understanding): It’s knowing someone well enough to explain past choices, predict preferences, and respect values.

What it is: A usable inner model of a person’s motivations, triggers, and principles.
How it works: 1) Gather experiences, 2) notice patterns, 3) connect feelings with actions, 4) form stable rules about the person, 5) update gently over time.
Why it matters: Without this, AI can parrot facts but can’t explain or anticipate behavior. 🍞 Anchor: If you hate scary rides because a past one made you sick, a companion that truly “knows you” won’t suggest the haunted roller coaster.

🍞 Hook: Think of a photo album shuffled out of order—vacation photos mixed with baby pictures—hard to tell which came first or why it matters.

🥬 The Concept (Narrative Identity Theory): Our identity grows from how we stitch life events into a meaningful story across time.

What it is: A psychology idea that who we are comes from how we organize and interpret our experiences.
How it works: 1) We order experiences over time, 2) attach meanings, 3) keep themes (like loyalty or curiosity), 4) use these to guide future choices.
Why it matters: If an AI ignores story structure, it misreads who we are and why we act. 🍞 Anchor: Seeing a pattern like “I keep speaking up for friends,” you conclude “I value fairness,” which predicts future choices.

Before this paper, many AI memory tests used short chats or synthetic logs. That’s like judging a friendship by a few texts. These traces are sparse, so retrieval became the main skill: can the AI pull the right fact from a big bag? Useful, yes—but retrieval isn’t understanding. Two big problems showed up:

Evaluation misalignment: Benchmarks scored how well models retrieved facts, not how well they inferred motives or principles. Deep questions without evidence checks often led to guesswork.
Data substrate misalignment: Most datasets were low-density (not much inner thought) and flattened (everything treated as plain text without separating senses, context, and mind). This drained away the clues needed to connect actions to values.

Failed attempts tried bigger context windows, more retrieval, or static personality labels (like fixed profiles). But bigger windows still mix past and present, retrieval fetches related text without prioritizing true causes, and static personas can’t reflect growth or context. The missing ingredient was a data and evaluation design that respects how real autobiographical memory works: it needs rich signals, careful handling of flashbacks, and questions that demand evidence-backed reasoning.

KnowMe-Bench fills this gap by: 1) using dense autobiographical narratives, 2) rebuilding them into a clear, time-anchored “cognitive stream” that separates present triggers from past events, and 3) grading models with a three-level suite—from facts to logic to deep insight—where each answer must cite exact evidence. This change matters because lifelong companions must do more than recall birthdays; they must explain choices, anticipate needs, and align with evolving values. In daily life, that means better reminders (that fit your habits), safer suggestions (that avoid your triggers), and fairer help (that honors your principles).

02Core Idea

🍞 Hook: You know how tidying a messy desk makes homework easier? When papers are sorted and labeled, you find what you need and remember why it matters.

🥬 The Concept (KnowMe-Bench): A benchmark that tests whether AI can build and use a truthful model of a person from rich life stories, not just fetch facts.

What it is: A dataset and test suite that turns autobiographical narratives into a time-ordered cognitive stream and asks evidence-linked questions from facts to deep motives.
How it works: 1) Use real, dense narratives; 2) split them into small, auditable units; 3) realign flashbacks to their true past times; 4) represent five textual modalities (see/hear/context/background/mind); 5) ask tiered questions that require citing exact evidence.
Why it matters: Without it, we mistake “good retrieval” for “true understanding,” and agents can’t explain or anticipate people reliably. 🍞 Anchor: Instead of quizzing names from a chat log, KnowMe-Bench asks, “Why did the narrator avoid the reunion?” and demands you point to the exact memories that explain it.

🍞 Hook: Picture a movie that jumps between now and childhood. If you re-edit it into the real-time order and label what’s seen, heard, and felt, the story gets clearer.

🥬 The Concept (Cognitive-Stream Reconstruction with Mnestic Realignment): A way to rebuild stories into a precise timeline that keeps present triggers in the now while moving recalled events back to their original time.

What it is: A flashback-aware, time-anchored stream with five fields: visual, auditory, situational context, background knowledge, and inner monologue.
How it works: 1) Cut text at natural boundaries; 2) extract Atomic Narrative Units (ANUs) with time and location; 3) separate triggers (now) from recalled content (then); 4) use a push/pop stack to handle nested flashbacks; 5) output a first-person record without adding new facts.
Why it matters: Without this, models confuse “I remember loving apples as a kid” with “I love apples now,” causing wrong updates and bad advice. 🍞 Anchor: “I smell rain (trigger now) and recall a 2005 bike ride (event then).” The stream keeps the smell at today’s date and moves the bike memory to 2005.

🍞 Hook: Think of a science fair where judges ask, “Show me your experiment and your data.” You can’t just guess—you must point to proof.

🥬 The Concept (Evidence-Grounded Hierarchical Evaluation): A three-level test where each answer must cite specific timeline evidence.

What it is: Level I (facts and entities), Level II (temporal and logical relations), Level III (deep insight into motives and principles), all scored with strict rubrics.
How it works: 1) Ask precise questions; 2) require minimal supporting evidence IDs; 3) judge with task-specific rubrics; 4) validate consistency with human-aligned checks.
Why it matters: Without evidence links, models can speculate and still appear smart, hiding misunderstandings. 🍞 Anchor: A question like “What triggered the memory of the grandmother?” expects the answer plus exact event IDs (e.g., “the taste of madeleine; evidence: E142, E209”).

Aha! Moment in one sentence: Treat person understanding as evidence-backed inference over a flashback-aware cognitive stream, not as plain fact retrieval.

Three analogies:

Librarian vs. Biographer: Retrieval is a fast librarian fetching books; understanding is a biographer connecting life chapters with motives.
Map vs. Journey: A list of landmarks (facts) isn’t the lived journey (causes and turning points); the stream reconstructs the real path.
Highlights vs. Movie: Chat snippets are highlights; autobiographical streams are the full movie with scene cuts, sounds, and voice-over.

Before vs. After:

Before: Focus on context length, vector search, and static persona tags; success = “found the right sentence.”
After: Focus on flashback alignment, evidence links, and layered reasoning; success = “showed why, with proof, in time order.”

Why it works (intuition): People’s stable values hide in patterns across time. Dense narratives contain those patterns, but only if you separate now from then and track senses, setting, knowledge, and mind. Evidence-linked questions force the model to surface those exact patterns instead of guessing. Building blocks: (1) ANUs with time/place; (2) five-field cognitive records; (3) mnestic push/pop stack; (4) first-person instantiation without embellishment; (5) three-tier questions with rubrics and evidence IDs.

03Methodology

High-level recipe: Raw narratives → (A) Smart segmentation → (B) Atomic Narrative Units (ANUs) → (C) Flashback-aware realignment (cognitive stream) → (D) First-person instantiation → Evaluation tasks (Levels I–III) with LLM-as-a-Judge.

🍞 Hook: Imagine turning a messy scrapbook into a neat timeline where each page has labels: what you saw, heard, felt, and knew.

🥬 The Concept (Pipeline Stages A–D): A careful, step-by-step build that preserves truth and order.

What it is: Four modules that extract, align, and present life stories for precise, auditable reasoning.
How it works: A) cut at natural scene shifts; B) pack each small piece into an ANU with time/place and five fields; C) realign flashbacks with a stack (PUSH/POP) so past events land at their true dates; D) rewrite into first-person without adding new details, then validate.
Why it matters: Skipping steps causes time tangles, lost clues, or made-up details. 🍞 Anchor: Text: “I set my cup on the windowsill. Rain hits the glass.” → ANU: time=morning, place=windowsill, action=set cup, environment=rain, mind=none.

Stage A: Context-Aware Segmentation

What happens: The system slices raw text only at natural semantic boundaries (scene, time, place), keeping exact wording via index-based cuts.
Why it exists: If we chunk by size alone, we split thoughts or mix scenes, breaking causal threads.
Example: Don’t split a paragraph mid-dialogue; do split when the narrator moves from kitchen to street.

Stage B: ANU Extraction (Atomic Narrative Units)

What happens: Each segment becomes small units U=(id, t_anch, location, C), where C holds five primitives: Action, Dialogue, Environment, Background, Mind. Constraints enforce granularity (≤3 actions or dialogues) and push abstract states to observable micro-behaviors.
Why it exists: Fine-grained, auditable units make evidence precise and prevent hand-wavy summaries.
Example: “I clenched the cup” → Action=clench cup; Mind (only if stated), not guessed.

Faithfulness-First and Verify-and-Revise

What happens: A Validator Agent computes a semantic divergence score δ using fact overlap: penalize hallucinations more than omissions. If δ > ε (e.g., 0.03), regenerate up to 3 times; else flag for human review.
Why it exists: Generative steps can invent details; strict checks protect truth.
Example: If “angry” wasn’t in the source, the checker marks hallucination and forces a fix.

Stage C: Flashback-Aware Temporal Realignment (Mnestic Realignment)

What happens: Separate mnemonic triggers (stay at present time) from recalled event content (moved to past). A stack machine manages nested flashbacks with actions: MAINTAIN, PUSH(t_new), POP, TRANSIENT.
Why it exists: Narratives often jump in time; without alignment, models overwrite current states with old memories (the paper’s “Update Paradox”).
Example: “The smell of rain (trigger) takes me back to 2005 (event)” → keep trigger now; push 2005 on the stack for the recalled scene; pop back when done.

Stage D: Narrative Instantiation and Validation

What happens: Turn each aligned ANU into a first-person paragraph that covers all present fields. A D-check rejects any added adjectives or emotions not in source (ε≈0.03 tolerance for grammar only).
Why it exists: First-person flow helps querying and evaluation while preserving evidence granularity.
Example: “I see rain hitting the glass. I place the cup on the windowsill.” No extra feelings unless stated.

Evaluation Suite (Levels I–III)

Level I: Precision & Factuality
- T1 Extraction: Find exact entities under time/place constraints.
- T2 Adversarial Abstention: Answer ABSTAIN when pieces are real but relations are wrong.
- T3 Temporal Reasoning: Compute durations and real-world order vs. narrative order.
Level II: Narrative Logic & Causality
- T4 Logical Event Ordering: Sort by semantic scales (e.g., danger) beyond timestamps.
- T5 Mnestic Trigger Analysis: Identify the correct sensory/associative trigger of a memory.
Level III: Psychoanalytic Depth
- T6 Mind-Body Interaction: Map contradictions between actions and inner states.
- T7 Expert Psychoanalysis: Open-ended motives/identity questions with core-metaphor targets.

Scoring Protocol: LLM-as-a-Judge with Rubrics

What happens: A capable judge model grades answers on task-specific criteria (entity accuracy, timeline order, trigger validity, or capturing specified metaphors) on a 0–5 scale. Human alignment study shows κ>0.75 agreement.
Why it exists: String-match fails on reasoning; rubrics reward right logic and correct evidence.
Example: Full marks only if the answer cites the exact evidence IDs and nails the defined metaphor (not generic emotion words).

Secret Sauce

Five-field cognitive records expose sensory and mental evidence.
Mnestic stack keeps present and past cleanly separated.
Evidence-linked questions make speculation visible (and docked).
Verify-and-Revise prevents subtle drift from the source.

Concrete data example Source: “I put the coffee cup on the windowsill. Rain is still hitting the glass.”

ANU: id=ANU-001; t_anchor=“morning, before rain stops”; location=“windowsill”; action=“place cup”; environment=“rain hits glass”; mind=none.
Realignment: MAINTAIN (no flashback); time_value e.g., 1966-04-25 19:00:00 (if provided by context).
Instantiation: “I place the coffee cup on the windowsill. I see rain hitting the glass.”

04Experiments & Results

🍞 Hook: Think of a triathlon. One athlete is great at swimming (facts), another at biking (timelines), and a third at running (insight). A true champion must do well at all three.

🥬 The Concept (Benchmarking Across Tasks and Datasets): Test different memory systems on fact recall, time/logic, and deep insight using varied narrative styles.

What it is: A large suite (2,580 queries) over three narrative modalities: flashback-heavy (Knausgård), event-dense (Ferrante), and psychology-rich (Proust).
How it works: Multiple systems compete—Base models, Naive RAG (k=50), Mem0 (entity graph), MemOS (stream/log). Scores are reported by levels and tasks; judges use strict rubrics and require evidence IDs.
Why it matters: Different stories stress different skills; we learn which memory design helps actual understanding. 🍞 Anchor: Like testing bikes on hills, flats, and rough trails to see what really holds up.

The Test

Datasets:
1. Flashback-Intensive (Knausgård, 1.15M tokens): stresses mnestic triggers and non-linear time.
2. Event-Driven (Neapolitan Novels, 1.76M): stresses fast entity updates and linear causality.
3. Psychological Depth (Proust, 1.30M): stresses inner monologue and subtext.
Models:
- Inference: Qwen3-32B (Long-Context), GPT-5-mini.
- Memory/Tools: Naive RAG; Mem0 (entity memory); MemOS (stream/log-based).
Metrics: T1–T7 across Level I (facts), Level II (temporal/logic), Level III (insight). Safety via T2 (abstention).

The Competition

Baseline models set a floor.
Naive RAG boosts retrieval but risks “context pollution” (retrieving related but misleading text).
Mem0 builds a dynamic entity graph for strong name/coreference tracking.
MemOS preserves a chronological stream for non-linear narratives and reasoning.

The Scoreboard (with context)

Flashbacks (Dataset 1):
- MemOS big wins on time/logic: +10.4% (T3) and +10.8% (T4) on Qwen3-32B—like jumping from a C to a solid B+ on timeline tests when others stumble.
- Mem0 regresses on T3 (about −3.5%): it overwrites present with past (the Update Paradox), like mixing yesterday’s weather with today’s forecast.
Event-Dense (Dataset 2):
- Mem0 shines on entities: up to +11.8% on T2—like an A in name/relationship tracking where RAG gets a B.
- MemOS still helps time/logic (T3, T4), though not as dramatically as in flashbacks.
Psychological Depth (Dataset 3):
- RAG boosts T1 facts (+~9%) but can hurt T6 insight (−~0.5%) due to context pollution.
- MemOS improves Level II and III (time/logic and insight), but overall Level III scores stay low.

Surprising Findings

The Update Paradox: Entity graphs that keep a “current state” can misread flashbacks as present updates (e.g., “I liked apples as a child” → wrongly become “likes apples now”). Stream logs avoid this by anchoring past properly.
Precision vs. Insight Trade-off: Systems ace facts in event-heavy texts but struggle to read subtext in introspective narratives. Retrieval ≠ understanding.
Backbone Sensitivity: Stronger base models shrink gains from memory add-ons for easy tasks, but even with the best combo, Level III insights cap around ~22%—still a low ceiling.
Safety via Abstention: Structured memories (Mem0/MemOS) are better at saying “ABSTAIN” when the question stitches true pieces into a false relation. Naive RAG is likelier to “force an answer.”

Plain-language takeaways

If the story jumps in time, stream-based memory (MemOS) is your best friend.
If the story has lots of characters and items, entity graphs (Mem0) help keep names straight.
If you need deep motives, none of today’s methods are good enough yet; we need new memory+reasoning designs.
For safety, structured memories reduce hallucinated answers.

🍞 Anchor: Imagine three quizzes: “What color was the bike?” (fact), “Which happened first in real life?” (time), and “Why did they quit the race?” (insight). Today’s systems do fine on color, better with a timeline log, but still guess too much on the “why.”

05Discussion & Limitations

🍞 Hook: Think of building a treehouse. Good wood (data), a clear blueprint (timeline), and careful measuring (scoring) matter—but you still need stronger beams for storms (deep insight).

🥬 The Concept (Honest Assessment of Limits): KnowMe-Bench is a strong start, but there’s room to grow.

Limitations: 1) Literary insight is partly subjective; the LLM-as-a-Judge rubric helps and aligns well with humans (κ>0.75) but isn’t perfect. 2) The multi-agent pipeline with strict audits is complex and costly. 3) Autobiographical narratives require careful de-identification and ethics review. 4) Level III remains hard—today’s systems rarely capture defined core metaphors and deeper motives.
Required Resources: Long-context capable models, retrieval/backends, and validation compute; human experts for final checks; privacy tooling.
When Not to Use: If you only need short, transactional memory (e.g., “What’s my delivery number?”), this heavyweight pipeline is overkill. Also, if data has no inner monologue or time cues, simpler tests may suffice.
Open Questions: How to fuse stream and graph memories without update paradoxes? Can models learn stable principles without overfitting to phrasing? How to incorporate multimodal (images/sounds) signals reliably? Can we design training objectives that reward evidence-linked causal attributions and penalize context pollution? 🍞 Anchor: Like a coach’s report: “Great at drills (facts), improving at plays (timelines), but needs new strategies for reading opponents’ minds (insight).”

06Conclusion & Future Work

Three-sentence summary: KnowMe-Bench reframes person understanding as evidence-backed inference over rich autobiographical narratives, not just fact retrieval. It rebuilds stories into a flashback-aware cognitive stream and evaluates models with a three-level, evidence-linked suite from facts to deep insight. Experiments show retrieval and entity graphs help on facts, stream logs help on time/logic, but all systems still struggle with true motives and principles.

Main achievement: A faithful, auditable benchmark and dataset (4.7M tokens) that isolates the gap between remembering and truly understanding a person.

Future directions: Blend stream-based and graph-based memories; invent training signals for evidence-grounded causal attributions; integrate multimodal cues; improve abstention safety; and push beyond retrieval to principled person models.

Why remember this: It marks a shift from “bigger context and better search” to “truer timelines and tested insight,” a foundation for digital companions that can explain, anticipate, and align with us over a lifetime.

Practical Applications

•Personal assistants that remember your preferences without confusing childhood tastes with current ones.
•Wellness coaches that recognize stress triggers from past experiences and suggest safer alternatives.
•Education tutors that adapt lessons using evidence from your long-term study patterns and reflections.
•Therapeutic journaling tools that organize entries into timelines and highlight recurring motives with citations.
•Family history apps that align flashbacks to real dates, keeping stories accurate and meaningful.
•Recommendation systems that avoid context pollution and cite why a suggestion fits your values.
•Safety layers for chatbots that abstain on tricky, mismatched questions instead of guessing.
•Authoring tools for memoirs that separate present triggers from past events to maintain narrative clarity.
•Agent memory modules that combine entity graphs and streams for robust long-horizon reasoning.
•Evaluation suites for AI labs to diagnose whether upgrades help facts, time/logic, or deep insight.

Version: 1