Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Yiming Du; Baojun Wang; Yifan Xiang; Zhaowei Wang; Wenyu Huang; Boyang Xue; Bin Liang; Xingshan Zeng; Fei Mi; Haoli Bai; Lifeng Shang; Jeff Z. Pan; Yuxin Jiang; Kam-Fai Wong

Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Intermediate

Yiming Du, Baojun Wang, Yifan Xiang et al.12/23/2025

arXiv PDF

Key Summary

•Memory-T1 teaches chatty AI agents to keep track of when things happened across many conversations.
•It first narrows the search to likely times, then picks the exact sessions to read, like using a map and then a magnifying glass.
•A reinforcement learning policy is trained with three rewards: getting the answer right, citing the right sessions, and staying true to the timeline.
•The new temporal consistency reward checks both how close a session is to the question’s time and whether the utterances inside it match the time window.
•On the Time-Dialog benchmark, a 7B model with Memory-T1 scores 67.0%, beating a 14B baseline by 10.2%.
•Ablations show the temporal and grounding rewards together add about 15% improvement.
•The method stays strong up to 128k-token histories where other models collapse, showing robustness to long, noisy dialogue.
•The approach generalizes to a different dataset (LoCoMo), improving over the base model without relying on retrieval.
•Training uses GRPO, a stable RL method, and keeps inference latency almost unchanged.
•This makes long-term assistants more reliable for date-sensitive tasks like planning, reminders, and factual Q&A.

Why This Research Matters

Real conversations stretch across weeks and months, and people often ask questions whose answers depend on when things happened. Memory-T1 makes AI assistants better at finding the right slice of time and the right evidence before answering. This leads to fewer mix-ups, like confusing last week with last month, and builds trust in long-term helpers. It also means a smaller model can act smarter by learning when to look, not just how much to read. With strong performance on very long histories, assistants can help with planning, tracking habits, and recalling important dates without getting lost. In short, it turns time from a troublemaker into a helpful guide for reliable AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and a friend text each other all year. If someone asks, “When did you plan the birthday party?” you’d need to remember not just what was said, but when it was said.

🥬 The Concept (Temporal Reasoning): It means understanding the order and timing of events. How it works: (1) Find the right clues in the history, (2) read any time hints like “last night,” (3) line them up with actual dates, and (4) answer the question. Why it matters: Without it, an agent mixes up what happened first or last and gives wrong answers.

🍞 Anchor: If your friend texted, “We met last night at 7,” and today is Jan 10, then “last night” means Jan 9 at 7.

🍞 Hook: You know how a school year has many classes, days, and notes? Conversations do too.

🥬 The Concept (Multi-session Dialogues): These are many separate chats over time between the same people. How it works: (1) Each session has a date, (2) each message can mention events that happened at different times, (3) later questions might depend on earlier sessions. Why it matters: If you can’t link a question to the right session, you’ll pull the wrong fact.

🍞 Anchor: If you planned a trip in February but packed in March, a question about “when did we book the flights?” should go to February, not March.

🍞 Hook: Think of reading a whole year’s worth of texts at once—it’s like trying to find a tiny sticker in a giant scrapbook.

🥬 The Concept (Long-context Models and Noise): Long-context models read huge amounts of text, but lots of extra, off-topic details (noise) distract them. How it works: (1) They try to hold many tokens in memory, (2) important bits can get “lost in the middle,” (3) time words like “the week before that” are easy to misinterpret. Why it matters: As histories grow, answers get less reliable.

🍞 Anchor: If the key sentence is buried on page 200 of 300, a model may focus on page 50 instead.

🍞 Hook: Imagine you’re sorting old photos. First, you pick the year, then the month, then the exact day—you don’t start by checking every single picture.

🥬 The Concept (Coarse-to-Fine Strategy): Start broad, then zoom in. How it works: (1) Roughly filter by time, (2) rank what’s left by textual relevance, (3) then choose exact evidence. Why it matters: Without it, the agent wastes attention everywhere and misses the best clues.

🍞 Anchor: First you grab the 2024 album (coarse), then the January folder (finer), then the Golden Globes selfie (finest).

🍞 Hook: Think of teaching a puppy tricks. You give treats for good moves so it learns faster.

🥬 The Concept (Reinforcement Learning, RL): The model tries actions and gets rewards (good) or penalties (bad) to learn a policy (what to do). How it works: (1) Propose which sessions to cite and what answer to give, (2) get rewards for correctness and good evidence, (3) repeat to improve. Why it matters: Without rewards, the model doesn’t learn which moves lead to better time reasoning.

🍞 Anchor: If citing the right session earns a treat, the model repeats that behavior.

The World Before: Agents could summarize or search across long histories, but they often treated everything as flat text. They didn’t deeply align “last night,” “two weeks earlier,” or “the week before that” to the right calendar dates. With longer dialogue, more chatter (noise) piled up, and their answers drifted.

The Problem: Find the few time-correct sessions and utterances that support a question, even when the wording is fuzzy (e.g., “last night”). This means mapping relative time expressions to the right dates and keeping event order straight across many sessions.

Failed Attempts: (1) Just make the model bigger—still gets lost in long, noisy text. (2) Pure retrieval (RAG)—helps recall but not always time alignment. (3) Heuristic timelines—work for explicit dates but struggle with ambiguous phrases. (4) RL on outcomes only—reward is too sparse; the model doesn’t learn which sessions matter.

The Gap: Models needed a way to learn a memory policy that is time-aware: first narrow to the right era, then pick the most time-faithful sessions, and get frequent, detailed feedback about temporal correctness.

Real Stakes: People ask date-sensitive questions all the time: “When did we last change the password?”, “Which doctor visit came first?”, “Did we schedule the trip before school started?” Reliable temporal reasoning makes assistants dependable planners, historians, and fact checkers over months of chats.

02Core Idea

🍞 Hook: You know how when you clean your room, you first toss out what obviously doesn’t belong, then carefully sort what’s left?

🥬 The Concept (Aha!): Train an agent to pick time-correct memories using a two-step (coarse-to-fine) search and a reinforcement learning policy that’s rewarded for correct answers, citing the right sessions, and staying consistent with the timeline.

How it works (big picture):

Predict the question’s time window to throw out sessions far from that period.
Rank the remaining sessions by relevance to the question.
Use an RL policy to select the precise evidence sessions and produce the answer.
Give three kinds of reward: answer accuracy, evidence grounding, and temporal consistency (session closeness + utterance fidelity).

Why it matters: Without this, the agent treats all history equally, misreads “last night,” and confuses event order. With it, the agent homes in on when and where to look.

🍞 Anchor: Asking “When did we mention the Golden Globes?” The agent keeps January chats, boosts the one saying “last night,” and maps that to Jan 9.

Three Analogies:

Library detective: First go to the right shelf (year/month), then the right book (session), then the exact page (utterance), and write down the citation.
Treasure map: Start with a big X (time filter), use a compass to get closer (relevance), then a metal detector to pinpoint the spot (RL selection) and prove it with clues (rewards).
Sports replay: Filter clips by game date, pick clips about the right play, then confirm the timestamp and players before calling the final score.

Before vs. After:

Before: Big models tried to read everything; time phrases were often misaligned. Longer chats meant worse answers.
After: Memory-T1 narrows to the right time, picks sessions that match the time window inside and out, and cites them, keeping answers stable even with 128k-token histories.

Why It Works (intuition):

Coarse-to-fine reduces distraction so the agent spends attention on the most promising slices of history.
Multi-level rewards are dense: the agent learns not just if the final answer was right, but whether the cited sessions were both relevant and time-faithful.
Temporal consistency splits into two checks: session-level “How close in time?” and utterance-level “Does the content itself line up with the time window?” This resolves tricky phrases like “the week before that.”

Building Blocks (with mini sandwiches):

🍞 Hook: Imagine a coach scoring players for shooting, passing, and defense, not just wins. 🥬 The Concept (Multi-Level Reward Function): A scoring system with multiple parts. How it works: (1) Answer accuracy, (2) evidence grounding, (3) temporal consistency. Why it matters: Only rewarding the final answer is too sparse; the model won’t learn what to read. 🍞 Anchor: Even if you guessed the score, you’d still get graded on how you played.
🍞 Hook: You first check the right month before picking the right day. 🥬 The Concept (Temporal Consistency Reward): Encourages choosing sessions and utterances that match the question’s time window. How it works: (1) Session closeness (chronological proximity), (2) utterance alignment (chronological fidelity). Why it matters: Stops the agent from citing the right topic but from the wrong time. 🍞 Anchor: Don’t use a March diary entry to answer a February question.
🍞 Hook: When studying, you first skim (coarse), then deep-dive (fine). 🥬 The Concept (Coarse-to-Fine Retrieval): Broad time filter, then relevance ranking, then precise RL selection. Why it matters: Saves attention and boosts accuracy. 🍞 Anchor: From “sometime in January” to “Jan 10 session” to “this exact quote.”
🍞 Hook: A student learns faster when feedback is immediate and specific. 🥬 The Concept (Reinforcement Learning Policy): A learned rule for which sessions to cite and what answer to give. How it works: Try, get graded by rewards, adjust. Why it matters: Hard problems need practice with good feedback. 🍞 Anchor: If citing Session 20 plus saying “Jan 9” gets high reward, the policy repeats that move.

03Methodology

High-level recipe: Input (question + memory bank) → Phase 1: Candidate Generation (Time Filter → Relevance Filter) → Phase 2: Fine-grained Selection via RL (Select sessions + Answer) → Rewards (Accuracy, Grounding, Temporal) → Update policy.

Phase 1: Candidate Generation

🍞 Hook: You don’t search the whole city for a lost glove—you start with where you were yesterday.

🥬 The Concept (Time Filter): Predict the question’s target time window and keep only overlapping sessions. How it works: (1) Use an LLM to predict [start, end] for the question, (2) drop sessions outside that window, (3) pass a smaller set forward. Why it matters: Removes obviously wrong-time sessions so noise shrinks.

🍞 Anchor: If the question is about “last night,” keep yesterday’s sessions, drop June sessions.

🍞 Hook: After you know the day, you read only the most relevant pages.

🥬 The Concept (Relevance Filter with a Retriever): Rank remaining sessions by how textually related they are to the question. How it works: (1) Compute text similarity (e.g., BM25), (2) take top-k sessions, (3) build a candidate pool that fits context. Why it matters: Keeps high-recall yet manageable candidates.

🍞 Anchor: For “Golden Globes last night,” sessions that mention “Golden Globes” rise to the top.

Phase 2: Fine-grained Selection via RL

🍞 Hook: Picking the right page is good; picking the exact quote is better.

🥬 The Concept (RL-based Evidence Selection + Answering): The agent outputs both which sessions it cites and the final answer in one shot. How it works: (1) Input: question + candidate pool, (2) Output: {selected_memory: [...], answer: ...}, (3) Parse the IDs and answer, (4) Give rewards, (5) Update the policy with GRPO. Why it matters: The agent learns a direct link between what it cites and what it says.

🍞 Anchor: “{selected_memory: [session_20], answer: January 9, 2024}” ties proof to answer.

Rewards (the secret sauce)

🍞 Hook: A good report card shows more than one grade.

🥬 The Concept (Multi-Level Rewards): Three parts work together.

What it is: A combined score that teaches the model not just to be right, but to be right for the right reasons at the right time.
How it works (step-by-step):
1. Accuracy Reward (Ra): Did the answer match? Uses exact match for choices, unit-aware checks for dates, tolerant checks for durations, and partial credit for sequences.
2. Evidence Grounding Reward (Rg): Did the cited session IDs match the gold evidence? Uses a similarity score (like Jaccard) scaled to [-1, 1].
3. Temporal Consistency Reward (Rt): Are the chosen sessions time-appropriate? Two subrewards:
  - Chronological Proximity (Rs, session-level): Reward nearby sessions; penalize far ones smoothly (so small timing wobbles aren’t punished too hard).
  - Chronological Fidelity (Rf, utterance-level): Reward utterances whose event times sit inside (or overlap) the question’s time window; penalize those outside.
Why it matters: Without Rg and Rt, the agent might guess right without citing the right proof or might use the correct topic from the wrong day.

🍞 Anchor: If you say “Jan 9” but cite a Jan 15 session, your score drops.

🍞 Hook: Practice is safer with a spotting partner.

🥬 The Concept (GRPO Training): A stable RL update method for language models. What it is: Like PPO but uses group-relative baselines to reduce variance. How it works: (1) Sample multiple outputs, (2) compute rewards per sample, (3) subtract the group average reward (advantage), (4) update the policy while staying close to a reference model. Why it matters: Keeps training steady and prevents wild behavior.

🍞 Anchor: Think of averaging class scores before grading each student’s improvement.

Concrete Example Walkthrough (Golden Globes, “last night”):

Time Filter: If today’s session is Jan 10, “last night” maps to Jan 9–10. Keep sessions around Jan 9–10.
Relevance Filter: Rank by mentions of “Golden Globes,” “Suits,” “together.” Keep top-k sessions.
RL Selection: The agent cites Session 20 (Jan 10, with “last night”), infers Jan 9 as the date, and outputs both.
Rewards:
- Ra: Date matches gold → good.
- Rg: Cited Session 20 matches gold evidence → good.
- Rt: Session is near the time window (Rs high) and the utterance’s event span aligns with “last night” (Rf high) → good.
Policy Update: GRPO nudges the model to repeat this behavior.

What breaks without each step:

No Time Filter: Too many off-date sessions—model gets distracted.
No Relevance Filter: Candidate pool is bloated—signals get muddy.
No Rg: Model may not learn to cite the true source.
No Rt: Model might pull the right topic but wrong date.
No GRPO: Training becomes unstable, learnings don’t stick.

Secret Sauce: The temporal consistency reward is the key: it teaches the model to respect both the calendar (session closeness) and the clock inside the sentences (utterance fidelity), which fixes subtle time ambiguities.

04Experiments & Results

The Test: Researchers used Time-Dialog (4,716 QA items across many temporal subtasks) and LoCoMo (a different dataset with a temporal subtask) to see if Memory-T1 truly understands time in multi-session chats. They measured overall F1, plus category subtasks like event order, ranges, and co-temporality.

The Competition:

Full-context LLMs (Qwen 3B/7B/14B, Llama 8B, Gemma 4B) and GPT-4,
Retrieval-augmented setups (RAG),
Specialized temporal or memory agents (Time-R1, MemAgent),
An SFT baseline,
An RL ablation that used only answer accuracy (Ra-only) to show the value of the new rewards.

The Scoreboard (with context):

Memory-T1 (7B) scored 67.0% on Time-Dialog overall—like getting an A when others are around B. It beat a larger 14B baseline by 10.2 points.
Even Memory-T1 (3B) reached 66.9%, outperforming bigger models like Llama-3-8B and Qwen-14B. This shows the method (policy + rewards) mattered more than just model size.
Against GPT-4 with a full prompt or ReAct, Memory-T1 still came out ahead (though an oracle GPT-4 using gold evidence remains higher, as expected).
The biggest gains were on complex temporal reasoning (e.g., order and range). That’s exactly where time alignment and grounding matter most.

Surprising/Interesting Findings:

Dense rewards matter: Removing temporal consistency (Rt) and evidence grounding (Rg) led to a combined drop around 15%. Training with Ra only caused a dramatic overall fall (~22%). Translation: final-answer-only feedback isn’t enough to learn time.
A quirky ablation: Removing just the session-level proximity (Rs) sometimes helped simple tasks but badly hurt complex ones, revealing that Rs and Rf (utterance fidelity) complement each other.
Long-context robustness: While baselines sagged as histories stretched to 128k tokens (classic “lost in the middle”), Memory-T1 held steady. The coarse-to-fine funnel protected the agent from noise.
Out-of-domain: On LoCoMo, Memory-T1 improved over the base model, especially without retrieval, hinting it learned a general internal skill for time-aware memory selection.
Speed: Nearly the same inference latency as baselines; retrieval added negligible time.

What changed because of Memory-T1:

Before: Answers drifted in long, noisy logs; relative time phrases were often misgrounded.
After: The agent reliably zooms into the right time, cites supporting sessions, and keeps event order and time windows straight—even when the log is huge.

Takeaway in numbers:

State-of-the-art open-source performance at 67.0% overall.
Outperforms a 14B baseline by 10.2% with a 7B model.
Temporal and grounding rewards together: about +15% lift.
Robust to 128k-token contexts with little slowdown.
OOD gains on LoCoMo, especially in non-RAG settings.

05Discussion & Limitations

Limitations:

Deep composition limits: Some tasks like complex timeline building or multi-hop comparisons still trip the model, because they demand long chains of logic beyond selecting good sessions.
Training-time annotations: The utterance-level event times and gold evidence are used to compute rewards during training. Real-world data won’t have these labels, so there’s a train/test mismatch (although the model doesn’t use labels at inference).
Time-window prediction errors: If the initial time filter guesses the window poorly, the correct session could be filtered out early.
Hyperparameter sensitivity: Top-k size and reward weights affect performance; the best settings may vary by domain.
RL compute: GRPO training needs multiple samples per query and a reference model; this is heavier than simple fine-tuning.

Required Resources:

An LLM backbone (e.g., 3B–7B) with ~16k context window for training setups shown.
A fast retriever (BM25 or similar) for the relevance filter.
RL framework supporting GRPO and KL regularization.
Training data with session timestamps and (for reward computation) gold evidence and time spans.

When NOT to Use:

Very short chats where simple search works fine (overkill).
Fully structured databases with exact timestamps (a rules engine may be simpler and exact).
Pure open-web QA without conversation context (a strong RAG alone may suffice).
Real-time streaming with no timestamps at all (you need at least rough time anchors or reliable relative-time inference).

Open Questions:

Can we learn temporal consistency without any event-time annotations via self-supervision?
How to better handle ambiguous phrases across time zones and daylight savings automatically?
Can we fuse the time filter and the policy into a single end-to-end model?
How to quantify and communicate uncertainty about time (e.g., say “around Jan 9” with confidence)?
How well does this extend to multilingual, multi-calendar settings and cross-cultural date phrases?

06Conclusion & Future Work

Three-sentence summary: Memory-T1 is a reinforcement-learning framework that helps dialogue agents pick the right time-consistent memories from long, noisy multi-session histories. It uses a coarse-to-fine search—time filter, relevance filter, and an RL policy—with multi-level rewards for answer accuracy, evidence grounding, and temporal consistency. This delivers state-of-the-art temporal reasoning, stays reliable up to 128k tokens, and lets smaller models beat bigger baselines.

Main achievement: Designing and proving the value of a temporal consistency reward that checks both session closeness and utterance-level fidelity, turning vague time phrases into well-grounded answers.

Future directions: Learn time alignment with weaker or no annotations, integrate the time filter into the policy end-to-end, handle multiple calendars and languages, and add uncertainty-aware outputs for fuzzy time.

Why remember this: It shows that being time-smart isn’t about reading more—it’s about reading the right things at the right time, with feedback that teaches the model how to respect the calendar inside conversations.

Practical Applications

•Personal timeline Q&A: “When did I say I’d start the gym routine?” with correct session citation.
•Family planning: Track which weekend was chosen for trips or birthdays across long chats.
•Health journaling: Align symptoms or medications with dates to answer doctor-prep questions.
•Project logs: Identify which meeting decided a feature and when, even months later.
•Customer support history: Pinpoint when a promise or policy was discussed in multi-session tickets.
•Education coaching: Recall when study goals were set and measure progress over time.
•Team retrospectives: Order events (deploys, incidents) accurately for postmortems.
•Legal/compliance review: Cite the exact session where a policy was approved and on what date.
•Financial planning: Track when budget changes were agreed upon across several sessions.
•Event planning: Confirm what was finalized “last night” versus “two weeks earlier.”

Version: 1