Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction

Muzhao Tian; Zisu Huang; Xiaohua Wang; Jingwen Xu; Zhengkang Guo; Qi Qian; Yuanzhe Shen; Kaitao Song; Jiakang Yuan; Changze Lv; Xiaoqing Zheng

Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction

Beginner

Muzhao Tian, Zisu Huang, Xiaohua Wang et al.1/8/2026

arXiv PDF

Key Summary

•Long-term AI helpers remember past chats, but using all memories can trap them in old ideas (Memory Anchoring).
•This paper treats “how much to use memory” like a volume knob the user can turn from fresh-start to history-heavy.
•They build a 1–5 memory-dependence score (MD-Score) using a rubric judge to measure how reliant an answer is on history.
•SteeM (Steerable Memory Agent) lets users set a target level and trains the model to match it via SFT and GRPO (a kind of RL).
•A synthetic but realistic dataset simulates long projects (Research, Tutoring) with events, artifacts, and query-specific memories.
•Compared to plain prompting or masking memories, SteeM hits the user’s requested level more accurately while keeping answer quality.
•SteeM generalizes to new subjects and keeps overall usefulness competitive on external reward benchmarks.
•Natural language preference cues (not just tags) work well and preserve general ability better than rigid tag training.
•Masking memory changes what the model sees, but SteeM changes how strongly it leans on it—this proves more reliable.
•Bottom line: users get real-time control over consistency vs. creativity in long-term human–agent teamwork.

Why This Research Matters

People want AI that remembers them but doesn’t get stuck in the past. This work gives users a real dial to balance consistency (follow our history) and creativity (think fresh) in each turn. That means a research assistant can either continue a protocol exactly or propose bold alternatives, depending on the day. A tutor can grade by your established rubric or challenge you with novel tasks on command. Teams gain smoother collaboration, fewer frustrating “stuck in old drafts” moments, and safer control over when sensitive history should steer decisions. In short, agents become better long-term partners—reliable when you need continuity, inventive when you want change.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you work on a big school project for weeks, your notes help you remember what you decided before—but sometimes you want to try a brand-new idea without being stuck to old plans?

🥬 Filling (The Actual Concept)

What it is: Before this paper, AI helpers in long conversations stored lots of memories and usually used all of them whenever you asked a new question.
How it works (what the world looked like):
1. The agent keeps a memory of your profile, past steps, drafts, and feedback.
2. When you ask something, it retrieves “relevant” history and stuffs it into the prompt.
3. The model then answers, strongly guided by the injected history.
Why it matters: This “always-use-history” strategy can make the agent over-attached to past choices, which blocks fresh thinking when you actually want new ideas.

🍞 Bottom Bread (Anchor) Imagine asking your tutor-bot to grade your last essay exactly like before (use history a lot), and later asking it for wild new topics (use little history). Old systems struggle to switch gears smoothly.

🍞 Top Bread (Hook) Imagine a boat tied to the dock by a rope. Tying it keeps it safe, but too-tight and it can’t sail anywhere new.

🥬 Filling (The Actual Concept)

What it is: Memory Anchoring is when the agent relies too much on past interactions and can’t easily try new directions.
How it works:
1. The system retrieves detailed history.
2. The model’s attention locks onto those details.
3. Even if you say “be creative,” old styles and decisions keep leaking in.
Why it matters: Without fixing anchoring, users can’t get either a faithful continuation when they want consistency or a clean slate when they want innovation.

🍞 Bottom Bread (Anchor) You say, “Ignore my old recipe—surprise me!” but the chef-bot keeps cooking the same dish because it’s clinging to last week’s notes.

The Problem

People’s needs change turn by turn: sometimes they want history-heavy answers (consistency), other times history-light answers (creativity).
Existing tools are too crude: either turn memory on/off or try to prompt “please be creative.” Both often fail—models still slip into old patterns.
Users, who know best how much memory is right for the moment, actually have the least control.

Failed Attempts

Prompt-only control (“be creative” vs. “follow history closely”) barely nudges the model—tests show answers cluster at high memory use anyway.
Binary memory masking (dropping some items) can hide useful facts and still doesn’t regulate how intensely the model leans on what remains.

The Gap

What’s missing is a real dial that lets users steer how much the model depends on memory, with measurable feedback to ensure the agent hits the requested level.

Real Stakes

Daily life examples: a project assistant that sometimes continues a plan precisely (e.g., lab protocol) and other times brainstorms new angles; a tutor that can either grade by your past rubric or challenge you with fresh tasks.
Without control, agents either become boring copycats (too anchored) or forgetful strangers (too detached). With control, users get the best of both—reliable consistency when needed and true creativity on demand.

02Core Idea

🍞 Top Bread (Hook) Imagine a radio with a big dial: turn it left for calm music, right for rock. Easy, adjustable, instant.

🥬 Filling (The Actual Concept)

What it is: The paper’s “Aha!” is to treat memory reliance as a user-controlled dial—an explicit behavior dimension you can set from “fresh start” to “history-heavy.”
How it works:
1. Define a 1–5 Memory-Dependence score that measures how history-driven an answer is.
2. Let users set a target score per question (their preference).
3. Train an agent (SteeM) to produce answers that match that target level while staying helpful.
Why it matters: Now users can balance creativity and consistency in real time, instead of being stuck with all-or-nothing memory.

🍞 Bottom Bread (Anchor) Asking “Continue my plan exactly” sets the dial high; asking “Pitch bold alternatives, not tied to our draft” sets it low—and the agent actually follows it.

🍞 Top Bread (Hook) You know how a teacher grades with a rubric so everyone knows what “good” means?

🥬 Filling (The Actual Concept)

What it is: Memory Dependence Metric (MD-Score) is a 1–5 rubric-judged score showing how much an answer leans on history.
How it works:
1. Use a judge to read the question, the memory, and the answer.
2. Check content, pattern, and style for signs of reliance (e.g., reusing internal results, following the exact prior workflow).
3. Output an integer 1–5 (1 = generic, 5 = direct continuation).
Why it matters: If you can measure it, you can control it—this score makes the dial real, not just a wish.

🍞 Bottom Bread (Anchor) If the answer copies last session’s blueprint step-by-step, it scores near 5; if it gives a fresh, general plan, it scores near 1.

🍞 Top Bread (Hook) Imagine ordering pizza and also telling the chef exactly how spicy you want it—from mild 1 to fiery 5.

🥬 Filling (The Actual Concept)

What it is: Memory-Dependence Preference (MD-Pref) is the user’s chosen level (1–5) for how much the agent should rely on memory for this specific question.
How it works:
1. The user (or their wording) signals the desired level.
2. The system treats that as the target.
3. The agent tries to produce an answer whose MD-Score matches the target.
Why it matters: Preferences change turn by turn; this lets the same agent cleanly switch modes.

🍞 Bottom Bread (Anchor) “Stick to our last plan” implies 5; “Ignore old drafts; propose new angles” implies 1–2.

🍞 Top Bread (Hook) Think of a car with power steering—you choose the direction, and the car smoothly follows.

🥬 Filling (The Actual Concept)

What it is: Steerable Memory Agent (SteeM) is a framework that learns to match the user’s target memory dependence while keeping answers high quality.
How it works:
1. Build a realistic long-horizon dataset (projects, events, artifacts, memories).
2. Generate training triples where the realized MD-Score is known.
3. Train with supervised fine-tuning (SFT) and then with reinforcement learning (GRPO) to reward good alignment and usefulness.
Why it matters: SteeM turns the abstract dial into dependable behavior across many tasks.

🍞 Bottom Bread (Anchor) When you ask for “fresh-eyed critique,” SteeM reduces reliance on past drafts; when you ask for “faithful continuation,” it follows history closely.

Before vs. After

Before: Memory was a black box; prompting barely shifted reliance; masking could hide key facts.
After: Memory becomes a controllable axis with measurable alignment; users get precise, reliable steering.

Why It Works (intuition)

The rubric makes reliance visible; SFT teaches examples of each level; RL (with alignment reward) fine-tunes the behavior so the chosen level becomes the easiest, most rewarded choice.

Building Blocks (preview)

MD-Score rubric judge
User preference (target level)
Preference-aligned data generation
SFT for a solid start
GRPO (RL) with rewards for alignment and task quality

03Methodology

High-level recipe: Input (user query + constructed memory) → Choose a target dependence level (1–5) → SteeM generates answer → Judge scores realized level → Train to reduce the gap while preserving quality.

🍞 Top Bread (Hook) Imagine building a custom bike: you pick the frame (data), tune the gears (SFT), and then road-test and tweak it (RL) until it rides exactly how you like.

🥬 Filling (The Actual Concept)

What it is: A training pipeline that makes the memory dial work reliably.
How it works, step by step:
1. Simulate realistic long projects
  - Create two scenarios (Research, Tutoring) with timelines of events (e.g., experiments, lessons) and evolving artifacts (plans, notes, drafts).
  - Why it matters: Gives controlled, repeatable, varied contexts where both history-light and history-heavy answers make sense.
  - Example: A research topic spawns events like method design → pilot study → analysis, producing artifacts like a method doc and results table.
2. Build query-specific memory M(q)
  - What: For each user query q, construct memory M(q) with three parts: user profile (m_prof), cross-session summary (m_inter), recent session summary (m_intra).
  - Why: Separates long-term preferences from recent steps, so the agent can be steered cleanly.
  - Example: “User prefers concise, visual summaries” (profile), “last week’s rubric” (inter-session), “today’s draft changes” (intra-session).
3. Measure reliance with a rubric judge (MD-Score)
  - What: An LLM-as-a-judge applies the rubric to output 1–5 for how history-driven an answer is.
  - Why: Without a meter, you can’t steer.
  - Example: If the answer strictly reuses last draft’s structure and terms, the judge returns 4–5.
4. Preference-aligned data generation
  - What: Create training triples (q_align, M(q), y) where the answer’s realized MD-Score is known and the query wording naturally hints at that level.
  - How: a. A user simulator rewrites q to add a coarse cue for a target level (e.g., “fresh perspective”). b. Sample multiple candidate answers from a model pool. c. Score each answer’s realized MD-Score with the rubric judge. d. Rewrite the original q into q_align whose implicit cue matches the realized score (so the triple is consistent). e. Filter by general/task quality with rubrics + a reward model.
  - Why: Real models tend to over-use memory; this pipeline manufactures balanced, high-quality examples across all levels.
  - Example: For a ‘level 2’ target, keep an answer that references history lightly but mainly offers a general approach.
5. Supervised fine-tuning (SFT)
  - What: Train the base model on the aligned triples using standard next-token prediction.
  - Why: Gives the model a strong, level-aware prior before RL.
  - Example: The model learns that “fresh-eyed critique” phrasing often pairs with low MD-Score answers.
6. GRPO reinforcement learning (RL)
  - What: Further optimize with a reward R = R_align + R_task + R_general. • R_align: negative of the absolute gap between realized and target MD-Score. • R_task: rubric-based score for task success. • R_general: general quality from a reward model.
  - Why: Teaches the model to trade off reliance and usefulness, hitting the target level without losing quality.
  - Example: If the user’s target is 1 but the model answers like a 4, it gets penalized; if it answers like a 1–2 and stays helpful, it gets rewarded.
7. Evaluation and iteration
  - Compare to baselines: None (no special instruction) and Rubric Instruct (explicit prompt for LOW/MID/HIGH), plus memory masking.
  - Analyze confusion matrices (target vs. realized level), overall alignment error (δ_align), and quality scores.
  - Tune until the mass concentrates near the diagonal (target ≈ realized) while quality holds.
Why it matters: This pipeline doesn’t just hide memory; it learns how strongly to lean on it, on command.

🍞 Bottom Bread (Anchor) The result is like a volume knob for memory: turn to 5 and the agent continues your plan; turn to 1 and it proposes fresh angles—not just because you asked, but because it was trained and rewarded to do so.

The Secret Sauce

The MD-Score judge turns a fuzzy behavior into a teachable target.
Preference-aligned data ensures all levels have good examples.
RL with an alignment reward makes hitting the exact level the path of least resistance.

Extra Sandwiches for Key Pieces

🍞 Top Bread (Hook) You know how chefs practice with sample dishes before a cooking contest?

🥬 Filling (The Actual Concept)

What it is: Supervised Fine-Tuning (SFT) teaches the model from curated examples how each memory level should look.
How it works:
1. Feed aligned triples (q_align, M(q), y) into training.
2. Learn to predict y given q_align and M(q).
3. Internalize patterns for levels 1–5.
Why it matters: SFT builds a reliable foundation so RL can fine-tune behavior.

🍞 Bottom Bread (Anchor) After SFT, when a query hints “fresh take,” the model already knows what that style of answer looks like.

🍞 Top Bread (Hook) Think of a video game where you earn points for playing the way the coach wants.

🥬 Filling (The Actual Concept)

What it is: GRPO (a kind of RL) gives rewards for matching the target level and staying useful.
How it works:
1. Generate multiple answers for the same prompt.
2. Score each with R_align, R_task, R_general.
3. Update the policy to prefer higher-scoring answers.
Why it matters: It nudges the model so the easiest way to score high is to match the user’s requested reliance.

🍞 Bottom Bread (Anchor) If you ask for a clean slate (level 1) and the answer copies old drafts, it loses points—soon the model learns to keep things fresh when asked.

04Experiments & Results

🍞 Top Bread (Hook) Imagine testing a smart helper on lots of school projects: Can it follow your “use history a lot” or “start fresh” instructions exactly, and still do good work?

🥬 Filling (The Actual Concept)

What it is: A broad evaluation of alignment (does the model hit the target level?) and quality (is the answer still good?).
How it works:
1. Data: Over 10,000 query–memory pairs from simulated Research and Tutoring timelines with 7,000+ events and artifacts.
2. Targets: For each test query, set a desired memory-dependence level (1–5) using natural preference cues.
3. Models: Compare SteeM (SFT and SFT+RL) against baselines—None and Rubric Instruct—and also against memory masking.
4. Metrics:
  - δ_align: the absolute gap between realized MD-Score and the target (lower is better).
  - Quality: reward-model scores (e.g., Skywork Reward) and external benchmarks (e.g., AlpacaEval).
Why it matters: We want a model that hits the level you ask for without getting worse at the task.

🍞 Bottom Bread (Anchor) Think of a scoreboard where winning means “closest to the requested memory level” while also “keeping a high grade on usefulness.”

The Test: Alignment to Target Levels

Finding: Baseline models showed Memory Anchoring—answers clustered near high reliance (4–5) even when asked for low reliance.
SteeM Result: Confusion matrices concentrate along the diagonal, meaning realized levels match targets much more often.
Across scenarios and tasks (Plan & Design, Revise, Analyze & Critique, Concept Explanation), SteeM consistently reduces δ_align compared to baselines; adding RL to SFT gives the best alignment.

The Competition: Prompting and Masking

Prompt-only (Rubric Instruct): Slight shifts but still biased toward high reliance.
Memory Masking: Some control by hiding context, but it can drop crucial info and does not ensure the model changes how strongly it leans on what remains.
SteeM vs. Masking: Pairwise judging shows SteeM wins more often at both matching the target level and completing the task.

The Scoreboard: Quality with Context

Overall reward scores: SteeM maintains or slightly improves utility compared to the baseline in several settings.
External benchmark (e.g., AlpacaEval): SteeM stays competitive; the tag-only training variant can slightly improve alignment but hurts general ability more than natural-language cues.
Takeaway: You don’t have to trade quality for control—SteeM keeps the answers useful while letting you steer reliance.

Surprising Findings

Prompting alone is weak once memory is present—models default to following history.
Natural language preference cues (subtle wording) are powerful for preserving general ability, better than rigid tags.
SteeM generalizes to new subjects (like Medical, Humanities) even when those topics weren’t in training, especially with RL.

Why These Results Matter

They show that memory reliance can be a controllable behavior, not a fixed quirk.
They prove users can set per-turn preferences and get dependable behavior across varied, realistic long-horizon tasks.

05Discussion & Limitations

Limitations

Simulated Data: Although the timelines and artifacts are carefully built, they may not capture all the messiness of real human work and conversation.
Two Scenarios: Only Research and Tutoring are covered; other domains (customer support, software debugging, legal drafting) could behave differently.
Coarse 1–5 Scale: Real preferences can be more nuanced (e.g., “mostly fresh, but keep safety constraints”), suggesting a need for finer or continuous control.
Judge Reliability: The rubric judge aligns well with humans on pairs, but it’s still an automated proxy; future work could integrate more human feedback or multi-judge ensembles.

Required Resources

Data pipeline to synthesize timelines, artifacts, and query-specific memory.
Access to a capable model pool for candidate generation and a reward model for quality checks.
Compute for SFT and RL (GRPO) training.

When NOT to Use

High-stakes compliance where all critical constraints must always be followed (e.g., medical protocols): use high-reliance defaults and restrict low levels.
Extremely short tasks with no useful history: the dial offers little benefit.
Settings where provenance must be explicit: consider pairing SteeM with citation tools to show exactly which memories were used.

Open Questions

Finer Control: How to move beyond integers (1–5) to continuous levels or multi-axis controls (e.g., content vs. style vs. workflow reliance separately)?
Human-in-the-Loop: Can users adjust the dial mid-answer or per-section, and can the model adapt interactively?
Robustness: How to make the judge and alignment more robust to tricky prompts or noisy histories?
Privacy & Consent: How should agents signal what memories they plan to use, and let users veto sensitive items?

06Conclusion & Future Work

Three-Sentence Summary

The paper introduces SteeM, a Steerable Memory Agent that lets users set how much the model should rely on past interactions for each query.
It builds a rubric-judged MD-Score to measure reliance and trains the model with SFT and GRPO so realized dependence matches the target while maintaining answer quality.
Across realistic long-horizon scenarios, SteeM outperforms prompting and masking, reduces Memory Anchoring, and generalizes to new subjects.

Main Achievement

Turning memory reliance into a controllable, measurable behavior axis—giving users a reliable dial between consistency and innovation.

Future Directions

Move from a 1–5 scale to continuous control and multi-axis steering (content, pattern, style separately).
Expand to more scenarios and integrate human feedback loops for on-the-fly adjustment.
Add transparent provenance so users can see and manage exactly which memories influenced the answer.

Why Remember This

Because it shows a practical way to balance “remembering who I am” with “thinking in new ways,” which is the heart of great long-term collaboration between humans and AI.

Practical Applications

•Project assistants that can switch between strict continuation of plans and blue-sky brainstorming on demand.
•Tutoring systems that grade using prior rubrics or design fresh exercises unrelated to past lessons when requested.
•Customer support bots that follow account history closely for billing issues but generate creative troubleshooting for new problems.
•Writing aides that either maintain a document’s established tone or intentionally explore new voices and structures.
•Product design agents that respect legacy constraints during implementation reviews but ideate freely during early concepting.
•Code assistants that follow existing architecture during refactoring but propose clean-slate patterns during redesign spikes.
•Healthcare triage helpers that default to high-reliance on patient history but can present general education when privacy-sensitive.
•Sales assistants that adhere to account strategies for renewals but suggest unconventional pitches for new markets.
•Education planners that preserve long-term goals yet introduce novel methods when progress stalls.
•Research aides that replicate prior analysis pipelines or pilot fresh methodologies under a low-reliance setting.

Version: 1