Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction
Key Summary
- â˘Long-term AI helpers remember past chats, but using all memories can trap them in old ideas (Memory Anchoring).
- â˘This paper treats âhow much to use memoryâ like a volume knob the user can turn from fresh-start to history-heavy.
- â˘They build a 1â5 memory-dependence score (MD-Score) using a rubric judge to measure how reliant an answer is on history.
- â˘SteeM (Steerable Memory Agent) lets users set a target level and trains the model to match it via SFT and GRPO (a kind of RL).
- â˘A synthetic but realistic dataset simulates long projects (Research, Tutoring) with events, artifacts, and query-specific memories.
- â˘Compared to plain prompting or masking memories, SteeM hits the userâs requested level more accurately while keeping answer quality.
- â˘SteeM generalizes to new subjects and keeps overall usefulness competitive on external reward benchmarks.
- â˘Natural language preference cues (not just tags) work well and preserve general ability better than rigid tag training.
- â˘Masking memory changes what the model sees, but SteeM changes how strongly it leans on itâthis proves more reliable.
- â˘Bottom line: users get real-time control over consistency vs. creativity in long-term humanâagent teamwork.
Why This Research Matters
People want AI that remembers them but doesnât get stuck in the past. This work gives users a real dial to balance consistency (follow our history) and creativity (think fresh) in each turn. That means a research assistant can either continue a protocol exactly or propose bold alternatives, depending on the day. A tutor can grade by your established rubric or challenge you with novel tasks on command. Teams gain smoother collaboration, fewer frustrating âstuck in old draftsâ moments, and safer control over when sensitive history should steer decisions. In short, agents become better long-term partnersâreliable when you need continuity, inventive when you want change.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook) You know how when you work on a big school project for weeks, your notes help you remember what you decided beforeâbut sometimes you want to try a brand-new idea without being stuck to old plans?
𼏠Filling (The Actual Concept)
- What it is: Before this paper, AI helpers in long conversations stored lots of memories and usually used all of them whenever you asked a new question.
- How it works (what the world looked like):
- The agent keeps a memory of your profile, past steps, drafts, and feedback.
- When you ask something, it retrieves ârelevantâ history and stuffs it into the prompt.
- The model then answers, strongly guided by the injected history.
- Why it matters: This âalways-use-historyâ strategy can make the agent over-attached to past choices, which blocks fresh thinking when you actually want new ideas.
đ Bottom Bread (Anchor) Imagine asking your tutor-bot to grade your last essay exactly like before (use history a lot), and later asking it for wild new topics (use little history). Old systems struggle to switch gears smoothly.
đ Top Bread (Hook) Imagine a boat tied to the dock by a rope. Tying it keeps it safe, but too-tight and it canât sail anywhere new.
𼏠Filling (The Actual Concept)
- What it is: Memory Anchoring is when the agent relies too much on past interactions and canât easily try new directions.
- How it works:
- The system retrieves detailed history.
- The modelâs attention locks onto those details.
- Even if you say âbe creative,â old styles and decisions keep leaking in.
- Why it matters: Without fixing anchoring, users canât get either a faithful continuation when they want consistency or a clean slate when they want innovation.
đ Bottom Bread (Anchor) You say, âIgnore my old recipeâsurprise me!â but the chef-bot keeps cooking the same dish because itâs clinging to last weekâs notes.
The Problem
- Peopleâs needs change turn by turn: sometimes they want history-heavy answers (consistency), other times history-light answers (creativity).
- Existing tools are too crude: either turn memory on/off or try to prompt âplease be creative.â Both often failâmodels still slip into old patterns.
- Users, who know best how much memory is right for the moment, actually have the least control.
Failed Attempts
- Prompt-only control (âbe creativeâ vs. âfollow history closelyâ) barely nudges the modelâtests show answers cluster at high memory use anyway.
- Binary memory masking (dropping some items) can hide useful facts and still doesnât regulate how intensely the model leans on what remains.
The Gap
- Whatâs missing is a real dial that lets users steer how much the model depends on memory, with measurable feedback to ensure the agent hits the requested level.
Real Stakes
- Daily life examples: a project assistant that sometimes continues a plan precisely (e.g., lab protocol) and other times brainstorms new angles; a tutor that can either grade by your past rubric or challenge you with fresh tasks.
- Without control, agents either become boring copycats (too anchored) or forgetful strangers (too detached). With control, users get the best of bothâreliable consistency when needed and true creativity on demand.
02Core Idea
đ Top Bread (Hook) Imagine a radio with a big dial: turn it left for calm music, right for rock. Easy, adjustable, instant.
𼏠Filling (The Actual Concept)
- What it is: The paperâs âAha!â is to treat memory reliance as a user-controlled dialâan explicit behavior dimension you can set from âfresh startâ to âhistory-heavy.â
- How it works:
- Define a 1â5 Memory-Dependence score that measures how history-driven an answer is.
- Let users set a target score per question (their preference).
- Train an agent (SteeM) to produce answers that match that target level while staying helpful.
- Why it matters: Now users can balance creativity and consistency in real time, instead of being stuck with all-or-nothing memory.
đ Bottom Bread (Anchor) Asking âContinue my plan exactlyâ sets the dial high; asking âPitch bold alternatives, not tied to our draftâ sets it lowâand the agent actually follows it.
đ Top Bread (Hook) You know how a teacher grades with a rubric so everyone knows what âgoodâ means?
𼏠Filling (The Actual Concept)
- What it is: Memory Dependence Metric (MD-Score) is a 1â5 rubric-judged score showing how much an answer leans on history.
- How it works:
- Use a judge to read the question, the memory, and the answer.
- Check content, pattern, and style for signs of reliance (e.g., reusing internal results, following the exact prior workflow).
- Output an integer 1â5 (1 = generic, 5 = direct continuation).
- Why it matters: If you can measure it, you can control itâthis score makes the dial real, not just a wish.
đ Bottom Bread (Anchor) If the answer copies last sessionâs blueprint step-by-step, it scores near 5; if it gives a fresh, general plan, it scores near 1.
đ Top Bread (Hook) Imagine ordering pizza and also telling the chef exactly how spicy you want itâfrom mild 1 to fiery 5.
𼏠Filling (The Actual Concept)
- What it is: Memory-Dependence Preference (MD-Pref) is the userâs chosen level (1â5) for how much the agent should rely on memory for this specific question.
- How it works:
- The user (or their wording) signals the desired level.
- The system treats that as the target.
- The agent tries to produce an answer whose MD-Score matches the target.
- Why it matters: Preferences change turn by turn; this lets the same agent cleanly switch modes.
đ Bottom Bread (Anchor) âStick to our last planâ implies 5; âIgnore old drafts; propose new anglesâ implies 1â2.
đ Top Bread (Hook) Think of a car with power steeringâyou choose the direction, and the car smoothly follows.
𼏠Filling (The Actual Concept)
- What it is: Steerable Memory Agent (SteeM) is a framework that learns to match the userâs target memory dependence while keeping answers high quality.
- How it works:
- Build a realistic long-horizon dataset (projects, events, artifacts, memories).
- Generate training triples where the realized MD-Score is known.
- Train with supervised fine-tuning (SFT) and then with reinforcement learning (GRPO) to reward good alignment and usefulness.
- Why it matters: SteeM turns the abstract dial into dependable behavior across many tasks.
đ Bottom Bread (Anchor) When you ask for âfresh-eyed critique,â SteeM reduces reliance on past drafts; when you ask for âfaithful continuation,â it follows history closely.
Before vs. After
- Before: Memory was a black box; prompting barely shifted reliance; masking could hide key facts.
- After: Memory becomes a controllable axis with measurable alignment; users get precise, reliable steering.
Why It Works (intuition)
- The rubric makes reliance visible; SFT teaches examples of each level; RL (with alignment reward) fine-tunes the behavior so the chosen level becomes the easiest, most rewarded choice.
Building Blocks (preview)
- MD-Score rubric judge
- User preference (target level)
- Preference-aligned data generation
- SFT for a solid start
- GRPO (RL) with rewards for alignment and task quality
03Methodology
High-level recipe: Input (user query + constructed memory) â Choose a target dependence level (1â5) â SteeM generates answer â Judge scores realized level â Train to reduce the gap while preserving quality.
đ Top Bread (Hook) Imagine building a custom bike: you pick the frame (data), tune the gears (SFT), and then road-test and tweak it (RL) until it rides exactly how you like.
𼏠Filling (The Actual Concept)
-
What it is: A training pipeline that makes the memory dial work reliably.
-
How it works, step by step:
-
Simulate realistic long projects
- Create two scenarios (Research, Tutoring) with timelines of events (e.g., experiments, lessons) and evolving artifacts (plans, notes, drafts).
- Why it matters: Gives controlled, repeatable, varied contexts where both history-light and history-heavy answers make sense.
- Example: A research topic spawns events like method design â pilot study â analysis, producing artifacts like a method doc and results table.
-
Build query-specific memory M(q)
- What: For each user query q, construct memory M(q) with three parts: user profile (m_prof), cross-session summary (m_inter), recent session summary (m_intra).
- Why: Separates long-term preferences from recent steps, so the agent can be steered cleanly.
- Example: âUser prefers concise, visual summariesâ (profile), âlast weekâs rubricâ (inter-session), âtodayâs draft changesâ (intra-session).
-
Measure reliance with a rubric judge (MD-Score)
- What: An LLM-as-a-judge applies the rubric to output 1â5 for how history-driven an answer is.
- Why: Without a meter, you canât steer.
- Example: If the answer strictly reuses last draftâs structure and terms, the judge returns 4â5.
-
Preference-aligned data generation
- What: Create training triples (q_align, M(q), y) where the answerâs realized MD-Score is known and the query wording naturally hints at that level.
- How: a. A user simulator rewrites q to add a coarse cue for a target level (e.g., âfresh perspectiveâ). b. Sample multiple candidate answers from a model pool. c. Score each answerâs realized MD-Score with the rubric judge. d. Rewrite the original q into q_align whose implicit cue matches the realized score (so the triple is consistent). e. Filter by general/task quality with rubrics + a reward model.
- Why: Real models tend to over-use memory; this pipeline manufactures balanced, high-quality examples across all levels.
- Example: For a âlevel 2â target, keep an answer that references history lightly but mainly offers a general approach.
-
Supervised fine-tuning (SFT)
- What: Train the base model on the aligned triples using standard next-token prediction.
- Why: Gives the model a strong, level-aware prior before RL.
- Example: The model learns that âfresh-eyed critiqueâ phrasing often pairs with low MD-Score answers.
-
GRPO reinforcement learning (RL)
- What: Further optimize with a reward R = R_align + R_task + R_general. ⢠R_align: negative of the absolute gap between realized and target MD-Score. ⢠R_task: rubric-based score for task success. ⢠R_general: general quality from a reward model.
- Why: Teaches the model to trade off reliance and usefulness, hitting the target level without losing quality.
- Example: If the userâs target is 1 but the model answers like a 4, it gets penalized; if it answers like a 1â2 and stays helpful, it gets rewarded.
-
Evaluation and iteration
- Compare to baselines: None (no special instruction) and Rubric Instruct (explicit prompt for LOW/MID/HIGH), plus memory masking.
- Analyze confusion matrices (target vs. realized level), overall alignment error (δ_align), and quality scores.
- Tune until the mass concentrates near the diagonal (target â realized) while quality holds.
-
-
Why it matters: This pipeline doesnât just hide memory; it learns how strongly to lean on it, on command.
đ Bottom Bread (Anchor) The result is like a volume knob for memory: turn to 5 and the agent continues your plan; turn to 1 and it proposes fresh anglesânot just because you asked, but because it was trained and rewarded to do so.
The Secret Sauce
- The MD-Score judge turns a fuzzy behavior into a teachable target.
- Preference-aligned data ensures all levels have good examples.
- RL with an alignment reward makes hitting the exact level the path of least resistance.
Extra Sandwiches for Key Pieces
đ Top Bread (Hook) You know how chefs practice with sample dishes before a cooking contest?
𼏠Filling (The Actual Concept)
- What it is: Supervised Fine-Tuning (SFT) teaches the model from curated examples how each memory level should look.
- How it works:
- Feed aligned triples (q_align, M(q), y) into training.
- Learn to predict y given q_align and M(q).
- Internalize patterns for levels 1â5.
- Why it matters: SFT builds a reliable foundation so RL can fine-tune behavior.
đ Bottom Bread (Anchor) After SFT, when a query hints âfresh take,â the model already knows what that style of answer looks like.
đ Top Bread (Hook) Think of a video game where you earn points for playing the way the coach wants.
𼏠Filling (The Actual Concept)
- What it is: GRPO (a kind of RL) gives rewards for matching the target level and staying useful.
- How it works:
- Generate multiple answers for the same prompt.
- Score each with R_align, R_task, R_general.
- Update the policy to prefer higher-scoring answers.
- Why it matters: It nudges the model so the easiest way to score high is to match the userâs requested reliance.
đ Bottom Bread (Anchor) If you ask for a clean slate (level 1) and the answer copies old drafts, it loses pointsâsoon the model learns to keep things fresh when asked.
04Experiments & Results
đ Top Bread (Hook) Imagine testing a smart helper on lots of school projects: Can it follow your âuse history a lotâ or âstart freshâ instructions exactly, and still do good work?
𼏠Filling (The Actual Concept)
- What it is: A broad evaluation of alignment (does the model hit the target level?) and quality (is the answer still good?).
- How it works:
- Data: Over 10,000 queryâmemory pairs from simulated Research and Tutoring timelines with 7,000+ events and artifacts.
- Targets: For each test query, set a desired memory-dependence level (1â5) using natural preference cues.
- Models: Compare SteeM (SFT and SFT+RL) against baselinesâNone and Rubric Instructâand also against memory masking.
- Metrics:
- δ_align: the absolute gap between realized MD-Score and the target (lower is better).
- Quality: reward-model scores (e.g., Skywork Reward) and external benchmarks (e.g., AlpacaEval).
- Why it matters: We want a model that hits the level you ask for without getting worse at the task.
đ Bottom Bread (Anchor) Think of a scoreboard where winning means âclosest to the requested memory levelâ while also âkeeping a high grade on usefulness.â
The Test: Alignment to Target Levels
- Finding: Baseline models showed Memory Anchoringâanswers clustered near high reliance (4â5) even when asked for low reliance.
- SteeM Result: Confusion matrices concentrate along the diagonal, meaning realized levels match targets much more often.
- Across scenarios and tasks (Plan & Design, Revise, Analyze & Critique, Concept Explanation), SteeM consistently reduces δ_align compared to baselines; adding RL to SFT gives the best alignment.
The Competition: Prompting and Masking
- Prompt-only (Rubric Instruct): Slight shifts but still biased toward high reliance.
- Memory Masking: Some control by hiding context, but it can drop crucial info and does not ensure the model changes how strongly it leans on what remains.
- SteeM vs. Masking: Pairwise judging shows SteeM wins more often at both matching the target level and completing the task.
The Scoreboard: Quality with Context
- Overall reward scores: SteeM maintains or slightly improves utility compared to the baseline in several settings.
- External benchmark (e.g., AlpacaEval): SteeM stays competitive; the tag-only training variant can slightly improve alignment but hurts general ability more than natural-language cues.
- Takeaway: You donât have to trade quality for controlâSteeM keeps the answers useful while letting you steer reliance.
Surprising Findings
- Prompting alone is weak once memory is presentâmodels default to following history.
- Natural language preference cues (subtle wording) are powerful for preserving general ability, better than rigid tags.
- SteeM generalizes to new subjects (like Medical, Humanities) even when those topics werenât in training, especially with RL.
Why These Results Matter
- They show that memory reliance can be a controllable behavior, not a fixed quirk.
- They prove users can set per-turn preferences and get dependable behavior across varied, realistic long-horizon tasks.
05Discussion & Limitations
Limitations
- Simulated Data: Although the timelines and artifacts are carefully built, they may not capture all the messiness of real human work and conversation.
- Two Scenarios: Only Research and Tutoring are covered; other domains (customer support, software debugging, legal drafting) could behave differently.
- Coarse 1â5 Scale: Real preferences can be more nuanced (e.g., âmostly fresh, but keep safety constraintsâ), suggesting a need for finer or continuous control.
- Judge Reliability: The rubric judge aligns well with humans on pairs, but itâs still an automated proxy; future work could integrate more human feedback or multi-judge ensembles.
Required Resources
- Data pipeline to synthesize timelines, artifacts, and query-specific memory.
- Access to a capable model pool for candidate generation and a reward model for quality checks.
- Compute for SFT and RL (GRPO) training.
When NOT to Use
- High-stakes compliance where all critical constraints must always be followed (e.g., medical protocols): use high-reliance defaults and restrict low levels.
- Extremely short tasks with no useful history: the dial offers little benefit.
- Settings where provenance must be explicit: consider pairing SteeM with citation tools to show exactly which memories were used.
Open Questions
- Finer Control: How to move beyond integers (1â5) to continuous levels or multi-axis controls (e.g., content vs. style vs. workflow reliance separately)?
- Human-in-the-Loop: Can users adjust the dial mid-answer or per-section, and can the model adapt interactively?
- Robustness: How to make the judge and alignment more robust to tricky prompts or noisy histories?
- Privacy & Consent: How should agents signal what memories they plan to use, and let users veto sensitive items?
06Conclusion & Future Work
Three-Sentence Summary
- The paper introduces SteeM, a Steerable Memory Agent that lets users set how much the model should rely on past interactions for each query.
- It builds a rubric-judged MD-Score to measure reliance and trains the model with SFT and GRPO so realized dependence matches the target while maintaining answer quality.
- Across realistic long-horizon scenarios, SteeM outperforms prompting and masking, reduces Memory Anchoring, and generalizes to new subjects.
Main Achievement
- Turning memory reliance into a controllable, measurable behavior axisâgiving users a reliable dial between consistency and innovation.
Future Directions
- Move from a 1â5 scale to continuous control and multi-axis steering (content, pattern, style separately).
- Expand to more scenarios and integrate human feedback loops for on-the-fly adjustment.
- Add transparent provenance so users can see and manage exactly which memories influenced the answer.
Why Remember This
- Because it shows a practical way to balance âremembering who I amâ with âthinking in new ways,â which is the heart of great long-term collaboration between humans and AI.
Practical Applications
- â˘Project assistants that can switch between strict continuation of plans and blue-sky brainstorming on demand.
- â˘Tutoring systems that grade using prior rubrics or design fresh exercises unrelated to past lessons when requested.
- â˘Customer support bots that follow account history closely for billing issues but generate creative troubleshooting for new problems.
- â˘Writing aides that either maintain a documentâs established tone or intentionally explore new voices and structures.
- â˘Product design agents that respect legacy constraints during implementation reviews but ideate freely during early concepting.
- â˘Code assistants that follow existing architecture during refactoring but propose clean-slate patterns during redesign spikes.
- â˘Healthcare triage helpers that default to high-reliance on patient history but can present general education when privacy-sensitive.
- â˘Sales assistants that adhere to account strategies for renewals but suggest unconventional pitches for new markets.
- â˘Education planners that preserve long-term goals yet introduce novel methods when progress stalls.
- â˘Research aides that replicate prior analysis pipelines or pilot fresh methodologies under a low-reliance setting.