Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Key Summary
- •BudgetMem is a way for AI helpers to build and use memory on the fly, picking how much thinking to spend so answers are both good and affordable.
- •Instead of pre-making one giant, fixed memory, it grabs only what the current question needs and processes it at runtime.
- •Each memory step (like filtering, finding entities, tracking time, and summarizing) offers three budgets: LOW, MID, and HIGH.
- •A tiny learned router chooses the right budget per step using reinforcement learning, balancing answer quality against token cost.
- •Three different ways to make tiers are compared: simpler vs fancier methods (implementation), short vs long thinking (reasoning), and small vs large models (capacity).
- •Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem beats strong baselines when quality is the priority and traces better accuracy–cost trade-offs as budgets tighten.
- •Capacity and implementation tiering cover a wider budget range, while reasoning tiering shines as a fine-tuner at similar costs.
- •A special reward-scale alignment keeps training stable so the router doesn’t just pick the cheapest path and ruin quality.
- •The approach transfers to a different base LLM without retraining the router, showing robustness.
Why This Research Matters
BudgetMem makes AI assistants more practical by turning memory into a pay-what-you-need feature instead of a one-size-fits-all cost. Teams can get strong accuracy when it matters while keeping daily bills predictable. It helps long-term chat, research, and support agents that must remember across many sessions without drowning in token usage. The router’s learned choices save money by avoiding overthinking on easy questions and spending wisely on hard ones. Stable training tricks ensure the system doesn’t collapse to cheap-but-bad behavior, giving reliable quality. Because tiers are standardized, the approach is portable across different models and can adapt as prices and hardware evolve.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you have a giant binder of class notes from all year. If you try to carry it to every test, your backpack will explode. If you summarize everything too early, you might throw away the exact page you need later. What should you do?
🥬 The Concept: Runtime memory extraction
- What it is: It means building just the memory you need right now, after you see the question, not long before it.
- How it works:
- Keep the past records intact but split into chunks for easy lookup.
- When a question arrives, retrieve only the chunks that look relevant.
- Process those chunks to build a compact memory that helps answer this question.
- Why it matters: If you pre-summarize everything (offline), you may pay compute up front and accidentally throw away crucial bits for a specific question. Runtime building keeps options open and avoids wasting compute on unneeded stuff.
🍞 Anchor: Like grabbing only the math pages you need for tonight’s homework instead of carrying every notebook to the table.
The world before: Many LLM agents tried to keep long-term memory by building it ahead of time (offline). They summarized, compressed, and indexed entire histories the same way for all future questions, whether or not a future question would need those parts. This was simpler to plan but risky: it could be wasteful (spending compute on things no query will use) and brittle (throwing away the exact detail a future query needs).
The problem: Real systems care about bills and speed. If we push memory work to runtime—only when we get a user query—then we must control the trade-off between answer quality and cost/latency. Yet past work gave few explicit knobs during runtime for memory computation. Questions remained: Where do we apply the budget (what pieces)? How do we realize budgets in a principled way, not ad hoc?
Failed attempts: Prior methods often (a) locked in a single, fixed pipeline with little ability to dial compute up/down per query, (b) tried ad hoc tricks to save cost that didn’t generalize, or (c) just spent more compute (e.g., bigger models or longer chains-of-thought) without clear control.
The gap: We needed a runtime memory system that is modular (so budgets can be targeted at the right parts), tiered (so every part offers low/mid/high options), and learnable (so the system can choose smartly per question).
🍞 Hook: You know how streaming apps let you pick video quality—Low, Medium, High—depending on your data plan? Wouldn’t it be nice if AI memory could do that too, per step?
🥬 The Concept: Budget tiers (LOW/MID/HIGH)
- What it is: Each memory step offers three increasing levels of compute and quality.
- How it works:
- Every module guarantees the same input/output format.
- But internally it can run a cheaper or more powerful version.
- The system can mix-and-match tiers across modules for each query.
- Why it matters: Without tiers, you’re stuck with one-size-fits-all. With tiers, you can save cost where it’s safe and spend more where it counts.
🍞 Anchor: Like choosing SD for a slow Wi‑Fi video but switching to HD just for the key scene.
Real stakes: This matters for chat assistants with long histories, research tools, customer support bots, and any agent that must remember across days or documents. Costs add up quickly. People need predictable bills and fast answers—but also correctness when it counts. A controllable, query-aware memory system lets teams decide exactly how to spend compute for each question.
🍞 Hook: Imagine an assembly line where each station can work fast and cheap or slow and super-precise, and a manager decides, in real time, which setting to use so the final product is great without blowing the budget.
🥬 The Concept: BudgetMem (high-level preview)
- What it is: A runtime memory framework that splits memory-making into modules, gives each module three budget tiers, and uses a small learned router to choose tiers per query under a cost–quality objective.
- How it works:
- Retrieve raw chunks relevant to the query.
- Run them through a fixed pipeline of modules (filter → entity/temporal/topic → summary).
- At each module, pick LOW/MID/HIGH.
- Optimize the router with reinforcement learning (RL) to balance accuracy vs. cost.
- Why it matters: It makes performance–cost trade-offs explicit, controllable, and learned from experience—no more blind guessing.
🍞 Anchor: Like a smart chef who decides when to use the quick pan vs. the slow pressure cooker for each part of a dish to serve a tasty meal on time.
02Core Idea
🍞 Hook: You know how you choose between taking the bus (cheap), a carpool (mid), or a taxi (fast but pricey) depending on how late you are? Imagine your AI’s memory does the same for every step it takes.
🥬 The Concept: BudgetMem’s key insight
- What it is: Treat runtime memory as a fixed set of modules, give each module LOW/MID/HIGH options, and learn a tiny router that picks the tier per module to hit the best accuracy–cost trade-off for the current query.
- How it works:
- Standardize module inputs/outputs.
- Offer three quality–cost tiers per module.
- Use a compact neural router to pick tiers as the query flows through.
- Train the router with RL to maximize task score minus a (scaled) cost.
- Why it matters: Without this, you either overspend or underthink. With it, you spend compute where it helps most.
🍞 Anchor: Like a GPS that knows when to take the toll road (expensive, fast) or local streets (cheap, slower) to arrive on time without overpaying.
Multiple analogies for the same idea:
- Kitchen analogy: Each kitchen station (prep, sauté, bake, plate) can work in speed mode, normal mode, or gourmet mode. A sous-chef (router) decides per dish and step which mode to use so dinner tastes great and is on time.
- School project analogy: For a science fair, you can do quick notes (LOW), a structured report (MID), or a polished poster with experiments (HIGH). A team lead (router) assigns the level for research, measurements, and presentation based on deadline and goals.
- Travel analogy: For each leg of a trip, choose bus (LOW), train (MID), or plane (HIGH). A planner (router) balances arrival time (quality) against ticket price (cost).
Before vs. After:
- Before: Memory systems pre-built summaries and indexes without knowing the next question, risking waste and missing details. Runtime methods often had one fixed compute level.
- After: BudgetMem mixes and matches compute per module, per query. It dials up where it matters (e.g., summarization) and dials down elsewhere (e.g., simple filtering), learning the best pattern over time.
Why it works (intuition, no equations):
- Not all steps are equally important for every question. If the retriever already did a great job, the filter can be LOW; if the question hinges on time order, the temporal module might go HIGH. By standardizing interfaces, the router compares apples-to-apples options and picks tiers that maximize expected score while respecting budget. Reinforcement learning closes the loop by rewarding choices that helped answer correctly with reasonable cost.
🍞 Hook: Imagine the pipeline itself.
🥬 The Concept: Modular runtime memory pipeline
- What it is: A fixed recipe: Filter → Parallel extraction (Entity, Temporal, Topic) → Summary.
- How it works:
- Filter tightens the candidate chunks.
- Three parallel modules pull out different clues (who/what, when, and what’s-the-topic).
- Summary fuses them into a compact memory for answering.
- Why it matters: This structure is simple, interpretable, and lets the router target compute exactly where needed.
🍞 Anchor: Like cleaning ingredients (filter), prepping meats/spices/timing (entity/temporal/topic), then plating a final dish (summary).
Building blocks (each introduced with the Sandwich pattern):
-
🍞 Hook: Sorting your backpack before class. 🥬 The Concept: Filter module
- What it is: Ranks and selects the most relevant chunks.
- How it works: Scores each retrieved chunk vs. the query and keeps top-K.
- Why it matters: Prevents downstream modules from drowning in noise. 🍞 Anchor: Keeping only the math pages for a math quiz.
-
🍞 Hook: Listing who did what. 🥬 The Concept: Entity module
- What it is: Extracts key people/objects and their relations.
- How it works: Finds subject–relation–object facts tied to the query.
- Why it matters: Answers often hinge on who/what relationships. 🍞 Anchor: "Ada Lovelace – wrote – first algorithm."
-
🍞 Hook: Putting events on a timeline. 🥬 The Concept: Temporal module
- What it is: Captures dates, durations, and order of events.
- How it works: Normalizes time clues and aligns event order.
- Why it matters: Many questions depend on what happened before/after. 🍞 Anchor: "Race started at 9:00 and ended at 10:30; awards after."
-
🍞 Hook: Naming chapters in a book. 🥬 The Concept: Topic module
- What it is: Summarizes themes and topic shifts.
- How it works: Extracts topic cues and transitions across chunks.
- Why it matters: Keeps the memory on the right subject. 🍞 Anchor: "This section is about volcano safety tips."
-
🍞 Hook: Turning notes into a study card. 🥬 The Concept: Summary module
- What it is: Fuses all clues into a compact, ready-to-use memory.
- How it works: Organizes entity, temporal, and topic info into a crisp memory.
- Why it matters: The final answerer LLM needs a short, accurate cheat-sheet. 🍞 Anchor: A neat one-page review sheet that actually helps you ace the test.
Tiering strategies (three ways to realize LOW/MID/HIGH):
-
🍞 Hook: Using a ruler/eyes/laser to measure. 🥬 The Concept: Implementation tiering
- What it is: Swap the internal method: heuristics (LOW), small learned models (MID), or LLM-based processing (HIGH).
- How it works: Same inputs/outputs, but increasingly capable internals.
- Why it matters: Covers a wide budget range with clear quality jumps. 🍞 Anchor: Spell-check by rules, then by a small model, then by a pro editor.
-
🍞 Hook: Quick answer vs. show-your-work vs. revise. 🥬 The Concept: Reasoning tiering
- What it is: Keep the same model but change how deeply it thinks: direct (LOW), chain-of-thought (MID), multi-step/reflection (HIGH).
- How it works: Add more deliberate steps as the tier rises.
- Why it matters: Great for fine-tuning quality at similar cost bands. 🍞 Anchor: Doing a math problem mentally, then on paper, then checking your steps.
-
🍞 Hook: Picking a bike, scooter, or car. 🥬 The Concept: Capacity tiering
- What it is: Same task and prompts, but use small (LOW), medium (MID), or large (HIGH) models inside a module.
- How it works: Bigger models usually understand more, at higher token cost.
- Why it matters: Achieves top quality in high-budget regimes. 🍞 Anchor: Delivering a package by bike (cheap), scooter (middle), or car (fastest, pricey).
03Methodology
At a high level: Input (query + chunked history) → Retrieval → Modular pipeline with per-module tier routing (Filter → Entity/Temporal/Topic → Summary) → Extracted memory → Answer LLM.
Step 0: Retrieval and chunking
- What happens: The history is split into small chunks and a retriever pulls a candidate set for the query.
- Why it exists: We need to narrow down the massive history before deeper processing.
- Example: From 100,000 tokens, retrieve the top-5 chunks most related to “When did the policy change and who announced it?”
🍞 Hook: Only sharpen the pencils you’ll actually use. 🥬 The Concept: Budget-tier routing
- What it is: A tiny neural policy that picks LOW/MID/HIGH for each module call, given the query and current intermediate signals.
- How it works:
- Observe a compact state (query, current module input, module ID).
- Choose one action: LOW, MID, or HIGH.
- Run the module at that tier and move to the next step.
- At the end, compute accuracy and cost to learn better decisions via RL.
- Why it matters: Without routing, you can’t customize compute to each query’s needs. 🍞 Anchor: A team captain decides who plays hard vs. conserves energy each quarter.
Step 1: Filter module (tiered)
- What happens: Scores candidate chunks for relevance; keeps the top-K.
- Why it exists: Too many irrelevant chunks will confuse later modules and the final answer.
- Example with data: Among five chunks, keep the three that mention the policy change and announcement.
Step 2: Parallel extraction modules (tiered)
- Entity module
- What happens: Extracts who/what facts like “Minister X – announced – policy Y.”
- Why it exists: Questions often hinge on relationships.
- Example: “CEO Lin – confirmed – merger date.”
- Temporal module
- What happens: Normalizes times and event order (before/after/during).
- Why it exists: Multi-hop questions often depend on when things happened.
- Example: “Announcement on March 2; change took effect on April 1.”
- Topic module
- What happens: Identifies central themes and transitions to keep focus.
- Why it exists: Prevents drifting to nearby but irrelevant subjects.
- Example: “This part is about policy announcements, not budget debates.”
Step 3: Summary module (tiered)
- What happens: Fuses entity, temporal, and topic outputs into a compact memory m.
- Why it exists: The answer LLM needs a short, accurate summary to condition on.
- Example: “On March 2, Minister X announced Policy Y; it took effect on April 1.”
Step 4: Answer generation
- What happens: A fixed LLM answers using (query, m).
- Why it exists: Separates memory building from final answering.
- Example: Q: “When did it start, and who announced it?” A: “April 1; Minister X.”
🍞 Hook: Training a puppy with treats for good behavior. 🥬 The Concept: Reinforcement learning for the router
- What it is: The router learns tier choices by receiving rewards: higher for accurate answers, lower for expensive runs.
- How it works:
- Run the full pipeline with selected tiers to get an answer.
- Compute task reward (e.g., F1/Judge) and a cost-based reward.
- Combine them with a knob λ that sets how much we care about cost.
- Update the router policy (e.g., PPO) to choose better tiers next time.
- Why it matters: Hand-made rules are brittle. RL lets the system discover smart spending patterns per module. 🍞 Anchor: The router is like a coach learning which plays win games without exhausting the team.
🍞 Hook: Comparing prices across different stores. 🥬 The Concept: Cost modeling and normalization
- What it is: Sum per-module costs (mainly token usage and API prices) and normalize them to a 0–1 scale so they balance with task reward.
- How it works:
- Add costs across routed modules.
- Apply a sliding-window, quantile-based normalization (with a sqrt) to bound values.
- Turn it into a reward that’s high when cost is low.
- Why it matters: Without fair scaling, cost could dominate or vanish in training, breaking the trade-off. 🍞 Anchor: Turning different currencies into dollars before adding them up.
🍞 Hook: Balancing two classroom graders—one very strict, one lenient—so neither overpowers the final grade. 🥬 The Concept: Reward-scale alignment
- What it is: A variance-based factor that rebalances the task reward and cost reward so one doesn’t swamp the other during learning.
- How it works:
- Measure the recent variability of task vs. cost rewards.
- Scale the cost term so both signals contribute stably.
- Learn smoother, non-degenerate policies.
- Why it matters: Without it, the router may collapse to always-LOW (cheap but wrong) or always-HIGH (good but pricey). 🍞 Anchor: Adjusting microphone levels so both singers are heard clearly.
The secret sauce:
- A fixed, interpretable pipeline with standardized module interfaces.
- Three orthogonal tiering strategies (implementation, reasoning, capacity) inside the same framework.
- A small shared router trained end-to-end with cost-aware rewards and reward-scale alignment for stability.
- Practical budgeting knobs (λ and retrieval size) that trace smooth accuracy–cost frontiers.
04Experiments & Results
The test: The team evaluated BudgetMem on three demanding benchmarks where long-term memory matters: LoCoMo, LongMemEval, and HotpotQA. They measured (a) task performance using F1 and an LLM-as-a-judge (a 0–1 scoring of answer correctness) and (b) memory extraction cost by adding up API token usage converted to dollars.
The competition: Strong baselines included ReadAgent, MemoryBank, A-MEM, LangMem, Mem0, MemoryOS, and LightMem—covering many popular memory designs.
The scoreboard (with context):
- On LongMemEval with LLaMA-3.3-70B as backbone, BudgetMem-CAP hit a Judge score of 60.50, while the strong baseline LightMem had 48.51. That’s like raising your test grade from a C+ to a solid B+.
- On HotpotQA with Qwen3-Next-80B, BudgetMem-REA reached a Judge score of 70.83 at just $0.17 cost. That’s like getting an A- while paying less than many B students.
- In performance-first settings (very low cost pressure), BudgetMem variants generally topped F1 and Judge across datasets while keeping costs contained compared to heavyweight pipelines.
Trade-off curves (why this matters): By adjusting the cost weight λ, BudgetMem traces smooth frontiers: as you allow more budget, quality rises predictably; as you tighten budget, quality decreases gracefully instead of crashing. These curves typically envelop baselines—meaning at the same cost, BudgetMem is more accurate, or at the same accuracy, it’s cheaper.
Surprising findings and insights:
- Capacity vs. Implementation vs. Reasoning
- Capacity tiering (small/medium/large models) and Implementation tiering (heuristics → small models → LLMs) cover a broader cost spectrum and achieve top-end quality when budgets are high.
- Reasoning tiering (direct → CoT → reflection) tends to operate in a narrower cost band but is excellent for fine-grained quality improvements at roughly similar costs—like a quality dial rather than a budget expander.
- Reward-scale alignment ablation: Removing it made the router collapse to mostly LOW tiers under cost pressure (cheap but poor answers). With alignment, the router used a balanced mix and achieved better, smoother frontiers.
- Retrieval-size sensitivity: More retrieved chunks aren’t always better. Performance peaked around top-5; beyond that, extra noise can hurt accuracy while raising cost. It’s like bringing just enough evidence to court but not drowning the judge in paperwork.
Bottom line: Across datasets and base models, BudgetMem variants consistently rank at the top in performance-first settings and provide cleaner accuracy–cost control than prior systems. They don’t just win; they let you decide how to win within your budget.
05Discussion & Limitations
Limitations:
- Ultra-tight budgets: When you must spend almost nothing, even smart routing can’t manufacture quality without compute. Expect diminishing returns at the extreme low end.
- Dependency on components: Quality depends on the underlying retriever, LLMs, and module prompts. Weak retrieval or poorly tuned prompts can cap gains.
- Latency variance: Dynamic routing can introduce variable latency per query. Some applications need hard real-time guarantees.
- Modular bias: The chosen pipeline (filter → entity/temporal/topic → summary) is simple and interpretable, but other tasks might need different modules.
Required resources:
- Access to one or more LLM backbones (for capacity and implementation tiers), plus a retriever.
- Token budget to train the router with RL (though the router itself is lightweight and shared across modules).
- Basic logging to track per-module costs for the reward function.
When not to use:
- Tiny contexts where a single pass with the base LLM fits easily and cheaply into the context window.
- Tasks that don’t benefit from structured memory (e.g., trivial single-sentence QA).
- Strict real-time systems where variable routing decisions are unacceptable.
Open questions:
- Adaptive module sets: Can the router decide not just the tier, but whether to skip or add modules dynamically while keeping stability?
- Richer cost models: Beyond tokens and prices, can we incorporate latency SLOs, GPU/CPU constraints, or carbon budgets directly?
- Transfer and generalization: How far can a router trained on one backbone or domain generalize to many others without retraining?
- Safety/Privacy: How should routing react to sensitive memories or access controls while still optimizing utility?
- Better judges: Can training with more reliable automatic judges (or reward models) further improve routing choices?
06Conclusion & Future Work
Three-sentence summary: BudgetMem turns runtime agent memory into a controllable, learnable process by giving each memory module LOW/MID/HIGH budget tiers and training a lightweight router to pick tiers per query. It compares three complementary ways to realize tiers—implementation, reasoning, and capacity—and shows that, across long-memory benchmarks, BudgetMem outperforms strong baselines and delivers smoother accuracy–cost frontiers. Stability tricks like reward-scale alignment help the router avoid trivial cheap policies and achieve dependable trade-offs.
Main achievement: Making performance–cost control for runtime memory explicit, modular, and learnable, so compute is spent where it most improves the final answer.
Future directions: Expand to adaptive module selection, integrate richer real-world cost/latency constraints, improve transfer across backbones and domains, and explore tighter synergy with retrieval strategies. Also, stronger or learned judges could sharpen the reward signal for even better routing.
Why remember this: It reframes memory as a per-query budgeting problem you can actually steer—like choosing HD only when the scene is crucial—so long-context agents become both smarter and more affordable in practice.
Practical Applications
- •Customer support bots that spend more memory compute on tricky cases but keep it low for simple FAQs.
- •Personal assistants that remember multi-week plans while staying within a monthly token budget.
- •Research copilots that boost temporal reasoning tiers for timeline-heavy questions and save cost elsewhere.
- •Healthcare triage assistants that prioritize higher tiers for safety-critical questions and lower tiers for routine follow-ups.
- •Education tutors that use deeper summarization only for multi-step problems, keeping costs low on drills.
- •Enterprise knowledge search that scales across departments by routing heavy tiers only when evidence is ambiguous.
- •Legal or policy analysts that increase capacity tiers when cross-document consistency is required.
- •Data labeling/review workflows that apply reflection-tier summaries for disputed items but not for straightforward ones.
- •Call-center QA that balances latency and accuracy by adjusting λ during peak vs. off-peak hours.