Toward Efficient Agents: Memory, Tool learning, and Planning
Key Summary
- •This survey explains how to make AI agents not just smart, but also efficient with their time, memory, and tool use.
- •It focuses on three big parts of an agent: memory (what to keep), tool learning (what to use), and planning (what to do next).
- •The paper shows that saving tokens, cutting latency, and reducing extra steps can be as important as getting the right answer.
- •It organizes many recent methods that compress context, pick the right tools at the right time, and search plans more carefully.
- •Efficiency is measured by either doing the same job with less cost or doing a better job with the same cost.
- •The authors suggest thinking in terms of a Pareto frontier: the best trade-offs between quality and cost.
- •Benchmarks and metrics for memory, tool learning, and planning help compare approaches fairly.
- •Key tricks include context compression, hierarchical retrieval, cost-aware rewards, and adaptive policy optimization.
- •The survey highlights open challenges like balancing compression with accuracy and choosing between online and offline memory updates.
Why This Research Matters
Efficient agents make powerful AI usable in real life by cutting costs and response times without sacrificing correctness. This means customer support bots can help more people faster, coding assistants can iterate quickly, and research tools can stay within tight budgets. Schools, startups, and nonprofits benefit because they no longer need giant servers to run capable agents. Lower costs also reduce environmental impact by saving compute. Finally, clear efficiency metrics help the community build fair comparisons and push the Pareto frontier forward.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you do a big school project, you keep notes, use tools like rulers or calculators, and make a plan so you finish on time? AI agents have to do the same three things—remember, use tools, and plan—but inside a computer. As agents took on longer and trickier tasks than plain chatbots, people noticed they often took too long, read too much, and called too many tools. That made them expensive and slow.
🍞 Top Bread (Hook): Imagine packing a suitcase for a long trip—you can’t bring everything, so you choose what matters most. 🥬 The Concept (Memory): Memory is where an agent keeps useful information to reuse later. How it works:
- Collect important pieces from the conversation or documents.
- Compress or organize them (like summaries, graphs, or short notes).
- Retrieve only the parts needed for the next step. Why it matters: Without memory, the agent rereads mountains of text every time, wasting tokens and time. 🍞 Bottom Bread (Anchor): Like keeping a neat travel checklist so you don’t repack your whole closet each morning.
🍞 Top Bread (Hook): Think of a Swiss Army knife—you pull out the right tool for the job. 🥬 The Concept (Tool Learning): Tool learning is how an agent picks and uses outside helpers—like search, calculators, or APIs. How it works:
- Select candidate tools related to the question.
- Fill in the right parameters (what to search, what to calculate).
- Execute and read the results. Why it matters: Without smart tool use, agents guess or over-call tools, making everything slower and pricier. 🍞 Bottom Bread (Anchor): Like choosing a calculator instead of doing huge multiplication in your head.
🍞 Top Bread (Hook): Before you bake cookies, you follow a recipe so you don’t waste ingredients. 🥬 The Concept (Planning): Planning is making a step-by-step path to the goal. How it works:
- Break the task into smaller steps.
- Decide the order and who or what does each part.
- Adjust if new info appears. Why it matters: Without planning, an agent retries randomly, uses more tokens, and takes longer. 🍞 Bottom Bread (Anchor): Like listing ingredients, preheating the oven, then mixing and baking in order.
🍞 Top Bread (Hook): If you have a time limit on a test, you must work both smart and fast. 🥬 The Concept (Efficiency): Efficiency means getting good results while using as little cost as possible. How it works:
- Measure costs (tokens, latency, money).
- Cut waste (compress, retrieve less, call fewer tools).
- Keep accuracy high while shrinking expense. Why it matters: Without efficiency, great answers arrive too late or cost too much to be practical. 🍞 Bottom Bread (Anchor): Like finishing homework correctly before dinner, not at midnight.
🍞 Top Bread (Hook): When shopping with limited pocket money, you compare prices. 🥬 The Concept (Cost): Cost is the resources used—tokens, time (latency), and compute dollars. How it works:
- Count how many tokens you read/write.
- Track waiting time for tools.
- Add compute and API bills. Why it matters: Ignoring cost makes systems slow and unaffordable. 🍞 Bottom Bread (Anchor): Like noticing that express shipping might not be worth it for a small item.
🍞 Top Bread (Hook): A trophy is great, but only if you actually won the game. 🥬 The Concept (Effectiveness): Effectiveness is how well the agent solves the task. How it works:
- Check if the answer is correct.
- See if the steps are valid.
- Compare to baselines. Why it matters: A cheap system that fails isn’t truly efficient. 🍞 Bottom Bread (Anchor): Like a fast quiz taker who still gets the answers right.
The world before this survey mostly chased higher accuracy by adding more steps, longer context, and many tool calls. That helped correctness but blew up costs. People tried just shrinking models or speeding up inference, but agents still looped through memory, tools, and plans over and over. The missing piece was a system-level view: treat memory, tools, and planning as levers to control cost while keeping quality. The real stakes are big: agents that help with coding, research, shopping, or scheduling need to be affordable and responsive so everyone can use them, not just big labs with giant budgets.
02Core Idea
The "Aha!" in one sentence: Optimize the whole agent loop—memory, tool learning, and planning—so it solves tasks with fewer tokens, fewer tool calls, and fewer steps, without losing accuracy.
Three analogies:
- Backpacking: Pack only essentials (compressed memory), pick the right gadgets (tool selection), and map an efficient route (planning) to finish the hike faster with less weight.
- Restaurant kitchen: Prep stations (memory) keep only needed ingredients, chefs choose the right utensil (tools), and the head chef coordinates timing (planning) to serve perfect dishes on time.
- Sports playbook: Short cues (memory summaries), the right drills (tools), and a smart play sequence (planning) score more with less energy.
Before vs After:
- Before: Agents stuffed giant histories into prompts, called tools repeatedly, and planned by trial-and-error.
- After: Agents store compressed summaries, call just the necessary tools (often in parallel), and use budget-aware planning and search to avoid waste.
Why it works (intuition without equations):
- Most of what we read isn’t needed every step; compressing and selectively retrieving memory cuts repeated reading.
- Not every question needs every tool; confidence and cost signals discourage extra calls.
- Trees of actions explode in size; guided search prunes dead branches early, saving steps.
- Training with rewards that include cost teaches the agent to solve the same tasks with less.
🍞 Top Bread (Hook): When you compare bikes, you check both speed and price to choose the best deal. 🥬 The Concept (Efficiency Metric): An efficiency metric measures how well an agent balances performance with cost. How it works:
- Pick quality scores (accuracy) and costs (tokens, latency, dollars).
- Compare methods under a fixed budget or at equal quality.
- Prefer ones that deliver more for less. Why it matters: Without shared metrics, we can’t tell if a method is truly efficient. 🍞 Bottom Bread (Anchor): Like choosing sneakers that last longer and cost less per mile.
🍞 Top Bread (Hook): Finding the best seat in a theater means trading off view and price. 🥬 The Concept (Pareto Frontier): The Pareto frontier is the set of best trade-offs between effectiveness and cost—improving one would worsen the other. How it works:
- Plot methods by quality and cost.
- Keep only those that aren’t beaten on both at once.
- Move the frontier outward with better designs. Why it matters: It shows which methods are truly state-of-the-art in bang-for-buck. 🍞 Bottom Bread (Anchor): Like picking the seat that offers the best view for the money among all options.
Building blocks (the simple pieces):
- Memory that shrinks text into summaries or latent states and retrieves only what’s needed.
- Tool learning that first narrows candidates, then fills parameters, and uses cost-aware rules to avoid extra calls.
- Planning that decomposes tasks, applies guided search, and spends more compute only where it pays off.
- Training signals and benchmarks that reward both correct answers and resource savings.
03Methodology
At a high level: Input → Memory (select and compress context) → Planning (decide steps under a budget) → Tool Learning (pick and call minimal tools) → Observation (read results) → Output.
Step A: Memory (construct, manage, access)
- What happens: The agent turns long histories into short, organized notes or latent states; it updates or forgets items by rules or small LLM edits; it retrieves only a tiny, relevant slice per query.
- Why this step exists: Without it, every turn reprocesses the full history, causing token and latency explosions.
- Example: From a 50-page doc, it keeps page gists and a few key facts; when asked a question, it fetches just 2–3 summaries.
🍞 Top Bread (Hook): Like turning a long book into a study guide with highlights. 🥬 The Concept (Context Compression): Context compression shortens long text into compact summaries or neural states. How it works:
- Extract key facts or make a gist.
- Replace long spans with short notes or latent memory tokens.
- Update notes when new info appears. Why it matters: Without compression, prompts overflow and costs soar. 🍞 Bottom Bread (Anchor): A one-page cheat sheet instead of rereading the whole textbook.
🍞 Top Bread (Hook): In a library, you first choose a shelf, then a book, then a page. 🥬 The Concept (Hierarchical Retrieval): Hierarchical retrieval finds info through layered indexes—from coarse summaries down to details. How it works:
- Search high-level summaries.
- Drill down into subtopics.
- Finally open the exact passage. Why it matters: Jumping straight to details is slow; layered lookup is faster and cheaper. 🍞 Bottom Bread (Anchor): Using table-of-contents → chapter → section to find a specific formula.
Step B: Planning (adaptive control, search, decomposition)
- What happens: The agent decides when to think fast vs. slow, prunes action trees with cost-aware search, and splits big tasks into smaller ones it can route or parallelize.
- Why this step exists: Random trial-and-error wastes steps; structured planning avoids dead ends earlier.
- Example: To solve a data question, it first drafts a plan, then explores only promising branches, skipping obviously costly ones.
🍞 Top Bread (Hook): When practicing piano, you focus on the tricky bars, not replay the whole piece every time. 🥬 The Concept (Adaptive Policy Optimization): Adaptive policy optimization improves decisions by learning from what worked best, especially under budgets. How it works:
- Try different strategies.
- Score them by correctness and cost.
- Favor strategies that get correct answers with fewer resources. Why it matters: Without adapting, the agent keeps overspending on the same mistakes. 🍞 Bottom Bread (Anchor): Like adjusting your study plan to spend time on the hardest questions only.
Step C: Tool Learning (selection, calling, tool-integrated reasoning)
- What happens: The agent picks a few candidate tools (retriever, classifier, or special “tool tokens”), fills in parameters, may call some in parallel, and only when needed.
- Why this step exists: Tools can be slow; minimizing and timing calls well keeps the whole system responsive.
- Example: Instead of calling search 20 times, it checks confidence and calls it twice, then runs code once to verify.
🍞 Top Bread (Hook): Like only asking a teacher for help when you’re truly stuck. 🥬 The Concept (RL-based Rewards): RL-based rewards give extra credit for solving tasks correctly with fewer tool calls and tokens. How it works:
- Define rewards for right answers.
- Subtract points for wasted calls or long prompts.
- Train the agent to pick cheaper-but-correct paths. Why it matters: Without cost in the reward, the agent learns to overuse tools. 🍞 Bottom Bread (Anchor): Getting bonus points for finishing a test accurately with time to spare.
Putting it together (the secret sauce):
- Shared budgets: Planning asks, “Do I really need memory expansion or a tool call now?” Memory and tools answer with compressed notes and minimal invocations.
- Parallelism when safe: Independent subtasks run at once (e.g., multiple city weathers), shrinking latency.
- Caching and reuse: Successful plans and distilled insights are stored so next time is faster.
- Confidence gating: If the model is sure, it skips the tool; if unsure, it calls once and verifies.
Concrete micro-examples:
- Memory: A customer-support agent stores a compact user profile and last 3 issues; it avoids re-reading month-old chats.
- Planning: A shopping agent splits “compare laptops” into specs, price, and reviews, searches once per subtask, then merges.
- Tools: A math agent first tries mental math; if uncertain, it calls the calculator once, not repeatedly.
04Experiments & Results
What did they test? Because this is a survey, the “tests” are the community’s benchmarks and metrics, focused on two complementary views of efficiency: (1) same quality for less cost, and (2) better quality at the same cost. Typical costs are tokens, latency, tool calls, and steps. Typical quality is accuracy or task success.
What was it compared against? Methods are compared to baselines that either (a) stuff long context without compression, (b) call tools freely without budgets, or (c) plan via unguided trial-and-error. Stronger systems combine compressed memory, selective tools, and budget-aware planning.
The scoreboard (with context):
- Memory: Hierarchical and compressed memory often reaches similar accuracy to full-context prompts but with far fewer tokens—like getting the same grade using half the study time.
- Tool learning: Confidence gating and cost-aware rewards reduce tool calls while keeping answers correct—like finishing the worksheet with fewer hints.
- Planning: Guided search (A*/MCTS variants) and task decomposition reach solutions with fewer explored branches—like taking the shortcut you know works instead of wandering.
- Multi-agent: Sparse topologies and protocol compression reduce chatter while maintaining or improving final answers—like a team using short hand signals instead of long speeches.
Surprising findings:
- Over-compression can hurt: Go too far and you drop key details; moderate compression tends to preserve accuracy better.
- Online vs. offline memory updates: Synchronous (online) updates adapt instantly but add latency; asynchronous (offline) consolidation lowers response time but adapts slower.
- Parallel tool calls help when tasks are independent; if dependencies exist, naive parallelism causes rework.
- Cost-aware rewards change behavior quickly: When cost is part of the grade, agents learn to skip unnecessary steps.
What success looks like in plain terms:
- A customer agent answers as accurately as before but loads a much shorter prompt and calls the database fewer times—like winning the same race with less fuel.
- A coding agent solves tasks with fewer interpreter runs because it plans and verifies smarter—like checking your work once instead of five times.
- A research agent retrieves only key documents, then summarizes—like citing the exact right sources instead of quoting everything.
How to read the results fairly:
- Always check both sides: the quality (accuracy/success) and the cost (tokens/latency/calls).
- Prefer methods on the Pareto frontier—better quality for equal cost or equal quality for less cost.
- Beware “cheap but wrong”: low cost without reliable answers isn’t true efficiency.
05Discussion & Limitations
Limitations:
- Over-compression: If summaries get too short, accuracy drops.
- Retrieval noise: Bigger memories can surface irrelevant items unless retrieval is precise.
- Planning overhead: Search and decomposition add their own cost if not budgeted.
- Tool fragility: External APIs change or fail, and parallel calls need careful dependency checks.
Required resources:
- A capable base LLM plus memory storage (vector DB or graph store).
- Tool connectors (search, code sandbox, APIs) and, ideally, a scheduler for parallel calls.
- Optional RL or preference-training setup if using cost-aware rewards.
When NOT to use:
- Tiny, one-shot tasks that fit in a short prompt: overhead from memory, planning, or tools may outweigh benefits.
- Highly unstable tool ecosystems where API latency dwarfs any planning gains.
- Strict real-time constraints with no room for retrieval or search—choose a small, direct model instead.
Open questions:
- Optimal compression: How to auto-tune the compression level per task to avoid information loss?
- Unified budgeting: How to share a single budget across memory, tools, and planning so each knows what the others spend?
- Lifelong learning: How to keep memories fresh without growing noisy or huge?
- Robust tool use: How to adapt quickly when tools change, fail, or return inconsistent outputs?
- Multimodal efficiency: How to extend these ideas cleanly to images, audio, and video tools without ballooning costs?
06Conclusion & Future Work
In three sentences: This survey reframes AI agents as budget-aware problem-solvers, not just correct ones, by optimizing memory, tool learning, and planning together. It catalogs practical techniques—compression, hierarchical retrieval, cost-aware rewards, guided search, and parallelization—that push the Pareto frontier of quality versus cost. It also standardizes how to measure efficiency so progress is comparable and meaningful.
Main achievement: A clear, system-level blueprint for building agents that keep accuracy high while shrinking tokens, latency, and tool calls across the entire loop.
Future directions:
- Smarter auto-tuning of compression and retrieval depth per query.
- Shared budgets that coordinate memory, tools, and planning in real time.
- Robust, generalizable tool use with minimal fine-tuning.
- Distilling multi-agent strengths into single agents for low-cost deployment.
Why remember this: Efficient agents make advanced AI practical—faster, cheaper, and fairer—so more people and products can benefit without needing giant compute bills.
Practical Applications
- •Build chat assistants that summarize and cache user history so each reply uses fewer tokens.
- •Deploy coding agents that gate calculator/tool use by confidence, calling tools only when uncertain.
- •Add hierarchical retrieval to enterprise RAG so agents open summaries first and drill down only when needed.
- •Introduce parallel tool calls for independent subtasks (e.g., fetching multiple APIs at once) to cut latency.
- •Use cost-aware rewards in RL fine-tuning so agents learn to minimize extra tool calls and long prompts.
- •Cache and reuse successful plans (plan templates) to avoid re-planning similar tasks.
- •Adopt budget-aware planning that limits search depth and step counts based on task complexity.
- •Switch to hybrid memory management: do small online updates and heavier consolidation offline to reduce latency.
- •Instrument token, latency, and tool-call counters and track the Pareto frontier when comparing versions.
- •Prune multi-agent communication graphs and use concise protocols (pseudocode) to curb context bloat.