Adaptation of Agentic AI
Key Summary
- •This paper organizes how AI agents learn and improve into one simple map with four roads: A1, A2, T1, and T2.
- •A1 means the agent learns from tool results (like code passing tests), while A2 means the agent learns from its own final answers.
- •T1 means we improve the tools without touching the agent, and T2 means we improve the tools using feedback from a fixed agent.
- •The framework helps builders pick the right strategy based on cost, flexibility, generalization, and modularity.
- •Real systems often mix these roads (for example, better retrievers from T1, adaptive search tools from T2, and reasoning agents trained with A1).
- •Reinforcement learning with verifiable rewards (RLVR) is a strong way to train agents using real, checkable feedback from the world.
- •The survey shows examples across search, coding, theorem proving, memory tools, and multi-tool planning, and explains trade-offs.
- •A key message is co-adaptation: the best systems grow by tuning both the agent and its tools over time.
- •It also highlights open challenges like continual learning, safety, efficiency, and standardized evaluations.
- •The paper offers a practical roadmap for building more reliable, capable, and general-purpose agentic AI.
Why This Research Matters
Better-adapted agentic AI means assistants that can actually complete complex tasks, not just talk about them. With a clear framework, teams can choose cheaper, faster strategies (like T1/T2) before touching expensive foundation models. Verifiable signals (tests, retrieval metrics, proof checkers) make training safer and more reliable, reducing hallucinations and brittle behaviors. This helps in daily life—stronger search, safer code help, clearer homework explanations—and in critical domains like healthcare, law, and science. The framework also supports long-term growth: you can keep upgrading tools and policies without breaking the system. In short, this roadmap makes building trustworthy, capable AI more practical and sustainable.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine a team project at school. One kid (the planner) decides what to do, some kids use special classroom tools (microscopes, calculators, browsers), and the whole team learns from what worked and what didn’t. That’s what modern AI teams (called agents and tools) are like.
🥬 The Concept 1 — Artificial Intelligence (AI)
- What it is: AI is software that tries to do smart tasks like understanding language, solving problems, or making plans.
- How it works: 1) It sees or reads input, 2) It thinks using patterns it learned, 3) It outputs an answer or action.
- Why it matters: Without AI, computers can’t flexibly help in messy, real-world tasks. 🍞 Anchor: When you ask a chatbot a question and it replies sensibly, that’s AI in action.
🥬 The Concept 2 — Machine Learning (ML)
- What it is: ML is how AI improves by learning from data instead of only following hand-written rules.
- How it works: 1) Show examples, 2) Compare predictions with the right answers, 3) Adjust to do better next time.
- Why it matters: Without ML, AI wouldn’t improve with experience. 🍞 Anchor: Like practicing math problems and getting better after checking your mistakes.
🥬 The Concept 3 — Basics of Reinforcement Learning (RL)
- What it is: RL teaches an AI to make sequences of decisions by rewarding good choices.
- How it works: 1) Try an action, 2) See the result, 3) Get a score (reward), 4) Do more of what earns higher scores.
- Why it matters: Without RL, long multi-step tasks (like searching, coding, or planning) are hard to master. 🍞 Anchor: Like a video game player who learns levels by trying, failing, and improving based on points.
🥬 The Concept 4 — Adaptive Systems
- What it is: Adaptive systems adjust their behavior to new tasks or environments.
- How it works: 1) Measure how well things went, 2) Update strategy or parts, 3) Try again and compare.
- Why it matters: Without adaptation, systems get stuck when the world changes. 🍞 Anchor: Changing study strategies after a quiz to score higher next time.
🥬 The Concept 5 — Agentic AI
- What it is: Agentic AI is an AI “worker” that can plan, think in steps, use tools (like search or code), and remember.
- How it works: 1) Understand the goal, 2) Plan steps, 3) Call tools, 4) Check results, 5) Continue until done.
- Why it matters: Without agents, LLMs would just answer once and stop, missing multi-step real-world tasks. 🍞 Anchor: A smart assistant that searches the web, runs code, and summarizes findings to finish a science report.
🥬 The Concept 6 — Adaptation in Agentic AI
- What it is: Adaptation changes either the agent or its tools so the whole system works better, more reliably, and in more places.
- How it works: 1) Choose what to improve (agent or tool), 2) Pick a learning signal (tool result or final answer), 3) Update using prompting, fine-tuning, or RL.
- Why it matters: Without adaptation, agents struggle with long plans, tool reliability, and new domains. 🍞 Anchor: Teaching the assistant to write better search queries, or upgrading the search tool so the assistant gets better documents.
🥬 The Concept 7 — Prompt Engineering
- What it is: A lightweight way to steer the agent by changing instructions and examples without touching its weights.
- How it works: 1) Write clearer goals and steps, 2) Give examples, 3) Add rules or checklists, 4) Test and tweak.
- Why it matters: Without good prompts, even strong models can behave inconsistently. 🍞 Anchor: Rewriting a question so a chatbot knows to use a calculator first, then explain.
🥬 The Concept 8 — Fine-Tuning
- What it is: Training the model’s weights on new data to specialize its behavior.
- How it works: 1) Collect examples, 2) Train on them (full or parameter-efficient like LoRA), 3) Validate and iterate.
- Why it matters: Without fine-tuning, agents often stay too general and miss domain tricks. 🍞 Anchor: Fine-tuning a model on medical Q&A so it handles clinical language correctly.
The world before: Foundation models were great at one-shot answers but not as good at long, tool-using missions. They could write, summarize, and chat, but they stumbled when they had to plan, search, run code, or remember across steps.
The problem: As tasks got harder (software help, research, scientific workflows), systems needed to both think and act. They needed to pick tools, format calls correctly, judge results, and adjust plans. Off-the-shelf models struggled with unreliable tool use, shallow planning, and weak generalization.
Failed attempts: Just prompting often wasn’t enough for reliable multi-step tool use. Blind fine-tuning could help one task but harm another (catastrophic forgetting). Training only the agent was expensive; training only the tools sometimes clashed with the agent’s style.
The gap: We lacked a clear map of adaptation choices—who to train (agent vs. tools), and which signal to trust (tool outcomes vs. final answers). Builders needed guidance to mix strategies without guesswork.
What this paper adds: A simple, unified framework: A1 and A2 (train the agent using tool signals or its own answer signals), and T1 and T2 (train tools on their own or with agent supervision). This map clarifies trade-offs and shows how real systems combine paths.
Real stakes: This matters for everyday things—better search assistants, safer code helpers, clearer homework explainers, and faster scientific discovery. With the right adaptation, agents become more helpful, reliable, and ready for the real world.
02Core Idea
🍞 Hook: You know how a great team improves two ways—players train themselves, and coaches upgrade the equipment? The magic happens when both get better in sync.
🥬 The Concept 1 — The Aha! Moment (in one sentence)
- What it is: The key insight is that all the ways we improve agentic AI fit into four clear boxes—A1, A2, T1, T2—based on who changes (agent or tool) and what signal teaches them (tool results or final answers).
- How it works: 1) Decide if you’ll update the agent or the tools, 2) Pick the teaching signal (from tool execution or final outputs), 3) Train with SFT, preferences, or RLVR, 4) Mix boxes when needed.
- Why it matters: Without this map, we waste effort, overfit, or miss cheap wins. 🍞 Anchor: Like choosing between practicing your swings (agent) or sharpening the bat (tool), and deciding if you’ll learn from the hit’s distance (tool result) or the final score (answer correctness).
Multiple Analogies (3 ways)
- School project: A1/A2 is the student improving study habits from quizzes (A2) or from how well the calculator/spreadsheet worked (A1). T1/T2 is the teacher upgrading classroom tools for everyone (T1) or customizing tools to how this student learns (T2).
- Kitchen: A1/A2 is the chef improving cooking steps based on taste of the dish (A2) or how well the oven performed (A1). T1/T2 is buying better pans for any chef (T1) vs. picking pans exactly for this chef’s style (T2).
- Video game: A1/A2 is the player leveling up skill by looking at final mission grades (A2) or weapon test damage (A1). T1/T2 is unlocking general gear for anyone (T1) or gear tuned to one player’s strategy (T2).
Before vs. After
- Before: Adaptation was a blur—lots of tricks with prompts, fine-tunes, and tools, but no clean way to choose.
- After: Four boxes make the design space obvious. You can pick fast, cheap tool upgrades (T1), agent-focused gains (A1/A2), or agent-guided tool tuning (T2). Teams can now mix and match with intention.
Why it works (intuition, no equations)
- Separation of concerns: Decide who learns (agent vs. tool) and who judges (tool results vs. final answer). This keeps goals clear and avoids mixed signals.
- Verifiable signals: Tool execution gives checkable feedback (tests pass, query recall goes up). Final answers give end-to-end guidance (did the solution work?).
- Modularity: Tools can improve independently and be swapped (T1/T2), avoiding costly agent retraining.
- Stabilization: Using the right signal reduces confusion—e.g., if the problem is bad search queries, train on retrieval scores (A1/T2) instead of only end answers (A2).
🥬 The Concept 2 — Agent Adaptation
- What it is: Changing the agent’s policy so it plans, reasons, or calls tools better.
- How it works: 1) Choose A1 (learn from tool outcomes) or A2 (learn from final answers), 2) Train with SFT, preferences, or RLVR, 3) Validate on tasks.
- Why it matters: Without agent adaptation, even good tools get used poorly. 🍞 Anchor: Teaching a student to write better search queries and to check code outputs before answering.
🥬 The Concept 3 — Tool Adaptation
- What it is: Changing the tools so they give the agent exactly what it needs.
- How it works: 1) Choose T1 (train tools independently) or T2 (train tools using the agent’s feedback), 2) Evaluate if the agent performs better with the updated tool, 3) Iterate.
- Why it matters: Without tool adaptation, the agent’s environment stays weak or mismatched. 🍞 Anchor: Upgrading a search engine so it returns better sources that your assistant can actually use.
🥬 The Concept 4 — A1: Tool-Execution–Signaled Agent Adaptation
- What it is: The agent learns from verifiable tool results (code tests, retrieval scores, API outcomes).
- How it works: 1) Agent calls a tool, 2) Tool runs and returns a result, 3) Score is computed from that result, 4) The agent updates to get higher scores.
- Why it matters: Without direct tool feedback, agents may look good on paper but fail when tools run. 🍞 Anchor: If your program passes more tests after training, you learned the right coding moves.
🥬 The Concept 5 — A2: Agent-Output–Signaled Agent Adaptation
- What it is: The agent learns from the quality of its own final answers (with or without tools).
- How it works: 1) Produce a final answer, 2) Check correctness or preference, 3) Update policy to produce better end results.
- Why it matters: Without end-to-end signals, agents can optimize the middle steps but still miss the goal. 🍞 Anchor: Getting graded on the full report, not just the quality of a single chart.
🥬 The Concept 6 — T1: Agent-Agnostic Tool Adaptation
- What it is: Train tools independently so any (even closed) agent can use them.
- How it works: 1) Train a retriever, planner, or model on broad data, 2) Plug into the agent, 3) No agent changes needed.
- Why it matters: Without T1, teams must always retrain the big agent—slow and costly. 🍞 Anchor: A universal remote that works out-of-the-box with many TVs.
🥬 The Concept 7 — T2: Agent-Supervised Tool Adaptation
- What it is: Tune tools using signals from a fixed agent’s outputs so tools match the agent’s style.
- How it works: 1) Agent tries a task with the tool, 2) Judge final answer quality, 3) Update the tool to better support that agent.
- Why it matters: Without T2, tools might be strong in general but not helpful to this specific agent. 🍞 Anchor: A coach adjusts training drills after watching how a particular player performs.
Building blocks
- Clear goals (what to optimize), clean signals (what to trust), stable training (SFT/Preferences/RLVR), and modular designs (swap tools without breaking the agent). Put together, they make complex agent systems simpler, safer, and stronger.
03Methodology
At a high level: Input → Choose Target (Agent or Tool) → Choose Signal (Tool Result or Final Answer) → Pick a Training Method (Prompting/SFT/Preferences/RLVR) → Train/Validate → Deploy and Iterate.
Step-by-step (like a recipe)
- Define the task and inputs
- What happens: Write down what the agent must do (e.g., answer multi-hop questions, fix code, plan a research session) and what tools are available (search, code, SQL, browser, memory).
- Why it exists: Without a clear task-tool map, you can’t pick the right adaptation box.
- Example: A deep-research assistant must search the web, open pages, summarize, and cite.
- Decide what to adapt: agent or tool
- What happens: If the agent’s reasoning and tool-calling are weak, choose Agent Adaptation (A1 or A2). If tools are misaligned or low-quality, choose Tool Adaptation (T1 or T2).
- Why it exists: Updating everything wastes compute; targeting the bottleneck is efficient.
- Example: If search results are irrelevant, first try T1 (better retriever) or T2 (tune retriever with agent feedback) before retraining the big agent.
- Choose the learning signal: tool result (A1) or final answer (A2)
- What happens: A1 trusts verifiable tool outcomes (tests passed, recall@K). A2 trusts end-to-end correctness or preferences for the final answer.
- Why it exists: The right signal stabilizes learning and reduces confusion.
- Example: For coding, use A1 (unit tests); for open QA, use A2 (exact-answer match or judge scores).
- Pick a training method
- What happens: Select one: Prompt Engineering (fast steering), SFT (imitate demos), Preference Learning (DPO-like), or RLVR (explore using real rewards).
- Why it exists: Methods differ in cost, data needs, and generalization.
- Example: Start with SFT on expert traces; upgrade to RLVR when you can run tools safely and repeatedly.
- Structure the pipeline
- What happens: Design how the agent interacts with tools: single call, multi-turn ReAct, or plan-execute-check cycles; decide how memory is written/read.
- Why it exists: Without a sensible loop, signals get noisy and credit assignment breaks.
- Example: ReAct for search: think → search → read → think → answer.
- Train with guardrails
- What happens: Add format checkers, sandboxes, rate limits, and retry logic. Regularize with KL or reference policies to avoid drifting.
- Why it exists: Prevents unsafe calls, prompt injection, and reward hacking.
- Example: Code RL uses a sandbox and verified test suites; search RL filters unsafe URLs.
- Evaluate with the right metrics
- What happens: Match metrics to the signal: tool metrics for A1 (pass rate, recall@K), answer metrics for A2 (EM/F1, judge scores). Also track latency, cost, robustness, and safety.
- Why it exists: You get what you measure; wrong metrics drive wrong behavior.
- Example: DeepRetrieval tracks retrieval recall and downstream QA accuracy; ReTool tracks final math correctness after code execution.
- Iterate or switch boxes
- What happens: If A2 plateaus, try T2 to make tools feed the agent better; if T1 stalls, try A1 so the agent learns to call tools properly.
- Why it exists: Real systems need co-adaptation across components.
- Example: A deep-research stack might use T1 for a general retriever, T2 for a search subagent tuned by the fixed LLM, and A1 to train the agent to write better queries.
Concrete mini-walkthroughs
-
RAG (search): Input → Agent (writes query) → Retriever (returns docs) → Agent (answers) • A1: Reward the agent on retrieval metrics (better queries). Example: learning to boost recall@K. • A2: Reward the agent on final answer correctness (better synthesis and when-to-search). • T1: Train the retriever broadly (plug-and-play for any agent). • T2: Tune the retriever using the fixed agent’s answer quality as feedback (make results more agent-usable).
-
Coding with execution: Input → Agent (writes code) → Sandbox (runs tests) → Agent (optional summary) • A1: Reward on pass rate; the agent learns to produce runnable, correct code. • A2: Reward on final correctness after seeing outputs (plan, run, then answer). • T1: Improve static code tools or linters independently. • T2: Train a code helper (subagent) using a fixed agent’s success as the teaching signal.
-
Memory as a tool (T2): Input → Agent reads memory → Acts → Final output → Memory gets updated using the agent’s output The agent stays fixed; the memory learns what to save and how to retrieve to help future tasks.
The secret sauce
- Two clean axes—who learns (agent vs. tool) and what teaches (tool result vs. final answer)—turn a messy design space into four simple boxes you can mix and match. This separation makes decisions clearer, reduces wasted compute, and encourages modular, swappable components. It’s like labeling the shelves in a workshop so everyone knows where to improve first.
04Experiments & Results
The paper is a survey, so it doesn’t run new experiments; instead, it summarizes how the four boxes (A1, A2, T1, T2) perform across tasks and reports standout results from cited works. Here’s how to read the “scoreboard” with meaning.
- What was tested and why
- Tool-grounded tasks (great for A1/RLVR): coding with unit tests, retrieval with recall/nDCG, theorem proving with verifiers. These have clear, verifiable signals.
- End-to-end reasoning tasks (great for A2): QA with exact match or judge preference, math proofs by final correctness, long research pipelines.
- Tool ecosystems (T1/T2): retrievers, planners, memory modules measured by how much they raise a fixed agent’s downstream performance.
- Competition and baselines
- Prompt-only systems vs. SFT vs. preference learning vs. RLVR.
- Frozen strong agents with generic tools (T1) vs. the same agents with tools tuned by the agent’s own feedback (T2).
- One-step agents vs. multi-turn planner-executor agents.
- The scoreboard (with context)
- A1 with RLVR on retrieval: DeepRetrieval reported roughly a threefold recall jump (about 65% vs. ~25%) on literature search—like going from a C to an A in finding the right papers—while also boosting downstream QA.
- A1 with RLVR on coding: Works like Code-R1/R1-Code-Interpreter emphasized that clean, sandboxed, verifiable rewards beat larger but noisier datasets—quality over quantity—resulting in stronger pass rates and fewer reward hacks.
- A1 on theorem proving: Systems such as AlphaProof and DeepSeek-Prover-V2 used proof checkers as step-by-step or final rewards, yielding steadier progress on long proofs (dense, trusty signals = easier credit assignment).
- A2 for reasoning: R1-style approaches (e.g., DeepSeek-R1) showed that optimizing final answer quality can lift math/coding correctness broadly; TextGrad reported notable jumps like GPT-4o code accuracy from 26% to 36% on LEETCODE-HARD and +3.9 pp on MMLU-Physics, showing that language feedback itself can be a powerful learning signal.
- T1/T2 for RAG: Agent-agnostic retrievers (T1) generalize across tasks, but T2 (agent-supervised tools like S3/AgentFlow) often deliver bigger downstream gains for that specific agent—like fitting a glove.
- Surprising findings
- RLVR’s reliability: Verifiable rewards (tests, retrieval, verifiers) stabilize exploration—even small models learn solid multi-step strategies with the right signals.
- Reward quality > dataset size: In code RL, filtered, trusted tests outperform massive but noisy data.
- Tool-only wins: Upgrading tools (T1/T2) can meaningfully lift a frozen closed-source agent—no need to fine-tune the giant model.
- Co-adaptation shines: Best systems combine boxes (e.g., T1 retrievers + T2 search subagents + A1 agent query training), showing additive gains.
- Big picture
- A1 is best when tools offer crisp, checkable signals.
- A2 is best for holistic, end-to-end outcomes.
- T1 gives broad, reusable tools; T2 gives custom-fit tools for a specific agent.
- Mixing boxes often beats any single strategy.
05Discussion & Limitations
Limitations
- Coverage lag: The field moves fast; a static taxonomy can miss the latest tricks or hybrid methods.
- Signal brittleness: Poor rewards (flaky tests, biased judges) mislead training in both A1 and A2.
- Credit assignment: Long, multi-step tasks still make it hard to know which step deserves credit or blame.
- Safety gaps: Without careful design, agents can learn unsafe tool habits (e.g., insecure code, risky browsing).
Required resources
- Compute to run tools (search, sandboxes, provers) at scale for RLVR.
- High-quality datasets for SFT and preferences.
- Infrastructure for tool orchestration (APIs, MCP, sandboxes, logs) and evaluation harnesses.
- Monitoring for cost, latency, and failure modes (timeouts, tool drift).
When NOT to use certain boxes
- Avoid A1 if the tool signal is noisy or non-verifiable (you’ll chase the wrong objective).
- Avoid A2-only if intermediate steps really matter (you might get good answers for the wrong reasons and fail to generalize).
- Avoid T1-only if the agent is quirky; you may need T2 to align tools to the agent’s style.
- Avoid heavy agent fine-tuning if you’re resource-limited or fear catastrophic forgetting; start with T1/T2.
Open questions
- Co-adaptation: How do we schedule and coordinate A1/A2 with T1/T2 over time without instability?
- Continual learning: How can agents and tools adapt safely as tasks, websites, APIs, and proof libraries evolve?
- Safe adaptation: How do we prevent reward hacking, prompt injection, over-permissioned tools, and privacy leaks?
- Efficient adaptation: Can we get the same gains with fewer tool calls, cheaper rewards, and smaller models?
- Evaluation standards: Can the community agree on multi-task, multi-tool benchmarks that reflect real deployment?
- Theory: What guarantees can we make about stability, convergence, or generalization when rewards come from complex tool chains?
06Conclusion & Future Work
Three-sentence summary
- This paper introduces a clear, four-box map for improving agentic AI: A1 (agent learns from tool results), A2 (agent learns from final answers), T1 (tool learns on its own), and T2 (tool learns from the agent’s feedback).
- The framework clarifies trade-offs in cost, flexibility, generalization, and modularity, and shows how real systems combine boxes for stronger results.
- It highlights trends like RLVR, agent-guided tools, and memory-as-a-T2-tool, and lays out open challenges in co-adaptation, continual learning, safety, and efficiency.
Main achievement
- A unified, practical taxonomy that turns a messy design space into four simple, composable choices, helping researchers and builders pick, mix, and switch strategies with confidence.
Future directions
- Co-adaptation schedules that coordinate A1/A2/T1/T2 over time.
- Continual adaptation for changing APIs, data, and domains.
- Safer reward designs and robust tool sandboxes.
- More efficient training through better signals, caching, and smaller-but-smarter subagents.
- Standardized, multi-tool, multi-turn benchmarks.
Why remember this
- Because when agents must plan, search, code, and remember, choosing how to adapt is half the battle. These four boxes are the map you can carry into any new project: decide who learns, decide what teaches, and build upward—modularly, safely, and efficiently.
Practical Applications
- •Build a research assistant by combining T1 (strong retriever), T2 (agent-tuned search subagent), and A1 (agent learns better query writing).
- •Speed up coding help: start with A1 (unit-test rewards), then add T2 (a code-repair subagent trained by the fixed LLM’s feedback).
- •Harden RAG pipelines: train a general retriever (T1), then align it to your agent’s preferences (T2), and fine-tune the agent on tool formats (A1).
- •Create safe web-browsing agents using RLVR with strict sandboxes and reward filters (A1/A2) to avoid unsafe actions.
- •Upgrade helpdesk bots: use T2 to tune a reranker on the bot’s success rate before trying costly agent fine-tuning.
- •Stabilize math reasoning with A2 (final-answer rewards) and optional code execution to verify intermediate steps.
- •Improve memory use by treating memory as a T2 tool—learn what to store and how to retrieve using the agent’s output quality.
- •Design co-adaptation schedules: alternate T2 tool updates and A1 agent updates for compounding gains.
- •Reduce costs by preferring T1/T2 first (tool-side training), and only do A1/A2 fine-tuning if still necessary.
- •Standardize evaluations: for each project, pair the right metrics to the chosen box (A1 tool metrics, A2 end metrics) and monitor robustness.