Position: Agentic Evolution is the Path to Evolving LLMs

Minhua Lin; Hanqing Lu; Zhan Shi; Bing He; Rui Mao; Zhiwei Zhang; Zongyu Wu; Xianfeng Tang; Hui Liu; Zhenwei Dai; Xiang Zhang; Suhang Wang; Benoit Dumoulin; Jian Pei

Position: Agentic Evolution is the Path to Evolving LLMs

Intermediate

Minhua Lin, Hanqing Lu, Zhan Shi et al.1/30/2026

arXiv PDF

Key Summary

•Big AI models do great in the lab but stumble in the real world because the world keeps changing.
•This paper says the fix is agentic evolution: let AI improve itself while it’s being used, not just when it’s trained.
•They introduce a framework called A-Evolve that treats learning-after-deployment like a careful, step-by-step repair job with tests.
•Instead of only rewiring the model’s brain (weights), the system also builds and updates tools, checklists, and tests it can reuse.
•A safety gate (validation) checks every proposed change before keeping it, so the AI doesn’t break old skills.
•The paper’s big bet: the more compute you spend on evolution (analysis, planning, testing), the better and faster the AI adapts over time.
•On a tough benchmark (AppWorld), agentic evolution beat popular baselines by large margins while using the same per-task budget.
•Small models plus agentic evolution sometimes outperformed bigger models without it, showing evolution can narrow the size gap.
•Breaking the evolution process into diagnose → plan → update → verify was crucial; removing verification hurt the most.
•Agentic evolution helps privacy because improvements can stay local as tools and tests, without sending raw user data back for retraining.

Why This Research Matters

Software, rules, and data formats change constantly, so AIs that can only rely on past training will slowly fall behind. Agentic evolution turns everyday failures into lessons by adding small, reusable tools and tests that make tomorrow’s tasks easier. This lowers costs over time, because the system doesn’t need to “think forever” on the same problem; it learns once and reuses. It’s safer, too, because a validation gate checks fixes before they stick, reducing the risk of silent breakage. Privacy can improve when adaptation happens locally via tools and tests instead of shipping user data back for global retraining. Smaller models can also stay competitive by accumulating smart artifacts, narrowing the gap with larger systems. Overall, this is a practical path to keep AI reliable in open-ended, fast-changing environments.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you learn new shortcuts for homework as your assignments get trickier during the school year? You don’t go back to kindergarten each time—you adapt as you go.

🥬 Filling (The Actual Concept)

What it is: Large Language Models (LLMs) are super text-predicting computers that answer questions, write code, and use tools.
How it works: They’re trained on huge piles of text, then fine-tuned to be helpful and safe, and finally used to solve tasks step by step.
Why it matters: Training happens once, but the world keeps changing. If models don’t keep learning during use, they fall behind when apps, rules, or data formats change.

🍞 Bottom Bread (Anchor) Imagine a model that books flights. If the airline changes its website, the model keeps clicking the old buttons and fails—unless it learns the new layout after deployment.

🍞 Top Bread (Hook) Think of a video game where levels change every day. If your character can’t learn new moves while playing, you’ll keep losing.

🥬 The Concept: Train–Deploy Gap

What it is: The train–deploy gap is the mismatch between what the AI learned before launch and what it faces in the real world after launch.
How it works:
1. Train: The model sees examples from the past.
2. Deploy: The world shifts—APIs rename fields, formats change, rules update.
3. Result: The model’s old habits stop fitting the new world.
Why it matters: Without closing this gap, performance drops over time, even for strong models.

🍞 Anchor A log reader that expects a field called "status" breaks when the system renames it to "http_status". Same task, different label, instant confusion.

🍞 Top Bread (Hook) Imagine doing a math worksheet. If you spot a repeated mistake, you change your strategy right then, not next semester.

🥬 The Concept: Deployment-time Adaptation

What it is: Deployment-time adaptation means the AI improves while being used, not just during pre-release training.
How it works:
1. Notice where it failed.
2. Figure out why.
3. Update something persistent (like a tool or rule) so it won’t fail the same way again.
Why it matters: This turns one-time smart thinking into a reusable habit, saving time and avoiding repeat mistakes.

🍞 Anchor If an agent keeps forgetting to attach a token to API calls, it can add a small "token helper" tool once, then reuse it forever.

🍞 Top Bread (Hook) Picture a backpack with three pouches: notes (what to remember), gadgets (what to use), and a checklist (how to check work).

🥬 The Concept: What People Tried Before (and Why It Fell Short)

Parametric updates (changing the model’s weights): Powerful but risky—hard to audit, can forget old skills, and may break good behaviors.
Heuristic memories (just appending notes/prompts): Easy but messy—notes pile up, retrieval gets noisy, and fixes don’t last.
Why it matters: These approaches often act blindly. They don’t reason about failures or verify fixes.

🍞 Anchor It’s like either rewriting your whole brain to fix a tiny typo (too risky) or scribbling sticky notes all over your desk (too messy). You need a small, reliable tool you can trust.

🍞 Top Bread (Hook) Imagine a pit crew that doesn’t just fuel the car but also decides what to fix, when, and how—then tests the car before it returns to the race.

🥬 The Gap This Paper Fills

What it is: A governed, step-by-step process where an evolver agent decides what to change, when to change it, and how to verify it.
How it works: The system keeps a persistent state of tools, knowledge, and tests. The evolver proposes edits and only commits them if they pass validation.
Why it matters: You get durable improvements that are understandable, auditable, and safe.

🍞 Anchor A code-parsing agent hits new nested JSON. Instead of guessing forever, it creates a versioned adapter function and a test. Once tests pass, the fix sticks.

02Core Idea

🍞 Top Bread (Hook) You know how good coaches don’t just shout random tips? They watch the game, pick the real problem, design a drill, and then test if it worked.

🥬 The Concept: Agentic Evolution

What it is: Agentic evolution is letting an AI act like its own careful coach—diagnosing failures, proposing fixes, and verifying them—while it’s deployed.
How it works:
1. Collect evidence from recent tasks.
2. Diagnose the root cause of failures.
3. Plan targeted edits to persistent parts of the system (tools, knowledge, tests—and sometimes parameters).
4. Verify with a validation gate.
5. Commit only if it passes; otherwise, try again or do nothing.
Why it matters: Without this, models either forget old skills or drown in messy memories. With it, improvements are reliable and stick.

🍞 Anchor A support bot keeps misreading a field. It doesn’t just "think harder" next time; it adds a tiny parser tool plus a unit test. Now the fix is permanent.

🍞 Top Bread (Hook) Imagine a recipe that uses both your cooking skills and a drawer full of trusted gadgets.

🥬 The Concept: Composite Policy

What it is: A composite policy is the AI’s overall strategy that mixes two parts: the model’s weights (the brain) and the persistent artifact state (the gadgets and notes it can reuse).
How it works:
1. The brain decides what to do.
2. The artifact state supplies tools, schemas, and workflows.
3. Together, they produce actions.
Why it matters: If you only change the brain, you risk breaking things. If you only add notes, you get clutter. Combining both gives safe, reusable power.

🍞 Anchor A travel agent model uses a reusable "compare_flight_prices" tool it previously made, instead of re-deriving the logic from scratch each time.

🍞 Top Bread (Hook) Think of a tidy toolbox with labels, not a random junk drawer.

🥬 The Concept: Persistent Artifact State

What it is: A neat, versioned collection of reusable stuff—knowledge entries (K), tools (T), and validation tests (V)—that the AI can edit and reuse across tasks.
How it works:
1. Store structured knowledge (schemas, steps, examples).
2. Keep executable tools with clear inputs/outputs.
3. Maintain tests that must pass before changes stick.
Why it matters: It turns one-time reasoning into durable, auditable capability.

🍞 Anchor The agent adds a "manage_auth_token" tool once and reuses it in any app that needs login, with tests to prevent regressions.

🍞 Top Bread (Hook) Like a doctor asking, “Where exactly does it hurt?”

🥬 The Concept: Diagnostic Mechanism

What it is: A process to find the root cause of failures instead of guessing.
How it works:
1. Aggregate traces and errors.
2. Spot patterns (e.g., repeated 401 or wrong field names).
3. Localize the cause (missing schema, brittle tool).
Why it matters: Fixing symptoms wastes time; fixing causes creates lasting gains.

🍞 Anchor Seeing many 422 errors tied to wrong parameter names leads to a plan: add a schema-checking skill and update the tool signature.

🍞 Top Bread (Hook) Before a roller coaster opens, safety inspectors run tests.

🥬 The Concept: Validation Gate

What it is: A commit checkpoint that only accepts updates that pass tests (or review).
How it works:
1. Run unit and regression tests.
2. If fail → reject and revise; if pass → commit.
3. Keep provenance and allow rollback.
Why it matters: It prevents fixes that secretly break old abilities.

🍞 Anchor A new parser tool must pass old and new JSON test cases. If even one breaks, the change isn’t kept.

🍞 Top Bread (Hook) Bigger homework time usually means better projects—if you spend it wisely.

🥬 The Concept: Evolution-Scaling Hypothesis

What it is: The idea that the more compute you invest in evolution (analysis, planning, tests), the better and faster your AI adapts over time.
How it works:
1. Extra compute lets the evolver consider more candidate fixes.
2. It can run stronger verification.
3. Validated updates accumulate, compounding gains.
Why it matters: Instead of thinking longer every time, the system learns once and reuses—so progress scales predictably with evolution compute.

🍞 Anchor Giving the evolver more steps per cycle leads to more accurate diagnoses and sturdier tools, which then boost task success across many future episodes.

03Methodology

🍞 Top Bread (Hook) Imagine a two-part school day: first you solve today’s worksheet, then you spend club time improving your toolkit so tomorrow’s worksheet is easier.

🥬 The Concept: The Solve–Evolve Loop (High-Level Recipe)

What it is: A repeating cycle with two phases: solve now, then evolve for later.
How it works (At a high level: Input → Solve → Log Evidence → Evolve → Verified Update → Output Next Time):
1. Solve: Use current skills and tools to finish the task.
2. Log: Save the full trace (what happened, errors, outputs).
3. Evolve: Analyze traces, propose a targeted update, verify it, and only then commit.
Why it matters: It separates doing the task from improving the system, so each phase can be optimized and governed properly.

🍞 Anchor A chatbot fails to parse a nested field today, logs the evidence, then creates and verifies a new adapter so tomorrow it breezes through similar tasks.

Step-by-step Method (like a recipe)

Persistent Artifact State π_S = {K, T, V}

What happens: The system maintains three registries.
- K (Knowledge): schemas, workflows, playbooks, examples—organized, versioned, and searchable.
- T (Tools): executable functions or API wrappers with clear signatures and linked tests.
- V (Validation): unit tests, regression suites, and optional human-review hooks.
Why it exists: To turn messy memories into tidy, reusable, auditable assets.
Example with data: After repeated 401 errors, add to K an "authentication workflow" skill, to T a manage_auth_token tool, and to V tests that simulate expired tokens.

Solve Phase: Read-only execution

What happens: The solver handles a task using π_θ (the model’s weights) and π_S (K and T), without changing anything.
Why it exists: Keeps runs reproducible and prevents accidental drift during task execution.
Example: The agent retrieves the "authentication workflow" skill, calls manage_auth_token(), then proceeds with API calls.

Evidence Collection: Obs buffer

What happens: The full trajectory (inputs, tool calls, errors, outputs) is appended to an evidence buffer.
Why it exists: Gives the evolver rich, cross-episode context to spot patterns instead of one-off guesses.
Example: Logs show 422 errors mostly come from incorrect parameter names when calling create_message(user_id, text).

Evolver Module: Diagnose → Plan → Update → Verify

Diagnose
- What happens: Identify actionable failures and propose an explicit update objective (g): what must improve and why.
- Why it exists: Fixing the wrong thing wastes compute and can hurt performance.
- Example: Recognizes that misnamed fields cause 422 and recommends: “Add schema lookup step before first API call.”
Plan
- What happens: Produce a structured edit plan with operators: add, patch, refactor, prune; specify dependencies.
- Why it exists: Complex fixes require coordinated changes; planning avoids incoherent edits.
- Example: First create discover_api_spec tool, then add systematic_api_exploration skill that calls it, then patch old examples.
Update
- What happens: Synthesize concrete artifacts (tools/skills/tests) with provenance; write to a staging area.
- Why it exists: Turns ideas into precise, testable changes.
- Example: Implement manage_auth_token() with input app_name, returns token; add tests covering cache hits/misses.
Verify (Validation Gate)
- What happens: Run unit/regression tests; optionally trigger human review; decide commit or reject.
- Why it exists: Prevents regressions and brittle patches from entering the main state.
- Example: Ensure the new parser handles old flat JSON and new nested JSON; if any fail, reject and revise.

Commit or No-op

What happens: If all checks pass, apply the update to π_S (and, rarely, to parameters π_θ under strict governance). Otherwise, do nothing and try a safer alternative.
Why it exists: Makes evolution conditional and governed, not automatic.
Example: Commit versioned adapter v1.2 after all tests pass; keep rollback history.

Edit Operators and Auditing (the tools of the mechanic)

Add: introduce a new tool/skill/test.
Patch: modify a signature or logic.
Refactor: reorganize for clarity and reuse without changing behavior.
Prune: remove obsolete or harmful artifacts.
Why it matters: A small, clean set of actions makes changes understandable and reversible.
Example: Patch tool signature from (email, pwd) to (username, password); update skills and tests accordingly.

Autonomy: When to Evolve

Trigger conditions: after enough evidence accumulates or after repeated, high-confidence failures.
Budgeting: evolution-time compute is capped (e.g., number of analysis steps), ensuring the system doesn’t over-spend.
Why it matters: Spend compute when it helps most; skip when noise is high.
Example: Batch 10 episodes, then run diagnosis; if patterns are weak, postpone updates.

Secret Sauce (what makes it clever)

Goal-oriented: Fix causes, not symptoms; target the exact artifact that unlocks durable gains.
Compositional: Build modular tools/skills/tests that snap together, turning long chains of reasoning into one-click capabilities.
Governed: A strict validation gate keeps the system stable over months, not just minutes.

Concrete Mini-Walkthrough

Input: "Find the most-liked song across my playlists and text me the title."
Solve: Fails due to confusing like_count vs popularity.
Evidence: Trace shows partial progress but wrong field interpretation.
Diagnose: "Semantic mismatch: task asks for like_count, agent used popularity."
Plan: Add presubmission_verification skill to check answer form; evolve analyze_task_requirements tool to extract key fields.
Update: Implement both, write tests comparing fields on sample playlists.
Verify: Tests pass on multiple mock playlists; regression checks keep other tasks green.
Commit: Next time, the agent asks the right question, gets like_count, and sends the correct title on the first try.

04Experiments & Results

🍞 Top Bread (Hook) When two soccer teams tie in skill, the one with a better coach’s halftime adjustments usually wins.

🥬 The Concept: The Test

What it is: The team tested agentic evolution on AppWorld, a tough playground where agents must use many apps/APIs with unit tests checking if goals are met.
How it works:
1. Train/evolve on 50 tasks; evaluate on a separate 50 tasks.
2. Fix both solve-time compute (per-task budget) and evolution-time compute (per-episode budget) for fair comparison.
3. Measure Task Goal Completion (TGC: perfect passes) and Average Passed Tests (APT: partial credit).
Why it matters: It shows whether improvements are real, reusable, and not just lucky one-offs.

🍞 Anchor It’s like grading not only if you answered every question (TGC) but also how many you got right overall (APT).

The Competition

Vanilla: just the solver, no persistent updates.
APE: search-based prompt evolution.
AWM: workflow memory that reuses past trajectories.
Our method: A-Evolve (agentic evolution with diagnose → plan → update → verify over K/T/V).

Scoreboard with Context

On Claude Haiku 4.5 (a smaller solver):
- Vanilla: 32% TGC
- AWM: 46% TGC
- A-Evolve: 64% TGC Interpretation: Moving from no evolution to heuristic memory is like going from a C to a B-, but agentic evolution jumps to a solid A-.
On Gemini 3 Flash:
- Vanilla: 52% TGC
- AWM: 56% TGC
- A-Evolve: 82% TGC Interpretation: That 82% is like scoring an A+ when others are at B/B-.
APT also rose consistently, meaning even when tasks weren’t fully solved, partial correctness improved a lot—evidence of real capability gains, not flukes.

Surprising/Notable Findings

Small + Smart beats Big + Static: A smaller solver with A-Evolve (Haiku 4.5 at 64% TGC) outperformed a larger one without evolution (Sonnet 4 vanilla at 42%). Evolution narrows the size gap by turning one-shot reasoning into reusable tools.
Verification is the keystone: Removing the validation gate hurt the most in ablations. Without it, broken tools sneaked in and created regressions.
Each module matters: Diagnosis, planning, and analysis tools all contributed. Dropping any one reduced scores, showing the full loop is more than the sum of parts.
Scaling evolution compute helps steadily: Increasing the number of evolution steps (more analysis and testing) improved results monotonically for A-Evolve, while heuristic memory plateaued early. More evolver capacity (bigger model for the evolver) also reduced proposal errors and verification failures.

Concrete Case Lessons

Authentication pains: Heuristic memory repeatedly rediscovered logins; A-Evolve created manage_auth_token plus tests and solved the entire class of problems thereafter.
Schema drift: Instead of endlessly re-deriving parsing logic, A-Evolve synthesized versioned adapters with regression checks; failures turned into steady capability.

What This Means in Plain Terms

Agentic evolution doesn’t just cram more notes into context; it builds and tests small, sturdy tools and rules that stick around.
With the same per-task budget, spending some compute on evolution gives lasting upgrades that make tomorrow’s tasks easier.
The gains aren’t fragile: they survive across different solvers and keep improving when you budget more for evolution.

🍞 Bottom Bread (Anchor) It’s like a study group that, after each test, writes a short, tested cheat-sheet tool for the hardest topic. Next tests feel easier not because you think longer, but because you already built the right calculator.

05Discussion & Limitations

🍞 Top Bread (Hook) If you build a treehouse that can upgrade itself, you still need good blueprints and safety checks—or it might wobble.

🥬 Limitations (What this can’t do yet)

Real-time ultra-fast shifts: The verify-first rule adds delay, so it may not be ideal when instant adaptation without tests is required.
Evolver quality matters: A weak evolver can propose brittle fixes that fail verification; the gate helps, but can’t fix low-quality ideas.
Hard environments without tests: If you can’t write good unit/regression tests or get reliable feedback, the validation gate is weaker.
Param-only changes remain tricky: Safely editing model weights online is still governance-heavy; A-Evolve focuses on artifacts first.

Required Resources

Compute for evolution: Budget for diagnosis, planning, synthesis, and verification.
Tooling: Sandboxed execution, test harnesses, and versioned artifact storage.
Governance: Policies for commit/review/rollback, plus provenance tracking.

When NOT to Use

One-off tasks with no repetition: If you’ll never see the problem again, evolution overhead may not pay off.
Locked-down systems with no tests: Without validation, safe commits are hard.
Extreme latency constraints: If milliseconds matter more than long-term reliability, defer evolution to batch windows.

Open Questions

Theory: Can we formalize regret bounds and prove that agentic evolution beats heuristic memory under broad conditions?
Design: What’s the best way to balance artifact updates vs selective param updates?
Verification: How do we scale trustworthy automatic tests for fuzzy tasks (e.g., creative writing) without humans-in-the-loop?
Safety: How do we detect and prevent subtle capability drift over long horizons?

🍞 Bottom Bread (Anchor) Like upgrading a bike while riding, you need pit stops (verification), a good mechanic (evolver), and spare parts (artifacts). Otherwise, you risk falling over.

06Conclusion & Future Work

🍞 Top Bread (Hook) You don’t get better at piano only by practicing once; you improve a little after each performance, especially if you record, review, and fix mistakes with care.

🥬 3-Sentence Summary

This paper argues that the biggest barrier for deployed AI is the train–deploy gap, and the solution is agentic evolution: improving during use with diagnosis, planning, verified updates, and careful governance.
The A-Evolve framework operationalizes this by separating solve-time from evolve-time and by editing a persistent artifact state (knowledge, tools, tests) under a validation gate.
They further propose the evolution-scaling hypothesis: adaptation improves predictably as you allocate more compute to evolution, not just to per-task thinking.

Main Achievement

Turning evolution from a static, heuristic pipeline into an autonomous, goal-directed, verifiable decision process that consistently converts experience into durable capability.

Future Directions

Better evolvers (smarter diagnosis/planning), richer and safer verification for fuzzy tasks, and theory that quantifies how close we are to the compute-optimal frontier.

Why Remember This

Because it reframes AI improvement as a governed, continual craft: don’t just think longer each time—learn once, verify, and reuse. That’s how deployed AI stays reliable in a world that never stops changing.

🍞 Bottom Bread (Anchor) Like building a library of trusty tools after each science fair, agentic evolution makes tomorrow’s experiments easier, safer, and more successful.

Practical Applications

•Customer support bots that add verified parsers and response checkers when ticket formats change.
•Data engineering agents that create and test adapters when upstream schemas drift.
•DevOps copilots that synthesize safe CLI tools and regression suites after incident postmortems.
•Finance assistants that update rule checkers and reconciliation tools as compliance policies evolve.
•Healthcare schedulers that verify new insurance code mappings with tests before using them.
•Education tutors that build graded rubrics and solution checkers to prevent recurring grading mistakes.
•Sales ops agents that patch CRM integration tools when APIs rename fields, with rollback if tests fail.
•Research assistants that refactor and cache complex retrieval workflows into reusable pipelines.
•Robotic process automation (RPA) agents that synthesize UI-adapters when layouts change and validate them in sandboxes.
•Security triage bots that add and test new signature detectors when novel alert patterns appear.

Version: 1