EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines

Shuo Zhang; Chaofa Yuan; Ryan Guo; Xiaomin Yu; Rui Xu; Zhangquan Chen; Zinuo Li; Zhi Yang; Shuhao Guan; Zhenheng Tang; Sen Hu; Liwen Zhang; Ronghao Chen; Huacan Wang

EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines

Intermediate

Shuo Zhang, Chaofa Yuan, Ryan Guo et al.1/14/2026

arXiv PDF

Key Summary

•EvoFSM is a way for AI agents to improve themselves safely by editing a clear flowchart (an FSM) instead of rewriting everything blindly.
•It separates changes into two kinds: Flow (how steps connect) and Skill (what each step does), so fixes are targeted and stable.
•A built-in Critic checks answers and decides whether the agent should tweak the flow, refine a skill, or both.
•Changes are made only with a few small, reversible 'atomic operations' like ADD_STATE or REVISE_INSTRUCTION, which keeps evolution under control.
•A self-evolving memory saves what worked (priors) and what failed (warnings) to jump-start future tasks.
•Across five tough multi-hop QA benchmarks, EvoFSM beats strong baselines and reaches 58% accuracy on DeepSearch with Claude-4.
•Ablations show big drops without structure or without evolution, proving both are necessary.
•It generalizes beyond research to interactive decision-making (ALFWorld, WebShop), raising success while keeping steps reasonable.
•Main limitations: reliance on general LLMs, the Critic’s fallibility, and a memory that needs smarter long-term management.

Why This Research Matters

Real research questions rarely follow a straight line—they jump between sources, formats, and verification needs. EvoFSM gives AI agents a safe way to adapt their plans on the fly without breaking trusted behaviors. This means more precise answers with citations, better handling of tricky multi-document cases, and fewer wild guesses. Teams can reuse proven workflows across similar tasks and steadily improve over time. As a result, analysts, students, and professionals can get deeper, more reliable help on hard problems—faster and with clearer reasoning trails.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building a LEGO city. If you have to follow one fixed instruction booklet for every city, you’ll struggle when you need a bridge, a hospital, or a park the booklet never planned for.

🥬 The Concept (The World Before): Many AI research agents followed fixed, one-size-fits-all routines: search once, read, answer. Newer ones used loops (search → reason → repeat), but still with a mostly rigid plan. These plans worked for simple questions, but real-world research is messy: sources vary, data formats differ, and you often need to verify numbers and cite exact passages.

How it worked: The agent tried a standard recipe for every task, sometimes looping a bit.
Why it mattered: Without flexible planning, the agent missed the exact evidence or got stuck.

🍞 Anchor: Asking, “Which laptops in 2024 under $1,000 have OLED screens and 10+ hour battery?” needs multiple searches, specs checking, and verification—not a single, fixed step.

🍞 Hook: You know how a teacher may let a student rewrite an essay to improve it? If the student is told “Change whatever you want,” they might delete the thesis or forget the main topic.

🥬 The Concept (The Problem): Some agents tried “self-evolution” by freely rewriting their own prompts, tools, or code. This unconstrained approach often broke what already worked.

What it is: Unconstrained self-evolution lets an AI modify itself without clear rules or boundaries.
How it works: A meta-agent reviews performance, then edits prompts/tools anywhere.
Why it matters: Without guardrails, the agent can drift from instructions, hallucinate, or corrupt reliable parts.

🍞 Anchor: That’s like fixing a bike by randomly swapping parts—you might end up with a bike that can’t steer.

🍞 Hook: Think of a traffic light that keeps cars moving in order. The structure keeps things safe even when roads get busy.

🥬 The Concept (The Gap): We needed a way to let agents adapt while staying inside a safe, understandable structure.

What it is: A controllable framework that allows change only in specific, visible ways.
How it works: Split changes into the route (workflow) and the driving skills (how each step acts), then adjust only with small, approved edits.
Why it matters: This preserves stability while enabling real adaptability for open-ended research.

🍞 Anchor: It’s like letting a city planner add a crosswalk or retime a light, instead of bulldozing the whole intersection every time traffic patterns shift.

— New Concepts (introduced with Sandwich) —

🍞 Hook: You know how a board game has spaces (states) and rules for when you can move from one space to another?

🥬 Finite State Machine (FSM): An FSM is a map of states (like Search, Read, Verify) and clear rules for moving between them.

How it works: The agent is in one state at a time; based on context, it transitions to the next allowed state.
Why it matters: The FSM sets safe boundaries, so evolution can’t break the whole system.

🍞 Anchor: Like a traffic light changing from green to yellow to red—predictable steps that keep order.

🍞 Hook: Picture a soccer coach who updates plays based on last match results—just enough to improve next time.

🥬 Self-Evolution: The agent improves its own process during use.

How it works: After finishing a task, it reviews what failed, tweaks its plan, and tries again.
Why it matters: Real-world tasks vary a lot; learning from each one boosts future performance.

🍞 Anchor: A student adjusting their study plan after a tough quiz.

🍞 Hook: Imagine pruning a tree carefully instead of chopping branches randomly.

🥬 Structured Self-Evolution: A careful way to self-improve within strict rules.

How it works: Only certain, local edits are allowed (like adding a verification step or tightening one instruction).
Why it matters: Prevents chaos—keeps the core strong while fixing what’s weak.

🍞 Anchor: A gardener shapes the tree to grow healthier, not wilder.

🍞 Hook: Think LEGO bricks—you make big builds by snapping small, simple pieces.

🥬 Atomic Operations: Tiny, safe edits the agent can make (e.g., ADD_STATE, DELETE_STATE, MODIFY_TRANSITION, REVISE_INSTRUCTION).

How it works: Each operation is local, interpretable, and reversible.
Why it matters: Keeps evolution understandable and controllable.

🍞 Anchor: Adding a “Read PDF” step between Search and Summarize when web pages aren’t enough.

🍞 Hook: After a spelling test, a teacher marks mistakes so you know what to fix.

🥬 Critic Mechanism: A checker that compares the answer to the question and spots failures (like missing citations or numbers).

How it works: Reviews results; if flawed, triggers targeted evolution.
Why it matters: Without a good checker, the agent can’t improve correctly.

🍞 Anchor: A referee calling out rule violations so the team practices the right drills.

🍞 Hook: Think of a notebook where you write what study tips worked and what didn’t, so you don’t repeat mistakes.

🥬 Self-Evolving Memory: A library of successful strategies (priors) and failure patterns (warnings) retrieved to start new tasks smarter.

How it works: Saves optimized FSMs and reasons for changes; fetches the most similar ones for new queries.
Why it matters: Builds compounding knowledge over time.

🍞 Anchor: For cloud price comparisons, a prior might say “Convert monthly to hourly,” preventing repeat errors.

02Core Idea

🍞 Hook: You know how a cooking show has stations—chop, sauté, taste, plate—and the chef tweaks the order or instructions to perfect a dish?

🥬 The Aha! Moment: Make the agent’s plan an explicit Finite State Machine (FSM) and let it self-evolve only through small, safe, visible edits to the Flow (the order and transitions) and the Skills (what each state does), guided by a Critic and powered by a memory of past wins and fails.

Multiple Analogies:

City Traffic: The FSM is the city’s traffic map. Flow edits are retiming lights or adding a crosswalk; Skill edits are training crossing guards to check IDs. The Critic is traffic monitoring; memory stores past fixes that eased jams.
School Project: States are roles—Researcher, Fact-Checker, Writer. Flow edits rearrange who goes when; Skill edits refine the Researcher’s search prompts. The Critic is the teacher’s rubric; memory is a binder of great past projects.
LEGO Factory: States are assembly steps. Flow edits add a quality check; Skill edits tighten the instruction for precise fits. The Critic is QA; memory keeps the best assembly blueprints.

Before vs After:

Before: Agents either stuck to rigid loops or rewrote themselves freely, risking instruction drift and hallucinations.
After: Agents change in precise ways: add a verifier, refine a browsing rule, or adjust a transition—without breaking everything else.

Why It Works (Intuition):

Decoupling fixes: Many failures are either about doing steps in the wrong order (Flow) or doing a step poorly (Skill). Separating them makes diagnosis and repair fast and accurate.
Guardrails: Atomic operations ensure edits are local and reversible, preventing spiral failures.
Learning over time: Memory seeds new problems with proven playbooks while warning against known traps.
Reliable oversight: The Critic focuses change where it helps most (e.g., “missing numbers,” “source too weak”).

Building Blocks (each as Sandwich):

🍞 Hook: Think of a treasure map with checkpoints. 🥬 Flow: The macroscopic transition logic that decides which state comes next.

How it works: T(S × context) → S, routing the path: Search → Read → Verify → Synthesize.
Why it matters: Without it, the agent loops or skips vital checks. 🍞 Anchor: If evidence is thin, Flow can loop back to Search; if strong, move to Synthesize.

🍞 Hook: Like a musician practicing a tricky bar, not the whole song. 🥬 Skill: The microscopic instructions for what to do inside a state.

How it works: A state’s prompt defines behaviors (e.g., “Extract exact numbers with units”).
Why it matters: Weak skills cause vague answers even with a great plan. 🍞 Anchor: Changing Browse from “summarize” to “quote exact Wh/kg” turns fluff into facts.

🍞 Hook: A sports replay tells you what went wrong. 🥬 Critic Mechanism: The feedback loop that detects failure modes and triggers targeted evolution.

How it works: Checks for logic, citations, quantity completeness; then suggests Flow/Skill fixes.
Why it matters: Guides improvements to the right place. 🍞 Anchor: “No 2023 sources? Add Verifier and re-search with year filters.”

🍞 Hook: Your recipe book grows with each great meal. 🥬 Self-Evolving Memory: Stores optimized FSMs and their rationales.

How it works: Retrieve top-k similar strategies; use successes as priors and failures as constraints.
Why it matters: Faster starts, fewer repeated mistakes. 🍞 Anchor: For clinical queries, initialize with a Search ↔ Verify loop and PDF reading.

🍞 Hook: Small screwdriver, big fix. 🥬 Atomic Operations: ADD_STATE, DELETE_STATE, MODIFY_TRANSITION (Flow) and REVISE_INSTRUCTION (Skill).

How it works: Local, interpretable, reversible edits.
Why it matters: Keeps evolution safe and explainable. 🍞 Anchor: Insert Read PDF between Search and Summarize when HTML pages are low quality.

03Methodology

High-Level Overview: Input → Retrieve top-k priors from memory → Initialize FSM → Execute states with tools → Critic evaluates → Apply atomic operations (Flow/Skill) → Update memory → Output answer

Step-by-Step (with Sandwich for key parts):

Input and Memory Warm-Start 🍞 Hook: Before a game, you review play highlights to start strong. 🥬 What it is: Retrieve top-k similar past strategies (both successes and failures) from the experience pool.

How it works: Embed the new query; fetch records with optimized FSMs and rationales.
Why it matters: Jump-starts with proven moves; avoids known traps. 🍞 Anchor: Cloud VM pricing? Recall: “Always convert monthly to hourly; prefer vendor docs.”

FSM Initialization (M_init) 🍞 Hook: Set the board before playing. 🥬 What it is: Build an initial state graph (e.g., Problem Decompose → Search → Browse/Read → Verify → Synthesize) and state prompts.

How it works: Use retrieved priors to seed states, transitions, and instructions; cap states (≤10) to keep it lean.
Why it matters: Provides a stable backbone for controlled evolution. 🍞 Anchor: For medical evidence: add Read PDF and Verify early, using ClinicalTrials.gov/FDA constraints.

Execution within the FSM 🍞 Hook: Follow the map, one checkpoint at a time. 🥬 What it is: Run the active state’s skill, collect outputs, and use Flow logic to pick the next state.

How it works: Search issues queries (Serper API); Read uses Jina Reader/PDF parser; Verify checks date/source; Synthesize compiles a cited answer.
Why it matters: Structured steps prevent chaos and reduce wasted loops. 🍞 Anchor: If Verify says “no 2023 data,” go back to Search with a refined query (‘annual report 2023 pdf’).

Critic Mechanism (C) Evaluation 🍞 Hook: Teacher grading your project. 🥬 What it is: Judge whether the output meets the query’s needs (numbers, citations, logic).

How it works: The Critic (LLM) inspects the final trajectory and output; flags failure modes like “missing quantitative evidence,” “weak sources,” or “logical gaps.”
Why it matters: Without this, the agent can’t learn what to fix. 🍞 Anchor: Verdict ❌ FAIL: “Missing statistical rates (%). Flow Gap: need PDF tool; Skill Gap: search too generic.”

Structured Self-Evolution via Atomic Operations 🍞 Hook: Tighten a loose screw or add a safety rail—don’t rebuild the house. 🥬 What it is: Apply small, safe edits to Flow or Skill.

Flow Operators (O_flow): • ADD_STATE: Insert Verifier/Read PDF to bridge gaps. • DELETE_STATE: Remove redundant steps. • MODIFY_TRANSITION: Adjust when to loop back or move forward.
Skill Operators (O_skill): • REVISE_INSTRUCTION: Sharpen a state’s prompt (e.g., prefer official PDFs; extract exact values with units).
Why it matters: Local, interpretable, reversible fixes keep the system stable while adapting. 🍞 Anchor: Insert Read PDF (Flow) after Search; update Search’s instruction to target FDA/ClinicalTrials.gov (Skill).

Re-Execution and Convergence 🍞 Hook: Try the improved play. 🥬 What it is: Run the updated FSM; if still failing, repeat Critic → Evolution up to a small iteration cap (e.g., 3).

How it works: Each iteration updates the state graph or instructions; accuracy often rises with a few passes.
Why it matters: Deep problems may need multiple, compounding adjustments. 🍞 Anchor: DeepSearch accuracy climbs notably with more iterations (up to +16% in experiments).

Self-Evolving Memory Update (E = E+ ∪ E−) 🍞 Hook: Add notes to your playbook. 🥬 What it is: Store the optimized FSM, operations sequence, and rationale if successful; record failure patterns as warnings.

How it works: Create a strategy record r with embeddings, M_optimized, and change reasons.
Why it matters: Future tasks start smarter; repeated mistakes shrink over time. 🍞 Anchor: For legal analysis, memory records “Add Legal_Verifier; search by ‘Article/Recital’ terms.”

Output 🍞 Hook: Present your science fair poster with clear sources and numbers. 🥬 What it is: Final answer citing sources and showing the reasoning path.

How it works: Synthesize from verified evidence; include key numbers and quotes.
Why it matters: Trustworthy answers depend on transparent steps and strong citations. 🍞 Anchor: A table comparing Tesla/BDY/Nio battery Wh/kg with references.

The Secret Sauce:

Decouple Flow vs Skill so you can fix order vs quality separately.
Use tiny, named operations to keep changes safe and explainable.
Let a Critic target fixes precisely.
Grow a memory of what works to accelerate new tasks. Together, these turn vague self-rewrites into precise, reliable self-evolution.

04Experiments & Results

🍞 Hook: Think of a school tournament where teams solve multi-step puzzles by gathering clues from different rooms.

🥬 The Test: Researchers evaluated EvoFSM on five multi-hop QA benchmarks—HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle, and xbench-DeepSearch—where answering requires collecting and connecting evidence across multiple documents.

What they measured: Accuracy (did the agent get the final answer right?)
Why it matters: Multi-hop QA stresses both planning (Flow) and precision (Skill), perfect for testing structured evolution.

🍞 Anchor: Like needing to find a map in Room A, a key in Room B, and instructions in Room C before you can unlock the final chest.

The Competition:

Standard RAG: One-shot retrieval (search once, then answer).
Agentic RAG: Iterative retrieve-and-reason loop.
Search-o1: Adds Reason-in-Documents to distill better evidence during loops.

Scoreboard with Context:

EvoFSM consistently outperforms baselines across different backbone LLMs (GPT-4o, Claude-4, Llama-3-70B, DeepSeek-v3, Qwen3-32B).
On DeepSearch (the toughest), EvoFSM hits 58.0% with Claude-4 and 51.0% with DeepSeek-v3, beating Search-o1 by notable margins (e.g., +11% with Claude-4, +10% with GPT-4o). Think of this as moving from a class average of B- to a solid A- on the hardest exam.
Iterative methods already improve over Standard RAG; EvoFSM’s structure makes the loops smarter, cutting dead-ends and stabilizing evidence gathering.

Ablation Insights (Why each part matters):

w/o Structured Evolution (Static FSM): Performance drops, especially on DeepSearch (−15 pts), showing that adaptation is crucial when problems change.
w/o FSM Topology (Unstructured Evolution): Also drops (−9 pts on DeepSearch); free-form rewrites lack guardrails and can loop or drift.
w/o All (Standard ReAct): Lowest scores overall (−17 pts on DeepSearch), confirming the synergy: structure + evolution beats either alone.

Surprising Findings:

More evolution iterations help complex tasks much more than simple ones. DeepSearch gains up to +16% with added iterations, while Bamboogle saturates after ~3—evidence that EvoFSM scales its effort to task complexity.
Generalization: Beyond research QA, EvoFSM improves success on interactive decision-making tasks (ALFWorld, WebShop), suggesting that structured flows with verifiers help avoid degenerate looping and surface higher-quality actions.

Why It Matters:

The wins are consistent across models, meaning the benefits come from the framework design, not a lucky pairing with one LLM.
Better accuracy with controlled steps means more trustworthy answers, clearer citations, and fewer wild guesses.

🍞 Anchor: When asked to compare GPUs or laws, EvoFSM is like a student who shows work, cites pages, and corrects strategy mid-test if a step was weak—earning higher, steadier grades than classmates who either never adapt or rewrite the whole plan mid-exam.

05Discussion & Limitations

Limitations (Honest Talk):

Reliance on General LLMs: EvoFSM uses off-the-shelf models via prompting, not fine-tuned weights. This keeps it accessible but may limit efficiency and how deeply the LLM internalizes FSM logic.
Critic Fragility: The Critic (an LLM) can miss subtle errors or falsely approve weak answers. A mistaken Critic can nudge evolution the wrong way.
Memory Scalability: The experience pool grows without consolidation. Over time, retrieval can get slower and bring back redundant or outdated strategies.

Required Resources:

A capable LLM (proprietary or open-weight), multi-agent orchestration (e.g., AutoGen), web tools (Serper for search, Jina Reader/PDF parsers), and guardrails (state cap ≈10; iteration cap ≈3). Sane tool budgets help avoid flailing.

When NOT to Use:

Purely creative writing or opinion tasks where strict flows add little value.
Domains without accessible, verifiable sources (the Verify state loses power).
Low-latency micro-tasks where the overhead of states/verification outweighs benefits.
Environments without web access, when tasks require live retrieval.

Open Questions:

Can we train or distill the Critic into a more reliable, verification-guided judge?
How to compress, merge, and forget in memory so it stays fast and fresh?
What new atomic operations are both safe and powerful (e.g., SPLIT_STATE, MERGE_STATES)?
How to fine-tune smaller, specialized agents to internalize common flows and reduce prompting cost?
Safety: How to formalize guarantees that evolution cannot bypass core constraints (e.g., citation requirements)?

06Conclusion & Future Work

Three-Sentence Summary:

EvoFSM turns self-improvement from risky free-form rewrites into safe, targeted edits by evolving an explicit FSM with small atomic operations.
It separates changes into Flow (the route) and Skill (how each step acts), guided by a Critic and accelerated by a self-evolving memory of past wins and fails.
Across hard multi-hop tasks and interactive settings, EvoFSM delivers higher accuracy and steadier behavior than strong baselines.

Main Achievement:

A practical, controllable blueprint for agent self-evolution that preserves stability while adapting effectively to open-ended research problems.

Future Directions:

Train sturdier Critics; distill flows into smaller specialized agents; add memory consolidation (merge/prune/abstract); expand safe atomic operations; formalize safety guarantees.

Why Remember This:

EvoFSM shifts the narrative: self-evolving agents don’t have to be chaotic. With a clear map (FSM), tiny tools (atomic ops), a careful coach (Critic), and a growing playbook (memory), agents can learn reliably in the wild—and show their work along the way.

Practical Applications

•Competitive analysis: Compare product specs and pricing across vendors with source-verified tables.
•Healthcare literature review: Pull exact trial outcomes from official PDFs (e.g., ClinicalTrials.gov, FDA).
•Legal due diligence: Locate and cite specific Articles/Recitals from official Acts and directives.
•Finance research: Extract metrics from earnings reports and convert units consistently (e.g., quarterly to annualized).
•Policy analysis: Contrast clauses across versions of a bill with date-verified sources.
•Academic study assistant: Build multi-source summaries that quote key numbers and cite pages.
•E-commerce decision aid: Search, filter, verify, and recommend products from large catalogs (WebShop-like).
•Software repo triage: Navigate repositories with stateful exploration and verification steps (e.g., code owners, tests).
•Market monitoring: Track changes by year/version using Verifier states to prevent outdated citations.
•Enterprise knowledge search: Enforce evidence quality and structured synthesis for internal documents.

Version: 1