Yunjue Agent Tech Report: A Fully Reproducible, Zero-Start In-Situ Self-Evolving Agent System for Open-Ended Tasks

Haotian Li; Shijun Yang; Weizhen Qi; Silei Zhao; Rui Hua; Mingzhu Song; Xiaojian Yang; Chao Peng

Yunjue Agent Tech Report: A Fully Reproducible, Zero-Start In-Situ Self-Evolving Agent System for Open-Ended Tasks

Intermediate

Haotian Li, Shijun Yang, Weizhen Qi et al.1/26/2026

arXiv PDF

Key Summary

•This paper builds an AI agent that learns new skills while working, like a kid who learns new tricks during recess without a teacher telling them what to do.
•Instead of relying on a fixed toolbox, the agent creates small, reusable Python tools on the fly and keeps the good ones for next time.
•A special parallel batch evolution process lets many tasks run at once, then merges similar tools so the toolbox stays neat and not cluttered.
•Because code either runs or errors, tools give a clear 'yes/no' signal, so the agent can improve without human labels or grades.
•Across tough benchmarks (HLE, DeepSearchQA, FinSearchComp, xBench), the system starts from zero tools and still beats strong baselines.
•When warmed up with tools learned before, it transfers knowledge to new domains and needs far fewer new tools to do well.
•A new metric, Evolutionary Generality Loss (EGL), tracks if the agent is reusing tools (good, low EGL) or constantly inventing new ones (bad, high EGL).
•The system uses a simple multi-agent team (Manager, Tool Developer, Executor, Integrator) plus an Aggregator and Merger to keep tools clean and converged.
•Results show rapid early gains and later stability, like practice leading to mastery, with fewer mistakes and fewer tokens spent per tool use.
•All code, traces, and evolved tools are open-sourced to help others build reliable, self-improving agents.

Why This Research Matters

Real-world tasks change constantly—new sources, formats, and rules appear overnight. Agents that can only do what they were born with fall behind; agents that grow their toolboxes during use keep up. A crisp success/failure signal from executing tools enables safe, label-free improvement, saving both time and tokens. Because tools are small and reusable, progress compounds: each new tool unlocks many future solutions. The system’s merging and convergence checks keep everything tidy and reliable, so users get consistent, verifiable answers. Open-sourcing code, traces, and tools means the community can build trustworthy, evolving AI together.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re starting a new school where every day brings totally different homework—biology one day, art the next, and then a surprise math puzzle. There’s no single workbook that covers it all, so you need to pick up new skills as you go.

🥬 The World Before: For years, AI agents mostly came in two flavors. Some were powerful but closed, like mystery boxes you couldn’t peek into. Others were open and inspectable but relied on fixed, hand-made toolkits and scripts. These open agents could do a lot—search the web, parse files, use calculators—but they didn’t grow much beyond what their creators pre-installed. When the world changed (new file formats, new websites, new task styles), the agent’s abilities didn’t keep up unless engineers updated them offline. It was like carrying a pre-packed backpack: useful, but it never refilled itself.

The Problem: Real life is open-ended. Tasks drift and surprise us. In these wild settings, you rarely get neat teacher labels saying, “Correct!” or “Wrong!” Agents that depend on training with clear rewards struggle. Changing a workflow or pleasing a user can be fuzzy and delayed (“Did they like my summary?”), so learning is slow. But there is one part of agents that gets crisp, instant feedback: tools. Code runs or it errors. That binary signal is gold for learning on the fly.

Failed Attempts: People tried three main ideas. (1) Train better workflows so the agent plans smarter. It helped, but still needed labels or domain-specific setups. (2) Grow better memories to reuse experience. Useful, but if you’re missing the right tool (like a PDF extractor), memory alone can’t do the job. (3) Create tools on the fly in special areas (like programming or biomedicine). Good progress, but often narrow and not reused across tasks.

The Gap: We needed a way for agents to upgrade themselves during use—right where the action happens—without handholding or labels, and to do it in a general, reusable way. That means treating each task as a learning chance, collecting the sharp signals from execution (run vs. error), and folding the wins back into a shared toolbox that steadily gets broader (more coverage) and deeper (more robust).

Real Stakes: Why should anyone care? Because your search assistant, research helper, or finance analyst bot faces shifting websites, new papers, surprise data formats, and evolving rules—every day. If your agent can only do what it was born with, it stalls. But if it can make new tools safely, test them, and keep the best, then tomorrow’s problem becomes easier. That saves time (fewer do-overs), cuts cost (fewer tokens per step), and boosts trust (more consistent, verified steps). Think of it like building a library of Lego blocks. The more solid blocks you have, the faster you snap together solutions—no matter how weird the assignment.

🍞 Anchor: Picture a student agent asked, “Find the 2019-2021 revenue trend for Company X from official filings.” If it lacks a PDF parser, it builds one, tests it, fixes errors, and saves it. Next time any task needs PDFs, that solid parser is ready. Over weeks, it collects web search, table readers, math solvers, and more—so future homework feels easy.

02Core Idea

🍞 Hook: You know how a chef learns new mini-techniques during busy dinner service—like a faster way to julienne carrots—and then uses them forever after? The kitchen gets better while cooking.

🥬 The Aha! Moment (one sentence): Treat each task as practice time to invent, test, fix, and keep small tools, so the agent steadily upgrades itself without needing labels.

Multiple Analogies:

Toolbox analogy: Start with an empty toolbox. Each tricky job forces you to craft a new screwdriver head or wrench adapter. If it works well, you keep it. Soon, you can fix almost anything fast.
Video game analogy: As you play, you unlock abilities (double jump, wall climb). New levels become doable by combining old powers—no new tutorial needed.
Lego analogy: Each solved task gives you a new Lego brick. Future builds snap together faster from your growing brick set.

Before vs. After:

Before: Agents copied fixed playbooks, struggled with new formats, and needed offline retraining or human labels.
After: Agents create and refine tools mid-task, reuse them across domains, and converge to a stable, compact library that handles most surprises.

Why It Works (intuition, no equations):

Tools give clean feedback: run success vs. exception. That clarity makes self-correction easy.
Small, atomic tools are reusable like universal adapters. The same web-search or CSV-preview tool helps across countless tasks.
Parallel batches are like study groups: multiple attempts in a round, then merge the best ideas so the library doesn’t bloat.
As reuse beats invention, you know you’re generalizing. That’s what the EGL metric tracks.

Building Blocks (with sandwich explanations for new concepts):

🍞 You know how a plant grows right in the garden, not in a lab, adapting to sun and soil on the spot? 🥬 In-Situ Self-Evolution: It means the agent improves while it’s actively solving tasks, not just during training.

How it works: (1) Tackle a task; (2) If missing a skill, make a tiny tool; (3) Run it and read errors; (4) Fix and harden; (5) Save it for reuse.
Why it matters: Without it, agents freeze at their birth skill level and can’t keep up with real-world change. 🍞 Anchor: The agent needs to parse a new spreadsheet type today; it builds a small reader now and uses it forever after.

🍞 Imagine your phone adding new apps automatically when you need them. 🥬 Tool Evolution: The agent steadily creates, improves, and reuses small Python tools.

How it works: Synthesize tool → test on the task → refine from error logs → keep it if robust.
Why it matters: No tool, no solution. Tools unlock actions that words alone can’t do (like reading PDFs or computing integrals). 🍞 Anchor: A math tool for symbolic equations made for one problem later helps solve physics homework too.

🍞 Picture a relay team where each runner has a job—planning, coding, running, and summarizing. 🥬 Multi-Agent System: A few specialized roles team up—Manager, Tool Developer, Executor, Integrator—with Aggregator/Merger for batch clean-up.

How it works: Manager picks/requests tools, Tool Developer codes them, Executor solves the task using ReAct-style steps, Integrator forms the final answer, Aggregator/Merger organizes tools across a batch.
Why it matters: Division of labor makes complex tasks tractable and keeps the toolbox disciplined. 🍞 Anchor: The Manager says, “We need a PDF tool,” the Tool Developer builds it, the Executor uses it, the Integrator writes the report.

🍞 Think of many classmates doing similar worksheets, then the teacher collecting the best solution steps into one guide. 🥬 Parallel Batch Evolution: Handle many tasks at once, then cluster and merge similar tools.

How it works: (1) Parallel tasks create local tools; (2) Cluster by function; (3) Merge into one canonical tool; (4) Update the global set.
Why it matters: You get speed and cleanliness—fast progress without a messy, bloated toolbox. 🍞 Anchor: Ten slightly different web fetchers become one reliable fetch_web_text.

🍞 Imagine a closet that stays tidy only if duplicates get folded into one neat stack. 🥬 Tool Absorbing Mechanism: After a batch, functionally similar tools are grouped and merged.

How it works: LLM-based clustering → pick/merge best implementations → keep one canonical version.
Why it matters: Prevents confusion during retrieval and speeds future tasks. 🍞 Anchor: “fetch_url_text”, “get_page_text”, “webpage_to_text” become one clean fetch_web_text.

🍞 You know how a car dashboard shows if you’re being efficient? 🥬 Evolutionary Generality Loss (EGL): A meter that rises when you’re inventing many new tools per use, and falls when you’re mostly reusing.

How it works: Track new tools vs. total tool uses over time; lower is better generalization.
Why it matters: Without a gauge, you don’t know if you’re truly learning or just patching. 🍞 Anchor: Early days: high EGL (lots of invention). Later: low EGL (mostly reuse), just like moving from clumsy practice to smooth mastery.

🍞 Finally, think of a house plan that stays the same while you upgrade appliances. 🥬 Agent Architecture: Keep workflow and prompts steady, evolve the tools.

How it works: Fixed team roles + stable prompts; only the toolbox grows.
Why it matters: Controls complexity and makes improvements reproducible. 🍞 Anchor: The kitchen layout stays; you just keep buying better appliances.

03Methodology

At a high level: Input (a batch of user queries) → Retrieve needed tools or synthesize new ones → Execute tasks with ReAct-style steps → Reflect, refine, and validate tools → Batch-cluster and merge tools → Output final answers + updated global toolbox.

Step-by-step (what, why, example):

Manager analyzes the task and picks tools

What happens: For each query, the Manager checks the global repository T and selects T_sub. If there’s a gap, it prepares a request for new tools with clear name, description, input/output schemas.
Why it exists: The agent must choose the smallest, single-purpose tools; otherwise, the toolbox becomes a tangle of overlapping functions.
Example: Task: “Find Tesla’s 2021 revenue from official sources and compute 2019–2021 growth.” Manager picks web_search, fetch_web_text, download_file, extract_pdf_text, evaluate_expression_math. If extract_pdf_text is missing, it requests it.

Tool Developer turns a request into a robust Python tool

What happens: Given the TOOL_REQUEST, it writes a fully formed Python module with TOOL_META, Pydantic Input/Output models, and run(input) → output. It avoids returning raw HTML or big binaries, uses retries/timeouts, and produces LLM-friendly outputs.
Why it exists: Strong I/O contracts and clean outputs prevent context pollution and make tools dependable and reusable.
Example: For extract_pdf_text, it saves the PDF path as input, returns plain text, and throws a clear error if the file is missing or zero-size.

Executor solves the task using ReAct

What happens: The Executor plans, calls tools, reads results, retries if needed, and continues. It can pause to ask the Manager for a missing capability.
Why it exists: Real tasks need reasoning plus action. The ReAct loop allows careful, stepwise tool use guided by evidence.
Example data path:
- web_search("Tesla 2021 10-K") → URLs
- fetch_web_text(URL of SEC page) → link to PDF
- download_file(PDF URL) → local path
- extract_pdf_text(path) → text with revenue figures
- evaluate_expression_math("growth = (rev2021 - rev2019)/rev2019") → final number

Integrator composes the final answer

What happens: Integrates execution history into a clean response, citing sources. Ensures the answer fits requested format (units, date style, etc.).
Why it exists: Users need one crisp, verified answer—not raw logs.
Example: “Tesla revenue in 2021 was $53.8B; 2019–2021 growth ≈ X%. Sources: SEC filing.”

Post-execution reflection and tool refinement

What happens: Read error traces and unusual edge cases; harden the tool (better validation, saner defaults). Approve to global T if robust.
Why it exists: Early tools may be brittle. Tightening them now saves future time and tokens.
Example: If extract_pdf_text choked on scanned PDFs, add OCR fallback or improve error messages.

Parallel Batch Evolution with tool absorbing

What happens: For a batch Q_t of B queries, each creates local tools P_{t,i}. After the batch, aggregate {T_{t-1}, P_{t,1}, …, P_{t,B}}, cluster by functional semantics, and merge each cluster to produce T_t.
Why it exists: Parallelism speeds learning, but would create duplicates. Absorbing prevents bloat and confusion.
Example: Ten variants of page-text fetchers get merged into one canonical fetch_web_text with best headers/retries.

Convergence monitoring with EGL

What happens: Continuously compute EGL from new-tool count vs. tool-usage count. Expect a downward trend as the toolbox generalizes.
Why it exists: A simple, label-free gauge of learning progress and stability, akin to a training loss for agents at inference.
Example: On a new domain, EGL spikes briefly (new needs), then falls as reuse dominates.

Secret Sauce (what makes it clever):

Binary, verifiable feedback: Tools run or fail—no human label needed.
Atomic design: Small, single-purpose tools maximize cross-task reuse.
Best-of-N via batching: Multiple parallel tool drafts, then pick/merge the winner.
Convergence control: Clustering + merging + EGL stops tool explosion and keeps retrieval unambiguous.

Mini walk-through with actual-style data:

Query: “Find the definition of ‘bond duration’ from a reliable source and compute the duration for a simple cash flow.”
1. Manager picks web_search, fetch_web_text, evaluate_expression_math. If a ‘parse_cashflows_csv’ tool is missing for a provided file, it requests it.
2. Tool Developer creates parse_cashflows_csv(Input: path, Output: list of {t, cf}).
3. Executor: web_search("bond duration definition site:investopedia.com") → URL; fetch_web_text(URL) → definition text; parse_cashflows_csv("flows.csv") → structured list; evaluate_expression_math("duration formula with given yield and flows") → numeric duration.
4. Integrator: Returns definition in words + computed duration, citing the Investopedia URL and file-based computation steps.
5. Reflection: If math tool threw precision errors, refine input validation and numeric stability; commit improved tool.

04Experiments & Results

The Test: The team evaluated whether a zero-start agent (empty toolbox at the beginning) can (1) quickly grow a strong toolset, (2) outperform strong baselines on varied tasks, (3) reuse tools across domains (warm-start), and (4) converge (as shown by low EGL and stable accuracy).

Benchmarks and Why:

HLE (Humanity’s Last Exam): expert-level, multi-topic reasoning—tests breadth.
DeepSearchQA (DSQA): deep web research—tests multi-step retrieval and synthesis.
FinSearchComp (FSC): finance tasks—tests time-sensitive lookup and quantitative reasoning.
xBench ScienceQA (xSciQA) and xBench DeepSearch (xDS): Chinese-language science and deep research—tests cross-lingual adaptability and real-world productivity.

The Competition: They compared against top proprietary and open systems (e.g., GPT-5 series, Gemini 3 Pro, Claude, etc.). Where allowed, baselines had web and Python; on xSciQA, rules restricted tools but they ensured parity.

Scoreboard with context:

HLE: Yunjue Agent 48.0% vs. backend 45.8%—like bumping from a solid B- to B despite starting with no tools, ranking behind only the very top proprietary model listed.
DSQA: 73.5%—a jump of +17.4 points over the Gemini 3 Pro baseline (56.6%). That’s like moving from a C+ to a strong A- on a tough research exam.
FSC: 65.0% vs. 49.9% for a key baseline—about +15 points; a big leap in finance where precision matters.
xSciQA: 76.5%, topping published baselines; a record-setting score for this test.
xDS: 59.7%, second only to one top model, beating others by clear margins.

Surprising Findings:

Tool emergence pattern: Across tasks, a few fundamental tools were used a lot (web_search, fetch_web_text, evaluate_expression_math), proving that well-chosen atomic tools generalize widely.
Warm-start power: Starting DSQA and xSciQA with tools learned on HLE barely needed new tools (down by 32% on DSQA and 100% on xSciQA), yet performance held or improved (e.g., xSciQA rose from 76.5 to 80.2). It’s like transferring chess tactics from one tournament to another.
Convergence: Only 97 tools were created across 2,500 HLE queries, despite topic shifts; later domains showed near-flat growth in tool count. That’s strong evidence the toolbox stabilized.
Efficiency gains: Against a Python-only baseline (no tool reuse), Yunjue achieved far higher accuracy (up to +11.5 points) with >99% tool success and dramatically fewer tokens per call (about 100–190 vs. ~518). The baseline’s raw, verbose code attempts polluted context and hurt reasoning.

EGL as a progress meter:

On HLE, EGL dropped sharply and stabilized after roughly 1,000 queries, matching accuracy gains measured at checkpoints (10%, 40%, 70%, 100%). This pattern—big early gains, small later gains—looks like moving from learning to mastery.

Batch size effects:

Larger batches sped early tool creation (steeper initial growth) but still converged to similar final library sizes. Average tokens per tool use fell over time in all settings, confirming transition from costly invention to cheap reuse.

Takeaway: Starting from zero, the agent not only matches or beats strong baselines across very different domains, but it also gets faster and tidier as tools converge. Warm-start transfer further cuts tool creation needs while nudging scores upward.

05Discussion & Limitations

Limitations (honest view):

Early cold-start drag: With an empty toolbox, the first stretch needs invention and may be slower or less accurate until core tools appear.
Stochastic variance: Different runs (or batch sizes) can yield slightly different tool paths and token costs due to LLM randomness.
Fixed workflow and prompts: By design, the study holds workflow and context steady; some tasks might benefit from co-evolving plans and memories (e.g., heavy personalization).
Narrow feedback scope: Binary run/exception is great for tools but doesn’t capture subjective goals (style, preference) without extra signals.
Security/sandboxing: Executing tools safely is critical; while the paper uses guardrails (e.g., LLM-friendly outputs), real deployments need strict isolation and policy checks.

Required Resources:

A capable LLM backend and a code-capable assistant for tool generation.
Stable execution sandbox with network/file I/O and dependency management.
Storage for the global tool registry and batch-time clustering/merging passes.

When NOT to Use:

Purely conversational, preference-heavy tasks with no objective execution steps; tool evolution won’t capture taste or tone without feedback.
Extremely specialized domains requiring certified compliance (medical/aviation) unless tools pass audits and strict validation.
Scenarios where code execution is unsafe or disallowed (no sandbox, no network permissions).

Open Questions:

Co-evolution: How to evolve memory (preferences, long-term context) and workflow (planning policies) alongside tools safely and label-free?
Regularization: How to make convergence more deterministic across runs and batch sizes?
Scheduling: What’s the best curriculum and adaptive batch strategy to maximize learning speed and stability?
Evaluation: Beyond EGL, what additional metrics capture subjective quality, reliability under drift, and long-horizon reasoning?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Yunjue Agent, a system that learns while working by creating, testing, refining, and reusing small tools—no labels needed. A parallel batch evolution with an absorbing mechanism keeps the toolbox compact and robust, and a new EGL metric tracks convergence like a training loss. Across multiple hard benchmarks, the agent starts from zero tools yet reaches or beats top baselines and transfers its learned tools to new domains with minimal extra effort.

Main Achievement: Showing that tool-first, in-situ self-evolution can deliver state-of-the-art results in open-ended environments while remaining transparent, reproducible, and efficient.

Future Directions: Pre-train agentic systems at the tool level (a “foundation toolset”), then adapt lightly in the wild; co-evolve memory and workflow with tools; add regularization for stable convergence; and develop adaptive batching curricula guided by convergence signals.

Why Remember This: It reframes agent improvement as everyday practice—turning each task into a lesson—so the agent’s toolbox becomes a growing set of reliable, general-purpose skills that make tomorrow’s unknowns feel familiar.

Practical Applications

•Research assistant that learns new parsers for unfamiliar PDFs, spreadsheets, and websites as they appear.
•Financial analyst bot that builds custom calculators (ratios, durations) and reuses them across tickers and quarters.
•Customer support triage agent that crafts reusable log filters or error pattern detectors from one-off incidents.
•Data journalism helper that evolves scrapers for changing portals and keeps a clean library of extractors.
•Academic helper that builds citation parsers and math solvers, then reuses them in multi-paper literature reviews.
•Enterprise operations bot that learns file-inspection and bounded-read tools to safely handle large datasets.
•Legal discovery assistant that creates OCR+text extract tools for varied scanned documents and archives.
•Healthcare admin tool (non-clinical) that adapts to hospital scheduling formats and insurance forms (with sandboxing).
•Product analytics agent that evolves connectors to new APIs and normalizes outputs into LLM-friendly summaries.
•Education tutor that accumulates math and science utilities (equation solvers, unit converters) to guide students.

Version: 1