Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Yuchen Shi; Yuzheng Cai; Siqi Cai; Zihan Xu; Lichao Chen; Yulei Qin; Zhijian Zhou; Xiang Fei; Chaofan Qiu; Xiaoyu Tan; Gang Li; Zongyi Li; Haojia Lin; Guocan Cai; Yong Mao; Yunsheng Wu; Ke Li; Xing Sun

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Intermediate

Yuchen Shi, Yuzheng Cai, Siqi Cai et al.12/31/2025

arXiv PDF

Key Summary

•Youtu-Agent is a build-and-grow factory for AI agents that cuts manual setup and keeps agents improving over time.
•It separates where the agent runs, what tools it uses, and how it thinks, so parts can be mixed and matched easily.
•Two ways to auto-build agents are offered: a fixed Workflow for routine jobs and a flexible Meta-Agent for fuzzy, complex jobs.
•It can auto-write missing Python tools (with tests) and assemble full YAML configs, reaching over 81% tool synthesis success.
•A low-cost Practice module lets agents learn from their own attempts by storing lessons in context, no fine-tuning needed.
•A full Agent RL module makes large-scale reinforcement learning faster (about 40% speedup) and more stable for long tasks.
•On open benchmarks, it achieved state-of-the-art results using open-weight models: 71.47% on WebWalkerQA and 72.8% on GAIA (text-only).
•Training-free GRPO in Practice boosted math benchmark AIME 2024/2025 by +2.7% and +5.4% with only about $18 of compute.
•Agent RL raised a 7B model’s math accuracy from 10% to 45% and improved search QA by up to 21% across multiple datasets.
•This framework lowers the barrier to create capable, adaptable agents for real-world tasks without relying on closed APIs.

Why This Research Matters

This work makes building powerful AI helpers accessible to small teams by automating the hardest setup steps. It also gives those helpers a way to keep getting better, first cheaply (Practice) and then deeply (RL) when needed. Because it uses open-weight models and open tools, organizations avoid lock-in to closed APIs. The framework’s stability and speed improvements make long, tool-heavy tasks—like browsing, coding, and research—practical at scale. In real life, that means faster research, cleaner data workflows, and smarter assistants that adapt to changing websites and tasks. The auto-generation of tools even writes unit-tested code, reducing breakage and maintenance. Overall, this shifts agents from one-off demos to dependable, evolving workers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you want to build a team of robot helpers. One robot browses the web, another runs code, and a third keeps notes. But assembling them takes ages, and once they’re built, they don’t get better on their own. That’s where many AI agents were stuck.

Before: AI agents could plan, use tools like browsers or code interpreters, and follow instructions. But putting one together was like building a custom treehouse from scratch every time—lots of measuring, sawing, and fixing. Engineers had to manually pick tools, glue them to the environment, write prompts, and keep everything stable. And after deployment, these agents often froze in time: if the world changed or tasks got trickier, the agent didn’t improve unless someone did expensive fine-tuning or spent hours rewriting prompts.

The Problem: Two big headaches slowed everyone down. First, high configuration costs—choosing, connecting, and debugging tools, plus writing the perfect prompts, took too much expert time. Second, static capabilities—agents couldn’t smoothly adapt to new tasks or longer, messier jobs without retraining.

🍞 Hook: You know how practice and coaching make you better at basketball without buying a new pair of legs? 🥬 The Concept (Reinforcement Learning): Reinforcement Learning (RL) is a way to teach AI by rewarding good actions and discouraging bad ones. How it works: 1) Try an action. 2) See what happens. 3) Score it with a reward. 4) Update the strategy to favor higher-scoring actions next time. Why it matters: Without RL, agents can’t steadily improve at long, step-by-step tasks where success comes at the end. 🍞 Anchor: A robot learns to fetch a snack: it tries routes, gets points for reaching the kitchen, and updates its path for next time.

🍞 Hook: Imagine studying with sticky notes that remind you of the tricks that worked last week. 🥬 The Concept (In-context Optimization): In-context optimization improves an AI’s behavior by feeding it helpful notes and examples directly in its prompt, with no weight updates. How it works: 1) Collect attempts and outcomes. 2) Summarize the patterns that worked. 3) Insert those as guidance into the next prompt. Why it matters: Without it, agents need expensive fine-tuning to adapt, or they repeat the same mistakes. 🍞 Anchor: Before a test, you review your own mini cheat-sheet of what worked on practice problems.

Failed Attempts: People tried giant prompt libraries (hard to maintain), rigid pipelines (broke on unusual tasks), and full fine-tuning (costly, data-hungry, and unstable for long workflows). RL for agents also hit snags like ‘entropy explosion’ where the agent’s choices got noisy and chaotic over long horizons.

The Gap: We needed a system that (1) makes building agents almost automatic, (2) keeps agents evolving with low cost when possible, and (3) safely scales up RL when big, lasting gains are needed.

🍞 Hook: Think of LEGO kits where each piece snaps in, and you can rebuild without starting from zero. 🥬 The Concept (Modular Architecture): A modular architecture organizes an agent system into snap-together parts that work cleanly together. How it works: Separate the environment (where actions run), tools (what actions do), and the agent (the planner/thinker). Reuse parts across different builds. Why it matters: Without modularity, every agent is a fragile, one-off project that’s hard to fix or grow. 🍞 Anchor: You can move a ‘browser tool’ from a research agent to a shopping agent without rewiring the whole system.

🍞 Hook: Recipes help different cooks make the same dish the same way. 🥬 The Concept (YAML Configuration System): YAML is a simple, human-readable recipe that declares an agent’s environment, tools, instructions, and context rules. How it works: You write a structured file that lists what the agent needs; the system reads it and builds the agent. Why it matters: Without a clean config, agents become messy to reproduce, share, or auto-generate. 🍞 Anchor: One YAML file can spin up a ‘research bot’ today and a ‘file-cleanup bot’ tomorrow with a few lines changed.

Real Stakes: Faster, cheaper agent setup means small teams can automate research, data cleanup, basic coding, and web tasks without months of engineering. Continuous learning means agents won’t go stale as websites, APIs, or tasks evolve. This isn’t just convenience—it’s the difference between tooling that scales to real businesses and tools that stay stuck in demos.

02Core Idea

Aha! Moment (one sentence): Treat agents like modular kits that can be auto-assembled from a clean recipe and then improved both cheaply (by practice in context) and deeply (by scalable, stable RL) when needed.

Three Analogies for the Same Idea:

Factory Line: Parts (environment, tools, planner) are standardized. A builder bot (Workflow/Meta-Agent) assembles them into a working robot. Then the robot learns tricks from a playbook (Practice) or goes to boot camp (RL) for major upgrades.
Sports Team: The coach drafts players (tools), sets the playbook (YAML), and picks starters (agent config). The team reviews film to get better fast (Practice), and when championships loom, they do full training camps (RL).
Kitchen: Ingredients (tools) and appliances (environment) are stocked. A sous-chef builds a recipe (config) for today’s dish. Notes from yesterday’s cook improve flavor now (Practice), and formal culinary school builds long-term skill (RL).

🍞 Hook: You know how a good toolbox, a clear plan, and steady practice turn a messy project into a smooth one? 🥬 The Concept (Youtu-Agent Framework): Youtu-Agent is a modular system that auto-builds agents and keeps them improving. How it works: 1) Split the system into three layers (Environment, Tools, Agent) and manage them with YAML. 2) Auto-generate full agents via two modes—Workflow for routine tasks, Meta-Agent for complex ones. 3) Continuously improve with Practice (in-context) or RL (parameter training) depending on goals and budget. Why it matters: Without this framework, teams pay high setup costs and end up with agents that don’t grow. 🍞 Anchor: A single YAML swaps a local shell for a browser; the same agent brain then uses browsing tools to solve web tasks.

Before vs After:

Before: Building agents was custom carpentry—slow, brittle, and hard to repeat. Improving them meant rewriting prompts or paying for fine-tuning.
After: Building agents is closer to assembling LEGO—auto-generation picks and even codes tools; YAML keeps things tidy; agents can get quick wins via Practice or big gains via RL.

Why It Works (intuition, no math):

Decoupling turns a tangled ball of yarn into neatly labeled strings—you find and swap what you need fast.
Auto-generation leverages LLMs to write tools and instructions on demand, shrinking human effort.
In-context practice captures patterns from many tries and feeds them back as instant guidance.
RL, when made stable and scalable, hardens skills into the model for lasting, larger improvements.

Building Blocks (with mini ‘sandwiches’):

🍞 Hook: Some jobs are like filling in a form—clear and repeatable. 🥬 The Concept (Workflow Mode): A deterministic, four-stage pipeline that builds agents for standard tasks. How it works: 1) Clarify task, 2) Retrieve/synthesize tools, 3) Generate prompts, 4) Assemble YAML config. Why it matters: Without a clear pipeline, repeatable tasks waste time every build. 🍞 Anchor: ‘Summarize PDFs in a folder’ is auto-built with a file parser tool and a summary prompt.

🍞 Hook: Other jobs feel like detective work—unclear clues, moving targets. 🥬 The Concept (Meta-Agent Mode): An architect agent that asks questions, searches for tools, writes missing tools, and assembles the final config. How it works: 1) Ask user to clarify, 2) Search the library, 3) Write new Python tools (with tests) when needed, 4) Build YAML. Why it matters: Without it, complex specs stall or demand heavy expert time. 🍞 Anchor: ‘Track today’s trending multi-agent papers and download PDFs’—the meta-agent finds arXiv tools, writes a ‘fetch_daily_papers’ tool, and assembles the agent.

🍞 Hook: Practicing with a tip sheet can lift your score before you ever hire a tutor. 🥬 The Concept (Agent Practice Module): Low-cost improvement that stores lessons in context, not weights. How it works: 1) Run several attempts per task, 2) Compare good vs. bad tries, 3) Distill the ‘what worked’ into text, 4) Insert that text next time. Why it matters: Without this, adaptation means expensive fine-tuning or staying stuck. 🍞 Anchor: On math puzzles, the agent learns ‘verify with code, then simplify’ and repeats that winning pattern.

🍞 Hook: Boot camp builds muscle that sticks. 🥬 The Concept (Agent RL Module): A full RL pipeline that’s fast and stable for long tasks. How it works: 1) Infrastructure: REST APIs, parallel rollouts, layered timeouts. 2) Algorithms: filter bad tool calls, reduce off-policy drift, fix advantage bias. Why it matters: Without stability and scale, agent RL is too slow or collapses. 🍞 Anchor: A 7B model’s math accuracy jumps from 10% to 45% after RL.

🍞 Hook: Sharing a shopping list keeps everyone aligned at the store. 🥬 The Concept (YAML Configuration System): A shared, readable plan for the whole agent. How it works: declare environment, tools, agent instructions, and context rules in one file. Why it matters: Without YAML, auto-generation can’t target a clean blueprint, and reproducibility suffers. 🍞 Anchor: One YAML spins up a ‘research assistant’ today and a ‘desktop cleaner’ tomorrow.

03Methodology

High-level Recipe: Input (task idea) → Generation (Workflow or Meta-Agent) → Executable Config (YAML + tools + prompts) → Continuous Improvement (Practice or RL).

Step-by-step Details

Automated Agent Generation

🍞 Hook: You know how a factory prints a label, grabs parts, and assembles a gadget? 🥬 The Concept (Automated Agent Generation): The system builds full agents—code, prompts, config—automatically from a task description. How it works:

Workflow Mode (deterministic):
1. Intent Clarification: parse the request into goals and constraints.
2. Tool Retrieval/Synthesis: fetch tools from a library or auto-generate Python tools (with docstrings, signatures, tests).
3. Prompt Engineering: craft system instructions that explain when and how to use tools.
4. Config Assembly: write a YAML that stitches environment, tools, and prompts together.
Meta-Agent Mode (flexible): a planning agent can ask the user questions, search for tools, create missing tools, and emit the final YAML. Why it matters: Without auto-generation, each agent requires hand-wiring; scaling to many use cases becomes impractical. 🍞 Anchor: ‘Summarize and save the top 10 articles about battery breakthroughs this week’—the system finds a search tool, writes a scraper if missing, adds a summarizer, and outputs a ready-to-run YAML.

Workflow Mode, with concrete data:

Example input: “Read all CSV files in a folder, plot sales by month, and email a report.”
Stage 1: Intent → needs file I/O, plotting, email.
Stage 2: Retrieve file and plotting tools; synthesize ‘send_email_smtp’ if absent.
Stage 3: Prompt: “Use file_reader to load CSVs; use plotter for sales by month; compose and send a summary.”
Stage 4: YAML ties it all to a local OS shell environment and a Python executor. If any step is skipped, the agent might lack a critical tool, misuse a tool, or be impossible to reproduce across machines.

Meta-Agent Mode, with concrete data:

Example input: “Track daily trending multi-agent papers and download PDFs.”
The meta-agent asks: “Which sources? ArXiv? Time range?” Then it searches the tool library, generates ‘fetch_daily_papers’ if missing, tests it, and writes the final YAML. Failing to clarify intent or to test tools can cause silent failures at run time.

Context Management and Execution

🍞 Hook: Your backpack only fits what you need today; old papers get tossed. 🥬 The Concept (Context Manager): A module that keeps the agent’s working memory small and relevant. How it works: prunes stale logs, old HTML, and redundant steps while preserving key state. Why it matters: Without pruning, the context overflows, costs spike, and reasoning derails. 🍞 Anchor: While browsing 10 pages, the agent keeps only the current DOM and essential notes, not the whole web history.

Execution Flow: perceive → reason → act. The agent reads state (e.g., browser page), thinks about the next step, chooses a tool (click, search, run Python), and repeats until done. For long tasks, Plan-and-Execute helps.

🍞 Hook: In a group project, one planner assigns tasks and specialists do the work. 🥬 The Concept (Plan-and-Execute): One planner breaks jobs into steps and dispatches them to specialized executors with the right tools. How it works: planner drafts a plan; executors run tools and return results; planner refines or finishes. Why it matters: Without it, big tasks get messy or loop forever. 🍞 Anchor: Planner: ‘Find the author’s email.’ Executor A: ‘Search web.’ Executor B: ‘Open page and parse.’ Planner: ‘Found it—email them.’

Practice: Training-free Group Relative Policy Optimization (GRPO)

🍞 Hook: After trying multiple ways to solve a puzzle, you compare what worked best and write a quick tip to yourself. 🥬 The Concept (Training-free GRPO): A method that compares a group of attempts on the same task and writes a short ‘what worked’ note (a semantic advantage) to guide future tries—no weight updates. How it works: 1) For each task, run multiple rollouts. 2) An LLM evaluator judges which trajectories are better. 3) It distills a textual learning direction by contrasting successes and failures. 4) During testing, insert these notes into the prompt (a ‘textual LoRA’). Why it matters: Without GRPO, practice is noisy; with it, the agent gains focused, reusable lessons cheaply. 🍞 Anchor: The agent learns: ‘For AIME problems, define variables, try algebra first, then verify with code.’

🍞 Hook: Practicing with a smart checklist boosts you today without changing your brain. 🥬 The Concept (Agent Practice Module): The component that runs rollouts, evaluates them relatively, distills guidance, and stores it for later prompts. How it works: small training sets, no gradients, compatible with API-only models. Why it matters: It’s the budget way to adapt to niches and new patterns. 🍞 Anchor: With only 100 math problems, the agent improves by several points on AIME benchmarks.

RL: Scalable and Stable Reinforcement Learning

🍞 Hook: To win a championship, you need real training, good facilities, and safety rules. 🥬 The Concept (Agent RL Module): A full RL pipeline that connects Youtu-Agent to distributed trainers for end-to-end learning. How it works:

Scalability (infrastructure): REST APIs wrap environments; Ray-based concurrency runs many rollouts; layered timeouts prevent stalls.
Stability (algorithms): filter invalid tool calls; reduce off-policy drift (no batch shuffling, fewer stale updates); fix advantage bias for turn-level GRPO. Why it matters: Without scale, RL is too slow; without stability, it collapses on long, tool-heavy tasks. 🍞 Anchor: After RL, coding/reasoning accuracy jumps on AIME; search QA improves across seven datasets.

Secret Sauce

The clean YAML schema is a bullseye for auto-generation.
Tool synthesis writes not just code but docstrings, signatures, and tests to reduce breakage.
Practice captures immediate, low-cost gains; RL cements deep skills when it’s worth the compute.
Infrastructure and algorithm fixes tame long-horizon RL pathologies, turning research ideas into a production-ready recipe.

04Experiments & Results

The Tests and Why

Can the framework, using only open models, solve real web and reasoning tasks? (WebWalkerQA, GAIA)
Can it auto-generate usable tools and configs? (AgentGen-80)
Can Practice (training-free GRPO) deliver cheap, steady gains? (AIME 2024/2025)
Can RL run faster and more stably at scale while lifting accuracy? (Math/code and search QA suites)

The Competition

Against strong prompting and training baselines: ReAct (no training), ZeroTIR, SimpleTIR, ReTool, AFM (RL-based). These represent today’s go-to methods for tool-using agents and math reasoning.

Scoreboard with Context

WebWalkerQA (deep web navigation, 680 Qs): 71.47% pass@1. That’s like scoring an A when many open baselines hover around lower grades, especially without closed APIs.
GAIA (text-only subset, 466 Qs): 72.8% pass@1. That’s strong real-world QA with tool-use using only open-weight models.
AgentGen-80 (auto-generation):
- Configuration Validity: Workflow 100%, Meta-Agent 98.75% (rare formatting misses).
- Tool Executability: ~81–83% for both (synthesized tools compile and run).
- Task Completion: Workflow 65.00%, Meta-Agent 68.75% (Meta-Agent’s flexibility helps on fuzzier tasks).
Practice on AIME (Mean@32, with DeepSeek-V3.1-Terminus):
- ReAct baseline: 80.0% (AIME24), 67.9% (AIME25).
- - Training-free GRPO (w/ ground truth): 82.7% (+2.7) and 73.3% (+5.4).
- Even without ground truth labels, gains persist, showing robustness when labels are scarce.
- Tool calls decrease over epochs—like solving the same puzzles with fewer steps, a sign of smarter strategies.
RL Efficiency and Effectiveness (Qwen2.5-7B-Instruct):
- ~40% faster iteration vs. Agent-Lightning v0.2.2 due to concurrency + timeout engineering.
- Math/code RL: AIME24 improves 0.10 → 0.45 (+0.35), AIME25 improves 0.09 → 0.31 (+0.22).
- Search QA RL: gains of +0.17 (TriviaQA), +0.19 (PopQA), +0.21 (NQ), +0.08 (MuSiQue), +0.17 (HotpotQA), +0.13 (Bamboogle), +0.10 (2WikiMultiHop).
- Training dynamics: KL and gradients remain stable; critic score and validation accuracy rise together—hallmarks of healthy RL.

Surprising Findings

Automated tool synthesis worked over 81% of the time even with unit tests in the loop—strong for code-generation in a constrained interface.
Practice improved not just accuracy but efficiency (fewer tool calls), suggesting it teaches better strategies, not just lucky guesses.
RL’s stability fixes (filtering invalid calls, reducing off-policy drift) paid off more than expected on long-horizon, tool-rich tasks.

Takeaway: Across all tests, the same pattern repeats—clean architecture enables automation; automation seeds capable agents; Practice adds cheap, immediate skill; RL scales and cements deeper improvements.

05Discussion & Limitations

Limitations

Task Coverage: While Meta-Agent mode helps, extremely open-ended or domain-expert tasks may still require human guidance or custom tools beyond auto-synthesis.
Tool Synthesis Gaps: ~19% of generated tools may fail initially; unusual APIs, auth flows, or GUI timing can still trip up synthesis.
Context Budget: Even with a context manager, very long projects can overrun context windows or require careful summarization.
RL Cost/Complexity: RL still needs infrastructure, data pipelines, and careful reward shaping; not every team will want to run 128-GPU-scale experiments.

Required Resources

For Generation/Practice: An open-weight LLM, a sandbox (e.g., Python runner/browser), and modest compute.
For RL: Distributed compute (GPUs), rollout orchestration (Ray/REST), datasets, and evaluation harnesses.

When NOT to Use

One-off, trivial automations that are faster to script by hand.
Ultra-sensitive tasks where auto-generated code or browsing is disallowed by policy.
Hard real-time systems with strict latency where LLM planning loops are too slow.

Open Questions

How to further raise tool-synthesis reliability on tricky auth/GUI workflows?
Can practice-style ‘textual LoRA’ be made more compact, structured, or automatically pruned over time?
What are the best reward designs for multi-objective tasks (speed, accuracy, safety) in agent RL?
How can multi-agent collaboration (planner–specialist teams) be auto-generated and jointly trained end-to-end?
Can we blend Practice and RL signals in a single loop that chooses the cheapest next improvement step automatically?

06Conclusion & Future Work

Three-sentence Summary: Youtu-Agent turns agent building into assembly and agent growth into a steady habit by combining modular design, automated generation, and two upgrade paths: low-cost Practice and full RL. It shows strong open-source performance on web and reasoning tasks while auto-writing tools and configs, then improves further with training-free GRPO and scalable, stable RL. The result is a practical path from idea to evolving agent without relying on closed APIs.

Main Achievement: A unified, production-ready framework where a clean YAML schema enables auto-generation of tools and agents, and a hybrid optimization stack (Practice + RL) keeps those agents improving from day one to deployment scale.

Future Directions: Add more environments (mobile, rich GUIs), richer multi-agent patterns with shared memory, smarter experience curation, and broader safety/verification layers for generated tools. Explore automatic decision-making about when to use Practice versus RL for the best cost/performance trade-off.

Why Remember This: It shows that agent creation doesn’t have to be artisanal or static—you can press ‘generate,’ get a workable agent, and then choose the cheapest path to make it better, again and again.

Practical Applications

•Spin up a web research agent that searches, opens pages, extracts facts, and writes citations automatically.
•Create a data-cleaning agent that reads CSVs, fixes formats, plots trends, and emails reports.
•Build a document assistant that parses PDFs, summarizes sections, and answers questions with citations.
•Deploy a coding helper that writes and tests small Python tools on demand for internal workflows.
•Set up a customer support agent that searches a knowledge base and web sources to draft accurate responses.
•Use Practice to quickly adapt an agent to a new domain (e.g., company-specific math or policy rules) without fine-tuning.
•Run RL to boost a 7B model’s long-horizon reasoning for math, coding, or multi-hop search tasks.
•Automate desktop routines (via Tip) like file organization, GUI form filling, and screenshot-based assistance, locally.
•Prototype and A/B test multiple agent variants by editing YAML configs instead of rewriting code.
•Continuously monitor and improve agents with Eval + Practice loops, then schedule periodic RL for lasting upgrades.

Version: 1