EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Xiaoshuai Song; Haofei Chang; Guanting Dong; Yutao Zhu; Zhicheng Dou; Ji-Rong Wen

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Intermediate

Xiaoshuai Song, Haofei Chang, Guanting Dong et al.1/9/2026

arXiv PDF

Key Summary

•EnvScaler is an automatic factory that builds many safe, rule-following practice worlds where AI agents can talk to users and call tools, just like real apps.
•It has two big parts: SkelBuilder makes the environment’s skeleton (data, tools, and rules), and ScenGenerator fills it with realistic starting data, tasks, and a way to check if the task was truly done.
•Instead of trusting imagination-only simulations, EnvScaler uses program code to run the environments, so they are consistent, controllable, and repeatable.
•A dual-agent tester checks every tool method by making calls and verifying the state changes, filtering out broken environments.
•Tasks are graded by rule-based state-check functions, so different valid solution paths still earn credit when the final state is correct.
•EnvScaler created 191 environments and ~7,000 scenarios, then used them to train Qwen3 models with supervised fine-tuning and reinforcement learning.
•Across three benchmarks (BFCL-MT, Tau-Bench, ACEBench-Agent), models trained with EnvScaler improved by large margins, especially on multi-turn, multi-tool challenges.
•Performance kept rising as more environments were added, showing that scaling variety teaches general problem-solving patterns, not just memorization.
•Conversation and non-conversation training each help in different ways; combining both gave the best overall scores.
•Limitations include relying on LLMs to synthesize code, focusing on text tools (not images/audio), and not modeling real-world latencies or network errors.

Why This Research Matters

Real apps have rules, states, and consequences; EnvScaler lets agents practice in code-run worlds that mirror those realities. By grading the final state, it rewards what truly matters—did the task get done correctly—rather than forcing one narrow action script. This leads to agents that generalize better to new environments, tools, and business rules. Developers can train safely without touching production systems, reducing risk and speeding iteration. Companies can cover more domains quickly, helping agents become broadly useful assistants. As tasks get longer and more complex, EnvScaler’s scalable factories keep up, continuously sharpening agent skills.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to be a shop helper. You need a safe practice store with real shelves, real buttons on the cash register, and real rules like “You can’t refund an item after 30 days.” If the practice store is pretend and changes its mind, you’ll learn bad habits.

🥬 The World Before: Large language models (LLMs) are getting good at chatting, but acting as helpful agents in real systems is harder. Real tasks—like canceling an order, rescheduling a flight, or updating a document—require reading the current system state and then using tools (APIs) to change it. Training for this needs reliable practice environments (sandboxes) with clear data, tools, and rules. People tried three paths: (1) real systems, (2) LLM-simulated worlds, and (3) small, hand-coded sandboxes. Real systems often block access and are risky to learn on. LLM-simulated worlds can hallucinate—today they say a package is delivered, tomorrow they forget. Hand-built sandboxes are steady but too few and time-consuming to scale.

🥬 The Problem: How can we create lots of trustworthy, varied practice environments—each with states, tools, and business rules—fast enough to train agents that can handle long, multi-step, multi-tool tasks? Also, how do we automatically check whether a task was truly finished (not just that the AI called the right tools)?

🥬 Failed Attempts:

Using only real systems: limited access, fragile to change, and risky for training.
Pure LLM simulation: cheap and easy to spin up, but inconsistent and not transparently rule-based.
Manually coded sandboxes: reliable but scarce; they’re typically built just for evaluation and don’t scale for training.
Tool-only mockups: some works model APIs without a real state or rules, so actions don’t truly affect a world.
Reconstruction from logs: requires access to real logs or preexisting tools, which may not be available.

🥬 The Gap: We need a way to automatically synthesize many executable, stateful, rule-following environments (not just text), plus realistic tasks and automatic, fair grading. And we need a way to test and filter out broken environments.

🥬 Real Stakes: Why should you care? Because agents that can safely practice in diverse, consistent worlds can better help in daily life: canceling the correct order instead of the wrong one; changing a flight without breaking airline rules; or cleaning up a messy message thread without losing data. Without solid practice worlds, agents learn shortcuts that fail in the real apps you use.

🍞 Anchor: Think of a driving simulator. If the red light sometimes means “go,” students will crash when they drive for real. EnvScaler builds many correct, code-run traffic systems so agent “drivers” learn the real rules and signals before touching an actual road.

02Core Idea

🍞 Hook: You know how LEGO sets come with a base (the board), pieces, and instructions? If you have a machine that can keep making new boards, pieces, and challenges, you can practice building anything.

🥬 The “Aha!” in one sentence: Use programmatic synthesis to automatically build many executable, rule-following tool environments—and auto-generate tasks plus state-based checkers—so LLM agents can practice real multi-turn tool use at scale.

🥬 Multiple Analogies:

Playground builder: A robot builds many playgrounds (environments) with safe, consistent equipment (tools and rules), then sets up obstacle courses (tasks) and referees (checkers) that judge if kids finish correctly.
Video game level factory: It creates new game worlds (states), gives the player a quest (task), and has the game engine, not a storyteller, decide if the quest is completed by checking the actual game state.
Cooking school: Kitchens (environments) have ingredients (state), appliances (tools), and health rules. The teacher sets recipes (tasks) and inspects the finished dish (state) rather than the exact order a student stirred.

🥬 Before vs After:

Before: Few, hand-made sandboxes; tests judged by matching a single tool-call script; LLM-simulated worlds were unstable.
After: Hundreds of code-run environments; tasks graded by final state, allowing multiple correct solution paths; automated quality checks keep only reliable environments.

🥬 Why It Works (intuition, no equations):

Code-backed environments are consistent and transparent: same input yields the same state change.
State-based grading rewards the outcome, not a single path, so agents can discover better strategies.
Automated discovery and testing remove the bottleneck of humans hand-coding every domain.

🥬 Building Blocks (each gets a Sandwich when first introduced):

Tool-interactive environment: A rule-governed world with data (state) and tools (APIs) that agents can call to read/change the state.
Programmatic synthesis: Use LLMs as programmers to write Python classes implementing state, tools, and rules.
EnvScaler: The full factory. Inside it:
- SkelBuilder: Mines environment themes from tasks, plans state/tools/rules, writes code, and dual-tests every tool.
- ScenGenerator: Fills each environment with initial data, crafts realistic tasks, and builds state-check functions that grade trajectories.
Training setup: Supervised Fine-Tuning (learn from teacher-made trajectories) and Reinforcement Learning (self-explore with rewards from check functions).

🍞 Anchor: In the Mobile Messaging Application example, EnvScaler makes users, contacts, messages, and tools like send_message or mark_message_as_read. A task might ask to add a contact, send a fix-it message, update delivery status, link it to a conversation, and clean up old messages. The checker looks at the final database to confirm each requirement really happened.

03Methodology

🍞 Hook: Imagine a board game factory line: First design the board and rules, then print the cards and pieces, then playtest with robots, and finally ship only the games that pass testing.

🥬 High-level recipe: Input → SkelBuilder (discover theme → plan logic → code tools → dual-agent testing) → ScenGenerator (make initial data → design tasks → build checkers) → Agent training (SFT → RL) → Output: stronger multi-tool, multi-turn agents.

— Concept 1 — 🍞 Top Bread (Hook): You know how a good classroom has rules, a whiteboard, and markers? That’s what a tool-interactive environment is for an AI. 🥬 Filling: What it is: A computer program that keeps track of a world’s data (state) and exposes tools (APIs) to let an agent read or change that state under rules. How it works:

The environment stores entities (like users, orders, messages).
It lists tool methods (like get_user, cancel_order, send_message).
Each tool follows rules (like “only cancel if status is Pending”). Why it matters: Without real state and rules, the agent’s actions don’t truly affect anything, and learning won’t transfer to real systems. 🍞 Bottom Bread (Anchor): In the messaging app, messages have delivery_status and read_status; tools like send_message and mark_message_as_read change those fields in the stored data.

— Concept 2 — 🍞 Top Bread (Hook): Picture telling a robot to build many mini-worlds from instructions. 🥬 Filling: What it is: Programmatic synthesis means using LLMs to write executable code for environments—data schemas, rules, and tool methods. How it works:

Infer state, tools, and rules from a text description.
Generate Python classes: attributes for state, methods for tools.
Verify syntax and extract tool interfaces. Why it matters: Code-run worlds are consistent, controllable, and explainable compared to free-form text simulations. 🍞 Bottom Bread (Anchor): The robot coder writes a class MobileMessagingApplication with dictionaries for users, contacts, messages, and functions like validate_phone_number or send_message.

— SkelBuilder (Stage A: Discover Themes) What happens: It mines environment topics from large task sets (e.g., “I need to cancel order #123” → E-commerce Order System). It filters tasks that clearly require reading/changing persistent state, then clusters and deduplicates environment descriptions. Why it exists: To cover many real-world domains without hand-picking them. Example: From “Notify Bob at 503...,” it infers a Mobile Messaging Application.

— SkelBuilder (Stage B: Plan and Code) What happens:

Logic planning: LLM plans the state schema (entities/attributes), rules, and tool list.
Program modeling: Generate class attributes and each tool’s method with rule checks and safe returns.
Assembly and interface extraction: Merge code, AST-validate, and list tool signatures. What breaks without it: Tools would not match rules, state updates could be inconsistent, or the code would be unexecutable. Example: It defines MessageInfo fields (delivery_status, read_status) and enforces valid updates.

— SkelBuilder (Stage C: Dual-Agent Assessment) 🍞 Top Bread (Hook): Like two referees: one plays moves, the other checks if the rules were followed. 🥬 Filling: What it is: A testing agent generates positive/negative tool calls; a checking agent examines source code, results, and state diffs to judge Pass/Warning/Fail. How it works:

Testing agent crafts a call (e.g., try deleting a non-existent message).
Environment executes and returns result.
Checking agent inspects code, result, and state change to label Pass/Warning/Fail.
Repeat many rounds and compute a pass rate; keep only high-scoring environments. Why it matters: It automatically filters broken logic and weak validations. 🍞 Bottom Bread (Anchor): If mark_message_as_read is called twice, the check expects idempotent success, not an error.

— ScenGenerator (Initial State) What happens: For each environment, generate a realistic initial database consistent with the schema and constraints (e.g., users, contacts, mixed message statuses). Why it exists: Tasks must be solvable given the actual data (you can’t cancel an order that doesn’t exist). Example: Preload messages MSG001–MSG007 with varied statuses and timestamps.

— ScenGenerator (Task Design) What happens: From the initial state and rules, generate a realistic, multi-step, state-changing task (not just a query). Avoid requiring unsupported side-effects (e.g., editing auto-generated timestamps). Why it exists: To push agents into multi-turn, multi-tool planning. Example: “Add Gabby as a contact, resend with updated text, set delivered, link to conversation, archive it, delete the old failed message, and send status to Alice, then mark her old message read.”

— ScenGenerator (Validation Functions) 🍞 Top Bread (Hook): Like a checklist a teacher uses to grade a project by looking at the finished product. 🥬 Filling: What it is: For each task, break it into check items and generate Python functions that inspect the final environment state to verify each item. How it works:

Make a checklist: Has contact X been added? Has message Y status become delivered?
For each item, write check_func(final_state) → True/False.
Reward = fraction of checks passed, allowing partial credit and multiple solution paths. Why it matters: It judges success by outcomes, not by following one rigid script of tool calls. 🍞 Bottom Bread (Anchor): Even if the agent updates delivery status before linking the message to the conversation, it still passes those checks if the final state is correct.

— Training (SFT then RL) 🍞 Top Bread (Hook): First learn from a teacher’s examples, then practice on your own and earn points. 🥬 Filling: What it is: Supervised Fine-Tuning (SFT) teaches by imitation; Reinforcement Learning (RL) lets agents explore and get rewards from the checkers. How it works:

SFT: Collect teacher trajectories under conversation and non-conversation settings; train the student to follow them.
RL: Let the student try multiple strategies; use checkers’ pass rate as reward to improve the policy. Why it matters: SFT gives a strong start; RL refines and tailors strategies to the environments. 🍞 Bottom Bread (Anchor): The model learns to verify phone numbers first (SFT), then RL helps it discover faster paths to finish the messaging task.

— Secret Sauce

Dual-agent assessment keeps environments trustworthy.
State-based grading supports creativity and multiple correct paths.
Automated theme mining and coding scale to many domains.
Combining conversation and non-conversation training covers both information-gathering and execution-focused skills.

04Experiments & Results

🍞 Hook: Think of a sports league for agents. We built many stadiums (environments), ran practices (SFT + RL), and then played official matches (benchmarks) to see who improved.

🥬 The Test: They measured how well different Qwen3 models solved multi-turn, multi-tool tasks on three public benchmarks: BFCL-v3 Multi-Turn (tests tool use across varied settings like missing parameters/functions and long contexts), Tau-Bench (retail and airline customer service with strict business rules), and ACEBench-Agent (mobile apps, food delivery, finance, travel; both multi-step and multi-turn).

🥬 The Competition: Baselines were the same Qwen3 models before training. Trained models used EnvScaler data for SFT, and then some also used RL rewards from the state-check functions. They also compared conversation vs non-conversation training subsets.

🥬 The Scoreboard (with context):

Scale of training data: 191 environments and about 7,000 scenarios were synthesized. Teacher-led SFT produced ~9,000 trajectories. RL used state-check rewards.
Overall gains from SFT: Averaged across models, BFCL-MT improved by around 8–9 points, ACEBench-Agent by about 11–12 points, and Tau-Bench by about 4 points. In classroom terms, that’s moving from a solid B- to a strong B+/A- on complex, rule-heavy tests.
Adding RL: Qwen3-8B improved further on BFCL-MT (+4.88) and Tau-Bench (+3.46), showing RL reliably squeezes extra performance when exploration capacity is strong. Qwen3-4B and Qwen3-1.7B also improved, but smaller models showed more variance.
Conversation vs Non-conversation: Non-conversation SFT helped on subsets where all info is given (Base, Long-Context), while conversation SFT helped more when tool names/parameters were missing (Miss-Parm). Combining both gave the best overall BFCL-MT score.
Scaling effect: Performance steadily rose as the number of SFT environments increased, with a big jump from 0→20 environments and continued gains beyond—evidence that variety teaches transferable patterns, not just surface tricks.
Similarity analysis: Training on the 50% most-similar or least-similar environments to BFCL-MT both beat the baseline by a lot, and the difference between the two subsets was small. This suggests EnvScaler trains general skills rather than overfitting to lookalikes.

🥬 Surprising/Notable Findings:

Multiple sampling and keeping the best trajectory significantly improved scenario scores, meaning there’s rich room for self-exploration.
State-check grading clearly separated stronger LLMs (higher win rates), validating that the checkers capture real success, not just lucky tool name matches.
Direct RL without an SFT warm start still helped, especially for larger models, but the best results came from SFT + RL. That mirrors real life: lessons first, then practice.

🍞 Anchor: On a random set of 50 EnvScaler scenarios, average trajectories were long (about 15 steps non-conversation and 25+ with conversation), confirming the tasks are meaty. Larger models consistently won more of these mini-matches, and adding RL made them even better at finishing the checklists.

05Discussion & Limitations

🍞 Hook: Even the best playground needs maintenance signs. Let’s talk about what EnvScaler doesn’t do yet and when it’s not the right tool.

🥬 Limitations:

LLM-made code bias: While environments are executable and rule-based, they’re still synthesized by LLMs, so business logic might drift from real systems.
Domain focus: It targets domain-specific, stateful systems more than open, messy spaces like general web browsing.
Missing realism knobs: It doesn’t yet simulate network delays, API rate limits, flaky errors, or partial outages that real systems experience.
Text-only tools: No built-in support for images, audio, or sensors, so multimodal agent skills aren’t trained here.

🥬 Required Resources:

Access to capable LLMs for coding, assessing, and generating scenarios.
Compute for SFT/RL and for running many environment instances during training.
Storage for the generated code, states, and trajectories.

🥬 When NOT to Use:

If your target is open web research or pure retrieval with no persistent local state.
If you need high-fidelity simulations of latency, pricing, or concurrency.
If your agent must reason over images or voice commands.

🥬 Open Questions:

How to better align synthesized rules with real, messy business policies?
Can we synthesize realistic failure modes (timeouts, 500s) and teach robust recovery?
How to extend to multimodal tools and physical-world simulators?
What’s the best curriculum for mixing conversation vs non-conversation tasks across domains?
Can environment synthesis itself be guided by performance gaps (automatically generate what the model most needs next)?

🍞 Anchor: Think of EnvScaler as a great driving course on quiet roads. Next steps are adding traffic jams, construction zones, and fog—so learners can handle anything.

06Conclusion & Future Work

🍞 Hook: Picture a training city where traffic lights always work, roads follow rules, and every neighborhood teaches a new skill. That’s what EnvScaler builds for AI agents.

🥬 3-Sentence Summary: EnvScaler automatically synthesizes many executable, tool-interactive environments, then fills them with realistic starting data, multi-step tasks, and rule-based state checkers. Its two engines—SkelBuilder and ScenGenerator—code up states, tools, and rules, test them with dual agents, and generate tasks plus final-state validators. Training on these worlds with SFT and RL significantly boosts multi-turn, multi-tool performance on public benchmarks.

🥬 Main Achievement: Turning environment creation, task design, and fair grading into an automated, scalable pipeline so agents can learn robust tool-use skills from outcomes, not just scripts.

🥬 Future Directions: Add realistic system noise (latency, partial failures), support multimodal tools, extend to open-web scenarios, and make synthesis adaptive—creating new worlds to target each model’s weak spots.

🥬 Why Remember This: When agents learn in many trustworthy, code-run environments and get graded by the final state, they discover general strategies that carry over to new apps and rules. EnvScaler shows how to build that training city at scale.

Practical Applications

•Internal agent training gym: Spin up many realistic practice environments to teach agents safe tool use before deployment.
•Quality gate for tools: Use dual-agent assessment to find brittle APIs or missing validations in synthetic sandboxes.
•Curriculum design: Mix conversation and non-conversation scenarios to strengthen both information-gathering and execution.
•Evaluation at scale: Benchmark agents by final-state checkers across diverse domains without hand-labeling trajectories.
•Reward shaping for RL: Plug in checklist-derived rewards to encourage robust, outcome-focused strategies.
•Rapid domain prototyping: Generate a draft environment for a new business process and iterate rules/tools quickly.
•Regression testing: Re-run the same code-run environments to catch behavior drifts after model updates.
•Data augmentation: Create fresh tasks and states to reduce overfitting and improve generalization.
•Safety drills: Script edge cases (e.g., invalid IDs, permission checks) to ensure graceful failure handling.
•Ops readiness: Simulate high-variance workloads (many small tasks) to stress-test agent planning and tool calling.

Version: 1