Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

Zhihao Xu; Rumei Li; Jiahuan Li; Rongxiang Weng; Jingang Wang; Xunliang Cai; Xiting Wang

Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

Intermediate

Zhihao Xu, Rumei Li, Jiahuan Li et al.1/15/2026

arXiv PDF

Key Summary

•The paper shows a new way to teach AI assistants how to use tools in many-step conversations by mining ordinary text on the internet for step-by-step “how-to” knowledge.
•Instead of starting from a fixed box of APIs, the method (called GEM) turns instructions found in text into tools, workflows, and full user–assistant dialogues with tool calls.
•GEM runs in four stages: filter texts that have steps, extract workflows and tool definitions, generate complete conversations, then refine them to be more realistic and complex.
•A smaller model called the Trajectory Synthesizer learns to do the whole GEM process end-to-end, making data generation faster and cheaper while keeping quality.
•On the BFCL V3 Multi-Turn benchmark, a 32B model trained with GEM data scores 44.88% and beats GPT-4.1 and other strong baselines in multi-turn tool use.
•On τ-bench (Airline and Retail), models trained on GEM’s out-of-domain data match or beat models trained on in-domain data, showing strong generalization.
•Careful validation (rules plus an LLM judge) reduces hallucinations by ensuring every tool call is grounded in the dialogue and tool schemas are correct.
•Refinement is crucial: adding ambiguity, longer context, and non-trivial tool chains significantly boosts performance.
•This approach unlocks a massive, diverse, and realistic source of training data by converting text into multi-turn, tool-using trajectories.
•The result is AI assistants that clarify, recover from errors, follow rules, and chain multiple tools more reliably in long conversations.

Why This Research Matters

This work turns the everyday “how-to” knowledge spread across the web into training fuel for AI helpers, so they can follow rules, ask clarifying questions, and recover from errors in realistic, multi-step tasks. That makes assistants more reliable for chores like fixing billing issues, planning travel, or editing media—without needing hand-built API libraries for every domain. Because the approach is text-first and out-of-domain, it scales naturally and generalizes to new problem spaces. The careful validation steps reduce hallucinations, improving safety and trust. Faster, cheaper generation via the Trajectory Synthesizer means organizations can keep data fresh as tools and policies change. Overall, this unlocks smarter, steadier AI teammates for real work and everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a friend can follow a recipe to bake cookies, but they can also fix little problems along the way—like swapping butter if it melts or checking the oven twice? We want AI helpers that can do that with digital tools—step by step, even when things get messy.

🥬 Filling (The Actual Concept): Multi-turn interactions

What it is: Multi-turn interactions are back-and-forth conversations where an AI and a user work through a task across many steps.
How it works:
1. The user asks for something.
2. The AI plans steps and may ask clarifying questions.
3. The AI calls tools, reads results, and adjusts.
4. They repeat until the task is done.
Why it matters: Without multi-turn skills, the AI gets stuck on incomplete info, long contexts, or small mistakes. 🍞 Bottom Bread (Anchor): Like ordering a flight: the AI must check dates, search flights, ask about luggage, handle errors, and confirm payment—several turns, not just one.

🍞 Top Bread (Hook): Imagine building a LEGO set with instructions. If you only had random bricks (tools) but no booklet (steps), you’d struggle. AI faces this too.

🥬 Filling: Tool-use data synthesis

What it is: It’s creating training examples that show how to use tools correctly across steps.
How it works:
1. Define tools (like search_flight, book_flight).
2. Create tasks and conversations that call these tools.
3. Check that tool names and parameters are correct.
Why it matters: Without good examples, AIs don’t learn reliable tool calling. 🍞 Bottom Bread: Teaching an AI to change a shipping address needs examples with the right order ID, address fields, and confirmation messages.

The World Before: Most systems generated training data by starting with predefined API sets. That’s like trying to learn every possible kitchen tool before you’ve even decided what recipe you’re cooking. Researchers simulated users and agents around those APIs, which limited variety and realism. It also meant lots of effort to collect or design giant API libraries.

The Problem: Real multi-turn tool-use data is scarce, diverse domains are hard to cover, and predefined tools cap what models can learn. Agents often fail with ambiguous questions, long histories, or errors.

🍞 Top Bread: Think of the internet like a giant library of step-by-step stories: tutorials, manuals, and guides.

🥬 Filling: Text-based trajectory generation (the paper’s main shift)

What it is: Turning ordinary text that describes procedures into full training journeys with tools and conversations.
How it works:
1. Find texts that contain multi-step procedures.
2. Extract the steps and design matching tool definitions.
3. Generate a realistic conversation where an assistant uses those tools.
4. Refine it to include clarifications, errors, and longer contexts.
Why it matters: This unlocks huge, diverse, realistic training data without hand-curated APIs. 🍞 Bottom Bread: A blog about “how to file an insurance claim” can become tools (create_claim, upload_document) plus a dialogue where the assistant guides a user through it.

The Gap: Everyone had bricks (APIs) but not enough storybooks (real processes). This paper flips it: start with the storybooks and invent the right bricks from them.

🍞 Top Bread: Imagine a skilled librarian who reads any guidebook and instantly drafts a clear workflow and the tools you’d need.

🥬 Filling: The GEM pipeline

What it is: A four-stage recipe to turn text into multi-turn tool-use trajectories.
How it works:
1. Relevance filtering: keep only texts with multi-step operations.
2. Workflow & tool extraction: convert steps into structured flows and tool schemas.
3. Trajectory generation: create realistic dialogues and tool calls in one pass.
4. Complexity refinement: add ambiguity, longer context, and error recovery.
Why it matters: Each stage ensures quality and realism; without them, data would be noisy or too simple. 🍞 Bottom Bread: From “creating a music visualizer” tutorial to tools (import_audio, make_composition, render_video) and a dialogue that handles missing parameters and export errors.

Failed Attempts: Prior datasets simulated interactions only around fixed APIs. These missed many real-world twists—like wrong inputs, conditional steps, or business rules hidden in text. They also struggled to generalize when tools changed.

The Stakes: Better training data means agents that can book travel, manage orders, analyze data, or support customers more safely and flexibly. That affects everyday tasks—from fixing a billing issue to planning a budget trip—because the assistant won’t fall apart when the user is vague or when an error pops up.

🍞 Top Bread: What if this pipeline could fit into a single fast model that learned the whole trick?

🥬 Filling: Trajectory Synthesizer

What it is: A smaller model trained to do the end-to-end text-to-trajectory mapping.
How it works:
1. Show it many examples produced by GEM.
2. It learns to output tool schemas and full dialogues directly from text.
3. Use it to generate lots of data cheaply.
Why it matters: Running giant multi-stage generation is costly; this makes it practical at scale. 🍞 Bottom Bread: Instead of hiring a big studio team for each movie (pipeline), you train a director who can quickly film from any script (text) with similar quality.

02Core Idea

🍞 Top Bread (Hook): Imagine you find a great “how to” article—like fixing a bike—and wish an AI could turn it into a step-by-step helper that asks smart questions, uses the right tools, and handles mistakes.

🥬 Filling: The “Aha!” Moment

What it is: The key insight is that ordinary text already contains hidden, multi-step problem-solving know-how that can be turned into tool-using conversations for training agents.
How it works:
1. Mine text for procedures.
2. Extract workflows and design tools that match the steps.
3. Generate a full user–assistant dialogue with tool calls and results.
4. Refine for realism: ambiguity, long context, errors, and rule checks.
Why it matters: This removes the bottleneck of collecting predefined tools and creates broader, more authentic training data. 🍞 Bottom Bread: A hospital billing guide becomes tools (submit_claim, verify_insurance), a workflow (verify → submit → track), and a dialogue that handles missing IDs and rejection errors.

Multiple Analogies:

Cookbook analogy: The web is a giant cookbook. GEM reads recipes (texts), lists ingredients and actions (tools and steps), cooks the dish (trajectory), then perfects the flavor (refinement).
Field trip analogy: Text is the museum map. GEM marks the exhibits (steps/tools), plans the route (workflow), walks through with Q&A (dialogue), and adds challenge stations (errors/clarifications).
Movie analogy: Text is the script. GEM casts roles (tools), shoots scenes (dialogue with tool calls), and edits for pacing and tension (refinement and validation).

Before vs After:

Before: Data came from fixed API sets—narrow, expensive to build, and often unrealistic.
After: Data comes from text—broad, cheap to scale, and grounded in real procedures. Models trained this way generalize better to new domains and tools.

🍞 Top Bread (Hook): You know how your teacher wants you to explain your thinking, not just the answer? That’s what GEM forces models to learn.

🥬 Filling: Why It Works (the intuition)

What it is: A method that matches the structure of real tasks—goals, steps, dependencies, rules, and recovery from mistakes—so models practice exactly what they must do in the wild.
How it works:
1. Start from human-authored procedures (authentic logic).
2. Turn them into structured tools and workflows (clear affordances).
3. Create dialogues that require clarifying, planning, and chaining calls (reasoning under constraints).
4. Validate structure and grounding (no hallucinated IDs or tools).
Why it matters: Training on realistic constraints and multi-step logic teaches robust behavior, not just syntax. 🍞 Bottom Bread: In retail support, the assistant must check order status before canceling, ask for missing info, and recover if the cancel tool fails—exactly what GEM’s data enforces.

Building Blocks (explained with Sandwich):

🍞 Hook: Ever highlight important parts in a textbook before studying? 🥬 Concept: Relevance filtering
- What it is: Keep only texts with real multi-step operations.
- How it works: A classifier flags procedural content; non-procedural text is dropped.
- Why it matters: Without this, you waste time trying to teach from stories without steps. 🍞 Anchor: It keeps “movie reviews” out, and keeps “how to request a refund” in.
🍞 Hook: Turning a to-do list into an actual plan. 🥬 Concept: Workflow & tool extraction
- What it is: Convert steps into a structured workflow and JSON tool definitions.
- How it works: Parse steps, note dependencies/conditions, design tools with clear parameters and types.
- Why it matters: Without proper tools and structure, the dialogue can’t make valid calls. 🍞 Anchor: “Search item → add to cart → pay” becomes tools with fields like item_id and payment_method.
🍞 Hook: Practicing a role-play before the school play. 🥬 Concept: Trajectory generation
- What it is: Produce the full user–assistant conversation in one pass.
- How it works: A strong model drafts system rules, user turns, assistant responses, tool calls, and tool outputs.
- Why it matters: Simulating the entire flow captures context and long-range dependencies. 🍞 Anchor: From “music visualizer” text to a dialogue that imports audio, sets FPS, and fixes export errors.
🍞 Hook: Adding challenges to a game after you master level 1. 🥬 Concept: Complexity refinement
- What it is: Make the dialogue tougher and more realistic.
- How it works: Add ambiguity, longer context, error recovery, and multi-tool chains; re-verify quality.
- Why it matters: Without this, data is too easy and models don’t learn robustness. 🍞 Anchor: Asking to “rush-order a gift but only if in budget and same-day delivery exists” forces multi-step reasoning.
🍞 Hook: Having a referee check the rules. 🥬 Concept: Validation (rule-based + LLM judge)
- What it is: Ensure structure is correct and no hallucinated parameters are used.
- How it works: Rule checks schema/format; LLM judge confirms arguments are grounded in context.
- Why it matters: Bad samples teach bad habits. 🍞 Anchor: If an assistant invents order_id=123, the judge rejects the sample.
🍞 Hook: Learning to draw by tracing many pictures, then sketching from memory. 🥬 Concept: Trajectory Synthesizer
- What it is: A model fine-tuned to do the whole text-to-trajectory job directly.
- How it works: Train on GEM outputs so it learns to produce tools and dialogues efficiently.
- Why it matters: Greatly reduces cost and latency vs. always running the full pipeline. 🍞 Anchor: The synthesizer turns a “photo framing” article into a robust multi-turn chat with correct tool calls.

03Methodology

At a high level: Raw text → Stage 1: Relevance filtering → Stage 2: Workflow & tool extraction → Stage 3: Trajectory generation → Stage 4: Refinement → Validation → Final training data or a trained Trajectory Synthesizer.

Stage 1: Relevance Filtering 🍞 Hook: Like sorting your backpack so you only keep what you need for class. 🥬 Concept

What it is: Automatically keep texts that truly describe multi-step operations.
How it works:
1. Sample text segments from a large corpus (e.g., Ultra-FineWeb).
2. Use a classifier/LLM prompt to label whether a segment contains multi-step procedures.
3. Keep only the positives and add metadata (domain, platform, category) to understand diversity.
Why it matters: Without filtering, later steps waste effort on non-procedural text and create poor training data. 🍞 Anchor: A tutorial on “creating a music visualizer” is kept; a poem about summer is filtered out.

Example with data: About 14% of sampled segments contained explicit multi-step workflows—enough to form a sizable training pool spanning many domains (e.g., customer support, research & data, education & elearning).

Stage 2: Workflow & Tool Extraction 🍞 Hook: Turning a recipe’s steps into a shopping list and cooking plan. 🥬 Concept

What it is: Convert natural-language steps into structured workflows and tool definitions that an AI can call.
How it works:
1. Identify all steps and sub-steps; capture dependencies (X before Y), conditionals (if-then), and uniqueness rules.
2. Design JSON-schema tools that each do a single coherent function; give clear names and types; specify required parameters.
3. Output execution graphs and example action sequences to guide generation.
Why it matters: Without accurate tools and structure, later dialogues would call missing functions or pass wrong parameters. 🍞 Anchor: “Import audio → set frame rate → render” becomes tools like import_audio(file_path), set_composition_fps(fps), render_video(preset).

Stage 3: Trajectory Generation 🍞 Hook: Rehearsing the whole play from start to curtain call. 🥬 Concept

What it is: Generate a complete, multi-turn dialogue in one shot that includes system rules, user turns, assistant reasoning, tool calls, and tool outputs.
How it works:
1. Provide the source text + extracted workflow + toolset to a strong model (e.g., GLM-4.6).
2. Produce system prompts that restate domain rules (e.g., “cancel only if status=pending”).
3. Create user requests that are natural and sometimes ambiguous; require clarifications.
4. Include assistant responses with correct tool calls and realistic tool results.
Why it matters: One-pass generation preserves context and allows long, coherent sequences with proper dependencies. 🍞 Anchor: In retail, a user asks to modify an order with missing details; the assistant asks clarifying questions, checks constraints via tools, then proceeds.

Stage 4: Complexity Refinement 🍞 Hook: Leveling up a video game by adding tricky puzzles. 🥬 Concept

What it is: Make conversations more challenging and varied so models learn robust behavior.
How it works:
1. Increase ambiguity and complexity of user requests.
2. Expand tool usage diversity and chain multi-step calls.
3. Improve realism of tool responses; add non-trivial error cases.
4. Encourage patterns like clarification, rule enforcement, and error recovery.
Why it matters: Without refinement, data is too easy and won’t teach long-context reasoning or recovery. 🍞 Anchor: The assistant must handle a failing print_image() call and suggest alternatives while keeping within rules (e.g., font size limits).

Validation (Dual Checks) 🍞 Hook: Having a hall monitor and a teacher both check homework. 🥬 Concept

What it is: Combine rule-based checks with an LLM judge to ensure structural correctness and no hallucination.
How it works:
1. Rule-based: Verify OpenAI tool schema, exact tool/parameter matches, turn ordering, and message tags.
2. LLM-based judge: Ensure every argument value is grounded in the dialogue; reject fabricated IDs or names.
Why it matters: Bad samples poison training. Validation keeps quality high. 🍞 Anchor: If the assistant calls update_order(order_id="999") without any prior tool output providing "999", it’s rejected.

Secret Sauce: Text-first and All-in-One

Text-first: Starting from unstructured text unlocks scale and diversity across domains without needing huge predefined API pools.
One-pass generation + refinement: Generate full dialogues efficiently, then challenge them to teach robustness.
Distillation: Train a Trajectory Synthesizer to copy the pipeline’s behavior cheaply.

Trajectory Synthesizer Training 🍞 Hook: Learning to ride a bike by watching and then practicing until it feels natural. 🥬 Concept

What it is: A model (e.g., Qwen3-8B-based) fine-tuned on pipeline outputs to produce tools and trajectories end-to-end from text.
How it works:
1. Input: instruction + source text segment.
2. Output: tool definitions + multi-turn dialogue with tool calls.
3. Use SFT (e.g., LR 5e-6, 2 epochs) to teach consistent formatting and grounded calls.
Why it matters: It matches pipeline quality with much lower cost and latency, enabling large-scale data generation. 🍞 Anchor: Feed a Wikihow article; the synthesizer outputs tools, system rules, and a realistic, validated conversation.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a school competition where teams solve real-life puzzles using the right gadgets, ask for clues, and fix mistakes fast. We want to see which team is best.

🥬 Concept: The Test (Benchmarks)

What it is: Two tough tests for multi-turn tool use—BFCL V3 Multi-Turn and τ-bench (Airline, Retail).
How it works:
1. BFCL V3 checks function-calling accuracy across categories like Base, Missing Function, Missing Parameter, and Long Context.
2. τ-bench simulates realistic domain conversations with tool use; metrics include Avg@4 and Pass@4 (success rates across 4 attempts).
Why it matters: These tests measure not just calling a tool, but making it work across turns, with rules and long histories. 🍞 Anchor: It’s like grading not only your final answer but also how well you used the calculator, asked for missing numbers, and remembered earlier steps.

The Competition (Baselines): Models trained on well-known synthetic datasets (APIGEN-MT, Simia-Tau, MUA, TOUCAN), plus strong proprietary or large models (e.g., GPT-4.1, DeepSeek-V3.2-Exp). Note that APIGEN-MT and Simia are in-domain for τ-bench (they practice exactly on Airline and Retail), while GEM trains on out-of-domain text.

Scoreboard with Context:

BFCL V3 Multi-Turn:
- Qwen3-8B-GEM: 30.25% overall accuracy (beats Qwen3-8B base: 18.00%, and other 8B baselines like APIGEN-MT and TOUCAN).
- Qwen3-32B-GEM: 44.88% overall accuracy (like getting an A when others got Bs), surpassing GPT-4.1 (38.88%) and DeepSeek-V3.2-Exp (37.38%).
τ-bench (Airline, Retail):
- 8B scale: Qwen3-8B-GEM is competitive with in-domain models and beats APIGEN-MT on Retail Pass@4 (75.44% vs 69.30%).
- 32B scale: Qwen3-32B-GEM achieves Retail Pass@4 of 86.84%, outperforming Simia and MUA in Retail, and is competitive in Airline.

🍞 Top Bread (Hook): You know how practicing many different sports can make you a better athlete overall?

🥬 Concept: Generalization from Text

What it is: Training on out-of-domain, text-mined trajectories teaches broad tool-use reasoning.
How it works: Because procedures come from many domains, the model learns patterns (clarify → check → act → verify) that transfer.
Why it matters: It matches or beats models trained right in the target domain. 🍞 Anchor: Even without practicing only “Airline,” the model still performs great on Airline tasks by mastering universal workflows.

Surprising Findings:

Refinement matters a lot: Removing refinement drops accuracy notably (e.g., 32B overall from 44.88% to 32.50% on BFCL), showing that harder, more realistic dialogues teach robustness.
LLM-based hallucination check helps: On 8B, accuracy improves from 27.38% to 30.25% with the check.
Trajectory Synthesizer nearly matches the pipeline: On BFCL 8B, 28.38% with synthesizer vs 30.25% with full pipeline data; on τ-bench Retail Pass@4, 73.68% with synthesizer vs 75.44% with pipeline.

🍞 Top Bread (Hook): Think of tracking how many tools you actually used in a big project to see if it was challenging.

🥬 Concept: Data Complexity Stats

What it is: Measuring how rich each trajectory is.
How it works: Count tools per dialogue (mean 8.6), total messages (mean 46.1), and tool calls (mean 16.3).
Why it matters: Deeper, longer interactions train long-context reasoning and multi-step planning. 🍞 Anchor: Compared to some datasets with ~6–18 turns, GEM’s 46-turn dialogues force the model to remember and plan across a much longer story.

05Discussion & Limitations

🍞 Top Bread (Hook): Even great coaches admit what their team still needs to work on. Let’s do that here.

🥬 Concept: Limitations

What it is: The edges where this method may struggle.
How it works:
1. Source quality: If the text is unclear or outdated, extracted tools and rules can be off.
2. Missing real execution: Tools are simulated, so some real-world quirks may be underrepresented.
3. Domain gaps: Unusual or highly specialized domains may still be hard to cover.
Why it matters: Knowing limits guides future improvements and careful use. 🍞 Anchor: A vague blog post might mislead tool design, causing a brittle dialogue.

🍞 Top Bread (Hook): Baking lots of cakes needs ingredients and an oven; training lots of data needs compute and models.

🥬 Concept: Required Resources

What it is: What you need to run GEM or train the synthesizer.
How it works:
1. Access to large text corpora (e.g., Ultra-FineWeb, Wikihow).
2. Strong teacher model for generation (e.g., GLM-4.6) and a capable LLM judge.
3. Compute for supervised fine-tuning (full-parameter SFT was used here).
Why it matters: Without these, you can’t produce or verify high-quality trajectories at scale. 🍞 Anchor: Like needing both a camera crew and an editor to create a good movie.

🍞 Top Bread (Hook): Some tools don’t fit some jobs.

🥬 Concept: When NOT to Use

What it is: Situations where this approach may be suboptimal.
How it works:
1. If you already have rich, executable APIs and logs from the target domain, direct logging/replay may be better.
2. If legal or privacy constraints block using large text corpora.
3. If you need physical-world interactions that text can’t capture (robotic kinematics, sensor noise).
Why it matters: Matching method to problem saves time and improves fidelity. 🍞 Anchor: For high-stakes medical device control, real simulator logs may be preferred over text-derived tools.

🍞 Top Bread (Hook): Questions are the seeds for the next harvest of ideas.

🥬 Concept: Open Questions

What it is: What we still don’t know.
How it works:
1. Best strategies to align synthesized tools with real APIs for seamless transfer.
2. Automatic difficulty tuning to match a model’s current skill.
3. Stronger grounding checks that combine LLM judges with lightweight execution.
4. Mixing text-derived and executable traces without overfitting.
Why it matters: Solving these will push generalization, safety, and efficiency further. 🍞 Anchor: Can a model read a company wiki and then safely act in the company’s live API environment with minimal hand-tuning?

06Conclusion & Future Work

Three-Sentence Summary:

This paper introduces GEM, a pipeline that transforms ordinary text into realistic, multi-turn, tool-using training trajectories by extracting workflows, designing tools, generating full dialogues, and refining complexity.
A Trajectory Synthesizer distills the pipeline into an efficient, end-to-end generator, drastically reducing cost while maintaining quality.
Models trained on GEM data achieve strong results on BFCL V3 and τ-bench, even when trained out-of-domain, demonstrating robust generalization.

Main Achievement:

Unlocking implicit human problem-solving experience in text as a scalable, diverse, and authentic source for training tool-using agents—bypassing the need for massive predefined API pools.

Future Directions:

Align synthesized tools with real executable APIs; blend text-derived trajectories with real logs.
Enhance validation by coupling LLM judges with partial execution or lightweight simulators.
Develop curriculum-style refinement that adapts difficulty to the learner.
Expand domains (e.g., enterprise workflows) and study safety guardrails for high-stakes settings.

Why Remember This:

It flips the usual script: instead of asking “Which tools do we have?”, it asks “What do humans already know how to do?” and builds the tools from there.
That shift turns the web’s how-to knowledge into powerful training fuel, creating assistants that clarify, follow rules, chain tools, and recover from errors in long, realistic conversations.

Practical Applications

•Auto-generate training data for customer support agents that must follow business rules and handle missing information.
•Build domain-agnostic assistant training sets from company wikis or manuals without first implementing full APIs.
•Rapidly adapt assistants to new product lines by mining updated documentation and regenerating trajectories.
•Create realistic, long-context tutoring dialogues (education/e-learning) from curriculum guides and lab manuals.
•Synthesize complex analytics workflows (research & data) from procedural write-ups, including error recovery.
•Improve retail assistants by extracting order, return, and exchange procedures directly from policy pages.
•Generate developer-tool assistants by converting CLI or API docs into conversations that check preconditions and handle failures.
•Train travel-planning agents from destination guides and airline policy pages, learning to clarify and chain multiple tools.
•Prototype enterprise workflows (HR, finance, IT) by mining internal SOPs and automatically producing tool-using dialogues.
•Reduce labeling costs by distilling the multi-stage pipeline into a Trajectory Synthesizer for continuous data refresh.

Version: 1