User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale

Jungho Cho; Minbyul Jeong; Sungrae Park

User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale

Intermediate

Jungho Cho, Minbyul Jeong, Sungrae Park1/13/2026

arXiv PDF

Key Summary

•The paper builds a new way to create realistic, long conversations between people and AI that use tools like databases.
•Instead of having a perfect robot that finishes a task in one shot, it simulates a human user who asks step by step and gives feedback each turn.
•It dynamically makes the tools and tasks the AI needs, so the system isn’t stuck with a small, fixed set of APIs.
•A plug-and-play design lets generation start from any point in a conversation and even include multiple tasks in one chat.
•They switch from simulated tool outputs to real, executable SQL tools so results can be verified and stay consistent across turns.
•The new data improves model performance on multi-turn, tool-use benchmarks like BFCL and τ2, especially for long and stateful tasks.
•Conversations become denser: more turns, more steps, and multiple goals in a single session, like real life.
•There’s a trade-off: more realism costs more time and compute, but it makes agents more reliable over repeated tries.
•The framework shows that modeling realistic user behavior and grounding tools in execution are key to training strong agentic models.

Why This Research Matters

Real people solve problems through back-and-forth conversations, not one-shot commands. This work teaches AI to do the same by simulating a realistic user who asks for steps, checks results, and gives feedback. Grounding tool use in real execution (like SQL) makes answers trustworthy and consistent across turns. The result is an AI that clarifies, adapts, and recovers from mistakes—more like a helpful teammate than an answer machine. That means better customer support, smarter office automation, and safer decision-making in data-heavy tasks. It also reduces hallucinations by rooting tool results in actual computations. In short, it upgrades AI from quick responders to reliable collaborators.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you work on a big school project, you don’t finish it in one move—you ask questions, try something, check the result, and then adjust? That’s how people really get things done. But many AIs were trained on super tidy, one-shot tasks with fixed tools, which is not how real conversations and problem solving work.

🍞 Top Bread (Hook): Imagine texting a helpful friend to plan a trip. You don’t send one giant message; you go back and forth: dates, budget, flights, hotels, and more. 🥬 The Concept (Multi-Turn Dialogue Generation): It means creating conversations with many turns—ask, answer, check, follow up—until the goal is reached.

What it is: A way to train AIs to handle back-and-forth chats that span multiple steps.
How it works: The system keeps context, breaks goals into subtasks, calls tools, reads results, and continues.
Why it matters: Without it, the AI tries to do everything in one go, misses details, and feels unrealistic. 🍞 Bottom Bread (Anchor): Planning a birthday party might take: invite list → date poll → venue search → budget check → final plan. That’s multi-turn.

🍞 Top Bread (Hook): Think of a super-smart librarian who knows a lot, can plan, and can use the computer to look things up. 🥬 The Concept (Large Language Models): They are AI systems that read, write, and reason with language.

What it is: A powerful text-based brain that can plan and explain.
How it works: It predicts the next words using patterns learned from tons of examples, and can now call tools.
Why it matters: Without LLMs, you don’t have the flexible reasoning needed for long, helpful chats. 🍞 Bottom Bread (Anchor): When you ask “What’s the capital of France?”, it answers “Paris” and can also plan how to find flight prices using tools.

Before this paper, many datasets used a small, fixed set of tools (like having only a hammer and a screwdriver). That made training simpler, but it didn’t match real life where new tools, APIs, and data sources keep appearing.

🍞 Top Bread (Hook): Imagine doing science experiments but only having two supplies forever. 🥬 The Concept (Static Toolsets): Fixed lists of tools or APIs that never change.

What it is: Predefined tools that every task must use, even if they’re not a perfect fit.
How it works: The agent selects from a static menu to solve tasks.
Why it matters: Without new tools, agents can’t adapt to fresh problems. 🍞 Bottom Bread (Anchor): Trying to summarize a video using only a calculator—wrong tool, wrong job.

Researchers tried generating synthetic dialogues where a simulator (an AI) solves tasks quickly. That scaled the data, but it caused an “efficiency trap”: the agent solved tasks in the fewest possible turns.

🍞 Top Bread (Hook): Imagine a puzzle expert who solves everything silently in seconds. Impressive, but you don’t learn the steps. 🥬 The Concept (Single-Shot Trajectories): One-turn or minimal-turn solutions that skip realistic back-and-forth.

What it is: Data where tasks end fast with minimal interaction.
How it works: A perfect solver emits the final answer with little dialogue.
Why it matters: Without the steps, models don’t learn clarifications, checks, or revisions. 🍞 Bottom Bread (Anchor): A math worksheet with only final answers teaches less than one with all the work shown.

What was missing? Modeling the user’s messy, incremental behavior and grounding tools in real execution so results could be trusted across turns. This paper fills that gap with a user simulator that acts like a human and with executable SQL tools that keep state consistent.

🍞 Top Bread (Hook): When you ask a database for “top 5 customers this month,” you want exact, checkable numbers. 🥬 The Concept (SQL-Driven Tools): Tools that run live SQL queries on real database schemas.

What it is: A way to connect AI calls to a real database so outputs are factual and consistent.
How it works: The system maps tool names to parameterized SQL, runs them, and returns exact results.
Why it matters: Without real execution, you risk made-up results and broken state. 🍞 Bottom Bread (Anchor): “Show me all flights from NYC to LA tomorrow, under $300.” The SQL tool returns real rows that other steps can trust.

Real stakes: Customer support, scheduling, shopping, travel, and office work are all multi-turn, tool-using chores. Better training data makes agents that clarify, adapt, and recover from errors—more like helpful teammates than answer machines.

02Core Idea

The aha moment in one sentence: Decouple the goal (Task) from the behavior (User), then simulate a realistic user who asks incrementally and reacts to tool results, while grounding tools in real execution to produce long, verifiable, multi-turn training data.

Three analogies:

Theater: The script (Task) is separate from the actor’s performance (User). The actor delivers lines with pauses, questions, and reactions, making it feel real.
Cooking class: The recipe (Task) stays the same, but the student (User) cooks step by step, tastes, adjusts, and asks for help.
Sports practice: The drill (Task) is fixed, but the player (User) does reps, gets feedback, and improves over multiple tries.

🍞 Top Bread (Hook): You know how good coaches don’t just give you the answer; they guide you through steps. 🥬 The Concept (User-Oriented Simulation Paradigm): Simulate a human-like user who asks one or two subtasks per turn, reacts to tool outputs, and gives feedback.

What it is: A user simulator with simple human rules to stretch conversations naturally.
How it works: It reads the goal, issues small requests, checks results, asks follow-ups, and only ends when done.
Why it matters: Without this, agents learn to skip clarifications and rush to final answers. 🍞 Bottom Bread (Anchor): “First, list last month’s sales by store. Okay, now filter top 3. Great—now summarize.”

🍞 Top Bread (Hook): Imagine a Lego set where you can add new pieces anytime. 🥬 The Concept (Dynamic Tool & Task Synthesis): The system generates domain-specific tools, tasks, and rubrics on the fly.

What it is: An LRM-based generator that designs tools and structured tasks with success checks.
How it works: It studies seed tools/schemas, creates gaps to fill, proposes new tools, and builds tasks with expected tool patterns.
Why it matters: Without fresh tools and structured goals, the agent can’t generalize. 🍞 Bottom Bread (Anchor): From a simple “get_customer_by_id,” it can add “list_overdue_invoices,” “apply_discount,” and “create_monthly_summary.”

🍞 Top Bread (Hook): Think of a backpack where you can swap notebooks without repacking everything. 🥬 The Concept (Plug-and-Play Scalability): A modular pipeline that can start generating from any conversation state.

What it is: A design where you can inject tools or tasks midway and keep going.
How it works: Tools → preprocessing (schemas) → tasks/rubrics → simulation → validation; any stage is replaceable.
Why it matters: Without modularity, scaling and diversifying data is slow and brittle. 🍞 Bottom Bread (Anchor): Start from an ongoing chat and add a new tool for returns processing without restarting the whole simulation.

Before vs After:

Before: Fixed tools, short chats, simulated outputs—agents got good at final answers but weak at back-and-forth.
After: Generated tools, long chats, executable SQL—agents practice clarifying, tracking state, and recovering from errors.

Why it works (intuition):

The user simulator increases turns naturally, forcing the model to practice clarifications and stepwise reasoning.
Rubrics anchor each step so the system can check correctness early and often.
Executable SQL makes outputs true and persistent, so later turns must respect earlier changes.

Building blocks (mini-sandwich cards):

🍞 Hook: You bake cookies by following steps, not jumping to the last one. 🥬 Concept (Rubric-Based Tasks): Tasks with clear success criteria and expected tool patterns.
- What: A checklist for each step.
- How: Define difficulty, expected calls, and checkpoints.
- Why: Without rubrics, you can’t verify progress. 🍞 Anchor: “Sift → mix → bake → taste” with checks at each step.
🍞 Hook: When you label your folders, it’s easier to find files. 🥬 Concept (Tool Preprocessing with JSON Schemas): Predict return schemas to keep tools consistent.
- What: Standard output shapes for each tool.
- How: Multi-turn schema design ensures shared fields (like user_id) match everywhere.
- Why: Without consistency, conversations break. 🍞 Anchor: Every contact card has name, email, phone in the same format.
🍞 Hook: Using a calculator gives exact numbers, not guesses. 🥬 Concept (Execution-Grounded Tools): Tools run for real (e.g., SQL), returning verifiable results.
- What: Real execution instead of imagined outputs.
- How: Map tool specs to queries; run them on real schemas.
- Why: Without grounding, models can hallucinate. 🍞 Anchor: “Total sales = 12,345” comes from an actual query, not a guess.

03Methodology

At a high level: Inputs (seed tools/schemas or a blank slate) → Tool preparation → Tool preprocessing (schemas) → Task generation (with rubrics) → Conversation simulation (task-oriented or user-oriented) → Validation and filtering → High-density, multi-turn dataset.

Step A: Tool Preparation

What happens: Starting from a small seed (e.g., a couple of APIs or a database), the system imagines realistic user questions, then generates new, domain-specific tools with names, descriptions, and parameters.
Why this step exists: Without enough tools, conversations are shallow and repetitive.
Example: From a sports database with club and player tables, it adds tools like get_top_players_by_earnings or get_club_players_summary.

Step B: Tool Preprocessing (JSON Schemas)

What happens: For each tool, the system predicts a JSON schema describing its return value—field names, types, and meanings.
Why this step exists: Without consistent outputs, later steps can’t rely on earlier results; state tracking breaks.
Example: Every tool that returns player_id uses the same type (integer) everywhere.

Step C: Task Generation with Rubrics (Task-Oriented Mode)

What happens: The system creates tasks labeled easy/medium/hard, with step-level success criteria, expected tool-use patterns (with placeholders), and checkpoints.
Why this step exists: Rubrics allow objective verification and encourage multi-step reasoning.
Example: “Find top-3 players by earnings, then compute average earnings of their club; verify each step with expected calls.”

Step D: Response Generation with a Simulator (Task-Oriented)

What happens: A strong LRM (e.g., GPT-OSS-120b) generates the agent’s messages and tool arguments; a simulator produces tool outputs for synthetic tools; time is randomized but respects user-provided dates.
Why this step exists: Ensures conversations are complete and plausible when using synthetic tools.
Example: The agent calls get_top_players_by_earnings(limit=3), reads results, then calls get_club_players_summary for each club.

Step E: Validation and Filtering

What happens: A validator checks if the conversation meets the rubric, uses required tools, and is semantically correct.
Why this step exists: To keep only high-quality, high-density trajectories.
Example: If a required call is missing or fields don’t match schemas, the dialogue is filtered out.

Secret issue discovered: Efficiency Trap

In task-oriented mode, the simulator often solves the goal in very few turns—great for speed, bad for realism.

Step F: Shift to User-Oriented Multi-Turn Generation

What changes: Decouple the Task (goal) from the User (behavior). Replace direct questions with descriptive tasks (the end goal written as a statement). Add a user simulator that behaves like a human: asks for one or two subtasks per turn, checks results, and provides feedback.
Why this step exists: To create longer, more realistic interactions that teach clarification, verification, and correction.
Example: Descriptive task: “Produce a monthly sales summary and highlight top-performing stores; confirm missing data through queries.” The user simulator begins: “Could you list total sales by store for June first?”

Step G: Execution-Grounded Tools (SQL-Driven)

What happens: Tools map to real SQL queries over actual schemas (e.g., from Spider). During generation, the system runs queries and returns real rows.
Why this step exists: To ensure outputs are accurate, persistent across turns, and verifiable.
Example: After an UPDATE, a later SELECT sees the new value—so the conversation’s state is consistent.

Step H: High-Density Trajectories and Plug-and-Play

What happens: Multiple tasks can be completed in one conversation; generation can start from any state (new tools mid-chat, new goals after partial progress).
Why this step exists: Real sessions often involve many related requests and state changes.
Example: A single thread: query top products → update a price → generate a markdown summary → ask for a trend chart spec.

The Secret Sauce

A user simulator with simple human-like rules reliably stretches conversations without artificial padding.
Rubric-checked steps keep quality high.
Executable tools prevent hallucinations and let state persist, which is crucial for long-horizon reasoning.

Mini-sandwiches for critical components:

🍞 Hook: Like reading a treasure map with checkpoints. 🥬 Concept (Rubric-Based Verification): Step-by-step checks ensure each tool call and result is correct.
- Why it matters: Without checks, wrong steps sneak through and teach bad habits. 🍞 Anchor: “Find the big oak → cross the stream → dig by the X.”
🍞 Hook: Like asking, “First, show me the ingredients.” 🥬 Concept (Descriptive Tasks): Goals stated as end results, not as a single question, guiding multi-turn progress.
- Why it matters: Encourages piecemeal requests and natural follow-ups. 🍞 Anchor: “By the end, I want a birthday plan with guests, budget, and venue.”
🍞 Hook: Like keeping your notebook neat so every page matches. 🥬 Concept (Schema Consistency): Shared fields use the same types across tools and turns.
- Why it matters: Prevents type mismatches and broken pipelines. 🍞 Anchor: Student IDs are always integers everywhere.

End-to-End Example (simplified): Input: Retail database schemas + seed tools. → Prepare tools (add list_top_products, update_price, summarize_sales_by_store). → Preprocess schemas (define outputs for each tool). → Generate descriptive task ("Produce June sales summary, call out top 3 stores, and flag stores with missing data; confirm if any data is unknown."). → User simulator starts: “First, show total sales by store for June.” → Agent calls summarize_sales_by_store(month=6), gets rows. → User: “Now, highlight top 3; then check if any stores have null inventory.” → Agent filters, calls check_missing_inventory, etc. → Validator confirms steps match rubric and results are consistent. → Store the whole chat as training data.

04Experiments & Results

The Test: The team measured whether models trained on this data handled multi-turn, tool-using conversations better. They used two public benchmarks: BFCL (Berkeley Function Calling Leaderboard) and τ2, which stress multi-turn reasoning, state tracking, and user noise.

🍞 Top Bread (Hook): It’s like grading students not just on the final answer, but on how they ask questions, use calculators, and check their work. 🥬 The Concept (What was measured): Multi-turn robustness, tool-call accuracy, and consistency across repeats.

What it is: Tests that require planning, calling the right tools, handling feedback, and keeping state.
How it works: Compare models fine-tuned on different datasets, then score them.
Why it matters: Without strong multi-turn skills, agents stumble in real workflows. 🍞 Bottom Bread (Anchor): A travel assistant that must book flights, then hotels, then adjust when the user changes dates.

The Competition: Baselines included APIGEN and NEMOTRON, popular synthetic data approaches. The authors fine-tuned Qwen-family reasoning models (4B and 30B) with different generation pipelines (task-oriented, user-oriented, and user-oriented + tool-execution) and compared results.

Scoreboard with context:

User-oriented data helped models do better on τ2, which really pressures agents to handle incremental, messy user requests. That’s like moving from pop quizzes to full projects.
Adding tool execution (SQL-backed) delivered the strongest overall results in challenging domains (e.g., Telecom), showing that verifiable outputs and persistent state improve reliability.
Importantly, BFCL function-calling accuracy stayed stable or got slightly better, so the extra complexity didn’t break low-level calling skills.

Data shape (what changed in the conversations):

In task-oriented generation from Nemotron seeds, average turns were around 12.84.
In the user-oriented setting, average turns jumped to around 21.79 (and similar increases across other sources), meaning richer back-and-forth.
Conversations often included multiple tasks in one thread, creating high-density trajectories that mirror real user sessions.

Surprising and important findings:

Consistency over repeats (Pass^k): Models trained with this data more often solved the same task correctly across multiple tries, especially in complex, stateful domains like Telecom. That means fewer lucky one-offs and more steady, trustworthy behavior.
Efficiency trade-off: The user-oriented pipeline took longer and produced fewer tokens per second during generation (higher latency, lower throughput), but the resulting data led to better long-horizon skills.

Interpretation for everyday life: If you want an AI that acts like a steady teammate—asking clarifying questions, double-checking results, and fixing mistakes—train it with conversations that actually look and feel that way, and tie its tools to real data so it can’t make things up.

05Discussion & Limitations

Limitations and honest look:

More compute and time: Because the user simulator talks in many turns and tools actually run (e.g., SQL), generation is slower and more expensive than quick one-shot synthesis.
Environment coupling: Grounding tools in real databases means the schemas, state, and tool specs must match perfectly. If they drift, errors can cascade through long conversations.
Partial views are tricky: When only part of a database is visible, the model can get brittle—unsure how to proceed or tempted to guess. Better strategies for uncertainty and recovery are needed.

Required resources:

Strong base models (reasoning LLMs), GPU clusters for training and generation, and a database engine to execute SQL tools at scale.
Engineering to maintain schemas, tool specs, and validation rubrics.

When not to use:

If you only need single-turn Q&A without tool use (e.g., trivia), this pipeline is overkill.
If you cannot afford execution (no database, no APIs) or strict schema alignment, the user-oriented + execution setup may be too heavy.
If latency and cost are the top priority, task-oriented synthetic data might be sufficient.

Open questions:

How well do execution-grounded models transfer to totally new domains and unseen tool ecosystems?
Can we make the user simulator adaptively harder (or easier) based on the model’s skill, like a good tutor?
How to add robust error recovery and state-repair when the environment has inconsistencies or partial data?
What’s the best mix of task-oriented and user-oriented data for a given application and budget?

06Conclusion & Future Work

Three-sentence summary: This paper presents a user-oriented generation framework that simulates human-like, incremental conversations and grounds tool calls in real execution (e.g., SQL). The modular, plug-and-play pipeline produces long, dense, verifiable multi-turn dialogues that better train agentic reasoning models. Experiments show stronger performance on benchmarks that value multi-turn coherence and statefulness, with improved consistency across repeated trials.

Main achievement: Decoupling tasks from user behavior and adding execution-grounded tools to scale realistic, high-fidelity training data for tool-using agents.

Future directions: Smarter user simulators that adjust difficulty, broader execution environments beyond SQL (APIs, files, web), stronger error-recovery mechanisms, and research on cross-domain generalization. Also, exploring optimal blends of task-oriented and user-oriented data to balance cost and performance.

Why remember this: Teaching AIs to be great teammates means training them in realistic conversations with real tools and real consequences—step by step, just like people work. This framework is a blueprint for building that kind of data at scale.

Practical Applications

•Customer support agents that ask clarifying questions, check account data via tools, and follow up until the issue is resolved.
•Sales and retail assistants that query databases, update prices, and produce summaries within the same conversation.
•Travel planners that iteratively search flights, adjust to date/budget changes, and confirm bookings with verifiable steps.
•Operations dashboards where the AI investigates anomalies step by step and generates action items grounded in live data.
•Internal helpdesk bots that triage tickets, fetch logs, run diagnostics tools, and report findings across multiple turns.
•Education tutors that guide students through multi-step solutions, verifying each step with rubric-like checks.
•Healthcare admin assistants that schedule, verify insurance eligibility, and update records using compliant, auditable tools.
•Data analysts’ copilots that run SQL queries, refine filters on feedback, and create final reports with provenance.
•Financial back-office agents that reconcile transactions, flag edge cases, and document decisions in one conversation.
•IT automation assistants that read configs, execute safe updates, and validate changes through follow-up checks.

Version: 1