MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Changle Qu; Sunhao Dai; Hengyi Cai; Jun Xu; Shuaiqiang Wang; Dawei Yin

MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Intermediate

Changle Qu, Sunhao Dai, Hengyi Cai et al.1/15/2026

arXiv PDF

Key Summary

•MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.
•It matches the agent’s tool calls to expert tool calls using a bipartite matching trick, then gives a fair score to each turn.
•There are two ways to match: hard (one-to-one, Hungarian algorithm) and soft (probabilistic, Optimal Transport).
•It blends turn-level rewards with the final answer reward so the agent learns both precise steps and overall success.
•A dual-level advantage signal (turn-level + trajectory-level) replaces the usual one-size-fits-all advantage in RL.
•On three benchmarks (FTRL, BFCL, ToolHop), MatchTIR beats strong baselines and even lets a 4B model top many 8B models.
•It especially shines on long, multi-turn tasks by rewarding useful calls and discouraging redundant or wrong ones.
•Ablations show turn-level rewards and dual-level advantages both matter; hard matching usually works best.
•The method needs ground-truth traces of tool use and was tested on smaller models due to compute limits.

Why This Research Matters

Better per-step feedback makes AI agents pick the right tools at the right times, cutting mistakes and wasted calls. This means faster customer support workflows, safer data handling, and clearer audit trails, because every step is checkable. Companies can save compute costs since the agent learns to avoid redundant calls. Users get more reliable results in long tasks like trip planning, report generation, and code debugging. The approach scales to many structured tools where correctness is verifiable. It also shows that smarter training beats simply making models bigger. Over time, this can raise the bar for practical, trustworthy AI assistants in real products.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a kid solving a big puzzle might use different tools—like a magnifying glass for tiny pieces or a sorter tray for colors—and has to decide which tool to use at each moment? If they get praised only when the whole puzzle is done, they won’t know which tool choices helped and which didn’t.

🥬 Filling (The Actual Concept)

What it is: This paper studies Tool-Integrated Reasoning (TIR), where an AI uses outside tools during multi-step thinking, and proposes MatchTIR, a way to reward each tool-use step fairly.
How it works (story of the world before): Before methods like this, AI agents got rewards mainly at the end (outcome reward) or once per whole journey (trajectory reward). That meant every step inside the journey was treated the same, even if some steps were brilliant and others were mistakes. People tried to add more signals with special reward models or lots of random trials (Monte Carlo), but those were often biased, costly, and noisy—especially for long, many-turn problems.
Why it matters: If you don’t tell the agent exactly which steps helped, it can’t learn which tools to pick, how to fill parameters, or when to stop. The agent becomes slower, makes more wrong calls, and wastes computer time.

🍞 Bottom Bread (Anchor): Imagine you ask an AI to book a trip. It needs to search flights, check baggage rules, and compare hotels. If it gets only a final ‘good job’ or ‘bad job’ score, it won’t know that the third search was the one that really nailed your dates, while the second search was off by a week. It keeps repeating the same mistakes.

🍞 Top Bread (Hook): Think of a math student using a calculator, a graphing tool, and a formula table. If the teacher only grades the final answer, the student won’t know which tool use was right or wrong along the way.

🥬 Filling

The problem: In TIR, agents interleave thinking with tool calls (names + parameter names + parameter values). Prior RL methods assign the same advantage to all tokens in a trajectory, creating ‘one-size-fits-all’ credit. In long tasks, that hides which turns mattered.
What failed before: External reward models can hallucinate, sampling many futures is expensive and noisy, and some tools (like web search) have many equally valid queries, making unique step labels tricky.
The missing piece: In many TIR tasks, tool names and parameters are structured and checkable. That means we can verify if a turn used the right tool and correct parameters. We just need a fair way to line up the agent’s tool calls with the expert’s.

🍞 Bottom Bread (Anchor): It’s like grading a science lab by checking if the student used the right instrument (tool name), measured the right thing (parameter name), and got the correct setting (parameter value). Now you can give feedback per step, not just at the end.

🍞 Top Bread (Hook): Imagine sorting two decks of recipe cards: your cooking steps and the chef’s gold-standard steps. You want to match the steps that truly correspond.

🥬 Filling

The gap this paper fills: MatchTIR aligns the agent’s tool calls with the ground-truth calls using bipartite matching to score each turn, then blends that with the final answer score. It also creates a dual-level advantage so each turn gets the right push during training, rather than the same push for all.
Real stakes: Better tool use helps with customer support workflows, data analysis, code debugging with compilers, medical triage checklists, and research assistants using search, spreadsheets, and databases. It can save time, money, and reduce errors.

🍞 Bottom Bread (Anchor): If an AI is helping schedule a delivery, it must call the right route planner with the right addresses and time windows. Grading each call keeps it from trying the wrong tool three times in a row and helps it learn the winning path faster.

02Core Idea

🍞 Top Bread (Hook): Imagine a relay race where each runner hands off a baton. If the team only gets a medal at the end, you can’t tell which handoff was smooth and which one fumbled. You need to score each handoff too.

🥬 Filling (The Actual Concept)

One-sentence “Aha!”: MatchTIR aligns each predicted tool call with the best-matching expert call (via bipartite matching) to give turn-level rewards, then combines those with final-outcome signals using a dual-level advantage so every turn gets the right training nudge.
Multiple analogies (3 ways):
1. Recipe cards: MatchTIR lines up your cooking steps with the chef’s steps; each matched step gets a score, and the whole dish also gets a taste test score.
2. Soccer replay: Each pass (turn) is graded for accuracy and timing, and the team’s final win/loss (outcome) is also counted.
3. School project: Each section of the report (turn) is graded for content and format, and the final presentation score (outcome) is added on top.
Before vs After: • Before: Every step in a multi-turn run shared the same advantage, masking which steps actually helped or hurt. • After: Each turn gets its own reward and advantage based on how well it matches expert tool use and how much it contributes to the final success, leading to cleaner, faster learning.
Why it works (intuition, no equations): • Structured tools expose checkable pieces: tool name, parameter names, and parameter values. • When you match predicted calls to gold calls, you can precisely tell which turns did the right thing. • Combining per-turn signals with the final answer score keeps the agent from over-optimizing on tiny steps and forgetting the big goal. • Dual-level advantages ensure both local correctness (this turn) and global success (the whole journey) shape learning.
Building Blocks:
1. Similarity scoring: Compare tool names, parameter names (overlap), and parameter values (exact matches) to build a similarity matrix between predicted and gold calls.
2. Hard vs Soft assignment: Hard uses the Hungarian algorithm (one-to-one strict matching), soft uses Optimal Transport (probabilistic many-to-one), both turning similarities into fair turn-level rewards.
3. Turn aggregation: If a turn has multiple calls, average them to avoid rewarding spammy extra calls.
4. Outcome reward: Score the final answer (like F1) to keep eyes on the prize.
5. Dual-level advantage: Combine a trajectory-level signal (how good the whole run was vs others) with a turn-level signal (discounted future impact from this turn on) for sharper credit assignment.

🍞 Bottom Bread (Anchor): Picture a coding helper that must call a compiler, a linter, and a test runner. MatchTIR lines up each of its calls with a teacher’s reference calls, scores them, and also checks whether the final program works. The helper learns which step to fix next time, not just that the overall try failed.

03Methodology

🍞 Top Bread (Hook): Imagine you’re building a LEGO model from instructions. You compare your steps to the instruction booklet, give yourself a score for each step, and also check if the final model looks right. Then you practice again, improving the steps that got low scores.

🥬 Filling (The Actual Concept)

High-level flow: Input question and tools → Agent runs multi-turn (think, call tool, get response) → Build similarity between predicted vs gold tool calls → Hard or soft match to assign per-call rewards → Average per-call rewards into per-turn reward → Add final outcome reward → Compute turn- and trajectory-level advantages → Train with GRPO using the integrated advantage.

Step-by-step details:

Multi-turn trajectory generation
- What happens: The policy LLM takes a question, reasons, and at each turn may call one or more tools with parameter names and values, receives tool outputs, and continues until it answers or hits a limit.
- Why it exists: We need a real sequence of actions to evaluate, not just a single guess.
- Example: Turn 1 calls weather_api(city="Boston"), Turn 2 calls flight_api(date="Fri"), Turn 3 outputs the final itinerary.
Build the similarity matrix S (predicted vs gold)
- What happens: For every predicted call and every gold call, compute a similarity score using: • Tool name match (1 or 0) • Parameter name overlap (Jaccard) • Parameter value correctness (count of exact matches) Normalize to 0–1.
- Why it exists: This gives a fair, structured way to say “how close was this predicted call to a correct call?”
- Example: Predicted name=“book_hotel” vs gold name=“book_hotel” (name=1), parameter names overlap 2/3, values match 1/2 → similarity ≈ 0.65.
Convert similarities into rewards with assignment
- Hard assignment (Hungarian/Kuhn–Munkres) • What: Find the best one-to-one pairing of predicted and gold calls to maximize total similarity. • Why: Prevents gaming by repeating similar calls to harvest extra credit. • Example: If two predicted calls both resemble one gold call, only the best-matching one gets that credit; the other gets a penalty or zero.
- Soft assignment (Optimal Transport/Sinkhorn) • What: Treat calls as distributions and compute a probabilistic mapping; each predicted call shares credit across multiple gold calls by weight. • Why: Gives smoother feedback when multiple predicted steps partially help. • Example: A predicted call 70% matches gold step A and 30% matches gold step B → reward = 0.7sim(A) + 0.3sim(B).
Aggregate to turn-level rewards
- What happens: If a turn has k tool calls, average their call-level rewards.
- Why it exists: Normalizes across turns and discourages spammy extra calls.
- Example: A turn with three calls receiving 1.0, 0.4, and 0.0 yields a turn reward of (1.0+0.4+0.0)/3 = 0.47.
Add an outcome-level reward
- What happens: Score the final answer (e.g., with F1) to reflect global success.
- Why it exists: Ensures the agent doesn’t obsess over perfect micro-steps but forget the overall goal.
- Example: If the gold answer is “Paris” and the agent says “Paris,” outcome reward = 1.0.
Dual-level advantage estimation
- Trajectory-level advantage • What: Sum all turn rewards (including the final outcome) and compare this rollout to other rollouts from the same prompt (group normalization). • Why: Encourages globally better reasoning paths. • Example: Among 16 rollouts, if yours scores well above the group mean, you get a positive global advantage.
- Turn-level advantage (discounted) • What: For each turn t, compute the discounted sum of rewards from t onward, then compare across rollouts at the same turn index. • Why: Captures long-term impact of a turn on future success (early good decisions matter). • Example: With discount γ=0.9, Turn 2 sees rewards from Turns 2..T weighted as 1.0, 0.9, 0.81, ...
Integrate advantages and optimize with GRPO
- What happens: For each token belonging to a given turn, add the trajectory-level and turn-level advantages to form the training signal, then apply GRPO updates with clipping and a small KL term to stay near a reference model. Mask tool responses (since the environment generated them).
- Why it exists: Combining both views (global + local) stabilizes and focuses learning where it counts.
- Example: If A_trajectory=+0.8 and A_turn=+0.4 for a token, the integrated advantage is +1.2.

The Secret Sauce (why this is clever):

Matching guards against reward hacking by rewarding the best-aligned calls and not duplicates.
Turn-level rewards plus final outcome avoid tunnel vision on either micro-steps or only the end.
Dual-level advantages fix the classic “uniform advantage” problem and make long-horizon learning sharper and faster.

🍞 Bottom Bread (Anchor): Think of grading a group science project. You grade each student’s part (turn-level), and also the final presentation (outcome). You compare this group’s total to other groups (trajectory-level), and you pay special attention to early steps that set up later success (discounted turn-level). That’s MatchTIR’s training loop.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a tournament where teams must solve multi-step puzzles using calculators, databases, and web search. We don’t just count wins; we also track how carefully each step was done and how many steps were wasted.

🥬 Filling (The Actual Concept)

The test (what and why): The authors tested whether fine-grained, per-turn rewards plus dual-level advantages help agents use tools more accurately and efficiently, especially in long, multi-turn problems.
Benchmarks: • FTRL (in-domain): Tool tasks with precise, checkable feedback, measured by Solve-P (precision of valid tool calls), Solve-R (recall of solved sub-questions), and Solve-F1 (their balance). • BFCL (out-of-domain): Function calling under multi-turn and agentic conditions (e.g., long context, missing info, web search, memory). • ToolHop (out-of-domain): Multi-hop tool use judged by Answer Correctness.
Competition (baselines): Vanilla (no RL), GRPO with outcome-only rewards, ToolRL and FTRL scoring methods in single- and multi-turn variants.
Scoreboard (results with context): • Across all three benchmarks, both MatchTIR variants (hard = KM, soft = OT) beat strong baselines. • On FTRL with Qwen3-8B, MatchTIR (KM) reaches about 39.28 average, which is like getting an A while others get B’s or C’s. • Even a 4B MatchTIR model outperforms many 8B baselines, showing that better training beats just making the model bigger. • Hard matching (KM) usually edges out soft (OT), suggesting strict one-to-one signals help avoid giving credit to near-misses that would fail in execution.
Surprising findings: • The gains grow with task difficulty. On problems needing many tool calls, MatchTIR shines even more, because per-turn credit assignment matters most in long journeys. • Tool-use becomes leaner and cleaner: fewer total tool calls, higher success rates per call, and fewer failed calls—like a student doing fewer, better experiments instead of many messy ones. • Hyperparameters matter: A stronger penalty on unmatched calls raises precision but can slightly trim recall; a higher discount factor γ improves performance, reflecting that early choices shape the whole plan.

🍞 Bottom Bread (Anchor): It’s like a cooking contest where teams using MatchTIR not only plate tastier dishes (better final answers) but also follow cleaner prep steps (fewer, more accurate tool calls) and waste fewer ingredients.

05Discussion & Limitations

🍞 Top Bread (Hook): Think about teaching with a perfect answer key. It works great in math class, but what about open-ended art projects? Some tasks don’t have a single, checkable path.

🥬 Filling (The Actual Concept)

Limitations (what it can’t do): • Needs ground-truth traces for turn-level supervision; in very open tasks (like free-form research), gold parameters and steps are hard to define or verify. • Evaluations used smaller backbones (4B, 8B) due to compute; scaling to larger models is untested here. • Hard matching can be strict—great for exact tools but harsh when multiple different steps are equally acceptable. • Relies on structured, verifiable tools; in messy environments (ambiguous web pages), per-step truth may be fuzzy.
Required resources: • A dataset with executable, verifiable tools (names, parameter schemas, and testable outcomes). • An RL setup supporting GRPO-style grouped rollouts and environment tool calls. • GPUs with enough memory for batched multi-turn rollouts (the paper used 8×A800-80G).
When NOT to use: • Highly creative or ambiguous tasks without objective ground truth per step. • Pure free-form browsing where many different step sequences are valid and not easily comparable. • Contexts with unreliable tool feedback (noisy APIs) that break verifiability.
Open questions: • How to generate or approximate ground-truth traces in open domains. • Blending precise matching with learned reward models to cover less-structured steps. • Adapting credit assignment online without gold references. • Scaling to bigger models and richer tool ecosystems while keeping training stable and cost-effective.

🍞 Bottom Bread (Anchor): It’s like grading piano practice with a sheet of music. When the notes are known, feedback is sharp and fast. For jazz improvisation, you might need new ways to judge progress than exact note matching.

06Conclusion & Future Work

🍞 Top Bread (Hook): Picture a coach who grades every pass in a game and also the final score. Players quickly learn which moves help the team win.

🥬 Filling (The Actual Concept)

Three-sentence summary: MatchTIR assigns fair, per-turn rewards by matching an agent’s tool calls to expert calls, and also considers the final answer reward. It then creates a dual-level advantage—local (turn) and global (trajectory)—so each step gets the right learning push. This fixes the classic problem of uniform credit and makes long, multi-turn tool use both sharper and more efficient.
Main achievement: Turning turn-level credit assignment into a bipartite matching problem and unifying it with dual-level advantages that significantly improve tool-integrated reasoning.
Future directions: Reduce reliance on ground-truth traces (e.g., semi/self-supervised signals), combine with robust reward models for open-ended steps, scale to larger backbones, and expand to noisier, real-world tools.
Why remember this: Precise feedback shapes precise behavior. By rewarding each tool call fairly—and still keeping eyes on the final goal—MatchTIR shows how smarter training can beat simply making models bigger.

🍞 Bottom Bread (Anchor): Next time an AI books travel or debugs code, MatchTIR’s ideas help it choose the right tool, fill in the right details, and finish the job well—step by step and all the way to the end.

Practical Applications

•Customer support agents that call databases, ticket systems, and knowledge bases with fewer, more accurate requests.
•Travel planners that sequence flight, hotel, and car-rental APIs correctly and avoid redundant searches.
•Coding assistants that compile, lint, and run tests with the right parameters, fixing the exact failing step.
•Data analysts that query spreadsheets and dashboards precisely, reducing bad queries and false alarms.
•Medical triage assistants that follow structured checklists and call the right tool with the right patient parameters.
•Finance bots that retrieve statements, categorize expenses, and reconcile ledgers with accurate parameter filling.
•Education tutors that use calculators, solvers, and plotters in the right order to teach multi-step reasoning.
•Research aides that consult libraries, citation tools, and summarizers with verifiable, step-level accuracy.
•Robotic process automation (RPA) that triggers enterprise tools (CRM, ERP) with fewer failed actions.
•Compliance assistants that produce auditable logs of each tool call and why it was made.

Version: 1