D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

Bowen Xu; Shaoyu Wu; Hao Jiang; Kai Liu; Xin Chen; Lulu Hu; Bin Yang

D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

Intermediate

Bowen Xu, Shaoyu Wu, Hao Jiang et al.2/2/2026

arXiv PDF

Key Summary

•This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
•The authors introduce D-CORE, a two-stage training recipe that first teaches the model to break big tasks into small steps and then keeps its thinking flexible with a special kind of reinforcement learning.
•Stage 1 (self-distillation) has the model teach itself how to decompose tasks and string the steps together into clear, checkable solution paths.
•Stage 2 (DA-GRPO) adds a diversity-aware reward so the model keeps exploring different, reflective thought patterns instead of becoming rigid.
•On tough tool-use benchmarks (BFCLv3 and τ-bench), D-CORE gets big jumps, especially on multi-turn tasks that require planning across steps.
•D-CORE-8B reaches 77.7% accuracy (best among 8B models) and D-CORE-14B hits 79.3%, even beating some 70B models while being 5× smaller.
•The method reduces wasted tokens and repetitive reflection, turning long, unhelpful thinking into short, effective step-by-step plans.
•It works across different domains and unseen tasks, showing strong generalization rather than overfitting to a single dataset.
•The key idea is simple: first learn to make a good plan, then learn to keep trying smart alternatives so you don’t get stuck.
•This approach can make everyday assistant agents more reliable at complex jobs like travel changes, returns, searches, and multi-tool workflows.

Why This Research Matters

Real assistants often need to use multiple tools across several steps, just like people do at work or at home. D-CORE helps AI plan those steps cleanly and avoid wasting time on rambling thoughts. That makes agents more reliable for tasks like refunds, travel changes, research, and file operations. It also means smaller models can reach or beat the performance of much bigger ones, which saves money and energy. By training decomposition first and then restoring healthy reflection, D-CORE turns ‘think more’ into ‘think better,’ unlocking practical, trustworthy tool use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re packing for a trip. If you just start throwing things in your bag without a plan, you’ll waste time, forget socks, and maybe pack three toothbrushes. Planning helps you pack fast and right.

🥬 The Concept (Large Reasoning Models and Tool Use): Large Reasoning Models (LRMs) are AIs that think step-by-step and can use tools (like calendars, web browsers, or file systems) to get things done. How it works: 1) read the user’s request, 2) decide which tools to use, 3) call the tools with the right inputs, 4) combine results into a final answer. Why it matters: without the ability to plan and coordinate tools, the AI guesses and fumbles, especially in multi-step, real-world tasks.

🍞 Anchor: If you ask an AI to ‘book a flight tomorrow and then email me the receipt,’ it must plan: check flights → pick one → pay → fetch receipt → email. That’s tool use.

The world before: AIs were getting decent at single-step tool calls (like ‘convert 50,000 RMB to USD’). But real-life problems are often a chain of steps (find all .txt files, create a folder, copy the files there; or look up orders, check statuses, return the right items). Benchmarks like BFCL show three tricky patterns: sequential steps, parallel steps, and ‘no tool needed’ cases. Even tougher, multi-turn conversations add history and hidden intentions to track over time. In these settings, many LRMs would write long ‘thinking’ text but still act poorly.

🍞 Hook: You know how some students write a lot on a test but still miss the point? More words don’t always mean better answers.

🥬 The Concept (Lazy Reasoning): Lazy Reasoning is when the AI fills its thought space with repetitive ‘hmm… wait… maybe…’ instead of making a solid plan and executing steps. How it works: 1) the model avoids breaking the task into parts, 2) it loops through trial-and-error reflections, 3) it wastes tokens and time with little progress. Why it matters: without true decomposition, adding more ‘think’ doesn’t help—performance stalls on multi-turn tasks.

🍞 Anchor: On a flight-change task, instead of listing the subtasks (find reservation → read date → search next-day flights → pick cheapest → confirm), the model keeps second-guessing itself and ends up asking for a human.

Researchers tried two common routes. First, supervised fine-tuning (SFT) on rule-based tool-use examples: it helps on easy tasks but generalizes poorly to complex, multi-intent or multi-turn situations. Second, reinforcement learning (RL) that rewards correct final outcomes: works in math or single-turn cases but often gives diminishing returns on complex tool workflows—more tokens, not much better results.

🍞 Hook: Think of building a Lego spaceship. If you never separate the big ship into smaller chunks, you’ll keep poking pieces around and get frustrated.

🥬 The Concept (Task Decomposition): Task decomposition means breaking a big job into small, ordered subtasks. How it works: 1) identify the needed steps, 2) label which must come first (sequential), which can happen together (parallel), and when no tool is needed, 3) solve each subtask, 4) compose the results. Why it matters: without decomposition, the model’s ‘thinking’ becomes a messy blob, and even RL can’t find a clean path to reward.

🍞 Anchor: To ‘copy all .txt files into a new folder,’ you: 1) find .txt files, 2) make the folder, 3) copy files in. Clear steps, clear success.

The problem: The team found LRMs did very little decomposition in multi-turn tool use, but showed a lot of repetitive reflection—classic Lazy Reasoning. They confirmed that when they forced decomposition by prompting, accuracy jumped. So the missing piece was the model’s built-in ability to decompose and then compose reasoning.

What this paper adds: D-CORE, a two-stage training framework that first teaches decomposition through self-distillation (the model learns from its own structured outputs), and then restores healthy, diverse reflection using a new RL trick called Diversity-Aware GRPO (DA-GRPO).

🍞 Hook: You know how you can study by making your own practice problems and then grading yourself? That builds good habits.

🥬 The Concept (Self-Distillation): Self-distillation is when a model creates well-structured solutions for itself and then learns to imitate them. How it works: 1) prompt the model to decompose tasks, 2) have it solve subtasks and stitch the results into a full ‘trajectory,’ 3) fine-tune the model on these trajectories so it internalizes decomposition and execution. Why it matters: without it, decomposition stays fragile and rare; with it, decomposition becomes the default habit.

🍞 Anchor: The model first writes ‘Step 1: compute exchange rate; Step 2: set budget,’ then trains on that example so next time it naturally plans those steps.

But self-distillation can make the model too same-y—less exploration and reflection. So the second stage uses RL to bring back smart variety.

🍞 Hook: Imagine a coach who not only scores you for the final goal but also gives extra credit for trying creative, promising moves.

🥬 The Concept (Reinforcement Learning and GRPO): Reinforcement Learning (RL) rewards actions that lead to good outcomes. GRPO is a specific RL method that compares outcomes across multiple tries and pushes the model toward better ones. How it works: 1) sample several solutions, 2) score them, 3) push up the probability of tokens from better solutions, 4) keep the model close to a reference to avoid going wild. Why it matters: without RL, the model might copy patterns but won’t learn to choose better actions on its own.

🍞 Anchor: The AI tries a few tool-call plans, gets points for good structure and correct calls, and slowly prefers the better plan.

🍞 Hook: When you’re brainstorming, a few high-uncertainty words—like ‘but,’ ‘maybe,’ or ‘because’—often signal real thinking.

🥬 The Concept (Diversity-Aware GRPO, DA-GRPO): DA-GRPO tweaks the RL advantage with an ‘entropy-aware’ term that gently rewards thoughtful, higher-uncertainty tokens when normal learning stalls. How it works: 1) if the usual score difference is near zero (everyone looks the same), add a small bonus linked to token entropy, 2) this nudges the model to explore reflective moves, 3) but keep a cap so exploration doesn’t drown out correctness. Why it matters: without DA-GRPO, self-distilled models can stagnate; with it, they recover healthy reflection while keeping strong decomposition.

🍞 Anchor: The model stops looping ‘wait, wait, wait’ and instead tries a new, sensible step it was unsure about—like ‘check reservation details first’—which unlocks the path to the right answer.

02Core Idea

Aha! Moment in one sentence: If we first teach the model to split problems into the right subtasks and then reward it for exploring smartly diverse reasoning, we turn rambling ‘Lazy Reasoning’ into crisp, stepwise tool use.

Three analogies:

Cooking: Write the recipe (decompose), then taste-test variations to improve flavor (diversity-aware RL).
Sports: Practice set plays (decompose), then scrimmage with creative moves that still respect the playbook (DA-GRPO).
Lego: Sort pieces by type (decompose), then try a few builds that follow the plan but allow neat add-ons (diversity-aware exploration).

Before vs After:

Before: Models write long thoughts, circle back, and miss key steps. RL alone struggles because the model’s ‘thinking’ is noisy and not structured.
After: Models plan: Step 1, Step 2, Step 3—call the right tools at the right times. RL now has a clean backbone to optimize, and DA-GRPO keeps the model reflective without getting stuck.

Why it works (intuition over math):

RL needs a good action space. Without decomposition, the action space is tangled; rewards get fuzzy. Self-distillation reorganizes the space: the model habitually plans subtasks, so steps map cleanly to rewards (e.g., correct tool, correct params). Then, after SFT makes the model more uniform (less variance), DA-GRPO injects gentle entropy-aware bonuses where the usual advantage is near zero. This prevents gradients from disappearing and nudges the model to try reflective, promising tokens. In short: plan first, then explore cleverly.

Building blocks (each with a simple sandwich explanation):

🍞 Hook: You know how teachers grade not just the final answer but also your work shown? That shows how you got there. 🥬 The Concept (Reasoning Trajectory): A reasoning trajectory is the full step-by-step path the model writes and the tools it calls. How it works: 1) decompose the task, 2) generate thoughts and tool calls for each subtask, 3) compose a final solution. Why it matters: without a clear path, it’s hard to learn from success or fix mistakes. 🍞 Anchor: ‘Find flights → get reservation details → change to next day’s cheapest economy → confirm’ is a trajectory.

🍞 Hook: Imagine you’re sorting chores among siblings: some must be done in order, some can happen at the same time, and some don’t need tools. 🥬 The Concept (Sequential/Parallel/Irrelevant): Subtasks can depend on each other (sequential), run together (parallel), or be handled by explanation only (irrelevant). How it works: identify the pattern, then schedule steps accordingly. Why it matters: mixing them up breaks the workflow—copying before creating a folder will fail. 🍞 Anchor: ‘Find’ files (sequentially before) ‘copy’ files; ‘translate two paragraphs’ can be parallel; ‘explain policy’ may need no tool.

🍞 Hook: Sometimes your practice tests all look the same, and your score stops improving. 🥬 The Concept (Advantage and Entropy Bonus in DA-GRPO): Advantage says ‘how much better than average’ a rollout is; entropy bonus adds exploration when everyone looks the same. How it works: 1) compute normal advantage; if it’s near zero, 2) add a capped bonus based on token entropy; 3) update the model accordingly. Why it matters: without the bonus, learning stalls; with it, the model keeps trying thoughtful, alternative moves. 🍞 Anchor: When all plays score about the same, the coach says, ‘try a new setup this time’—but not so wild that you forget the ball.

Summing up the core idea: D-CORE aligns the model’s inner thoughts with the structure of real tasks (via self-distillation), then preserves and guides curiosity (via DA-GRPO). This pairing changes ‘think more’ from wasting tokens into building the right staircase of steps.

03Methodology

High-level pipeline: Input (policy, tools, chat history, current query) → Stage 1: Self-Distillation (Decompose → Generate → Compose → Distill) → Stage 2: Diversity-Aware RL (DA-GRPO) → Output (accurate tool calls + concise, stepwise reasoning).

Stage 1: Self-Distillation (teaching decomposition and execution)

What happens: The model is prompted to decompose the user’s query into clear subtasks, then to solve each subtask with thoughts and tool calls, and finally to compose a full reasoning trajectory. These composed trajectories become training data for supervised fine-tuning (SFT), so the model learns to do this planning by default.
Why this step exists: RL alone struggles when the model’s thinking is messy; SFT on structured trajectories makes decomposition habitual and stabilizes learning. Without it, the model keeps looping in Lazy Reasoning, and rewards can’t steer it reliably.
Mini example (budget setting): Input: ‘Set a budget equivalent to 50,000 RMB in USD using token X.’ Decompose: (1) compute_exchange_rate (RMB→USD, 50,000), (2) set_budget_limit (token X, converted value). Generate: the model writes a short thought for each subtask and calls the proper tools. Compose: stitch the steps + tool outputs into a single, readable trajectory. Distill: SFT trains the model on this final example.

Detailed steps like a recipe:

Decompose(C, Q, Y*):
- Input: context C (policy P, tool list T, conversation history H), query Q, and a small set of reference/few-shot examples Y*.
- Action: ask the model to output a subtask list with step numbers and concise descriptions.
- Safety check: verify reasonable length and relevance (e.g., no invented tools).
- Why needed: without a subtask list, later steps collapse into unfocused thinking.
Generate per-subtask:
- For each subtask si: the model writes a brief thought, proposes the tool call τi, and you execute it to get output oi.
- Sequential case: pass each oi forward so later steps know what happened before.
- Parallel case: run subtasks independently and merge results.
- Irrelevant case: explain why no tool is needed.
- Why needed: verifies that each subtask is executable and grounded in real tool responses.
Compose trajectories:
- Combine subtasks, thoughts Ri, tool calls τi, and tool outputs oi into a single, coherent trajectory Ŷ.
- Include light reflection (e.g., quick checks) where helpful, especially for parallel/irrelevant cases.
- Why needed: creates a complete ‘study guide’ that shows good habits from start to finish.
Distill via SFT:
- Train the model to maximize the likelihood of Ŷ given (C, Q).
- Why needed: cements decomposition-first behavior so it becomes the model’s default.

The secret sauce of Stage 1: It uses the model itself, guided by prompts and a small number of examples, to mass-produce high-quality trajectories—no expensive teacher model required.

Stage 2: Diversity-Aware GRPO (keeping reflection smart and alive)

What happens: Run RL using a GRPO-style objective with a twist. If advantages for a group of rollouts are nearly identical (a sign of stagnation), add a small, capped ‘entropy-aware’ offset so tokens with higher uncertainty (often reflective words) get a nudge. This prevents gradient collapse and brings back healthy exploration.
Why this step exists: Self-distillation can make the model’s behavior too uniform—great for stability, bad for discovery. DA-GRPO reintroduces exploration where it helps most but doesn’t let entropy overwhelm correctness.
Mini example (returns on τ-bench retail): The base RL plan might repeatedly forget one item in a multi-item return. DA-GRPO’s gentle entropy bonus helps the model try a slightly different thought like ‘check if there are multiple items in this order,’ which unlocks the correct two-item call.

Reward design (aligned with ToolRL):

Format: did the model follow the required <think>…</think> and output style?
Structure: does the number of tool calls match what’s expected?
Keys and values: did it pick the right tool names, parameter names, and parameter values?
Why needed: tool use must be verifiable; these rewards make correctness checkable and consistent.

Putting it all together (Input → Steps → Output):

Input: P, T, H, Q
Stage 1: Decompose → per-subtask Generate (Ri, τi) → Execute → Compose Ŷ → Distill
Stage 2: Sample multiple trajectories → Score with rewards → Apply DA-GRPO updates (normal advantage when present, entropy-aware term when not) → New policy
Output: A model that plans first, executes cleanly, and reflects just enough to solve tricky multi-turn workflows.

The clever bits:

Self-supplied trajectories: avoid big teacher costs while still getting structured reasoning data.
Entropy-aware advantage: a simple, capped offset that prevents stalled learning and encourages meaningful reflection, not endless rambling.
Balance: decomposition locks in structure; DA-GRPO unlocks flexible, thoughtful exploration.

04Experiments & Results

The tests: The team measured accuracy on realistic tool-use benchmarks that demand planning:

BFCLv3: includes Live/Non-Live settings and tests Relevance (use a tool or not), Irrelevance (no tool needed), and especially Multi-Turn (several steps across a conversation).
τ-bench: airline and retail tasks where agents must follow multi-step procedures (e.g., change a flight the day after, return multiple items, compute refunds) with hidden constraints.

The competition: They compared D-CORE on Qwen3-8B and 14B backbones against:

Vanilla Qwen baselines (no-think, think),
ToolRL (GRPO-only RL),
Task-specialized models like xLAM2 (8B/32B/70B),
Big proprietary systems (Claude 3.7 Sonnet, GPT-4o, o1, DeepSeek-R1).

Scoreboard with context:

D-CORE-8B: 77.7% overall on BFCLv3—like jumping from a mid B to a strong A while being the smallest in the pack. It outperforms the best 8B baseline by 5.7 points.
D-CORE-14B: 79.3%—a new state-of-the-art among reported open models in this study, beating some 70B models despite being 5× smaller. That’s like a varsity player outperforming a pro on the same drill.
Multi-turn gains: about +30.8 points over base models—this is the clearest evidence that decomposition + diversity-aware reflection tackles Lazy Reasoning head-on.
τ-bench: D-CORE improves Qwen3-8B by +18.6 points and Qwen3-14B by +17.7, with standout performance in airline scenarios that need 4–5 subtasks when user intent is fuzzy.

Surprising findings:

RL alone (GRPO) often underperforms or gives small gains on complex multi-turn tasks—sometimes even negative on multi-turn! But once self-distillation is added first, RL suddenly becomes much more effective. This supports the paper’s thesis: plan first, then refine.
After self-distillation, reward variance shrinks (everyone performs similarly), which would normally stall RL. DA-GRPO’s entropy-aware boost revives learning without making the model chaotic.
Behavioral analysis shows a major shift: self-distillation increases decomposition and lowers empty reflection; DA-GRPO brings back a healthy slice of reflection, but not the wasteful kind. In the 8B model, errors due to Lazy Reasoning on multi-turn drop from 45% to 6%.

Generalization (out-of-distribution): On ACEBench, τ‑Bench variants, and BFCLv4-agentic (with web and memory tasks), D-CORE-trained models remain competitive or superior to baselines. This suggests the method teaches general habits (decompose then reflect) rather than memorizing datasets.

Efficiency notes:

40k self-distillation samples generated in roughly a day per model size on a single A100.
Training (SFT + DA-GRPO) runs in under two days on 8×A100 for each size.
This is practical for many labs and teams.

Takeaway: D-CORE doesn’t just add more ‘think’—it changes how the model thinks. The result is fewer wasted tokens, stronger plans, and better real-world tool use.

05Discussion & Limitations

Limitations:

Backbone dependence: Results shown on Qwen3 backbones; while the ideas are general, plug-and-play performance on every LRM is not guaranteed.
Prompt reliance in Stage 1: Good decomposition prompts and a few reference examples matter. Poor prompts can yield weaker trajectories.
Tuning DA-GRPO: The entropy bonus must be balanced. Too low: exploration stalls. Too high: reflection overwhelms accuracy.
Reward coverage: The reward checks format, structure, and key/value matches, but may not capture all nuances (e.g., latency trade-offs, user preferences).

Required resources:

Hardware: 8×A100 (80GB) class for training stages; 1×A100 for data generation is enough.
Data: A seed of tool-use tasks covering sequential, parallel, and irrelevance cases; 40k self-distill samples worked well here.
Engineering: A tool-execution sandbox, logging to capture tool outputs, and verification scripts.

When NOT to use:

Purely generative tasks with no tools (e.g., storytelling) where decomposition of tool calls isn’t central.
Tiny models or extremely low-resource settings where even light RL is infeasible.
Domains where tool outputs can’t be verified automatically; the reward then becomes noisy.

Open questions:

Can we auto-tune the entropy bonus per task or per token type to further reduce sensitivity to α/δ?
How well does D-CORE extend to multimodal agents (e.g., reading screens, seeing images) where decomposition spans vision and language?
Can rewards incorporate longer-term success (like fewer total calls or lower latency) without harming accuracy?
Can we learn to predict the best decomposition pattern (sequential vs parallel) before execution to save tokens?
How far can smaller models go with better decomposition before needing more parameters?

Honest assessment: D-CORE is a practical, modular recipe. It meaningfully lifts multi-turn tool use by fixing the plan-first habit and then preserving thoughtful exploration. It does not solve all reasoning, but it reliably turns ‘more thinking’ into ‘better thinking’ in agentic workflows.

06Conclusion & Future Work

Three-sentence summary: D-CORE teaches models to plan first and then reflect smartly, transforming Lazy Reasoning into effective, step-by-step tool use. It does this by self-distilling decomposition habits and then applying diversity-aware RL (DA-GRPO) that maintains exploration without derailing accuracy. The result is large jumps on complex, multi-turn benchmarks, with smaller models outperforming much bigger ones.

Main achievement: Showing that aligning a model’s inner thoughts with explicit task decomposition—then carefully reviving reflection with an entropy-aware advantage—dramatically improves complex tool workflows.

Future directions:

Extend to multimodal agents (screen reading, vision + language) where decomposition includes perception steps.
Smarter, auto-tuned entropy bonuses that adapt per token or per stage.
Broader rewards that also score efficiency and user satisfaction.
Plug-and-play trajectories to bootstrap other backbones with minimal engineering.

Why remember this: D-CORE reframes ‘reasoning tokens’ from wordy filler into a clean plan plus purposeful reflection. It’s a simple, teachable blueprint—decompose, then diversify—that helps agents reliably get real work done.

Practical Applications

•Customer support agents that correctly process multi-item returns, exchanges, and refunds with fewer errors.
•Travel assistants that change flights, compute budgets, and handle compensation logic across several steps.
•Research copilots that plan web searches, read results, and assemble summaries without losing track.
•IT automations that reliably chain file system and system tools (find, mkdir, copy, backup).
•Back-office workflows that check records, validate constraints, and update multiple systems in order.
•Shopping assistants that compare products in parallel and then assemble a final recommendation.
•Personal finance bots that decompose goals (convert currencies, set limits, schedule payments).
•Sales operations that gather CRM data, verify fields, and generate correct tool calls for updates.
•Healthcare admin helpers that schedule appointments, verify insurance, and send follow-ups in sequence.
•Education tutors that plan multi-step problem solving (retrieve definitions, apply rules, verify answers).

Version: 1