D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use
Key Summary
- âąThis paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
- âąThe authors introduce D-CORE, a two-stage training recipe that first teaches the model to break big tasks into small steps and then keeps its thinking flexible with a special kind of reinforcement learning.
- âąStage 1 (self-distillation) has the model teach itself how to decompose tasks and string the steps together into clear, checkable solution paths.
- âąStage 2 (DA-GRPO) adds a diversity-aware reward so the model keeps exploring different, reflective thought patterns instead of becoming rigid.
- âąOn tough tool-use benchmarks (BFCLv3 and Ï-bench), D-CORE gets big jumps, especially on multi-turn tasks that require planning across steps.
- âąD-CORE-8B reaches 77.7% accuracy (best among 8B models) and D-CORE-14B hits 79.3%, even beating some 70B models while being 5Ă smaller.
- âąThe method reduces wasted tokens and repetitive reflection, turning long, unhelpful thinking into short, effective step-by-step plans.
- âąIt works across different domains and unseen tasks, showing strong generalization rather than overfitting to a single dataset.
- âąThe key idea is simple: first learn to make a good plan, then learn to keep trying smart alternatives so you donât get stuck.
- âąThis approach can make everyday assistant agents more reliable at complex jobs like travel changes, returns, searches, and multi-tool workflows.
Why This Research Matters
Real assistants often need to use multiple tools across several steps, just like people do at work or at home. D-CORE helps AI plan those steps cleanly and avoid wasting time on rambling thoughts. That makes agents more reliable for tasks like refunds, travel changes, research, and file operations. It also means smaller models can reach or beat the performance of much bigger ones, which saves money and energy. By training decomposition first and then restoring healthy reflection, D-CORE turns âthink moreâ into âthink better,â unlocking practical, trustworthy tool use.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre packing for a trip. If you just start throwing things in your bag without a plan, youâll waste time, forget socks, and maybe pack three toothbrushes. Planning helps you pack fast and right.
đ„Ź The Concept (Large Reasoning Models and Tool Use): Large Reasoning Models (LRMs) are AIs that think step-by-step and can use tools (like calendars, web browsers, or file systems) to get things done. How it works: 1) read the userâs request, 2) decide which tools to use, 3) call the tools with the right inputs, 4) combine results into a final answer. Why it matters: without the ability to plan and coordinate tools, the AI guesses and fumbles, especially in multi-step, real-world tasks.
đ Anchor: If you ask an AI to âbook a flight tomorrow and then email me the receipt,â it must plan: check flights â pick one â pay â fetch receipt â email. Thatâs tool use.
The world before: AIs were getting decent at single-step tool calls (like âconvert 50,000 RMB to USDâ). But real-life problems are often a chain of steps (find all .txt files, create a folder, copy the files there; or look up orders, check statuses, return the right items). Benchmarks like BFCL show three tricky patterns: sequential steps, parallel steps, and âno tool neededâ cases. Even tougher, multi-turn conversations add history and hidden intentions to track over time. In these settings, many LRMs would write long âthinkingâ text but still act poorly.
đ Hook: You know how some students write a lot on a test but still miss the point? More words donât always mean better answers.
đ„Ź The Concept (Lazy Reasoning): Lazy Reasoning is when the AI fills its thought space with repetitive âhmm⊠wait⊠maybeâŠâ instead of making a solid plan and executing steps. How it works: 1) the model avoids breaking the task into parts, 2) it loops through trial-and-error reflections, 3) it wastes tokens and time with little progress. Why it matters: without true decomposition, adding more âthinkâ doesnât helpâperformance stalls on multi-turn tasks.
đ Anchor: On a flight-change task, instead of listing the subtasks (find reservation â read date â search next-day flights â pick cheapest â confirm), the model keeps second-guessing itself and ends up asking for a human.
Researchers tried two common routes. First, supervised fine-tuning (SFT) on rule-based tool-use examples: it helps on easy tasks but generalizes poorly to complex, multi-intent or multi-turn situations. Second, reinforcement learning (RL) that rewards correct final outcomes: works in math or single-turn cases but often gives diminishing returns on complex tool workflowsâmore tokens, not much better results.
đ Hook: Think of building a Lego spaceship. If you never separate the big ship into smaller chunks, youâll keep poking pieces around and get frustrated.
đ„Ź The Concept (Task Decomposition): Task decomposition means breaking a big job into small, ordered subtasks. How it works: 1) identify the needed steps, 2) label which must come first (sequential), which can happen together (parallel), and when no tool is needed, 3) solve each subtask, 4) compose the results. Why it matters: without decomposition, the modelâs âthinkingâ becomes a messy blob, and even RL canât find a clean path to reward.
đ Anchor: To âcopy all .txt files into a new folder,â you: 1) find .txt files, 2) make the folder, 3) copy files in. Clear steps, clear success.
The problem: The team found LRMs did very little decomposition in multi-turn tool use, but showed a lot of repetitive reflectionâclassic Lazy Reasoning. They confirmed that when they forced decomposition by prompting, accuracy jumped. So the missing piece was the modelâs built-in ability to decompose and then compose reasoning.
What this paper adds: D-CORE, a two-stage training framework that first teaches decomposition through self-distillation (the model learns from its own structured outputs), and then restores healthy, diverse reflection using a new RL trick called Diversity-Aware GRPO (DA-GRPO).
đ Hook: You know how you can study by making your own practice problems and then grading yourself? That builds good habits.
đ„Ź The Concept (Self-Distillation): Self-distillation is when a model creates well-structured solutions for itself and then learns to imitate them. How it works: 1) prompt the model to decompose tasks, 2) have it solve subtasks and stitch the results into a full âtrajectory,â 3) fine-tune the model on these trajectories so it internalizes decomposition and execution. Why it matters: without it, decomposition stays fragile and rare; with it, decomposition becomes the default habit.
đ Anchor: The model first writes âStep 1: compute exchange rate; Step 2: set budget,â then trains on that example so next time it naturally plans those steps.
But self-distillation can make the model too same-yâless exploration and reflection. So the second stage uses RL to bring back smart variety.
đ Hook: Imagine a coach who not only scores you for the final goal but also gives extra credit for trying creative, promising moves.
đ„Ź The Concept (Reinforcement Learning and GRPO): Reinforcement Learning (RL) rewards actions that lead to good outcomes. GRPO is a specific RL method that compares outcomes across multiple tries and pushes the model toward better ones. How it works: 1) sample several solutions, 2) score them, 3) push up the probability of tokens from better solutions, 4) keep the model close to a reference to avoid going wild. Why it matters: without RL, the model might copy patterns but wonât learn to choose better actions on its own.
đ Anchor: The AI tries a few tool-call plans, gets points for good structure and correct calls, and slowly prefers the better plan.
đ Hook: When youâre brainstorming, a few high-uncertainty wordsâlike âbut,â âmaybe,â or âbecauseââoften signal real thinking.
đ„Ź The Concept (Diversity-Aware GRPO, DA-GRPO): DA-GRPO tweaks the RL advantage with an âentropy-awareâ term that gently rewards thoughtful, higher-uncertainty tokens when normal learning stalls. How it works: 1) if the usual score difference is near zero (everyone looks the same), add a small bonus linked to token entropy, 2) this nudges the model to explore reflective moves, 3) but keep a cap so exploration doesnât drown out correctness. Why it matters: without DA-GRPO, self-distilled models can stagnate; with it, they recover healthy reflection while keeping strong decomposition.
đ Anchor: The model stops looping âwait, wait, waitâ and instead tries a new, sensible step it was unsure aboutâlike âcheck reservation details firstââwhich unlocks the path to the right answer.
02Core Idea
Aha! Moment in one sentence: If we first teach the model to split problems into the right subtasks and then reward it for exploring smartly diverse reasoning, we turn rambling âLazy Reasoningâ into crisp, stepwise tool use.
Three analogies:
- Cooking: Write the recipe (decompose), then taste-test variations to improve flavor (diversity-aware RL).
- Sports: Practice set plays (decompose), then scrimmage with creative moves that still respect the playbook (DA-GRPO).
- Lego: Sort pieces by type (decompose), then try a few builds that follow the plan but allow neat add-ons (diversity-aware exploration).
Before vs After:
- Before: Models write long thoughts, circle back, and miss key steps. RL alone struggles because the modelâs âthinkingâ is noisy and not structured.
- After: Models plan: Step 1, Step 2, Step 3âcall the right tools at the right times. RL now has a clean backbone to optimize, and DA-GRPO keeps the model reflective without getting stuck.
Why it works (intuition over math):
- RL needs a good action space. Without decomposition, the action space is tangled; rewards get fuzzy. Self-distillation reorganizes the space: the model habitually plans subtasks, so steps map cleanly to rewards (e.g., correct tool, correct params). Then, after SFT makes the model more uniform (less variance), DA-GRPO injects gentle entropy-aware bonuses where the usual advantage is near zero. This prevents gradients from disappearing and nudges the model to try reflective, promising tokens. In short: plan first, then explore cleverly.
Building blocks (each with a simple sandwich explanation):
đ Hook: You know how teachers grade not just the final answer but also your work shown? That shows how you got there. đ„Ź The Concept (Reasoning Trajectory): A reasoning trajectory is the full step-by-step path the model writes and the tools it calls. How it works: 1) decompose the task, 2) generate thoughts and tool calls for each subtask, 3) compose a final solution. Why it matters: without a clear path, itâs hard to learn from success or fix mistakes. đ Anchor: âFind flights â get reservation details â change to next dayâs cheapest economy â confirmâ is a trajectory.
đ Hook: Imagine youâre sorting chores among siblings: some must be done in order, some can happen at the same time, and some donât need tools. đ„Ź The Concept (Sequential/Parallel/Irrelevant): Subtasks can depend on each other (sequential), run together (parallel), or be handled by explanation only (irrelevant). How it works: identify the pattern, then schedule steps accordingly. Why it matters: mixing them up breaks the workflowâcopying before creating a folder will fail. đ Anchor: âFindâ files (sequentially before) âcopyâ files; âtranslate two paragraphsâ can be parallel; âexplain policyâ may need no tool.
đ Hook: Sometimes your practice tests all look the same, and your score stops improving. đ„Ź The Concept (Advantage and Entropy Bonus in DA-GRPO): Advantage says âhow much better than averageâ a rollout is; entropy bonus adds exploration when everyone looks the same. How it works: 1) compute normal advantage; if itâs near zero, 2) add a capped bonus based on token entropy; 3) update the model accordingly. Why it matters: without the bonus, learning stalls; with it, the model keeps trying thoughtful, alternative moves. đ Anchor: When all plays score about the same, the coach says, âtry a new setup this timeââbut not so wild that you forget the ball.
Summing up the core idea: D-CORE aligns the modelâs inner thoughts with the structure of real tasks (via self-distillation), then preserves and guides curiosity (via DA-GRPO). This pairing changes âthink moreâ from wasting tokens into building the right staircase of steps.
03Methodology
High-level pipeline: Input (policy, tools, chat history, current query) â Stage 1: Self-Distillation (Decompose â Generate â Compose â Distill) â Stage 2: Diversity-Aware RL (DA-GRPO) â Output (accurate tool calls + concise, stepwise reasoning).
Stage 1: Self-Distillation (teaching decomposition and execution)
- What happens: The model is prompted to decompose the userâs query into clear subtasks, then to solve each subtask with thoughts and tool calls, and finally to compose a full reasoning trajectory. These composed trajectories become training data for supervised fine-tuning (SFT), so the model learns to do this planning by default.
- Why this step exists: RL alone struggles when the modelâs thinking is messy; SFT on structured trajectories makes decomposition habitual and stabilizes learning. Without it, the model keeps looping in Lazy Reasoning, and rewards canât steer it reliably.
- Mini example (budget setting): Input: âSet a budget equivalent to 50,000 RMB in USD using token X.â Decompose: (1) compute_exchange_rate (RMBâUSD, 50,000), (2) set_budget_limit (token X, converted value). Generate: the model writes a short thought for each subtask and calls the proper tools. Compose: stitch the steps + tool outputs into a single, readable trajectory. Distill: SFT trains the model on this final example.
Detailed steps like a recipe:
- Decompose(C, Q, Y*):
- Input: context C (policy P, tool list T, conversation history H), query Q, and a small set of reference/few-shot examples Y*.
- Action: ask the model to output a subtask list with step numbers and concise descriptions.
- Safety check: verify reasonable length and relevance (e.g., no invented tools).
- Why needed: without a subtask list, later steps collapse into unfocused thinking.
- Generate per-subtask:
- For each subtask si: the model writes a brief thought, proposes the tool call Ïi, and you execute it to get output oi.
- Sequential case: pass each oi forward so later steps know what happened before.
- Parallel case: run subtasks independently and merge results.
- Irrelevant case: explain why no tool is needed.
- Why needed: verifies that each subtask is executable and grounded in real tool responses.
- Compose trajectories:
- Combine subtasks, thoughts Ri, tool calls Ïi, and tool outputs oi into a single, coherent trajectory ƶ.
- Include light reflection (e.g., quick checks) where helpful, especially for parallel/irrelevant cases.
- Why needed: creates a complete âstudy guideâ that shows good habits from start to finish.
- Distill via SFT:
- Train the model to maximize the likelihood of ƶ given (C, Q).
- Why needed: cements decomposition-first behavior so it becomes the modelâs default.
The secret sauce of Stage 1: It uses the model itself, guided by prompts and a small number of examples, to mass-produce high-quality trajectoriesâno expensive teacher model required.
Stage 2: Diversity-Aware GRPO (keeping reflection smart and alive)
- What happens: Run RL using a GRPO-style objective with a twist. If advantages for a group of rollouts are nearly identical (a sign of stagnation), add a small, capped âentropy-awareâ offset so tokens with higher uncertainty (often reflective words) get a nudge. This prevents gradient collapse and brings back healthy exploration.
- Why this step exists: Self-distillation can make the modelâs behavior too uniformâgreat for stability, bad for discovery. DA-GRPO reintroduces exploration where it helps most but doesnât let entropy overwhelm correctness.
- Mini example (returns on Ï-bench retail): The base RL plan might repeatedly forget one item in a multi-item return. DA-GRPOâs gentle entropy bonus helps the model try a slightly different thought like âcheck if there are multiple items in this order,â which unlocks the correct two-item call.
Reward design (aligned with ToolRL):
- Format: did the model follow the required <think>âŠ</think> and output style?
- Structure: does the number of tool calls match whatâs expected?
- Keys and values: did it pick the right tool names, parameter names, and parameter values?
- Why needed: tool use must be verifiable; these rewards make correctness checkable and consistent.
Putting it all together (Input â Steps â Output):
- Input: P, T, H, Q
- Stage 1: Decompose â per-subtask Generate (Ri, Ïi) â Execute â Compose ƶ â Distill
- Stage 2: Sample multiple trajectories â Score with rewards â Apply DA-GRPO updates (normal advantage when present, entropy-aware term when not) â New policy
- Output: A model that plans first, executes cleanly, and reflects just enough to solve tricky multi-turn workflows.
The clever bits:
- Self-supplied trajectories: avoid big teacher costs while still getting structured reasoning data.
- Entropy-aware advantage: a simple, capped offset that prevents stalled learning and encourages meaningful reflection, not endless rambling.
- Balance: decomposition locks in structure; DA-GRPO unlocks flexible, thoughtful exploration.
04Experiments & Results
The tests: The team measured accuracy on realistic tool-use benchmarks that demand planning:
- BFCLv3: includes Live/Non-Live settings and tests Relevance (use a tool or not), Irrelevance (no tool needed), and especially Multi-Turn (several steps across a conversation).
- Ï-bench: airline and retail tasks where agents must follow multi-step procedures (e.g., change a flight the day after, return multiple items, compute refunds) with hidden constraints.
The competition: They compared D-CORE on Qwen3-8B and 14B backbones against:
- Vanilla Qwen baselines (no-think, think),
- ToolRL (GRPO-only RL),
- Task-specialized models like xLAM2 (8B/32B/70B),
- Big proprietary systems (Claude 3.7 Sonnet, GPT-4o, o1, DeepSeek-R1).
Scoreboard with context:
- D-CORE-8B: 77.7% overall on BFCLv3âlike jumping from a mid B to a strong A while being the smallest in the pack. It outperforms the best 8B baseline by 5.7 points.
- D-CORE-14B: 79.3%âa new state-of-the-art among reported open models in this study, beating some 70B models despite being 5Ă smaller. Thatâs like a varsity player outperforming a pro on the same drill.
- Multi-turn gains: about +30.8 points over base modelsâthis is the clearest evidence that decomposition + diversity-aware reflection tackles Lazy Reasoning head-on.
- Ï-bench: D-CORE improves Qwen3-8B by +18.6 points and Qwen3-14B by +17.7, with standout performance in airline scenarios that need 4â5 subtasks when user intent is fuzzy.
Surprising findings:
- RL alone (GRPO) often underperforms or gives small gains on complex multi-turn tasksâsometimes even negative on multi-turn! But once self-distillation is added first, RL suddenly becomes much more effective. This supports the paperâs thesis: plan first, then refine.
- After self-distillation, reward variance shrinks (everyone performs similarly), which would normally stall RL. DA-GRPOâs entropy-aware boost revives learning without making the model chaotic.
- Behavioral analysis shows a major shift: self-distillation increases decomposition and lowers empty reflection; DA-GRPO brings back a healthy slice of reflection, but not the wasteful kind. In the 8B model, errors due to Lazy Reasoning on multi-turn drop from 45% to 6%.
Generalization (out-of-distribution): On ACEBench, ÏâBench variants, and BFCLv4-agentic (with web and memory tasks), D-CORE-trained models remain competitive or superior to baselines. This suggests the method teaches general habits (decompose then reflect) rather than memorizing datasets.
Efficiency notes:
- 40k self-distillation samples generated in roughly a day per model size on a single A100.
- Training (SFT + DA-GRPO) runs in under two days on 8ĂA100 for each size.
- This is practical for many labs and teams.
Takeaway: D-CORE doesnât just add more âthinkââit changes how the model thinks. The result is fewer wasted tokens, stronger plans, and better real-world tool use.
05Discussion & Limitations
Limitations:
- Backbone dependence: Results shown on Qwen3 backbones; while the ideas are general, plug-and-play performance on every LRM is not guaranteed.
- Prompt reliance in Stage 1: Good decomposition prompts and a few reference examples matter. Poor prompts can yield weaker trajectories.
- Tuning DA-GRPO: The entropy bonus must be balanced. Too low: exploration stalls. Too high: reflection overwhelms accuracy.
- Reward coverage: The reward checks format, structure, and key/value matches, but may not capture all nuances (e.g., latency trade-offs, user preferences).
Required resources:
- Hardware: 8ĂA100 (80GB) class for training stages; 1ĂA100 for data generation is enough.
- Data: A seed of tool-use tasks covering sequential, parallel, and irrelevance cases; 40k self-distill samples worked well here.
- Engineering: A tool-execution sandbox, logging to capture tool outputs, and verification scripts.
When NOT to use:
- Purely generative tasks with no tools (e.g., storytelling) where decomposition of tool calls isnât central.
- Tiny models or extremely low-resource settings where even light RL is infeasible.
- Domains where tool outputs canât be verified automatically; the reward then becomes noisy.
Open questions:
- Can we auto-tune the entropy bonus per task or per token type to further reduce sensitivity to α/Ύ?
- How well does D-CORE extend to multimodal agents (e.g., reading screens, seeing images) where decomposition spans vision and language?
- Can rewards incorporate longer-term success (like fewer total calls or lower latency) without harming accuracy?
- Can we learn to predict the best decomposition pattern (sequential vs parallel) before execution to save tokens?
- How far can smaller models go with better decomposition before needing more parameters?
Honest assessment: D-CORE is a practical, modular recipe. It meaningfully lifts multi-turn tool use by fixing the plan-first habit and then preserving thoughtful exploration. It does not solve all reasoning, but it reliably turns âmore thinkingâ into âbetter thinkingâ in agentic workflows.
06Conclusion & Future Work
Three-sentence summary: D-CORE teaches models to plan first and then reflect smartly, transforming Lazy Reasoning into effective, step-by-step tool use. It does this by self-distilling decomposition habits and then applying diversity-aware RL (DA-GRPO) that maintains exploration without derailing accuracy. The result is large jumps on complex, multi-turn benchmarks, with smaller models outperforming much bigger ones.
Main achievement: Showing that aligning a modelâs inner thoughts with explicit task decompositionâthen carefully reviving reflection with an entropy-aware advantageâdramatically improves complex tool workflows.
Future directions:
- Extend to multimodal agents (screen reading, vision + language) where decomposition includes perception steps.
- Smarter, auto-tuned entropy bonuses that adapt per token or per stage.
- Broader rewards that also score efficiency and user satisfaction.
- Plug-and-play trajectories to bootstrap other backbones with minimal engineering.
Why remember this: D-CORE reframes âreasoning tokensâ from wordy filler into a clean plan plus purposeful reflection. Itâs a simple, teachable blueprintâdecompose, then diversifyâthat helps agents reliably get real work done.
Practical Applications
- âąCustomer support agents that correctly process multi-item returns, exchanges, and refunds with fewer errors.
- âąTravel assistants that change flights, compute budgets, and handle compensation logic across several steps.
- âąResearch copilots that plan web searches, read results, and assemble summaries without losing track.
- âąIT automations that reliably chain file system and system tools (find, mkdir, copy, backup).
- âąBack-office workflows that check records, validate constraints, and update multiple systems in order.
- âąShopping assistants that compare products in parallel and then assemble a final recommendation.
- âąPersonal finance bots that decompose goals (convert currencies, set limits, schedule payments).
- âąSales operations that gather CRM data, verify fields, and generate correct tool calls for updates.
- âąHealthcare admin helpers that schedule appointments, verify insurance, and send follow-ups in sequence.
- âąEducation tutors that plan multi-step problem solving (retrieve definitions, apply rules, verify answers).