Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Xiao Liang; Zhong-Zhi Li; Zhenghao Lin; Eric Hancheng Jiang; Hengyuan Zhang; Yelong Shen; Kai-Wei Chang; Ying Nian Wu; Yeyun Gong; Weizhu Chen

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Intermediate

Xiao Liang, Zhong-Zhi Li, Zhenghao Lin et al.2/2/2026

arXiv PDF

Key Summary

•The paper trains language models to solve hard problems by first breaking them into smaller parts and then solving those parts, instead of only thinking in one long chain.
•This divide-and-conquer style is learned with reinforcement learning so the model practices both making good subproblems and using them to answer the original question.
•A special reward design checks that subproblems are well-formed, there are enough of them, and at least one group of them actually helps reach a correct final answer.
•Using only the correctness of the final answer to score the full solution turns out to be a strong teaching signal for solving subproblems too.
•Compared to standard chain-of-thought training, this method raises the model’s performance ceiling and scales better when you give it more attempts at test time.
•On competition-level math benchmarks, the method improves Pass@1 by 8.6% and Pass@32 by 6.3% over strong chain-of-thought baselines.
•Training on especially hard questions (Deep DAC) further increases gains and keeps solutions shorter and less repetitive while still exploring many ideas.
•Even when mixed with chain-of-thought training, the divide-and-conquer training boosts the model’s step-by-step reasoning ability too.
•At test time, spreading attempts across many different subproblem decompositions works better than just trying the same full solution many times.

Why This Research Matters

Reasoning-heavy AI systems will increasingly help us with math, coding, planning, and scientific discovery, so their ability to explore many good solution paths really matters. This paper shows that teaching models to plan (divide) and execute (conquer) together raises their performance ceiling beyond classic step-by-step chains. As a result, giving the model more attempts at test time genuinely helps, which means we can trade extra compute for better answers in a predictable way. The approach also tends to produce clearer, shorter solutions, making outputs easier to audit and trust. These benefits appear even when starting from small models or when mixing with ordinary chain-of-thought training.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how cleaning a super messy room is easier if you first sort toys, books, and clothes into piles, then clean one pile at a time? That’s because smaller jobs are simpler.

🥬 The Concept (Chain-of-Thought, CoT):

What it is: Chain-of-Thought is when an AI solves a problem by writing one long, careful line of steps from start to finish.
How it works:
1. Read the whole problem.
2. Think step-by-step in a single stream.
3. Produce a final answer.
Why it matters: Without CoT, the model often skips steps and makes more mistakes. 🍞 Anchor: When asked “What’s 25% of 80?”, CoT helps the model write: 25% = 1/4, 1/4 of 80 is 20, so the answer is 20.

🍞 Hook: Imagine building a big LEGO castle. You don’t build it all at once—you make the tower, the wall, and the gate separately, then snap them together.

🥬 The Concept (Divide-and-Conquer, DAC):

What it is: Divide-and-Conquer breaks a hard problem into smaller subproblems, solves them, then combines their answers.
How it works:
1. Split the big problem into clear sub-tasks.
2. Solve each sub-task.
3. Use the sub-answers to solve the original problem.
Why it matters: Without splitting, the model may get lost in a long, twisty solution and can’t scale well to harder tasks. 🍞 Anchor: For a long fraction puzzle, DAC might first simplify part A, then part B, then plug both into the final step.

🍞 Hook: Think of running a race alone versus having a relay team. Doing everything in a single stretch is slower and tiring; splitting the race can use energy better.

🥬 The Concept (Test-time scalability):

What it is: Test-time scalability means how well a model improves when you give it more chances or longer time at test time.
How it works:
1. Give the model multiple tries or longer answers.
2. See if performance keeps improving.
Why it matters: If extra tries stop helping, the model has hit a ceiling; we want methods that keep getting better with more attempts. 🍞 Anchor: Trying 32 different decompositions of a problem often beats trying the same single chain 32 times.

🍞 Hook: Have you ever learned to add by practicing one way, then were tested using a totally different trick? That mismatch can be confusing.

🥬 The Concept (Misalignment between training and inference):

What it is: Misalignment is when a model is trained one way (like CoT) but asked to think a different way later (like DAC).
How it works:
1. Train mostly on single long chains.
2. At test time, ask for subproblems and trees.
3. The model struggles because it never practiced that style.
Why it matters: Without alignment, DAC prompting at test time underperforms even when DAC could help. 🍞 Anchor: Models trained on step-by-step often do worse when suddenly asked to branch into subproblems with complex prompting.

🍞 Hook: Picture learning a video game by trying moves and getting points when you win. You keep what works and drop what doesn’t.

🥬 The Concept (Reinforcement Learning, RL):

What it is: RL teaches models by rewarding good answers and not rewarding bad ones.
How it works:
1. The model tries to solve problems.
2. It gets a reward if the final answer is correct (and sometimes for good structure).
3. It updates its strategy to get more rewards in the future.
Why it matters: When correct solutions are rare and step labels are expensive, RL lets models learn by exploring. 🍞 Anchor: The model tries different solution paths to a math problem; only correct final answers get a “gold star,” and the model learns which paths are safer.

The world before this paper looked like this: CoT had become the standard way to push LLM reasoning using long, reflective steps. It works well on many tasks, but for the very hardest ones, the model’s single-file march hits a limit. People tried DAC-style prompting at test time (like Tree-of-Thoughts), which helped sometimes, but it required fancy prompts and didn’t match what the model had practiced during training. That mismatch (misalignment) made DAC flaky: even simpler questions could become harder when forced into a style the model hadn’t learned.

The problem researchers faced: How do we make DAC not just a test-time trick but a skill the model truly learns—so it scales better, explores smarter, and reaches a higher ceiling on tough problems?

Failed attempts mostly focused on inference-only solutions—clever prompts and search schemes—without teaching the model to actually generate helpful subproblems or use them well. Worse, rewarding division by directly measuring conquer accuracy caused a failure mode: the model tried to “solve the whole thing early” in the division text instead of producing real subproblems.

The missing piece (the gap) was a training recipe that jointly teaches both halves of DAC: making good subproblems and using them to solve the original question, with rewards that encourage real decomposition rather than reward-hacking.

Real stakes: Better reasoning affects tutors that help with math homework, planning tools that break big goals into steps, coding assistants that decompose bugs, and scientists who need models to explore many paths without getting stuck. When models scale well at test time, giving them more attempts actually helps—like having a bigger, more energetic team for that relay race.

02Core Idea

🍞 Hook: Imagine a chef who not only knows how to slice ingredients into perfect bite-sized pieces but also knows exactly how to cook each piece and plate a beautiful dish. Training both the slicing and cooking together makes dinner reliably great.

🥬 The Concept (Aha! DAC-RL in one sentence):

What it is: Train the model end-to-end with reinforcement learning to both divide problems into helpful subproblems and conquer them, using final-answer rewards plus gentle division rewards.
How it works:
1. Division: The model proposes multiple groups of subproblems.
2. Conquering: For each group, the model solves subproblems in sequence and then answers the original problem.
3. Rewards: The final correctness rewards the whole conquer path; division gets extra checks (right format, enough subproblems, and at least one group that helps lead to a correct final answer).
4. Update: The policy is improved so future divisions and solutions work better.
Why it matters: Without training both halves together, DAC at test time is wobbly; with DAC-RL, it becomes a native skill that scales. 🍞 Anchor: The model learns to write a plan (subproblems) and then follow that plan to solve a tough math question.

Three analogies for the same idea:

School project teams: One teammate breaks the project into tasks; the rest complete them. Grading both the plan and the result teaches everyone to plan wisely and execute well.
Treasure map: First you draw a helpful map (division), then you follow it step-by-step (conquering). You get the treasure only if the map makes sense and you follow it to the end.
Puzzle sorting: You sort pieces by color/edge (division), then assemble each cluster (conquering). Scoring on the finished picture encourages sorting that truly helps assembly.

Before vs After:

Before: CoT training made models good at long single-file reasoning but hit a ceiling; DAC prompts at test time often underperformed because the model hadn’t practiced them.
After: DAC-RL raises the ceiling. With more attempts, performance keeps improving. The model’s plans are more structured, solutions are shorter and less repetitive, and even CoT skills get a lift.

🍞 Hook: You know how a teacher can grade just the final poster of a science fair project and still reward teams that had solid steps because those steps led to a great result?

🥬 The Concept (Final-answer reward as surrogate for subproblems):

What it is: Using the correctness of the final answer to reward the whole conquer process, which indirectly encourages correct subproblem solving.
How it works:
1. Solve subproblems and then the original problem.
2. If the final answer is correct, the whole path gets a reward.
3. Over time, paths with more correct substeps are chosen more often.
Why it matters: We don’t have ground-truth answers for every subproblem, so a single final reward is a practical and consistent teacher. 🍞 Anchor: If solving the area after finding base and height is correct, the model learns that getting base and height right pays off.

🍞 Hook: When you grade planners, you don’t just want “lots of steps,” you want steps that are neat, enough in number, and actually useful.

🥬 The Concept (Division reward with format, quantity, and helpfulness):

What it is: A light reward checks that subproblems are parseable, there are at least N of them, and at least one group contributes to getting the final answer right.
How it works:
1. Format validity: Subproblems must follow the required tags so they can be read.
2. Quantity: There must be at least N subproblems so division isn’t skipped.
3. Helpfulness lower bound: Reward groups that help produce at least one correct final answer among groups.
Why it matters: Without these, the model might cheat by solving the whole thing early in the “division” text or by outputting zero subproblems. 🍞 Anchor: The model lists 4 clear subproblems for a geometry proof, and one grouping leads to the correct final angle; that group earns credit.

🍞 Hook: Think of exploring a maze with many branches. If you only ever try one path, you might miss the exit; if you explore diverse branches, you’re more likely to find the way out.

🥬 The Concept (Why it works—intuition without equations):

What it is: Training the plan-making and plan-following together turns diverse exploration into better solutions, while simple rewards prevent reward-hacking.
How it works:
1. Multiple division groups create diverse starting points.
2. Several conquering attempts per group explore varied solution paths.
3. Final-answer rewards steer towards reliable substeps; division checks keep decomposition real.
Why it matters: This combination fights entropy collapse (getting stuck repeating the same thing) and lifts the performance ceiling. 🍞 Anchor: On tough math questions, trying many different decompositions beats repeating the same long chain, and training makes those decompositions steadily smarter.

Building blocks of the idea:

Policy that outputs subproblem groups (division).
Conquering that solves subproblems then the original.
Final-answer reward for conquering; lightweight, aligned division reward.
RL optimization (GRPO) with stability tricks so learning is steady.
Test-time scaling that prefers breadth (many groups) over depth (many tries of the same group).

03Methodology

High-level recipe: Input problem → Division (make several subproblem groups) → Conquering (solve subs, then final) → Rewards (final correctness + division checks) → Policy update (make future division and conquering better).

🍞 Hook: Imagine you’re a coach. First, you split the class into small teams with different game plans, then each team plays, and finally you score them and adjust training for next time.

🥬 The Concept (Subproblem division):

What it is: The model proposes Gd groups, each with at least Ns subproblems, written in a strict, parseable format.
How it works:
1. Prompt: “Don’t solve—just list more than Ns subproblems using tags.”
2. The model creates several groups, each a different take on how to break the problem.
3. A parser checks tags and counts subproblems.
Why it matters: Without this, the model can skip decomposing and just try to solve directly, defeating DAC. 🍞 Anchor: For an algebra word problem, the model lists 4–5 mini-tasks like “isolate x,” “find common denominator,” “simplify fraction,” in neat tags.

Details and safeguards:

Format validity: Responses must match tag patterns so we can reliably extract subproblems.
Quantity validity: Enforce at least Ns (e.g., Ns = 3) to prevent degeneration into zero-division.
Helpfulness lower bound: Reward a division group if, among all groups, at least one conquering attempt with that group leads to a correct final answer.
Why the lower bound? Stronger rewards (like average accuracy per group) made the model “solve early” inside the division text. The gentler rule keeps division honest.

🍞 Hook: Picture following a checklist: do Step 1, then Step 2, then finish. You don’t jump to the end without doing the steps.

🥬 The Concept (Conquering: solve subs then original):

What it is: For each division group, the model gets a conquering prompt that asks it to solve subproblems in order and then answer the original.
How it works:
1. Build a conquering prompt: original problem + tagged subproblems.
2. Generate Gc solutions per group (several independent tries).
3. Extract the final answer and check correctness.
Why it matters: This staged solving lets partial results guide the final step. 🍞 Anchor: With subproblems like “compute slope,” “compute intercept,” the model then combines them to find the line equation correctly.

🍞 Hook: Imagine a teacher who gives a gold star only if the science project works at the end. Teams that really did the steps right get the star more often.

🥬 The Concept (Final-answer reward):

What it is: A simple 1/0 reward based on whether the final answer is correct.
How it works:
1. If final extracted answer equals the ground truth, reward = 1; else 0.
2. This single signal nudges the model toward substeps that truly help the end.
Why it matters: We don’t need subproblem answer keys; the final correctness already correlates with getting more substeps right. 🍞 Anchor: If the model got the fraction simplified wrong, the final answer tends to be wrong too, so that path gets no star.

🍞 Hook: When you grade planners, you still want the plan to be neat and complete, even if you mostly grade the final project.

🥬 The Concept (Division reward):

What it is: A small bonus that checks (a) format, (b) enough subproblems, (c) at least one helpful group.
How it works:
1. If tags are broken or too few subproblems, assign a negative/zero reward.
2. If among groups, this group helped yield at least one correct final answer, it’s rewarded.
Why it matters: Prevents reward-hacking (like solving everything during division) and ensures the division stage remains a planner, not a solver. 🍞 Anchor: The model earns credit for providing 4 cleanly-tagged steps where one grouping later leads to the correct result.

🍞 Hook: Think of learning to throw darts better by trying a bunch of throws, seeing scores, and adjusting your aim each round.

🥬 The Concept (Policy update with GRPO):

What it is: An RL method that compares rewards of multiple tries and nudges the policy toward better ones while keeping changes stable.
How it works:
1. For each problem, generate groups and multiple conquering attempts.
2. Compute rewards and store experiences.
3. Update the model to increase probability of higher-reward tokens (with clipping and KL to stay stable).
Why it matters: Without stable updates, training can wobble or collapse into repeating the same thing. 🍞 Anchor: The model tries eight solutions per plan, sees which scored 1, and slightly increases the chance of writing similar good moves next time.

Concrete settings (typical):

Gd = 4 division groups per problem; Gc = 8 conquering attempts per group.
Ns = 3 minimum subproblems.
Batch size around 256; rollout lengths up to 8k tokens; temperature ~1.0.
Use stability tricks (e.g., Clip-Higher, token-level loss) from prior RL work.

Extra design choices that help:

Deep DAC: Train more on the hardest problems and allow longer contexts; this strengthens test-time scalability.
Mix-RL: Use DAC for tough items and CoT for easy ones; surprisingly, this can also boost plain CoT performance.
Test-time allocation: With a fixed budget k attempts, spreading attempts across many different division groups (breadth) outperforms hammering one group with many tries (depth).

Secret sauce:

Joint training of planning (division) and execution (conquering).
A gentle but aligned division reward to prevent cheating and encourage real decomposition.
A single final-answer reward that implicitly teaches solving subproblems.
Breadth-first exploration (many decompositions) that raises the ceiling without exploding length, often producing shorter, cleaner reasoning.

04Experiments & Results

🍞 Hook: Imagine two soccer strategies. Team A passes in a straight line every time (CoT). Team B first splits the field into zones and plans different plays for each zone (DAC). Which one improves more when given extra practice and more plays to try?

🥬 The Concept (The test and why):

What it is: Evaluate on tough math contests (AIME 2024/2025, Beyond-AIME, HMMT-2025) using Pass@1 and Pass@32.
How it works:
1. Train with DAC-RL vs CoT-RL on DAPO-Math-17k.
2. Measure how often the best of k tries is correct (Pass@k), and how performance grows during training.
Why it matters: These are competitive, uncontaminated benchmarks where simple tricks fail; gains here mean real reasoning improvements. 🍞 Anchor: Think of Pass@1 as getting the question right on the first try (like a pop quiz), and Pass@32 as getting 32 tries (like a practice session with retries).

The competition (baselines):

CoT-style RL on the same data and budgets.
Inference-only DAC prompting without specialized training.
Strong instruction-tuned models (e.g., Qwen families) as starting points.

Scoreboard with context:

Main headline: With Qwen3-4B-Instruct-2507, DAC-RL beats CoT by +8.6% Pass@1 and +6.3% Pass@32 on average across contest benchmarks. That’s like going from a solid B to an A, even when the class is graded on a curve.
For Qwen2.5-7B-Instruct, DAC-RL also improves over CoT-RL, including +3.4% on Pass@32 across benchmarks, despite DAC starting weaker than CoT before training.
Deep DAC (training longer on the hardest 3.7k items with longer token budgets) adds further gains, and CoT with the same increased rollout budget does not catch up.

🍞 Hook: When you have 1,024 total tries, is it better to try 1 plan 1,024 times, or to try 128 different plans and give each a few attempts?

🥬 The Concept (Test-time budget allocation, Pass@k):

What it is: Distribute a fixed number of attempts between many division groups (breadth) versus many tries per group (depth).
How it works:
1. Keep k = 1024 attempts fixed.
2. Vary the number of groups n and attempts per group m so n × m = 1024.
3. Measure Pass@1024.
Why it matters: Shows whether exploring more decompositions helps more than repeating the same plan. 🍞 Anchor: Results show breadth (more groups, fewer tries each) consistently beats the CoT baseline and often outperforms depth-heavy allocations.

Surprising findings:

DAC training also boosts CoT inference performance in the mixed setting (Mix-RL), even though you reduce CoT training on hard items. Learning to plan better helps you think better, even when you later use a plain chain.
DAC-trained solutions are often shorter and less redundant, yet the model explores more (higher entropy), avoiding the trap of repeating the same thought trails.
Enforcing a very strict “you must show every subproblem answer in a fixed format” during RL raises format-following rates but hurts final performance—an alignment tax: being too rigid can reduce problem-solving ability.

Cold-start experiments:

Distilling a stronger model’s DAC and CoT outputs into a smaller model (before RL) improves both styles; after RL, DAC gains more than CoT under the same budget, suggesting DAC encourages richer exploration.

Bottom line: Across models, datasets, and settings, training DAC as a first-class skill lifts the ceiling, scales better with more attempts, and even makes step-by-step chains cleaner.

05Discussion & Limitations

🍞 Hook: Think of a Swiss Army knife: super useful, but it still has limits—you wouldn’t cut a tree with its tiny saw.

Limitations:

Domain coverage: Experiments focus on math with integer answers and contest-style problems. Other domains (law, science, open-ended writing) need adaptation and robust verifiers.
Reward sparsity: Final-answer rewards are sparse. While effective here, tasks without easy correctness checks may require verifiable intermediate targets.
Prompt sensitivity: Division relies on parseable tags; if formatting breaks, rewards break. Robust parsing and guardrails are needed.
Compute and memory: Training uses many rollouts per item and long contexts; smaller labs need careful budgeting.
Subproblem quality: The helpfulness lower bound prevents cheating but doesn’t directly optimize the very best decompositions; future work might score decomposition quality more precisely without inviting reward hacks.

Required resources:

RL infra that can sample multiple groups and attempts per problem (e.g., Gd × Gc rollouts) and handle long contexts (8k–32k+ tokens).
Reliable answer checkers and extraction for the final answer.
Datasets of sufficiently hard problems to reveal the benefits of DAC.

When not to use:

Tasks with no clear correctness signal (no ground truth, or extremely fuzzy outputs) where final-answer rewards can’t be computed.
Very small models that can’t reliably follow division prompts or hold long contexts.
Ultra-low-budget settings where multiple rollouts per item are infeasible.

Open questions:

Can we design richer yet safe division rewards that score decomposition quality without causing early-solve cheating?
How to extend DAC-RL to multi-hop QA, coding with tests, and science reasoning with tool calls and verifiable checks?
Can we learn to generate and reuse libraries of subproblems or subroutines across tasks (meta-DAC)?
What are the best breadth–depth allocations for different domains and budgets, and can the model learn to allocate on the fly?
How to combine DAC with process-level verifiable rewards and programmatic checkers to reduce sparsity even further?

06Conclusion & Future Work

Three-sentence summary: This paper trains language models to use divide-and-conquer reasoning as a native skill by jointly learning to create subproblems and to solve them with reinforcement learning. A simple final-answer reward steers correct subproblem solving, while light division rewards keep decompositions real and useful, raising the performance ceiling and improving test-time scalability. On tough math benchmarks, this approach beats strong chain-of-thought baselines and even produces shorter, cleaner reasoning traces.

Main achievement: Turning DAC from a test-time trick into a trained competency (DAC-RL) that consistently outperforms CoT and scales better with more attempts.

Future directions:

Extend DAC-RL to coding with unit tests, multi-document QA with retrieval, and theorem proving with formal verifiers.
Learn adaptive test-time allocation: decide how many decompositions and tries each problem needs.
Design richer, safe division rewards and process-level checks that keep planning honest without hurting performance.

Why remember this: When problems are very hard, how you explore matters as much as how you step. Teaching models to plan (divide) and execute (conquer) together—then rewarding the real finish line—unlocks better exploration, higher ceilings, and solutions that stay clear even as tasks get tougher.

Practical Applications

•Math tutoring systems that automatically break hard problems into learnable steps for students.
•Coding assistants that decompose a complex bug into smaller diagnostics and fixes with unit tests.
•Study planners that split big projects into well-scaffolded tasks and then track completion logically.
•Scientific assistants that design sub-experiments and combine their results into a final conclusion.
•Legal or policy analysis that separates a case into precedents, statutes, and fact checks before synthesis.
•Data analysis pipelines that plan sub-queries and aggregations, then assemble a final summary.
•Operations planning (logistics/scheduling) that decomposes objectives into route, capacity, and timing subproblems.
•Game/agent planning where the model generates subgoals and then executes them sequentially.
•Customer support bots that isolate a multi-issue ticket into sub-issues and resolve them in order.
•Education content generation that creates stepwise problem sets and solutions aligned to learning goals.

Version: 1