Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

Zhipeng Chen; Xiaobo Qin; Wayne Xin Zhao; Youbin Wu; Ji-Rong Wen

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

Intermediate

Zhipeng Chen, Xiaobo Qin, Wayne Xin Zhao et al.1/31/2026

arXiv PDF

Key Summary

•This paper teaches a model to make its own helpful hints (sub-questions) and then use those hints to learn better with reinforcement learning that checks answers automatically.
•First, a "decomposer" model is trained (without any teacher model) to break a hard question into a few smaller sub-questions using rewards that check both quality and format.
•Those sub-questions are attached to each training question, giving extra guidance that helps the main "reasoner" explore smarter instead of guessing blindly.
•A special in-context distillation loss (IDL) lets the reasoner learn from hint-guided successes but practice answering without hints, so the skill becomes internal.
•The method is plug-and-play: it works with different RL-with-verifiable-reward (RLVR) algorithms like GRPO, RLOO, and REINFORCE++.
•Across eight math benchmarks, the approach improves accuracy and Pass@k, especially on harder tests where plain RLVR plateaus.
•Coarse, high-level hints boost exploration and Pass@k; fine, step-by-step hints boost exploitation and Pass@1—so hint granularity matters.
•Unlike teacher-distilled methods, this approach avoids the cost and ceiling of copying a stronger model, enabling self-improvement from a single base model.
•Ablations show both the format reward and the Pass@k-style quality reward are important for training a reliable decomposer.
•The system encourages smarter searching without trapping the model into over-relying on hints, thanks to an adaptive trigger and careful selection of training examples.

Why This Research Matters

This approach helps models learn to think better without relying on expensive teacher models, lowering costs and removing ceilings on performance. By nudging exploration with short, high-level hints, the model finds good ideas faster and keeps improving on tough, multi-step problems. The adaptive loss makes sure the model doesn’t become dependent on hints, so it stays strong at test time when no hints are provided. Better exploration means better tools for math learning, coding help, scientific analysis, and complex decision support. Because it is plug-and-play with popular RL algorithms, teams can adopt it quickly in existing pipelines. The clear evidence across diverse benchmarks suggests the gains are robust, especially where plain RL stalls.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you’re faced with a giant jigsaw puzzle. If you just grab random pieces and hope they fit, you’ll waste time and get frustrated. But if someone first sorts the pieces into edges, corners, and colors, you finish much faster.

🥬 Filling (The Actual Concept)

What it is: This paper tackles how to help large language models (LLMs) get better at step-by-step reasoning by giving them just enough guidance to explore smartly instead of guessing.
How it works (story of the field):
1. You know how: Reinforcement Learning with Verifiable Rewards (RLVR) is like a quiz game where the model tries answers, and a checker says “right” or “wrong.” That makes models practice reasoning and learn from experience.
2. The problem: The checker only says correct/incorrect. That’s very little information. On hard problems, the model explores almost blindly, often fails, and then becomes cautious, which stalls progress.
3. Failed attempts: People tried clever reward tweaks (shaping, advantage tricks) and also tried borrowing hints or full solutions from big “teacher” models. These either still left exploration too blind or made training depend on expensive teachers, which also caps student performance at the teacher’s level.
4. The missing piece: Give the model more information while training—without any external teacher—and do it in a way that boosts exploration instead of spoon-feeding full solutions.
Why it matters: Smarter exploration means models improve longer, handle tougher multi-step tasks, and stop plateauing early. That affects tools we use daily: math helpers, study assistants, coding copilots, and more.

🍞 Bottom Bread (Anchor) Think of a student doing word problems. If you give them a few guiding questions like “What’s being asked?” and “Which formula helps?”, they find the path. They still solve the problem, but the hints stop them from wandering aimlessly. That’s what this paper designs for LLMs.

— New Concept Sandwiches (introduced in order) —

Reinforcement Learning with Verifiable Rewards (RLVR) 🍞 You know how in a quiz app, you get points only if your final answer is right? 🥬 RLVR is training where a model tries to solve questions and gets a reward if the final answer matches the ground truth. How it works: (1) Model generates an answer; (2) A verifier checks if it’s correct; (3) The model updates itself to make correct answers more likely next time. Why it matters: Without RLVR, the model won’t practice multi-step reasoning with real feedback. 🍞 Anchor: When a model solves “What is 12×13?” and says “156,” it gets a thumbs-up; if not, thumbs-down.
Large Reasoning Model (LRM) 🍞 Imagine a student who not only answers but also double-checks and fixes mistakes. 🥬 An LRM is an LLM trained (often by RLVR) to do complex reasoning behaviors like reflecting, verifying, and correcting. How: Practice many problems with feedback until behaviors like “check steps” become natural. Why: Without reasoning skills, the model guesses or gives shallow answers. 🍞 Anchor: When asked a geometry problem, an LRM plans steps, checks results, and corrects if needed.
Compositional Generalization 🍞 Picture building a castle from familiar Lego pieces in a new way. 🥬 It’s the ability to combine simple known skills to solve new, harder problems. How: Recall small facts, link them, and create a multi-step solution. Why: Without it, models know pieces but can’t assemble them for complex tasks. 🍞 Anchor: Knowing area of a rectangle and triangle, then combining both to find the area of a house-shaped figure.
Decomposer 🍞 Think of a teacher who turns one hard question into a few bite-sized questions. 🥬 The decomposer is a model trained to break a complex question into simpler sub-questions. How: It learns, with rewards, to output a short list of helpful sub-questions in the right format. Why: Without good sub-questions, the reasoner keeps wandering in the dark. 🍞 Anchor: For a set problem, the decomposer asks “How many do at least one activity?” and “Apply inclusion-exclusion,” guiding the path.
Sub-question Guidance 🍞 Imagine following a treasure map with checkpoints. 🥬 Sub-question guidance is adding those decomposed sub-questions to a prompt to steer reasoning. How: Attach the list to the original question; the reasoner follows these hints while exploring. Why: Without guidance, exploration is random and slow; with too much detail, exploration stops. 🍞 Anchor: “First find total who do at least one; then subtract overlaps” helps reach the correct count.
In-Context Distillation Loss (IDL) 🍞 Think of practicing with training wheels, then riding without them—but your balance memory stays. 🥬 IDL teaches from hint-guided successes while training the model to answer without hints. How: (1) Use hints to find correct solutions; (2) Select the good ones; (3) Train the model to produce those solutions when only given the question. Why: Without IDL, the model may depend on hints and fail at test time without them. 🍞 Anchor: A student solves with a checklist during practice but takes the test confidently without it.
Pass@k 🍞 You know how trying multiple times increases your chance of success? 🥬 Pass@k is the chance that at least one of k tries is correct. How: Generate k solutions per question and see if any are right. Why: It measures exploration: can the model find a correct path among several? 🍞 Anchor: If you try 5 different solution paths and one hits the right answer, that’s Pass@5 success.
Exploration vs. Exploitation 🍞 Imagine trying new routes to school (exploration) vs. taking your fastest known route (exploitation). 🥬 Exploration looks for new, better strategies; exploitation uses the best one you already have. How: Balance both so you learn new paths but still arrive on time. Why: All-exploration wastes time; all-exploitation gets stuck. 🍞 Anchor: Sometimes you try a side street; other times you stick to the highway.
Plug-and-Play Module 🍞 Think of a USB device that works with many computers. 🥬 A plug-and-play method fits different RLVR algorithms without changes. How: It adds sub-question guidance and IDL alongside GRPO, RLOO, or REINFORCE++. Why: Flexibility means broader, easier adoption. 🍞 Anchor: The same charger works for multiple outlets with an adapter.

02Core Idea

🍞 Top Bread (Hook) You know how coaches don’t always give you the answer—they give just enough clues so you can discover it yourself and truly learn? That’s the magic trick here.

🥬 Filling (The Actual Concept)

One-sentence “Aha!”: Train a model to create its own helpful sub-questions (hints) and then use those hints, carefully and sparingly, to make reinforcement learning exploration effective—no teacher model needed.
Multiple analogies:
1. Hiking guideposts: Instead of a full tour guide (teacher), you place signposts (sub-questions) on the trail so hikers (the reasoner) can explore safely and still find the summit.
2. Recipe outline: Instead of giving the whole recipe, you list the key steps (“mix,” “bake,” “cool”). The chef still cooks but avoids common mistakes.
3. Lego blueprint: You don’t hand over a finished castle; you show a few main modular chunks to build first. The builder completes the rest.
Before vs. After: • Before: RLVR often plateaued on hard tasks because the model got only “right/wrong” and explored blindly. Many teams imported teacher-written plans, which were costly and limiting. • After: The model first learns to write decent sub-questions itself (the decomposer). Those are used to nudge the main model (the reasoner) during RLVR. A special training loss (IDL) turns hint-guided wins into independent skill. Result: better accuracy and stronger, longer-lasting improvement.
Why it works (intuition):
1. Extra information raises the “signal” in training, so the model wastes fewer tries.
2. Hints are high-level and format-checked, which encourages exploration without overfitting to one path.
3. IDL ensures the reasoner doesn’t grow dependent on hints; it practices answering from the question alone.
4. An adaptive trigger only applies IDL when the model is struggling, so easy problems remain a playground for self-exploration.
Building blocks (mini-sandwiches for new terms): • Decomposer (already explained): makes short, helpful sub-questions. • Quality Reward and Format Reward: The decomposer is rewarded if its hints (a) follow the required tags and (b) help a proxy model get a correct answer at least once across several tries (Pass@k style). Without this, the hints might be messy or unhelpful. • Proxy Reasoner: a stand-in solver used only to test whether the hints increase the chance of success. Without it, you can’t score hint usefulness. • IDL (already explained): converts success with hints into skill without hints. • Adaptive trigger and selection: IDL turns on only when the model does poorly on a question; positive examples are capped and diversified to avoid overfitting.

🍞 Bottom Bread (Anchor) Example: For a contest math problem, the decomposer writes two tips: “Find how many do at least one activity” and “Apply inclusion-exclusion.” The reasoner, guided by these, reaches a correct solution once among multiple tries. Then IDL teaches the reasoner to solve the same kind of question next time without tips. Over time, the model stops guessing and starts planning.

03Methodology

🍞 Top Bread (Hook) Imagine building a treehouse. First, you sketch a simple plan (decomposer). Next, you label where the beams go (annotated sub-questions). Finally, you practice building until you can do it from memory (reasoner + IDL).

🥬 Filling (The Actual Concept)

High-level pipeline: Input question → (A) Train a Decomposer via RLVR → (B) Use it to add sub-questions to each training item → (C) Train a Reasoner with RLVR + adaptive IDL → Output: a stronger reasoner that explores smarter and answers without hints.

Step A: Train the Decomposer via RLVR

What happens: The decomposer is asked to output a short, properly tagged list of sub-questions. Two rewards judge it: (1) Format Reward (did it use the right tags and non-empty content?) and (2) Quality Reward (did the hints help a proxy reasoner get at least one correct answer across multiple rollouts—like Pass@k?). The final reward is the product: it only pays if both are good.
Why this step exists: Without format checks, hints become messy to parse. Without quality checks, hints may be irrelevant. Together, they make hints both usable and useful.
Example: Question about clubs at school. A good decomposer outputs: “<subquestion>How many students do at least one activity?</subquestion> <subquestion>Apply inclusion-exclusion to find both.</subquestion>” The proxy reasoner uses these tips and gets one correct answer out of 32 tries, so the decomposer is rewarded.

Step B: Annotate the Training Set with Sub-questions

What happens: For each training question, the trained decomposer generates a brief list of sub-questions, which we store alongside the question and the true answer.
Why this step exists: We need consistent, reusable guidance so the reasoner can learn from hint-guided exploration on the same dataset it trains on.
Example: For geometry, sub-questions might be “List known angles” and “Use triangle sum.” They’re short (≈2 sub-questions, ≈60 tokens on average) and don’t leak answers.

Step C: Train the Reasoner with RLVR + Adaptive IDL

What happens (two tracks run per question):
1. Vanilla RLVR track: The reasoner generates several answers from the question alone; a verifier scores them. This keeps exploration alive.
2. Hint-guided track: The reasoner also generates several answers using the sub-questions. Some of these are correct. From the correct ones, we select a small, diverse set. Then we apply IDL to train the model to produce these good solutions from the plain question (no hints in the input for IDL). Importantly, we turn IDL on only if the average reward from the vanilla track is below a threshold—meaning the model struggled.
Why this step exists: Without IDL, the model might over-rely on hints. Without the adaptive trigger, it might use hints even when it doesn’t need them, which weakens exploration. Without selection and diversity prompts, it could overfit to near-duplicate traces.
Example with data: Suppose out of 32 vanilla tries, average success is very low. We trigger IDL. Among 32 hint-guided tries, 4 are correct. We keep at most a capped portion (say up to 30% of rollouts), ensuring variety. We then train the reasoner to recreate those 4 successes when only given the question. Next time, it solves similar problems without hints.

The Secret Sauce

Two-part reward for the decomposer prevents reward hacking: hints must be both well-formed and demonstrably helpful.
Coarse-grained hints encourage exploration: they nudge the model toward useful regions without dictating a single path.
Adaptive IDL turns successes into internal skills only when needed, so the model won’t become dependent on hints.
Plug-and-play design means you can pair this with GRPO, RLOO, or REINFORCE++.

🍞 Bottom Bread (Anchor) Think of a soccer coach: (A) teach players how to set up plays (decomposer); (B) write those plays on a whiteboard during practice (annotated hints); (C) in scrimmage, use plays only if the team is stuck; after they score, run the play again without the whiteboard until they can do it from memory (adaptive IDL).

04Experiments & Results

🍞 Top Bread (Hook) Picture a class taking eight different math quizzes. Some students get a tiny nudge—just a few guiding questions—and suddenly more of them get A’s, especially on the toughest quizzes.

🥬 Filling (The Actual Concept)

The Test: The authors evaluated on eight math benchmarks (AIME24, AIME25, AMC23, BeyondAIME, MATH500, Minerva, OlymMATH-Easy, OlymMATH-Hard). They measured both single-try accuracy (Pass@1) and multi-try success (Pass@k), which shows how well the model explores.
The Competition: • SFT baselines: supervised fine-tuning with chain-of-thought (CoT), with or without sub-questions. • RLVR baselines: vanilla GRPO and teacher-guided methods like LUFFY and Scaf-GRPO.
The Scoreboard (contextualized): • On Qwen2.5-7B-Instruct, adding A D improved average accuracy beyond vanilla GRPO and also beyond teacher-based methods. For example, average accuracy rose to about 26.5%, which is like moving from a solid C to a B on challenging math tests, while most others stayed in the lower 20s. • On smaller or different families (e.g., LLaMA3.2-3B, LLaMA3.1-8B, Qwen2.5-Math-7B), A D still gave consistent gains over their respective baselines. This shows robustness across sizes and families. • Plug-and-play: Combining A D with RLOO and REINFORCE++ also improved results compared to those algorithms alone, especially on the harder benchmarks like AIME25 and BeyondAIME, where plain RLVR often stagnated.
Surprising Findings:
1. Simply adding decomposer sub-questions to SFT data did not help and sometimes hurt. Why? Decomposing is a different skill from solving; also, hints may miss edge cases. Blindly learning them can mislead.
2. Adding sub-questions to prompts at training and test time gave a fast early boost but later reduced exploration, slowing further gains. In contrast, A D’s adaptive use of hints preserved exploration and improved long-term performance.
3. Coarse hints vs. fine hints: Coarse, high-level hints increased Pass@k (better exploration), while fine, step-like hints improved Pass@1 (better exploitation). This confirms the intuition about hint granularity.
Decomposer Analysis (Ablations): • Removing the format reward or switching to a Pass@1-only style quality reward degraded the decomposer’s usefulness. • With both Format and Pass@k-style Quality rewards, the decomposer produced short, clean, answer-free sub-questions (≈2 per problem, ≈60 tokens) that helped the reasoner search.

🍞 Bottom Bread (Anchor) Think of an orchestra: with a few well-placed cues from the conductor (the decomposer’s sub-questions), the musicians (the reasoner) stay together on difficult pieces and perform better over time—especially in the most demanding concerts.

05Discussion & Limitations

🍞 Top Bread (Hook) Even great maps can mislead if trails wash out or if hikers follow the signs too literally.

🥬 Filling (The Actual Concept)

Limitations (be specific):
1. Dependence on hint quality: If the decomposer makes vague or misdirected sub-questions, learning may slow or drift.
2. Not all tasks decompose cleanly: Some problems (e.g., ones hinging on a single hidden trick) may benefit less from high-level hints.
3. Compute cost: Training needs multiple rollouts for both vanilla and hint-guided tracks plus a proxy reasoner to score hint quality.
4. Over-reliance risk: Although adaptive IDL reduces it, very frequent hint triggering or poor selection could still bias the model.
Required Resources: A verifier for rewards, enough compute for multi-rollout RLVR, and storage for annotated sub-questions. Compatible RLVR frameworks (e.g., GRPO, RLOO, REINFORCE++).
When NOT to Use: • If you already have excellent teacher data and must match the teacher exactly. • If tasks are extremely simple (hints add overhead) or extremely non-decomposable. • In ultra-low-latency settings where extra rollouts are unacceptable.
Open Questions:
1. Co-evolution: How should hint style and granularity adapt as the reasoner improves?
2. Online learning: Can the decomposer–reasoner pair update from live user feedback safely?
3. Beyond math: How well does sub-question guidance transfer to code, science QA, or long-horizon agents?
4. Automatic granularity control: Can we detect when to use coarse vs. fine hints to target Pass@k vs. Pass@1 gains?

🍞 Bottom Bread (Anchor) Like training wheels that you raise slowly as the rider gains balance, the best systems might automatically adjust hint granularity and frequency as the model grows more skilled.

06Conclusion & Future Work

🍞 Top Bread (Hook) Think of learning to solve tough puzzles with just the right nudge—enough to keep you moving, not so much that you stop thinking.

🥬 Filling (The Actual Concept)

3-Sentence Summary: This paper introduces A D, a way to train a decomposer that writes sub-questions and then uses those hints, adaptively, to make RL-with-verifiable-reward training far more effective. A special in-context distillation loss turns hint-guided wins into independent skills, so the model performs without hints at test time. The method is plug-and-play across RLVR algorithms and delivers consistent gains on hard math benchmarks.
Main Achievement: Showing that teacher-free, coarse sub-question guidance—paired with adaptive IDL—breaks past RLVR’s blind exploration, improving both Pass@1 and Pass@k on challenging tasks.
Future Directions: Co-evolving hint style with reasoner skill, online learning from real feedback, smarter granularity control (coarse vs. fine hints), and extending beyond math to coding and multi-hop knowledge tasks.
Why Remember This: It’s a practical recipe for giving models just enough guidance to keep exploring and improving—without the cost or ceiling of a teacher model.

🍞 Bottom Bread (Anchor) In short: teach the model to write its own helpful checklists, use them wisely during practice, and learn to perform confidently without the checklist on test day.

Practical Applications

•Math tutoring systems that guide students with short, high-level questions instead of full solutions.
•Coding assistants that propose helpful checkpoints (e.g., write tests, isolate bug scope) to debug complex issues.
•Scientific assistants that outline investigation steps (e.g., define variables, choose methods) to analyze experiments.
•Data analysis copilots that decompose multi-hop queries (e.g., filter, group, compare) for business intelligence.
•Legal or policy assistants that map a complex question into sub-issues (facts, precedent, constraints) before drafting.
•Customer support bots that break down a user’s problem (environment, steps tried, error type) to resolve faster.
•Planning agents that set milestone sub-goals for long-horizon tasks (research plan, resource list, timeline).
•Education platforms that adapt hint granularity to the learner’s level, boosting independence over time.
•Autograding tools that use sub-questions to verify reasoning paths, not just final answers.
•Research workflows where models generate structured exploration plans before running costly simulations.

Version: 1