Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Xin Xu; Clive Bai; Kai Yang; Tianhao Chen; Yangkun Chen; Weijie Liu; Hao Chen; Yang Wang; Saiyong Yang; Can Yang

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Intermediate

Xin Xu, Clive Bai, Kai Yang et al.2/12/2026

arXiv

Key Summary

•This paper shows a simple way to turn many 'too-easy' questions into harder, still-checkable ones so that AI keeps learning instead of stalling.
•The trick, called Composition-RL, stitches together several normal problems into one bigger, verifiable problem that needs multiple steps.
•As training goes on, many prompts become 'solveall' (the model always gets them right), which gives no learning signal; composition brings those prompts back to life.
•A curriculum that gradually increases how many problems you compose (depth) makes models even stronger.
•Across models from 4B to 30B, Composition-RL beats regular RL on both math and general tests, sometimes by a large margin.
•Composing across domains (like physics + math) improves both subjects more than just mixing datasets.
•The method works because it teaches the model to combine skills (compositional generalization) and nudges it to do correct intermediate steps (implicit process supervision).
•It’s cheap to scale: you reuse your old verified prompts instead of paying for new ones.
•Dynamic sampling still filters uninformative prompts, but composition lowers the fraction of 'too-easy' ones so more of the batch becomes useful.
•Code, datasets, and models are released so others can try Composition-RL.

Why This Research Matters

As models get smarter, many training questions become too easy and stop teaching anything new. Composition-RL breathes new life into those old prompts by combining them into harder, still-checkable tasks, so training stays productive. This saves money and time because you reuse your existing data instead of constantly collecting more. It also builds true reasoning: the model must combine skills and reach correct intermediate steps on the way to the answer. The curriculum lets teams dial up difficulty gradually, preventing plateaus. And by composing across subjects (like physics and math), the method strengthens general problem-solving, not just one narrow area.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how your math homework gets easier the more you practice, and after a while, you stop learning from the super-easy problems? If you only do problems you can already solve in your sleep, you don’t grow.

🥬 The Concept: Reinforcement Learning (RL)

What it is: RL is a way to train computers by giving them feedback (rewards) when they do something right, like training a puppy with treats.
How it works:
1. The model tries an answer.
2. It gets a reward if the answer is good.
3. It updates itself to do better next time.
Why it matters: Without RL, models don’t get practice adjusting their behavior based on success or failure. 🍞 Anchor: Think of a kid shooting basketballs: each make is a reward, and they adjust their aim over time.

🍞 Hook: Imagine a quiz where every question has a clear, checkable answer—like a multiple-choice test you can grade with a key.

🥬 The Concept: Reinforcement Learning with Verifiable Rewards (RLVR)

What it is: RLVR is RL where the computer gets a reward only if its answer matches a ground-truth answer that can be automatically checked.
How it works:
1. Show the model a prompt (question) with a known correct answer.
2. Let it try several solutions (rollouts).
3. Use a verifier to mark right/wrong and give reward.
4. Update the model using those rewards.
Why it matters: Without verifiable rewards, feedback can be noisy or slow; RLVR is fast and scalable because the computer can self-grade. 🍞 Anchor: Like an online math platform that instantly tells you if your answer equals 42 and gives you a point if yes.

🍞 Hook: Picture a classroom where some questions are so easy everyone gets them right, and some are so hard everyone gets them wrong. Neither helps the teacher see who needs help.

🥬 The Concept: Zero-variance prompts (solveall/solvenone)

What it is: Prompts where all sampled answers are right (solveall) or all wrong (solvenone), so there’s no learning signal.
How it works:
1. Sample several answers to the same prompt.
2. If they’re all correct or all incorrect, the “advantage” is zero.
3. The gradient update becomes zero; nothing changes.
Why it matters: Too many zero-variance prompts waste training time because the model cannot learn from them. 🍞 Anchor: If every student gets a question right or wrong, the teacher can’t tell what to teach next.

🍞 Hook: Think about picking which practice problems to do next based on what still challenges you, not what you’ve already aced.

🥬 The Concept: Dynamic Sampling

What it is: A training trick that filters out the zero-variance prompts and keeps the ones that still provide learning signal.
How it works:
1. Over-sample candidate prompts for a batch.
2. Test them with multiple attempts.
3. Keep only those with mixed results (some right, some wrong).
Why it matters: Without dynamic sampling, most of your training time might be spent on prompts that teach nothing. 🍞 Anchor: Like skipping worksheets you always ace or always fail, and focusing on the ones where you get half right.

🍞 Hook: When you explain your thinking step-by-step, you usually catch your mistakes earlier.

🥬 The Concept: Chain-of-Thought (CoT)

What it is: A method where the model writes out its reasoning steps.
How it works:
1. Encourage step-by-step explanations.
2. Use those steps to arrive at the final answer.
3. Longer, clearer chains often mean better reasoning.
Why it matters: Without CoT, models may guess; with CoT, they reason more reliably. 🍞 Anchor: Like showing your long division, not just the answer.

🍞 Hook: Suppose you can’t buy new puzzle books, but you can combine pages from the ones you have to make brand-new challenges.

🥬 The Concept: Compositional Prompts

What it is: New questions built by combining multiple existing questions into one bigger, verifiable question.
How it works:
1. Pick two or more existing prompts with known answers.
2. Tie them together so solving one helps solve the next.
3. Keep a final answer that can still be auto-checked.
Why it matters: Without composition, your dataset runs out of useful (non-easy) practice as the model improves. 🍞 Anchor: Like turning two short math problems into one story problem that needs both parts to finish.

🍞 Hook: You don’t jump from 1st-grade math straight to calculus—you ramp up.

🥬 The Concept: Curriculum Learning

What it is: Start training with easier data and gradually increase the difficulty.
How it works:
1. Train on normal prompts (Depth 1).
2. Switch to composed prompts of Depth 2.
3. Later, move to Depth 3, and so on.
Why it matters: Without a curriculum, training can stall when prompts become too easy. 🍞 Anchor: Leveling up in a video game after you master the current level.

The world before this paper: RLVR made training faster by auto-checking answers, but as models got better, many prompts turned into solveall—leaving little to learn from. People tried focusing on super-hard prompts (solvenone), adding hints, or rolling out more samples. But a big hole remained: what about all those easy prompts that used to teach something and now teach nothing? This paper fills that gap by composing multiple solved/easy prompts into new, harder, still-verifiable ones—so your old data keeps paying dividends.

02Core Idea

🍞 Hook: Imagine taking two Lego builds you’ve already finished and snapping them together into a new, cooler machine that actually works.

🥬 The Concept: Composition-RL (the Aha!)

What it is: A way to train models by automatically composing existing verifiable prompts into new, multi-step prompts, then running RL on those.
How it works:
1. Pick K existing prompts with known answers.
2. Use Sequential Prompt Composition (SPC) to link them so solving earlier parts helps later parts.
3. Keep the final answer verifiable, so RLVR still works.
Why it matters: Without Composition-RL, easy prompts pile up and stop teaching; with it, you recycle them into fresh, challenging lessons. 🍞 Anchor: Two short algebra tasks become one story problem that needs both answers in sequence.

Three analogies for the same idea:

Jigsaw analogy: Combine small puzzles into a bigger picture, where each piece still fits and the final image is checkable.
Cooking analogy: Take two simple recipes, define a shared ingredient, and create a new dish that needs both parts done right.
Obstacle-course analogy: Finish checkpoint A to unlock a code for checkpoint B; the final buzzer confirms success.

Before vs After:

Before: Training on fixed prompts; as the model improves, many prompts become solveall and stop teaching.
After: The same pool of prompts, but composed into harder ones; the fraction of useful prompts goes up, so training keeps moving forward.

🍞 Hook: You know how math problems sometimes say “Let X be …” and later use X to finish the solution?

🥬 The Concept: Sequential Prompt Composition (SPC)

What it is: A step-by-step way to stitch prompts together so the result is still verifiable.
How it works:
1. Modify Prompt 1 with its ground-truth answer by defining a variable X (e.g., “Let X be the sum you found”).
2. Modify Prompt 2 by replacing one number with a variable Y.
3. Connect them with a relation (e.g., “Y is 6 less than X”).
4. The final composed prompt keeps a clear, checkable answer.
Why it matters: Without SPC, composition could break verifiability; SPC preserves a clean final check. 🍞 Anchor: First solve an equation to get X = 7; then solve a second expression using Y = X − 6 = 1; the final numeric answer is still easy to verify.

🍞 Hook: Think of adding more links to a chain as you get stronger.

🥬 The Concept: Compositional Depth (K)

What it is: How many prompts you combine in sequence.
How it works:
1. Depth 1: normal (original) prompt.
2. Depth 2: combine two prompts.
3. Depth 3: combine three, and so on.
Why it matters: Without depth, you can’t steadily raise difficulty in a controlled way. 🍞 Anchor: Depth 3 means solve A to get X, use X to solve B to get Y, then use Y to solve C.

Why it works (intuition):

Compositional generalization: The model must recombine known skills in new ways, which builds real reasoning ability.
Implicit process supervision: Even though rewards check only the final answer, the structure nudges the model to produce the right intermediate values (like X first), guiding the thinking process.
Data efficiency: Instead of buying new prompts, you transform old ones into many fresh, informative examples.

Building blocks:

Verified seeds: Use prompts with known answers.
SPC rules: Define variables, swap constants, connect relations.
Filtering: Auto-check composed prompts to weed out mistakes.
RLVR training: Same training recipe, richer data.
Curriculum: Start with Depth 1, then Depth 2, then Depth 3 to keep learning going.

03Methodology

At a high level: Original Prompts → Compose with SPC (Depth K) → Verify/Filter → RLVR Training (with dynamic sampling) → Stronger Model.

Step A: Gather inputs (verifiable seeds)

What happens: Collect a set of prompts with ground-truth answers (e.g., MATH12K). Ensure there’s a verifier that can auto-check answers.
Why this step exists: RLVR needs a clear reward signal; no ground truth means no automatic grading.
Example: A simple algebra equation with answer 7 and a simplification problem with answer 13p − 30.

🍞 Hook: Like defining a nickname for a number you already found so you can use it later. 🥬 The Concept: Define a variable from Prompt 1 (SPC Step 1)

What it is: Turn Prompt 1’s true answer into a named variable (e.g., “Let X be …”).
How it works:
1. Solve Prompt 1 using its ground truth.
2. Extract a numeric value (e.g., 7) and define X to equal that value in words tied to the problem.
3. Append that definition to Prompt 1’s text.
Why it matters: Without a named X, you can’t link Prompt 1 to later prompts cleanly. 🍞 Anchor: “Let X be the sum of the values of n satisfying |2n − 7| = 3.” If the sum is 7, then X = 7.

🍞 Hook: Imagine replacing a specific ingredient in a recipe with a to-be-announced substitute. 🥬 The Concept: Replace a constant with a variable in Prompt 2 (SPC Step 2)

What it is: Choose a number in Prompt 2 and swap it for a new variable (Y).
How it works:
1. Identify a constant (like the 1 in 5p + 1).
2. Replace it by Y to make the result depend on Y.
3. Keep the final expression’s answer still numeric once Y is set.
Why it matters: Without this swap, Prompts 1 and 2 don’t interact. 🍞 Anchor: Change 3(5p + 1 − 2p·4) into 3(5p + Y − 2p·4), so the final simplification depends on Y.

🍞 Hook: To make a relay race, you need to pass the baton from Runner 1 to Runner 2. 🥬 The Concept: Connect the prompts (SPC Step 3)

What it is: Add a sentence linking X and Y (e.g., “Y is 6 less than X”).
How it works:
1. Compute the relation between the chosen values (like X − Y = 6).
2. Write a clean natural-language constraint.
3. Concatenate: [Prompt 1 + definition of X] + [relation] + [Prompt 2 with Y].
Why it matters: Without a link, the composition isn’t a real multi-step task. 🍞 Anchor: If X = 7, then Y = 1; now Prompt 2 simplifies to the same final numeric answer as before, still verifiable.

Quality Gate: Verification and filtering

What happens: Automatically check each composition stage using an LLM and simple rules; discard broken ones (e.g., inconsistent variable names, mismatched values).
Why this step exists: Automation can introduce small errors; filtering keeps the data clean.
Example: Verify that when you plug X’s definition back into Prompt 1, you really recover X = 7; verify that swapping 1 for Y in Prompt 2 still makes sense.

🍞 Hook: Don’t practice the spelling words you already mastered every time; pick the ones that still challenge you. 🥬 The Concept: Dynamic Sampling during RL

What it is: Before each update, gather more candidate prompts than needed, test them, and keep only the ones with mixed results.
How it works:
1. Over-sample prompts for the batch.
2. Run multiple rollouts per prompt.
3. Keep prompts that aren’t all-right or all-wrong.
Why it matters: Without it, too many zero-variance prompts clog training. 🍞 Anchor: A training step keeps prompts where the model got, say, 5/8 right—not 8/8 or 0/8.

Training Recipe (unchanged core, richer data)

Inputs: The composed dataset (e.g., MATH-Composition-199K) plus the verifier (Math-Verify).
Rollouts: Multiple attempts per prompt, long output budget for reasoned solutions.
Objective: Standard RLVR (policy gradient form) with group-based normalization; use the same optimizer and learning rate as baseline.
Why it matters: You get better learning without inventing a new trainer; the data does the heavy lifting.

🍞 Hook: Coaches turn up the difficulty as the team improves so practice never gets stale. 🥬 The Concept: Curriculum over Depth K

What it is: Train on Depth 1 (original prompts), then continue on Depth 2, then Depth 3.
How it works:
1. Train on original MATH12K until progress slows and solveall rises.
2. Switch to composed Depth 2 prompts to drop solveall and unlock new gains.
3. Optionally, move to Depth 3 for more improvements.
Why it matters: Without this ramp, training may plateau early. 🍞 Anchor: The 4B model reached 37.9% on AIME24 after going Depth 1 → 2 → 3, beating some 8B baselines trained on more data.

Secret Sauce

Recycle wins: Turn easy prompts back into useful training via composition.
Built-in scaffolding: The X→Y structure encourages correct intermediate reasoning without extra labels.
Scale for free: K = 2 alone can nearly double usable combinations; cross-domain pairs (physics + math) broaden skills.
Plays nice with RLVR: The final answer stays verifiable, so rewards remain simple and fast.

04Experiments & Results

The Test: What they measured and why

pass@1 (and averages over multiple generations): How often the model gets the right answer when sampling a few tries—this captures practical accuracy.
solveall ratio: The share of prompts the model gets entirely right in rollouts—too high means the batch gives no learning signal.
In-domain vs out-of-domain: Math-focused tests (AIME24/25, BeyondAIME, IMOBench) and general reasoning (GPQA, MMLU-Pro) check both specialized and broad gains.

The Competition: Baselines

Regular RLVR on original MATH12K.
RL-zero style methods (for context on the curriculum results): Beyond-80/20, AlphaRL, RL-ZVP (8B models on larger data).
Cross-domain: Mix Training (math + physics together) and Math-then-Physics (sequential fine-tuning).

The Scoreboard (with context)

Composition-RL vs original RL (all model sizes):
- Math overall gains: +3.6% (4B), +4.8% (8B), +6.1% (14B), up to +14.3% (30B MoE). Think of moving from a steady B to an A–, and for 30B, from a B– to a solid A.
- Hard math benchmarks: AIME24 up to +21.4%, AIME25 up to +14.1%, BeyondAIME up to +12.0%, IMOBench up to +9.6%.
- Out-of-domain gains: +0.7% to +2.9% depending on size—small but consistent, like extra credit adding to your overall grade.
Curriculum (Depth 1 → 2 → 3, 4B model):
- Each switch lowered solveall and unlocked more learning.
- Final AIME24 = 37.9%, beating several 8B baselines trained on more data—like the underdog team winning with smart drills.
Cross-domain (physics + math composition) vs mixing:
- Composed Physics–Math beats both Mix Training and Math-then-Physics by clear margins on math and general tasks (e.g., +4.3% on MMLU-Pro vs MATH-only; +7.1% on AIME24 vs Math-then-Physics).
- Mixing helps a bit, sequential helps more, but composition helps most—like combining two sports into one obstacle course that makes you fitter for both.

Surprising/Notable Findings

Solveall rises fast during normal RL (to ~75%), shrinking the effective batch to roughly a quarter—training stalls unless you change the data.
Even K = 2 composition nearly doubles potential combinations and drops solveall dramatically (e.g., from 81.5% to 41.4%).
Bigger models benefit more: improvements climbed with model size, peaking at the 30B MoE—suggesting composition scales well.
Composition seems to teach intermediate steps implicitly: accuracy at recovering the X variable rose alongside final-answer accuracy.

Takeaway: The method is simple—compose, filter, and train—but the effect is strong: more useful batches, better math scores, and broader generalization, especially when depth is increased over time.

05Discussion & Limitations

Limitations

Domain coverage: The verifier works great for math; in physics, they had to filter out items where rule-based verification failed. Broader domains may need better verifiers.
Composition quality: The LLM-based composition pipeline is robust but not perfect; a small error rate (<2% after filtering) remains.
Answer diversity: If you only vary the first prompt’s answer a little, you might overexpose the model to a narrow set of final answers unless you sample widely.
Structure bias: SPC’s linear chain (X → Y) is powerful but simple; some tasks need trees or graphs of dependencies.

Required Resources

A base model (4B–30B in the paper) and standard RLVR infrastructure (rollouts, verifier, dynamic sampling).
Computation to generate and filter compositional prompts (tens to hundreds of thousands) and to run RL for several stages.
A reliable, preferably fast, verifier per domain.

When NOT to Use

If your domain lacks a workable automatic verifier, RLVR (and thus Composition-RL) becomes hard to apply.
When you already have abundant fresh, challenging prompts: composition gives less marginal benefit.
If prompts are tightly coupled to fragile formats (e.g., strict templates) that break under text edits.

Open Questions

Can we extend beyond chains to DAG-style compositions that better mirror real multi-skill reasoning?
How to design cross-domain relations (not just numeric offsets) that still allow easy final verification?
Can smarter samplers pick which prompts to compose to maximize learning signal per batch?
How far does the curriculum scale—Depth 4 or 5—before returns diminish?
Can composition be combined with process-level rewards for even stronger guidance?

06Conclusion & Future Work

Three-Sentence Summary

Composition-RL turns old, easy, verifiable prompts into new, harder, still-verifiable ones by composing them, so RL training keeps learning instead of stalling.
A simple SPC recipe (define X from one prompt, swap Y into another, connect them) plus a depth curriculum boosts math and general reasoning across model sizes.
It scales cheaply, cuts solveall rates, improves pass@1 on tough benchmarks, and even beats larger baselines in some cases.

Main Achievement

Showing that composing prompts—rather than collecting new ones—consistently strengthens RLVR training and enables effective cross-domain learning.

Future Directions

Compose from larger and more varied seeds (e.g., harder math sets) and more domains.
Explore richer graph-shaped compositions and better verifiers for science and real-world tasks.
Combine composition with on-policy distillation or process supervision for extra gains.

Why Remember This

It’s an elegant, low-cost way to recycle your data for sustained learning.
It teaches models to combine skills and nudges them toward correct intermediate steps without extra labels.
As models get better and easy prompts pile up, composition keeps the training signal alive.

Practical Applications

•Extend the lifespan of an RLVR dataset by auto-composing Depth-2 prompts whenever solveall rises.
•Use a depth curriculum (1 → 2 → 3) to keep batches informative and avoid early training plateaus.
•Build cross-domain compositions (e.g., physics + math) to improve transfer and multi-task performance.
•Pair composition with dynamic sampling so that each batch contains fewer zero-variance prompts.
•Filter composed prompts with automated self-checks to maintain high data quality at scale.
•Retrofit existing RL pipelines (GRPO, verifiers) without changing the core trainer—just swap in composed data.
•Boost smaller models with smart data: apply Composition-RL to beat larger baselines trained on more raw prompts.
•Generate diverse final answers by sampling the second prompt from the full dataset to widen coverage.
•Track solveall over time and trigger composition when it exceeds a threshold (e.g., >50%).
•Compose prompts across increasing depths during fine-tuning sprints to harvest quick performance gains.

Version: 1