Multi-Task GRPO: Reliable LLM Reasoning Across Tasks
Key Summary
- ā¢Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.
- ā¢Standard GRPO training across multiple tasks lets easy tasks hog the progress while hard tasks are left behind.
- ā¢This paper introduces MT-GRPO, which gives more practice to weaker tasks and checks that the practice really follows that plan.
- ā¢It uses improvement-aware task weights that rise for tasks that are both low-scoring and not improving much.
- ā¢A ratio-preserving sampler balances training batches even when some prompts give no learning signal (zero-gradient).
- ā¢On 3 tasks, MT-GRPO improves worst-task accuracy by 16ā28% over standard GRPO and by 6% over DAPO, while keeping average accuracy competitive.
- ā¢It also reaches 50% worst-task accuracy with 50% fewer steps in the 3-task setting, showing faster reliability gains.
- ā¢On 9 tasks, a single knob (lambda) lets you trade a bit of average score for much better worst-case reliability.
- ā¢The method is simple to plug into GRPO pipelines and makes multi-task reasoning more balanced and dependable.
Why This Research Matters
Real assistants need balanced skills, not just a high average that hides weak spots. MT-GRPO raises the weakest skill without dragging down the rest, making models more trustworthy in everyday use. This helps in settings like tutoring (logic plus math), coding (reasoning plus testing), and safety checks (no blind spots). It also speeds up reaching reliability thresholds, saving time and compute. Because it plugs into common GRPO pipelines with a single trade-off knob, itās practical for many teams. Over time, this approach can make AI systems more dependable across a wide range of tasks and domains.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine you're studying math, science, and writing. If you only practice what you're already good at, your weak subjects never catch up. That might look okay on average, but it fails when you actually need all of them.
š„¬ The Concept (Policy Gradient):
- What it is: A way for an AI to learn by nudging itself toward actions that brought better rewards.
- How it works:
- Try something (like writing a solution).
- Get a reward (right/wrong, well-formatted or not).
- Push the modelās choices slightly toward the ones that did better.
- Why it matters: Without this feedback loop, the model doesnāt know which choices to make more often. š Anchor: Like trying different strategies on a puzzle and favoring the ones that got you closer to the answer.
š Hook: You know how teachers give stars for correct answers and neat work?
š„¬ The Concept (Task-Level Rewards):
- What it is: Points the AI earns per task (e.g., math or logic) for correct and well-formatted answers.
- How it works:
- For each task, compare outputs to ground-truth answers.
- Add bonus for correct formatting.
- Average these scores to track task progress.
- Why it matters: Without clear points per task, we canāt tell which skills are strong or weak. š Anchor: A math sheet where each correct problem is 1 point and neat handwriting gets a small bonus.
š Hook: Think of grading a group project by comparing all drafts and picking which ones seem better within that group.
š„¬ The Concept (GRPO):
- What it is: A training method (Group-Relative Policy Optimization) that compares multiple answers to the same prompt and prefers the relatively better ones.
- How it works:
- For a prompt, generate several candidate answers.
- Score them and see which are above the groupās average.
- Push the model to prefer the better-than-average ones.
- Why it matters: It removes the need for a value network and uses fair relative comparisons. š Anchor: Like a mini contest where all studentsā answers to the same question are compared, and the best ones guide the class.
š Hook: Imagine some quiz questions are so easy (everyone always gets them right) or so hard (everyone gets them wrong) that you learn nothing new from them.
š„¬ The Concept (Zero-Gradient Prompts):
- What it is: Prompts where all sampled answers score the same, so the model gets no signal to improve.
- How it works:
- Generate several answers.
- If they all get identical scores, the relative advantage is zero.
- No gradient means no learning from that prompt.
- Why it matters: If one task has lots of these, it silently contributes less to learning, even if we wanted to prioritize it. š Anchor: A practice question thatās either a guaranteed A or a guaranteed Fāno hint about how to get better.
The world before: RL post-training (like GRPO) made LLMs much better at single tasks such as math or coding. But real-world assistants must juggle many skills: planning (Countdown), logic (Zebra puzzles), and inductive reasoning (ARC). When people tried to train on all tasks at once using average performance, easy tasks dominated, and hard tasks stagnated. Worse, some tasks had many zero-gradient prompts, so even if you gave them higher sampling weight, they still didnāt āspeak upā in the gradients.
The problem: How do we train one model across diverse tasks so that the weakest task is good enoughānot just the average?
Failed attempts:
- Uniform sampling (plain multi-task GRPO): easy tasks hog progress; hard tasks fall behind.
- Curriculum sampling (e.g., SEC): it helps average performance but can still misallocate effort, and doesnāt fix zero-gradient imbalance.
- Classic robust weighting from other areas: GRPOās loss can look the same when everything is perfect or when everything fails, so itās not a reliable signal for reweighting.
The gap: We need a training rule that (1) explicitly prioritizes the worst or slowest-improving tasks and (2) makes sure the batch actually reflects those priorities despite zero-gradient prompts.
Real stakes: In daily life, a helper bot shouldnāt solve tricky math but fail at basic logic instructions. In code assistants, great algorithms but bad reasoning about test cases is risky. For safety checks, you want the weakest check to be solid. Reliability across skills is what makes these systems trustworthy.
02Core Idea
š Hook: Think of a coach who looks at each playerās score and also how quickly theyāre improving, then gives more practice time to the ones who need it mostāand makes sure the practice plan is actually followed during drills.
š„¬ The Concept (Task Reweighting):
- What it is: Dynamically changing how often each task is practiced.
- How it works:
- Measure each taskās reward (how good it is now).
- Measure improvement (how much it just got better).
- Increase weight for tasks that are weak and not improving much; decrease for strong or fast-improving ones.
- Why it matters: Without reweighting, easy tasks can hog training and hard tasks stay weak. š Anchor: Studying more spelling if your last few quizzes didnāt improve, and less if youāre already acing them.
š Hook: Imagine progress charts for every subject; even if a subjectās score is low, if itās shooting up, maybe you can focus on another subject thatās stuck.
š„¬ The Concept (Improvement Signals):
- What it is: A per-task āhow much did we improve this step?ā score.
- How it works:
- Compute task reward before and after an update.
- Subtract to get the improvement.
- Use it with reward to decide future task weights.
- Why it matters: Reward-only weighting can tunnel on the same worst task and ignore others; improvement-aware weighting prevents collapse. š Anchor: If your logic puzzle score rose a lot this week, your tutor shifts time to math, which has been flat.
š Hook: A recipe only tastes right if you keep the right ingredient ratios; sampling tasks for training is the same.
š„¬ The Concept (Ratio-Preserving Sampler):
- What it is: A sampler that ensures the training batch really has the target mix of tasks after filtering out unhelpful prompts.
- How it works:
- Decide desired post-filtered counts per task from the learned weights.
- Oversample tasks likely to be filtered (acceptance-aware).
- Resample until post-filtered batch matches target ratios.
- Why it matters: Without this, tasks with many zero-gradient prompts get underrepresented, breaking the plan. š Anchor: If strawberries bruise easily and some get thrown out, buy extra so you still have the right amount for your fruit salad.
š Hook: Now put it all together: a coach who schedules practice based on current scores and progress, and a team manager who ensures the right players actually show up on the field.
š„¬ The Concept (MT-GRPO):
- What it is: A new training loop that blends improvement-aware task weighting with ratio-preserving sampling on top of GRPO.
- How it works:
- Update the model using GRPO on a batch built to match task weights.
- Measure task rewards and improvements.
- Adjust weights to help the weakest and slowest-improving tasks.
- Repeat.
- Why it matters: It raises the floor (worst-task score) without tanking the overall average. š Anchor: The teamās weakest player improves fast, and the whole team still wins games.
Aha! moment in one sentence: If you want balanced multi-task reasoning, prioritize tasks that are both weak and not improvingāand enforce that priority in the actual training batches.
Three analogies:
- School schedule: Spend more study time on subjects that are both hard and stuck; also make sure those subjects truly get time on your daily planner (even if some worksheets end up unusable).
- Cooking: Adjust ingredient amounts to fix bland parts of a dish and account for waste so the final plate keeps perfect ratios.
- Sports coaching: Give drills to the athletes who most need them, and double-check attendance so the plan matches practice.
Before vs. After:
- Before: Average-focused training lets successes on easy tasks hide failures on hard ones; zero-gradient tasks shrink their voice.
- After: Weights shift toward the struggling tasks, and the sampler guarantees their voice is heard, lifting worst-task accuracy while keeping averages strong.
Why it works (intuition): Rewards show where we stand, improvements show where weāre moving. Combining them avoids over-focusing on a single worst task and instead balances progress. Then, by preserving ratios after filtering, the gradients truly reflect the plan.
Building blocks:
- Task-Level Rewards (where we stand)
- Improvement Signals (how weāre moving)
- Improvement-Aware Weight Updater (who needs time next)
- Ratio-Preserving Sampler with acceptance-aware oversampling (make the batch match the plan)
- GRPO core (stable, relative scoring per prompt)
- A trade-off knob (lambda) to balance worst-case robustness vs. average performance
03Methodology
High-level pipeline: Inputs (multi-task datasets, current model) ā Improvement-aware task weights ā Ratio-preserving batch sampler ā GRPO policy update ā Measure per-task reward and improvement ā Update weights ā Repeat.
Step-by-step (with Sandwich explanations at first use):
- Measure where we are (Task-Level Rewards) š Hook: Like checking your grades in each subject before planning next weekās study. š„¬ The Concept:
- What: Per-task scores that reflect correctness and formatting.
- How:
- For each task, evaluate a batch of prompts.
- Score answers (1 for correct, small bonus for correct format, else 0 as per dataset rules).
- Average to get task reward.
- Why: We need a clear scoreboard to know who needs help. š Anchor: Math = 60%, Logic = 40%, Patterns = 30% this week.
- Measure how weāre moving (Improvement Signals) š Hook: You donāt just look at your gradeāyou check whether it went up or down since last week. š„¬ The Concept:
- What: The change in each taskās reward after an update.
- How:
- Save last stepās reward per task.
- Update the model.
- Recompute rewards and subtract.
- Why: A low score thatās rising fast may need less urgent help than a low score thatās flat. š Anchor: Logic jumped from 40% to 50% (good trend), while Patterns stayed at 30% (needs attention).
- Decide who gets more practice (Improvement-Aware Task Reweighting) š Hook: If two subjects are both hard, focus on the one thatās stuck, not the one already taking off. š„¬ The Concept:
- What: A rule that raises weights for tasks that are weak and not improving, and lowers them for strong or fast-improving tasks.
- How:
- Combine improvement (I) and reward (J) into a signal: s = I + λ·J.
- Compare each taskās s to the weighted average.
- Increase weights for below-average s; decrease for above-average s.
- Why: Prevents training from collapsing onto one worst task forever; encourages balanced progress. š Anchor: If Patterns is low and flat, it gets more slots in the next study schedule.
- Make the batch match the plan (Ratio-Preserving Sampler) š Hook: If some worksheets end up useless, bring extra so your study time still matches the schedule. š„¬ The Concept (Zero-Gradient Prompts + RP Sampler):
- What: Some prompts give no learning signal; the sampler accounts for this so post-filtered batches keep the target task ratios.
- How:
- Predict acceptance rates (1 ā filter rate) per task.
- Oversample tasks with high filter rates (acceptance-aware).
- Resample as needed so accepted samples hit the target counts per task.
- Why: Without this, tasks with many zero-gradient prompts would be underrepresented even if we planned to prioritize them. š Anchor: If ARC often yields no signal, oversample it so the final batch still has the intended amount of ARC.
- Learn from the batch (GRPO Update) š Hook: In a mini contest, pick the best answers among a group and learn to prefer those. š„¬ The Concept:
- What: Use GRPO on the constructed batch to push the model toward better-than-average answers per prompt.
- How:
- For each prompt, generate several answers.
- Score and normalize within the group.
- Nudge the model toward answers that beat the group average.
- Why: Stable, relative comparisons make improvements without needing a value model. š Anchor: For a Zebra puzzle prompt, favor the candidate reasoning chain that most often leads to the correct solution.
- Close the loop
- After the GRPO step, recompute rewards and improvements and update task weights again.
- The loop repeats, steadily lifting the weakest skills while keeping overall performance strong.
Concrete mini example:
- Suppose Countdown=80% (improving), Zebra=55% (flat), ARC=30% (flat, many zero-gradient prompts).
- Weight updater increases ARC and Zebra weights more than Countdown.
- RP sampler oversamples ARC to counter high filtering so post-filtered batch has the planned share of ARC.
- GRPO update uses this balanced batch; next step shows ARC rising to 36% and Zebra to 58%.
Secret sauce:
- The combo: (a) improvement-aware weighting keeps focus where progress is lacking, and (b) ratio-preserving sampling guarantees that focus shows up in gradients even with zero-gradient prompts. Together, they raise the worst-case task without tanking the average.
Implementation notes (plain-English):
- One knob (lambda) controls how hard you push worst-task robustness vs. average performance.
- Track per-task filter rates to guide oversampling.
- Use modest oversampling and capped inflation to keep compute reasonable.
- Update weights smoothly to avoid wild swings; the improvement term helps stabilize.
04Experiments & Results
The test: Can MT-GRPO boost the weakest task while keeping the overall average solid, and can it do so efficiently?
Setups:
- Tasks: Countdown (planning), Zebra (logic), ARC (inductive reasoning), each with easy/medium/hard variants.
- Models: Qwen-2.5-3B base.
- Baselines: GRPO, SEC-GRPO (curriculum), DAPO (strong RL baseline), SEC-DAPO.
- Metrics: Worst-task accuracy (the floor), average accuracy (overall), and average per-task relative change (fairness-weighted improvement).
Experiment 1: Controlled 3-task setting (Countdown, Zebra, ARC; medium difficulty)
- Scoreboard: MT-GRPO improves worst-task accuracy by 16ā28% over standard GRPO and beats DAPO by 6%, while keeping average accuracy competitive.
- Meaning: Thatās like raising the classās lowest grade from a D to a solid C/B without lowering the class average.
- Efficiency: MT-GRPO reaches 50% worst-task accuracy with 50% fewer steps than baselinesāfaster reliability gains.
- Dynamics: As Countdown gets strong, MT-GRPO shifts weight to Zebra and ARC; baselines often keep feeding Countdown, yielding smaller gains where help is needed most.
- Ratio preservation: ARC had many zero-gradient prompts. Without RP sampling, ARC would be underrepresented despite higher weight. With RP sampling, actual batch shares match the plan, unlocking ARC improvements.
Experiment 2: 9-task setting (easy/medium/hard for each of Countdown, Zebra, ARC)
- Trade-off knob (lambda): Higher lambda consistently lifts worst-task accuracy (e.g., +16% over GRPO, +6% over DAPO at lambda=1.2) but trims a bit of average accuracyāan explicit, tunable trade.
- Difficulty trends: Smaller lambda favors balanced improvement (more gains on hard tasks, sometimes smaller or negative changes on easy ones), raising fairness-style metrics. Larger lambda concentrates on the single weakest (often Zebra-hard), maximizing the floor.
- Takeaway: You can steer between āhighest floorā and ābest overall averageā depending on your deployment goals.
Surprises and insights:
- Zero-gradient imbalance is big: ARC frequently produced zero-gradient prompts; without acceptance-aware oversampling, progress stalled no matter the weights.
- Weight collapse avoided: Reward-only reweighting tends to fixate on the current worst task; the improvement-aware term avoids this, distributing help where itās most needed.
- Faster robustness: Beyond better final scores, MT-GRPO reaches reliability thresholds earlierāuseful for limited training budgets.
Overall: Across both small and large settings, MT-GRPO reliably boosts the weakest task and keeps the average competitive. The methodās two core ideasāimprovement-aware weights and ratio-preserving samplingāare both crucial to these gains.
05Discussion & Limitations
Limitations:
- Rewarded tasks only: The approach needs clear, verifiable rewards per task (correctness/format). Itās not designed for fuzzy goals without a good reward signal.
- Compute and sampling overhead: Ratio-preserving, acceptance-aware resampling adds generation and filtering work, especially for tasks with many zero-gradient prompts.
- Hyperparameter tuning: The lambda trade-off needs tuning per use case; too high can over-focus on the worst task, too low can favor average performance.
- Extreme zero-gradient regimes: If a task yields almost no informative prompts despite oversampling, progress will still be slow.
- Interference remains possible: While weighting helps, conflicting gradients across very different tasks can still cause some negative transfer.
Required resources:
- A base LLM, multi-task datasets with automatic grading, and RL post-training infrastructure (GRPO-style implementation, rollout generation, filtering statistics).
- Enough compute to support oversampling and a few resampling rounds per batch.
When not to use:
- Single-task specialization: If you only care about one task, standard GRPO/DAPO may be simpler and faster.
- Noisy or unverifiable rewards: If correctness canāt be judged reliably, the signals driving weights and sampling wonāt be stable.
- Tiny data regimes without diversity: If each task has too few prompts, the weight and ratio estimates will be unstable.
Open questions:
- Smarter acceptance prediction: Can we learn to predict zero-gradient likelihood per prompt to cut resampling cost further?
- Cross-task transfer: How to better exploit positive transfer while guarding against interference in this RL setting?
- Beyond accuracy: Can we extend robustness to other criteria (latency, safety, or verbosity) with multi-objective reward shaping?
- Theoretical guarantees: Stronger convergence and robustness guarantees with on-policy, clipped GRPO in multi-task, non-stationary settings.
- Generalization: How well do balanced gains transfer to unseen but related tasks or domains?
06Conclusion & Future Work
Three-sentence summary: MT-GRPO trains one model to handle many reasoning tasks by giving more attention to tasks that are both weak and not improvingāand then ensuring the training batch truly reflects that plan. It couples improvement-aware task reweighting with a ratio-preserving, acceptance-aware sampler on top of GRPO. This raises the worst-task performance while keeping the average strong and speeds up reaching reliability thresholds.
Main achievement: Making multi-task RL post-training reliably lift the weakest task by aligning planned emphasis (weights) with realized gradients (ratio-preserving sampling) and stabilizing weights with improvement signals.
Future directions: Reduce sampling overhead with better acceptance prediction, explore multi-objective robustness (e.g., safety plus accuracy), enhance positive transfer across tasks, and extend to larger, more diverse benchmarks and modalities.
Why remember this: Real assistants must be good at many things; MT-GRPO shows a practical, plug-in way to build balanced competenceāturning an average āA-minusā that hides a āDā into a report card where every subject is solid.
Practical Applications
- ā¢Build AI tutors that improve both math and logic evenly so students donāt develop hidden gaps.
- ā¢Train coding assistants to balance algorithmic skill with test-case reasoning and error handling.
- ā¢Develop customer support bots that are reliable across intents: troubleshooting, returns, and policy explanations.
- ā¢Enhance safety pipelines by lifting the weakest check (e.g., privacy or toxicity) without harming others.
- ā¢Prepare general-purpose agents that handle planning, deduction, and pattern recognition in balanced ways.
- ā¢Create fair multi-domain chatbots by prioritizing underperforming domains while keeping overall quality high.
- ā¢Speed up reaching minimum reliability bars during RL post-training for constrained training budgets.
- ā¢Improve benchmark suites where a single failing task blocks deployment (raise the floor quickly).
- ā¢Stabilize multi-task fine-tuning for small models by enforcing practice ratios and balanced gains.
- ā¢Apply to multimodal settings (text+vision) to prevent one modality from dominating training.