🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization | How I Study AI

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Intermediate
Shih-Yang Liu, Xin Dong, Ximing Lu et al.1/8/2026
arXivPDF

Key Summary

  • •When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.
  • •GDPO fixes this by normalizing each reward separately first, then adding them together, so the model can still tell small but important differences between answers.
  • •After summing the per-reward advantages, GDPO adds a final batch-wise normalization step to keep updates stable no matter how many rewards you use.
  • •Across tool-calling, math reasoning with length limits, and coding with bug checks, GDPO trains more stably and reaches better scores than GRPO.
  • •In tool-calling, GDPO improved both accuracy and the percentage of outputs in the correct format compared with GRPO.
  • •In math, GDPO reduced overlong responses by up to about 80% on AIME while also improving accuracy on multiple benchmarks.
  • •In coding, GDPO balanced three goals at once—passing tests, staying within length limits, and avoiding bugs—better than GRPO.
  • •Just lowering a reward’s weight often doesn’t change the model’s priorities when one reward is much easier; conditioning easier rewards on harder ones works better.
  • •GDPO is simple to drop into existing GRPO-style pipelines and makes multi-reward alignment more precise and stable.
  • •This matters for real products: one model can be taught to be accurate, safe, concise, and well-formatted at the same time without training crashes.

Why This Research Matters

Real products need AI that is not just accurate but also safe, concise, and well-formatted—and often all at once. If training squashes these goals into a blurry signal, models learn the wrong lessons and users get unreliable behavior. GDPO keeps each goal’s voice clear during training, so the model can balance them properly. That means better tool use (think function calling that actually follows the template), math solvers that stay within length limits without losing accuracy, and coding assistants that pass tests without crashing. It also reduces training instability, saving time and compute. In short, GDPO helps build trustworthy assistants that meet real-world expectations.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you learn a new sport, your coach might care about many things at once—speed, teamwork, and safety? If the coach only looked at a single, mixed-up score, you wouldn’t know what to improve. That’s what was happening with many AI training setups.

🍞 Top Bread (Hook) Imagine you’re grading a school project with different parts: writing quality, neatness, and following instructions. If you only gave one averaged grade, you’d lose which part needs work.

🥬 Filling (The Actual Concept: Reinforcement Learning)

  • What it is: Reinforcement Learning (RL) is a way to teach models by trying actions and getting rewards, like points.
  • How it works:
    1. The model tries something (an answer or action).
    2. It gets rewards based on how good that try was.
    3. It uses those rewards to make better tries next time.
  • Why it matters: Without rewards, the model can’t tell what helped it improve.

🍞 Bottom Bread (Anchor) A chatbot answers a math problem. If it’s correct, it gets a point; if it rambles too long, it might lose a point. Over time, it learns to be correct and concise.

The world before this paper: Big language models got good at answering questions, but people also wanted them to be safe, short when needed, well-formatted, unbiased, and accurate. That means multiple rewards at the same time—like a teacher grading different subjects.

The problem: A common training method called GRPO took all those rewards, added them up, and then normalized (scaled) them inside each question’s group of sampled answers. This sounds fine, but it often crushed different reward combinations into the same “advantage” (the signal that says how much to push the model’s probabilities). If two answers differ in important ways—like one is both correct and well-formatted (2 points) and another is only correct (1 point)—GRPO could still give them the same push compared to their peers. That blurs what really matters.

🍞 Top Bread (Hook) You know how if two students score 92 and 99, and you round both to “A,” you can’t tell who did better or where to improve?

🥬 Filling (The Actual Concept: Advantage Estimation)

  • What it is: Advantage estimation measures how much better an answer is than average for that question.
  • How it works:
    1. Score each answer with rewards.
    2. Compare each answer’s score to the group’s average.
    3. Use that difference as the “push” to make similar answers more likely next time.
  • Why it matters: Without it, the model can’t tell which answer in a set deserves the biggest boost.

🍞 Bottom Bread (Anchor) If three answers get scores 0, 1, and 2 on a question, advantage estimation should push the “2” answer up the most, the “1” a bit, and the “0” down.

Failed attempts: Some works removed the standard deviation in GRPO’s normalization to create more distinct signals. It helped a tiny bit but still merged many different reward mixtures into too few buckets. Training could still wobble or fail.

The gap: We needed a way to keep the details from each reward—format, correctness, length, bugs—from getting mixed into one blurry number.

🍞 Top Bread (Hook) Imagine separate report cards for math, science, and art, and only then making an overall average.

🥬 Filling (The Actual Concept: Multi-reward Learning)

  • What it is: Multi-reward learning teaches a model using several reward signals at once (e.g., accuracy, safety, format, length).
  • How it works:
    1. Design a reward for each goal.
    2. Score each answer with all those rewards.
    3. Combine them to guide learning.
  • Why it matters: Without separating goals, the model can’t balance them well.

🍞 Bottom Bread (Anchor) A coding model gets one reward for passing tests, one for short solutions, and one for having no runtime errors. It learns to pass tests without writing messy or too-long code.

Real stakes: In everyday tools—chatbots, coding assistants, math solvers—we want answers that are right, safe, clearly formatted, and not too long. If the learning signal is blurry, models can become chatty but wrong, or right but off-format, or even crash training. This paper shows how to keep the learning signal crisp so models behave the way people actually want.

02Core Idea

The “Aha!” in one sentence: Normalize each reward separately first, then add them up and stabilize the result—so the model keeps the fine details from every goal instead of squashing them.

We’ll build up the idea with key concepts in order.

🍞 Top Bread (Hook) You know how a teacher posts separate scores for homework, tests, and projects before averaging the final grade? That keeps each part meaningful.

🥬 Filling (The Actual Concept: Group Relative Policy Optimization, GRPO)

  • What it is: GRPO is a way to update the model by comparing a group of sampled answers to each other and pushing the better-than-average ones up.
  • How it works:
    1. For a question, sample several answers from the current model.
    2. Score them with a reward (often just one reward like correctness).
    3. Normalize those scores within the group to get advantages.
    4. Use those advantages to nudge the model toward better answers.
  • Why it matters: It removes the need for a separate value model, making training simpler and efficient.

🍞 Bottom Bread (Anchor) For one math problem, the model tries 4 answers. If one is clearly the best, GRPO boosts it the most and trims the worst ones.

But when there are many rewards, GRPO often mixes them into one sum and normalizes that. Different mixes (like “correct + well-formatted” vs. “correct only”) can end up with the same advantage. That’s the squash.

🍞 Top Bread (Hook) Imagine two smoothies with different fruits—strawberry+banana vs. strawberry alone. If you blend them and then force them to taste the same, you lose the difference.

🥬 Filling (The Actual Concept: GDPO)

  • What it is: GDPO (Group reward-Decoupled Normalization Policy Optimization) is a GRPO-style method that normalizes each reward on its own first, then sums, then batch-normalizes to keep scales stable.
  • How it works:
    1. For each reward (accuracy, format, length, bugs), compute group-relative advantages separately.
    2. Sum these per-reward advantages to form a combined advantage.
    3. Apply batch-wise normalization so the final signal stays in a steady range even if you add more rewards.
  • Why it matters: It preserves the unique contribution of each reward so the model can learn precise trade-offs.

🍞 Bottom Bread (Anchor) Two answers: (a) correct and well-formatted; (b) correct but messy. With GDPO, (a) gets a bigger push because it scores high on two separate normalized tracks, not just a blurred sum.

Multiple analogies for the same idea:

  1. Report cards: Grade math, science, and art separately before averaging—so students see where to improve.
  2. Cooking: Season salt, pepper, and herbs to taste individually; then combine. If you dump them together first, you can’t adjust each flavor.
  3. Band mixer: Adjust each instrument’s volume on its own channel; then set the master volume. If you only had one combined knob, the guitar might drown out the piano.

Before vs. after:

  • Before (GRPO on summed rewards): Different reward mixes often collapse into the same advantage bucket; learning signals blur; training can destabilize.
  • After (GDPO decoupled normalization): Reward mixes keep their identities; signals are sharper; training converges more consistently across tasks.

Why it works (intuition, no math): Normalizing each reward separately keeps their information alive. Summing those clean signals then reflects the true multi-goal quality of an answer. The final batch-wise normalization prevents the combined signal from exploding as you add more rewards, which stabilizes learning.

Building blocks:

  • Separate per-reward group normalization (keeps detail for each goal)
  • Summation of per-reward advantages (makes one actionable signal)
  • Batch-wise normalization of the sum (keeps training stable regardless of reward count)

03Methodology

High-level pipeline: Prompt + Multiple Reward Functions → Per-reward group normalization → Sum per-reward advantages (with optional weights) → Batch-wise normalization of the sum → Policy update.

Step-by-step, like a recipe:

  1. Define rewards for each goal.

    • What happens: You create clear reward functions: e.g., correctness (right answer?), format (structured tags?), length (≤ target?), bugs (runtime errors?).
    • Why it exists: Each goal needs its own “thermometer”—mixing too early makes it impossible to know which part is hot or cold.
    • Example: Tool-calling has a 0/1 format reward and a graded correctness reward based on tool name/parameters.
  2. Sample groups of answers per prompt.

    • What happens: For each question, the model produces G answers (rollouts).
    • Why it exists: Group-wise comparison is the heart of GRPO-style learning—answers compete against their siblings.
    • Example: For a math question, generate 16 solutions with different reasoning paths.
  3. Compute per-reward group-relative advantages.

    • What happens: For each reward separately (e.g., only the format reward across the G answers), compute how much better/worse each answer is than its group average, scaled to a stable range.
    • Why it exists: This preserves each reward’s structure. Without it, a 2-reward answer (correct+formatted) and a 1-reward answer (only correct) might look the same.
    • Example: Group format scores [1, 0, 1, 0] become per-answer format advantages like [+a, −a, +a, −a]; correctness gets its own set.
  4. Optionally apply reward weights to per-reward advantages.

    • What happens: Multiply each per-reward advantage by a weight (e.g., correctness 1.0, length 0.5) before summing.
    • Why it exists: Lets you encode priority. But beware: if one reward is much easier, small weight changes may not shift priorities.
    • Example: In math, correctness might be weighted 1.0; length 0.75.
  5. Sum per-reward advantages into a single advantage per answer.

    • What happens: Add the per-reward advantages (after weights) to get one combined advantage for each answer.
    • Why it exists: The policy update needs one steering signal; we just made sure it’s an information-rich one.
    • Example: If an answer is slightly above average on correctness and far above on format, its total advantage is large and positive.
  6. Batch-wise normalization of the summed advantages.

    • What happens: Normalize all summed advantages across the whole batch to a stable numerical range.
    • Why it exists: Keeps training steady even as you add more rewards or face different datasets; prevents advantage scale from drifting.
    • Example: Without this, training sometimes collapses; with it, runs converge more reliably.
  7. Policy update with clipping and KL control.

    • What happens: Use standard GRPO-style updates (e.g., token-level ratios, clipping) plus a small KL penalty to stay near the reference model.
    • Why it exists: Prevents overly large steps that can ruin the model’s language quality; keeps progress smooth.
    • Example: The best answers in the group get pushed up within safe limits; bad ones get pushed down a bit.

What breaks without each step:

  • Skip per-reward normalization: Different reward mixes collapse; model can’t learn nuanced trade-offs.
  • Skip weights: Can’t reflect user priorities (e.g., correctness over length).
  • Skip batch normalization: Training may become unstable, especially with many rewards.
  • Skip KL/clipping: The policy can take wild steps and degrade fluency.

Concrete mini-data example:

  • Two rewards (format 0/1, correctness 0/2), four answers have raw rewards: A: (format=1, correctness=2) B: (1, 1) C: (0, 2) D: (0, 0)
  • Per-reward normalize within the group: • Format channel sees [1,1,0,0] and produces per-answer advantages like [+f,+f,−f,−f]. • Correctness channel sees [2,1,2,0] and produces [+c,0,+c,−2c] (illustrative).
  • Sum channels: A gets (+f + c), B gets (+f + 0), C gets (−f + c), D gets (−f − 2c).
  • Batch-normalize: Scale the final numbers across the batch so updates are stable.
  • Update: A gets the biggest positive push; D gets pushed down; B and C adjust appropriately.

The secret sauce:

  • Decoupled per-reward normalization preserves signal granularity.
  • The final batch-wise normalization keeps learning stable as reward count grows.
  • Together, they prevent the “same-advantage” collapse that plagues summed-reward GRPO and yield steadier, better convergence.

04Experiments & Results

The test: Does GDPO really keep more reward detail and train more stably than GRPO across real tasks? The authors tested on three fronts: tool-calling (accuracy + format), math reasoning (accuracy + length limit), and coding (pass rate + length limit + bug rate).

The competition: Baselines were standard GRPO and a GRPO variant without standard deviation normalization (GRPO w/o std). Models included Qwen2.5 1.5B/3B, DeepSeek-R1 1.5B/7B, and Qwen3-4B.

Scoreboard with context:

  • Tool-calling (BFCL-v3): • GDPO trained Qwen2.5-1.5B improved average accuracy and format correctness over GRPO (about +2.7% avg accuracy and +4% correct format). Think of it as moving from a solid B to a higher B+ on both being right and turning work in the right template. • On Qwen2.5-3B, GDPO again edged out GRPO in both accuracy and formatting, suggesting the method scales beyond one model size. • GRPO w/o std sometimes matched correctness but crashed on format (0% correct format), like getting answers right but never following the required structure.

  • Math reasoning with a 4000-token target (MATH, AIME, AMC, Minerva, Olympiad): • GDPO reduced length violations massively (e.g., AIME overlength dropped from ~85% to near 0–0.2% in some cases) while also improving or maintaining accuracy. That’s like trimming an essay to fit the page limit while still getting a higher score. • On DeepSeek-R1-1.5B, GDPO improved Pass@1 on MATH (+2.6%), AIME (+6.3%), and Olympiad (+2.3%) compared to GRPO, while slashing overlength answers by huge margins. • Training stability: GRPO’s correctness started to dip after ~400 steps, and lengths crept up; GDPO kept improving correctness and tightened length control.

  • Coding with three objectives (Apps, CodeContests, Codeforces, Taco): • Two-reward setting (pass rate + conditioned length): GDPO boosted pass rates across all datasets versus GRPO while keeping length violations similar or lower. • Three-reward setting (add bug reduction): GDPO matched or slightly improved pass rates and also reduced both overlength and bug rates more than GRPO—like writing code that’s not only correct but also compact and less crash-prone.

Surprising findings:

  • Simply removing standard deviation in GRPO (GRPO w/o std) did not fix the collapse problem and could hurt stability (e.g., tool format learning failed entirely).
  • Just lowering the weight of an easy reward (like length) often didn’t change the model’s behavior; conditioning the easier reward on the harder one (e.g., only grant length reward when the answer is correct) worked far better. Under this setup, GDPO especially shined by translating the relaxed constraint into real accuracy gains with modest length trade-offs.

Bottom line: Across all tasks and models, GDPO gave a more expressive learning signal, smoother training, and better real-world metrics than GRPO.

05Discussion & Limitations

Limitations:

  • If rewards are very sparse or nearly always identical within a group (e.g., all zeros), per-reward normalization can be less informative, and training might need additional shaping or curriculum.
  • GDPO adds a small computational overhead for normalizing each reward channel and then batch-normalizing the sum, though in practice it remains lightweight compared to total RL costs.
  • When rewards are extremely noisy or mis-specified, preserving their details won’t help; garbage in still means garbage out.
  • In tiny-group regimes (few rollouts), per-reward statistics can be unstable; careful choice of group size and light smoothing may be needed.

Required resources:

  • A standard GRPO-style RLHF stack (e.g., verl/HF-TRL/NeMo-RL), support for multi-sample rollouts per prompt, and room to compute per-reward advantages.
  • Reward functions or models for each objective (format checkers, answer checkers, bug detectors), plus infrastructure to run them efficiently.

When not to use:

  • Single-reward scenarios where GRPO is already stable and strong; GDPO’s benefit is minimal.
  • Settings where you cannot define or evaluate separate rewards reliably (e.g., unavailable format checker or buggy test harness).
  • Ultra-low-latency training loops where even small overheads are unacceptable.

Open questions:

  • How best to smooth or regularize per-reward normalization when rewards are extremely sparse?
  • Can adaptive weighting strategies, informed by difficulty, further improve priority handling without manual tuning?
  • What are the best practices for combining learned reward models with rule-based rewards under GDPO?
  • How does GDPO interact with advanced sampling schemes (e.g., dynamic group sizes, pruning) at very large scales?
  • Can these ideas generalize to vision/multimodal RL settings with many heterogeneous rewards?

06Conclusion & Future Work

Three-sentence summary:

  • GRPO often collapses different multi-reward mixtures into the same learning signal, making models miss important differences and destabilizing training.
  • GDPO fixes this by normalizing each reward channel separately, summing those advantages, and batch-normalizing the result to keep updates stable as you add rewards.
  • Across tool-calling, math with length limits, and coding with bug checks, GDPO consistently trains more stably and achieves better metrics than GRPO.

Main achievement:

  • A simple, drop-in redesign of the advantage computation that preserves per-reward detail and yields more accurate, stable multi-reward optimization.

Future directions:

  • Smarter priority handling (adaptive weights/conditioning) that automatically reacts to reward difficulty.
  • Combining GDPO with improved sampling, filtering, or curriculum to further stabilize long trainings.
  • Extending GDPO to multimodal tasks and environments with complex, learned reward models.

Why remember this:

  • As we ask one model to satisfy many human preferences at once—be right, safe, concise, and well-formatted—the training signal must stay crisp. GDPO keeps those signals clear, so models learn the right lesson from each reward and deliver better, more reliable behavior in the real world.

Practical Applications

  • •Train a function-calling assistant to maximize both tool-call accuracy and strict output formatting.
  • •Tune math solvers to improve accuracy while enforcing hard response-length limits for efficiency.
  • •Improve coding assistants by jointly optimizing pass rate, response length, and bug avoidance.
  • •Build safety-aligned chatbots by separating and balancing rewards for helpfulness, harmlessness, and adherence to policies.
  • •Personalize assistants by weighting style, tone, and brevity differently while preserving each signal’s distinct influence.
  • •Scale to more objectives (e.g., fairness, coherence) without destabilizing updates by using batch-wise advantage normalization.
  • •Use conditioned rewards to ensure tough goals (e.g., correctness) are met before easy ones (e.g., brevity) can influence training.
  • •Retrofit existing GRPO pipelines by swapping in per-reward normalization and a final batch normalization step.
  • •Run ablations to set practical reward weights after first resolving difficulty imbalances with conditioned rewards.
  • •Apply to multimodal agents (e.g., tool use plus vision) by giving each modality its own normalized reward channel.
#GDPO#GRPO#multi-reward reinforcement learning#advantage estimation#reward normalization#RLHF#tool calling#reasoning efficiency#length constraints#coding pass rate#bug ratio#batch-wise normalization#conditioned rewards#policy optimization
Version: 1