GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
IntermediateShih-Yang Liu, Xin Dong et al.Jan 8arXiv
When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.
#GDPO#GRPO#multi-reward reinforcement learning