Your Group-Relative Advantage Is Biased
IntermediateFengkai Yang, Zherui Chen et al.Jan 13arXiv
Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.
#Reinforcement Learning from Verifier Rewards#GRPO#GSPO