Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation
IntermediateZhiqi Yu, Zhangquan Chen et al.Feb 5arXiv
The paper finds a hidden symmetry inside GRPOโs advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.