Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO
IntermediateYunze Tong, Mushui Liu et al.Feb 6arXiv
Text-to-image models using GRPO used to give the same final reward to every step, which is like giving the whole team the same grade no matter who did what.
#TurningPoint-GRPO#GRPO#Flow Matching