Papers2

#Preference Alignment

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Text-to-image models using GRPO used to give the same final reward to every step, which is like giving the whole team the same grade no matter who did what.

#TurningPoint-GRPO#GRPO#Flow Matching

E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

Intermediate

Shengjun Zhang, Zhang Zhang et al.Jan 1arXiv

This paper shows that when teaching image generators with reinforcement learning, only a few early, very noisy steps actually help the model learn what people like.

#E-GRPO#Group Relative Policy Optimization#Flow Matching