Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
IntermediateZhenpeng Su, Leiyu Pan et al.Dec 5arXiv
Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the โsafe zone,โ causing unstable learning.
#reinforcement learning#PPO-clip#KL penalty