Rethinking the Trust Region in LLM Reinforcement Learning
IntermediatePenghui Qi, Xiangxin Zhou et al.Feb 4arXiv
The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.
#Reinforcement Learning#Proximal Policy Optimization#Trust Region