Efficient RLVR Training via Weighted Mutual Information Data Selection
IntermediateXinyu Zhou, Boyu Zhu et al.Mar 2arXiv
Reinforcement learning (RL) trains language models by letting them try answers and learn from rewards, but training is slow if we pick the wrong practice questions.
#Reinforcement Learning#RLVR#Data Selection