Self-Hinting Language Models Enhance Reinforcement Learning
IntermediateBaohao Liao, Hanze Dong et al.Feb 3arXiv
When rewards are rare, a popular training method for language models (GRPO) often stops learning because every try in a group gets the same score, so there is nothing to compare.
#reinforcement learning#GRPO#self-hinting