Large language models learn better when we spend more practice time on the right questions at the right moments.
The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.
Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.