The paper teaches language models to explore more ideas while thinking, so they can solve harder problems.
The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.