The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.