The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.