The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.