Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.