Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.
This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.