This paper tackles why training AI agents that act over many steps (like browsing the web or moving in a house) often becomes unstable and collapses.
Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.
This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.