ProAct teaches AI agents to think ahead accurately without needing expensive search every time they act.
Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubik’s Cube.
LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.
Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.
This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.