ProAct teaches AI agents to think ahead accurately without needing expensive search every time they act.
Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubik’s Cube.
Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.
This paper introduces LAMER, a Meta-RL training framework that teaches language agents to explore first and then use what they learned to solve tasks faster.