Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
IntermediateJunbo Li, Peng Zhou et al.Dec 18arXiv
Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.
#Turn-PPO#multi-turn reinforcement learning#agentic LLMs