Papers2

#advantage estimation

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong et al.Jan 8arXiv

When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.

#GDPO#GRPO#multi-reward reinforcement learning

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Intermediate

Junbo Li, Peng Zhou et al.Dec 18arXiv

Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.

#Turn-PPO#multi-turn reinforcement learning#agentic LLMs