BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
IntermediateYuan Li, Bo Wang et al.Mar 5arXiv
BandPO is a new training method for large language models that keeps updates safe while letting the model freely explore smart, low-probability ideas.
#BandPO#PPO clipping#trust region