Papers2

#ratio clipping

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

BandPO is a new training method for large language models that keeps updates safe while letting the model freely explore smart, low-probability ideas.

#BandPO#PPO clipping#trust region

Not triaged yet

Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Intermediate

Peter Chen, Xiaopeng Li et al.Dec 18arXiv

The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.

#RLVR#Group Relative Policy Optimization#ratio clipping

Not triaged yet