Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
IntermediatePeter Chen, Xiaopeng Li et al.Dec 18arXiv
The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.
#RLVR#Group Relative Policy Optimization#ratio clipping