VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.
The paper introduces CoPE, a simple change to how models track word positions that makes long documents much easier for them to understand.