VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
IntermediateGuobin Shen, Chenxiao Zhao et al.Feb 11arXiv
VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.
#VESPO#off-policy reinforcement learning#importance sampling