Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
IntermediateFanfan Liu, Youyang Yin et al.Feb 5arXiv
The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.