๐ŸŽ“How I Study AIHISA
๐Ÿ“–Read
๐Ÿ“„Papers๐Ÿ“ฐBlogs๐ŸŽฌCourses
๐Ÿ’กLearn
๐Ÿ›ค๏ธPaths๐Ÿ“šTopics๐Ÿ’กConcepts๐ŸŽดShorts
๐ŸŽฏPractice
โฑ๏ธCoach๐ŸงฉProblems๐Ÿง Thinking๐ŸŽฏPrompts๐Ÿง Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers1

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#length normalization bias

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Intermediate
Guobin Shen, Chenxiao Zhao et al.Feb 11arXiv

VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.

#VESPO#off-policy reinforcement learning#importance sampling