Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning
IntermediateDylan Zhang, Yufeng Xu et al.Feb 1arXiv
The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.
#Supervised Fine-Tuning#Reinforcement Learning#Distribution Mismatch