Diversity or Precision? A Deep Dive into Next Token Prediction
IntermediateHaoyuan Wu, Hai Wang et al.Dec 28arXiv
The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.
#next-token prediction#cross-entropy as policy gradient#reward shaping