Self-Improving Pretraining: using post-trained models to pretrain better models
IntermediateEllen Xiaoqing Tan, Shehzaad Dhuliawala et al.Jan 29arXiv
This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.
#self-improving pretraining#reinforcement learning#online DPO