This paper teaches a language model to improve its own math answers by first writing several drafts and then learning to beat its best draft.
The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.