The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.
This paper teaches a language model to think along several paths at the same time instead of one step after another.