The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.
The paper shows a fast, training-free way to boost an LLM’s step-by-step reasoning by smartly reusing the model’s own probabilities.
This paper teaches a model to be its own teacher so it can climb out of a learning plateau on very hard math problems.