The paper shows a fast, training-free way to boost an LLM’s step-by-step reasoning by smartly reusing the model’s own probabilities.
LLMs are usually trained by treating every question the same and giving each one the same number of tries, which wastes compute on easy problems and neglects hard ones.
This paper teaches a model to be its own teacher so it can climb out of a learning plateau on very hard math problems.
The paper introduces Intervention Training (InT), a simple way for a language model to find and fix the first wrong step in its own reasoning using a short, targeted correction.
DARC teaches big language models to get smarter by splitting training into two calm, well-organized steps instead of one chaotic loop.
The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.
The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.
This paper teaches large language models (LLMs) to explore smarter by listening to their own gradients—the directions they would update—rather than chasing random variety.
Reasoning tokens (the words a model writes before its final answer) help the model think better, but they are not a trustworthy diary of how it really thought.