Binary right/wrong rewards for training reasoning in large language models are hard to design and often too sparse to learn from.
Large language models (LLMs) donβt act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.
The paper asks when reinforcement learning (RL) really makes language models better at reasoning beyond what they learned in pre-training.
This paper teaches a language model to think along several paths at the same time instead of one step after another.