Binary right/wrong rewards for training reasoning in large language models are hard to design and often too sparse to learn from.
Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.
The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.
Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.
The paper asks when reinforcement learning (RL) really makes language models better at reasoning beyond what they learned in pre-training.
This paper teaches a language model to think along several paths at the same time instead of one step after another.