Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.
The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.