This paper teaches AI models to learn like good students: try, think about what went wrong, fix it, and remember the fix.
Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.
The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.