This paper teaches AI models to learn like good students: try, think about what went wrong, fix it, and remember the fix.
When rewards are rare, a popular training method for language models (GRPO) often stops learning because every try in a group gets the same score, so there is nothing to compare.
The paper shows that big sequence models (like transformers) quietly learn longer goals inside their hidden activations, even though they are trained one step at a time.