MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.
Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.
This paper shows that when teaching image generators with reinforcement learning, only a few early, very noisy steps actually help the model learn what people like.