The paper fixes a common mistake in training language models for multi-part tasks: giving the same reward signal to every token, even when different text parts aim at different goals.
MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.
Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.
This paper shows that when teaching image generators with reinforcement learning, only a few early, very noisy steps actually help the model learn what people like.