The paper fixes a common mistake in training language models for multi-part tasks: giving the same reward signal to every token, even when different text parts aim at different goals.
This paper shows that when teaching image generators with reinforcement learning, only a few early, very noisy steps actually help the model learn what people like.
The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.