Binary right/wrong rewards for training reasoning in large language models are hard to design and often too sparse to learn from.
The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.