This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.