This paper teaches a model to make its own helpful hints (sub-questions) and then use those hints to learn better with reinforcement learning that checks answers automatically.
Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.
Big AI models often write very long step-by-step solutions, but usual checkers either only check the final answer or get lost in the long steps.