ReGFT is a simple pre-RL step that shows the model partial human hints, then makes it solve problems in its own words, creating correct, model-style solutions for hard questions.
This paper teaches a language model to improve its own math answers by first writing several drafts and then learning to beat its best draft.
This paper teaches a model to make its own helpful hints (sub-questions) and then use those hints to learn better with reinforcement learning that checks answers automatically.
Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.
Big AI models often write very long step-by-step solutions, but usual checkers either only check the final answer or get lost in the long steps.