The paper finds a hidden symmetry inside GRPOโs advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.
The paper trains language models to solve hard problems by first breaking them into smaller parts and then solving those parts, instead of only thinking in one long chain.
GTR-Turbo teaches a vision-language agent using a 'free teacher' made by merging its own past checkpoints, so no costly external model is needed.