The paper finds a hidden symmetry inside GRPOโs advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.
SAGE is a two-agent system that automatically writes tough, multi-step search questions and checks them by actually trying to solve them.