FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.
The paper shows that when an LLM is trained with spurious (misleading) rewards in RLVR, it can score higher by memorizing answers instead of reasoning.
X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.