The paper trains language models to solve hard problems by first breaking them into smaller parts and then solving those parts, instead of only thinking in one long chain.
This paper says that to make math-solving AIs smarter, we should train them more on the hardest questions they can almost solve.
MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.
Falcon-H1R is a small (7B) AI model that thinks really well without needing giant computers.
The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.