The paper shows that growing the embedding part of a language model (especially with n-grams) can beat adding more MoE experts once you pass a certain sparsity 'sweet spot.'
MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.
Nemotron 3 is a new family of open AI models (Nano, Super, Ultra) built to think better while running faster and cheaper.