How I Study AI - Learn AI Papers & Lectures the Easy Way

Scaling Embeddings Outperforms Scaling Experts in Language Models

Intermediate

Hong Liu, Jiaqi Zhang et al.Jan 29arXiv

The paper shows that growing the embedding part of a language model (especially with n-grams) can beat adding more MoE experts once you pass a certain sparsity 'sweet spot.'

#N-gram Embedding#Mixture-of-Experts (MoE)#Embedding Scaling

MiMo-V2-Flash Technical Report

Intermediate

Xiaomi LLM-Core Team, : et al.Jan 6arXiv

MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.

#Mixture-of-Experts#Sliding Window Attention#Global Attention

NVIDIA Nemotron 3: Efficient and Open Intelligence

Intermediate

NVIDIA, : et al.Dec 24arXiv

Nemotron 3 is a new family of open AI models (Nano, Super, Ultra) built to think better while running faster and cheaper.

#Nemotron 3#Mixture-of-Experts#LatentMoE

Papers3

Scaling Embeddings Outperforms Scaling Experts in Language Models

MiMo-V2-Flash Technical Report

NVIDIA Nemotron 3: Efficient and Open Intelligence