Scaling Embeddings Outperforms Scaling Experts in Language Models
IntermediateHong Liu, Jiaqi Zhang et al.Jan 29arXiv
The paper shows that growing the embedding part of a language model (especially with n-grams) can beat adding more MoE experts once you pass a certain sparsity 'sweet spot.'
#N-gram Embedding#Mixture-of-Experts (MoE)#Embedding Scaling