Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts
IntermediateXuan-Phi Nguyen, Shrey Pandit et al.Jan 23arXiv
Mixture-of-Experts (MoE) models often send far more tokens to a few βfavoriteβ experts, which overloads some GPUs while others sit idle.
#Mixture-of-Experts#Expert Parallelism#Least-Loaded Expert Parallelism