Papers2

#GEMM

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Xuan-Phi Nguyen, Shrey Pandit et al.Jan 23arXiv

Mixture-of-Experts (MoE) models often send far more tokens to a few “favorite” experts, which overloads some GPUs while others sit idle.

#Mixture-of-Experts#Expert Parallelism#Least-Loaded Expert Parallelism

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Intermediate

Kewei Zhang, Ye Huang et al.Jan 12arXiv

Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.

#Multi-Head Linear Attention#Linear Attention#Self-Attention