Mixture-of-Experts (MoE) models often send far more tokens to a few βfavoriteβ experts, which overloads some GPUs while others sit idle.
Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.