Janus: Disaggregating Attention and Experts for Scalable MoE Inference
IntermediateZhexiang Zhang, Ye Wang et al.Dec 15arXiv
Janus splits a Mixture-of-Experts (MoE) model into two parts—attention and experts—so each can use just the right amount of GPUs.
#Mixture-of-Experts inference#disaggregated serving#activation load balancing