Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
BeginnerJinrui Zhang, Chaodong Xiao et al.Feb 12arXiv
Training big language models usually needs super-expensive, tightly connected GPU clusters, which most people do not have.
#decentralized LLM pretraining#mixture-of-experts (MoE)#sparse expert synchronization