The paper finds a simple trick—randomly skipping some parameter updates—can train large language models better than fancy optimizers.
Mixture-of-Experts (MoE) models use many small specialist networks (experts) and a router to pick which experts handle each token, but the router isn’t explicitly taught what each expert is good at.