Papers2

#LLM pretraining

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Taejong Joo, Wenhan Xia et al.Feb 17arXiv

The paper finds a simple trick—randomly skipping some parameter updates—can train large language models better than fancy optimizers.

#Magma#random masking#adaptive optimizers

Not triaged yet

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Intermediate

Ang Lv, Jin Ma et al.Dec 29arXiv

Mixture-of-Experts (MoE) models use many small specialist networks (experts) and a router to pick which experts handle each token, but the router isn’t explicitly taught what each expert is good at.

#Mixture-of-Experts#expert-router coupling#auxiliary loss

Not triaged yet