On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
IntermediateTaejong Joo, Wenhan Xia et al.Feb 17arXiv
The paper finds a simple trick—randomly skipping some parameter updates—can train large language models better than fancy optimizers.
#Magma#random masking#adaptive optimizers