Progressive Residual Warmup for Language Model Pretraining
IntermediateTianhao Chen, Xin Xu et al.Mar 5arXiv
Training big Transformers can wobble at the start because every layer tries to learn all at once.
#Progressive Residual Warmup#ProRes#Transformer training stability