Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
IntermediateMaksim Velikanov, Ilyas Chahed et al.Jan 8arXiv
The paper shows that big language models often get stuck with weight sizes set by training hyperparameters instead of by the data, which quietly hurts performance.
#learnable multipliers#weight decay#noise–WD equilibrium