Stronger Normalization-Free Transformers
IntermediateMingzhi Chen, Taiming Lu et al.Dec 11arXiv
This paper shows that we can remove normalization layers from Transformers and still train them well by using a simple point‑by‑point function called Derf.
#Normalization‑free Transformers#LayerNorm replacement#Point‑wise activation