This paper shows that we can remove normalization layers from Transformers and still train them well by using a simple point‑by‑point function called Derf.
This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).