Stronger Normalization-Free Transformers

Mingzhi Chen; Taiming Lu; Jiachen Zhu; Mingjie Sun; Zhuang Liu

Stronger Normalization-Free Transformers

Intermediate

Mingzhi Chen, Taiming Lu, Jiachen Zhu et al.12/11/2025

arXiv PDF

Key Summary

•This paper shows that we can remove normalization layers from Transformers and still train them well by using a simple point‑by‑point function called Derf.
•Derf applies the error function (erf) with two tiny learned knobs (a scale and a shift) to every activation independently, instead of using batch or token statistics.
•The authors studied what makes such functions work and found four must‑have traits: keep outputs near zero, keep values bounded, be sensitive around zero, and be monotonic.
•Guided by these traits, they tested many candidate functions and discovered Derf (erf(αx + s)) consistently beats LayerNorm, RMSNorm, and the prior best, DyT (Dynamic Tanh).
•On ImageNet with ViT, Derf improves top‑1 accuracy (e.g., 82.8% vs. 82.3% with LayerNorm), and on diffusion models it lowers FID (better images).
•On speech (wav2vec 2.0) and DNA sequence tasks, Derf also outperforms normalization layers and DyT.
•Interestingly, Derf has higher training loss than normalization, suggesting its gains come from better generalization, not overfitting the training data.
•Derf is simple, fast, and avoids the memory and synchronization costs of normalization, making it practical for many architectures and hardware setups.
•The work provides clear design rules for building strong normalization‑free Transformers using point‑wise functions.

Why This Research Matters

Derf makes Transformers simpler and cheaper to run by removing normalization’s need for batch statistics, which saves memory and avoids synchronization overhead. It boosts accuracy in vision tasks and improves image quality in diffusion models, directly benefiting applications like photo search and generative art. Because Derf seems to generalize better, it can help models be more reliable on new data, not just the training set. It also works well when you have small batches or limited hardware, making strong models more accessible to smaller labs and edge devices. Finally, the paper gives clear, easy‑to‑follow design rules so others can build even better normalization‑free models.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a music teacher helps a school band stay in rhythm so everyone plays together? In deep learning, we used special helpers called normalization layers to keep the signals inside a model steady so training didn’t go off‑beat.

🥬 Filling (The Actual Concept): What it is: Normalization layers (like LayerNorm and RMSNorm) adjust the scale and center of activations so the network learns smoothly. How it works (recipe):

Look at a group of numbers (an activation vector).
Compute their average (mean) and spread (standard deviation).
Subtract the mean to recentre around zero.
Divide by the standard deviation to keep a steady scale.
Optionally learn tiny per‑channel scale/shift (gamma, beta) to fine‑tune. Why it matters: Without normalization, some layers can blow up or shrink signals, making training unstable and slow.

🍞 Bottom Bread (Anchor): Imagine you’re practicing piano and your teacher keeps your tempo steady; normalization keeps the neural network ‘tempo’ steady so learning doesn’t rush or drag.

The world before: For years, most strong models—especially Transformers—leaned on normalization to train reliably. It worked great, but had downsides: it needs to compute statistics on the fly, which means extra memory access and sometimes communication between devices; it can depend on batch size (tiny batches can hurt), and it adds latency on certain hardware.

The problem: Could we remove normalization and still get the same (or better) results? Doing so would simplify models, reduce memory traffic, and make training more robust to batch size choices—but we risk losing the steady ‘tempo’ normalization provides.

Failed attempts: People tried many paths. Some used clever initializations (so signals start well‑behaved), some used weight normalization, some clipped gradients, and some simplified Transformer blocks. Helpful, but often they still trailed standard normalization in performance or reliability. A recent idea, Dynamic Tanh (DyT), replaced normalization with a simple point‑wise function: it almost matched normalization—but didn’t clearly beat it across the board.

🍞 Top Bread (Hook): Imagine if, instead of a teacher watching the whole class, each student had a tiny, personal metronome that keeps their own pace.

🥬 Filling (The Actual Concept): What it is: Point‑wise functions are tiny ‘personal metronomes’—the same learned curve applied to each activation independently, without using batch or token statistics. How it works:

Take one activation value x.
Pass it through a learned S‑shaped curve f(x; θ).
Repeat for every activation separately (same curve everywhere).
Use small learned knobs (like scale and shift) to fine‑tune. Why it matters: No statistics means less memory/communication overhead and no batch‑size headaches. The challenge is keeping training stable and accurate.

🍞 Bottom Bread (Anchor): Like giving each musician a personal metronome instead of relying on the conductor’s signals.

The gap: DyT showed we could match normalization with a good S‑shaped function, but it didn’t consistently surpass normalization. What exact shape should the function be to not just match but beat normalization?

The real stakes: This matters for everyone. Faster, simpler models train cheaper (good for the planet and your wallet), work better on small devices or with small batches (like fine‑tuning at home), and can even improve accuracy in real tasks like recognizing images, understanding speech, or analyzing DNA sequences. If we can remove a heavy ‘helper’ (normalization) and replace it with a tiny function that’s easier and stronger, we make AI both smarter and more accessible.

02Core Idea

🍞 Top Bread (Hook): Imagine sorting a pile of homework: you first flatten messy stacks, then slide each paper through a smooth roller that straightens creases without needing to measure the whole pile each time.

🥬 Filling (The Actual Concept): What it is: The key insight is that a carefully shaped, statistics‑free, point‑wise function can replace normalization layers and even outperform them. The winning shape is Derf: erf(αx + s) with small learned parameters α (scale) and s (shift), plus per‑channel γ and β. How it works:

Study which shapes make training stable and accurate: near zero‑centered, bounded, sensitive around zero, and monotonic.
Build a library of candidate S‑shaped curves that follow these rules.
Test them across tasks; discover that the error function (erf) with a learned shift and scale works best.
Replace each normalization layer in Transformers with Derf and train as usual. Why it matters: It removes statistic lookups and synchronization, reduces sensitivity to batch size, simplifies the block, and (surprisingly) often improves final accuracy or generation quality.

🍞 Bottom Bread (Anchor): It’s like swapping a group‑sensing conductor for an individual smart roller that perfectly smooths each sheet on its own—fewer moving parts, better results.

Three analogies for the same idea:

Traffic analog: Instead of adjusting traffic lights based on city‑wide sensor data (normalization), every car uses adaptive cruise control (Derf) that keeps speeds safe and smooth without a city control center.
Cooking analog: Rather than tasting and rebalancing an entire soup pot (normalization), you lightly season each spoonful consistently with a perfect spice curve (Derf) that prevents any bite from being too salty or too bland.
School analog: Instead of grading on a curve after seeing the whole class’s scores (normalization), each test question is designed with a fairness curve (Derf) that turns raw effort into a balanced score without checking the class average.

Before vs. after:

Before: Normalization was the stabilizer; DyT showed you could nearly match it with a tanh‑based function but not reliably beat it.
After: With Derf, we have a function that is simple, uses no batch/token statistics, and consistently beats LayerNorm/RMSNorm/DyT across vision, generation, speech, and DNA tasks.

🍞 Top Bread (Hook): You know how a good ruler needs these traits: starts at zero, has markings close together near zero for precision, doesn’t stretch forever, and the numbers always increase left to right.

🥬 Filling (The Actual Concept): The must‑have properties for the function:

Zero‑centeredness What it is: Outputs balance around zero so positives and negatives are symmetric. How it works: Keep the curve’s middle at (0,0) and avoid big vertical/horizontal shifts. Why it matters: Keeps gradients steady; too much shift harms training.
Boundedness What it is: Outputs stay within a finite range. How it works: Use S‑shapes that saturate (like erf/tanh) or clip unbounded ones. Why it matters: Prevents activations/gradients from exploding layer by layer.
Center sensitivity What it is: The curve reacts to small inputs near zero (no flat dead zone). How it works: Ensure the slope near zero isn’t tiny. Why it matters: Small, important signals don’t get lost.
Monotonicity What it is: As input increases, output steadily increases (or decreases) without wobbling. How it works: Avoid humps/oscillations; keep the derivative sign consistent. Why it matters: Preserves order and keeps gradient signals from flipping.

🍞 Bottom Bread (Anchor): Like a perfect measuring tool: centered zero mark, clear ticks near zero, ends that don’t run to infinity, and numbers that never go backward.

Building blocks (putting it together):

Start with well‑known S‑shapes (tanh, erf, arctan) and CDF‑like curves.
Transform them (shift/scale/clip) to satisfy the four properties.
Empirically test on vision (ImageNet) and diffusion models (FID).
Pick the winner: erf with learnable α and s—Derf.

🍞 Bottom Bread (Anchor): On ImageNet, this ‘smart ruler’ (Derf) nudges ViT accuracy from 82.3% (LayerNorm) to 82.8%, and on diffusion image generation it lowers FID across model sizes—meaning cleaner, more realistic images.

03Methodology

High‑level overview: Input → Replace each normalization layer with Derf → Train as usual → Output predictions.

🍞 Top Bread (Hook): Imagine rebuilding a bicycle so that every wheel has a built‑in self‑balancer. You remove the bulky stabilizer bars (normalization) and snap in tiny smart rings (Derf) on each axle.

🥬 Filling (The Actual Concept): What it is: A recipe to remove normalization from Transformers and insert Derf everywhere those layers used to be, chosen and tuned by first discovering which curve shapes make training stable. How it works (recipe): Step A: Study which shapes work

Define four properties: zero‑centeredness, boundedness, center sensitivity, monotonicity.
Start with base functions (tanh, erf, arctan) and precisely adjust them—shift, clip, scale—to isolate each property’s effect.
Run controlled ViT‑Base experiments on ImageNet to see how changes affect top‑1 accuracy and stability.
Findings:
- Zero‑centered: small shifts are okay; big shifts break training.
- Bounded: bounded beats unbounded; too‑fast growth diverges.
- Center sensitivity: bigger flat zones near zero hurt performance and can cause failure.
- Monotonic: monotonic functions outperform hump/oscillation shapes. What breaks without this step: You might pick a curve that seems fine but subtly destabilizes training or kills accuracy.

Step B: Build a candidate set

Collect many S‑shapes: natural (erf/tanh/arctan), transformed basics, clipped unbounded, and ratio‑style functions.
Enforce the four properties via transformations (e.g., clip arcsinh to [−1, 1]).
Normalize ranges and symmetry so comparisons are fair.
Evaluate candidates on ViT and Diffusion Transformers (DiT) using standard training recipes and metrics (top‑1 accuracy, FID). What breaks without this step: You might overfit to one task or miss a stronger function hidden in plain sight.

Step C: Pick the winner and formalize Derf

Winner: erf(x) variants consistently lead.
Define Derf layer:
- Formula: y = γ · erf(αx + s) + β
- α, s are learned scalars (shared across channels); γ, β are per‑channel like in normalization layers.
Initialization: γ = 1, β = 0, α = 0.5, s = 0.
Integration: Replace every normalization site in Transformers (pre‑attention, pre‑FFN, final norm) with a one‑to‑one Derf layer.
Efficiency: No per‑batch or per‑token statistics; purely point‑wise; easy on memory and parallelism. What breaks without this step: Inconsistent replacements hurt stability; poor initialization can slow or break training.

Step D: Verify across domains

Vision (ViT‑B, ViT‑L; ImageNet): report top‑1 accuracy.
Generation (DiT‑B/L/XL; ImageNet): report FID.
Speech (wav2vec 2.0 Base/Large; LibriSpeech): report validation loss.
DNA (HyenaDNA, Caduceus; GenomicBenchmarks): report accuracy.
Language (GPT‑2 124M; OpenWebText): report validation loss. What breaks without this step: The function might only shine in one domain but fail elsewhere.

Concrete data example (toy):

Suppose a tiny vector x = [−3, −1, 0, 1, 3]. With α = 0.5 and s = 0, αx = [−1.5, −0.5, 0, 0.5, 1.5].
erf(·) roughly maps to [−0.966, −0.520, 0, 0.520, 0.966].
With γ = 1, β = 0, the output is softly squashed into [−1, 1], balanced around zero, and sensitive near the center—exactly the behavior we want.

🍞 Bottom Bread (Anchor): Like turning raw, bouncy springs into smooth, well‑damped springs so each layer passes along a steady signal without needing to peek at everyone else’s behavior.

The secret sauce:

erf’s S‑shape naturally satisfies all four properties and, with a learnable shift s, can finely center the sweet spot without depending on batch statistics.
Derf keeps enough slope near zero to pass small signals, saturates extremes to prevent blow‑ups, and preserves order to keep gradients consistent.
This combination seems to act as an implicit regularizer: it doesn’t overfit the training set as hard as normalization can, yet it generalizes better on test data.

04Experiments & Results

🍞 Top Bread (Hook): Think of a school tournament where students compete in reading (vision), art (generation), music (speech), science (DNA), and writing (language). We want a single new practice trick (Derf) that helps in all events, not just one.

🥬 Filling (The Actual Concept): What it is: A broad evaluation of Derf versus standard normalization layers and DyT across many tasks and models. How it works:

Remove normalization layers and insert Derf in ViT, DiT, wav2vec 2.0, HyenaDNA/Caduceus, and GPT‑2.
Train with each model’s standard recipe; compare top‑1 accuracy (ImageNet), FID (image generation), validation loss (speech/language), and task accuracy (DNA).
Also compute training loss in evaluation mode after training to study fitting versus generalization. Why it matters: Beating normalization across domains means Derf is a simple, widely useful drop‑in.

🍞 Bottom Bread (Anchor): Instead of a trick that only helps in math class, this one improves performance in reading, art, music, science, and writing.

The test and scoreboard (with context):

Vision (ImageNet; higher is better):
- ViT‑Base: LayerNorm 82.3%, DyT 82.5%, Derf 82.8% (Derf adds about +0.5% over LN—like nudging a solid B to a strong B+).
- ViT‑Large: LayerNorm 83.1%, DyT 83.6%, Derf 83.8% (Derf wins again).
Image Generation (FID; lower is better):
- DiT‑B/4: LN 64.93, DyT 63.94, Derf 63.23 (cleaner images).
- DiT‑L/4: LN 45.91, DyT 45.66, Derf 43.94 (notable gain—like cutting smudges from the final prints).
- DiT‑XL/2: LN 19.94, DyT 20.83, Derf 18.92 (best quality among all three).
Speech (LibriSpeech; lower validation loss is better):
- wav2vec 2.0 Base: LN 1.95, DyT 1.95, Derf 1.93.
- wav2vec 2.0 Large: LN 1.92, DyT 1.91, Derf 1.90.
DNA (GenomicBenchmarks; higher is better):
- HyenaDNA: Norm 85.2%, DyT 85.2%, Derf 85.7%.
- Caduceus: Norm 86.9%, DyT 86.9%, Derf 87.3%.
Language (OpenWebText; lower is better):
- GPT‑2 (124M): LN 2.94, DyT 2.97, Derf 2.94 (matches LN, beats DyT).

Surprising finding: Fitting vs. generalization

Evaluation‑mode training loss (lower means better fit to training data) consistently ranks: Normalization < Derf < DyT.
But test‑time results often rank: Derf > Normalization ≥ DyT.
This suggests Derf’s gains come from better generalization, not from overfitting the training set. In plain words: it may memorize a bit less but understands better.

Other insights:

There’s a ‘too‑fast growth’ danger for unbounded functions: if outputs grow too quickly, training can diverge early. Bounded S‑shapes like erf avoid this.
A learnable shift s helps many functions, but erf still beats scaled‑tanh approximations, so its shape (not just tuning) matters.

Bottom line: Across image recognition, image generation, speech, DNA, and language, Derf is a consistent top performer or tied for best, with simpler mechanics and lower overhead than normalization.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best tool has places it shines and places it struggles, like a bike being great on roads but not ideal for deep sand.

🥬 Filling (The Actual Concept): What it is: An honest look at when Derf works best, what it needs, and what we still don’t know. How it works (considerations):

Limitations:
1. Task dependence: While Derf wins or ties broadly, some settings (like small language models) show parity with LayerNorm rather than a clear win.
2. Initialization sensitivity: Poor α/s initialization or too‑aggressive learning rates can hurt stability.
3. Extreme distributions: If activations are very skewed or the task benefits from data‑dependent shifts, normalization’s adaptivity might still help.
4. Theory gap: We have strong empirical rules (the four properties), but a complete theory explaining why erf beats tanh everywhere is still evolving.
Required resources:
1. Standard training compute (no heavy statistic ops per step).
2. Usual deep‑learning stack (AdamW, schedulers) works out of the box.
3. Basic hyperparameter sweeps (e.g., learning rate for DiT) remain helpful.
When NOT to use:
1. If your pipeline critically relies on per‑batch or per‑token adaptive behavior (e.g., specific normalization‑based conditioning tricks).
2. If your training script assumes normalization statistics for other components (e.g., certain regularizers).
3. If you have extremely tiny models where LN already saturates performance and Derf adds complexity to your validation process.
Open questions:
1. Can we mathematically prove why erf’s particular curvature yields better generalization than tanh under common training regimes?
2. Are there even better point‑wise curves beyond erf within the same four‑property space?
3. How does Derf interact with very large‑scale pretraining (billion‑parameter LMs) and mixed‑precision or quantized training?
4. Can we co‑design optimizers or learning‑rate schedules tailored to Derf for further gains?

🍞 Bottom Bread (Anchor): Think of Derf as an all‑terrain tire: it handles city streets and country roads very well, but we should still map out where it slips—and figure out if a next‑gen tire could grip even better.

06Conclusion & Future Work

Three‑sentence summary: The paper shows that Transformers don’t need normalization layers if you replace them with the right point‑wise function. By studying four key properties—zero‑centeredness, boundedness, center sensitivity, and monotonicity—and testing many candidates, the authors identify Derf (erf(αx + s)) as the strongest choice. Derf consistently matches or exceeds normalization across vision, generation, speech, DNA, and language tasks, likely thanks to stronger generalization rather than harder overfitting.

Main achievement: A simple, statistics‑free, drop‑in function that not only replaces normalization in Transformers but also often makes them better.

Future directions:

Tighten the theory explaining why Derf generalizes better and when.
Explore even richer function families that satisfy the four rules.
Co‑design optimizers, schedules, and inits around Derf for large‑scale models.
Test on massive LLMs, reinforcement learning, multi‑modal, and low‑precision/edge scenarios.

Why remember this: Derf changes the default assumption that ‘you must normalize’ by offering a tiny, fast, and effective alternative—with clear rules for what makes such functions work. It’s a rare case where simpler is not just good enough but better.

Practical Applications

•Speed up and simplify training pipelines by replacing LayerNorm/RMSNorm with Derf in Transformers.
•Train reliably with small batch sizes (e.g., on single‑GPU or edge setups) without suffering normalization instability.
•Improve diffusion image generation quality (lower FID) for creative tools and content pipelines.
•Enhance image classification accuracy in ViT models for vision applications like retail product recognition.
•Stabilize speech representation learning (wav2vec 2.0) for better downstream ASR and audio tasks.
•Boost DNA sequence modeling performance for genomics research and diagnostics triage.
•Reduce memory bandwidth and synchronization costs on multi‑device training (e.g., TPUs/GPUs).
•Simplify inference and deployment on hardware where normalization is costly or poorly optimized.
•Use Derf as a strong default when building normalization‑free Transformers in new domains.
•Combine Derf with quantization or mixed precision to further cut compute and energy costs.

Version: 1