Different transformers may have very different weights, but they often hide the same tiny "engine" inside that actually does the task.
This paper explains, in detail, how a simple two-layer neural network learns to add numbers on a clock (modular addition) by building and combining wave-like patterns called Fourier features.
The paper shows that big language models often get stuck with weight sizes set by training hyperparameters instead of by the data, which quietly hurts performance.
Training a neural network is like finding the lowest spot in a giant, bumpy landscape called the loss landscape.