Papers4

#weight decay

Transformers converge to invariant algorithmic cores

Different transformers may have very different weights, but they often hide the same tiny "engine" inside that actually does the task.

#algorithmic cores#mechanistic interpretability#transformers

Not triaged yet

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Intermediate

Jianliang He, Leda Wang et al.Feb 18arXiv

This paper explains, in detail, how a simple two-layer neural network learns to add numbers on a clock (modular addition) by building and combining wave-like patterns called Fourier features.

#modular addition#Fourier features#discrete Fourier transform

Not triaged yet

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Intermediate

Maksim Velikanov, Ilyas Chahed et al.Jan 8arXiv

The paper shows that big language models often get stuck with weight sizes set by training hyperparameters instead of by the data, which quietly hurts performance.

#learnable multipliers#weight decay#noise–WD equilibrium

Not triaged yet

Visualizing the Loss Landscape of Neural Nets

Intermediate

Hao Li, Zheng Xu et al.Dec 28arXiv

Training a neural network is like finding the lowest spot in a giant, bumpy landscape called the loss landscape.

#loss landscape visualization#filter normalization#sharpness flatness

Not triaged yet