On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Jianliang He; Leda Wang; Siyu Chen; Zhuoran Yang

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Intermediate

Jianliang He, Leda Wang, Siyu Chen et al.2/18/2026

arXiv

Key Summary

•This paper explains, in detail, how a simple two-layer neural network learns to add numbers on a clock (modular addition) by building and combining wave-like patterns called Fourier features.
•Each neuron picks one favorite wave (a single frequency) and lines up its timing across layers (phase alignment), so the whole network works like a coordinated choir.
•Across many neurons, the network spreads out across all needed waves (frequency diversification) and balances their timings (phase symmetry), so random noise cancels and the right answer stands out.
•A neuron’s winning wave is chosen early by a lottery ticket mechanism: the wave that started with a slightly bigger head start and better timing alignment grows fastest and wins inside that neuron.
•Mathematically, the authors track training as a smooth movie (gradient flow) in Fourier space and prove how phases align and why single-frequency structure sticks around.
•They show the final network is basically a majority vote over many imperfect voters; with enough diversity and symmetry, the correct sum wins strongly even if each voter is a bit noisy.
•They reveal why swapping ReLU for any activation with strong even parts (like |x| or x^2) still works at test time: the model’s symmetry cancels odd parts and keeps even parts.
•They unpack grokking as three stages: fast memorization, a first clean-up where weight decay prunes extra waves, and a slow final polish that yields sudden generalization.
•Ablations confirm both frequency coverage and balanced phases are crucial; without them, confidence drops and mistakes increase.
•This gives a concrete, neuron-level picture of how feature learning, competition, and regularization interact to produce reliable generalization.

Why This Research Matters

This work turns a mysterious learning behavior—sudden generalization—into a concrete, testable recipe. By showing how clean, single-frequency features form inside neurons and how diversity plus symmetry make majority voting robust, it offers a blueprint for building more reliable AI systems. The even-component robustness explains why certain activation swaps don’t harm performance, informing practical choices in model design. The lottery ticket dynamics provide a predictive lens on which features will win, suggesting smarter initializations or monitoring tools. Finally, understanding grokking as a two-force tug-of-war (loss minimization vs. weight decay) helps us schedule training and regularization to reach generalization faster in real applications.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a clock goes from 11 to 12 and then wraps back to 1? Adding times on a clock is different from normal adding because you loop around.

🥬 Filling (The Actual Concept)

What it is: Modular addition is clock-style adding: you add two numbers and, if you pass the top number p, you wrap around to zero.
How it works: 1) Pick a modulus p (like 12 on a clock). 2) Add x + y. 3) If the sum is at least p, subtract p until it fits between 0 and p−1. 4) That result is (x + y) mod p.
Why it matters: This simple “clock add” task is perfect for peeking inside how a neural network learns patterns, because it’s small, exact, and has clear structure.

🍞 Bottom Bread (Anchor) Example: On a 7-hour clock, 5 + 4 = 9 wraps to 2, so (5 + 4) mod 7 = 2.

🍞 Top Bread (Hook) Imagine you’re teaching a class. At first you just memorize the answers for your homework problems. But later, suddenly, you understand the rule and can solve any similar problem. That surprise jump is exciting.

🥬 Filling (The Actual Concept)

What it is: Grokking is when a neural network trains for a long time, first memorizes, and only later suddenly generalizes—to nails it on new, unseen inputs.
How it works: 1) The model fits training data exactly. 2) Extra training with regularization (like weight decay) quietly cleans and sharpens features. 3) A clean, general rule pops out; test accuracy jumps.
Why it matters: It reveals that learning the rule can be delayed; good generalization isn’t just about fast fitting.

🍞 Bottom Bread (Anchor) A student memorizes multiplication tables but later truly understands multiplication and can solve story problems.

🍞 Top Bread (Hook) You know how music is made of notes at different pitches? If you can separate a song into its notes, you can understand and remix it better.

🥬 Filling (The Actual Concept)

What it is: Fourier features represent data using waves (sines and cosines) at different frequencies.
How it works: 1) Take a signal (or vector). 2) Decompose it into a sum of waves. 3) Each wave has a frequency (how fast it wiggles), a magnitude (how strong), and a phase (where it starts). 4) Add them back to reconstruct the original.
Why it matters: Many patterns—even “clock-add”—look simple in wave-land; addition turns into rotation around a circle.

🍞 Bottom Bread (Anchor) Just like a song equals bass + mid + treble, a network’s weights can be seen as a mix of waves—making the hidden structure visible.

🍞 Top Bread (Hook) Think of lining up dancers so they all kick at just the right moment. If a few are off-beat, the show looks messy; when they align, it’s magic.

🥬 Filling (The Actual Concept)

What it is: Phase alignment means the “start positions” (phases) of waves in different layers match in a special way.
How it works: 1) Each neuron picks a favorite frequency. 2) The input-layer’s phase and output-layer’s phase move during training. 3) Training pushes them toward a doubled-phase relation: output phase ≈ 2 × input phase. 4) Once aligned, features grow fast and stay stable.
Why it matters: Without alignment, signals fight each other; with alignment, they reinforce and become reliable.

🍞 Bottom Bread (Anchor) Like two gears meshing: when their teeth line up, power transfers smoothly; when misaligned, they grind.

🍞 Top Bread (Hook) Picture a classroom vote on the right answer. If students come from many backgrounds and balance each other’s biases, their majority choice is often correct.

🥬 Filling (The Actual Concept)

What it is: Majority voting (with diversification) is many neurons each giving a noisy vote, but their balanced diversity cancels errors.
How it works: 1) Spread neurons across all needed frequencies (frequency diversification). 2) Balance their phases across the circle (phase symmetry). 3) Sum all votes; shared noise vanishes; the true answer stands out. 4) Softmax then picks the biggest logit confidently.
Why it matters: A single neuron is imperfect; many diverse, symmetric neurons together are robust.

🍞 Bottom Bread (Anchor) If 100 kids each guess a number and their guesses aren’t all biased the same way, the average gets close to the truth.

The world before: Researchers knew networks could learn modular addition and that grokking happens, but we didn’t have a complete, neuron-level explanation of how the learned waves combine into a working algorithm—or why the right features reliably emerge from random starts.

The problem: Two big puzzles stood out: (1) Mechanism—how exactly do individual neurons’ waves assemble into the correct modular addition rule? (2) Dynamics—why do those specific waves appear and line up during training with standard losses and random initialization?

Failed attempts: Earlier theories often used mean-field approximations or special losses, which missed the fine, neuron-by-neuron competition and alignment seen in real networks with finite width.

The gap: We lacked a proof that normal training picks single-frequency features per neuron, aligns phases across layers, spreads frequencies across neurons, and uses symmetry to cancel noise—plus a clear link to grokking.

Real stakes: Understanding this toy task clarifies deep ideas—how features grow, how randomness gets organized, and why generalization can suddenly appear. These lessons help us design better, more interpretable systems for real problems like signal processing, robotics, and code or math reasoning.

02Core Idea

🍞 Top Bread (Hook) Imagine building a perfect choir: each singer holds just one note, different sections cover all notes, and everyone breathes in sync. The result is crisp and powerful.

🥬 Filling (The Actual Concept)

What it is: The key insight is that two things together solve modular addition: (1) neuron-level phase alignment that creates clean single-frequency features, and (2) network-level diversification (cover all frequencies and symmetrically spread phases) so that majority voting cancels noise and highlights the correct sum.
How it works: 1) Inside each neuron, multiple frequencies compete; one wins (single frequency). 2) The input and output phases align in a doubled relation (output ≈ 2 × input). 3) Across neurons, all needed frequencies are represented (diversification). 4) Within each frequency, phases are spread uniformly around the circle (symmetry). 5) Summing neurons yields a strong peak at the correct answer and small, cancelable peaks elsewhere.
Why it matters: Without per-neuron alignment, features wobble; without diversification and symmetry, noise doesn’t cancel; without both, you can’t robustly select the right sum.

🍞 Bottom Bread (Anchor) Like a well-orchestrated band: one player per note, all notes covered, timing synced—so the melody (the correct modular sum) is loud and clear.

Three analogies for the same idea:

Music: Each neuron is a single note; the orchestra covers the full scale; synchronized timing makes the song clean.
Voting: Each neuron is a voter with a small bias; by balancing voters across viewpoints and keeping timing consistent, the majority finds the truth.
Gears: Each neuron’s input and output phases are two gears; when their teeth (phases) mesh just right and all gears cover different ratios (frequencies), the machine computes reliably.

🍞 Sandwich: Frequency Diversification

Hook: You know how a rainbow needs every color to look complete?
Concept: Frequency diversification means different neurons pick different wave speeds so all needed frequencies are represented.
- How: 1) Random initialization sprinkles frequencies. 2) Training amplifies a single winner per neuron. 3) With many neurons, you cover the full set. 4) This ensures the signal term for the true answer is present.
- Why: If some frequencies are missing, certain inputs won’t be handled well; gaps cause blind spots.
Anchor: If your paint set lacks blue, you can’t paint the sky; missing frequencies mean missing pieces of the rule.

🍞 Sandwich: Phase Symmetry

Hook: Imagine seating kids evenly around a circular table; any loud chatter from one side gets balanced by quiet from the opposite side.
Concept: Phase symmetry spreads neuron phases evenly around the circle within each frequency.
- How: 1) Random start makes phases scattered. 2) Training aligns layer phases but keeps their circular spread. 3) Even higher-order symmetry (in multiples) holds approximately. 4) Summing cancels noise.
- Why: Without symmetry, certain wrong answers get systematic boosts instead of being canceled.
Anchor: Balanced see-saws on a playground cancel tilts; symmetric phases cancel spurious peaks.

🍞 Sandwich: Phase Alignment (Double Phase)

Hook: Two metronomes syncing up on a piano will eventually tick together.
Concept: The output phase becomes roughly twice the input phase in each neuron.
- How: 1) Track the phase difference D = 2ϕ − ψ. 2) Training pushes D toward 0. 3) As D shrinks, magnitudes grow faster, reinforcing alignment. 4) Alignment becomes stable.
- Why: Without alignment, the neuron’s feature is weak and noisy; with it, the feature is strong and steady.
Anchor: Two dancers spinning at linked speeds look smooth only when their spins lock in the right ratio.

Before vs After:

Before: We saw single-frequency and alignment empirically but didn’t know how they assemble into a precise solution.
After: We have a full recipe: per-neuron competition selects one frequency and aligns phases; across neurons, frequencies diversify and phases symmetrize; summing yields a near-indicator for the correct sum; softmax chooses it.

Why it works (intuition):

In Fourier space, modular addition translates to a structure where the true answer aligns across frequencies while off-targets misalign differently. With enough balanced, phase-aligned voters, shared signal stacks and idiosyncratic noise cancels.

Building blocks:

Single-frequency per neuron (sparse features)
Layer-wise phase alignment (double phase)
Frequency diversification across neurons
Phase symmetry within frequency groups
Majority vote aggregation (signal > noise)
Softmax to pick the winner
Weight decay to prune leftovers and polish features (key for grokking)

🍞 Sandwich: Majority Vote Mechanism

Hook: When lots of blurry photos of the same scene are averaged, the final image looks sharper.
Concept: Summing many biased but diverse neurons boosts the true logit most, leaving others smaller.
- How: 1) Each neuron produces a main peak near the true sum plus smaller peaks elsewhere. 2) Diversification ensures main peaks add; symmetry makes side peaks cancel. 3) The correct index wins by a clear margin. 4) Softmax turns that margin into high confidence.
- Why: Single neurons are too noisy; the ensemble makes the signal robust.
Anchor: Noise-canceling headphones remove random noise by combining opposite-phase sounds, letting the song shine.

03Methodology

At a high level: Inputs (x, y) → Fourier lens on weights → Inside each neuron: frequency competition + phase alignment → Across neurons: frequency diversification + phase symmetry → Sum logits (majority vote) → Softmax picks (x + y) mod p.

Step-by-step with sandwiches and examples:

Representing the Task and Model

What happens: We use a two-layer network without biases. Inputs are one-hot vectors for x and y in {0,…,p−1}. The network outputs p logits, one for each possible sum.
Why this step exists: Clear, discrete structure makes Fourier analysis exact and interpretable.
Example: For p = 23, the input vectors have length 23; the output is a 23-dimensional score where the correct index is (x + y) mod 23.

🍞 Sandwich: Discrete Fourier Transform (DFT)

Hook: Like putting on a magic pair of glasses that reveals hidden stripes and waves in what you see.
Concept: DFT converts weight vectors into waves with magnitudes and phases per frequency.
- How: 1) Choose the Fourier basis (sines and cosines). 2) Project weights onto this basis. 3) Group sine+cosine into magnitude α and phase ϕ (or β, ψ for output layer). 4) Track these over training.
- Why: In wave space, the math of alignment, growth, and cancellation becomes simple and separable.
Anchor: Equal to splitting a song into its notes so you can raise or lower each note independently.

Early Training: Decoupled, Single-Frequency Emergence

What happens: With small random starts, neurons behave independently. Within a neuron, several frequencies exist at tiny scales; they compete. One frequency wins and dominates: single-frequency per neuron.
Why this step exists: Competition plus alignment makes each neuron specialize, which simplifies the ensemble.
Example: A neuron that began with tiny hints of frequencies 1, 3, and 7 ends up locking onto frequency 3 only.

🍞 Sandwich: Lottery Ticket Mechanism

Hook: Think of a set of race cars starting very close. A slight edge in fuel and alignment makes one pull ahead fast.
Concept: The winning frequency inside a neuron is set by tiny initialization advantages: larger initial magnitude and better phase alignment win.
- How: 1) Measure phase misalignment D = 2ϕ − ψ. 2) Frequencies with smaller |D| grow faster. 3) As magnitudes grow, alignment improves even more (positive loop). 4) One frequency outpaces the rest and dominates.
- Why: Without a winner, the neuron stays messy and multi-frequency, hurting ensemble cancellation later.
Anchor: Like musical chairs: tiny timing advantages decide who gets the chair when the music stops.

🍞 Sandwich: Gradient Flow

Hook: Imagine slowing down a movie of training to see each little nudge to the weights.
Concept: Gradient flow is the continuous-time version of gradient descent that tracks exact training dynamics.
- How: 1) Write an ODE for how weights change. 2) Project into Fourier space to get ODEs for α, β, and D. 3) Analyze signs: cos(D) controls growth; sin(D) controls rotation. 4) Show D is driven toward 0.
- Why: It proves rigorously why features emerge and align, not just that they do.
Anchor: Like watching a plant grow frame-by-frame to learn which parts grow first and why.

Layer-wise Phase Alignment (Double Phase)

What happens: The network pushes the output phase to align with twice the input phase in each neuron. When D → 0, magnitudes α and β grow quickly and reinforce alignment.
Why this step exists: Alignment stabilizes features; otherwise growth wobbles or shrinks.
Example: Starting misaligned, phases rotate in opposite directions until they meet; then magnitudes take off.

Network-Level Diversification and Symmetry

What happens: With many neurons, every needed frequency appears across neurons (diversification), and their phases spread evenly around the circle (symmetry). Magnitudes are similar, avoiding one-neuron dominance.
Why this step exists: Coverage guarantees the signal term is present for every input; symmetry cancels noise terms.
Example: For p = 23, across 512 neurons, all 11 nontrivial frequencies are covered multiple times with phases scattered evenly.

Majority Vote and Indicator Approximation

What happens: Each neuron contributes a main peak at the true sum and smaller peaks elsewhere. Summing over all neurons, symmetrical small peaks cancel, while the true peak stacks and becomes much taller than the rest. Softmax then picks the correct index with high probability.
Why this step exists: This is the final readout; the whole construction was to make this vote decisive.
Example: The correct logit exceeds spurious ones by a margin that grows with the number of diversified neurons.

🍞 Sandwich: Even-Component Robustness (Activation Swapping)

Hook: If you sing a melody on different instruments, it still sounds right if the core notes are the same.
Concept: The solution mainly uses the even part of the activation (like |x| or x^2), so swapping ReLU for another even-strong activation at test time keeps accuracy high.
- How: 1) ReLU(x) = (x + |x|)/2 has both odd and even parts. 2) Symmetry cancels odd parts across neurons. 3) The even part remains and carries the signal. 4) So |x| or x^2 also work.
- Why: If the even part were missing, the ensemble’s cancellation trick would break.
Anchor: Changing from piano to violin keeps the tune if the notes (even components) stay the same.

Grokking as Three Stages

What happens: (I) Memorization: fast fit, messy extra frequencies. (II) Generalization I: ongoing loss minimization plus weight decay prunes non-feature frequencies; test loss drops sharply. (III) Generalization II: slow, weight-decay-only polishing; test accuracy creeps to perfect.
Why this step exists: Shows how feature cleaning, not just scaling, produces sudden generalization.
Example: After training accuracy hits 100%, test accuracy is still poor; later, it suddenly improves as pruning clarifies features.

Secret sauce: Viewing everything in Fourier space turns a mysterious black box into a clean recipe: per-neuron ODEs for magnitudes and phases, a winner-takes-all dynamic that explains sparsity, and an ensemble symmetry that explains noise cancellation and robustness.

04Experiments & Results

🍞 Sandwich: Inverse Participation Ratio (IPR)

Hook: If most of your allowance is saved in one piggy bank and only a little in others, your money is concentrated.
Concept: IPR measures how concentrated (sparse) the Fourier energy is at one frequency; higher IPR means closer to single-frequency.
- How: 1) Compute norms of Fourier coefficients. 2) Form a ratio that grows when energy is packed into fewer bins. 3) Track it during training. 4) See it rise as neurons become single-frequency.
- Why: It shows when features become clean and concentrated—key for robust voting.
Anchor: A flashlight beam getting narrower and brighter means more concentration.

The tests and why they matter:

Train on full data (p = 23) or with a 75% train split for grokking studies; measure loss, accuracy, phase alignment, and IPR.
Compare activations swapped at test time to check even-component robustness.
Run ablations that remove frequencies or shrink phase ranges to test how much diversification and symmetry matter.

The competition:

Baseline understanding: prior reports of Fourier features and grokking. This paper adds precise neuron-level dynamics, diversification conditions, and proofs.

The scoreboard with context:

Single-frequency emergence: Heatmaps after DFT show nearly all neurons pick exactly one frequency; alignment satisfies output phase ≈ 2 × input phase. That’s like getting a perfect solo note from each singer.
Diversification and symmetry: With M = 512, all 11 nontrivial frequencies appear across neurons, and phases within each frequency look uniform; magnitudes are tightly clustered. That’s like an orchestra covering the full scale with balanced sections—no one blasting too loud.
Majority-vote success: The summed logits form a sharp peak at the correct sum with only tiny side peaks; softmax then chooses correctly with near certainty—an A+ when others hovered at B–.
Activation swapping: ReLU-trained models keep 100% accuracy when swapped to |x|, x^2, x^4, or exp(x). They fail for purely odd activations like x or x^3. This is strong evidence that even parts carry the solution.
Grokking timeline: Training loss hits zero early, but test loss remains high; later, test loss drops sharply, then slowly polishes to perfect accuracy. This matches the three-stage theory: memorize → prune via weight decay → polish.
Ablations (frequency/phase): Reducing covered frequencies or squeezing phase ranges makes loss spike. Full diversification yields the lowest loss by far—like covering all notes vs. playing with half a piano.
ReLU leakage: Small energy appears at harmonics (e.g., 3k*, 5k*). Theory predicts leakage decays roughly like 1/r^2 with harmonic order, matching observations. So the main note still dominates cleanly.

Surprising findings:

Even-component robustness: The model’s success doesn’t depend on the exact shape of ReLU; any activation with strong even parts works after training.
Lottery ticket clarity: Within a single neuron, the winner is highly predictable from tiny initial differences—an unexpectedly clean race dynamic.

Quantitative flavor:

With p = 23 and M = 512, all nontrivial frequencies appear; phases within each group behave like uniform draws on the circle. The ensemble produces a large logit gap favoring the true sum; IPR rises steadily, signaling single-frequency dominance in each neuron. Under train/test split with weight decay, test accuracy transitions from near 0% after memorization to near 100% post-pruning.

05Discussion & Limitations

Limitations (be specific):

Two-layer focus: Results are for shallow networks; deeper architectures may add interactions not captured here.
Toy domain: Modular addition is wonderfully clean but simpler than natural images or language; some behaviors may not translate directly.
Small-initialization regime: The cleanest proofs rely on tiny starts (for decoupling); large or exotic initializations might change early dynamics.
Overparameterization: The clearest diversification and symmetry arise when width is large; small models may not exhibit the same cancellation power.
Exact DFT structure: The discrete circular nature (mod p) makes Fourier analysis ideal; tasks without such group structure may need different tools.

Resources needed:

Sufficient width (hundreds of neurons) to realize full frequency coverage and phase symmetry.
Enough training steps for phase alignment and pruning to complete (especially the slow final stage in grokking).
Weight decay to encourage sparsity in frequency space and enable the test-time jump.

When NOT to use this approach:

Non-cyclic or noisy-label tasks where Fourier structure is weak or symmetry assumptions break.
Extremely small models that cannot achieve diversification; majority voting won’t be decisive.
Training without any regularization where leftover frequencies persist and harm generalization.

Open questions:

Deeper nets: How do multi-layer interactions modify or amplify phase alignment and diversification?
Other groups: Can similar majority-vote and alignment mechanisms solve tasks on different algebraic structures?
Optimizers: AdamW vs. SGD—how do momentum and normalization affect the race dynamics and symmetry formation?
Data regimes: How do partial or corrupted datasets reshape the lottery and the grokking timeline?
Monitoring tools: Can real-time DFT dashboards predict the grokking jump and guide training interventions?

06Conclusion & Future Work

Three-sentence summary:

This paper shows that two-layer networks learn modular addition by turning weights into single-frequency Fourier features whose phases align across layers, while the whole network spreads across frequencies and balances phases to cancel noise.
A neuron-level lottery ticket mechanism explains which frequency wins inside each neuron, and the ensemble acts as a majority vote that approximates an indicator for the correct sum.
Grokking emerges as memorization followed by two generalization stages driven by loss minimization and weight decay, which prune extra frequencies and polish features until sudden generalization appears.

Main achievement:

A complete, mechanistic, and dynamical account—from per-neuron ODEs to network-level diversification—explaining both how the solution is represented (majority voting in Fourier space) and how it reliably emerges from random initialization.

Future directions:

Extend to deeper networks and other algebraic tasks; study different regularizers and optimizers; build training monitors that watch phase alignment, frequency coverage, and IPR to predict and accelerate grokking.

Why remember this:

It transforms a mystery (sudden generalization) into a clear recipe: per-neuron alignment and sparsity, network-level diversity and symmetry, and careful pruning. This blueprint can guide principled design and diagnosis of learning in broader AI systems.

Practical Applications

•Monitor Fourier features of weights during training to detect phase alignment and single-frequency emergence early.
•Ensure sufficient network width so all needed frequencies can be represented and phase symmetry can form.
•Use weight decay schedules to gently prune non-feature frequencies and accelerate grokking.
•Choose activations with strong even components (e.g., ReLU, |x|, x^2) to maintain performance and robustness.
•Design initializations that mildly bias useful frequencies or better alignment to shorten the lottery race.
•Build dashboards that track phase misalignment D, IPR, and frequency coverage to predict the grokking jump.
•Apply the majority-vote diversification idea to other cyclic or group-structured tasks (e.g., rotations).
•Use ablation checks (limit frequencies or phase ranges) during debugging to verify the need for diversification.
•Adopt training regimes (e.g., normalized or spherical GD) if quadratic-like instabilities appear.
•Leverage symmetry-aware data splits and augmentations to encourage balanced phase distributions.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes