Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Zeyuan Allen-Zhu

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Intermediate

Zeyuan Allen-Zhu12/19/2025

arXiv PDF

Key Summary

•The paper introduces Canon layers, tiny add-ons that let nearby words share information directly, like passing notes along a row of desks.
•Using controlled “synthetic” practice tasks, the authors fairly compare different model designs without the usual real-world noise.
•Canon layers boost Transformers’ reasoning depth by 2–4×, reasoning breadth by about 30%, and knowledge skills with minimal extra cost.
•They rescue NoPE (no positional embeddings) models, making them perform like or better than RoPE models while improving long-context behavior.
•Linear models (GLA, Mamba2, GDN) also improve with Canon, with GLA+Canon rivaling or beating Mamba2 in several skills.
•Mamba2’s built-in conv1d acts like a partial Canon; removing it drops performance to basic GLA, proving horizontal mixing is key.
•Even after adding Canon everywhere, Transformers still reason deeper than linear models; linear models store more facts per parameter.
•At academic scale (1.3B params, 100B tokens), real-world training is noisy; Canon trends persist, but all models still fail simple 2-hop reasoning.
•The bottleneck for linear models is not memory size but small errors that stack up when compressing and retrieving information repeatedly.
•Synthetic tasks offer a clean, low-cost way to predict which architectures will scale well as data and training methods improve.

Why This Research Matters

Better AI design choices should come from clear science, not guesswork. This work builds a clean “physics lab” of synthetic tasks that isolates core skills so we can compare architectures fairly. Canon layers are a tiny, practical addition that make models reason deeper and learn faster with almost no extra cost. They also reduce reliance on fragile positional tricks and help weaker models catch up, making strong performance more accessible. For industry, this means cheaper experiments, faster iteration, and more reliable predictions about what will scale. For users, it points toward future assistants that follow longer chains of thought, manage multiple constraints at once, and stay accurate in longer documents.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re on a soccer team where everyone is great at kicking, but hardly anyone talks during the game. You might win easy matches, but in tough games, those missed shout-outs and quick passes cost you goals.

🥬 The Situation Before: Language models (LMs) were scoring many points—answering questions, summarizing, and coding—but we didn’t really know which team formations (architectures) worked best. We often judged models by a number called perplexity (how “surprised” they are by text). That seems helpful, but it’s like judging a soccer team only by how loudly they cheer, not by their passes and goals. Also, at the “junior league” training size (about 1.3B parameters and 100B tokens), random luck changes scores by a few percent, hiding real differences between designs. Worse, these small models consistently failed even super-simple 2-hop reasoning (think: A is same as B, B is 1970, so A is 1970)—they guessed randomly.

🍞 Hook: You know how some homework makes you smarter at one thing (like fractions) and other homework mixes too many topics at once so it’s hard to see what you’re learning?

🥬 The Problem: Real-world text is a big soup of skills (facts, logic, style, trivia). When models train on this soup, it’s hard to tell whether better results come from the data, the training tricks, or the model’s design. We also see “grokking,” where models suddenly understand a concept after a long delay—unpredictable and messy for comparing designs.

🍞 Hook: Imagine a science lab with a vacuum chamber where you remove all the messy air so you can study pure motion.

🥬 The Gap: We needed a clean playground that isolates single skills—like depth of reasoning (how many steps in a chain), breadth of reasoning (handling many branches at once), storing facts in weights (capacity), using facts in mental math (manipulation), and understanding nested structures (like grammar). With infinite high-quality synthetic data, we can remove noise, run fair races, and see mini scaling-laws (how performance grows with size and difficulty).

🍞 Hook: Think of assembly lines—parts move forward, but workers also need to pass tools sideways to neighbors for quick fixes.

🥬 What’s Missing in Today’s Models: Transformers are great at global attention (looking everywhere), but they oddly lack a simple sideways whisper inside a layer. Even for easy tasks like recalling a matching pair, they often need two layers: the first layer copies local info forward; the second layer uses it. That’s like using a crane to move a pencil to your neighbor. Linear models compress the past into a tiny memory; that’s fast but can blur nearby details.

🍞 Hook: Imagine adding small conveyor belts between neighbors on each workstation.

🥬 The Paper’s Answer: Add Canon layers—tiny, learnable, local mixers that let each token directly blend with its recent neighbors (like the past 3). They’re named after a musical “canon,” where a melody repeats with a delay. These layers slot in before/inside attention and before/inside MLPs (four spots called A, B, C, D). They cost little, work with any architecture (Transformer, linear attention, state-space), and massively improve reasoning in the synthetic playground. They also match trends in real-world training at academic scale.

🍞 Anchor: After installing these “sideways conveyors,” weak teams (NoPE, plain GLA) suddenly play like stars (matching RoPE or Mamba2/GDN). Transformers reason 2–4× deeper; linear models handle more branches and math; and everyone needs fewer fancy position tricks. It’s a small part with a big payoff.

02Core Idea

🍞 Hook: You know how in a group project, quick whispers to the classmate next to you save time compared to sending a formal email to the whole class?

🥬 Aha in One Sentence: Give models a light, local, sideways channel—Canon layers—so nearby tokens can quickly share information inside the same layer.

How it works (recipe):

At a token, collect its own hidden vector plus the last few neighbors (like t, t−1, t−2, t−3).
Mix them with a tiny 1D causal convolution (trainable weights), then add back the original (residual) vector.
Do this at multiple spots in the block: A (before attention), B (inside attention projections), C (before MLP), D (inside MLP).
Keep it cheap: no extra global attention needed, just a small local mixer.

Why it matters: Without a sideways lane, attention spends layers just to pass local info along. With Canon, models master 1- and 2-hop steps faster and can climb to 4, 8, 16 hops with the same training budget.

🍞 Anchor: Like adding short hallways between neighboring classrooms so kids can hand off worksheets without walking through the main lobby each time.

Multiple analogies:

Neighborhood plumbing: Instead of pumping water citywide (global attention) to reach your next-door house, install a short pipe between neighbors (Canon) so small deliveries are instant.
Orchestra canon: One violin starts a melody, the next joins a bar later, and so on—overlapping patterns help the whole group stay in sync.
Post-it relay: Sticky notes passed seat-to-seat beat broadcasting to the whole auditorium when only your row needs it.

Before vs After:

Before: Transformers often need an extra layer for trivial local passing; NoPE struggles on structured reasoning; linear models blur nearby details under compression.
After: One small Canon boosts reasoning depth 2–4×, fixes NoPE to RoPE-level, and turns GLA into a strong contender; even Mamba2’s secret sauce is a partial Canon-like conv.

Why it works (intuition): Deep reasoning is learned step-by-step (1-hop → 2-hop → 4-hop ...). Faster early steps snowball. Canon accelerates those early wins by improving local feature quality that attention or recurrence then chains together.

Building blocks (each with Sandwich):

🍞 You know how you chat with kids sitting left and right? 🥬 Horizontal Information Flow: It’s the sideways sharing between neighboring tokens. How: blend each token with a few recent ones. Why: without it, models waste power to deliver tiny local facts. 🍞 Example: Finding who “he” refers to usually needs just a few nearby words.
🍞 Imagine plug-in sockets on a power strip. 🥬 Canon-A/B/C/D: Four spots in a block where Canon can plug in. How: A before attention, B inside attention projections, C before MLP, D inside MLP. Why: different spots help different computations; stacking gives cumulative gains. 🍞 Example: Canon-ACD often works great even if B is untouched.
🍞 Think of safety rails on stairs. 🥬 Residual Connections: Always add the original vector back after mixing. How: h' = h + conv1d(neighbors). Why: stabilizes training and lets the model ignore Canon if unhelpful. 🍞 Example: Canon with no residual wobbles; with residual it’s steady.
🍞 A 4-key piano chord. 🥬 Tiny Causal Conv1d: A short, learnable kernel (size ~4) looks only backward in time. How: multiplies and sums small local windows. Why: captures just-enough neighborhood context at tiny cost. 🍞 Example: Even fixed random kernels already help; learned ones help more.

03Methodology

🍞 Hook: Imagine designing a video game tutorial with levels that each teach exactly one move—jump, duck, dash—so you can measure each skill cleanly.

🥬 High-Level Recipe: Input (sequence) → [Canon layers at A/B/C/D to mix local neighbors] → [Attention/Linear/SSM to do global or compressed reasoning] → [MLP to transform features] → Output (next tokens/answers). We pretrain and test on five synthetic tasks, each isolating one “atomic” skill. We compare base architectures (Transformers with RoPE or NoPE, linear attention GLA, state-space Mamba2, and gated-delta GDN) with and without Canon.

🍞 Anchor: Like five mini-games: one for long chains (depth), one for many branches (breadth), one for memory (capacity), one for mental math (manipulation), and one for nested patterns (structure).

Now each step and each task (Sandwich style):

Data Playground Setup

🍞 You know how a math workbook has sections: add, subtract, multiply—no mixing—so you can see what you’ve mastered?
🥬 Synthetic Tasks: We create infinite clean practice so each dataset trains just one capability. How: we control size and difficulty (N, K, L), fix lengths to typical windows (like 2048–4096), and ensure enough mid-difficulty cases to avoid grokking surprises. Why: Real data mixes skills and adds noise; here, we run fair, repeatable races.
🍞 Example: We can precisely ask, “How deep a chain can you do at 95% accuracy within this training budget?”

Canon Layer Integration

🍞 Imagine clipping tiny fans onto a laptop at four vents where heat builds up.
🥬 Canon-A/B/C/D: Add small causal conv1d (kernel ≈ 4) with residual around it. How: A before attention, B after Q/K/V projections, C before MLP, D inside MLP (before activation). Why: Each spot boosts a different part of the computation; stacking helps the most.
🍞 Example: Canon-ACD often beats changing attention itself, proving it’s general-purpose sideways mixing.

Training Protocol

🍞 Think of cooking two identical cakes to be sure the oven (randomness) didn’t trick you.
🥬 Controlled Pretraining: Same batch size, steps, learning rates grid; same data order across seeds; multiple model sizes and difficulty levels (“3×4 mini scaling-laws”). Why: Stabilizes comparisons and exposes where design, not luck, drives gains.
🍞 Example: If Model X beats Y across sizes and difficulties, it’s a true architectural win.

The Five Atomic Tasks:

Depo (Reasoning Depth)
- 🍞 Imagine jumping along stepping stones exactly k steps ahead.
- 🥬 What: Given a scrambled list of directed edges forming a big cycle, answer the k-th successor for many queries. How: Train on a spread of k (like up to 8 or 16), then test the hardest k and biggest graphs. Why: Deep chains reveal whether models can reliably stack steps.
- 🍞 Example: With Canon, Transformers go from failing k=4 to nailing k=8 or 16.
Brevo (Reasoning Breadth)
- 🍞 Picture a family tree question: “List all of Alice’s nephews,” requiring many branches at once.
- 🥬 What: Given a DAG (dependencies), output all nodes that feed into a query, in topological order. How: The correct answer requires planning the whole ordering before typing the first token. Why: Tests parallel, bottom-up reasoning over multiple branches.
- 🍞 Example: Canon gives roughly 30% more breadth capacity (bigger DAGs at similar accuracy).
Capo (Knowledge Capacity)
- 🍞 Think of how many flashcards your brain can store reliably after 100 practice views.
- 🥬 What: Synthetic biographies encode facts; measure bits-per-parameter learned. How: Undertrain (100 exposures) to magnify architecture differences in learning speed/stability. Why: Efficient, stable training translates to higher reliable storage.
- 🍞 Example: Canon recovers 10–15% capacity for gated MLP or MoE by accelerating learning.
Mano (Knowledge Manipulation)
- 🍞 Like mental math: you know 7×8=56 and 56+9=65; do it in your head without writing steps.
- 🥬 What: Evaluate nested modular arithmetic (e.g., ((a×b)+(c−d)) mod 23) without chain-of-thought. How: Mix retrieval from internal tables (23×23) plus composition rules. Why: Tests the combo of “recall facts” + “compute over them” purely mentally.
- 🍞 Example: Canon extends the length of expressions models can handle by ≈30%.
Lano (Hierarchical Language Structure)
- 🍞 Imagine building a Lego castle where pieces must fit a grammar of rules; you must plan globally.
- 🥬 What: Generate sequences from a context-free grammar with local ambiguity; only the whole-sequence parse resolves meaning. How: Requires learning a dynamic-programming-like strategy over long inputs. Why: Tests structural reasoning and long-range dependencies.
- 🍞 Example: Canon improves scores on harder grammars, though very deep nesting still stretches compute.

Secret Sauce

🍞 Like adding bike lanes to reduce traffic jams.
🥬 Why Canon Works: It boosts the quality of local features so global mechanisms (attention or compressed memory) can chain them more reliably. Without it, early hop-errors cascade. Why: Faster early-hop mastery lets models climb to deeper hops within the same budget.
🍞 Example: Transformers with Canon jump from shallow to deep multi-hop quickly; linear models double breadth and math length.

04Experiments & Results

🍞 Hook: Imagine a science fair where everyone solves the same five mini-games under a timer, and we grade both how many levels they beat and how fast they learned.

🥬 The Tests and Why:

Depth (Depo): Can you follow longer chains? This shows if the model can stack steps without slipping.
Breadth (Brevo): Can you handle many branches? This reveals parallel planning.
Capacity (Capo): How many bits of fact can you store reliably per parameter under tight practice?
Manipulation (Mano): Can you do mental math with your stored facts and rules?
Structure (Lano): Can you resolve globally ambiguous sequences with nested rules? We compare base models (RoPE/NoPE Transformers, GLA, Mamba2, GDN) with and without Canon, over multiple sizes and difficulties, and then test at academic real-world pretraining scale (1.3B/100B tokens).

🍞 Anchor: Think of color-coded scoreboards where Canon bars jump much higher across most games.

Scoreboard with Context:

Transformers + Canon: Reasoning depth jumped 2–4× (like going from barely finishing level 4 to confidently beating level 8 or 16). Breadth rose ≈30% (tackle bigger DAGs). Capacity up 10–15% for slower-to-train MLPs, and manipulation length up ≈30%. On structure (Lano), Canon helps, though very deep nesting remains challenging—expected since parsing cost grows quickly with sequence length.
NoPE + Canon: Before, NoPE was like a car with no GPS—good at not overfitting to short roads but lost in the city. Canon gives it excellent local directions: it matches or even beats RoPE+Canon on several synthetic tasks and tends to generalize better to long contexts. It also outperforms RoPE-fixes like ALiBi/H-ALiBi.
Linear Models + Canon: GLA gets a universal boost—depth from 1 to 4 hops, double breadth and manipulation length—often surpassing Mamba2. Mamba2’s internal conv1d is a partial Canon; removing it drops Mamba2 to GLA-level. Replacing it with full Canon wins further. GDN benefits less (its gating already mimics some Canon behavior) but still inches up.
Final fair match (everyone gets full Canon): Transformers still win in deep reasoning (2–4× more depth), while linear models win in bits-per-parameter factual capacity (~40% more). Linear depth limits stem from small, compounding encode/retrieve errors in compressed memory, not from too-small memory.

Real-World Academic-Scale Findings (1.3B/100B):

Noise is high: scores swing by 1–4% across random seeds; many differences are statistically mushy.
Canon trends persist: NoPE+Canon ≈ RoPE+Canon; GLA+Canon ≈ or > Mamba2, ≈ GDN; removing Mamba2’s conv1d hurts a lot.
All models fail 2-hop reasoning even in short contexts (~100 tokens)—a wake-up call that today’s academic-scale pretraining pipelines underteach deep reasoning.
Cutting back RoPE (or using NoPE) helps length generalization once Canon is present.

Surprises:

Tiny local mixers (Canon) change the game far more than many heavier tricks.
Mamba2’s edge largely comes from a Canon-like conv1d; the SSM part isn’t the main hero for these tasks.
Even equalized with Canon, linear models still lag in deep chains because tiny errors stack during compression/retrieval.

🍞 Anchor: Picture two athletes: one lifts heavier weights (capacity, linear models), the other runs deeper obstacle courses (reasoning depth, Transformers). Canon is like better shoes: both improve, but their specialties remain.

05Discussion & Limitations

🍞 Hook: If you add training wheels to two bikes, both ride safer, but one still might be faster uphill and the other stronger downhill.

🥬 Honest Assessment:

Limitations:
1. Academic-scale real-world training is noisy. Many 1–3% gains may be luck; only bigger moves are trustworthy.
2. Synthetic tasks are clean by design; they can’t capture every real-world quirk (messy language, rare edge cases, tool use).
3. Very deep structural tasks (like our hardest grammar) still strain compute; Canon helps but doesn’t rewrite complexity laws.
4. Results beyond 1.3B/100B need larger-scale confirmation, though early signs at 1–8B/1–2T look aligned.
Required Resources: • Commodity GPUs (A100/H100 class), bf16 training, and access to the simple conv1d kernels (H3 library). Synthetic data is cheap and infinite; that’s the point.
When Not to Use (or Not Alone): • If your main goal is extreme long-context retrieval (like 1M tokens), you’ll still want specialized retrieval/compression modules. Canon doesn’t replace those; it complements them inside the standard 4k window where deep reasoning happens. • If your pipeline is entirely judged by perplexity at early training, Canon’s real strengths (reasoning accuracy) might be underappreciated.
Open Questions:
1. Can dynamic or gated Canons (input-conditioned weights) beat static conv1d enough to justify added cost?
2. What’s the best Canon placement schedule across layers for maximum win per FLOP?
3. Can we design linear memories with lower encode/retrieve error so depth catches up to Transformers?
4. How do Canon layers interact with RL-based post-training that supplies tailored curricula?

🍞 Anchor: Think of Canon as a trusty Swiss Army knife: lightweight, widely useful, not a magic wand—but it makes many jobs easier and hints where to invent the next big tool.

06Conclusion & Future Work

🍞 Hook: Small hinges swing big doors; sometimes a tiny part changes how the whole machine feels.

🥬 Three-Sentence Summary: This paper adds Canon layers—tiny local mixers that let neighboring tokens share information—to a clean synthetic playground that isolates core skills like depth, breadth, capacity, manipulation, and structure. Canon consistently boosts reasoning (2–4× deeper) and stabilizes training across Transformers, linear attention, and SSMs, rescuing weak setups (like NoPE, plain GLA) and clarifying that linear models’ depth limits come from compounding compression/retrieval errors, not memory size. Real-world academic-scale runs echo these trends but also show today’s pipelines still don’t teach even simple 2-hop reasoning well.

Main Achievement: Identifying “horizontal information flow” as a missing primitive and delivering Canon layers as a minimal, architecture-agnostic fix that’s easy to implement and yields outsized gains.

Future Directions: Explore dynamic/gated Canons, optimize where and how often to place them, invent lower-error linear memories, enrich the synthetic task zoo, and validate at larger scales. Canon can also pair with long-context retrieval systems to form practical hybrids: linear efficiency for storage, Transformer precision for deep chains.

Why Remember This: Like residual connections or LayerNorm, Canon layers are a small idea with a big practical footprint—they make models learn deeper reasoning faster, with minimal cost, and give researchers a clean lab to study what really matters in architecture design.

Practical Applications

•Add Canon-ACD (causal conv1d kernel≈4 with residual) to your Transformer blocks to boost multi-hop reasoning without heavy code changes.
•If you use NoPE for long-context robustness, pair it with Canon to match or beat RoPE-level performance on many reasoning tasks.
•For linear-attention systems (GLA), integrate Canon-AbCD to double reasoning breadth and manipulation length, rivaling Mamba2/GDN.
•In Mamba2-like models, ensure the conv1d (partial Canon) is present; consider upgrading to full Canon-AbCD for further gains.
•When optimizing knowledge storage under limited exposures (e.g., domain-specific facts), add Canon to recover 10–15% bits-per-parameter in gated MLP or MoE.
•Evaluate architectures with synthetic tasks (Depo/Brevo/Capo/Mano/Lano) to de-noise comparisons and choose designs based on targeted skills.
•Reduce RoPE usage (e.g., RoPE on a subset of dims) when Canon is enabled to improve length generalization without sacrificing reasoning.
•When benchmarking small models, avoid perplexity as the main metric; test multi-hop tasks directly (e.g., 1-hop-L/2-hop-L) to reveal real reasoning ability.
•Combine Canon-equipped Transformers (for deep reasoning) with linear/SSM components (for long-context compression) in hybrid systems.
•Adopt a mini scaling-law grid (multiple model sizes × multiple data difficulties) to stabilize findings and avoid drawing conclusions from lucky runs.

Version: 1