📚TheoryAdvanced

Transformer Theory

Key Points

  • Transformers map sequences to sequences using layers of self-attention and feed-forward networks wrapped with residual connections and LayerNorm.
  • Self-attention lets every token look at every other token, making long-range dependencies easy and efficient to capture in parallel.
  • Positional encoding injects order information into token embeddings so attention knows where things are in the sequence.
  • Residuals and LayerNorm stabilize optimization so very deep stacks of layers can be trained effectively.
  • The architecture is expressive—under mild assumptions, transformers can simulate any Turing machine, showing computational universality.
  • In-context learning emerges from pretraining: the model can learn to perform tasks at inference time by reading examples in the prompt.
  • Scaling laws show predictable improvements in loss as parameters, data, and compute grow, with power-law exponents.
  • Attention’s cost scales quadratically in sequence length; practical systems use optimizations and memory-saving tricks to push to long contexts.

Prerequisites

  • Linear Algebra (vectors, matrices, dot products)Attention and feed-forward computations are matrix multiplications and projections.
  • Probability and SoftmaxAttention weights are probabilities derived via softmax; training uses probabilistic losses.
  • Neural Networks and BackpropagationTransformers are deep nets optimized with gradient descent and residual connections.
  • Layer NormalizationStabilizes the training dynamics in deep architectures; essential for understanding Pre-LN vs Post-LN.
  • Optimization (SGD/Adam)Explains how models are trained to minimize cross-entropy and why normalization/residuals help.
  • Autoregressive ModelingDecoder-only transformers predict the next token; understanding causal masking is key.
  • Computational ComplexityAttention’s O(T^2) cost motivates efficient variants for long sequences.
  • Numerical StabilitySoftmax stability (log-sum-exp) and mixed precision are crucial in large models.
  • Distributed/Parallel Computing (basics)Transformers scale across GPUs/TPUs using data/model parallelism.
  • Programming in C++ (vectors, loops)To implement and experiment with forward passes and components efficiently.

Detailed Explanation

Tap terms for definitions

01Overview

Transformers are a neural network architecture designed to process sequences (like text, code, audio, and DNA) using attention instead of recurrence. Each input token is first turned into a continuous vector (an embedding), then enriched with position information via positional encoding. The core building block—repeated N times—is a combination of multi-head self-attention and a position-wise feed-forward network (FFN), each wrapped with residual connections and Layer Normalization. Attention allows every token to selectively focus on other tokens, capturing dependencies regardless of distance in the sequence. Because attention is highly parallelizable, transformers train efficiently on modern hardware, which helped them displace RNNs and CNNs in many sequence tasks. The architecture’s effectiveness scales with data, parameters, and compute, exhibiting smooth power-law trends known as scaling laws. Remarkably, theoretical work shows that transformers are computationally universal under certain assumptions, meaning they can simulate any algorithm a Turing machine can. In practice, large pretrained transformers (LLMs) show emergent abilities and can learn new tasks from context alone (in-context learning), without changing their weights. Understanding how embeddings, attention, residuals, and normalization interact is essential to grasp why transformers power today’s state-of-the-art AI systems.

02Intuition & Analogies

Imagine a classroom discussion. Each student (token) hears the entire conversation and decides whose words matter most to respond to next. That decision—who to pay attention to—depends on the question a student asks (query) and what others said (keys and values). If a student is answering a math question, they might mostly attend to the person who shared the formula; for a history question, they focus on the person who mentioned the relevant date. This is self-attention: every token scores and mixes information from all other tokens, guided by its own needs. Now, how do they keep track of order? Think of seat numbers or timestamps: positional encoding gives every utterance a sense of “when” it occurred, so attention can tell “Alice spoke before Bob” even if both are equally relevant. Residual connections act like keeping a copy of your notes before integrating new insights—if the new idea is confusing, you still have your original notes. LayerNorm is like repeatedly straightening and standardizing everyone’s notes so small differences don’t explode or vanish across long study sessions. The FFN is the student’s private scratchpad—after gathering insights from others, each student applies a non-linear transformation to refine their interpretation. Stack many such discussions (layers), and the class becomes adept at solving complex problems: some layers specialize in syntax (who is subject/object), others in semantics (who did what to whom), and higher ones in reasoning. With lots of practice (pretraining), the class develops surprising, emergent talents, like solving problems they’ve never seen directly, just by recognizing patterns in how examples are presented in the prompt (in-context learning).

03Formal Definition

A transformer processes a length-T sequence of token indices . Tokens are mapped to vectors via an embedding matrix E ^{V d_{}}, then combined with positional encodings P ^{T d_{}}. Each layer applies multi-head self-attention (MHA) and a position-wise feed-forward network (FFN) with residual connections and LayerNorm. For head h with projections ^Q, ^K, ^V ^{d_{} }, attention computes (Q,K,V) = ()V. Multi-head attention concatenates heads and applies . The FFN is typically FFN(x) = (x + ) + with a nonlinearity (e.g., ReLU or GeLU). Residual connections add inputs to sublayer outputs; LayerNorm normalizes features per token. Training minimizes next-token cross-entropy over a vocabulary using teacher forcing. The decoder-only causal variant masks future positions to ensure autoregressive generation. Theoretically, under assumptions like sufficient precision and sequence length, transformers are computationally universal: T_{} such that for any Turing machine M and input x, T_{}(x) = M(x). Empirically, loss scales predictably with parameters N, compute C, and data D by approximate power laws, and models demonstrate emergent in-context learning behaviors.

04When to Use

Use transformers when you need to model sequences with long-range dependencies and want efficient parallel training: language modeling, machine translation, code completion, speech recognition, protein folding, and time-series forecasting. Choose decoder-only (causal) transformers for autoregressive generation (LLMs), encoder-only (BERT-like) for understanding tasks (classification, retrieval, tagging), and encoder–decoder for sequence-to-sequence tasks (translation, summarization). If order matters but you lack explicit structure, positional encoding provides a flexible, learnable or fixed mechanism to inject sequence position. When datasets are large and compute allows, transformers benefit from scaling: bigger models on more data tend to yield better performance following power-law trends. If latency or memory is constrained and sequences are long, consider efficient attention variants (sparse, linear, local) or retrieval-augmented methods. For small datasets or strongly structured domains (e.g., small tabular data), simpler models might be preferable; however, transformers can still excel with transfer learning via pretrained checkpoints and task-specific fine-tuning or prompting.

⚠️Common Mistakes

  • Forgetting the 1/\sqrt{d_k} scaling in attention leads to saturated softmax and poor gradients; always divide QK^T by \sqrt{d_k}.
  • Omitting causal masks in decoder attention leaks future information, causing train–test mismatch; ensure upper-triangular masking during training and inference.
  • Mixing up Pre-LN vs Post-LN: Post-LN stacks can be unstable at depth; many modern implementations use Pre-LN (LayerNorm before sublayers) for better optimization.
  • Ignoring numerical stability in softmax causes overflow; subtract the row-wise max before exponentiation.
  • Mismanaging shapes: ensure Q, K, V dimensions align (Q: T\times d_k, K: T\times d_k, V: T\times d_v) and multi-head concatenation matches W^O.
  • Not adding positional information makes the model permutation-invariant over tokens; always add or learn positions for sequence tasks.
  • Overlooking memory cost: attention uses O(T^2) memory for scores; for long contexts, use chunking, caching, or efficient attention.
  • Misinterpreting in-context learning as weight updates: it is behavior that arises from fixed weights processing the prompt; the model infers algorithms from context, not gradient steps during inference.

Key Formulas

Scaled Dot-Product Attention

Explanation: Attention weights are computed by comparing queries to keys, scaled by the key dimension to prevent saturation, then applied to values. Use this as the core operation that lets tokens mix information.

Multi-Head Attention

Explanation: Each head has separate projections and captures different relations. Heads are concatenated and projected back to the model dimension.

Position-wise Feed-Forward

Explanation: An MLP applied to each token independently adds nonlinearity and increases expressivity. Common activations are ReLU or GeLU, and is usually larger than .

Layer Normalization

Explanation: LayerNorm standardizes features within each token and then learns a scale () and shift (). It stabilizes gradients in deep networks.

Residual Connection

Explanation: Residuals add a layer’s input to its output, preserving information and easing optimization. This helps train very deep models without vanishing gradients.

Sinusoidal Positional Encoding

Explanation: Fixed sin/cos functions at different frequencies inject order information so the model can reason about relative and absolute positions.

Softmax

Explanation: Softmax turns raw scores into probabilities that sum to 1. It is applied row-wise to attention logits.

Autoregressive Cross-Entropy Loss

Explanation: Training objective for decoder-only transformers: maximize the probability of the next token given the past. Lower loss means better next-token prediction.

Empirical Scaling Law

Explanation: Validation loss tends to decrease approximately as a power law in parameters (N), compute (C), and data (D). Useful for planning budgets and target performance.

Computational Universality (Informal)

Explanation: Under certain assumptions (e.g., sufficient precision and sequence length), a transformer can emulate any Turing machine M. This expresses theoretical expressivity rather than practical design guidance.

Layer Compute Costs

Explanation: Per layer, attention cost grows quadratically with sequence length T, while the FFN cost grows linearly with T but quadratically with hidden sizes. These dominate training and inference time.

Complexity Analysis

Let T be sequence length, d the model dimension, h the number of heads, and the per-head key/query dimension. In a standard transformer layer, the dominant compute comes from attention and the FFN. Multi-head self-attention forms Q, K, V via three projections O(T ), then computes Q per head in O( ) and applies softmax and multiplies by V in O( + ) and O( + T ). Summed over h heads, this is O(T + h ) ≈ O(T + d). The FFN applies two linear layers per token: O(T d + T d) ≈ O(T d ). With typical ≈ 4d, FFN can be comparable to or larger than attention for moderate T, but for long T the quadratic attention term dominates. Memory-wise, caching the attention score matrix requires O() per head, plus activations O(T d) for residual paths and O(T ) within the FFN. During training, activation memory often dominates; techniques like activation checkpointing, flash attention (reducing memory traffic), and mixed precision help. In decoder-only autoregressive inference with key/value caching, each new token costs O(h L ) where L is past context length (linear in L per step) and avoids recomputing K/V for the history. End-to-end, an N-layer stack multiplies these costs by N. Therefore, overall time is roughly O(N(T + d + T d )) and space O(N(T d + ) + parameters). Efficient or sparse attention variants reduce the factor to near-linear or log-linear in T under structural assumptions, trading exactness for scalability.

Code Examples

Scaled Dot-Product Attention (single head) with numerically stable softmax
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <algorithm>
5#include <iomanip>
6
7using Matrix = std::vector<std::vector<double>>;
8
9Matrix transpose(const Matrix &A) {
10 size_t n = A.size(), m = A[0].size();
11 Matrix AT(m, std::vector<double>(n));
12 for (size_t i = 0; i < n; ++i)
13 for (size_t j = 0; j < m; ++j)
14 AT[j][i] = A[i][j];
15 return AT;
16}
17
18Matrix matmul(const Matrix &A, const Matrix &B) {
19 size_t n = A.size(), m = A[0].size(), p = B[0].size();
20 Matrix C(n, std::vector<double>(p, 0.0));
21 for (size_t i = 0; i < n; ++i)
22 for (size_t k = 0; k < m; ++k) {
23 double aik = A[i][k];
24 for (size_t j = 0; j < p; ++j)
25 C[i][j] += aik * B[k][j];
26 }
27 return C;
28}
29
30void softmax_rows(Matrix &A) {
31 for (auto &row : A) {
32 double maxv = *std::max_element(row.begin(), row.end());
33 double sum = 0.0;
34 for (double &x : row) { x = std::exp(x - maxv); sum += x; }
35 for (double &x : row) x /= (sum + 1e-12);
36 }
37}
38
39Matrix scale(const Matrix &A, double s) {
40 Matrix B = A;
41 for (auto &row : B) for (double &x : row) x *= s;
42 return B;
43}
44
45Matrix attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
46 // Q: T x d_k, K: T x d_k, V: T x d_v
47 size_t d_k = Q[0].size();
48 Matrix KT = transpose(K);
49 Matrix scores = matmul(Q, KT); // T x T
50 double scale_factor = 1.0 / std::sqrt((double)d_k);
51 scores = scale(scores, scale_factor);
52 softmax_rows(scores); // row-wise softmax
53 Matrix out = matmul(scores, V); // T x d_v
54 return out;
55}
56
57void print_matrix(const Matrix &A, const std::string &name) {
58 std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
59 std::cout << std::fixed << std::setprecision(4);
60 for (const auto &row : A) {
61 for (double x : row) std::cout << std::setw(8) << x << ' ';
62 std::cout << '\n';
63 }
64}
65
66int main() {
67 // Toy example: 3 tokens, d_k = d_v = 4
68 Matrix Q = { {0.1, 0.2, 0.3, 0.4},
69 {0.5, 0.4, 0.3, 0.2},
70 {0.2, 0.1, 0.0, -0.1} };
71 Matrix K = { {0.3, 0.1, 0.4, 0.2},
72 {0.0, -0.1, 0.2, 0.1},
73 {0.5, 0.3, 0.1, 0.2} };
74 Matrix V = { {0.1, 0.0, 0.2, 0.3},
75 {0.2, 0.1, 0.0, 0.1},
76 {0.0, 0.3, 0.4, 0.1} };
77
78 Matrix O = attention(Q, K, V);
79 print_matrix(O, "Attention Output");
80 return 0;
81}
82

Computes scaled dot-product attention for a single head. Queries compare to keys to produce attention scores, scaled by 1/sqrt(d_k) for stability. Row-wise softmax produces weights that are used to average values, returning a contextualized representation per token.

Time: O(T^2 d_k + T d_k d_v) ≈ O(T^2 d)Space: O(T^2 + T d)
Sinusoidal Positional Encoding generator and application to embeddings
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <iomanip>
5
6using Matrix = std::vector<std::vector<double>>;
7
8Matrix positional_encoding(int max_len, int d_model) {
9 Matrix PE(max_len, std::vector<double>(d_model, 0.0));
10 for (int pos = 0; pos < max_len; ++pos) {
11 for (int i = 0; i < d_model; i += 2) {
12 double angle = pos / std::pow(10000.0, (double)i / d_model);
13 PE[pos][i] = std::sin(angle);
14 if (i + 1 < d_model) PE[pos][i + 1] = std::cos(angle);
15 }
16 }
17 return PE;
18}
19
20void add_inplace(Matrix &A, const Matrix &B) {
21 for (size_t i = 0; i < A.size(); ++i)
22 for (size_t j = 0; j < A[0].size(); ++j)
23 A[i][j] += B[i][j];
24}
25
26void print_matrix(const Matrix &A, const std::string &name) {
27 std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
28 std::cout << std::fixed << std::setprecision(4);
29 for (const auto &row : A) {
30 for (double x : row) std::cout << std::setw(8) << x << ' ';
31 std::cout << '\n';
32 }
33}
34
35int main() {
36 int T = 4; // sequence length
37 int d = 8; // model dimension
38
39 // Toy token embeddings (e.g., from a learned embedding table)
40 Matrix X = {
41 {0.1, 0.2, 0.1, 0.0, -0.1, 0.05, 0.2, -0.05},
42 {0.0, -0.1, 0.2, 0.3, 0.1, -0.2, 0.0, 0.05},
43 {0.2, 0.1, -0.1, 0.1, 0.0, 0.2, -0.2, 0.1},
44 {-0.1, 0.3, 0.2, -0.2, 0.1, -0.1, 0.05, 0.0}
45 };
46
47 Matrix PE = positional_encoding(T, d);
48
49 add_inplace(X, PE); // Add positions to embeddings
50
51 print_matrix(PE, "Positional Encoding");
52 print_matrix(X, "Embeddings + PE");
53 return 0;
54}
55

Generates fixed sinusoidal positional encodings and adds them to token embeddings. These provide absolute and relative position information so attention can reason about order.

Time: O(T · d_model)Space: O(T · d_model)
Minimal Transformer Block (Pre-LN, single-head) forward pass
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <random>
5#include <algorithm>
6#include <iomanip>
7
8using Matrix = std::vector<std::vector<double>>;
9using Vector = std::vector<double>;
10
11Matrix transpose(const Matrix &A) {
12 size_t n = A.size(), m = A[0].size();
13 Matrix AT(m, std::vector<double>(n));
14 for (size_t i = 0; i < n; ++i)
15 for (size_t j = 0; j < m; ++j)
16 AT[j][i] = A[i][j];
17 return AT;
18}
19
20Matrix matmul(const Matrix &A, const Matrix &B) {
21 size_t n = A.size(), m = A[0].size(), p = B[0].size();
22 Matrix C(n, std::vector<double>(p, 0.0));
23 for (size_t i = 0; i < n; ++i)
24 for (size_t k = 0; k < m; ++k) {
25 double aik = A[i][k];
26 for (size_t j = 0; j < p; ++j)
27 C[i][j] += aik * B[k][j];
28 }
29 return C;
30}
31
32Matrix add(const Matrix &A, const Matrix &B) {
33 Matrix C = A;
34 for (size_t i = 0; i < A.size(); ++i)
35 for (size_t j = 0; j < A[0].size(); ++j)
36 C[i][j] += B[i][j];
37 return C;
38}
39
40Matrix add_bias(const Matrix &A, const Vector &b) {
41 Matrix C = A;
42 for (size_t i = 0; i < A.size(); ++i)
43 for (size_t j = 0; j < A[0].size(); ++j)
44 C[i][j] += b[j];
45 return C;
46}
47
48void softmax_rows(Matrix &A) {
49 for (auto &row : A) {
50 double maxv = *std::max_element(row.begin(), row.end());
51 double sum = 0.0;
52 for (double &x : row) { x = std::exp(x - maxv); sum += x; }
53 for (double &x : row) x /= (sum + 1e-12);
54 }
55}
56
57Matrix attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
58 size_t d_k = Q[0].size();
59 Matrix scores = matmul(Q, transpose(K)); // T x T
60 double scale = 1.0 / std::sqrt((double)d_k);
61 for (auto &row : scores) for (double &x : row) x *= scale;
62 softmax_rows(scores);
63 return matmul(scores, V); // T x d_v
64}
65
66Matrix layernorm(const Matrix &X, const Vector &gamma, const Vector &beta, double eps=1e-5) {
67 size_t T = X.size(), d = X[0].size();
68 Matrix Y(T, std::vector<double>(d));
69 for (size_t i = 0; i < T; ++i) {
70 double mean = 0.0;
71 for (size_t j = 0; j < d; ++j) mean += X[i][j];
72 mean /= d;
73 double var = 0.0;
74 for (size_t j = 0; j < d; ++j) { double v = X[i][j] - mean; var += v*v; }
75 var /= d;
76 double inv = 1.0 / std::sqrt(var + eps);
77 for (size_t j = 0; j < d; ++j) {
78 double normed = (X[i][j] - mean) * inv;
79 Y[i][j] = normed * gamma[j] + beta[j];
80 }
81 }
82 return Y;
83}
84
85Matrix gelu(const Matrix &X) {
86 Matrix Y = X;
87 const double k = std::sqrt(2.0 / M_PI);
88 for (auto &row : Y) {
89 for (double &x : row) {
90 double t = x + 0.044715 * x * x * x;
91 x = 0.5 * x * (1.0 + std::tanh(k * t));
92 }
93 }
94 return Y;
95}
96
97struct Linear {
98 Matrix W; // in x out
99 Vector b; // out
100 Linear(size_t in, size_t out, std::mt19937 &rng) {
101 std::normal_distribution<double> nd(0.0, 1.0/std::sqrt((double)in));
102 W.assign(in, Vector(out));
103 b.assign(out, 0.0);
104 for (size_t i = 0; i < in; ++i)
105 for (size_t j = 0; j < out; ++j)
106 W[i][j] = nd(rng);
107 }
108 Matrix forward(const Matrix &X) const { // X: T x in
109 Matrix Y = matmul(X, W); // T x out
110 Y = add_bias(Y, b);
111 return Y;
112 }
113};
114
115struct TransformerBlock {
116 size_t d_model;
117 size_t d_ff;
118 // Single-head for simplicity: d_k = d_model
119 Linear WQ, WK, WV, WO; // projections
120 Linear W1, W2; // FFN
121 Vector ln1_g, ln1_b, ln2_g, ln2_b; // LayerNorm params
122
123 TransformerBlock(size_t d_model_, size_t d_ff_, std::mt19937 &rng)
124 : d_model(d_model_), d_ff(d_ff_),
125 WQ(d_model_, d_model_, rng),
126 WK(d_model_, d_model_, rng),
127 WV(d_model_, d_model_, rng),
128 WO(d_model_, d_model_, rng),
129 W1(d_model_, d_ff_, rng),
130 W2(d_ff_, d_model_, rng) {
131 ln1_g.assign(d_model, 1.0); ln1_b.assign(d_model, 0.0);
132 ln2_g.assign(d_model, 1.0); ln2_b.assign(d_model, 0.0);
133 }
134
135 Matrix forward(const Matrix &X_in) {
136 // Pre-LN Transformer block
137 Matrix x = X_in;
138 // Sublayer 1: Self-Attention
139 Matrix n1 = layernorm(x, ln1_g, ln1_b);
140 Matrix Q = WQ.forward(n1);
141 Matrix K = WK.forward(n1);
142 Matrix V = WV.forward(n1);
143 Matrix attn = attention(Q, K, V);
144 Matrix attn_out = WO.forward(attn);
145 Matrix x1 = add(x, attn_out); // residual
146 // Sublayer 2: FFN
147 Matrix n2 = layernorm(x1, ln2_g, ln2_b);
148 Matrix h = gelu(W1.forward(n2));
149 Matrix ffn_out = W2.forward(h);
150 Matrix y = add(x1, ffn_out); // residual
151 return y;
152 }
153};
154
155void print_matrix(const Matrix &A, const std::string &name) {
156 std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
157 std::cout << std::fixed << std::setprecision(4);
158 for (const auto &row : A) {
159 for (double x : row) std::cout << std::setw(8) << x << ' ';
160 std::cout << '\n';
161 }
162}
163
164int main() {
165 std::mt19937 rng(42);
166 size_t T = 4; // sequence length
167 size_t d = 8; // model dimension
168 size_t d_ff = 32; // FFN hidden size
169
170 // Toy input: embeddings + positions (random here for demo)
171 std::normal_distribution<double> nd(0.0, 1.0);
172 Matrix X(T, Vector(d));
173 for (size_t i = 0; i < T; ++i)
174 for (size_t j = 0; j < d; ++j)
175 X[i][j] = nd(rng) * 0.1;
176
177 TransformerBlock block(d, d_ff, rng);
178 Matrix Y = block.forward(X);
179
180 print_matrix(Y, "Transformer Block Output");
181 return 0;
182}
183

Implements a minimal Pre-LN transformer block with a single self-attention head and a two-layer FFN with GeLU. It shows residual connections around both sublayers and LayerNorm before each, mirroring modern stable designs. This is a forward-only educational skeleton (no training).

Time: O(T d^2 + T^2 d + T d d_ff) for a single block (single head)Space: O(T^2 + T d + d d_ff) for activations and parameters