Transformer Theory
Key Points
- •Transformers map sequences to sequences using layers of self-attention and feed-forward networks wrapped with residual connections and LayerNorm.
- •Self-attention lets every token look at every other token, making long-range dependencies easy and efficient to capture in parallel.
- •Positional encoding injects order information into token embeddings so attention knows where things are in the sequence.
- •Residuals and LayerNorm stabilize optimization so very deep stacks of layers can be trained effectively.
- •The architecture is expressive—under mild assumptions, transformers can simulate any Turing machine, showing computational universality.
- •In-context learning emerges from pretraining: the model can learn to perform tasks at inference time by reading examples in the prompt.
- •Scaling laws show predictable improvements in loss as parameters, data, and compute grow, with power-law exponents.
- •Attention’s cost scales quadratically in sequence length; practical systems use optimizations and memory-saving tricks to push to long contexts.
Prerequisites
- →Linear Algebra (vectors, matrices, dot products) — Attention and feed-forward computations are matrix multiplications and projections.
- →Probability and Softmax — Attention weights are probabilities derived via softmax; training uses probabilistic losses.
- →Neural Networks and Backpropagation — Transformers are deep nets optimized with gradient descent and residual connections.
- →Layer Normalization — Stabilizes the training dynamics in deep architectures; essential for understanding Pre-LN vs Post-LN.
- →Optimization (SGD/Adam) — Explains how models are trained to minimize cross-entropy and why normalization/residuals help.
- →Autoregressive Modeling — Decoder-only transformers predict the next token; understanding causal masking is key.
- →Computational Complexity — Attention’s O(T^2) cost motivates efficient variants for long sequences.
- →Numerical Stability — Softmax stability (log-sum-exp) and mixed precision are crucial in large models.
- →Distributed/Parallel Computing (basics) — Transformers scale across GPUs/TPUs using data/model parallelism.
- →Programming in C++ (vectors, loops) — To implement and experiment with forward passes and components efficiently.
Detailed Explanation
Tap terms for definitions01Overview
Transformers are a neural network architecture designed to process sequences (like text, code, audio, and DNA) using attention instead of recurrence. Each input token is first turned into a continuous vector (an embedding), then enriched with position information via positional encoding. The core building block—repeated N times—is a combination of multi-head self-attention and a position-wise feed-forward network (FFN), each wrapped with residual connections and Layer Normalization. Attention allows every token to selectively focus on other tokens, capturing dependencies regardless of distance in the sequence. Because attention is highly parallelizable, transformers train efficiently on modern hardware, which helped them displace RNNs and CNNs in many sequence tasks. The architecture’s effectiveness scales with data, parameters, and compute, exhibiting smooth power-law trends known as scaling laws. Remarkably, theoretical work shows that transformers are computationally universal under certain assumptions, meaning they can simulate any algorithm a Turing machine can. In practice, large pretrained transformers (LLMs) show emergent abilities and can learn new tasks from context alone (in-context learning), without changing their weights. Understanding how embeddings, attention, residuals, and normalization interact is essential to grasp why transformers power today’s state-of-the-art AI systems.
02Intuition & Analogies
Imagine a classroom discussion. Each student (token) hears the entire conversation and decides whose words matter most to respond to next. That decision—who to pay attention to—depends on the question a student asks (query) and what others said (keys and values). If a student is answering a math question, they might mostly attend to the person who shared the formula; for a history question, they focus on the person who mentioned the relevant date. This is self-attention: every token scores and mixes information from all other tokens, guided by its own needs. Now, how do they keep track of order? Think of seat numbers or timestamps: positional encoding gives every utterance a sense of “when” it occurred, so attention can tell “Alice spoke before Bob” even if both are equally relevant. Residual connections act like keeping a copy of your notes before integrating new insights—if the new idea is confusing, you still have your original notes. LayerNorm is like repeatedly straightening and standardizing everyone’s notes so small differences don’t explode or vanish across long study sessions. The FFN is the student’s private scratchpad—after gathering insights from others, each student applies a non-linear transformation to refine their interpretation. Stack many such discussions (layers), and the class becomes adept at solving complex problems: some layers specialize in syntax (who is subject/object), others in semantics (who did what to whom), and higher ones in reasoning. With lots of practice (pretraining), the class develops surprising, emergent talents, like solving problems they’ve never seen directly, just by recognizing patterns in how examples are presented in the prompt (in-context learning).
03Formal Definition
04When to Use
Use transformers when you need to model sequences with long-range dependencies and want efficient parallel training: language modeling, machine translation, code completion, speech recognition, protein folding, and time-series forecasting. Choose decoder-only (causal) transformers for autoregressive generation (LLMs), encoder-only (BERT-like) for understanding tasks (classification, retrieval, tagging), and encoder–decoder for sequence-to-sequence tasks (translation, summarization). If order matters but you lack explicit structure, positional encoding provides a flexible, learnable or fixed mechanism to inject sequence position. When datasets are large and compute allows, transformers benefit from scaling: bigger models on more data tend to yield better performance following power-law trends. If latency or memory is constrained and sequences are long, consider efficient attention variants (sparse, linear, local) or retrieval-augmented methods. For small datasets or strongly structured domains (e.g., small tabular data), simpler models might be preferable; however, transformers can still excel with transfer learning via pretrained checkpoints and task-specific fine-tuning or prompting.
⚠️Common Mistakes
- Forgetting the 1/\sqrt{d_k} scaling in attention leads to saturated softmax and poor gradients; always divide QK^T by \sqrt{d_k}.
- Omitting causal masks in decoder attention leaks future information, causing train–test mismatch; ensure upper-triangular masking during training and inference.
- Mixing up Pre-LN vs Post-LN: Post-LN stacks can be unstable at depth; many modern implementations use Pre-LN (LayerNorm before sublayers) for better optimization.
- Ignoring numerical stability in softmax causes overflow; subtract the row-wise max before exponentiation.
- Mismanaging shapes: ensure Q, K, V dimensions align (Q: T\times d_k, K: T\times d_k, V: T\times d_v) and multi-head concatenation matches W^O.
- Not adding positional information makes the model permutation-invariant over tokens; always add or learn positions for sequence tasks.
- Overlooking memory cost: attention uses O(T^2) memory for scores; for long contexts, use chunking, caching, or efficient attention.
- Misinterpreting in-context learning as weight updates: it is behavior that arises from fixed weights processing the prompt; the model infers algorithms from context, not gradient steps during inference.
Key Formulas
Scaled Dot-Product Attention
Explanation: Attention weights are computed by comparing queries to keys, scaled by the key dimension to prevent saturation, then applied to values. Use this as the core operation that lets tokens mix information.
Multi-Head Attention
Explanation: Each head has separate projections and captures different relations. Heads are concatenated and projected back to the model dimension.
Position-wise Feed-Forward
Explanation: An MLP applied to each token independently adds nonlinearity and increases expressivity. Common activations are ReLU or GeLU, and is usually larger than .
Layer Normalization
Explanation: LayerNorm standardizes features within each token and then learns a scale () and shift (). It stabilizes gradients in deep networks.
Residual Connection
Explanation: Residuals add a layer’s input to its output, preserving information and easing optimization. This helps train very deep models without vanishing gradients.
Sinusoidal Positional Encoding
Explanation: Fixed sin/cos functions at different frequencies inject order information so the model can reason about relative and absolute positions.
Softmax
Explanation: Softmax turns raw scores into probabilities that sum to 1. It is applied row-wise to attention logits.
Autoregressive Cross-Entropy Loss
Explanation: Training objective for decoder-only transformers: maximize the probability of the next token given the past. Lower loss means better next-token prediction.
Empirical Scaling Law
Explanation: Validation loss tends to decrease approximately as a power law in parameters (N), compute (C), and data (D). Useful for planning budgets and target performance.
Computational Universality (Informal)
Explanation: Under certain assumptions (e.g., sufficient precision and sequence length), a transformer can emulate any Turing machine M. This expresses theoretical expressivity rather than practical design guidance.
Layer Compute Costs
Explanation: Per layer, attention cost grows quadratically with sequence length T, while the FFN cost grows linearly with T but quadratically with hidden sizes. These dominate training and inference time.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <algorithm> 5 #include <iomanip> 6 7 using Matrix = std::vector<std::vector<double>>; 8 9 Matrix transpose(const Matrix &A) { 10 size_t n = A.size(), m = A[0].size(); 11 Matrix AT(m, std::vector<double>(n)); 12 for (size_t i = 0; i < n; ++i) 13 for (size_t j = 0; j < m; ++j) 14 AT[j][i] = A[i][j]; 15 return AT; 16 } 17 18 Matrix matmul(const Matrix &A, const Matrix &B) { 19 size_t n = A.size(), m = A[0].size(), p = B[0].size(); 20 Matrix C(n, std::vector<double>(p, 0.0)); 21 for (size_t i = 0; i < n; ++i) 22 for (size_t k = 0; k < m; ++k) { 23 double aik = A[i][k]; 24 for (size_t j = 0; j < p; ++j) 25 C[i][j] += aik * B[k][j]; 26 } 27 return C; 28 } 29 30 void softmax_rows(Matrix &A) { 31 for (auto &row : A) { 32 double maxv = *std::max_element(row.begin(), row.end()); 33 double sum = 0.0; 34 for (double &x : row) { x = std::exp(x - maxv); sum += x; } 35 for (double &x : row) x /= (sum + 1e-12); 36 } 37 } 38 39 Matrix scale(const Matrix &A, double s) { 40 Matrix B = A; 41 for (auto &row : B) for (double &x : row) x *= s; 42 return B; 43 } 44 45 Matrix attention(const Matrix &Q, const Matrix &K, const Matrix &V) { 46 // Q: T x d_k, K: T x d_k, V: T x d_v 47 size_t d_k = Q[0].size(); 48 Matrix KT = transpose(K); 49 Matrix scores = matmul(Q, KT); // T x T 50 double scale_factor = 1.0 / std::sqrt((double)d_k); 51 scores = scale(scores, scale_factor); 52 softmax_rows(scores); // row-wise softmax 53 Matrix out = matmul(scores, V); // T x d_v 54 return out; 55 } 56 57 void print_matrix(const Matrix &A, const std::string &name) { 58 std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n"; 59 std::cout << std::fixed << std::setprecision(4); 60 for (const auto &row : A) { 61 for (double x : row) std::cout << std::setw(8) << x << ' '; 62 std::cout << '\n'; 63 } 64 } 65 66 int main() { 67 // Toy example: 3 tokens, d_k = d_v = 4 68 Matrix Q = { {0.1, 0.2, 0.3, 0.4}, 69 {0.5, 0.4, 0.3, 0.2}, 70 {0.2, 0.1, 0.0, -0.1} }; 71 Matrix K = { {0.3, 0.1, 0.4, 0.2}, 72 {0.0, -0.1, 0.2, 0.1}, 73 {0.5, 0.3, 0.1, 0.2} }; 74 Matrix V = { {0.1, 0.0, 0.2, 0.3}, 75 {0.2, 0.1, 0.0, 0.1}, 76 {0.0, 0.3, 0.4, 0.1} }; 77 78 Matrix O = attention(Q, K, V); 79 print_matrix(O, "Attention Output"); 80 return 0; 81 } 82
Computes scaled dot-product attention for a single head. Queries compare to keys to produce attention scores, scaled by 1/sqrt(d_k) for stability. Row-wise softmax produces weights that are used to average values, returning a contextualized representation per token.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <iomanip> 5 6 using Matrix = std::vector<std::vector<double>>; 7 8 Matrix positional_encoding(int max_len, int d_model) { 9 Matrix PE(max_len, std::vector<double>(d_model, 0.0)); 10 for (int pos = 0; pos < max_len; ++pos) { 11 for (int i = 0; i < d_model; i += 2) { 12 double angle = pos / std::pow(10000.0, (double)i / d_model); 13 PE[pos][i] = std::sin(angle); 14 if (i + 1 < d_model) PE[pos][i + 1] = std::cos(angle); 15 } 16 } 17 return PE; 18 } 19 20 void add_inplace(Matrix &A, const Matrix &B) { 21 for (size_t i = 0; i < A.size(); ++i) 22 for (size_t j = 0; j < A[0].size(); ++j) 23 A[i][j] += B[i][j]; 24 } 25 26 void print_matrix(const Matrix &A, const std::string &name) { 27 std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n"; 28 std::cout << std::fixed << std::setprecision(4); 29 for (const auto &row : A) { 30 for (double x : row) std::cout << std::setw(8) << x << ' '; 31 std::cout << '\n'; 32 } 33 } 34 35 int main() { 36 int T = 4; // sequence length 37 int d = 8; // model dimension 38 39 // Toy token embeddings (e.g., from a learned embedding table) 40 Matrix X = { 41 {0.1, 0.2, 0.1, 0.0, -0.1, 0.05, 0.2, -0.05}, 42 {0.0, -0.1, 0.2, 0.3, 0.1, -0.2, 0.0, 0.05}, 43 {0.2, 0.1, -0.1, 0.1, 0.0, 0.2, -0.2, 0.1}, 44 {-0.1, 0.3, 0.2, -0.2, 0.1, -0.1, 0.05, 0.0} 45 }; 46 47 Matrix PE = positional_encoding(T, d); 48 49 add_inplace(X, PE); // Add positions to embeddings 50 51 print_matrix(PE, "Positional Encoding"); 52 print_matrix(X, "Embeddings + PE"); 53 return 0; 54 } 55
Generates fixed sinusoidal positional encodings and adds them to token embeddings. These provide absolute and relative position information so attention can reason about order.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <random> 5 #include <algorithm> 6 #include <iomanip> 7 8 using Matrix = std::vector<std::vector<double>>; 9 using Vector = std::vector<double>; 10 11 Matrix transpose(const Matrix &A) { 12 size_t n = A.size(), m = A[0].size(); 13 Matrix AT(m, std::vector<double>(n)); 14 for (size_t i = 0; i < n; ++i) 15 for (size_t j = 0; j < m; ++j) 16 AT[j][i] = A[i][j]; 17 return AT; 18 } 19 20 Matrix matmul(const Matrix &A, const Matrix &B) { 21 size_t n = A.size(), m = A[0].size(), p = B[0].size(); 22 Matrix C(n, std::vector<double>(p, 0.0)); 23 for (size_t i = 0; i < n; ++i) 24 for (size_t k = 0; k < m; ++k) { 25 double aik = A[i][k]; 26 for (size_t j = 0; j < p; ++j) 27 C[i][j] += aik * B[k][j]; 28 } 29 return C; 30 } 31 32 Matrix add(const Matrix &A, const Matrix &B) { 33 Matrix C = A; 34 for (size_t i = 0; i < A.size(); ++i) 35 for (size_t j = 0; j < A[0].size(); ++j) 36 C[i][j] += B[i][j]; 37 return C; 38 } 39 40 Matrix add_bias(const Matrix &A, const Vector &b) { 41 Matrix C = A; 42 for (size_t i = 0; i < A.size(); ++i) 43 for (size_t j = 0; j < A[0].size(); ++j) 44 C[i][j] += b[j]; 45 return C; 46 } 47 48 void softmax_rows(Matrix &A) { 49 for (auto &row : A) { 50 double maxv = *std::max_element(row.begin(), row.end()); 51 double sum = 0.0; 52 for (double &x : row) { x = std::exp(x - maxv); sum += x; } 53 for (double &x : row) x /= (sum + 1e-12); 54 } 55 } 56 57 Matrix attention(const Matrix &Q, const Matrix &K, const Matrix &V) { 58 size_t d_k = Q[0].size(); 59 Matrix scores = matmul(Q, transpose(K)); // T x T 60 double scale = 1.0 / std::sqrt((double)d_k); 61 for (auto &row : scores) for (double &x : row) x *= scale; 62 softmax_rows(scores); 63 return matmul(scores, V); // T x d_v 64 } 65 66 Matrix layernorm(const Matrix &X, const Vector &gamma, const Vector &beta, double eps=1e-5) { 67 size_t T = X.size(), d = X[0].size(); 68 Matrix Y(T, std::vector<double>(d)); 69 for (size_t i = 0; i < T; ++i) { 70 double mean = 0.0; 71 for (size_t j = 0; j < d; ++j) mean += X[i][j]; 72 mean /= d; 73 double var = 0.0; 74 for (size_t j = 0; j < d; ++j) { double v = X[i][j] - mean; var += v*v; } 75 var /= d; 76 double inv = 1.0 / std::sqrt(var + eps); 77 for (size_t j = 0; j < d; ++j) { 78 double normed = (X[i][j] - mean) * inv; 79 Y[i][j] = normed * gamma[j] + beta[j]; 80 } 81 } 82 return Y; 83 } 84 85 Matrix gelu(const Matrix &X) { 86 Matrix Y = X; 87 const double k = std::sqrt(2.0 / M_PI); 88 for (auto &row : Y) { 89 for (double &x : row) { 90 double t = x + 0.044715 * x * x * x; 91 x = 0.5 * x * (1.0 + std::tanh(k * t)); 92 } 93 } 94 return Y; 95 } 96 97 struct Linear { 98 Matrix W; // in x out 99 Vector b; // out 100 Linear(size_t in, size_t out, std::mt19937 &rng) { 101 std::normal_distribution<double> nd(0.0, 1.0/std::sqrt((double)in)); 102 W.assign(in, Vector(out)); 103 b.assign(out, 0.0); 104 for (size_t i = 0; i < in; ++i) 105 for (size_t j = 0; j < out; ++j) 106 W[i][j] = nd(rng); 107 } 108 Matrix forward(const Matrix &X) const { // X: T x in 109 Matrix Y = matmul(X, W); // T x out 110 Y = add_bias(Y, b); 111 return Y; 112 } 113 }; 114 115 struct TransformerBlock { 116 size_t d_model; 117 size_t d_ff; 118 // Single-head for simplicity: d_k = d_model 119 Linear WQ, WK, WV, WO; // projections 120 Linear W1, W2; // FFN 121 Vector ln1_g, ln1_b, ln2_g, ln2_b; // LayerNorm params 122 123 TransformerBlock(size_t d_model_, size_t d_ff_, std::mt19937 &rng) 124 : d_model(d_model_), d_ff(d_ff_), 125 WQ(d_model_, d_model_, rng), 126 WK(d_model_, d_model_, rng), 127 WV(d_model_, d_model_, rng), 128 WO(d_model_, d_model_, rng), 129 W1(d_model_, d_ff_, rng), 130 W2(d_ff_, d_model_, rng) { 131 ln1_g.assign(d_model, 1.0); ln1_b.assign(d_model, 0.0); 132 ln2_g.assign(d_model, 1.0); ln2_b.assign(d_model, 0.0); 133 } 134 135 Matrix forward(const Matrix &X_in) { 136 // Pre-LN Transformer block 137 Matrix x = X_in; 138 // Sublayer 1: Self-Attention 139 Matrix n1 = layernorm(x, ln1_g, ln1_b); 140 Matrix Q = WQ.forward(n1); 141 Matrix K = WK.forward(n1); 142 Matrix V = WV.forward(n1); 143 Matrix attn = attention(Q, K, V); 144 Matrix attn_out = WO.forward(attn); 145 Matrix x1 = add(x, attn_out); // residual 146 // Sublayer 2: FFN 147 Matrix n2 = layernorm(x1, ln2_g, ln2_b); 148 Matrix h = gelu(W1.forward(n2)); 149 Matrix ffn_out = W2.forward(h); 150 Matrix y = add(x1, ffn_out); // residual 151 return y; 152 } 153 }; 154 155 void print_matrix(const Matrix &A, const std::string &name) { 156 std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n"; 157 std::cout << std::fixed << std::setprecision(4); 158 for (const auto &row : A) { 159 for (double x : row) std::cout << std::setw(8) << x << ' '; 160 std::cout << '\n'; 161 } 162 } 163 164 int main() { 165 std::mt19937 rng(42); 166 size_t T = 4; // sequence length 167 size_t d = 8; // model dimension 168 size_t d_ff = 32; // FFN hidden size 169 170 // Toy input: embeddings + positions (random here for demo) 171 std::normal_distribution<double> nd(0.0, 1.0); 172 Matrix X(T, Vector(d)); 173 for (size_t i = 0; i < T; ++i) 174 for (size_t j = 0; j < d; ++j) 175 X[i][j] = nd(rng) * 0.1; 176 177 TransformerBlock block(d, d_ff, rng); 178 Matrix Y = block.forward(X); 179 180 print_matrix(Y, "Transformer Block Output"); 181 return 0; 182 } 183
Implements a minimal Pre-LN transformer block with a single self-attention head and a two-layer FFN with GeLU. It shows residual connections around both sublayers and LayerNorm before each, mirroring modern stable designs. This is a forward-only educational skeleton (no training).