📚TheoryAdvanced

Transformer Theory

Key Points

•
Transformers map sequences to sequences using layers of self-attention and feed-forward networks wrapped with residual connections and LayerNorm.
•
Self-attention lets every token look at every other token, making long-range dependencies easy and efficient to capture in parallel.
•
Positional encoding injects order information into token embeddings so attention knows where things are in the sequence.
•
Residuals and LayerNorm stabilize optimization so very deep stacks of layers can be trained effectively.
•
The architecture is expressive—under mild assumptions, transformers can simulate any Turing machine, showing computational universality.
•
In-context learning emerges from pretraining: the model can learn to perform tasks at inference time by reading examples in the prompt.
•
Scaling laws show predictable improvements in loss as parameters, data, and compute grow, with power-law exponents.
•
Attention’s cost scales quadratically in sequence length; practical systems use optimizations and memory-saving tricks to push to long contexts.

Prerequisites

→Linear Algebra (vectors, matrices, dot products) — Attention and feed-forward computations are matrix multiplications and projections.
→Probability and Softmax — Attention weights are probabilities derived via softmax; training uses probabilistic losses.
→Neural Networks and Backpropagation — Transformers are deep nets optimized with gradient descent and residual connections.
→Layer Normalization — Stabilizes the training dynamics in deep architectures; essential for understanding Pre-LN vs Post-LN.
→Optimization (SGD/Adam) — Explains how models are trained to minimize cross-entropy and why normalization/residuals help.
→Autoregressive Modeling — Decoder-only transformers predict the next token; understanding causal masking is key.
→Computational Complexity — Attention’s O(T^2) cost motivates efficient variants for long sequences.
→Numerical Stability — Softmax stability (log-sum-exp) and mixed precision are crucial in large models.
→Distributed/Parallel Computing (basics) — Transformers scale across GPUs/TPUs using data/model parallelism.
→Programming in C++ (vectors, loops) — To implement and experiment with forward passes and components efficiently.

Detailed Explanation

Tap terms for definitions

01Overview

Transformers are a neural network architecture designed to process sequences (like text, code, audio, and DNA) using attention instead of recurrence. Each input token is first turned into a continuous vector (an embedding), then enriched with position information via positional encoding. The core building block—repeated N times—is a combination of multi-head self-attention and a position-wise feed-forward network (FFN), each wrapped with residual connections and Layer Normalization. Attention allows every token to selectively focus on other tokens, capturing dependencies regardless of distance in the sequence. Because attention is highly parallelizable, transformers train efficiently on modern hardware, which helped them displace RNNs and CNNs in many sequence tasks. The architecture’s effectiveness scales with data, parameters, and compute, exhibiting smooth power-law trends known as scaling laws. Remarkably, theoretical work shows that transformers are computationally universal under certain assumptions, meaning they can simulate any algorithm a Turing machine can. In practice, large pretrained transformers (LLMs) show emergent abilities and can learn new tasks from context alone (in-context learning), without changing their weights. Understanding how embeddings, attention, residuals, and normalization interact is essential to grasp why transformers power today’s state-of-the-art AI systems.

02Intuition & Analogies

Imagine a classroom discussion. Each student (token) hears the entire conversation and decides whose words matter most to respond to next. That decision—who to pay attention to—depends on the question a student asks (query) and what others said (keys and values). If a student is answering a math question, they might mostly attend to the person who shared the formula; for a history question, they focus on the person who mentioned the relevant date. This is self-attention: every token scores and mixes information from all other tokens, guided by its own needs. Now, how do they keep track of order? Think of seat numbers or timestamps: positional encoding gives every utterance a sense of “when” it occurred, so attention can tell “Alice spoke before Bob” even if both are equally relevant. Residual connections act like keeping a copy of your notes before integrating new insights—if the new idea is confusing, you still have your original notes. LayerNorm is like repeatedly straightening and standardizing everyone’s notes so small differences don’t explode or vanish across long study sessions. The FFN is the student’s private scratchpad—after gathering insights from others, each student applies a non-linear transformation to refine their interpretation. Stack many such discussions (layers), and the class becomes adept at solving complex problems: some layers specialize in syntax (who is subject/object), others in semantics (who did what to whom), and higher ones in reasoning. With lots of practice (pretraining), the class develops surprising, emergent talents, like solving problems they’ve never seen directly, just by recognizing patterns in how examples are presented in the prompt (in-context learning).

03Formal Definition

A transformer processes a length-T sequence of token indices

x_{1 : T}

. Tokens are mapped to vectors via an embedding matrix E

\in

R

^{V

\times

d_{

model

}}, then combined with positional encodings P

\in

R

^{T

\times

d_{

model

}}. Each layer applies multi-head self-attention (MHA) and a position-wise feed-forward network (FFN) with residual connections and LayerNorm. For head h with projections

W_{h}

^Q,

W_{h}

^K,

W_{h}

\in

R

^{d_{

model

}

\times

d_{k}

}, attention computes

Att

(Q,K,V) =

softmax

(

\frac{Q K ^{⊤}}{d _{k}}

)V. Multi-head attention concatenates heads and applies

W^{O}

. The FFN is typically FFN(x) =

σ

W_{1}

b_{1}

)

W_{2}

b_{2}

with a nonlinearity

σ

(e.g., ReLU or GeLU). Residual connections add inputs to sublayer outputs; LayerNorm normalizes features per token. Training minimizes next-token cross-entropy over a vocabulary using teacher forcing. The decoder-only causal variant masks future positions to ensure autoregressive generation. Theoretically, under assumptions like sufficient precision and sequence length, transformers are computationally universal:

\exists

T_{

transformer

} such that for any Turing machine M and input x, T_{

transformer

}(x) = M(x). Empirically, loss scales predictably with parameters N, compute C, and data D by approximate power laws, and models demonstrate emergent in-context learning behaviors.

04When to Use

Use transformers when you need to model sequences with long-range dependencies and want efficient parallel training: language modeling, machine translation, code completion, speech recognition, protein folding, and time-series forecasting. Choose decoder-only (causal) transformers for autoregressive generation (LLMs), encoder-only (BERT-like) for understanding tasks (classification, retrieval, tagging), and encoder–decoder for sequence-to-sequence tasks (translation, summarization). If order matters but you lack explicit structure, positional encoding provides a flexible, learnable or fixed mechanism to inject sequence position. When datasets are large and compute allows, transformers benefit from scaling: bigger models on more data tend to yield better performance following power-law trends. If latency or memory is constrained and sequences are long, consider efficient attention variants (sparse, linear, local) or retrieval-augmented methods. For small datasets or strongly structured domains (e.g., small tabular data), simpler models might be preferable; however, transformers can still excel with transfer learning via pretrained checkpoints and task-specific fine-tuning or prompting.

⚠️Common Mistakes

Forgetting the 1/\sqrt{d_k} scaling in attention leads to saturated softmax and poor gradients; always divide QK^T by \sqrt{d_k}.
Omitting causal masks in decoder attention leaks future information, causing train–test mismatch; ensure upper-triangular masking during training and inference.
Mixing up Pre-LN vs Post-LN: Post-LN stacks can be unstable at depth; many modern implementations use Pre-LN (LayerNorm before sublayers) for better optimization.
Ignoring numerical stability in softmax causes overflow; subtract the row-wise max before exponentiation.
Mismanaging shapes: ensure Q, K, V dimensions align (Q: T\times d_k, K: T\times d_k, V: T\times d_v) and multi-head concatenation matches W^O.
Not adding positional information makes the model permutation-invariant over tokens; always add or learn positions for sequence tasks.
Overlooking memory cost: attention uses O(T^2) memory for scores; for long contexts, use chunking, caching, or efficient attention.
Misinterpreting in-context learning as weight updates: it is behavior that arises from fixed weights processing the prompt; the model infers algorithms from context, not gradient steps during inference.

Key Formulas

Scaled Dot-Product Attention

Att (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Explanation: Attention weights are computed by comparing queries to keys, scaled by the key dimension to prevent saturation, then applied to values. Use this as the core operation that lets tokens mix information.

Multi-Head Attention

head_{i} = Att (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}), MHA (X) = Concat (head_{1}, \dots, head_{h}) W^{O}

Explanation: Each head has separate projections and captures different relations. Heads are concatenated and projected back to the model dimension.

Position-wise Feed-Forward

FFN (x) = σ (x W_{1} + b_{1}) W_{2} + b_{2}

Explanation: An MLP applied to each token independently adds nonlinearity and increases expressivity. Common activations are ReLU or GeLU, and $d_{f} f$ is usually larger than $d_{m} o d e l$ .

Layer Normalization

LN (x) = \frac{x - μ}{σ ^{2} + ϵ} ⊙ γ + β, μ = \frac{1}{d} j = 1 \sum d x_{j}, σ^{2} = \frac{1}{d} j = 1 \sum d (x_{j} - μ)^{2}

Explanation: LayerNorm standardizes features within each token and then learns a scale ( $γ$ ) and shift ( $β$ ). It stabilizes gradients in deep networks.

Residual Connection

y = x + F (x)

Explanation: Residuals add a layer’s input to its output, preserving information and easing optimization. This helps train very deep models without vanishing gradients.

Sinusoidal Positional Encoding

PE_{p os, 2 i} = sin (\frac{p os}{1000 0 ^{\frac{2 i}{d _{model}}}}), PE_{p os, 2 i + 1} = cos (\frac{p os}{1000 0 ^{\frac{2 i}{d _{model}}}})

Explanation: Fixed sin/cos functions at different frequencies inject order information so the model can reason about relative and absolute positions.

Softmax

softmax (z_{i}) = \frac{e ^{z_{i}}}{\sum _{j = 1}^{n} e ^{z_{j}}}

Explanation: Softmax turns raw scores into probabilities that sum to 1. It is applied row-wise to attention logits.

Autoregressive Cross-Entropy Loss

L = - t = 1 \sum T lo g p (x_{t} ∣ x_{< t})

Explanation: Training objective for decoder-only transformers: maximize the probability of the next token given the past. Lower loss means better next-token prediction.

Empirical Scaling Law

L (N, C, D) \propto N^{- 0.076} C^{- 0.050} D^{- 0.095}

Explanation: Validation loss tends to decrease approximately as a power law in parameters (N), compute (C), and data (D). Useful for planning budgets and target performance.

Computational Universality (Informal)

\exists T_{transformer} : \forall x, T_{transformer} (x) = M (x)

Explanation: Under certain assumptions (e.g., sufficient precision and sequence length), a transformer can emulate any Turing machine M. This expresses theoretical expressivity rather than practical design guidance.

Layer Compute Costs

Cost_{attn} = O (T^{2} d), Cost_{ffn} = O (T d d_{ff})

Explanation: Per layer, attention cost grows quadratically with sequence length T, while the FFN cost grows linearly with T but quadratically with hidden sizes. These dominate training and inference time.

Complexity Analysis

Let T be sequence length, d the model dimension, h the number of heads, and

d_{k} = d / h

the per-head key/query dimension. In a standard transformer layer, the dominant compute comes from attention and the FFN. Multi-head self-attention forms Q, K, V via three projections O(T

d^{2}

), then computes Q

K^{T}

per head in O(

T^{2}

d_{k}

) and applies softmax and multiplies by V in O(

T^{2}

d_{k}

T^{2}

) and O(

T^{2}

d_{k}

+ T

d_{k}

d_{v}

). Summed over h heads, this is O(T

d^{2}

+ h

T^{2}

d_{k}

) ≈ O(T

d^{2}

T^{2}

d). The FFN applies two linear layers per token: O(T d

d_{f} f

+ T

d_{f} f

d) ≈ O(T d

d_{f} f

). With typical

d_{f} f

≈ 4d, FFN can be comparable to or larger than attention for moderate T, but for long T the quadratic attention term dominates. Memory-wise, caching the attention score matrix requires O(

T^{2}

) per head, plus activations O(T d) for residual paths and O(T

d_{f} f

) within the FFN. During training, activation memory often dominates; techniques like activation checkpointing, flash attention (reducing memory traffic), and mixed precision help. In decoder-only autoregressive inference with key/value caching, each new token costs O(h L

d_{k}

) where L is past context length (linear in L per step) and avoids recomputing K/V for the history. End-to-end, an N-layer stack multiplies these costs by N. Therefore, overall time is roughly O(N(T

d^{2}

T^{2}

d + T d

d_{f} f

)) and space O(N(T d +

T^{2}

) + parameters). Efficient or sparse attention variants reduce the

T^{2}

factor to near-linear or log-linear in T under structural assumptions, trading exactness for scalability.

Code Examples

Scaled Dot-Product Attention (single head) with numerically stable softmax

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <algorithm>
5 #include <iomanip>
6 
7 using Matrix = std::vector<std::vector<double>>;
8 
9 Matrix transpose(const Matrix &A) {
10     size_t n = A.size(), m = A[0].size();
11     Matrix AT(m, std::vector<double>(n));
12     for (size_t i = 0; i < n; ++i)
13         for (size_t j = 0; j < m; ++j)
14             AT[j][i] = A[i][j];
15     return AT;
16 }
17 
18 Matrix matmul(const Matrix &A, const Matrix &B) {
19     size_t n = A.size(), m = A[0].size(), p = B[0].size();
20     Matrix C(n, std::vector<double>(p, 0.0));
21     for (size_t i = 0; i < n; ++i)
22         for (size_t k = 0; k < m; ++k) {
23             double aik = A[i][k];
24             for (size_t j = 0; j < p; ++j)
25                 C[i][j] += aik * B[k][j];
26         }
27     return C;
28 }
29 
30 void softmax_rows(Matrix &A) {
31     for (auto &row : A) {
32         double maxv = *std::max_element(row.begin(), row.end());
33         double sum = 0.0;
34         for (double &x : row) { x = std::exp(x - maxv); sum += x; }
35         for (double &x : row) x /= (sum + 1e-12);
36     }
37 }
38 
39 Matrix scale(const Matrix &A, double s) {
40     Matrix B = A;
41     for (auto &row : B) for (double &x : row) x *= s;
42     return B;
43 }
44 
45 Matrix attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
46     // Q: T x d_k, K: T x d_k, V: T x d_v
47     size_t d_k = Q[0].size();
48     Matrix KT = transpose(K);
49     Matrix scores = matmul(Q, KT);                 // T x T
50     double scale_factor = 1.0 / std::sqrt((double)d_k);
51     scores = scale(scores, scale_factor);
52     softmax_rows(scores);                           // row-wise softmax
53     Matrix out = matmul(scores, V);                 // T x d_v
54     return out;
55 }
56 
57 void print_matrix(const Matrix &A, const std::string &name) {
58     std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
59     std::cout << std::fixed << std::setprecision(4);
60     for (const auto &row : A) {
61         for (double x : row) std::cout << std::setw(8) << x << ' ';
62         std::cout << '\n';
63     }
64 }
65 
66 int main() {
67     // Toy example: 3 tokens, d_k = d_v = 4
68     Matrix Q = { {0.1, 0.2, 0.3, 0.4},
69                  {0.5, 0.4, 0.3, 0.2},
70                  {0.2, 0.1, 0.0, -0.1} };
71     Matrix K = { {0.3, 0.1, 0.4, 0.2},
72                  {0.0, -0.1, 0.2, 0.1},
73                  {0.5, 0.3, 0.1, 0.2} };
74     Matrix V = { {0.1, 0.0, 0.2, 0.3},
75                  {0.2, 0.1, 0.0, 0.1},
76                  {0.0, 0.3, 0.4, 0.1} };
77 
78     Matrix O = attention(Q, K, V);
79     print_matrix(O, "Attention Output");
80     return 0;
81 }
82

Computes scaled dot-product attention for a single head. Queries compare to keys to produce attention scores, scaled by 1/sqrt(d_k) for stability. Row-wise softmax produces weights that are used to average values, returning a contextualized representation per token.

Time: O(T^2 d_k + T d_k d_v) ≈ O(T^2 d)Space: O(T^2 + T d)

Sinusoidal Positional Encoding generator and application to embeddings

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <iomanip>
5 
6 using Matrix = std::vector<std::vector<double>>;
7 
8 Matrix positional_encoding(int max_len, int d_model) {
9     Matrix PE(max_len, std::vector<double>(d_model, 0.0));
10     for (int pos = 0; pos < max_len; ++pos) {
11         for (int i = 0; i < d_model; i += 2) {
12             double angle = pos / std::pow(10000.0, (double)i / d_model);
13             PE[pos][i] = std::sin(angle);
14             if (i + 1 < d_model) PE[pos][i + 1] = std::cos(angle);
15         }
16     }
17     return PE;
18 }
19 
20 void add_inplace(Matrix &A, const Matrix &B) {
21     for (size_t i = 0; i < A.size(); ++i)
22         for (size_t j = 0; j < A[0].size(); ++j)
23             A[i][j] += B[i][j];
24 }
25 
26 void print_matrix(const Matrix &A, const std::string &name) {
27     std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
28     std::cout << std::fixed << std::setprecision(4);
29     for (const auto &row : A) {
30         for (double x : row) std::cout << std::setw(8) << x << ' ';
31         std::cout << '\n';
32     }
33 }
34 
35 int main() {
36     int T = 4;          // sequence length
37     int d = 8;          // model dimension
38 
39     // Toy token embeddings (e.g., from a learned embedding table)
40     Matrix X = {
41         {0.1, 0.2, 0.1, 0.0, -0.1, 0.05, 0.2, -0.05},
42         {0.0, -0.1, 0.2, 0.3, 0.1, -0.2, 0.0, 0.05},
43         {0.2, 0.1, -0.1, 0.1, 0.0, 0.2, -0.2, 0.1},
44         {-0.1, 0.3, 0.2, -0.2, 0.1, -0.1, 0.05, 0.0}
45     };
46 
47     Matrix PE = positional_encoding(T, d);
48 
49     add_inplace(X, PE); // Add positions to embeddings
50 
51     print_matrix(PE, "Positional Encoding");
52     print_matrix(X, "Embeddings + PE");
53     return 0;
54 }
55

Generates fixed sinusoidal positional encodings and adds them to token embeddings. These provide absolute and relative position information so attention can reason about order.

Time: O(T · d_model)Space: O(T · d_model)

Minimal Transformer Block (Pre-LN, single-head) forward pass

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <random>
5 #include <algorithm>
6 #include <iomanip>
7 
8 using Matrix = std::vector<std::vector<double>>;
9 using Vector = std::vector<double>;
10 
11 Matrix transpose(const Matrix &A) {
12     size_t n = A.size(), m = A[0].size();
13     Matrix AT(m, std::vector<double>(n));
14     for (size_t i = 0; i < n; ++i)
15         for (size_t j = 0; j < m; ++j)
16             AT[j][i] = A[i][j];
17     return AT;
18 }
19 
20 Matrix matmul(const Matrix &A, const Matrix &B) {
21     size_t n = A.size(), m = A[0].size(), p = B[0].size();
22     Matrix C(n, std::vector<double>(p, 0.0));
23     for (size_t i = 0; i < n; ++i)
24         for (size_t k = 0; k < m; ++k) {
25             double aik = A[i][k];
26             for (size_t j = 0; j < p; ++j)
27                 C[i][j] += aik * B[k][j];
28         }
29     return C;
30 }
31 
32 Matrix add(const Matrix &A, const Matrix &B) {
33     Matrix C = A;
34     for (size_t i = 0; i < A.size(); ++i)
35         for (size_t j = 0; j < A[0].size(); ++j)
36             C[i][j] += B[i][j];
37     return C;
38 }
39 
40 Matrix add_bias(const Matrix &A, const Vector &b) {
41     Matrix C = A;
42     for (size_t i = 0; i < A.size(); ++i)
43         for (size_t j = 0; j < A[0].size(); ++j)
44             C[i][j] += b[j];
45     return C;
46 }
47 
48 void softmax_rows(Matrix &A) {
49     for (auto &row : A) {
50         double maxv = *std::max_element(row.begin(), row.end());
51         double sum = 0.0;
52         for (double &x : row) { x = std::exp(x - maxv); sum += x; }
53         for (double &x : row) x /= (sum + 1e-12);
54     }
55 }
56 
57 Matrix attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
58     size_t d_k = Q[0].size();
59     Matrix scores = matmul(Q, transpose(K)); // T x T
60     double scale = 1.0 / std::sqrt((double)d_k);
61     for (auto &row : scores) for (double &x : row) x *= scale;
62     softmax_rows(scores);
63     return matmul(scores, V); // T x d_v
64 }
65 
66 Matrix layernorm(const Matrix &X, const Vector &gamma, const Vector &beta, double eps=1e-5) {
67     size_t T = X.size(), d = X[0].size();
68     Matrix Y(T, std::vector<double>(d));
69     for (size_t i = 0; i < T; ++i) {
70         double mean = 0.0;
71         for (size_t j = 0; j < d; ++j) mean += X[i][j];
72         mean /= d;
73         double var = 0.0;
74         for (size_t j = 0; j < d; ++j) { double v = X[i][j] - mean; var += v*v; }
75         var /= d;
76         double inv = 1.0 / std::sqrt(var + eps);
77         for (size_t j = 0; j < d; ++j) {
78             double normed = (X[i][j] - mean) * inv;
79             Y[i][j] = normed * gamma[j] + beta[j];
80         }
81     }
82     return Y;
83 }
84 
85 Matrix gelu(const Matrix &X) {
86     Matrix Y = X;
87     const double k = std::sqrt(2.0 / M_PI);
88     for (auto &row : Y) {
89         for (double &x : row) {
90             double t = x + 0.044715 * x * x * x;
91             x = 0.5 * x * (1.0 + std::tanh(k * t));
92         }
93     }
94     return Y;
95 }
96 
97 struct Linear {
98     Matrix W; // in x out
99     Vector b; // out
100     Linear(size_t in, size_t out, std::mt19937 &rng) {
101         std::normal_distribution<double> nd(0.0, 1.0/std::sqrt((double)in));
102         W.assign(in, Vector(out));
103         b.assign(out, 0.0);
104         for (size_t i = 0; i < in; ++i)
105             for (size_t j = 0; j < out; ++j)
106                 W[i][j] = nd(rng);
107     }
108     Matrix forward(const Matrix &X) const { // X: T x in
109         Matrix Y = matmul(X, W);           // T x out
110         Y = add_bias(Y, b);
111         return Y;
112     }
113 };
114 
115 struct TransformerBlock {
116     size_t d_model;
117     size_t d_ff;
118     // Single-head for simplicity: d_k = d_model
119     Linear WQ, WK, WV, WO; // projections
120     Linear W1, W2;         // FFN
121     Vector ln1_g, ln1_b, ln2_g, ln2_b; // LayerNorm params
122 
123     TransformerBlock(size_t d_model_, size_t d_ff_, std::mt19937 &rng)
124         : d_model(d_model_), d_ff(d_ff_),
125           WQ(d_model_, d_model_, rng),
126           WK(d_model_, d_model_, rng),
127           WV(d_model_, d_model_, rng),
128           WO(d_model_, d_model_, rng),
129           W1(d_model_, d_ff_, rng),
130           W2(d_ff_, d_model_, rng) {
131         ln1_g.assign(d_model, 1.0); ln1_b.assign(d_model, 0.0);
132         ln2_g.assign(d_model, 1.0); ln2_b.assign(d_model, 0.0);
133     }
134 
135     Matrix forward(const Matrix &X_in) {
136         // Pre-LN Transformer block
137         Matrix x = X_in;
138         // Sublayer 1: Self-Attention
139         Matrix n1 = layernorm(x, ln1_g, ln1_b);
140         Matrix Q = WQ.forward(n1);
141         Matrix K = WK.forward(n1);
142         Matrix V = WV.forward(n1);
143         Matrix attn = attention(Q, K, V);
144         Matrix attn_out = WO.forward(attn);
145         Matrix x1 = add(x, attn_out); // residual
146         // Sublayer 2: FFN
147         Matrix n2 = layernorm(x1, ln2_g, ln2_b);
148         Matrix h = gelu(W1.forward(n2));
149         Matrix ffn_out = W2.forward(h);
150         Matrix y = add(x1, ffn_out); // residual
151         return y;
152     }
153 };
154 
155 void print_matrix(const Matrix &A, const std::string &name) {
156     std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
157     std::cout << std::fixed << std::setprecision(4);
158     for (const auto &row : A) {
159         for (double x : row) std::cout << std::setw(8) << x << ' ';
160         std::cout << '\n';
161     }
162 }
163 
164 int main() {
165     std::mt19937 rng(42);
166     size_t T = 4;        // sequence length
167     size_t d = 8;        // model dimension
168     size_t d_ff = 32;    // FFN hidden size
169 
170     // Toy input: embeddings + positions (random here for demo)
171     std::normal_distribution<double> nd(0.0, 1.0);
172     Matrix X(T, Vector(d));
173     for (size_t i = 0; i < T; ++i)
174         for (size_t j = 0; j < d; ++j)
175             X[i][j] = nd(rng) * 0.1;
176 
177     TransformerBlock block(d, d_ff, rng);
178     Matrix Y = block.forward(X);
179 
180     print_matrix(Y, "Transformer Block Output");
181     return 0;
182 }
183

Implements a minimal Pre-LN transformer block with a single self-attention head and a two-layer FFN with GeLU. It shows residual connections around both sublayers and LayerNorm before each, mirroring modern stable designs. This is a forward-only educational skeleton (no training).

Time: O(T d^2 + T^2 d + T d d_ff) for a single block (single head)Space: O(T^2 + T d + d d_ff) for activations and parameters

1	#include <iostream>
2	#include <vector>
3	#include <cmath>
4	#include <algorithm>
5	#include <iomanip>
6
7	using Matrix = std::vector<std::vector<double>>;
8
9	Matrix transpose(const Matrix &A) {
10	size_t n = A.size(), m = A[0].size();
11	Matrix AT(m, std::vector<double>(n));
12	for (size_t i = 0; i < n; ++i)
13	for (size_t j = 0; j < m; ++j)
14	AT[j][i] = A[i][j];
15	return AT;
16	}
17
18	Matrix matmul(const Matrix &A, const Matrix &B) {
19	size_t n = A.size(), m = A[0].size(), p = B[0].size();
20	Matrix C(n, std::vector<double>(p, 0.0));
21	for (size_t i = 0; i < n; ++i)
22	for (size_t k = 0; k < m; ++k) {
23	double aik = A[i][k];
24	for (size_t j = 0; j < p; ++j)
25	C[i][j] += aik * B[k][j];
26	}
27	return C;
28	}
29
30	void softmax_rows(Matrix &A) {
31	for (auto &row : A) {
32	double maxv = *std::max_element(row.begin(), row.end());
33	double sum = 0.0;
34	for (double &x : row) { x = std::exp(x - maxv); sum += x; }
35	for (double &x : row) x /= (sum + 1e-12);
36	}
37	}
38
39	Matrix scale(const Matrix &A, double s) {
40	Matrix B = A;
41	for (auto &row : B) for (double &x : row) x *= s;
42	return B;
43	}
44
45	Matrix attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
46	// Q: T x d_k, K: T x d_k, V: T x d_v
47	size_t d_k = Q[0].size();
48	Matrix KT = transpose(K);
49	Matrix scores = matmul(Q, KT); // T x T
50	double scale_factor = 1.0 / std::sqrt((double)d_k);
51	scores = scale(scores, scale_factor);
52	softmax_rows(scores); // row-wise softmax
53	Matrix out = matmul(scores, V); // T x d_v
54	return out;
55	}
56
57	void print_matrix(const Matrix &A, const std::string &name) {
58	std::cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
59	std::cout << std::fixed << std::setprecision(4);
60	for (const auto &row : A) {
61	for (double x : row) std::cout << std::setw(8) << x << ' ';
62	std::cout << '\n';
63	}
64	}
65
66	int main() {
67	// Toy example: 3 tokens, d_k = d_v = 4
68	Matrix Q = { {0.1, 0.2, 0.3, 0.4},
69	{0.5, 0.4, 0.3, 0.2},
70	{0.2, 0.1, 0.0, -0.1} };
71	Matrix K = { {0.3, 0.1, 0.4, 0.2},
72	{0.0, -0.1, 0.2, 0.1},
73	{0.5, 0.3, 0.1, 0.2} };
74	Matrix V = { {0.1, 0.0, 0.2, 0.3},
75	{0.2, 0.1, 0.0, 0.1},
76	{0.0, 0.3, 0.4, 0.1} };
77
78	Matrix O = attention(Q, K, V);
79	print_matrix(O, "Attention Output");
80	return 0;
81	}
82