📚TheoryIntermediate

Scaled Dot-Product Attention

Key Points

•
Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by $ $d_{k}$ $, applying softmax, and forming a weighted sum.
•
The scale $ $d_{k}$ $ keeps gradients well-behaved by preventing overly peaky softmax distributions when feature dimension grows.
•
Softmax turns similarity scores into a probability distribution so weights are non-negative and sum to 1 across keys.
•
Masks (like causal or padding masks) are added before softmax as large negative numbers, blocking attention to disallowed positions.
•
In matrix form, attention can process many queries at once: $ $softmax$ (QK^ $⊤$ / $d_{k}$ )V$.
•
Multi-head attention applies this operation multiple times in parallel on projected subspaces and concatenates results.
•
A numerically stable softmax subtracts the row-wise maximum before exponentials to avoid overflow.
•
The naive implementation runs in $O(mn $d_{k}$ + mn $d_{v}$ + mn)$ time for an $m $\times$ $d_{k}$ $ query matrix and $n$ keys/values, and uses $O(mn)$ extra memory for the score matrix.

Prerequisites

→Matrix multiplication and transposition — Attention is expressed and implemented via matrix products QK^T and AV.
→Dot product and cosine similarity — Similarity between queries and keys is computed via dot products.
→Softmax and numerical stability — Turning scores into probabilities and keeping computations stable is essential.
→Basic probability distributions — Attention weights are probabilities that sum to 1 across keys.
→Time and space complexity — Understanding the O(mn) memory and O(mnd) time costs guides design decisions.
→Linear transformations (affine layers) — Multi-head attention uses learned linear projections of inputs into Q, K, and V.
→C++ vectors and loops — Implementations rely on nested loops over std::vector for clarity.

Detailed Explanation

Tap terms for definitions

01Overview

Scaled dot-product attention is a core operation in modern sequence models like Transformers. Given a set of queries Q, keys K, and values V, it computes attention weights that represent how relevant each value is to each query. It does this by first measuring similarity between queries and keys using dot products, then scaling these scores by the square root of the key dimension to keep values in a well-behaved numerical range. The scores are turned into probabilities with a softmax, and the final output is a weighted sum of values using these probabilities as weights. In compact matrix form, the operation is Attention(Q, K, V) = softmax(Q $K^T$ / sqrt( $d_k$ )) V. Because it is differentiable end-to-end, attention can be learned via gradient descent, allowing models to dynamically focus on different parts of the input. Scaled dot-product attention is efficient, parallelizable, and flexible: it can be applied over time steps in text, patches in images, or nodes in graphs. Variants include masking to enforce causality (no peeking into the future) and multi-head attention, which runs several attention operations in parallel on different learned subspaces, capturing diverse relationships.

02Intuition & Analogies

Imagine you are searching a bookshelf for a topic (the query). Each book has an index describing its content (the key) and its actual content (the value). To answer your question, you compare your question (query) to each book’s index (key): the more similar they are, the more that book should influence your answer. You convert these similarities into a set of percentages that sum to 100% (softmax), then take a weighted average of the books’ content (values). That final averaged content is your attended answer. The dot product is a fast way to measure how aligned two descriptions are: bigger dot product means they point in similar directions. However, if your description vectors are long (high dimensional), raw dot products tend to get large just because there are many components adding up. When scores get too large, softmax becomes overly confident about just one book, which harms learning. Dividing by ( $\sqrt{$ d_k $}$ ) acts like a temperature knob, cooling the scores so the softmax doesn’t collapse too quickly. Masks are like a “do not look here” sticker: before computing the percentages, you subtract a huge number from the forbidden books’ scores so they effectively get weight zero after softmax. Multi-head attention is like asking several librarians with different specialties to each give a recommendation and then combining all their suggestions.

03Formal Definition

Let Q \(

\in

R^{m \times d_{k}}

\), K \(

\in

R^{n \times d_{k}}

\), and V \(

\in

R^{n \times d_{v}}

\). Scaled dot-product attention computes S = QK^

⊤

d_{k}

, where S \(

\in

R^{m \times n}

\) contains pairwise similarities between queries and keys. The attention weights are

A = so f t ma x (S

), applied row-wise so that for each query i,

A_{i, :}

is a probability distribution over n keys. The output is

O = A V

\in

R^{m \times d_{v}}

\), a weighted combination of values for each query. Optionally, a (additive) mask M \(

\in

R^{m \times n}

\) with entries 0 for allowed and \(-

\infty

\) for disallowed positions is added before softmax:

A = so f t ma x (S + M

). In multi-head attention, with h heads and projection matrices

W^{Q}

_i,

W^{K}

_i,

W^{V}

_i, the i-th head is

h e a d_{i} = A tt e n t i o n (Q W_{i}^{Q}

, K

W^{K}

_i, V

W^{V}

_i), and the final output is Concat(hea

d_{1}

\dots

,hea

d_{h}

)

W^{O}

. Training typically uses automatic differentiation; the softmax Jacobian \(

\frac{\partial σ _{i}}{\partial x _{j}}

σ_{i}

(

δ_{ij}

σ_{j}

)\) governs gradient flow within each row.

04When to Use

Use scaled dot-product attention whenever you need a model to flexibly relate elements across a set or sequence. Typical cases include natural language processing (e.g., a word’s representation should attend to other relevant words), computer vision (e.g., a patch attends to other patches), and recommendation systems (e.g., a user’s current action attends to historical actions). Employ causal masking for autoregressive generation to prevent information leakage from future positions. Use padding masks to ignore padded tokens in batches of variable-length sequences. Choose multi-head attention when a single similarity notion is insufficient; multiple heads allow the model to capture different relationships (syntax vs. semantics, local vs. global, etc.). For small inputs or prototyping, a direct O(mn) score matrix is fine. For very long sequences (tens of thousands), consider approximate or sparse variants that reduce the quadratic cost, but be mindful of their trade-offs in accuracy and implementation complexity.

⚠️Common Mistakes

Forgetting the scaling by ( $\sqrt{$ d_k $}$ ) leads to overly sharp softmax distributions and unstable training, especially as ( $d_k$ ) grows. Fix by explicitly dividing scores before softmax or by using a temperature parameter (\tau = $\sqrt{$ d_k $}$ ).
Applying masks after softmax instead of before. Correct masking is additive before softmax, using large negative numbers to ensure masked positions receive near-zero probability.
Numerical instability in softmax due to large scores; always subtract the row-wise maximum prior to exponentials (log-sum-exp trick) and prefer double precision for teaching/demos or float32 with care in production.
Mixing up keys and values: keys determine where to look; values are the content to aggregate. Ensure K and V are aligned by position.
Dimension mismatches when implementing multi-head attention: verify that $d_m$ odel is divisible by the number of heads and that reshapes/splits are along the correct axes.
Ignoring padding tokens. Without a padding mask, attention can latch onto meaningless pads, degrading performance.
Using zero to mask before softmax. Zeros do not nullify probabilities; you must add (-\infty) (or a large negative constant) so softmax returns (near) zero probability for masked entries.

Key Formulas

Scaled Dot-Product Attention

Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Explanation: Computes similarity between queries and keys, scales by the square root of key dimension, converts to probabilities via softmax, and uses them to form a weighted sum of values.

Softmax

softmax (x)_{i} = \frac{e ^{x_{i}}}{\sum _{j = 1}^{n} e ^{x_{j}}}

Explanation: Turns a vector of scores into non-negative weights that sum to 1, making them interpretable as probabilities.

Numerically Stable Softmax

softmax (x)_{i} = \frac{e ^{x_{i} - m}}{\sum _{j = 1}^{n} e ^{x_{j} - m}}, m = j max x_{j}

Explanation: Subtracting the maximum score m prevents overflow in the exponentials while leaving the probabilities unchanged.

Multi-Head Attention

head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}), Out = Concat (head_{1}, \dots, head_{h}) W^{O}

Explanation: Each head attends in a different learned subspace; concatenating and projecting combines their information.

Masked Attention

A = softmax (S + M), M_{ij} = {0 - \infty allowed masked

Explanation: Masks are added to scores before softmax so masked positions receive (near) zero probability.

Motivation for Scaling

Var (q^{⊤} k) = d_{k} Var (q_{i} k_{i})

Explanation: With independent components, the variance of dot products grows with dimension. Dividing by $ $d_{k}$ $ keeps score magnitudes stable as dimensionality increases.

Softmax Jacobian

\frac{\partial σ _{i}}{\partial x _{j}} = σ_{i} (δ_{ij} - σ_{j})

Explanation: Describes how changes in a score affect each output probability; important for understanding gradient flow in attention.

Time Complexity (Naive)

T (m, n, d_{k}, d_{v}) = Θ (mn d_{k} + mn + mn d_{v})

Explanation: Computing scores Q $K^{T}$ takes $mn $d_{k}$ $, softmax is $mn$, and multiplying by V costs $mn $d_{v}$ $.

Space for Score Matrix

S (m, n) = Θ (mn)

Explanation: Storing the attention scores (and optionally weights) requires memory proportional to the number of query-key pairs.

Complexity Analysis

For m queries of dimension

d_{k}

attending over n keys/values (with values of dimension

d_{v}

), the naive implementation forms the score matrix

S = Q K^{T} /\sqrt d_{k}

. Computing S costs Θ(mn

d_{k}

) arithmetic operations: each of the mn entries is a dot product of length

d_{k}

. Applying a row-wise softmax requires Θ(mn): subtracting the max, exponentiating, summing, and normalizing for each of the m rows over n columns. Finally, multiplying the attention weights A (m×n) by V (n×

d_{v}

) costs Θ(mn

d_{v}

). Thus the total time is T(m,n,

d_{k}

d_{v}

) = Θ(mn

d_{k}

+ mn + mn

d_{v}

), commonly dominated by the two matrix multiplications Θ(mn

d_{k}

+ mn

d_{v} n

). Space-wise, beyond inputs and outputs, the main auxiliary memory is the score matrix S (and possibly A), each Θ(mn). Masks also occupy Θ(mn) if dense. In multi-head attention with h heads where

d_{k} = d_{v} = d_{m} o d e l / h

, time scales by h (one attention per head) but each head’s dimensions are smaller; overall complexity remains Θ(mn

d_{m} o d e l

) per attention block, plus the cheap linear projections. When sequences are long (large n and m), the Θ(mn) memory for S can be the primary bottleneck, motivating kernel, block-sparse, or low-rank approximations. Numerically, care must be taken to implement softmax stably (subtracting row-wise maxima) and to add masks prior to softmax via large negative numbers to ensure masked probabilities are effectively zero.

Code Examples

Single-Query Scaled Dot-Product Attention (numerically stable)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute a numerically stable softmax over a vector
5 vector<double> softmax_stable(const vector<double>& x) {
6     double mx = *max_element(x.begin(), x.end());
7     double sum = 0.0;
8     vector<double> exps(x.size());
9     for (size_t i = 0; i < x.size(); ++i) {
10         exps[i] = exp(x[i] - mx); // subtract max for numerical stability
11         sum += exps[i];
12     }
13     for (size_t i = 0; i < x.size(); ++i) exps[i] /= (sum + 1e-12);
14     return exps; // probabilities sum to ~1
15 }
16 
17 // Dot product between two vectors of equal length
18 double dot(const vector<double>& a, const vector<double>& b) {
19     assert(a.size() == b.size());
20     double s = 0.0;
21     for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
22     return s;
23 }
24 
25 // Single-query attention: q is 1xd_k, K is nxd_k, V is nxd_v
26 vector<double> attention_single(const vector<double>& q,
27                                 const vector<vector<double>>& K,
28                                 const vector<vector<double>>& V) {
29     size_t n = K.size();
30     size_t d_k = q.size();
31     assert(n == V.size());
32     // 1) Scores: s_i = (q · K_i) / sqrt(d_k)
33     vector<double> scores(n);
34     double scale = 1.0 / sqrt((double)d_k);
35     for (size_t i = 0; i < n; ++i) {
36         assert(K[i].size() == d_k);
37         scores[i] = dot(q, K[i]) * scale;
38     }
39     // 2) Softmax to get attention weights
40     vector<double> w = softmax_stable(scores);
41     // 3) Weighted sum of values
42     size_t d_v = V[0].size();
43     vector<double> out(d_v, 0.0);
44     for (size_t i = 0; i < n; ++i) {
45         assert(V[i].size() == d_v);
46         for (size_t j = 0; j < d_v; ++j) out[j] += w[i] * V[i][j];
47     }
48     return out;
49 }
50 
51 int main() {
52     // Example with small, readable numbers
53     // d_k = 4, n = 3 values, d_v = 2
54     vector<double> q = {0.1, 0.2, 0.3, 0.4};
55     vector<vector<double>> K = {
56         {0.0, 0.1, 0.0, 0.1},
57         {0.2, 0.1, 0.0, 0.0},
58         {0.1, 0.0, 0.3, 0.1}
59     };
60     vector<vector<double>> V = {
61         {1.0, 0.0},
62         {0.0, 1.0},
63         {1.0, 1.0}
64     };
65 
66     vector<double> out = attention_single(q, K, V);
67 
68     cout.setf(std::ios::fixed); cout<<setprecision(6);
69     cout << "Output: ";
70     for (double x : out) cout << x << ' ';
71     cout << "\n";
72 
73     // Extra: verify weights sum to ~1
74     double scale = 1.0 / sqrt((double)q.size());
75     vector<double> scores;
76     for (auto& k : K) scores.push_back(dot(q, k) * scale);
77     vector<double> w = softmax_stable(scores);
78     double sumw = accumulate(w.begin(), w.end(), 0.0);
79     cout << "Sum of attention weights: " << sumw << "\n";
80     return 0;
81 }
82

This program computes scaled dot-product attention for a single query vector. It forms similarity scores to each key via dot products, scales by $1/\sqrt{d_k}$, applies a numerically stable softmax, and returns the weighted sum of values. The example prints the output vector and confirms that attention weights sum to 1.

Time: O(nd_k + nd_v)Space: O(n + d_v)

Batched Scaled Dot-Product Attention with Causal Mask

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Mat = vector<vector<double>>; // simple dense matrix helper
5 
6 Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); }
7 
8 Mat transpose(const Mat& A) {
9     size_t r = A.size(), c = A[0].size();
10     Mat T(c, vector<double>(r));
11     for (size_t i = 0; i < r; ++i)
12         for (size_t j = 0; j < c; ++j)
13             T[j][i] = A[i][j];
14     return T;
15 }
16 
17 Mat matmul(const Mat& A, const Mat& B) {
18     size_t r = A.size(), k = A[0].size(), c = B[0].size();
19     assert(B.size() == k);
20     Mat C(r, vector<double>(c, 0.0));
21     for (size_t i = 0; i < r; ++i)
22         for (size_t t = 0; t < k; ++t) {
23             double a = A[i][t];
24             for (size_t j = 0; j < c; ++j) C[i][j] += a * B[t][j];
25         }
26     return C;
27 }
28 
29 void softmax_rows_inplace(Mat& S) {
30     for (auto& row : S) {
31         double mx = *max_element(row.begin(), row.end());
32         double sum = 0.0;
33         for (double& x : row) { x = exp(x - mx); sum += x; }
34         double inv = 1.0 / (sum + 1e-12);
35         for (double& x : row) x *= inv;
36     }
37 }
38 
39 // Build a lower-triangular (causal) mask: allow j <= i, disallow j > i
40 Mat causal_mask(size_t m, size_t n) {
41     Mat M(m, vector<double>(n, 0.0));
42     const double NEG_INF = -1e9; // practical -inf for softmax
43     for (size_t i = 0; i < m; ++i)
44         for (size_t j = 0; j < n; ++j)
45             if (j > i) M[i][j] = NEG_INF;
46     return M;
47 }
48 
49 Mat add(const Mat& A, const Mat& B) {
50     size_t r = A.size(), c = A[0].size();
51     assert(B.size() == r && B[0].size() == c);
52     Mat C = A;
53     for (size_t i = 0; i < r; ++i)
54         for (size_t j = 0; j < c; ++j)
55             C[i][j] += B[i][j];
56     return C;
57 }
58 
59 // Batched attention: Q (mxd_k), K (nxd_k), V (nxd_v), optional additive mask M (m x n)
60 Mat attention_batched(const Mat& Q, const Mat& K, const Mat& V, const Mat* M = nullptr) {
61     size_t m = Q.size(), d_k = Q[0].size();
62     size_t n = K.size();
63     size_t d_v = V[0].size();
64     assert(K[0].size() == d_k);
65     assert(V.size() == n);
66 
67     // 1) S = Q K^T / sqrt(d_k)
68     Mat KT = transpose(K);
69     Mat S = matmul(Q, KT);
70     double scale = 1.0 / sqrt((double)d_k);
71     for (size_t i = 0; i < m; ++i)
72         for (size_t j = 0; j < n; ++j)
73             S[i][j] *= scale;
74 
75     // 2) Add mask before softmax, if provided
76     if (M) {
77         assert((*M).size() == m && (*M)[0].size() == n);
78         for (size_t i = 0; i < m; ++i)
79             for (size_t j = 0; j < n; ++j)
80                 S[i][j] += (*M)[i][j];
81     }
82 
83     // 3) Row-wise softmax to get attention weights A
84     softmax_rows_inplace(S); // now S holds A
85 
86     // 4) Output O = A V
87     Mat O = matmul(S, V);
88     return O;
89 }
90 
91 int main() {
92     // Small demo: m=n=4, d_k=3, d_v=2, with causal mask
93     size_t m = 4, n = 4, d_k = 3, d_v = 2;
94     Mat Q(m, vector<double>(d_k)), K(n, vector<double>(d_k)), V(n, vector<double>(d_v));
95 
96     // Initialize with deterministic values for reproducibility
97     for (size_t i = 0; i < m; ++i)
98         for (size_t j = 0; j < d_k; ++j)
99             Q[i][j] = (i + 1) * 0.1 + (j + 1) * 0.01;
100     for (size_t i = 0; i < n; ++i)
101         for (size_t j = 0; j < d_k; ++j)
102             K[i][j] = (i + 1) * 0.05 + (j + 1) * 0.02;
103     for (size_t i = 0; i < n; ++i)
104         for (size_t j = 0; j < d_v; ++j)
105             V[i][j] = (i + 1) * 0.2 + (j) * 0.1;
106 
107     Mat M = causal_mask(m, n);
108     Mat O = attention_batched(Q, K, V, &M);
109 
110     cout.setf(std::ios::fixed); cout<<setprecision(6);
111     cout << "Output (each row corresponds to a query):\n";
112     for (auto& row : O) {
113         for (double x : row) cout << x << ' ';
114         cout << '\n';
115     }
116     return 0;
117 }
118

This program implements the full matrix form of scaled dot-product attention with an additive causal mask. It computes S = QK^T/√d_k, adds a lower-triangular mask (−1e9 above the diagonal), applies stable row-wise softmax to get attention weights, and multiplies by V to produce outputs. The example constructs small deterministic matrices for clarity and prints the resulting attended representations.

Time: O(mnd_k + mn + mnd_v)Space: O(mn + m d_v)

Minimal Multi-Head Self-Attention (single sequence, no bias)

1 #include <bits/stdc++.h>
2 using namespace std;
3 using Mat = vector<vector<double>>;
4 
5 Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); }
6 Mat transpose(const Mat& A){ size_t r=A.size(), c=A[0].size(); Mat T(c, vector<double>(r)); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) T[j][i]=A[i][j]; return T; }
7 Mat matmul(const Mat& A, const Mat& B){ size_t r=A.size(), k=A[0].size(), c=B[0].size(); assert(B.size()==k); Mat C(r, vector<double>(c,0.0)); for(size_t i=0;i<r;++i) for(size_t t=0;t<k;++t){ double a=A[i][t]; for(size_t j=0;j<c;++j) C[i][j]+=a*B[t][j]; } return C; }
8 void softmax_rows_inplace(Mat& S){ for(auto& row:S){ double mx=*max_element(row.begin(),row.end()); double sum=0.0; for(double& x:row){ x=exp(x-mx); sum+=x; } double inv=1.0/(sum+1e-12); for(double& x:row) x*=inv; }}
9 
10 // Slice columns [c0, c1) from A
11 Mat slice_cols(const Mat& A, size_t c0, size_t c1){ size_t r=A.size(); Mat S(r, vector<double>(c1-c0)); for(size_t i=0;i<r;++i) for(size_t j=c0;j<c1;++j) S[i][j-c0]=A[i][j]; return S; }
12 
13 // Concatenate matrices horizontally (same row count)
14 Mat hconcat(const vector<Mat>& parts){ size_t r=parts[0].size(); size_t totalc=0; for(auto& P:parts){ assert(P.size()==r); totalc += P[0].size(); }
15     Mat C(r, vector<double>(totalc));
16     size_t off=0; for(auto& P:parts){ size_t c=P[0].size(); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) C[i][j+off]=P[i][j]; off+=c; }
17     return C;
18 }
19 
20 Mat attention(const Mat& Q, const Mat& K, const Mat& V) {
21     size_t d_k = Q[0].size();
22     Mat S = matmul(Q, transpose(K));
23     double scale = 1.0 / sqrt((double)d_k);
24     for (auto& row : S) for (double& x : row) x *= scale;
25     softmax_rows_inplace(S);
26     return matmul(S, V);
27 }
28 
29 int main(){
30     // Single sequence self-attention: X (n x d_model)
31     size_t n=5, d_model=8, h=2; // 2 heads => d_k=d_v=4
32     assert(d_model % h == 0);
33     size_t d_k = d_model / h, d_v = d_model / h;
34 
35     // Input tokens (deterministic pattern for demo)
36     Mat X(n, vector<double>(d_model));
37     for(size_t i=0;i<n;++i) for(size_t j=0;j<d_model;++j) X[i][j] = 0.01*(i+1) + 0.02*(j+1);
38 
39     // Projection matrices: W_Q, W_K, W_V, W_O (d_model x d_model)
40     auto init = [&](double base){ Mat W(d_model, vector<double>(d_model)); for(size_t i=0;i<d_model;++i) for(size_t j=0;j<d_model;++j) W[i][j] = base + 0.001*(i+1) + 0.0001*(j+1); return W; };
41     Mat WQ = init(0.05), WK = init(0.04), WV = init(0.03), WO = init(0.02);
42 
43     // Linear projections
44     Mat Q = matmul(X, WQ), K = matmul(X, WK), V = matmul(X, WV);
45 
46     // Split into heads (concatenation of per-head outputs later)
47     vector<Mat> head_outputs;
48     for(size_t head=0; head<h; ++head){
49         size_t c0 = head * d_k, c1 = c0 + d_k;
50         Mat Qh = slice_cols(Q, c0, c1);
51         Mat Kh = slice_cols(K, c0, c1);
52         Mat Vh = slice_cols(V, c0, c1); // d_v == d_k in this minimal demo
53         Mat Oh = attention(Qh, Kh, Vh);
54         head_outputs.push_back(Oh);
55     }
56 
57     // Concat heads and apply output projection
58     Mat O_concat = hconcat(head_outputs); // n x d_model
59     Mat Out = matmul(O_concat, WO);      // n x d_model
60 
61     cout.setf(std::ios::fixed); cout<<setprecision(6);
62     cout << "Multi-head self-attention output (first 3 tokens):\n";
63     for(size_t i=0;i<min((size_t)3,n); ++i){
64         for(size_t j=0;j<d_model; ++j) cout << Out[i][j] << ' ';
65         cout << '\n';
66     }
67     return 0;
68 }
69

This program assembles a minimal multi-head self-attention layer from the scaled dot-product primitive. It projects inputs into Q, K, V, splits along the feature dimension into h heads, applies attention per head, concatenates outputs, and applies a final output projection. This mirrors the Transformer’s core computation while keeping the math simple and explicit.

Time: O(h n^2 d_k + n d_model^2) for naive matmulsSpace: O(n^2 + n d_model)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute a numerically stable softmax over a vector
5	vector<double> softmax_stable(const vector<double>& x) {
6	double mx = *max_element(x.begin(), x.end());
7	double sum = 0.0;
8	vector<double> exps(x.size());
9	for (size_t i = 0; i < x.size(); ++i) {
10	exps[i] = exp(x[i] - mx); // subtract max for numerical stability
11	sum += exps[i];
12	}
13	for (size_t i = 0; i < x.size(); ++i) exps[i] /= (sum + 1e-12);
14	return exps; // probabilities sum to ~1
15	}
16
17	// Dot product between two vectors of equal length
18	double dot(const vector<double>& a, const vector<double>& b) {
19	assert(a.size() == b.size());
20	double s = 0.0;
21	for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
22	return s;
23	}
24
25	// Single-query attention: q is 1xd_k, K is nxd_k, V is nxd_v
26	vector<double> attention_single(const vector<double>& q,
27	const vector<vector<double>>& K,
28	const vector<vector<double>>& V) {
29	size_t n = K.size();
30	size_t d_k = q.size();
31	assert(n == V.size());
32	// 1) Scores: s_i = (q · K_i) / sqrt(d_k)
33	vector<double> scores(n);
34	double scale = 1.0 / sqrt((double)d_k);
35	for (size_t i = 0; i < n; ++i) {
36	assert(K[i].size() == d_k);
37	scores[i] = dot(q, K[i]) * scale;
38	}
39	// 2) Softmax to get attention weights
40	vector<double> w = softmax_stable(scores);
41	// 3) Weighted sum of values
42	size_t d_v = V[0].size();
43	vector<double> out(d_v, 0.0);
44	for (size_t i = 0; i < n; ++i) {
45	assert(V[i].size() == d_v);
46	for (size_t j = 0; j < d_v; ++j) out[j] += w[i] * V[i][j];
47	}
48	return out;
49	}
50
51	int main() {
52	// Example with small, readable numbers
53	// d_k = 4, n = 3 values, d_v = 2
54	vector<double> q = {0.1, 0.2, 0.3, 0.4};
55	vector<vector<double>> K = {
56	{0.0, 0.1, 0.0, 0.1},
57	{0.2, 0.1, 0.0, 0.0},
58	{0.1, 0.0, 0.3, 0.1}
59	};
60	vector<vector<double>> V = {
61	{1.0, 0.0},
62	{0.0, 1.0},
63	{1.0, 1.0}
64	};
65
66	vector<double> out = attention_single(q, K, V);
67
68	cout.setf(std::ios::fixed); cout<<setprecision(6);
69	cout << "Output: ";
70	for (double x : out) cout << x << ' ';
71	cout << "\n";
72
73	// Extra: verify weights sum to ~1
74	double scale = 1.0 / sqrt((double)q.size());
75	vector<double> scores;
76	for (auto& k : K) scores.push_back(dot(q, k) * scale);
77	vector<double> w = softmax_stable(scores);
78	double sumw = accumulate(w.begin(), w.end(), 0.0);
79	cout << "Sum of attention weights: " << sumw << "\n";
80	return 0;
81	}
82

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	using Mat = vector<vector<double>>; // simple dense matrix helper
5
6	Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); }
7
8	Mat transpose(const Mat& A) {
9	size_t r = A.size(), c = A[0].size();
10	Mat T(c, vector<double>(r));
11	for (size_t i = 0; i < r; ++i)
12	for (size_t j = 0; j < c; ++j)
13	T[j][i] = A[i][j];
14	return T;
15	}
16
17	Mat matmul(const Mat& A, const Mat& B) {
18	size_t r = A.size(), k = A[0].size(), c = B[0].size();
19	assert(B.size() == k);
20	Mat C(r, vector<double>(c, 0.0));
21	for (size_t i = 0; i < r; ++i)
22	for (size_t t = 0; t < k; ++t) {
23	double a = A[i][t];
24	for (size_t j = 0; j < c; ++j) C[i][j] += a * B[t][j];
25	}
26	return C;
27	}
28
29	void softmax_rows_inplace(Mat& S) {
30	for (auto& row : S) {
31	double mx = *max_element(row.begin(), row.end());
32	double sum = 0.0;
33	for (double& x : row) { x = exp(x - mx); sum += x; }
34	double inv = 1.0 / (sum + 1e-12);
35	for (double& x : row) x *= inv;
36	}
37	}
38
39	// Build a lower-triangular (causal) mask: allow j <= i, disallow j > i
40	Mat causal_mask(size_t m, size_t n) {
41	Mat M(m, vector<double>(n, 0.0));
42	const double NEG_INF = -1e9; // practical -inf for softmax
43	for (size_t i = 0; i < m; ++i)
44	for (size_t j = 0; j < n; ++j)
45	if (j > i) M[i][j] = NEG_INF;
46	return M;
47	}
48
49	Mat add(const Mat& A, const Mat& B) {
50	size_t r = A.size(), c = A[0].size();
51	assert(B.size() == r && B[0].size() == c);
52	Mat C = A;
53	for (size_t i = 0; i < r; ++i)
54	for (size_t j = 0; j < c; ++j)
55	C[i][j] += B[i][j];
56	return C;
57	}
58
59	// Batched attention: Q (mxd_k), K (nxd_k), V (nxd_v), optional additive mask M (m x n)
60	Mat attention_batched(const Mat& Q, const Mat& K, const Mat& V, const Mat* M = nullptr) {
61	size_t m = Q.size(), d_k = Q[0].size();
62	size_t n = K.size();
63	size_t d_v = V[0].size();
64	assert(K[0].size() == d_k);
65	assert(V.size() == n);
66
67	// 1) S = Q K^T / sqrt(d_k)
68	Mat KT = transpose(K);
69	Mat S = matmul(Q, KT);
70	double scale = 1.0 / sqrt((double)d_k);
71	for (size_t i = 0; i < m; ++i)
72	for (size_t j = 0; j < n; ++j)
73	S[i][j] *= scale;
74
75	// 2) Add mask before softmax, if provided
76	if (M) {
77	assert((M).size() == m && (M)[0].size() == n);
78	for (size_t i = 0; i < m; ++i)
79	for (size_t j = 0; j < n; ++j)
80	S[i][j] += (*M)[i][j];
81	}
82
83	// 3) Row-wise softmax to get attention weights A
84	softmax_rows_inplace(S); // now S holds A
85
86	// 4) Output O = A V
87	Mat O = matmul(S, V);
88	return O;
89	}
90
91	int main() {
92	// Small demo: m=n=4, d_k=3, d_v=2, with causal mask
93	size_t m = 4, n = 4, d_k = 3, d_v = 2;
94	Mat Q(m, vector<double>(d_k)), K(n, vector<double>(d_k)), V(n, vector<double>(d_v));
95
96	// Initialize with deterministic values for reproducibility
97	for (size_t i = 0; i < m; ++i)
98	for (size_t j = 0; j < d_k; ++j)
99	Q[i][j] = (i + 1) * 0.1 + (j + 1) * 0.01;
100	for (size_t i = 0; i < n; ++i)
101	for (size_t j = 0; j < d_k; ++j)
102	K[i][j] = (i + 1) * 0.05 + (j + 1) * 0.02;
103	for (size_t i = 0; i < n; ++i)
104	for (size_t j = 0; j < d_v; ++j)
105	V[i][j] = (i + 1) * 0.2 + (j) * 0.1;
106
107	Mat M = causal_mask(m, n);
108	Mat O = attention_batched(Q, K, V, &M);
109
110	cout.setf(std::ios::fixed); cout<<setprecision(6);
111	cout << "Output (each row corresponds to a query):\n";
112	for (auto& row : O) {
113	for (double x : row) cout << x << ' ';
114	cout << '\n';
115	}
116	return 0;
117	}
118

1	#include <bits/stdc++.h>
2	using namespace std;
3	using Mat = vector<vector<double>>;
4
5	Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); }
6	Mat transpose(const Mat& A){ size_t r=A.size(), c=A[0].size(); Mat T(c, vector<double>(r)); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) T[j][i]=A[i][j]; return T; }
7	Mat matmul(const Mat& A, const Mat& B){ size_t r=A.size(), k=A[0].size(), c=B[0].size(); assert(B.size()==k); Mat C(r, vector<double>(c,0.0)); for(size_t i=0;i<r;++i) for(size_t t=0;t<k;++t){ double a=A[i][t]; for(size_t j=0;j<c;++j) C[i][j]+=a*B[t][j]; } return C; }
8	void softmax_rows_inplace(Mat& S){ for(auto& row:S){ double mx=max_element(row.begin(),row.end()); double sum=0.0; for(double& x:row){ x=exp(x-mx); sum+=x; } double inv=1.0/(sum+1e-12); for(double& x:row) x=inv; }}
9
10	// Slice columns [c0, c1) from A
11	Mat slice_cols(const Mat& A, size_t c0, size_t c1){ size_t r=A.size(); Mat S(r, vector<double>(c1-c0)); for(size_t i=0;i<r;++i) for(size_t j=c0;j<c1;++j) S[i][j-c0]=A[i][j]; return S; }
12
13	// Concatenate matrices horizontally (same row count)
14	Mat hconcat(const vector<Mat>& parts){ size_t r=parts[0].size(); size_t totalc=0; for(auto& P:parts){ assert(P.size()==r); totalc += P[0].size(); }
15	Mat C(r, vector<double>(totalc));
16	size_t off=0; for(auto& P:parts){ size_t c=P[0].size(); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) C[i][j+off]=P[i][j]; off+=c; }
17	return C;
18	}
19
20	Mat attention(const Mat& Q, const Mat& K, const Mat& V) {
21	size_t d_k = Q[0].size();
22	Mat S = matmul(Q, transpose(K));
23	double scale = 1.0 / sqrt((double)d_k);
24	for (auto& row : S) for (double& x : row) x *= scale;
25	softmax_rows_inplace(S);
26	return matmul(S, V);
27	}
28
29	int main(){
30	// Single sequence self-attention: X (n x d_model)
31	size_t n=5, d_model=8, h=2; // 2 heads => d_k=d_v=4
32	assert(d_model % h == 0);
33	size_t d_k = d_model / h, d_v = d_model / h;
34
35	// Input tokens (deterministic pattern for demo)
36	Mat X(n, vector<double>(d_model));
37	for(size_t i=0;i<n;++i) for(size_t j=0;j<d_model;++j) X[i][j] = 0.01(i+1) + 0.02(j+1);
38
39	// Projection matrices: W_Q, W_K, W_V, W_O (d_model x d_model)
40	auto init = [&](double base){ Mat W(d_model, vector<double>(d_model)); for(size_t i=0;i<d_model;++i) for(size_t j=0;j<d_model;++j) W[i][j] = base + 0.001(i+1) + 0.0001(j+1); return W; };
41	Mat WQ = init(0.05), WK = init(0.04), WV = init(0.03), WO = init(0.02);
42
43	// Linear projections
44	Mat Q = matmul(X, WQ), K = matmul(X, WK), V = matmul(X, WV);
45
46	// Split into heads (concatenation of per-head outputs later)
47	vector<Mat> head_outputs;
48	for(size_t head=0; head<h; ++head){
49	size_t c0 = head * d_k, c1 = c0 + d_k;
50	Mat Qh = slice_cols(Q, c0, c1);
51	Mat Kh = slice_cols(K, c0, c1);
52	Mat Vh = slice_cols(V, c0, c1); // d_v == d_k in this minimal demo
53	Mat Oh = attention(Qh, Kh, Vh);
54	head_outputs.push_back(Oh);
55	}
56
57	// Concat heads and apply output projection
58	Mat O_concat = hconcat(head_outputs); // n x d_model
59	Mat Out = matmul(O_concat, WO); // n x d_model
60
61	cout.setf(std::ios::fixed); cout<<setprecision(6);
62	cout << "Multi-head self-attention output (first 3 tokens):\n";
63	for(size_t i=0;i<min((size_t)3,n); ++i){
64	for(size_t j=0;j<d_model; ++j) cout << Out[i][j] << ' ';
65	cout << '\n';
66	}
67	return 0;
68	}
69