🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Scaled Dot-Product Attention

Key Points

  • •
    Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(dk​​\), applying softmax, and forming a weighted sum.
  • •
    The scale \(dk​​\) keeps gradients well-behaved by preventing overly peaky softmax distributions when feature dimension grows.
  • •
    Softmax turns similarity scores into a probability distribution so weights are non-negative and sum to 1 across keys.
  • •
    Masks (like causal or padding masks) are added before softmax as large negative numbers, blocking attention to disallowed positions.
  • •
    In matrix form, attention can process many queries at once: \(softmax(QK^⊤/dk​​)V\).
  • •
    Multi-head attention applies this operation multiple times in parallel on projected subspaces and concatenates results.
  • •
    A numerically stable softmax subtracts the row-wise maximum before exponentials to avoid overflow.
  • •
    The naive implementation runs in \(O(mndk​ + mndv​ + mn)\) time for an \(m× dk​\) query matrix and \(n\) keys/values, and uses \(O(mn)\) extra memory for the score matrix.

Prerequisites

  • →Matrix multiplication and transposition — Attention is expressed and implemented via matrix products QK^T and AV.
  • →Dot product and cosine similarity — Similarity between queries and keys is computed via dot products.
  • →Softmax and numerical stability — Turning scores into probabilities and keeping computations stable is essential.
  • →Basic probability distributions — Attention weights are probabilities that sum to 1 across keys.
  • →Time and space complexity — Understanding the O(mn) memory and O(mnd) time costs guides design decisions.
  • →Linear transformations (affine layers) — Multi-head attention uses learned linear projections of inputs into Q, K, and V.
  • →C++ vectors and loops — Implementations rely on nested loops over std::vector for clarity.

Detailed Explanation

Tap terms for definitions

01Overview

Scaled dot-product attention is a core operation in modern sequence models like Transformers. Given a set of queries Q, keys K, and values V, it computes attention weights that represent how relevant each value is to each query. It does this by first measuring similarity between queries and keys using dot products, then scaling these scores by the square root of the key dimension to keep values in a well-behaved numerical range. The scores are turned into probabilities with a softmax, and the final output is a weighted sum of values using these probabilities as weights. In compact matrix form, the operation is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V. Because it is differentiable end-to-end, attention can be learned via gradient descent, allowing models to dynamically focus on different parts of the input. Scaled dot-product attention is efficient, parallelizable, and flexible: it can be applied over time steps in text, patches in images, or nodes in graphs. Variants include masking to enforce causality (no peeking into the future) and multi-head attention, which runs several attention operations in parallel on different learned subspaces, capturing diverse relationships.

02Intuition & Analogies

Imagine you are searching a bookshelf for a topic (the query). Each book has an index describing its content (the key) and its actual content (the value). To answer your question, you compare your question (query) to each book’s index (key): the more similar they are, the more that book should influence your answer. You convert these similarities into a set of percentages that sum to 100% (softmax), then take a weighted average of the books’ content (values). That final averaged content is your attended answer. The dot product is a fast way to measure how aligned two descriptions are: bigger dot product means they point in similar directions. However, if your description vectors are long (high dimensional), raw dot products tend to get large just because there are many components adding up. When scores get too large, softmax becomes overly confident about just one book, which harms learning. Dividing by (\sqrt{d_k}) acts like a temperature knob, cooling the scores so the softmax doesn’t collapse too quickly. Masks are like a “do not look here” sticker: before computing the percentages, you subtract a huge number from the forbidden books’ scores so they effectively get weight zero after softmax. Multi-head attention is like asking several librarians with different specialties to each give a recommendation and then combining all their suggestions.

03Formal Definition

Let Q \(∈ Rm×dk​\), K \(∈ Rn×dk​\), and V \(∈ Rn×dv​\). Scaled dot-product attention computes S = QK^⊤ / dk​​, where S \(∈ Rm×n\) contains pairwise similarities between queries and keys. The attention weights are A=softmax(S), applied row-wise so that for each query i, Ai,:​ is a probability distribution over n keys. The output is O=AV \(∈ Rm×dv​\), a weighted combination of values for each query. Optionally, a (additive) mask M \(∈ Rm×n\) with entries 0 for allowed and \(-∞\) for disallowed positions is added before softmax: A=softmax(S+M). In multi-head attention, with h heads and projection matrices WQ_i, WK_i, WV_i, the i-th head is headi​=Attention(QWiQ​, KWK_i, VWV_i), and the final output is Concat(head1​,…,headh​)WO. Training typically uses automatic differentiation; the softmax Jacobian \(∂xj​∂σi​​ = σi​(δij​ - σj​)\) governs gradient flow within each row.

04When to Use

Use scaled dot-product attention whenever you need a model to flexibly relate elements across a set or sequence. Typical cases include natural language processing (e.g., a word’s representation should attend to other relevant words), computer vision (e.g., a patch attends to other patches), and recommendation systems (e.g., a user’s current action attends to historical actions). Employ causal masking for autoregressive generation to prevent information leakage from future positions. Use padding masks to ignore padded tokens in batches of variable-length sequences. Choose multi-head attention when a single similarity notion is insufficient; multiple heads allow the model to capture different relationships (syntax vs. semantics, local vs. global, etc.). For small inputs or prototyping, a direct O(mn) score matrix is fine. For very long sequences (tens of thousands), consider approximate or sparse variants that reduce the quadratic cost, but be mindful of their trade-offs in accuracy and implementation complexity.

⚠️Common Mistakes

  • Forgetting the scaling by (\sqrt{d_k}) leads to overly sharp softmax distributions and unstable training, especially as (d_k) grows. Fix by explicitly dividing scores before softmax or by using a temperature parameter (\tau = \sqrt{d_k}).
  • Applying masks after softmax instead of before. Correct masking is additive before softmax, using large negative numbers to ensure masked positions receive near-zero probability.
  • Numerical instability in softmax due to large scores; always subtract the row-wise maximum prior to exponentials (log-sum-exp trick) and prefer double precision for teaching/demos or float32 with care in production.
  • Mixing up keys and values: keys determine where to look; values are the content to aggregate. Ensure K and V are aligned by position.
  • Dimension mismatches when implementing multi-head attention: verify that d_model is divisible by the number of heads and that reshapes/splits are along the correct axes.
  • Ignoring padding tokens. Without a padding mask, attention can latch onto meaningless pads, degrading performance.
  • Using zero to mask before softmax. Zeros do not nullify probabilities; you must add (-\infty) (or a large negative constant) so softmax returns (near) zero probability for masked entries.

Key Formulas

Scaled Dot-Product Attention

Attention(Q,K,V)=softmax(dk​​QK⊤​)V

Explanation: Computes similarity between queries and keys, scales by the square root of key dimension, converts to probabilities via softmax, and uses them to form a weighted sum of values.

Softmax

softmax(x)i​=∑j=1n​exj​exi​​

Explanation: Turns a vector of scores into non-negative weights that sum to 1, making them interpretable as probabilities.

Numerically Stable Softmax

softmax(x)i​=∑j=1n​exj​−mexi​−m​,m=jmax​xj​

Explanation: Subtracting the maximum score m prevents overflow in the exponentials while leaving the probabilities unchanged.

Multi-Head Attention

headi​=Attention(QWiQ​,KWiK​,VWiV​),Out=Concat(head1​,…,headh​)WO

Explanation: Each head attends in a different learned subspace; concatenating and projecting combines their information.

Masked Attention

A=softmax(S+M),Mij​={0−∞​allowedmasked​

Explanation: Masks are added to scores before softmax so masked positions receive (near) zero probability.

Motivation for Scaling

Var(q⊤k)=dk​Var(qi​ki​)

Explanation: With independent components, the variance of dot products grows with dimension. Dividing by \(dk​​\) keeps score magnitudes stable as dimensionality increases.

Softmax Jacobian

∂xj​∂σi​​=σi​(δij​−σj​)

Explanation: Describes how changes in a score affect each output probability; important for understanding gradient flow in attention.

Time Complexity (Naive)

T(m,n,dk​,dv​)=Θ(mndk​+mn+mndv​)

Explanation: Computing scores QKT takes \(mndk​\), softmax is \(mn\), and multiplying by V costs \(mndv​\).

Space for Score Matrix

S(m,n)=Θ(mn)

Explanation: Storing the attention scores (and optionally weights) requires memory proportional to the number of query-key pairs.

Complexity Analysis

For m queries of dimension dk​ attending over n keys/values (with values of dimension dv​), the naive implementation forms the score matrix S=QKT/√dk​. Computing S costs Θ(mndk​) arithmetic operations: each of the mn entries is a dot product of length dk​. Applying a row-wise softmax requires Θ(mn): subtracting the max, exponentiating, summing, and normalizing for each of the m rows over n columns. Finally, multiplying the attention weights A (m×n) by V (n×dv​) costs Θ(mndv​). Thus the total time is T(m,n,dk​,dv​) = Θ(mndk​ + mn + mndv​), commonly dominated by the two matrix multiplications Θ(mndk​ + mndv​n). Space-wise, beyond inputs and outputs, the main auxiliary memory is the score matrix S (and possibly A), each Θ(mn). Masks also occupy Θ(mn) if dense. In multi-head attention with h heads where dk​=dv​=dm​odel/h, time scales by h (one attention per head) but each head’s dimensions are smaller; overall complexity remains Θ(mndm​odel) per attention block, plus the cheap linear projections. When sequences are long (large n and m), the Θ(mn) memory for S can be the primary bottleneck, motivating kernel, block-sparse, or low-rank approximations. Numerically, care must be taken to implement softmax stably (subtracting row-wise maxima) and to add masks prior to softmax via large negative numbers to ensure masked probabilities are effectively zero.

Code Examples

Single-Query Scaled Dot-Product Attention (numerically stable)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute a numerically stable softmax over a vector
5vector<double> softmax_stable(const vector<double>& x) {
6 double mx = *max_element(x.begin(), x.end());
7 double sum = 0.0;
8 vector<double> exps(x.size());
9 for (size_t i = 0; i < x.size(); ++i) {
10 exps[i] = exp(x[i] - mx); // subtract max for numerical stability
11 sum += exps[i];
12 }
13 for (size_t i = 0; i < x.size(); ++i) exps[i] /= (sum + 1e-12);
14 return exps; // probabilities sum to ~1
15}
16
17// Dot product between two vectors of equal length
18double dot(const vector<double>& a, const vector<double>& b) {
19 assert(a.size() == b.size());
20 double s = 0.0;
21 for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
22 return s;
23}
24
25// Single-query attention: q is 1xd_k, K is nxd_k, V is nxd_v
26vector<double> attention_single(const vector<double>& q,
27 const vector<vector<double>>& K,
28 const vector<vector<double>>& V) {
29 size_t n = K.size();
30 size_t d_k = q.size();
31 assert(n == V.size());
32 // 1) Scores: s_i = (q · K_i) / sqrt(d_k)
33 vector<double> scores(n);
34 double scale = 1.0 / sqrt((double)d_k);
35 for (size_t i = 0; i < n; ++i) {
36 assert(K[i].size() == d_k);
37 scores[i] = dot(q, K[i]) * scale;
38 }
39 // 2) Softmax to get attention weights
40 vector<double> w = softmax_stable(scores);
41 // 3) Weighted sum of values
42 size_t d_v = V[0].size();
43 vector<double> out(d_v, 0.0);
44 for (size_t i = 0; i < n; ++i) {
45 assert(V[i].size() == d_v);
46 for (size_t j = 0; j < d_v; ++j) out[j] += w[i] * V[i][j];
47 }
48 return out;
49}
50
51int main() {
52 // Example with small, readable numbers
53 // d_k = 4, n = 3 values, d_v = 2
54 vector<double> q = {0.1, 0.2, 0.3, 0.4};
55 vector<vector<double>> K = {
56 {0.0, 0.1, 0.0, 0.1},
57 {0.2, 0.1, 0.0, 0.0},
58 {0.1, 0.0, 0.3, 0.1}
59 };
60 vector<vector<double>> V = {
61 {1.0, 0.0},
62 {0.0, 1.0},
63 {1.0, 1.0}
64 };
65
66 vector<double> out = attention_single(q, K, V);
67
68 cout.setf(std::ios::fixed); cout<<setprecision(6);
69 cout << "Output: ";
70 for (double x : out) cout << x << ' ';
71 cout << "\n";
72
73 // Extra: verify weights sum to ~1
74 double scale = 1.0 / sqrt((double)q.size());
75 vector<double> scores;
76 for (auto& k : K) scores.push_back(dot(q, k) * scale);
77 vector<double> w = softmax_stable(scores);
78 double sumw = accumulate(w.begin(), w.end(), 0.0);
79 cout << "Sum of attention weights: " << sumw << "\n";
80 return 0;
81}
82

This program computes scaled dot-product attention for a single query vector. It forms similarity scores to each key via dot products, scales by \(1/\sqrt{d_k}\), applies a numerically stable softmax, and returns the weighted sum of values. The example prints the output vector and confirms that attention weights sum to 1.

Time: O(nd_k + nd_v)Space: O(n + d_v)
Batched Scaled Dot-Product Attention with Causal Mask
1#include <bits/stdc++.h>
2using namespace std;
3
4using Mat = vector<vector<double>>; // simple dense matrix helper
5
6Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); }
7
8Mat transpose(const Mat& A) {
9 size_t r = A.size(), c = A[0].size();
10 Mat T(c, vector<double>(r));
11 for (size_t i = 0; i < r; ++i)
12 for (size_t j = 0; j < c; ++j)
13 T[j][i] = A[i][j];
14 return T;
15}
16
17Mat matmul(const Mat& A, const Mat& B) {
18 size_t r = A.size(), k = A[0].size(), c = B[0].size();
19 assert(B.size() == k);
20 Mat C(r, vector<double>(c, 0.0));
21 for (size_t i = 0; i < r; ++i)
22 for (size_t t = 0; t < k; ++t) {
23 double a = A[i][t];
24 for (size_t j = 0; j < c; ++j) C[i][j] += a * B[t][j];
25 }
26 return C;
27}
28
29void softmax_rows_inplace(Mat& S) {
30 for (auto& row : S) {
31 double mx = *max_element(row.begin(), row.end());
32 double sum = 0.0;
33 for (double& x : row) { x = exp(x - mx); sum += x; }
34 double inv = 1.0 / (sum + 1e-12);
35 for (double& x : row) x *= inv;
36 }
37}
38
39// Build a lower-triangular (causal) mask: allow j <= i, disallow j > i
40Mat causal_mask(size_t m, size_t n) {
41 Mat M(m, vector<double>(n, 0.0));
42 const double NEG_INF = -1e9; // practical -inf for softmax
43 for (size_t i = 0; i < m; ++i)
44 for (size_t j = 0; j < n; ++j)
45 if (j > i) M[i][j] = NEG_INF;
46 return M;
47}
48
49Mat add(const Mat& A, const Mat& B) {
50 size_t r = A.size(), c = A[0].size();
51 assert(B.size() == r && B[0].size() == c);
52 Mat C = A;
53 for (size_t i = 0; i < r; ++i)
54 for (size_t j = 0; j < c; ++j)
55 C[i][j] += B[i][j];
56 return C;
57}
58
59// Batched attention: Q (mxd_k), K (nxd_k), V (nxd_v), optional additive mask M (m x n)
60Mat attention_batched(const Mat& Q, const Mat& K, const Mat& V, const Mat* M = nullptr) {
61 size_t m = Q.size(), d_k = Q[0].size();
62 size_t n = K.size();
63 size_t d_v = V[0].size();
64 assert(K[0].size() == d_k);
65 assert(V.size() == n);
66
67 // 1) S = Q K^T / sqrt(d_k)
68 Mat KT = transpose(K);
69 Mat S = matmul(Q, KT);
70 double scale = 1.0 / sqrt((double)d_k);
71 for (size_t i = 0; i < m; ++i)
72 for (size_t j = 0; j < n; ++j)
73 S[i][j] *= scale;
74
75 // 2) Add mask before softmax, if provided
76 if (M) {
77 assert((*M).size() == m && (*M)[0].size() == n);
78 for (size_t i = 0; i < m; ++i)
79 for (size_t j = 0; j < n; ++j)
80 S[i][j] += (*M)[i][j];
81 }
82
83 // 3) Row-wise softmax to get attention weights A
84 softmax_rows_inplace(S); // now S holds A
85
86 // 4) Output O = A V
87 Mat O = matmul(S, V);
88 return O;
89}
90
91int main() {
92 // Small demo: m=n=4, d_k=3, d_v=2, with causal mask
93 size_t m = 4, n = 4, d_k = 3, d_v = 2;
94 Mat Q(m, vector<double>(d_k)), K(n, vector<double>(d_k)), V(n, vector<double>(d_v));
95
96 // Initialize with deterministic values for reproducibility
97 for (size_t i = 0; i < m; ++i)
98 for (size_t j = 0; j < d_k; ++j)
99 Q[i][j] = (i + 1) * 0.1 + (j + 1) * 0.01;
100 for (size_t i = 0; i < n; ++i)
101 for (size_t j = 0; j < d_k; ++j)
102 K[i][j] = (i + 1) * 0.05 + (j + 1) * 0.02;
103 for (size_t i = 0; i < n; ++i)
104 for (size_t j = 0; j < d_v; ++j)
105 V[i][j] = (i + 1) * 0.2 + (j) * 0.1;
106
107 Mat M = causal_mask(m, n);
108 Mat O = attention_batched(Q, K, V, &M);
109
110 cout.setf(std::ios::fixed); cout<<setprecision(6);
111 cout << "Output (each row corresponds to a query):\n";
112 for (auto& row : O) {
113 for (double x : row) cout << x << ' ';
114 cout << '\n';
115 }
116 return 0;
117}
118

This program implements the full matrix form of scaled dot-product attention with an additive causal mask. It computes S = QK^T/√d_k, adds a lower-triangular mask (−1e9 above the diagonal), applies stable row-wise softmax to get attention weights, and multiplies by V to produce outputs. The example constructs small deterministic matrices for clarity and prints the resulting attended representations.

Time: O(mnd_k + mn + mnd_v)Space: O(mn + m d_v)
Minimal Multi-Head Self-Attention (single sequence, no bias)
1#include <bits/stdc++.h>
2using namespace std;
3using Mat = vector<vector<double>>;
4
5Mat zeros(size_t r, size_t c) { return Mat(r, vector<double>(c, 0.0)); }
6Mat transpose(const Mat& A){ size_t r=A.size(), c=A[0].size(); Mat T(c, vector<double>(r)); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) T[j][i]=A[i][j]; return T; }
7Mat matmul(const Mat& A, const Mat& B){ size_t r=A.size(), k=A[0].size(), c=B[0].size(); assert(B.size()==k); Mat C(r, vector<double>(c,0.0)); for(size_t i=0;i<r;++i) for(size_t t=0;t<k;++t){ double a=A[i][t]; for(size_t j=0;j<c;++j) C[i][j]+=a*B[t][j]; } return C; }
8void softmax_rows_inplace(Mat& S){ for(auto& row:S){ double mx=*max_element(row.begin(),row.end()); double sum=0.0; for(double& x:row){ x=exp(x-mx); sum+=x; } double inv=1.0/(sum+1e-12); for(double& x:row) x*=inv; }}
9
10// Slice columns [c0, c1) from A
11Mat slice_cols(const Mat& A, size_t c0, size_t c1){ size_t r=A.size(); Mat S(r, vector<double>(c1-c0)); for(size_t i=0;i<r;++i) for(size_t j=c0;j<c1;++j) S[i][j-c0]=A[i][j]; return S; }
12
13// Concatenate matrices horizontally (same row count)
14Mat hconcat(const vector<Mat>& parts){ size_t r=parts[0].size(); size_t totalc=0; for(auto& P:parts){ assert(P.size()==r); totalc += P[0].size(); }
15 Mat C(r, vector<double>(totalc));
16 size_t off=0; for(auto& P:parts){ size_t c=P[0].size(); for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) C[i][j+off]=P[i][j]; off+=c; }
17 return C;
18}
19
20Mat attention(const Mat& Q, const Mat& K, const Mat& V) {
21 size_t d_k = Q[0].size();
22 Mat S = matmul(Q, transpose(K));
23 double scale = 1.0 / sqrt((double)d_k);
24 for (auto& row : S) for (double& x : row) x *= scale;
25 softmax_rows_inplace(S);
26 return matmul(S, V);
27}
28
29int main(){
30 // Single sequence self-attention: X (n x d_model)
31 size_t n=5, d_model=8, h=2; // 2 heads => d_k=d_v=4
32 assert(d_model % h == 0);
33 size_t d_k = d_model / h, d_v = d_model / h;
34
35 // Input tokens (deterministic pattern for demo)
36 Mat X(n, vector<double>(d_model));
37 for(size_t i=0;i<n;++i) for(size_t j=0;j<d_model;++j) X[i][j] = 0.01*(i+1) + 0.02*(j+1);
38
39 // Projection matrices: W_Q, W_K, W_V, W_O (d_model x d_model)
40 auto init = [&](double base){ Mat W(d_model, vector<double>(d_model)); for(size_t i=0;i<d_model;++i) for(size_t j=0;j<d_model;++j) W[i][j] = base + 0.001*(i+1) + 0.0001*(j+1); return W; };
41 Mat WQ = init(0.05), WK = init(0.04), WV = init(0.03), WO = init(0.02);
42
43 // Linear projections
44 Mat Q = matmul(X, WQ), K = matmul(X, WK), V = matmul(X, WV);
45
46 // Split into heads (concatenation of per-head outputs later)
47 vector<Mat> head_outputs;
48 for(size_t head=0; head<h; ++head){
49 size_t c0 = head * d_k, c1 = c0 + d_k;
50 Mat Qh = slice_cols(Q, c0, c1);
51 Mat Kh = slice_cols(K, c0, c1);
52 Mat Vh = slice_cols(V, c0, c1); // d_v == d_k in this minimal demo
53 Mat Oh = attention(Qh, Kh, Vh);
54 head_outputs.push_back(Oh);
55 }
56
57 // Concat heads and apply output projection
58 Mat O_concat = hconcat(head_outputs); // n x d_model
59 Mat Out = matmul(O_concat, WO); // n x d_model
60
61 cout.setf(std::ios::fixed); cout<<setprecision(6);
62 cout << "Multi-head self-attention output (first 3 tokens):\n";
63 for(size_t i=0;i<min((size_t)3,n); ++i){
64 for(size_t j=0;j<d_model; ++j) cout << Out[i][j] << ' ';
65 cout << '\n';
66 }
67 return 0;
68}
69

This program assembles a minimal multi-head self-attention layer from the scaled dot-product primitive. It projects inputs into Q, K, V, splits along the feature dimension into h heads, applies attention per head, concatenates outputs, and applies a final output projection. This mirrors the Transformer’s core computation while keeping the math simple and explicit.

Time: O(h n^2 d_k + n d_model^2) for naive matmulsSpace: O(n^2 + n d_model)
#scaled dot-product attention#softmax#transformer#self-attention#causal mask#multi-head attention#query key value#numerical stability#temperature#matrix multiplication#time complexity#log-sum-exp#probability distribution