Sequence-to-Sequence with Attention
Key Points
- •Sequence-to-sequence with attention lets a decoder focus on the most relevant parts of the input at each output step, rather than compressing everything into a single vector.
- •Attention computes a set of alignment scores over encoder states and forms a context vector as a weighted average using softmax weights.
- •Common attention scoring functions include dot, general (bilinear), additive (Bahdanau), and scaled dot-product (as in Transformers).
- •The computational cost typically scales as O( d), where and are source and target lengths and d is the hidden size.
- •Attention greatly improves translation, summarization, and dialogue models by handling long dependencies and variable-length inputs.
- •In C++, you can implement attention with plain vectors and matrices: compute scores, apply softmax row-wise, and take weighted sums.
- •Training uses teacher forcing and cross-entropy loss, while decoding commonly uses greedy or beam search with attention at each step.
- •Numerical stability (e.g., softmax with max subtraction) and correct masking are essential for robust implementations.
Prerequisites
- →Linear Algebra (vectors, matrices, dot products) — Attention is implemented with matrix multiplications and vector operations.
- →Probability and Softmax — Attention weights and output distributions are probabilities computed via softmax.
- →RNNs / GRUs / LSTMs — Classic Seq2Seq encoders/decoders are recurrent; understanding state updates is key.
- →Backpropagation and Gradient Descent — Training learns attention and projection parameters by minimizing cross-entropy.
- →Numerical Stability — Stable softmax and masking are crucial to avoid overflow and incorrect attention.
- →Autoregressive Decoding — The decoder predicts each token conditioned on previous tokens and attention.
- →Beam Search (optional) — Improves inference quality over greedy decoding by exploring multiple hypotheses.
- →Transformer Basics (optional) — Scaled dot-product attention generalizes the same idea to fully attention-based models.
Detailed Explanation
Tap terms for definitions01Overview
Sequence-to-Sequence (Seq2Seq) with Attention is an encoder–decoder architecture for mapping an input sequence (like a sentence) to an output sequence (like its translation). In a vanilla Seq2Seq model, the encoder compresses the entire input into a single fixed-size vector, and the decoder generates outputs from this bottleneck. Attention removes this bottleneck by allowing the decoder to dynamically look back at the encoder’s hidden states at every time step. Concretely, the decoder computes a set of relevance scores over all encoder states and forms a context vector as a weighted average. This context vector, combined with the decoder’s internal state, predicts the next token. This mechanism lets the model handle much longer sequences and alignments (which input parts support which outputs). The approach generalizes well beyond text: it’s used in speech recognition, captioning, and even non-neural algorithms that need soft alignment. Variants include Bahdanau (additive) attention, Luong (dot/general) attention, and scaled dot-product attention (central to Transformers). The practical result is better accuracy, interpretable alignments, and improved gradient flow during training, especially for long-range dependencies.
02Intuition & Analogies
Imagine translating a paragraph with a highlighter in your hand. You don’t first memorize the entire paragraph and then write the translation from memory; instead, at each translated word, you glance back and highlight the most relevant original words. That highlight intensity over the source words is your attention distribution, and the weighted combination of those highlighted words is your context. The decoder is like you choosing the next word to write, guided by what you’re currently focusing on plus your memory of what you’ve already written. Another analogy: think of a searchlight sweeping across a landscape (the encoder states). At each moment (each output token), the beam focuses on certain regions (relevant positions). The brightness at each location is the attention weight, and the combined reflected light is the context vector. Crucially, the beam can move and spread, capturing multiple relevant spots when necessary. Without attention, you’d be forced to summarize the whole landscape into a single snapshot—a lossy compression. With attention, you get to re-consult the original scene repeatedly, choosing different parts as the needs of the translation evolve. This dynamic lookup not only improves accuracy but also yields a human-interpretable alignment map, showing which input pieces influenced each output decision.
03Formal Definition
04When to Use
Use Seq2Seq with attention whenever you need to map variable-length inputs to variable-length outputs and long-range dependencies matter. Core scenarios include machine translation, summarization (abstractive), question answering with spans or generated answers, image captioning (attention over CNN features), speech recognition (attention over acoustic frames), and code generation. Attention is also valuable when interpretability is required: alignment heatmaps show which input positions influenced each output token, aiding debugging and trust. If your baseline Seq2Seq struggles with long sentences or loses critical details, attention is a natural upgrade that usually boosts performance without changing the overall training recipe. For resource-constrained settings, attention’s extra O(T_s T_t) compute is still often worthwhile thanks to significant accuracy gains. If you need massive scalability or purely parallel computation, consider transformer-style self-attention encoders and decoders, but the cross-attention idea remains the same: compute relevance between decoder queries and encoder keys/values. In structured prediction (like parsing) or program synthesis, attention provides soft alignment, which can serve as a differentiable proxy for hard alignment or selection.
⚠️Common Mistakes
- Forgetting numerical stability in softmax: directly exponentiating large scores can overflow. Fix by subtracting the row-wise maximum before exp. Also clamp extremely small probabilities if you compute logs.\n- Ignoring masks: you must mask padding positions in the encoder sequence so they get zero probability mass. Otherwise, the model can attend to meaningless padding and degrade.\n- Mismatched dimensions: score functions require consistent shapes (e.g., dot product needs d_s = d_h or a projection W to bridge sizes). Double-check matrix sizes and concatenations.\n- Treating context and state interchangeably: the decoder state s_t and context c_t play different roles; concatenation vs. addition changes capacity and should match your chosen variant (Luong vs. Bahdanau).\n- Overly small attention hidden size (for additive attention): too small d_a bottlenecks alignment expressiveness; too large increases compute and overfitting risk.\n- Overlooking inference-time complexity: beam search multiplies compute by beam size B and vocabulary size V per step; plan memory and time accordingly.\n- Not inspecting alignments: bad or peaky/flat attention patterns often reveal bugs (masking, scaling) or learning issues (learning rate, regularization).
Key Formulas
Encoder Recurrence
Explanation: The encoder processes the input sequence step by step, updating its hidden state from the previous state and the current token. f can be an RNN, GRU, LSTM, CNN block, or a transformer layer (in parallel form).
Attention Score
Explanation: Each encoder position i gets a relevance score for producing the t-th output. The score function can be dot, general (bilinear), or additive (Bahdanau).
Attention Weights (Softmax)
Explanation: Scores are normalized into a probability distribution over positions. Larger scores get higher weights, but the weights sum to 1 across all positions.
Context Vector
Explanation: The context is a weighted average of encoder states, emphasizing relevant positions. It is fed to the decoder along with its hidden state to predict the next token.
Luong (Dot/General) Scores
Explanation: Dot uses a simple inner product, requiring matched dimensions. General introduces a learned projection W to align different spaces.
Bahdanau (Additive) Score
Explanation: A small feed-forward network combines decoder state and encoder state through tanh, then projects with v to a scalar score. This often works well when and differ.
Scaled Dot-Product Attention
Explanation: Queries match against keys to produce attention weights, scaled by 1/sqrt() for stability. The output is a weighted sum over the values.
Decoder Output Distribution
Explanation: The probability over the vocabulary for the next token is computed from a linear projection of the decoder state concatenated with the context vector.
Cross-Entropy Loss
Explanation: Training minimizes the negative log-likelihood of ground-truth tokens given the input and prior ground-truth outputs. This encourages the model to assign high probability to the correct sequence.
Stable Softmax
Explanation: Subtracting the maximum score per row prevents numerical overflow in exponentials without changing the resulting probabilities.
Attention Complexity
Explanation: At each of decoding steps, attention compares the decoder state to all encoder states with hidden size d. This dominates the runtime in RNN-based Seq2Seq with attention.
Beam Search Work
Explanation: Each decoding step expands B hypotheses over a vocabulary of size V before pruning. This multiplicative factor must be considered in inference time.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Matrix = vector<vector<double>>; 5 6 Matrix transpose(const Matrix &A) { 7 int n = A.size(), m = A[0].size(); 8 Matrix T(m, vector<double>(n)); 9 for (int i = 0; i < n; ++i) 10 for (int j = 0; j < m; ++j) 11 T[j][i] = A[i][j]; 12 return T; 13 } 14 15 Matrix matmul(const Matrix &A, const Matrix &B) { 16 int n = A.size(), d = A[0].size(), m = B[0].size(); 17 Matrix C(n, vector<double>(m, 0.0)); 18 for (int i = 0; i < n; ++i) 19 for (int k = 0; k < d; ++k) 20 for (int j = 0; j < m; ++j) 21 C[i][j] += A[i][k] * B[k][j]; 22 return C; 23 } 24 25 vector<double> softmax_row(const vector<double> &z) { 26 double mx = *max_element(z.begin(), z.end()); 27 double sum = 0.0; 28 vector<double> p(z.size()); 29 for (size_t i = 0; i < z.size(); ++i) { 30 p[i] = exp(z[i] - mx); // stable softmax 31 sum += p[i]; 32 } 33 for (double &v : p) v /= (sum + 1e-12); 34 return p; 35 } 36 37 Matrix scaled_dot_product_attention(const Matrix &Q, const Matrix &K, const Matrix &V) { 38 // Q: [n_q, d_k], K: [n_k, d_k], V: [n_k, d_v] 39 int n_q = Q.size(); 40 int d_k = Q[0].size(); 41 int n_k = K.size(); 42 int d_v = V[0].size(); 43 (void)d_v; // unused directly here 44 45 // Compute scores = Q * K^T / sqrt(d_k) => [n_q, n_k] 46 Matrix Kt = transpose(K); 47 Matrix scores = matmul(Q, Kt); 48 double scale = 1.0 / sqrt((double)d_k); 49 for (int i = 0; i < n_q; ++i) 50 for (int j = 0; j < n_k; ++j) 51 scores[i][j] *= scale; 52 53 // Row-wise softmax to get attention weights 54 Matrix weights(n_q, vector<double>(n_k)); 55 for (int i = 0; i < n_q; ++i) { 56 weights[i] = softmax_row(scores[i]); 57 } 58 59 // Output = weights * V => [n_q, d_v] 60 Matrix out = matmul(weights, V); 61 return out; 62 } 63 64 int main() { 65 // Example with small matrices 66 // n_q=2 queries, n_k=4 keys/values, d_k=3, d_v=2 67 Matrix Q = {{0.2, 0.1, -0.3}, 68 {0.0, 0.4, 0.5}}; 69 Matrix K = {{0.1, -0.2, 0.3}, 70 {0.0, 0.5, 0.2}, 71 {0.7, -0.1, 0.0}, 72 {-0.3, 0.2, 0.1}}; 73 Matrix V = {{ 0.5, 0.0}, 74 { 0.1, 0.2}, 75 {-0.2, 0.3}, 76 { 0.4, -0.1}}; 77 78 Matrix out = scaled_dot_product_attention(Q, K, V); 79 80 cout << fixed << setprecision(4); 81 cout << "Attention outputs (rows per query):\n"; 82 for (auto &row : out) { 83 for (double v : row) cout << v << ' '; 84 cout << '\n'; 85 } 86 return 0; 87 } 88
This program implements scaled dot-product attention. It computes scores by multiplying queries with transposed keys, scales by 1/sqrt(d_k) for stability, applies row-wise softmax to get attention weights, and multiplies by values to get context outputs. The example uses tiny matrices to print numerical outputs. The code includes a stable softmax and basic matrix utilities to remain dependency-free.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Vec = vector<double>; 5 using Mat = vector<vector<double>>; 6 7 Vec add(const Vec &a, const Vec &b) { 8 Vec c(a.size()); 9 for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i]; 10 return c; 11 } 12 13 Vec tanh_vec(const Vec &a) { 14 Vec b(a.size()); 15 for (size_t i = 0; i < a.size(); ++i) b[i] = tanh(a[i]); 16 return b; 17 } 18 19 Vec matvec(const Mat &W, const Vec &x) { 20 size_t r = W.size(), c = W[0].size(); 21 Vec y(r, 0.0); 22 for (size_t i = 0; i < r; ++i) 23 for (size_t j = 0; j < c; ++j) 24 y[i] += W[i][j] * x[j]; 25 return y; 26 } 27 28 double dot(const Vec &a, const Vec &b) { 29 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s; 30 } 31 32 vector<double> softmax_stable(const vector<double> &z) { 33 double mx = *max_element(z.begin(), z.end()); 34 double sum = 0.0; vector<double> p(z.size()); 35 for (size_t i = 0; i < z.size(); ++i) { p[i] = exp(z[i] - mx); sum += p[i]; } 36 for (double &v : p) v /= (sum + 1e-12); 37 return p; 38 } 39 40 int main() { 41 // Dimensions: d_s=3 (decoder state), d_h=4 (encoder state), d_a=5 (attention hidden) 42 // One decoder state s, multiple encoder states h_i 43 vector<Vec> H = { 44 { 0.1, 0.0, 0.3, -0.2}, 45 { 0.2, 0.5, -0.1, 0.4}, 46 {-0.3, 0.2, 0.6, 0.1} 47 }; // T_s=3 48 Vec s = {0.05, -0.1, 0.2}; 49 50 // Parameters (random small constants for demo) 51 Mat W_s = { 52 { 0.1, -0.2, 0.3}, 53 {-0.1, 0.0, 0.2}, 54 { 0.05, 0.1, -0.05}, 55 { 0.2, -0.1, 0.0}, 56 {-0.15, 0.05, 0.1} 57 }; // 5x3 58 Mat W_h = { 59 { 0.2, 0.1, -0.2, 0.0}, 60 {-0.1, 0.3, 0.05, -0.05}, 61 { 0.0, -0.2, 0.1, 0.2}, 62 { 0.1, 0.0, 0.0, -0.1}, 63 {-0.05, 0.2, -0.1, 0.1} 64 }; // 5x4 65 Vec v = {0.1, -0.2, 0.05, 0.15, -0.1}; // 5 66 67 // Precompute W_s s 68 Vec Ws = matvec(W_s, s); // size d_a 69 70 // Scores e_i 71 vector<double> e(H.size()); 72 for (size_t i = 0; i < H.size(); ++i) { 73 Vec Wh = matvec(W_h, H[i]); 74 Vec z = add(Ws, Wh); 75 Vec a = tanh_vec(z); 76 e[i] = dot(v, a); 77 } 78 79 // Attention weights 80 vector<double> alpha = softmax_stable(e); 81 82 // Context c = sum_i alpha_i * h_i 83 Vec c(H[0].size(), 0.0); 84 for (size_t i = 0; i < H.size(); ++i) 85 for (size_t j = 0; j < H[0].size(); ++j) 86 c[j] += alpha[i] * H[i][j]; 87 88 cout << fixed << setprecision(4); 89 cout << "Attention weights (alpha): "; 90 for (double a_i : alpha) cout << a_i << ' '; cout << "\n"; 91 cout << "Context vector (c): "; 92 for (double cj : c) cout << cj << ' '; cout << "\n"; 93 return 0; 94 } 95
This example computes Bahdanau (additive) attention for a single decoder state over multiple encoder states. It forms scores via a small feed-forward network, applies a stable softmax to obtain weights, and then computes the context vector as a weighted sum. Parameters are fixed for demonstration; in practice they are learned via gradient descent.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Vec = vector<double>; 5 using Mat = vector<vector<double>>; 6 7 // Utility functions 8 Vec concat(const Vec &a, const Vec &b) { Vec c=a; c.insert(c.end(), b.begin(), b.end()); return c; } 9 Vec concat3(const Vec &a, const Vec &b, const Vec &c) { Vec ab = concat(a,b); return concat(ab,c); } 10 11 Vec matvec(const Mat &W, const Vec &x) { 12 size_t r=W.size(), c=W[0].size(); Vec y(r,0.0); 13 for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) y[i]+=W[i][j]*x[j]; 14 return y; 15 } 16 17 Vec add(const Vec &a, const Vec &b) { Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; } 18 Vec tanh_vec(const Vec &a) { Vec b(a.size()); for(size_t i=0;i<a.size();++i) b[i]=tanh(a[i]); return b; } 19 20 double dot(const Vec &a, const Vec &b){ double s=0.0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 21 22 vector<double> softmax(const vector<double> &z){ double mx=*max_element(z.begin(),z.end()); double sum=0.0; vector<double> p(z.size()); for(size_t i=0;i<z.size();++i){ p[i]=exp(z[i]-mx); sum+=p[i]; } for(double &v:p) v/=(sum+1e-12); return p; } 23 24 int argmax(const vector<double> &v){ return (int)(max_element(v.begin(), v.end()) - v.begin()); } 25 26 // Additive attention: e_i = v^T tanh(Ws s + Wh h_i) 27 vector<double> attention_scores(const Vec &s, const vector<Vec> &H, const Mat &Ws, const Mat &Wh, const Vec &v) { 28 Vec Ws_s = matvec(Ws, s); 29 vector<double> e(H.size()); 30 for(size_t i=0;i<H.size();++i){ 31 Vec Wh_hi = matvec(Wh, H[i]); 32 Vec z = add(Ws_s, Wh_hi); 33 Vec a = tanh_vec(z); 34 e[i] = dot(v, a); 35 } 36 return e; 37 } 38 39 Vec weighted_sum(const vector<Vec> &H, const vector<double> &alpha){ 40 size_t d=H[0].size(); Vec c(d,0.0); 41 for(size_t i=0;i<H.size();++i) for(size_t j=0;j<d;++j) c[j]+=alpha[i]*H[i][j]; 42 return c; 43 } 44 45 int main(){ 46 // Dimensions 47 const int d_h=4; // encoder hidden size 48 const int d_s=4; // decoder state size 49 const int d_a=3; // attention hidden size 50 const int d_e=3; // embedding size 51 const int V=6; // vocabulary size 52 53 // Seeded random generator for reproducibility 54 std::mt19937 rng(42); std::uniform_real_distribution<double> dist(-0.2,0.2); 55 56 // Toy encoder states H (T_s x d_h) 57 const int T_s=5; vector<Vec> H(T_s, Vec(d_h)); 58 for(int i=0;i<T_s;++i) for(int j=0;j<d_h;++j) H[i][j]=dist(rng); 59 60 // Parameters (random small values for demo) 61 Mat Ws(d_a, Vec(d_s)), Wh(d_a, Vec(d_h)); Vec v_att(d_a); 62 for(int i=0;i<d_a;++i){ for(int j=0;j<d_s;++j) Ws[i][j]=dist(rng); for(int j=0;j<d_h;++j) Wh[i][j]=dist(rng); v_att[i]=dist(rng); } 63 64 // RNN state update: s_t = tanh(Wr [s_{t-1}; emb(y_{t-1}); c_t] + b_r) 65 Mat Wr(d_s, Vec(d_s + d_e + d_h)); Vec b_r(d_s); 66 for(int i=0;i<d_s;++i){ b_r[i]=dist(rng); for(int j=0;j<(int)Wr[0].size();++j) Wr[i][j]=dist(rng); } 67 68 // Output projection: logits = Wo [s_t; c_t] + b_o -> size V 69 Mat Wo(V, Vec(d_s + d_h)); Vec b_o(V); 70 for(int i=0;i<V;++i){ b_o[i]=dist(rng); for(int j=0;j<(int)Wo[0].size();++j) Wo[i][j]=dist(rng); } 71 72 // Embedding matrix (V x d_e) 73 Mat Emb(V, Vec(d_e)); for(int i=0;i<V;++i) for(int j=0;j<d_e;++j) Emb[i][j]=dist(rng); 74 75 // Special tokens (ids) 76 const int SOS=1, EOS=2; 77 78 // Initial states 79 Vec s_prev(d_s, 0.0); // zero state 80 Vec c_prev(d_h, 0.0); // initial context 81 int y_prev = SOS; // start token 82 83 const int max_steps=10; 84 85 cout << fixed << setprecision(4); 86 cout << "Greedy decoding with attention (token ids):\n"; 87 for(int t=1;t<=max_steps;++t){ 88 // 1) Compute attention over encoder states using s_{t-1} 89 vector<double> e = attention_scores(s_prev, H, Ws, Wh, v_att); 90 vector<double> alpha = softmax(e); 91 Vec c_t = weighted_sum(H, alpha); 92 93 // 2) Update decoder state using previous state, previous token embedding, and new context 94 Vec emb = Emb[y_prev]; 95 Vec inp = concat3(s_prev, emb, c_t); // [d_s + d_e + d_h] 96 Vec s_t = tanh_vec(add(matvec(Wr, inp), b_r)); 97 98 // 3) Compute output distribution and choose next token (greedy) 99 Vec out_inp = concat(s_t, c_t); // [d_s + d_h] 100 Vec logits = add(matvec(Wo, out_inp), b_o); // size V 101 vector<double> probs = softmax(logits); 102 int y_t = argmax(probs); 103 104 // Print step info 105 cout << "t=" << t << " -> y_t=" << y_t << ", attention="; 106 for(double a : alpha) cout << a << ' '; 107 cout << "\n"; 108 109 if(y_t == EOS){ cout << "<eos> reached, stop.\n"; break; } 110 // roll 111 s_prev = s_t; c_prev = c_t; y_prev = y_t; 112 } 113 114 return 0; 115 } 116
This toy program demonstrates one full greedy decoding loop with additive attention. At each step, it (1) computes attention scores over encoder states from the decoder state, (2) forms a context vector, (3) updates the decoder state via a simple tanh RNN cell using the previous state, previous token embedding, and context, and (4) projects to vocabulary logits, applies softmax, and picks the argmax token. The numbers are random but the control flow mirrors a real Seq2Seq with attention.