📚TheoryIntermediate

Sequence-to-Sequence with Attention

Key Points

•
Sequence-to-sequence with attention lets a decoder focus on the most relevant parts of the input at each output step, rather than compressing everything into a single vector.
•
Attention computes a set of alignment scores over encoder states and forms a context vector as a weighted average using softmax weights.
•
Common attention scoring functions include dot, general (bilinear), additive (Bahdanau), and scaled dot-product (as in Transformers).
•
The computational cost typically scales as O( $T_{s}$ $T_{t}$ d), where $T_{s}$ and $T_{t}$ are source and target lengths and d is the hidden size.
•
Attention greatly improves translation, summarization, and dialogue models by handling long dependencies and variable-length inputs.
•
In C++, you can implement attention with plain vectors and matrices: compute scores, apply softmax row-wise, and take weighted sums.
•
Training uses teacher forcing and cross-entropy loss, while decoding commonly uses greedy or beam search with attention at each step.
•
Numerical stability (e.g., softmax with max subtraction) and correct masking are essential for robust implementations.

Prerequisites

→Linear Algebra (vectors, matrices, dot products) — Attention is implemented with matrix multiplications and vector operations.
→Probability and Softmax — Attention weights and output distributions are probabilities computed via softmax.
→RNNs / GRUs / LSTMs — Classic Seq2Seq encoders/decoders are recurrent; understanding state updates is key.
→Backpropagation and Gradient Descent — Training learns attention and projection parameters by minimizing cross-entropy.
→Numerical Stability — Stable softmax and masking are crucial to avoid overflow and incorrect attention.
→Autoregressive Decoding — The decoder predicts each token conditioned on previous tokens and attention.
→Beam Search (optional) — Improves inference quality over greedy decoding by exploring multiple hypotheses.
→Transformer Basics (optional) — Scaled dot-product attention generalizes the same idea to fully attention-based models.

Detailed Explanation

Tap terms for definitions

01Overview

Sequence-to-Sequence (Seq2Seq) with Attention is an encoder–decoder architecture for mapping an input sequence (like a sentence) to an output sequence (like its translation). In a vanilla Seq2Seq model, the encoder compresses the entire input into a single fixed-size vector, and the decoder generates outputs from this bottleneck. Attention removes this bottleneck by allowing the decoder to dynamically look back at the encoder’s hidden states at every time step. Concretely, the decoder computes a set of relevance scores over all encoder states and forms a context vector as a weighted average. This context vector, combined with the decoder’s internal state, predicts the next token. This mechanism lets the model handle much longer sequences and alignments (which input parts support which outputs). The approach generalizes well beyond text: it’s used in speech recognition, captioning, and even non-neural algorithms that need soft alignment. Variants include Bahdanau (additive) attention, Luong (dot/general) attention, and scaled dot-product attention (central to Transformers). The practical result is better accuracy, interpretable alignments, and improved gradient flow during training, especially for long-range dependencies.

02Intuition & Analogies

Imagine translating a paragraph with a highlighter in your hand. You don’t first memorize the entire paragraph and then write the translation from memory; instead, at each translated word, you glance back and highlight the most relevant original words. That highlight intensity over the source words is your attention distribution, and the weighted combination of those highlighted words is your context. The decoder is like you choosing the next word to write, guided by what you’re currently focusing on plus your memory of what you’ve already written. Another analogy: think of a searchlight sweeping across a landscape (the encoder states). At each moment (each output token), the beam focuses on certain regions (relevant positions). The brightness at each location is the attention weight, and the combined reflected light is the context vector. Crucially, the beam can move and spread, capturing multiple relevant spots when necessary. Without attention, you’d be forced to summarize the whole landscape into a single snapshot—a lossy compression. With attention, you get to re-consult the original scene repeatedly, choosing different parts as the needs of the translation evolve. This dynamic lookup not only improves accuracy but also yields a human-interpretable alignment map, showing which input pieces influenced each output decision.

03Formal Definition

Let an input sequence x = (

x_{1}

\dots

x_{T_{s}}

) be encoded into hidden states

h_{i}

\in

R^{d_{h}}

using a recurrent, convolutional, or transformer encoder:

h_{i} = f (x_{i}

h_{i - 1}

) or a parallel variant. The decoder maintains a state

s_{t}

\in

R^{d_{s}}

and predicts

y_{t}

given previous outputs

y_{< t}

. Attention introduces a score function

e_{t, i}

score

(

s_{t - 1}

h_{i}

) that measures relevance of encoder state

h_{i}

to producing

y_{t}

. The attention weights are

α_{t, i}

softmax

(

e_{t, \cdot}

)_i =

\frac{e x p ( e _{t, i} )}{\sum _{j = 1}^{T_{s}} e x p ( e _{t, j} )}

. The context vector is

c_{t}

\sum_{i = 1}^{T_{s}}

α_{t, i}

h_{i}

. Typical scoring choices: dot (

e_{t, i} = s_{t - 1}^{⊤}

h_{i}

), general/bilinear (

e_{t, i} = s_{t - 1}^{⊤}

h_{i}

), and additive/Bahdanau (

e_{t, i} = v^{⊤}

tanh

(

W_{s}

s_{t - 1}

W_{h}

h_{i}

)). The decoder update is

s_{t} = g (s_{t - 1}

y_{t - 1}

c_{t}

), followed by output distribution p(

y_{t}

∣

y_{< t}

, x) =

softmax

(

W_{o}

[

s_{t}

;

c_{t}

] +

b_{o}

). Training minimizes negative log-likelihood L = -

\sum_{t = 1}^{T_{t}}

lo g

y_{t}

∣

y_{< t}

^*, x) using teacher forcing (feeding ground-truth

y_{t - 1}

^*). During inference, greedy or beam search selects

y_{t}

from p(

\cdot

∣

y_{< t}

, x), recomputing attention at each step.

04When to Use

Use Seq2Seq with attention whenever you need to map variable-length inputs to variable-length outputs and long-range dependencies matter. Core scenarios include machine translation, summarization (abstractive), question answering with spans or generated answers, image captioning (attention over CNN features), speech recognition (attention over acoustic frames), and code generation. Attention is also valuable when interpretability is required: alignment heatmaps show which input positions influenced each output token, aiding debugging and trust. If your baseline Seq2Seq struggles with long sentences or loses critical details, attention is a natural upgrade that usually boosts performance without changing the overall training recipe. For resource-constrained settings, attention’s extra O( $T_s$ $T_t$ ) compute is still often worthwhile thanks to significant accuracy gains. If you need massive scalability or purely parallel computation, consider transformer-style self-attention encoders and decoders, but the cross-attention idea remains the same: compute relevance between decoder queries and encoder keys/values. In structured prediction (like parsing) or program synthesis, attention provides soft alignment, which can serve as a differentiable proxy for hard alignment or selection.

⚠️Common Mistakes

Forgetting numerical stability in softmax: directly exponentiating large scores can overflow. Fix by subtracting the row-wise maximum before exp. Also clamp extremely small probabilities if you compute logs.\n- Ignoring masks: you must mask padding positions in the encoder sequence so they get zero probability mass. Otherwise, the model can attend to meaningless padding and degrade.\n- Mismatched dimensions: score functions require consistent shapes (e.g., dot product needs $d_s$ = $d_h$ or a projection W to bridge sizes). Double-check matrix sizes and concatenations.\n- Treating context and state interchangeably: the decoder state $s_t$ and context $c_t$ play different roles; concatenation vs. addition changes capacity and should match your chosen variant (Luong vs. Bahdanau).\n- Overly small attention hidden size (for additive attention): too small $d_a$ bottlenecks alignment expressiveness; too large increases compute and overfitting risk.\n- Overlooking inference-time complexity: beam search multiplies compute by beam size B and vocabulary size V per step; plan memory and time accordingly.\n- Not inspecting alignments: bad or peaky/flat attention patterns often reveal bugs (masking, scaling) or learning issues (learning rate, regularization).

Key Formulas

Encoder Recurrence

h_{t} = f (x_{t}, h_{t - 1})

Explanation: The encoder processes the input sequence step by step, updating its hidden state from the previous state and the current token. f can be an RNN, GRU, LSTM, CNN block, or a transformer layer (in parallel form).

Attention Score

e_{t, i} = score (s_{t - 1}, h_{i})

Explanation: Each encoder position i gets a relevance score for producing the t-th output. The score function can be dot, general (bilinear), or additive (Bahdanau).

Attention Weights (Softmax)

α_{t, i} = \frac{exp ( e _{t, i} )}{\sum _{j = 1}^{T_{s}} exp ( e _{t, j} )}

Explanation: Scores are normalized into a probability distribution over positions. Larger scores get higher weights, but the weights sum to 1 across all positions.

Context Vector

c_{t} = i = 1 \sum T_{s} α_{t, i} h_{i}

Explanation: The context is a weighted average of encoder states, emphasizing relevant positions. It is fed to the decoder along with its hidden state to predict the next token.

Luong (Dot/General) Scores

e_{t, i}^{dot} = s_{t - 1}^{⊤} h_{i}, e_{t, i}^{general} = s_{t - 1}^{⊤} W h_{i}

Explanation: Dot uses a simple inner product, requiring matched dimensions. General introduces a learned projection W to align different spaces.

Bahdanau (Additive) Score

e_{t, i}^{add} = v^{⊤} tanh (W_{s} s_{t - 1} + W_{h} h_{i})

Explanation: A small feed-forward network combines decoder state and encoder state through tanh, then projects with v to a scalar score. This often works well when $d_{s}$ and $d_{h}$ differ.

Scaled Dot-Product Attention

Attn (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Explanation: Queries match against keys to produce attention weights, scaled by 1/sqrt( $d_{k}$ ) for stability. The output is a weighted sum over the values.

Decoder Output Distribution

p (y_{t} ∣ y_{< t}, x) = softmax (W_{o} [s_{t}; c_{t}] + b_{o})

Explanation: The probability over the vocabulary for the next token is computed from a linear projection of the decoder state concatenated with the context vector.

Cross-Entropy Loss

L (θ) = - t = 1 \sum T_{t} lo g p_{θ} (y_{t}^{*} ∣ y_{< t}^{*}, x)

Explanation: Training minimizes the negative log-likelihood of ground-truth tokens given the input and prior ground-truth outputs. This encourages the model to assign high probability to the correct sequence.

Stable Softmax

softmax (z_{i}) = \frac{exp ( z _{i} - max _{j} z _{j} )}{\sum _{j} exp ( z _{j} - max _{k} z _{k} )}

Explanation: Subtracting the maximum score per row prevents numerical overflow in exponentials without changing the resulting probabilities.

Attention Complexity

Complexity \approx O (T_{s} T_{t} d)

Explanation: At each of $T_{t}$ decoding steps, attention compares the decoder state to all $T_{s}$ encoder states with hidden size d. This dominates the runtime in RNN-based Seq2Seq with attention.

Beam Search Work

Beam Expansions per step = B \cdot V

Explanation: Each decoding step expands B hypotheses over a vocabulary of size V before pruning. This multiplicative factor must be considered in inference time.

Complexity Analysis

For RNN-based Seq2Seq with attention, the dominant cost arises from computing attention at each decoder step. Let

T_{s}

be source length,

T_{t}

be target length, and d be a representative hidden size (e.g., max of

d_{s}

d_{h}

d_{a}

). For additive (Bahdanau) attention, each step computes

e_{t, i} = v^{T}

tanh(

W_{s}

s_{t - 1}

W_{h}

h_{i}

) for

i = 1

T_{s}

, costing O(

T_{s}

d_{a}

(

d_{s}

d_{h}

) +

T_{s}

d_{a}

) ≈ O(

T_{s}

d_{a}

(

d_{s}

d_{h}

)). Applying softmax adds O(

T_{s}

), and forming the context

c_{t}

adds O(

T_{s}

d_{h}

). Across all

T_{t}

steps, total is O(

T_{s}

T_{t}

d), where d captures these sizes. For dot/general attention, cost per step is O(

T_{s}

d) (or O(

T_{s}

d^{2}

) if a projection W is used once per step), again leading to O(

T_{s}

T_{t}

d) overall. The decoder update (e.g., GRU/LSTM) adds O(

T_{t}

d^{2}

), which is typically smaller than attention if

T_{s}

is large. Memory usage is dominated by storing encoder states O(

T_{s}

d_{h}

); the decoder only needs current states O(

d_{s}

) and temporary attention buffers O(

T_{s}

). During training with teacher forcing, the output softmax over a vocabulary of size V costs O(

T_{t}

d_{o} u t

) for the final linear layer and normalizer; for large V this can dominate, leading to approximations like sampled softmax. At inference, greedy decoding uses the same attention cost per step, while beam search multiplies compute roughly by the beam size B (O(B

T_{s}

d) per step) and expands B · V hypotheses before pruning. Transformers replace recurrence with parallel self-attention but maintain the same cross-attention shape O(

T_{s}

T_{t}

d).

Code Examples

Scaled dot-product attention (single/multi-query) with stable softmax

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Matrix = vector<vector<double>>;
5 
6 Matrix transpose(const Matrix &A) {
7     int n = A.size(), m = A[0].size();
8     Matrix T(m, vector<double>(n));
9     for (int i = 0; i < n; ++i)
10         for (int j = 0; j < m; ++j)
11             T[j][i] = A[i][j];
12     return T;
13 }
14 
15 Matrix matmul(const Matrix &A, const Matrix &B) {
16     int n = A.size(), d = A[0].size(), m = B[0].size();
17     Matrix C(n, vector<double>(m, 0.0));
18     for (int i = 0; i < n; ++i)
19         for (int k = 0; k < d; ++k)
20             for (int j = 0; j < m; ++j)
21                 C[i][j] += A[i][k] * B[k][j];
22     return C;
23 }
24 
25 vector<double> softmax_row(const vector<double> &z) {
26     double mx = *max_element(z.begin(), z.end());
27     double sum = 0.0;
28     vector<double> p(z.size());
29     for (size_t i = 0; i < z.size(); ++i) {
30         p[i] = exp(z[i] - mx); // stable softmax
31         sum += p[i];
32     }
33     for (double &v : p) v /= (sum + 1e-12);
34     return p;
35 }
36 
37 Matrix scaled_dot_product_attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
38     // Q: [n_q, d_k], K: [n_k, d_k], V: [n_k, d_v]
39     int n_q = Q.size();
40     int d_k = Q[0].size();
41     int n_k = K.size();
42     int d_v = V[0].size();
43     (void)d_v; // unused directly here
44 
45     // Compute scores = Q * K^T / sqrt(d_k) => [n_q, n_k]
46     Matrix Kt = transpose(K);
47     Matrix scores = matmul(Q, Kt);
48     double scale = 1.0 / sqrt((double)d_k);
49     for (int i = 0; i < n_q; ++i)
50         for (int j = 0; j < n_k; ++j)
51             scores[i][j] *= scale;
52 
53     // Row-wise softmax to get attention weights
54     Matrix weights(n_q, vector<double>(n_k));
55     for (int i = 0; i < n_q; ++i) {
56         weights[i] = softmax_row(scores[i]);
57     }
58 
59     // Output = weights * V  => [n_q, d_v]
60     Matrix out = matmul(weights, V);
61     return out;
62 }
63 
64 int main() {
65     // Example with small matrices
66     // n_q=2 queries, n_k=4 keys/values, d_k=3, d_v=2
67     Matrix Q = {{0.2, 0.1, -0.3},
68                 {0.0, 0.4,  0.5}};
69     Matrix K = {{0.1, -0.2, 0.3},
70                 {0.0,  0.5, 0.2},
71                 {0.7, -0.1, 0.0},
72                 {-0.3, 0.2, 0.1}};
73     Matrix V = {{ 0.5, 0.0},
74                 { 0.1, 0.2},
75                 {-0.2, 0.3},
76                 { 0.4, -0.1}};
77 
78     Matrix out = scaled_dot_product_attention(Q, K, V);
79 
80     cout << fixed << setprecision(4);
81     cout << "Attention outputs (rows per query):\n";
82     for (auto &row : out) {
83         for (double v : row) cout << v << ' ';
84         cout << '\n';
85     }
86     return 0;
87 }
88

This program implements scaled dot-product attention. It computes scores by multiplying queries with transposed keys, scales by 1/sqrt(d_k) for stability, applies row-wise softmax to get attention weights, and multiplies by values to get context outputs. The example uses tiny matrices to print numerical outputs. The code includes a stable softmax and basic matrix utilities to remain dependency-free.

Time: O(n_q n_k d_k + n_q n_k + n_q n_k d_v), typically dominated by O(n_q n_k max(d_k, d_v))Space: O(n_q n_k + n_k d_v + n_q d_v), dominated by the attention weight matrix O(n_q n_k)

Additive (Bahdanau) attention over encoder states

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Vec = vector<double>;
5 using Mat = vector<vector<double>>;
6 
7 Vec add(const Vec &a, const Vec &b) {
8     Vec c(a.size());
9     for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i];
10     return c;
11 }
12 
13 Vec tanh_vec(const Vec &a) {
14     Vec b(a.size());
15     for (size_t i = 0; i < a.size(); ++i) b[i] = tanh(a[i]);
16     return b;
17 }
18 
19 Vec matvec(const Mat &W, const Vec &x) {
20     size_t r = W.size(), c = W[0].size();
21     Vec y(r, 0.0);
22     for (size_t i = 0; i < r; ++i)
23         for (size_t j = 0; j < c; ++j)
24             y[i] += W[i][j] * x[j];
25     return y;
26 }
27 
28 double dot(const Vec &a, const Vec &b) {
29     double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
30 }
31 
32 vector<double> softmax_stable(const vector<double> &z) {
33     double mx = *max_element(z.begin(), z.end());
34     double sum = 0.0; vector<double> p(z.size());
35     for (size_t i = 0; i < z.size(); ++i) { p[i] = exp(z[i] - mx); sum += p[i]; }
36     for (double &v : p) v /= (sum + 1e-12);
37     return p;
38 }
39 
40 int main() {
41     // Dimensions: d_s=3 (decoder state), d_h=4 (encoder state), d_a=5 (attention hidden)
42     // One decoder state s, multiple encoder states h_i
43     vector<Vec> H = {
44         { 0.1, 0.0,  0.3, -0.2},
45         { 0.2, 0.5, -0.1,  0.4},
46         {-0.3, 0.2,  0.6,  0.1}
47     }; // T_s=3
48     Vec s = {0.05, -0.1, 0.2};
49 
50     // Parameters (random small constants for demo)
51     Mat W_s = {
52         { 0.1, -0.2, 0.3},
53         {-0.1, 0.0,  0.2},
54         { 0.05, 0.1, -0.05},
55         { 0.2, -0.1, 0.0},
56         {-0.15, 0.05, 0.1}
57     }; // 5x3
58     Mat W_h = {
59         { 0.2, 0.1, -0.2, 0.0},
60         {-0.1, 0.3, 0.05, -0.05},
61         { 0.0, -0.2, 0.1, 0.2},
62         { 0.1, 0.0, 0.0, -0.1},
63         {-0.05, 0.2, -0.1, 0.1}
64     }; // 5x4
65     Vec v = {0.1, -0.2, 0.05, 0.15, -0.1}; // 5
66 
67     // Precompute W_s s
68     Vec Ws = matvec(W_s, s); // size d_a
69 
70     // Scores e_i
71     vector<double> e(H.size());
72     for (size_t i = 0; i < H.size(); ++i) {
73         Vec Wh = matvec(W_h, H[i]);
74         Vec z = add(Ws, Wh);
75         Vec a = tanh_vec(z);
76         e[i] = dot(v, a);
77     }
78 
79     // Attention weights
80     vector<double> alpha = softmax_stable(e);
81 
82     // Context c = sum_i alpha_i * h_i
83     Vec c(H[0].size(), 0.0);
84     for (size_t i = 0; i < H.size(); ++i)
85         for (size_t j = 0; j < H[0].size(); ++j)
86             c[j] += alpha[i] * H[i][j];
87 
88     cout << fixed << setprecision(4);
89     cout << "Attention weights (alpha): ";
90     for (double a_i : alpha) cout << a_i << ' '; cout << "\n";
91     cout << "Context vector (c): ";
92     for (double cj : c) cout << cj << ' '; cout << "\n";
93     return 0;
94 }
95

This example computes Bahdanau (additive) attention for a single decoder state over multiple encoder states. It forms scores via a small feed-forward network, applies a stable softmax to obtain weights, and then computes the context vector as a weighted sum. Parameters are fixed for demonstration; in practice they are learned via gradient descent.

Time: O(T_s · (d_a · (d_s + d_h)) + T_s + T_s · d_h) ≈ O(T_s · d_a · (d_s + d_h))Space: O(T_s + d_a + d_h) to store scores, intermediate activations, and the context vector

Greedy decoding loop with additive attention (toy end-to-end forward pass)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Vec = vector<double>;
5 using Mat = vector<vector<double>>;
6 
7 // Utility functions
8 Vec concat(const Vec &a, const Vec &b) { Vec c=a; c.insert(c.end(), b.begin(), b.end()); return c; }
9 Vec concat3(const Vec &a, const Vec &b, const Vec &c) { Vec ab = concat(a,b); return concat(ab,c); }
10 
11 Vec matvec(const Mat &W, const Vec &x) {
12     size_t r=W.size(), c=W[0].size(); Vec y(r,0.0);
13     for(size_t i=0;i<r;++i) for(size_t j=0;j<c;++j) y[i]+=W[i][j]*x[j];
14     return y;
15 }
16 
17 Vec add(const Vec &a, const Vec &b) { Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
18 Vec tanh_vec(const Vec &a) { Vec b(a.size()); for(size_t i=0;i<a.size();++i) b[i]=tanh(a[i]); return b; }
19 
20 double dot(const Vec &a, const Vec &b){ double s=0.0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
21 
22 vector<double> softmax(const vector<double> &z){ double mx=*max_element(z.begin(),z.end()); double sum=0.0; vector<double> p(z.size()); for(size_t i=0;i<z.size();++i){ p[i]=exp(z[i]-mx); sum+=p[i]; } for(double &v:p) v/=(sum+1e-12); return p; }
23 
24 int argmax(const vector<double> &v){ return (int)(max_element(v.begin(), v.end()) - v.begin()); }
25 
26 // Additive attention: e_i = v^T tanh(Ws s + Wh h_i)
27 vector<double> attention_scores(const Vec &s, const vector<Vec> &H, const Mat &Ws, const Mat &Wh, const Vec &v) {
28     Vec Ws_s = matvec(Ws, s);
29     vector<double> e(H.size());
30     for(size_t i=0;i<H.size();++i){
31         Vec Wh_hi = matvec(Wh, H[i]);
32         Vec z = add(Ws_s, Wh_hi);
33         Vec a = tanh_vec(z);
34         e[i] = dot(v, a);
35     }
36     return e;
37 }
38 
39 Vec weighted_sum(const vector<Vec> &H, const vector<double> &alpha){
40     size_t d=H[0].size(); Vec c(d,0.0);
41     for(size_t i=0;i<H.size();++i) for(size_t j=0;j<d;++j) c[j]+=alpha[i]*H[i][j];
42     return c;
43 }
44 
45 int main(){
46     // Dimensions
47     const int d_h=4;   // encoder hidden size
48     const int d_s=4;   // decoder state size
49     const int d_a=3;   // attention hidden size
50     const int d_e=3;   // embedding size
51     const int V=6;     // vocabulary size
52 
53     // Seeded random generator for reproducibility
54     std::mt19937 rng(42); std::uniform_real_distribution<double> dist(-0.2,0.2);
55 
56     // Toy encoder states H (T_s x d_h)
57     const int T_s=5; vector<Vec> H(T_s, Vec(d_h));
58     for(int i=0;i<T_s;++i) for(int j=0;j<d_h;++j) H[i][j]=dist(rng);
59 
60     // Parameters (random small values for demo)
61     Mat Ws(d_a, Vec(d_s)), Wh(d_a, Vec(d_h)); Vec v_att(d_a);
62     for(int i=0;i<d_a;++i){ for(int j=0;j<d_s;++j) Ws[i][j]=dist(rng); for(int j=0;j<d_h;++j) Wh[i][j]=dist(rng); v_att[i]=dist(rng); }
63 
64     // RNN state update: s_t = tanh(Wr [s_{t-1}; emb(y_{t-1}); c_t] + b_r)
65     Mat Wr(d_s, Vec(d_s + d_e + d_h)); Vec b_r(d_s);
66     for(int i=0;i<d_s;++i){ b_r[i]=dist(rng); for(int j=0;j<(int)Wr[0].size();++j) Wr[i][j]=dist(rng); }
67 
68     // Output projection: logits = Wo [s_t; c_t] + b_o -> size V
69     Mat Wo(V, Vec(d_s + d_h)); Vec b_o(V);
70     for(int i=0;i<V;++i){ b_o[i]=dist(rng); for(int j=0;j<(int)Wo[0].size();++j) Wo[i][j]=dist(rng); }
71 
72     // Embedding matrix (V x d_e)
73     Mat Emb(V, Vec(d_e)); for(int i=0;i<V;++i) for(int j=0;j<d_e;++j) Emb[i][j]=dist(rng);
74 
75     // Special tokens (ids)
76     const int SOS=1, EOS=2;
77 
78     // Initial states
79     Vec s_prev(d_s, 0.0); // zero state
80     Vec c_prev(d_h, 0.0); // initial context
81     int y_prev = SOS;     // start token
82 
83     const int max_steps=10;
84 
85     cout << fixed << setprecision(4);
86     cout << "Greedy decoding with attention (token ids):\n";
87     for(int t=1;t<=max_steps;++t){
88         // 1) Compute attention over encoder states using s_{t-1}
89         vector<double> e = attention_scores(s_prev, H, Ws, Wh, v_att);
90         vector<double> alpha = softmax(e);
91         Vec c_t = weighted_sum(H, alpha);
92 
93         // 2) Update decoder state using previous state, previous token embedding, and new context
94         Vec emb = Emb[y_prev];
95         Vec inp = concat3(s_prev, emb, c_t); // [d_s + d_e + d_h]
96         Vec s_t = tanh_vec(add(matvec(Wr, inp), b_r));
97 
98         // 3) Compute output distribution and choose next token (greedy)
99         Vec out_inp = concat(s_t, c_t); // [d_s + d_h]
100         Vec logits = add(matvec(Wo, out_inp), b_o); // size V
101         vector<double> probs = softmax(logits);
102         int y_t = argmax(probs);
103 
104         // Print step info
105         cout << "t=" << t << " -> y_t=" << y_t << ", attention=";
106         for(double a : alpha) cout << a << ' ';
107         cout << "\n";
108 
109         if(y_t == EOS){ cout << "<eos> reached, stop.\n"; break; }
110         // roll
111         s_prev = s_t; c_prev = c_t; y_prev = y_t;
112     }
113 
114     return 0;
115 }
116

This toy program demonstrates one full greedy decoding loop with additive attention. At each step, it (1) computes attention scores over encoder states from the decoder state, (2) forms a context vector, (3) updates the decoder state via a simple tanh RNN cell using the previous state, previous token embedding, and context, and (4) projects to vocabulary logits, applies softmax, and picks the argmax token. The numbers are random but the control flow mirrors a real Seq2Seq with attention.

Time: O(T_t · (T_s · d_a · (d_s + d_h) + (d_s · (d_s + d_e + d_h)) + V · (d_s + d_h)))Space: O(T_s · d_h + d_s + d_a + V + d_e) for encoder states, parameters, and working buffers

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	using Matrix = vector<vector<double>>;
5
6	Matrix transpose(const Matrix &A) {
7	int n = A.size(), m = A[0].size();
8	Matrix T(m, vector<double>(n));
9	for (int i = 0; i < n; ++i)
10	for (int j = 0; j < m; ++j)
11	T[j][i] = A[i][j];
12	return T;
13	}
14
15	Matrix matmul(const Matrix &A, const Matrix &B) {
16	int n = A.size(), d = A[0].size(), m = B[0].size();
17	Matrix C(n, vector<double>(m, 0.0));
18	for (int i = 0; i < n; ++i)
19	for (int k = 0; k < d; ++k)
20	for (int j = 0; j < m; ++j)
21	C[i][j] += A[i][k] * B[k][j];
22	return C;
23	}
24
25	vector<double> softmax_row(const vector<double> &z) {
26	double mx = *max_element(z.begin(), z.end());
27	double sum = 0.0;
28	vector<double> p(z.size());
29	for (size_t i = 0; i < z.size(); ++i) {
30	p[i] = exp(z[i] - mx); // stable softmax
31	sum += p[i];
32	}
33	for (double &v : p) v /= (sum + 1e-12);
34	return p;
35	}
36
37	Matrix scaled_dot_product_attention(const Matrix &Q, const Matrix &K, const Matrix &V) {
38	// Q: [n_q, d_k], K: [n_k, d_k], V: [n_k, d_v]
39	int n_q = Q.size();
40	int d_k = Q[0].size();
41	int n_k = K.size();
42	int d_v = V[0].size();
43	(void)d_v; // unused directly here
44
45	// Compute scores = Q * K^T / sqrt(d_k) => [n_q, n_k]
46	Matrix Kt = transpose(K);
47	Matrix scores = matmul(Q, Kt);
48	double scale = 1.0 / sqrt((double)d_k);
49	for (int i = 0; i < n_q; ++i)
50	for (int j = 0; j < n_k; ++j)
51	scores[i][j] *= scale;
52
53	// Row-wise softmax to get attention weights
54	Matrix weights(n_q, vector<double>(n_k));
55	for (int i = 0; i < n_q; ++i) {
56	weights[i] = softmax_row(scores[i]);
57	}
58
59	// Output = weights * V => [n_q, d_v]
60	Matrix out = matmul(weights, V);
61	return out;
62	}
63
64	int main() {
65	// Example with small matrices
66	// n_q=2 queries, n_k=4 keys/values, d_k=3, d_v=2
67	Matrix Q = {{0.2, 0.1, -0.3},
68	{0.0, 0.4, 0.5}};
69	Matrix K = {{0.1, -0.2, 0.3},
70	{0.0, 0.5, 0.2},
71	{0.7, -0.1, 0.0},
72	{-0.3, 0.2, 0.1}};
73	Matrix V = {{ 0.5, 0.0},
74	{ 0.1, 0.2},
75	{-0.2, 0.3},
76	{ 0.4, -0.1}};
77
78	Matrix out = scaled_dot_product_attention(Q, K, V);
79
80	cout << fixed << setprecision(4);
81	cout << "Attention outputs (rows per query):\n";
82	for (auto &row : out) {
83	for (double v : row) cout << v << ' ';
84	cout << '\n';
85	}
86	return 0;
87	}
88

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	using Vec = vector<double>;
5	using Mat = vector<vector<double>>;
6
7	Vec add(const Vec &a, const Vec &b) {
8	Vec c(a.size());
9	for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i];
10	return c;
11	}
12
13	Vec tanh_vec(const Vec &a) {
14	Vec b(a.size());
15	for (size_t i = 0; i < a.size(); ++i) b[i] = tanh(a[i]);
16	return b;
17	}
18
19	Vec matvec(const Mat &W, const Vec &x) {
20	size_t r = W.size(), c = W[0].size();
21	Vec y(r, 0.0);
22	for (size_t i = 0; i < r; ++i)
23	for (size_t j = 0; j < c; ++j)
24	y[i] += W[i][j] * x[j];
25	return y;
26	}
27
28	double dot(const Vec &a, const Vec &b) {
29	double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
30	}
31
32	vector<double> softmax_stable(const vector<double> &z) {
33	double mx = *max_element(z.begin(), z.end());
34	double sum = 0.0; vector<double> p(z.size());
35	for (size_t i = 0; i < z.size(); ++i) { p[i] = exp(z[i] - mx); sum += p[i]; }
36	for (double &v : p) v /= (sum + 1e-12);
37	return p;
38	}
39
40	int main() {
41	// Dimensions: d_s=3 (decoder state), d_h=4 (encoder state), d_a=5 (attention hidden)
42	// One decoder state s, multiple encoder states h_i
43	vector<Vec> H = {
44	{ 0.1, 0.0, 0.3, -0.2},
45	{ 0.2, 0.5, -0.1, 0.4},
46	{-0.3, 0.2, 0.6, 0.1}
47	}; // T_s=3
48	Vec s = {0.05, -0.1, 0.2};
49
50	// Parameters (random small constants for demo)
51	Mat W_s = {
52	{ 0.1, -0.2, 0.3},
53	{-0.1, 0.0, 0.2},
54	{ 0.05, 0.1, -0.05},
55	{ 0.2, -0.1, 0.0},
56	{-0.15, 0.05, 0.1}
57	}; // 5x3
58	Mat W_h = {
59	{ 0.2, 0.1, -0.2, 0.0},
60	{-0.1, 0.3, 0.05, -0.05},
61	{ 0.0, -0.2, 0.1, 0.2},
62	{ 0.1, 0.0, 0.0, -0.1},
63	{-0.05, 0.2, -0.1, 0.1}
64	}; // 5x4
65	Vec v = {0.1, -0.2, 0.05, 0.15, -0.1}; // 5
66
67	// Precompute W_s s
68	Vec Ws = matvec(W_s, s); // size d_a
69
70	// Scores e_i
71	vector<double> e(H.size());
72	for (size_t i = 0; i < H.size(); ++i) {
73	Vec Wh = matvec(W_h, H[i]);
74	Vec z = add(Ws, Wh);
75	Vec a = tanh_vec(z);
76	e[i] = dot(v, a);
77	}
78
79	// Attention weights
80	vector<double> alpha = softmax_stable(e);
81
82	// Context c = sum_i alpha_i * h_i
83	Vec c(H[0].size(), 0.0);
84	for (size_t i = 0; i < H.size(); ++i)
85	for (size_t j = 0; j < H[0].size(); ++j)
86	c[j] += alpha[i] * H[i][j];
87
88	cout << fixed << setprecision(4);
89	cout << "Attention weights (alpha): ";
90	for (double a_i : alpha) cout << a_i << ' '; cout << "\n";
91	cout << "Context vector (c): ";
92	for (double cj : c) cout << cj << ' '; cout << "\n";
93	return 0;
94	}
95