📚TheoryIntermediate

Recurrent Neural Network Theory

Key Points

•
A Recurrent Neural Network (RNN) processes sequences by carrying a hidden state that is updated at every time step using $h_{t} = f (W_{h}$ $h_{t - 1}$ + $W_{x}$ $x_{t}$ + b).
•
RNNs learn temporal patterns like language, audio, and time-series by reusing the same weights across steps, giving them a form of memory.
•
Training uses Backpropagation Through Time (BPTT), which unrolls the network across T steps and applies the chain rule over the entire sequence.
•
Gradients can vanish or explode due to repeated Jacobian products; careful initialization, gating $\frac{L STM}{GR U}$ , and gradient clipping help.
•
The per-sequence time complexity is O(T( $d_{x}$ $d_{h}$ + $d_{h}$ ^2 + $d_{h}$ $d_{y}$ )), and memory is O(T $d_{h}$ ) for storing states during BPTT.
•
Outputs are often computed as $y_{t} = g (W_{y}$ $h_{t}$ + c) with softmax (classification) or linear/sigmoid $\frac{re g ress i o n}{bina ry}$ .
•
Truncated BPTT limits backprop to a window of k steps to trade long-range credit assignment for efficiency and stability.
•
In C++, you can implement a compact RNN by writing small matrix/vector utilities and carefully managing memory and numerical stability.

Prerequisites

→Linear algebra (vectors, matrices, norms) — RNN computations are matrix-vector operations, and stability analysis uses matrix norms and eigenvalues.
→Calculus and chain rule — Backpropagation Through Time applies the chain rule across many time steps.
→Optimization (SGD, learning rate, regularization) — Training RNNs requires tuning optimizers and understanding convergence/stability.
→Probability and information theory — Losses like cross-entropy and concepts like softmax are probabilistic.
→Numerical stability — Stable softmax/log-sum-exp and gradient clipping prevent overflow/underflow.
→Basic C++ programming — Implementing an RNN from scratch involves arrays, loops, and memory management.
→Neural network basics — Familiarity with activations, layers, and loss functions helps contextualize RNNs.
→Sequence modeling tasks — Understanding inputs/outputs for many-to-one and many-to-many setups guides design.

Detailed Explanation

Tap terms for definitions

01Overview

A Recurrent Neural Network (RNN) is a neural architecture designed to handle sequential data. Unlike feedforward networks that treat inputs independently, RNNs maintain a hidden state h_t that summarizes information from prior time steps. At time t, the hidden state is updated by combining the current input x_t with the previous hidden state h_{t-1} through learnable parameters and a nonlinear activation. Because the same parameters are reused at every step, RNNs can, in principle, model temporal dependencies of arbitrary length. They are widely used for tasks like language modeling, speech recognition, and time-series forecasting. The canonical update looks like h_t = f(W_h h_{t-1} + W_x x_t + b), with f often being tanh or ReLU. An output y_t is optionally produced via y_t = g(W_y h_t + c), where g could be softmax for multiclass classification or identity/sigmoid for regression/binary tasks. RNNs are trained with Backpropagation Through Time (BPTT), which unrolls the sequence, applies the chain rule across time, and accumulates gradients. While simple and elegant, basic RNNs face stability challenges: gradients can vanish or explode due to repeated multiplication by Jacobians. Practical training therefore uses techniques like careful initialization, gradient clipping, truncated BPTT, or gated variants such as LSTM and GRU.

02Intuition & Analogies

Imagine reading a sentence word by word while taking brief notes on a sticky note. Each time you read a new word (x_t), you update your note (h_t) with the new information while keeping the important parts from the previous note (h_{t-1}). You use a consistent method for updating the note—like a template—which corresponds to the matrices W_x and W_h. The note is small and must summarize the past concisely; this is the RNN’s hidden state. If you want to answer a question at any point, you consult the current note and produce an answer (y_t). The nonlinearity f (like tanh) acts like a rule that prevents the note from growing without bound and helps emphasize salient features. Training the network is like reviewing your notes after finishing the sentence, figuring out which parts helped or hurt your final answers, and then slightly adjusting your note-taking template to do better next time. This review process backtracks through time—exactly what Backpropagation Through Time does—crediting or blaming notes from earlier points based on their influence on later answers. However, just as faint pencil marks get harder to trace back through many pages, gradients can fade (vanish) when you backtrack too far, making it difficult to learn very long-term dependencies. Conversely, repeatedly amplifying strong marks can smear ink everywhere (exploding gradients). Techniques like clipping keep the adjustments neat, while gated RNNs (LSTM/GRU) add controlled doors that decide what to write, erase, or read from the note, improving long-term memory.

03Formal Definition

Consider a sequence of inputs \{

x_{t}

\}_{

t=1}^{T}

, where

x_{t}

\in

R^{d_{x}}

. A

\frac{v ani ll a}{El man}

RNN with hidden size

d_{h}

maintains hidden states

h_{t}

\in

R^{d_{h}}

via the recurrence

h_{t} = f (a_{t}

), where

a_{t} = W_{h}

h_{t - 1}

W_{x}

x_{t}

+ b. Here,

W_{h}

\in

R^{d_{h} \times d_{h}}

W_{x}

\in

R^{d_{h} \times d_{x}}

, and b

\in

R^{d_{h}}

are shared across time steps; f is an elementwise activation (e.g., tanh or ReLU). The initial hidden state

h_{0}

may be zeros or a learned parameter. An output can be produced as

y_{t} = g (W_{y}

h_{t}

+ c), with

W_{y}

\in

R^{d_{y} \times d_{h}}

, c

\in

R^{d_{y}}

, and g chosen per task (softmax for multiclass, identity for regression, sigmoid for binary). Training minimizes a sequence loss (e.g., sum of per-step losses) over data, commonly cross-entropy for classification: L =

\sum_{t = 1}^{T}

ℓ

(

y_{t}

\overset{y}{^}_{t}

). Backpropagation Through Time unrolls the network across

t = 1

..T and applies the chain rule to compute gradients with respect to parameters and hidden states. The gradient from time t influences earlier states through products of Jacobians

\partial

h_{t}

\partial

h_{t - 1}

, leading to vanishing/exploding behavior depending on spectral properties of

W_{h}

and the activation derivative. Practical implementations may use truncated BPTT, where gradients are propagated only k steps backward to limit compute and stabilize training.

04When to Use

Use RNNs when your data is sequential and order matters: natural language (character/word sequences), audio waveforms, sensor streams, and financial time series. They are suitable when you need online processing (streaming), low-latency inference step-by-step, or when sequences are relatively short to moderate so that simple recurrence suffices. Choose the proper I/O mapping: many-to-one (e.g., sentiment classification over a sentence), one-to-many (sequence generation from a context), or many-to-many (tagging each token). If your dependencies span long ranges or you observe training instabilities, consider gated RNNs (LSTM/GRU) or even Transformers. When compute/memory is limited and you value compactness and easy stepwise deployment (e.g., embedded systems), vanilla RNNs can be attractive—especially with truncated BPTT and gradient clipping. For continuous signals and forecasting, RNNs can model autoregressive dynamics; for classification, pair them with softmax and cross-entropy. In practice, combine RNNs with embeddings for discrete tokens, masking/padding for variable-length batches, and regularization (dropout, weight decay) for generalization.

⚠️Common Mistakes

Ignoring sequence lengths: feeding padded tokens without masks causes the model to learn from padding noise. Always mask losses and states for shorter sequences.
Shape mismatches: mixing row/column conventions or forgetting that W_h multiplies h_{t-1} and W_x multiplies x_t leads to silent bugs. Verify dimensions and write tests.
Unstable training: using high learning rates without gradient clipping may cause exploding gradients; conversely, poor initialization can cause vanishing. Use Xavier/He initialization and clip gradients by global norm.
Wrong activation or output: applying softmax to binary outputs or forgetting a sigmoid/identity where appropriate degrades learning. Match g and the loss to the task.
Forgetting to detach/segment for truncated BPTT: in manual implementations, you must stop gradients beyond the window; otherwise memory/time explodes.
Not resetting or mishandling hidden state between sequences: decide between stateless (h_0=0 each sequence) and stateful (carry over h_T) regimes and code accordingly.
Numerical issues: computing log(softmax) naively can underflow. Use stable softmax/log-sum-exp. Similarly, clamp arguments to log and divisions.
Evaluation mismatch: training with teacher forcing and evaluating autoregressively without careful handling can cause exposure bias; validate in the intended inference mode.

Key Formulas

RNN Recurrence

h_{t} = f (W_{h} h_{t - 1} + W_{x} x_{t} + b)

Explanation: The hidden state at time t is a nonlinear function of the previous hidden state and current input. This is the core update that gives RNNs memory.

Output Mapping

y_{t} = g (W_{y} h_{t} + c)

Explanation: Given a hidden state, the network produces an output via another linear map and possibly a nonlinearity (softmax, sigmoid, or identity) depending on the task.

Sequence Cross-Entropy

L = - t = 1 \sum T k = 1 \sum d_{y} y_{t, k} lo g \overset{y}{^}_{t, k}

Explanation: For classification, the loss over a sequence is the sum of cross-entropies between the true labels and predicted probabilities at each time step.

BPTT Error Recurrence

δ_{t} = (W_{y}^{⊤} (\overset{y}{^}_{t} - y_{t}) + W_{h}^{⊤} δ_{t + 1}) ⊙ f^{'} (a_{t})

Explanation: The error signal at time t combines the immediate output error with the backpropagated error from time t+1, modulated by the activation derivative at t. This is central to computing gradients through time.

Parameter Gradients

\frac{\partial L}{\partial W _{h}} = t = 1 \sum T δ_{t} h_{t - 1}^{⊤}, \frac{\partial L}{\partial W _{x}} = t = 1 \sum T δ_{t} x_{t}^{⊤}, \frac{\partial L}{\partial b} = t = 1 \sum T δ_{t}

Explanation: Once $δ_{t}$ is known, gradients w.r.t. parameters follow from outer products with the corresponding inputs. These sums accumulate contributions across time.

Gradient Decay/Explosion Bound

\frac{\partial h _{t}}{\partial h _{t - k}}_{2} \leq (L_{f} ∥ W_{h} ∥_{2})^{k}

Explanation: The norm of the Jacobian product across k steps is bounded by the product of the activation’s Lipschitz constant and the spectral norm of $W_{h}$ raised to k, explaining vanishing/exploding gradients.

Spectral Radius Criterion

ρ (W_{h}) < 1 \Rightarrow local stability near zero

Explanation: If the largest absolute eigenvalue of $W_{h}$ is less than one (and f is approximately linear near zero), small perturbations decay, helping stability but risking vanishing gradients.

Softmax

\overset{y}{^}_{t, k} = \frac{exp ( z _{t, k} )}{\sum _{j = 1}^{d_{y}} exp ( z _{t, j} )}, z_{t} = W_{y} h_{t} + c

Explanation: Turns logits into a probability distribution. Use with cross-entropy for multiclass outputs.

RNN Cost per Sequence

Time = O (T (d_{x} d_{h} + d_{h}^{2} + d_{h} d_{y})), Space = O (T d_{h})

Explanation: Each time step performs matrix-vector products with $W_{x}$ , $W_{h}$ , and $W_{y}$ . Storing hidden states over T steps is needed for BPTT, leading to linear memory in T.

Global-Norm Gradient Clipping

g_{clipped} = g \cdot min (1, \frac{τ}{∥ g ∥ _{2}})

Explanation: If the gradient norm exceeds threshold $τ$ , rescale all gradients uniformly to stabilize training without changing direction.

Parameter Count

∣ θ ∣ = d_{h} d_{x} + d_{h}^{2} + d_{h} d_{y} + d_{h} + d_{y}

Explanation: The total number of learnable parameters in a vanilla RNN cell and output layer equals weights plus biases across input, recurrent, and output mappings.

Complexity Analysis

Let

d_{x}

be input size,

d_{h}

hidden size,

d_{y}

output size, and T sequence length. Each forward step computes two main matrix-vector products:

W_{x}

x_{t}

(cost O(

d_{x}

d_{h}

)) and

W_{h}

h_{t - 1}

(cost O(

d_{h}

^2)), followed by an activation f (O(

d_{h}

)) and possibly an output projection

W_{y}

h_{t}

(O(

d_{h}

d_{y}

)). Thus, a full forward pass over T steps costs O(T(

d_{x}

d_{h}

d_{h}

^2 +

d_{h}

d_{y}

)). The backward pass has the same order of complexity because it repeats similar multiplications while propagating error signals and accumulating parameter gradients; constants are slightly larger due to extra elementwise operations and outer products. Memory usage during training is dominated by storing per-step activations (e.g.,

h_{t}

and, for some losses, pre-activations

a_{t}

) to compute gradients. This requires O(T

d_{h}

) space, plus O(

d_{h}

^2 +

d_{x}

d_{h}

d_{h}

d_{y}

) for parameters, which is independent of T. In practice, batch processing multiplies time by batch size N, but matrix-matrix operations (BLAS level-3) can utilize hardware more efficiently. Truncated BPTT with window k reduces effective backprop length and memory to O(k

d_{h}

), trading off the ability to learn dependencies longer than k steps. Parallelism across time is limited (sequential dependency), but parallelism across batch and within matrix operations is available. Numerical stability considerations (e.g., stable softmax, clipping) add negligible overhead. Careful cache-friendly memory layouts and vectorization can substantially improve constants in these asymptotic costs, especially for large

d_{h}

Code Examples

Forward pass of a vanilla RNN for next-character prediction (inference only)

1 #include <iostream>
2 #include <vector>
3 #include <string>
4 #include <random>
5 #include <cmath>
6 #include <algorithm>
7 #include <iomanip>
8 
9 using namespace std;
10 
11 // Utility: numerically-stable softmax
12 vector<double> softmax(const vector<double>& z) {
13     double max_z = *max_element(z.begin(), z.end());
14     double sum = 0.0;
15     vector<double> p(z.size());
16     for (size_t i = 0; i < z.size(); ++i) {
17         p[i] = exp(z[i] - max_z);
18         sum += p[i];
19     }
20     for (size_t i = 0; i < z.size(); ++i) p[i] /= sum + 1e-12;
21     return p;
22 }
23 
24 // Matrix-vector multiply: y = A * x  (A: m x n, x: n)
25 vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x) {
26     size_t m = A.size(), n = A[0].size();
27     vector<double> y(m, 0.0);
28     for (size_t i = 0; i < m; ++i) {
29         double s = 0.0;
30         for (size_t j = 0; j < n; ++j) s += A[i][j] * x[j];
31         y[i] = s;
32     }
33     return y;
34 }
35 
36 // Vector add: a += b
37 void add_inplace(vector<double>& a, const vector<double>& b) {
38     for (size_t i = 0; i < a.size(); ++i) a[i] += b[i];
39 }
40 
41 // Activation tanh elementwise
42 vector<double> tanh_vec(const vector<double>& v) {
43     vector<double> r(v.size());
44     for (size_t i = 0; i < v.size(); ++i) r[i] = tanh(v[i]);
45     return r;
46 }
47 
48 // One-hot encoding for a small vocabulary
49 vector<double> one_hot(int idx, int vocab) {
50     vector<double> v(vocab, 0.0);
51     v[idx] = 1.0;
52     return v;
53 }
54 
55 int main() {
56     // Toy vocabulary and mapping
57     vector<char> vocab = {' ', 'a', 'b', 'c'}; // size = 4
58     auto idx_of = [&](char ch){
59         for (int i = 0; i < (int)vocab.size(); ++i) if (vocab[i] == ch) return i;
60         return 0; // default to space
61     };
62 
63     int d_x = (int)vocab.size();
64     int d_h = 8;             // hidden size
65     int d_y = (int)vocab.size(); // predict next char distribution
66 
67     // Random init (small values)
68     std::mt19937 rng(42);
69     std::uniform_real_distribution<double> uni(-0.1, 0.1);
70 
71     vector<vector<double>> W_x(d_h, vector<double>(d_x));
72     vector<vector<double>> W_h(d_h, vector<double>(d_h));
73     vector<double> b_h(d_h, 0.0);
74     vector<vector<double>> W_y(d_y, vector<double>(d_h));
75     vector<double> b_y(d_y, 0.0);
76 
77     for (int i = 0; i < d_h; ++i) {
78         for (int j = 0; j < d_x; ++j) W_x[i][j] = uni(rng);
79         for (int j = 0; j < d_h; ++j) W_h[i][j] = uni(rng);
80         b_h[i] = 0.0;
81     }
82     for (int i = 0; i < d_y; ++i) {
83         for (int j = 0; j < d_h; ++j) W_y[i][j] = uni(rng);
84         b_y[i] = 0.0;
85     }
86 
87     // Input sequence (we will predict next char at each step)
88     string input = "ab cab";
89 
90     // Initial hidden state
91     vector<double> h(d_h, 0.0);
92 
93     cout << fixed << setprecision(3);
94     cout << "Input\tPredicted next char (argmax)\n";
95 
96     for (size_t t = 0; t < input.size(); ++t) {
97         // Prepare x_t as one-hot
98         vector<double> x = one_hot(idx_of(input[t]), d_x);
99 
100         // Compute pre-activation a_t = W_h h + W_x x + b
101         vector<double> a = matvec(W_h, h);
102         vector<double> Wx = matvec(W_x, x);
103         add_inplace(a, Wx);
104         add_inplace(a, b_h);
105 
106         // Update hidden state
107         h = tanh_vec(a);
108 
109         // Output logits and softmax
110         vector<double> logits = matvec(W_y, h);
111         add_inplace(logits, b_y);
112         vector<double> probs = softmax(logits);
113 
114         // Argmax prediction
115         int pred = int(max_element(probs.begin(), probs.end()) - probs.begin());
116         cout << input[t] << "\t" << vocab[pred] << "\n";
117     }
118 
119     return 0;
120 }
121

This program defines a small vanilla RNN with tanh hidden units and a softmax output head to perform next-character prediction. It runs only the forward pass: for each character in the input, it updates the hidden state and outputs a probability distribution over the next character. Random weights yield arbitrary predictions, but the structure reflects the recurrence h_t = f(W_h h_{t-1} + W_x x_t + b) and y_t = softmax(W_y h_t + c). The example shows how to manage shapes and numerically stable softmax in plain C++.

Time: O(T(d_x d_h + d_h^2 + d_h d_y)) for a sequence of length T.Space: O(d_h + d_x + d_y). No training, so we do not store all time-step states.

Training a tiny RNN with BPTT on the Echo task (predict x_{t-1})

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <cmath>
5 #include <algorithm>
6 #include <numeric>
7 #include <cassert>
8 using namespace std;
9 
10 // Sigmoid and tanh
11 inline double sigmoid(double z){ return 1.0 / (1.0 + exp(-z)); }
12 inline double dtanh(double y){ return 1.0 - y*y; } // derivative using output y = tanh(a)
13 
14 // Vector helpers
15 vector<double> zeros(int n){ return vector<double>(n, 0.0); }
16 
17 vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x){
18     int m = (int)A.size();
19     int n = (int)A[0].size();
20     vector<double> y(m, 0.0);
21     for(int i=0;i<m;++i){
22         double s=0.0; for(int j=0;j<n;++j) s += A[i][j]*x[j];
23         y[i]=s;
24     }
25     return y;
26 }
27 
28 void add_inplace(vector<double>& a, const vector<double>& b){
29     for(size_t i=0;i<a.size();++i) a[i]+=b[i];
30 }
31 
32 vector<double> tanh_vec(const vector<double>& v){
33     vector<double> r(v.size());
34     for(size_t i=0;i<v.size();++i) r[i]=tanh(v[i]);
35     return r;
36 }
37 
38 // Outer product: A += u * v^T (A: m x n, u: m, v: n)
39 void outer_add(vector<vector<double>>& A, const vector<double>& u, const vector<double>& v){
40     int m=(int)A.size(), n=(int)A[0].size();
41     assert((int)u.size()==m && (int)v.size()==n);
42     for(int i=0;i<m;++i){
43         for(int j=0;j<n;++j) A[i][j] += u[i]*v[j];
44     }
45 }
46 
47 // y += A^T * x  (A: m x n, x: m) => y: n
48 void add_matTvec_inplace(vector<double>& y, const vector<vector<double>>& A, const vector<double>& x){
49     int m=(int)A.size(), n=(int)A[0].size();
50     assert((int)y.size()==n && (int)x.size()==m);
51     for(int j=0;j<n;++j){
52         double s=0.0; for(int i=0;i<m;++i) s += A[i][j]*x[i];
53         y[j] += s;
54     }
55 }
56 
57 // Global-norm gradient clipping
58 void clip_by_global_norm(vector<vector<double>>& dW_h, vector<vector<double>>& dW_x,
59                          vector<double>& db_h, vector<double>& dW_y, double& db_y,
60                          double tau){
61     double sq=0.0;
62     int d_h=(int)dW_h.size(); int d_x=(int)dW_x[0].size();
63     for(int i=0;i<d_h;++i){
64         for(int j=0;j<d_h;++j) sq += dW_h[i][j]*dW_h[i][j];
65         for(int j=0;j<d_x;++j) sq += dW_x[i][j]*dW_x[i][j];
66         sq += db_h[i]*db_h[i];
67     }
68     for(double v: dW_y) sq += v*v;
69     sq += db_y*db_y;
70     double norm = sqrt(sq) + 1e-12;
71     double scale = min(1.0, tau / norm);
72     if(scale<1.0){
73         for(int i=0;i<d_h;++i){
74             for(int j=0;j<d_h;++j) dW_h[i][j]*=scale;
75             for(int j=0;j<d_x;++j) dW_x[i][j]*=scale;
76             db_h[i]*=scale;
77         }
78         for(double& v: dW_y) v*=scale;
79         db_y*=scale;
80     }
81 }
82 
83 int main(){
84     // Echo task: input x_t in {0,1}, target y_t = x_{t-1} (y_0 = 0)
85     int T = 25;          // sequence length
86     int epochs = 300;    // small demo
87     int d_x = 1;         // scalar input
88     int d_h = 8;         // hidden units
89     int d_y = 1;         // scalar output (prob of 1)
90     double lr = 0.1;
91 
92     // Initialize parameters
93     std::mt19937 rng(123);
94     std::uniform_real_distribution<double> uni(-0.2, 0.2);
95 
96     vector<vector<double>> W_x(d_h, vector<double>(d_x));
97     vector<vector<double>> W_h(d_h, vector<double>(d_h));
98     vector<double> b_h(d_h, 0.0);
99     vector<double> W_y(d_h); // shape 1 x d_h represented as vector
100     double b_y = 0.0;
101 
102     for(int i=0;i<d_h;++i){
103         for(int j=0;j<d_x;++j) W_x[i][j]=uni(rng);
104         for(int j=0;j<d_h;++j) W_h[i][j]=uni(rng) * (i==j? 0.5:1.0); // slightly smaller on diagonal
105         W_y[i]=uni(rng);
106     }
107 
108     std::bernoulli_distribution bern(0.5);
109 
110     for(int epoch=1; epoch<=epochs; ++epoch){
111         // Generate a random sequence each epoch
112         vector<double> x(T, 0.0), y_true(T, 0.0);
113         for(int t=0;t<T;++t) x[t] = bern(rng) ? 1.0:0.0;
114         y_true[0] = 0.0;
115         for(int t=1;t<T;++t) y_true[t] = x[t-1];
116 
117         // Forward pass: store a_t, h_t, y_hat_t
118         vector<vector<double>> a(T, vector<double>(d_h, 0.0));
119         vector<vector<double>> h(T, vector<double>(d_h, 0.0));
120         vector<double> y_hat(T, 0.0);
121 
122         vector<double> h_prev(d_h, 0.0);
123         for(int t=0;t<T;++t){
124             // a_t = W_h h_{t-1} + W_x x_t + b_h
125             a[t] = matvec(W_h, h_prev);
126             // add W_x * x_t (scalar input)
127             for(int i=0;i<d_h;++i) a[t][i] += W_x[i][0] * x[t] + b_h[i];
128             // h_t = tanh(a_t)
129             h[t] = tanh_vec(a[t]);
130             // y_t = sigmoid(W_y h_t + b_y)
131             double z = b_y;
132             for(int i=0;i<d_h;++i) z += W_y[i]*h[t][i];
133             y_hat[t] = sigmoid(z);
134             h_prev = h[t];
135         }
136 
137         // Compute binary cross-entropy loss
138         double loss = 0.0;
139         for(int t=0;t<T;++t){
140             double yt = y_true[t];
141             double yp = min(1.0-1e-7, max(1e-7, y_hat[t]));
142             loss += -( yt*log(yp) + (1.0-yt)*log(1.0-yp) );
143         }
144         loss /= T;
145 
146         // Backward pass (BPTT)
147         vector<vector<double>> dW_x(d_h, vector<double>(d_x, 0.0));
148         vector<vector<double>> dW_h(d_h, vector<double>(d_h, 0.0));
149         vector<double> db_h(d_h, 0.0);
150         vector<double> dW_y(d_h, 0.0);
151         double db_y = 0.0;
152 
153         vector<double> dh_next(d_h, 0.0);
154         for(int t=T-1; t>=0; --t){
155             // Output gradient: dL/dz = y_hat - y_true for sigmoid + BCE
156             double dL_dz = (y_hat[t] - y_true[t]) / T; // average over T
157             // Grad w.r.t W_y, b_y
158             for(int i=0;i<d_h;++i) dW_y[i] += dL_dz * h[t][i];
159             db_y += dL_dz;
160 
161             // Backprop into h_t
162             vector<double> dh(d_h, 0.0);
163             for(int i=0;i<d_h;++i) dh[i] = dW_y[i]*0.0; // placeholder to emphasize structure
164             // add W_y^T * dL_dz
165             for(int i=0;i<d_h;++i) dh[i] += W_y[i] * dL_dz;
166             // add gradient from future due to recurrence
167             for(int i=0;i<d_h;++i) dh[i] += dh_next[i];
168 
169             // Through tanh: da = dh * (1 - tanh(a)^2) where tanh(a)=h[t][i]
170             vector<double> da(d_h, 0.0);
171             for(int i=0;i<d_h;++i) da[i] = dh[i] * dtanh(h[t][i]);
172 
173             // Accumulate parameter grads: W_x, W_h, b_h
174             // W_x: da * x_t^T (x scalar)
175             for(int i=0;i<d_h;++i) dW_x[i][0] += da[i] * x[t];
176             // W_h: da * h_{t-1}^T
177             const vector<double>& h_prev_t = (t>0? h[t-1] : vector<double>(d_h, 0.0));
178             for(int i=0;i<d_h;++i){
179                 for(int j=0;j<d_h;++j){
180                     double hprevj = (t>0? h[t-1][j]: 0.0);
181                     dW_h[i][j] += da[i] * hprevj;
182                 }
183             }
184             for(int i=0;i<d_h;++i) db_h[i] += da[i];
185 
186             // Propagate to previous hidden: dh_next = W_h^T * da
187             fill(dh_next.begin(), dh_next.end(), 0.0);
188             add_matTvec_inplace(dh_next, W_h, da);
189         }
190 
191         // Clip gradients to avoid exploding
192         clip_by_global_norm(dW_h, dW_x, db_h, dW_y, db_y, 5.0);
193 
194         // SGD update
195         for(int i=0;i<d_h;++i){
196             W_y[i] -= lr * dW_y[i];
197             for(int j=0;j<d_x;++j) W_x[i][j] -= lr * dW_x[i][j];
198             for(int j=0;j<d_h;++j) W_h[i][j] -= lr * dW_h[i][j];
199             b_h[i] -= lr * db_h[i];
200         }
201         b_y -= lr * db_y;
202 
203         if(epoch % 50 == 0 || epoch == 1){
204             cout << "Epoch " << epoch << ": loss = " << loss << "\n";
205         }
206     }
207 
208     // Quick qualitative check
209     cout << "\nQualitative check on a new random sequence:\n";
210     vector<double> x(T,0.0);
211     std::mt19937 rng2(999);
212     std::bernoulli_distribution bern2(0.5);
213     for(int t=0;t<T;++t) x[t] = bern2(rng2)?1.0:0.0;
214 
215     vector<double> h_prev(d_h, 0.0);
216     double prev_pred = 0.0;
217     cout << "t\tx_t\tpred p(x_{t-1}=1)\n";
218     for(int t=0;t<T;++t){
219         vector<double> a(d_h, 0.0);
220         for(int i=0;i<d_h;++i){
221             double s = 0.0;
222             for(int j=0;j<d_h;++j) s += W_h[i][j]*h_prev[j];
223             s += W_x[i][0]*x[t] + b_h[i];
224             a[i] = s;
225         }
226         vector<double> h_t(d_h, 0.0);
227         for(int i=0;i<d_h;++i) h_t[i] = tanh(a[i]);
228         double z = b_y; for(int i=0;i<d_h;++i) z += W_y[i]*h_t[i];
229         double pred = sigmoid(z);
230         cout << t << "\t" << x[t] << "\t" << pred << "\n";
231         h_prev = h_t;
232         prev_pred = pred;
233     }
234 
235     return 0;
236 }
237

This C++ program trains a minimal RNN end-to-end on the Echo task, where the target at step t is the previous input x_{t-1}. It implements forward propagation, binary cross-entropy loss, and full Backpropagation Through Time (BPTT) with tanh hidden units and a sigmoid output. Gradients are derived analytically and accumulated across time, then clipped by global norm before an SGD update. After a few hundred epochs, the model typically learns to output a probability close to the previous bit. The code demonstrates parameter sharing across time steps, state storage for BPTT, and handling of recurrent gradient flow without external libraries.

Time: O(T(d_x d_h + d_h^2 + d_h d_y)) per epoch for one sequence (same order for backward).Space: O(T d_h) to store activations (a_t, h_t), plus O(d_h^2 + d_x d_h + d_h d_y) for parameters.

1	#include <iostream>
2	#include <vector>
3	#include <string>
4	#include <random>
5	#include <cmath>
6	#include <algorithm>
7	#include <iomanip>
8
9	using namespace std;
10
11	// Utility: numerically-stable softmax
12	vector<double> softmax(const vector<double>& z) {
13	double max_z = *max_element(z.begin(), z.end());
14	double sum = 0.0;
15	vector<double> p(z.size());
16	for (size_t i = 0; i < z.size(); ++i) {
17	p[i] = exp(z[i] - max_z);
18	sum += p[i];
19	}
20	for (size_t i = 0; i < z.size(); ++i) p[i] /= sum + 1e-12;
21	return p;
22	}
23
24	// Matrix-vector multiply: y = A * x (A: m x n, x: n)
25	vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x) {
26	size_t m = A.size(), n = A[0].size();
27	vector<double> y(m, 0.0);
28	for (size_t i = 0; i < m; ++i) {
29	double s = 0.0;
30	for (size_t j = 0; j < n; ++j) s += A[i][j] * x[j];
31	y[i] = s;
32	}
33	return y;
34	}
35
36	// Vector add: a += b
37	void add_inplace(vector<double>& a, const vector<double>& b) {
38	for (size_t i = 0; i < a.size(); ++i) a[i] += b[i];
39	}
40
41	// Activation tanh elementwise
42	vector<double> tanh_vec(const vector<double>& v) {
43	vector<double> r(v.size());
44	for (size_t i = 0; i < v.size(); ++i) r[i] = tanh(v[i]);
45	return r;
46	}
47
48	// One-hot encoding for a small vocabulary
49	vector<double> one_hot(int idx, int vocab) {
50	vector<double> v(vocab, 0.0);
51	v[idx] = 1.0;
52	return v;
53	}
54
55	int main() {
56	// Toy vocabulary and mapping
57	vector<char> vocab = {' ', 'a', 'b', 'c'}; // size = 4
58	auto idx_of = [&](char ch){
59	for (int i = 0; i < (int)vocab.size(); ++i) if (vocab[i] == ch) return i;
60	return 0; // default to space
61	};
62
63	int d_x = (int)vocab.size();
64	int d_h = 8; // hidden size
65	int d_y = (int)vocab.size(); // predict next char distribution
66
67	// Random init (small values)
68	std::mt19937 rng(42);
69	std::uniform_real_distribution<double> uni(-0.1, 0.1);
70
71	vector<vector<double>> W_x(d_h, vector<double>(d_x));
72	vector<vector<double>> W_h(d_h, vector<double>(d_h));
73	vector<double> b_h(d_h, 0.0);
74	vector<vector<double>> W_y(d_y, vector<double>(d_h));
75	vector<double> b_y(d_y, 0.0);
76
77	for (int i = 0; i < d_h; ++i) {
78	for (int j = 0; j < d_x; ++j) W_x[i][j] = uni(rng);
79	for (int j = 0; j < d_h; ++j) W_h[i][j] = uni(rng);
80	b_h[i] = 0.0;
81	}
82	for (int i = 0; i < d_y; ++i) {
83	for (int j = 0; j < d_h; ++j) W_y[i][j] = uni(rng);
84	b_y[i] = 0.0;
85	}
86
87	// Input sequence (we will predict next char at each step)
88	string input = "ab cab";
89
90	// Initial hidden state
91	vector<double> h(d_h, 0.0);
92
93	cout << fixed << setprecision(3);
94	cout << "Input\tPredicted next char (argmax)\n";
95
96	for (size_t t = 0; t < input.size(); ++t) {
97	// Prepare x_t as one-hot
98	vector<double> x = one_hot(idx_of(input[t]), d_x);
99
100	// Compute pre-activation a_t = W_h h + W_x x + b
101	vector<double> a = matvec(W_h, h);
102	vector<double> Wx = matvec(W_x, x);
103	add_inplace(a, Wx);
104	add_inplace(a, b_h);
105
106	// Update hidden state
107	h = tanh_vec(a);
108
109	// Output logits and softmax
110	vector<double> logits = matvec(W_y, h);
111	add_inplace(logits, b_y);
112	vector<double> probs = softmax(logits);
113
114	// Argmax prediction
115	int pred = int(max_element(probs.begin(), probs.end()) - probs.begin());
116	cout << input[t] << "\t" << vocab[pred] << "\n";
117	}
118
119	return 0;
120	}
121