🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Recurrent Neural Network Theory

Key Points

  • •
    A Recurrent Neural Network (RNN) processes sequences by carrying a hidden state that is updated at every time step using ht​=f(Wh​ ht−1​ + Wx​ xt​ + b).
  • •
    RNNs learn temporal patterns like language, audio, and time-series by reusing the same weights across steps, giving them a form of memory.
  • •
    Training uses Backpropagation Through Time (BPTT), which unrolls the network across T steps and applies the chain rule over the entire sequence.
  • •
    Gradients can vanish or explode due to repeated Jacobian products; careful initialization, gating GRULSTM​, and gradient clipping help.
  • •
    The per-sequence time complexity is O(T(dx​ dh​ + dh​^2 + dh​ dy​)), and memory is O(T dh​) for storing states during BPTT.
  • •
    Outputs are often computed as yt​=g(Wy​ ht​ + c) with softmax (classification) or linear/sigmoid binaryregression​.
  • •
    Truncated BPTT limits backprop to a window of k steps to trade long-range credit assignment for efficiency and stability.
  • •
    In C++, you can implement a compact RNN by writing small matrix/vector utilities and carefully managing memory and numerical stability.

Prerequisites

  • →Linear algebra (vectors, matrices, norms) — RNN computations are matrix-vector operations, and stability analysis uses matrix norms and eigenvalues.
  • →Calculus and chain rule — Backpropagation Through Time applies the chain rule across many time steps.
  • →Optimization (SGD, learning rate, regularization) — Training RNNs requires tuning optimizers and understanding convergence/stability.
  • →Probability and information theory — Losses like cross-entropy and concepts like softmax are probabilistic.
  • →Numerical stability — Stable softmax/log-sum-exp and gradient clipping prevent overflow/underflow.
  • →Basic C++ programming — Implementing an RNN from scratch involves arrays, loops, and memory management.
  • →Neural network basics — Familiarity with activations, layers, and loss functions helps contextualize RNNs.
  • →Sequence modeling tasks — Understanding inputs/outputs for many-to-one and many-to-many setups guides design.

Detailed Explanation

Tap terms for definitions

01Overview

A Recurrent Neural Network (RNN) is a neural architecture designed to handle sequential data. Unlike feedforward networks that treat inputs independently, RNNs maintain a hidden state h_t that summarizes information from prior time steps. At time t, the hidden state is updated by combining the current input x_t with the previous hidden state h_{t-1} through learnable parameters and a nonlinear activation. Because the same parameters are reused at every step, RNNs can, in principle, model temporal dependencies of arbitrary length. They are widely used for tasks like language modeling, speech recognition, and time-series forecasting. The canonical update looks like h_t = f(W_h h_{t-1} + W_x x_t + b), with f often being tanh or ReLU. An output y_t is optionally produced via y_t = g(W_y h_t + c), where g could be softmax for multiclass classification or identity/sigmoid for regression/binary tasks. RNNs are trained with Backpropagation Through Time (BPTT), which unrolls the sequence, applies the chain rule across time, and accumulates gradients. While simple and elegant, basic RNNs face stability challenges: gradients can vanish or explode due to repeated multiplication by Jacobians. Practical training therefore uses techniques like careful initialization, gradient clipping, truncated BPTT, or gated variants such as LSTM and GRU.

02Intuition & Analogies

Imagine reading a sentence word by word while taking brief notes on a sticky note. Each time you read a new word (x_t), you update your note (h_t) with the new information while keeping the important parts from the previous note (h_{t-1}). You use a consistent method for updating the note—like a template—which corresponds to the matrices W_x and W_h. The note is small and must summarize the past concisely; this is the RNN’s hidden state. If you want to answer a question at any point, you consult the current note and produce an answer (y_t). The nonlinearity f (like tanh) acts like a rule that prevents the note from growing without bound and helps emphasize salient features. Training the network is like reviewing your notes after finishing the sentence, figuring out which parts helped or hurt your final answers, and then slightly adjusting your note-taking template to do better next time. This review process backtracks through time—exactly what Backpropagation Through Time does—crediting or blaming notes from earlier points based on their influence on later answers. However, just as faint pencil marks get harder to trace back through many pages, gradients can fade (vanish) when you backtrack too far, making it difficult to learn very long-term dependencies. Conversely, repeatedly amplifying strong marks can smear ink everywhere (exploding gradients). Techniques like clipping keep the adjustments neat, while gated RNNs (LSTM/GRU) add controlled doors that decide what to write, erase, or read from the note, improving long-term memory.

03Formal Definition

Consider a sequence of inputs \{xt​\}_{t=1}^{T}, where xt​ ∈ Rdx​. A Elmanvanilla​ RNN with hidden size dh​ maintains hidden states ht​ ∈ Rdh​ via the recurrence ht​=f(at​), where at​=Wh​ ht−1​ + Wx​ xt​ + b. Here, Wh​ ∈ Rdh​×dh​, Wx​ ∈ Rdh​×dx​, and b ∈ Rdh​ are shared across time steps; f is an elementwise activation (e.g., tanh or ReLU). The initial hidden state h0​ may be zeros or a learned parameter. An output can be produced as yt​=g(Wy​ ht​ + c), with Wy​ ∈ Rdy​×dh​, c ∈ Rdy​, and g chosen per task (softmax for multiclass, identity for regression, sigmoid for binary). Training minimizes a sequence loss (e.g., sum of per-step losses) over data, commonly cross-entropy for classification: L = ∑t=1T​ ℓ(yt​, y^​t​). Backpropagation Through Time unrolls the network across t=1..T and applies the chain rule to compute gradients with respect to parameters and hidden states. The gradient from time t influences earlier states through products of Jacobians ∂ ht​ / ∂ ht−1​, leading to vanishing/exploding behavior depending on spectral properties of Wh​ and the activation derivative. Practical implementations may use truncated BPTT, where gradients are propagated only k steps backward to limit compute and stabilize training.

04When to Use

Use RNNs when your data is sequential and order matters: natural language (character/word sequences), audio waveforms, sensor streams, and financial time series. They are suitable when you need online processing (streaming), low-latency inference step-by-step, or when sequences are relatively short to moderate so that simple recurrence suffices. Choose the proper I/O mapping: many-to-one (e.g., sentiment classification over a sentence), one-to-many (sequence generation from a context), or many-to-many (tagging each token). If your dependencies span long ranges or you observe training instabilities, consider gated RNNs (LSTM/GRU) or even Transformers. When compute/memory is limited and you value compactness and easy stepwise deployment (e.g., embedded systems), vanilla RNNs can be attractive—especially with truncated BPTT and gradient clipping. For continuous signals and forecasting, RNNs can model autoregressive dynamics; for classification, pair them with softmax and cross-entropy. In practice, combine RNNs with embeddings for discrete tokens, masking/padding for variable-length batches, and regularization (dropout, weight decay) for generalization.

⚠️Common Mistakes

  • Ignoring sequence lengths: feeding padded tokens without masks causes the model to learn from padding noise. Always mask losses and states for shorter sequences.
  • Shape mismatches: mixing row/column conventions or forgetting that W_h multiplies h_{t-1} and W_x multiplies x_t leads to silent bugs. Verify dimensions and write tests.
  • Unstable training: using high learning rates without gradient clipping may cause exploding gradients; conversely, poor initialization can cause vanishing. Use Xavier/He initialization and clip gradients by global norm.
  • Wrong activation or output: applying softmax to binary outputs or forgetting a sigmoid/identity where appropriate degrades learning. Match g and the loss to the task.
  • Forgetting to detach/segment for truncated BPTT: in manual implementations, you must stop gradients beyond the window; otherwise memory/time explodes.
  • Not resetting or mishandling hidden state between sequences: decide between stateless (h_0=0 each sequence) and stateful (carry over h_T) regimes and code accordingly.
  • Numerical issues: computing log(softmax) naively can underflow. Use stable softmax/log-sum-exp. Similarly, clamp arguments to log and divisions.
  • Evaluation mismatch: training with teacher forcing and evaluating autoregressively without careful handling can cause exposure bias; validate in the intended inference mode.

Key Formulas

RNN Recurrence

ht​=f(Wh​ht−1​+Wx​xt​+b)

Explanation: The hidden state at time t is a nonlinear function of the previous hidden state and current input. This is the core update that gives RNNs memory.

Output Mapping

yt​=g(Wy​ht​+c)

Explanation: Given a hidden state, the network produces an output via another linear map and possibly a nonlinearity (softmax, sigmoid, or identity) depending on the task.

Sequence Cross-Entropy

L=−t=1∑T​k=1∑dy​​yt,k​logy^​t,k​

Explanation: For classification, the loss over a sequence is the sum of cross-entropies between the true labels and predicted probabilities at each time step.

BPTT Error Recurrence

δt​=(Wy⊤​(y^​t​−yt​)+Wh⊤​δt+1​)⊙f′(at​)

Explanation: The error signal at time t combines the immediate output error with the backpropagated error from time t+1, modulated by the activation derivative at t. This is central to computing gradients through time.

Parameter Gradients

∂Wh​∂L​=t=1∑T​δt​ht−1⊤​,∂Wx​∂L​=t=1∑T​δt​xt⊤​,∂b∂L​=t=1∑T​δt​

Explanation: Once δt​ is known, gradients w.r.t. parameters follow from outer products with the corresponding inputs. These sums accumulate contributions across time.

Gradient Decay/Explosion Bound

​∂ht−k​∂ht​​​2​≤(Lf​∥Wh​∥2​)k

Explanation: The norm of the Jacobian product across k steps is bounded by the product of the activation’s Lipschitz constant and the spectral norm of Wh​ raised to k, explaining vanishing/exploding gradients.

Spectral Radius Criterion

ρ(Wh​)<1⇒local stability near zero

Explanation: If the largest absolute eigenvalue of Wh​ is less than one (and f is approximately linear near zero), small perturbations decay, helping stability but risking vanishing gradients.

Softmax

y^​t,k​=∑j=1dy​​exp(zt,j​)exp(zt,k​)​,zt​=Wy​ht​+c

Explanation: Turns logits into a probability distribution. Use with cross-entropy for multiclass outputs.

RNN Cost per Sequence

Time=O(T(dx​dh​+dh2​+dh​dy​)),Space=O(Tdh​)

Explanation: Each time step performs matrix-vector products with Wx​, Wh​, and Wy​. Storing hidden states over T steps is needed for BPTT, leading to linear memory in T.

Global-Norm Gradient Clipping

gclipped​=g⋅min(1,∥g∥2​τ​)

Explanation: If the gradient norm exceeds threshold τ, rescale all gradients uniformly to stabilize training without changing direction.

Parameter Count

∣θ∣=dh​dx​+dh2​+dh​dy​+dh​+dy​

Explanation: The total number of learnable parameters in a vanilla RNN cell and output layer equals weights plus biases across input, recurrent, and output mappings.

Complexity Analysis

Let dx​ be input size, dh​ hidden size, dy​ output size, and T sequence length. Each forward step computes two main matrix-vector products: Wx​ xt​ (cost O(dx​ dh​)) and Wh​ ht−1​ (cost O(dh​^2)), followed by an activation f (O(dh​)) and possibly an output projection Wy​ ht​ (O(dh​ dy​)). Thus, a full forward pass over T steps costs O(T(dx​ dh​ + dh​^2 + dh​ dy​)). The backward pass has the same order of complexity because it repeats similar multiplications while propagating error signals and accumulating parameter gradients; constants are slightly larger due to extra elementwise operations and outer products. Memory usage during training is dominated by storing per-step activations (e.g., ht​ and, for some losses, pre-activations at​) to compute gradients. This requires O(T dh​) space, plus O(dh​^2 + dx​ dh​ + dh​ dy​) for parameters, which is independent of T. In practice, batch processing multiplies time by batch size N, but matrix-matrix operations (BLAS level-3) can utilize hardware more efficiently. Truncated BPTT with window k reduces effective backprop length and memory to O(k dh​), trading off the ability to learn dependencies longer than k steps. Parallelism across time is limited (sequential dependency), but parallelism across batch and within matrix operations is available. Numerical stability considerations (e.g., stable softmax, clipping) add negligible overhead. Careful cache-friendly memory layouts and vectorization can substantially improve constants in these asymptotic costs, especially for large dh​.

Code Examples

Forward pass of a vanilla RNN for next-character prediction (inference only)
1#include <iostream>
2#include <vector>
3#include <string>
4#include <random>
5#include <cmath>
6#include <algorithm>
7#include <iomanip>
8
9using namespace std;
10
11// Utility: numerically-stable softmax
12vector<double> softmax(const vector<double>& z) {
13 double max_z = *max_element(z.begin(), z.end());
14 double sum = 0.0;
15 vector<double> p(z.size());
16 for (size_t i = 0; i < z.size(); ++i) {
17 p[i] = exp(z[i] - max_z);
18 sum += p[i];
19 }
20 for (size_t i = 0; i < z.size(); ++i) p[i] /= sum + 1e-12;
21 return p;
22}
23
24// Matrix-vector multiply: y = A * x (A: m x n, x: n)
25vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x) {
26 size_t m = A.size(), n = A[0].size();
27 vector<double> y(m, 0.0);
28 for (size_t i = 0; i < m; ++i) {
29 double s = 0.0;
30 for (size_t j = 0; j < n; ++j) s += A[i][j] * x[j];
31 y[i] = s;
32 }
33 return y;
34}
35
36// Vector add: a += b
37void add_inplace(vector<double>& a, const vector<double>& b) {
38 for (size_t i = 0; i < a.size(); ++i) a[i] += b[i];
39}
40
41// Activation tanh elementwise
42vector<double> tanh_vec(const vector<double>& v) {
43 vector<double> r(v.size());
44 for (size_t i = 0; i < v.size(); ++i) r[i] = tanh(v[i]);
45 return r;
46}
47
48// One-hot encoding for a small vocabulary
49vector<double> one_hot(int idx, int vocab) {
50 vector<double> v(vocab, 0.0);
51 v[idx] = 1.0;
52 return v;
53}
54
55int main() {
56 // Toy vocabulary and mapping
57 vector<char> vocab = {' ', 'a', 'b', 'c'}; // size = 4
58 auto idx_of = [&](char ch){
59 for (int i = 0; i < (int)vocab.size(); ++i) if (vocab[i] == ch) return i;
60 return 0; // default to space
61 };
62
63 int d_x = (int)vocab.size();
64 int d_h = 8; // hidden size
65 int d_y = (int)vocab.size(); // predict next char distribution
66
67 // Random init (small values)
68 std::mt19937 rng(42);
69 std::uniform_real_distribution<double> uni(-0.1, 0.1);
70
71 vector<vector<double>> W_x(d_h, vector<double>(d_x));
72 vector<vector<double>> W_h(d_h, vector<double>(d_h));
73 vector<double> b_h(d_h, 0.0);
74 vector<vector<double>> W_y(d_y, vector<double>(d_h));
75 vector<double> b_y(d_y, 0.0);
76
77 for (int i = 0; i < d_h; ++i) {
78 for (int j = 0; j < d_x; ++j) W_x[i][j] = uni(rng);
79 for (int j = 0; j < d_h; ++j) W_h[i][j] = uni(rng);
80 b_h[i] = 0.0;
81 }
82 for (int i = 0; i < d_y; ++i) {
83 for (int j = 0; j < d_h; ++j) W_y[i][j] = uni(rng);
84 b_y[i] = 0.0;
85 }
86
87 // Input sequence (we will predict next char at each step)
88 string input = "ab cab";
89
90 // Initial hidden state
91 vector<double> h(d_h, 0.0);
92
93 cout << fixed << setprecision(3);
94 cout << "Input\tPredicted next char (argmax)\n";
95
96 for (size_t t = 0; t < input.size(); ++t) {
97 // Prepare x_t as one-hot
98 vector<double> x = one_hot(idx_of(input[t]), d_x);
99
100 // Compute pre-activation a_t = W_h h + W_x x + b
101 vector<double> a = matvec(W_h, h);
102 vector<double> Wx = matvec(W_x, x);
103 add_inplace(a, Wx);
104 add_inplace(a, b_h);
105
106 // Update hidden state
107 h = tanh_vec(a);
108
109 // Output logits and softmax
110 vector<double> logits = matvec(W_y, h);
111 add_inplace(logits, b_y);
112 vector<double> probs = softmax(logits);
113
114 // Argmax prediction
115 int pred = int(max_element(probs.begin(), probs.end()) - probs.begin());
116 cout << input[t] << "\t" << vocab[pred] << "\n";
117 }
118
119 return 0;
120}
121

This program defines a small vanilla RNN with tanh hidden units and a softmax output head to perform next-character prediction. It runs only the forward pass: for each character in the input, it updates the hidden state and outputs a probability distribution over the next character. Random weights yield arbitrary predictions, but the structure reflects the recurrence h_t = f(W_h h_{t-1} + W_x x_t + b) and y_t = softmax(W_y h_t + c). The example shows how to manage shapes and numerically stable softmax in plain C++.

Time: O(T(d_x d_h + d_h^2 + d_h d_y)) for a sequence of length T.Space: O(d_h + d_x + d_y). No training, so we do not store all time-step states.
Training a tiny RNN with BPTT on the Echo task (predict x_{t-1})
1#include <iostream>
2#include <vector>
3#include <random>
4#include <cmath>
5#include <algorithm>
6#include <numeric>
7#include <cassert>
8using namespace std;
9
10// Sigmoid and tanh
11inline double sigmoid(double z){ return 1.0 / (1.0 + exp(-z)); }
12inline double dtanh(double y){ return 1.0 - y*y; } // derivative using output y = tanh(a)
13
14// Vector helpers
15vector<double> zeros(int n){ return vector<double>(n, 0.0); }
16
17vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x){
18 int m = (int)A.size();
19 int n = (int)A[0].size();
20 vector<double> y(m, 0.0);
21 for(int i=0;i<m;++i){
22 double s=0.0; for(int j=0;j<n;++j) s += A[i][j]*x[j];
23 y[i]=s;
24 }
25 return y;
26}
27
28void add_inplace(vector<double>& a, const vector<double>& b){
29 for(size_t i=0;i<a.size();++i) a[i]+=b[i];
30}
31
32vector<double> tanh_vec(const vector<double>& v){
33 vector<double> r(v.size());
34 for(size_t i=0;i<v.size();++i) r[i]=tanh(v[i]);
35 return r;
36}
37
38// Outer product: A += u * v^T (A: m x n, u: m, v: n)
39void outer_add(vector<vector<double>>& A, const vector<double>& u, const vector<double>& v){
40 int m=(int)A.size(), n=(int)A[0].size();
41 assert((int)u.size()==m && (int)v.size()==n);
42 for(int i=0;i<m;++i){
43 for(int j=0;j<n;++j) A[i][j] += u[i]*v[j];
44 }
45}
46
47// y += A^T * x (A: m x n, x: m) => y: n
48void add_matTvec_inplace(vector<double>& y, const vector<vector<double>>& A, const vector<double>& x){
49 int m=(int)A.size(), n=(int)A[0].size();
50 assert((int)y.size()==n && (int)x.size()==m);
51 for(int j=0;j<n;++j){
52 double s=0.0; for(int i=0;i<m;++i) s += A[i][j]*x[i];
53 y[j] += s;
54 }
55}
56
57// Global-norm gradient clipping
58void clip_by_global_norm(vector<vector<double>>& dW_h, vector<vector<double>>& dW_x,
59 vector<double>& db_h, vector<double>& dW_y, double& db_y,
60 double tau){
61 double sq=0.0;
62 int d_h=(int)dW_h.size(); int d_x=(int)dW_x[0].size();
63 for(int i=0;i<d_h;++i){
64 for(int j=0;j<d_h;++j) sq += dW_h[i][j]*dW_h[i][j];
65 for(int j=0;j<d_x;++j) sq += dW_x[i][j]*dW_x[i][j];
66 sq += db_h[i]*db_h[i];
67 }
68 for(double v: dW_y) sq += v*v;
69 sq += db_y*db_y;
70 double norm = sqrt(sq) + 1e-12;
71 double scale = min(1.0, tau / norm);
72 if(scale<1.0){
73 for(int i=0;i<d_h;++i){
74 for(int j=0;j<d_h;++j) dW_h[i][j]*=scale;
75 for(int j=0;j<d_x;++j) dW_x[i][j]*=scale;
76 db_h[i]*=scale;
77 }
78 for(double& v: dW_y) v*=scale;
79 db_y*=scale;
80 }
81}
82
83int main(){
84 // Echo task: input x_t in {0,1}, target y_t = x_{t-1} (y_0 = 0)
85 int T = 25; // sequence length
86 int epochs = 300; // small demo
87 int d_x = 1; // scalar input
88 int d_h = 8; // hidden units
89 int d_y = 1; // scalar output (prob of 1)
90 double lr = 0.1;
91
92 // Initialize parameters
93 std::mt19937 rng(123);
94 std::uniform_real_distribution<double> uni(-0.2, 0.2);
95
96 vector<vector<double>> W_x(d_h, vector<double>(d_x));
97 vector<vector<double>> W_h(d_h, vector<double>(d_h));
98 vector<double> b_h(d_h, 0.0);
99 vector<double> W_y(d_h); // shape 1 x d_h represented as vector
100 double b_y = 0.0;
101
102 for(int i=0;i<d_h;++i){
103 for(int j=0;j<d_x;++j) W_x[i][j]=uni(rng);
104 for(int j=0;j<d_h;++j) W_h[i][j]=uni(rng) * (i==j? 0.5:1.0); // slightly smaller on diagonal
105 W_y[i]=uni(rng);
106 }
107
108 std::bernoulli_distribution bern(0.5);
109
110 for(int epoch=1; epoch<=epochs; ++epoch){
111 // Generate a random sequence each epoch
112 vector<double> x(T, 0.0), y_true(T, 0.0);
113 for(int t=0;t<T;++t) x[t] = bern(rng) ? 1.0:0.0;
114 y_true[0] = 0.0;
115 for(int t=1;t<T;++t) y_true[t] = x[t-1];
116
117 // Forward pass: store a_t, h_t, y_hat_t
118 vector<vector<double>> a(T, vector<double>(d_h, 0.0));
119 vector<vector<double>> h(T, vector<double>(d_h, 0.0));
120 vector<double> y_hat(T, 0.0);
121
122 vector<double> h_prev(d_h, 0.0);
123 for(int t=0;t<T;++t){
124 // a_t = W_h h_{t-1} + W_x x_t + b_h
125 a[t] = matvec(W_h, h_prev);
126 // add W_x * x_t (scalar input)
127 for(int i=0;i<d_h;++i) a[t][i] += W_x[i][0] * x[t] + b_h[i];
128 // h_t = tanh(a_t)
129 h[t] = tanh_vec(a[t]);
130 // y_t = sigmoid(W_y h_t + b_y)
131 double z = b_y;
132 for(int i=0;i<d_h;++i) z += W_y[i]*h[t][i];
133 y_hat[t] = sigmoid(z);
134 h_prev = h[t];
135 }
136
137 // Compute binary cross-entropy loss
138 double loss = 0.0;
139 for(int t=0;t<T;++t){
140 double yt = y_true[t];
141 double yp = min(1.0-1e-7, max(1e-7, y_hat[t]));
142 loss += -( yt*log(yp) + (1.0-yt)*log(1.0-yp) );
143 }
144 loss /= T;
145
146 // Backward pass (BPTT)
147 vector<vector<double>> dW_x(d_h, vector<double>(d_x, 0.0));
148 vector<vector<double>> dW_h(d_h, vector<double>(d_h, 0.0));
149 vector<double> db_h(d_h, 0.0);
150 vector<double> dW_y(d_h, 0.0);
151 double db_y = 0.0;
152
153 vector<double> dh_next(d_h, 0.0);
154 for(int t=T-1; t>=0; --t){
155 // Output gradient: dL/dz = y_hat - y_true for sigmoid + BCE
156 double dL_dz = (y_hat[t] - y_true[t]) / T; // average over T
157 // Grad w.r.t W_y, b_y
158 for(int i=0;i<d_h;++i) dW_y[i] += dL_dz * h[t][i];
159 db_y += dL_dz;
160
161 // Backprop into h_t
162 vector<double> dh(d_h, 0.0);
163 for(int i=0;i<d_h;++i) dh[i] = dW_y[i]*0.0; // placeholder to emphasize structure
164 // add W_y^T * dL_dz
165 for(int i=0;i<d_h;++i) dh[i] += W_y[i] * dL_dz;
166 // add gradient from future due to recurrence
167 for(int i=0;i<d_h;++i) dh[i] += dh_next[i];
168
169 // Through tanh: da = dh * (1 - tanh(a)^2) where tanh(a)=h[t][i]
170 vector<double> da(d_h, 0.0);
171 for(int i=0;i<d_h;++i) da[i] = dh[i] * dtanh(h[t][i]);
172
173 // Accumulate parameter grads: W_x, W_h, b_h
174 // W_x: da * x_t^T (x scalar)
175 for(int i=0;i<d_h;++i) dW_x[i][0] += da[i] * x[t];
176 // W_h: da * h_{t-1}^T
177 const vector<double>& h_prev_t = (t>0? h[t-1] : vector<double>(d_h, 0.0));
178 for(int i=0;i<d_h;++i){
179 for(int j=0;j<d_h;++j){
180 double hprevj = (t>0? h[t-1][j]: 0.0);
181 dW_h[i][j] += da[i] * hprevj;
182 }
183 }
184 for(int i=0;i<d_h;++i) db_h[i] += da[i];
185
186 // Propagate to previous hidden: dh_next = W_h^T * da
187 fill(dh_next.begin(), dh_next.end(), 0.0);
188 add_matTvec_inplace(dh_next, W_h, da);
189 }
190
191 // Clip gradients to avoid exploding
192 clip_by_global_norm(dW_h, dW_x, db_h, dW_y, db_y, 5.0);
193
194 // SGD update
195 for(int i=0;i<d_h;++i){
196 W_y[i] -= lr * dW_y[i];
197 for(int j=0;j<d_x;++j) W_x[i][j] -= lr * dW_x[i][j];
198 for(int j=0;j<d_h;++j) W_h[i][j] -= lr * dW_h[i][j];
199 b_h[i] -= lr * db_h[i];
200 }
201 b_y -= lr * db_y;
202
203 if(epoch % 50 == 0 || epoch == 1){
204 cout << "Epoch " << epoch << ": loss = " << loss << "\n";
205 }
206 }
207
208 // Quick qualitative check
209 cout << "\nQualitative check on a new random sequence:\n";
210 vector<double> x(T,0.0);
211 std::mt19937 rng2(999);
212 std::bernoulli_distribution bern2(0.5);
213 for(int t=0;t<T;++t) x[t] = bern2(rng2)?1.0:0.0;
214
215 vector<double> h_prev(d_h, 0.0);
216 double prev_pred = 0.0;
217 cout << "t\tx_t\tpred p(x_{t-1}=1)\n";
218 for(int t=0;t<T;++t){
219 vector<double> a(d_h, 0.0);
220 for(int i=0;i<d_h;++i){
221 double s = 0.0;
222 for(int j=0;j<d_h;++j) s += W_h[i][j]*h_prev[j];
223 s += W_x[i][0]*x[t] + b_h[i];
224 a[i] = s;
225 }
226 vector<double> h_t(d_h, 0.0);
227 for(int i=0;i<d_h;++i) h_t[i] = tanh(a[i]);
228 double z = b_y; for(int i=0;i<d_h;++i) z += W_y[i]*h_t[i];
229 double pred = sigmoid(z);
230 cout << t << "\t" << x[t] << "\t" << pred << "\n";
231 h_prev = h_t;
232 prev_pred = pred;
233 }
234
235 return 0;
236}
237

This C++ program trains a minimal RNN end-to-end on the Echo task, where the target at step t is the previous input x_{t-1}. It implements forward propagation, binary cross-entropy loss, and full Backpropagation Through Time (BPTT) with tanh hidden units and a sigmoid output. Gradients are derived analytically and accumulated across time, then clipped by global norm before an SGD update. After a few hundred epochs, the model typically learns to output a probability close to the previous bit. The code demonstrates parameter sharing across time steps, state storage for BPTT, and handling of recurrent gradient flow without external libraries.

Time: O(T(d_x d_h + d_h^2 + d_h d_y)) per epoch for one sequence (same order for backward).Space: O(T d_h) to store activations (a_t, h_t), plus O(d_h^2 + d_x d_h + d_h d_y) for parameters.
#recurrent neural network#rnn#backpropagation through time#bptt#vanishing gradient#exploding gradient#spectral radius#tanh#softmax#cross-entropy#truncated bptt#gradient clipping#sequence modeling#time series#cpp implementation