Recurrent Neural Network Theory
Key Points
- •A Recurrent Neural Network (RNN) processes sequences by carrying a hidden state that is updated at every time step using + + b).
- •RNNs learn temporal patterns like language, audio, and time-series by reusing the same weights across steps, giving them a form of memory.
- •Training uses Backpropagation Through Time (BPTT), which unrolls the network across T steps and applies the chain rule over the entire sequence.
- •Gradients can vanish or explode due to repeated Jacobian products; careful initialization, gating , and gradient clipping help.
- •The per-sequence time complexity is O(T( + ^2 + )), and memory is O(T ) for storing states during BPTT.
- •Outputs are often computed as + c) with softmax (classification) or linear/sigmoid .
- •Truncated BPTT limits backprop to a window of k steps to trade long-range credit assignment for efficiency and stability.
- •In C++, you can implement a compact RNN by writing small matrix/vector utilities and carefully managing memory and numerical stability.
Prerequisites
- →Linear algebra (vectors, matrices, norms) — RNN computations are matrix-vector operations, and stability analysis uses matrix norms and eigenvalues.
- →Calculus and chain rule — Backpropagation Through Time applies the chain rule across many time steps.
- →Optimization (SGD, learning rate, regularization) — Training RNNs requires tuning optimizers and understanding convergence/stability.
- →Probability and information theory — Losses like cross-entropy and concepts like softmax are probabilistic.
- →Numerical stability — Stable softmax/log-sum-exp and gradient clipping prevent overflow/underflow.
- →Basic C++ programming — Implementing an RNN from scratch involves arrays, loops, and memory management.
- →Neural network basics — Familiarity with activations, layers, and loss functions helps contextualize RNNs.
- →Sequence modeling tasks — Understanding inputs/outputs for many-to-one and many-to-many setups guides design.
Detailed Explanation
Tap terms for definitions01Overview
A Recurrent Neural Network (RNN) is a neural architecture designed to handle sequential data. Unlike feedforward networks that treat inputs independently, RNNs maintain a hidden state h_t that summarizes information from prior time steps. At time t, the hidden state is updated by combining the current input x_t with the previous hidden state h_{t-1} through learnable parameters and a nonlinear activation. Because the same parameters are reused at every step, RNNs can, in principle, model temporal dependencies of arbitrary length. They are widely used for tasks like language modeling, speech recognition, and time-series forecasting. The canonical update looks like h_t = f(W_h h_{t-1} + W_x x_t + b), with f often being tanh or ReLU. An output y_t is optionally produced via y_t = g(W_y h_t + c), where g could be softmax for multiclass classification or identity/sigmoid for regression/binary tasks. RNNs are trained with Backpropagation Through Time (BPTT), which unrolls the sequence, applies the chain rule across time, and accumulates gradients. While simple and elegant, basic RNNs face stability challenges: gradients can vanish or explode due to repeated multiplication by Jacobians. Practical training therefore uses techniques like careful initialization, gradient clipping, truncated BPTT, or gated variants such as LSTM and GRU.
02Intuition & Analogies
Imagine reading a sentence word by word while taking brief notes on a sticky note. Each time you read a new word (x_t), you update your note (h_t) with the new information while keeping the important parts from the previous note (h_{t-1}). You use a consistent method for updating the note—like a template—which corresponds to the matrices W_x and W_h. The note is small and must summarize the past concisely; this is the RNN’s hidden state. If you want to answer a question at any point, you consult the current note and produce an answer (y_t). The nonlinearity f (like tanh) acts like a rule that prevents the note from growing without bound and helps emphasize salient features. Training the network is like reviewing your notes after finishing the sentence, figuring out which parts helped or hurt your final answers, and then slightly adjusting your note-taking template to do better next time. This review process backtracks through time—exactly what Backpropagation Through Time does—crediting or blaming notes from earlier points based on their influence on later answers. However, just as faint pencil marks get harder to trace back through many pages, gradients can fade (vanish) when you backtrack too far, making it difficult to learn very long-term dependencies. Conversely, repeatedly amplifying strong marks can smear ink everywhere (exploding gradients). Techniques like clipping keep the adjustments neat, while gated RNNs (LSTM/GRU) add controlled doors that decide what to write, erase, or read from the note, improving long-term memory.
03Formal Definition
04When to Use
Use RNNs when your data is sequential and order matters: natural language (character/word sequences), audio waveforms, sensor streams, and financial time series. They are suitable when you need online processing (streaming), low-latency inference step-by-step, or when sequences are relatively short to moderate so that simple recurrence suffices. Choose the proper I/O mapping: many-to-one (e.g., sentiment classification over a sentence), one-to-many (sequence generation from a context), or many-to-many (tagging each token). If your dependencies span long ranges or you observe training instabilities, consider gated RNNs (LSTM/GRU) or even Transformers. When compute/memory is limited and you value compactness and easy stepwise deployment (e.g., embedded systems), vanilla RNNs can be attractive—especially with truncated BPTT and gradient clipping. For continuous signals and forecasting, RNNs can model autoregressive dynamics; for classification, pair them with softmax and cross-entropy. In practice, combine RNNs with embeddings for discrete tokens, masking/padding for variable-length batches, and regularization (dropout, weight decay) for generalization.
⚠️Common Mistakes
- Ignoring sequence lengths: feeding padded tokens without masks causes the model to learn from padding noise. Always mask losses and states for shorter sequences.
- Shape mismatches: mixing row/column conventions or forgetting that W_h multiplies h_{t-1} and W_x multiplies x_t leads to silent bugs. Verify dimensions and write tests.
- Unstable training: using high learning rates without gradient clipping may cause exploding gradients; conversely, poor initialization can cause vanishing. Use Xavier/He initialization and clip gradients by global norm.
- Wrong activation or output: applying softmax to binary outputs or forgetting a sigmoid/identity where appropriate degrades learning. Match g and the loss to the task.
- Forgetting to detach/segment for truncated BPTT: in manual implementations, you must stop gradients beyond the window; otherwise memory/time explodes.
- Not resetting or mishandling hidden state between sequences: decide between stateless (h_0=0 each sequence) and stateful (carry over h_T) regimes and code accordingly.
- Numerical issues: computing log(softmax) naively can underflow. Use stable softmax/log-sum-exp. Similarly, clamp arguments to log and divisions.
- Evaluation mismatch: training with teacher forcing and evaluating autoregressively without careful handling can cause exposure bias; validate in the intended inference mode.
Key Formulas
RNN Recurrence
Explanation: The hidden state at time t is a nonlinear function of the previous hidden state and current input. This is the core update that gives RNNs memory.
Output Mapping
Explanation: Given a hidden state, the network produces an output via another linear map and possibly a nonlinearity (softmax, sigmoid, or identity) depending on the task.
Sequence Cross-Entropy
Explanation: For classification, the loss over a sequence is the sum of cross-entropies between the true labels and predicted probabilities at each time step.
BPTT Error Recurrence
Explanation: The error signal at time t combines the immediate output error with the backpropagated error from time t+1, modulated by the activation derivative at t. This is central to computing gradients through time.
Parameter Gradients
Explanation: Once is known, gradients w.r.t. parameters follow from outer products with the corresponding inputs. These sums accumulate contributions across time.
Gradient Decay/Explosion Bound
Explanation: The norm of the Jacobian product across k steps is bounded by the product of the activation’s Lipschitz constant and the spectral norm of raised to k, explaining vanishing/exploding gradients.
Spectral Radius Criterion
Explanation: If the largest absolute eigenvalue of is less than one (and f is approximately linear near zero), small perturbations decay, helping stability but risking vanishing gradients.
Softmax
Explanation: Turns logits into a probability distribution. Use with cross-entropy for multiclass outputs.
RNN Cost per Sequence
Explanation: Each time step performs matrix-vector products with , , and . Storing hidden states over T steps is needed for BPTT, leading to linear memory in T.
Global-Norm Gradient Clipping
Explanation: If the gradient norm exceeds threshold , rescale all gradients uniformly to stabilize training without changing direction.
Parameter Count
Explanation: The total number of learnable parameters in a vanilla RNN cell and output layer equals weights plus biases across input, recurrent, and output mappings.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <string> 4 #include <random> 5 #include <cmath> 6 #include <algorithm> 7 #include <iomanip> 8 9 using namespace std; 10 11 // Utility: numerically-stable softmax 12 vector<double> softmax(const vector<double>& z) { 13 double max_z = *max_element(z.begin(), z.end()); 14 double sum = 0.0; 15 vector<double> p(z.size()); 16 for (size_t i = 0; i < z.size(); ++i) { 17 p[i] = exp(z[i] - max_z); 18 sum += p[i]; 19 } 20 for (size_t i = 0; i < z.size(); ++i) p[i] /= sum + 1e-12; 21 return p; 22 } 23 24 // Matrix-vector multiply: y = A * x (A: m x n, x: n) 25 vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x) { 26 size_t m = A.size(), n = A[0].size(); 27 vector<double> y(m, 0.0); 28 for (size_t i = 0; i < m; ++i) { 29 double s = 0.0; 30 for (size_t j = 0; j < n; ++j) s += A[i][j] * x[j]; 31 y[i] = s; 32 } 33 return y; 34 } 35 36 // Vector add: a += b 37 void add_inplace(vector<double>& a, const vector<double>& b) { 38 for (size_t i = 0; i < a.size(); ++i) a[i] += b[i]; 39 } 40 41 // Activation tanh elementwise 42 vector<double> tanh_vec(const vector<double>& v) { 43 vector<double> r(v.size()); 44 for (size_t i = 0; i < v.size(); ++i) r[i] = tanh(v[i]); 45 return r; 46 } 47 48 // One-hot encoding for a small vocabulary 49 vector<double> one_hot(int idx, int vocab) { 50 vector<double> v(vocab, 0.0); 51 v[idx] = 1.0; 52 return v; 53 } 54 55 int main() { 56 // Toy vocabulary and mapping 57 vector<char> vocab = {' ', 'a', 'b', 'c'}; // size = 4 58 auto idx_of = [&](char ch){ 59 for (int i = 0; i < (int)vocab.size(); ++i) if (vocab[i] == ch) return i; 60 return 0; // default to space 61 }; 62 63 int d_x = (int)vocab.size(); 64 int d_h = 8; // hidden size 65 int d_y = (int)vocab.size(); // predict next char distribution 66 67 // Random init (small values) 68 std::mt19937 rng(42); 69 std::uniform_real_distribution<double> uni(-0.1, 0.1); 70 71 vector<vector<double>> W_x(d_h, vector<double>(d_x)); 72 vector<vector<double>> W_h(d_h, vector<double>(d_h)); 73 vector<double> b_h(d_h, 0.0); 74 vector<vector<double>> W_y(d_y, vector<double>(d_h)); 75 vector<double> b_y(d_y, 0.0); 76 77 for (int i = 0; i < d_h; ++i) { 78 for (int j = 0; j < d_x; ++j) W_x[i][j] = uni(rng); 79 for (int j = 0; j < d_h; ++j) W_h[i][j] = uni(rng); 80 b_h[i] = 0.0; 81 } 82 for (int i = 0; i < d_y; ++i) { 83 for (int j = 0; j < d_h; ++j) W_y[i][j] = uni(rng); 84 b_y[i] = 0.0; 85 } 86 87 // Input sequence (we will predict next char at each step) 88 string input = "ab cab"; 89 90 // Initial hidden state 91 vector<double> h(d_h, 0.0); 92 93 cout << fixed << setprecision(3); 94 cout << "Input\tPredicted next char (argmax)\n"; 95 96 for (size_t t = 0; t < input.size(); ++t) { 97 // Prepare x_t as one-hot 98 vector<double> x = one_hot(idx_of(input[t]), d_x); 99 100 // Compute pre-activation a_t = W_h h + W_x x + b 101 vector<double> a = matvec(W_h, h); 102 vector<double> Wx = matvec(W_x, x); 103 add_inplace(a, Wx); 104 add_inplace(a, b_h); 105 106 // Update hidden state 107 h = tanh_vec(a); 108 109 // Output logits and softmax 110 vector<double> logits = matvec(W_y, h); 111 add_inplace(logits, b_y); 112 vector<double> probs = softmax(logits); 113 114 // Argmax prediction 115 int pred = int(max_element(probs.begin(), probs.end()) - probs.begin()); 116 cout << input[t] << "\t" << vocab[pred] << "\n"; 117 } 118 119 return 0; 120 } 121
This program defines a small vanilla RNN with tanh hidden units and a softmax output head to perform next-character prediction. It runs only the forward pass: for each character in the input, it updates the hidden state and outputs a probability distribution over the next character. Random weights yield arbitrary predictions, but the structure reflects the recurrence h_t = f(W_h h_{t-1} + W_x x_t + b) and y_t = softmax(W_y h_t + c). The example shows how to manage shapes and numerically stable softmax in plain C++.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <cmath> 5 #include <algorithm> 6 #include <numeric> 7 #include <cassert> 8 using namespace std; 9 10 // Sigmoid and tanh 11 inline double sigmoid(double z){ return 1.0 / (1.0 + exp(-z)); } 12 inline double dtanh(double y){ return 1.0 - y*y; } // derivative using output y = tanh(a) 13 14 // Vector helpers 15 vector<double> zeros(int n){ return vector<double>(n, 0.0); } 16 17 vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x){ 18 int m = (int)A.size(); 19 int n = (int)A[0].size(); 20 vector<double> y(m, 0.0); 21 for(int i=0;i<m;++i){ 22 double s=0.0; for(int j=0;j<n;++j) s += A[i][j]*x[j]; 23 y[i]=s; 24 } 25 return y; 26 } 27 28 void add_inplace(vector<double>& a, const vector<double>& b){ 29 for(size_t i=0;i<a.size();++i) a[i]+=b[i]; 30 } 31 32 vector<double> tanh_vec(const vector<double>& v){ 33 vector<double> r(v.size()); 34 for(size_t i=0;i<v.size();++i) r[i]=tanh(v[i]); 35 return r; 36 } 37 38 // Outer product: A += u * v^T (A: m x n, u: m, v: n) 39 void outer_add(vector<vector<double>>& A, const vector<double>& u, const vector<double>& v){ 40 int m=(int)A.size(), n=(int)A[0].size(); 41 assert((int)u.size()==m && (int)v.size()==n); 42 for(int i=0;i<m;++i){ 43 for(int j=0;j<n;++j) A[i][j] += u[i]*v[j]; 44 } 45 } 46 47 // y += A^T * x (A: m x n, x: m) => y: n 48 void add_matTvec_inplace(vector<double>& y, const vector<vector<double>>& A, const vector<double>& x){ 49 int m=(int)A.size(), n=(int)A[0].size(); 50 assert((int)y.size()==n && (int)x.size()==m); 51 for(int j=0;j<n;++j){ 52 double s=0.0; for(int i=0;i<m;++i) s += A[i][j]*x[i]; 53 y[j] += s; 54 } 55 } 56 57 // Global-norm gradient clipping 58 void clip_by_global_norm(vector<vector<double>>& dW_h, vector<vector<double>>& dW_x, 59 vector<double>& db_h, vector<double>& dW_y, double& db_y, 60 double tau){ 61 double sq=0.0; 62 int d_h=(int)dW_h.size(); int d_x=(int)dW_x[0].size(); 63 for(int i=0;i<d_h;++i){ 64 for(int j=0;j<d_h;++j) sq += dW_h[i][j]*dW_h[i][j]; 65 for(int j=0;j<d_x;++j) sq += dW_x[i][j]*dW_x[i][j]; 66 sq += db_h[i]*db_h[i]; 67 } 68 for(double v: dW_y) sq += v*v; 69 sq += db_y*db_y; 70 double norm = sqrt(sq) + 1e-12; 71 double scale = min(1.0, tau / norm); 72 if(scale<1.0){ 73 for(int i=0;i<d_h;++i){ 74 for(int j=0;j<d_h;++j) dW_h[i][j]*=scale; 75 for(int j=0;j<d_x;++j) dW_x[i][j]*=scale; 76 db_h[i]*=scale; 77 } 78 for(double& v: dW_y) v*=scale; 79 db_y*=scale; 80 } 81 } 82 83 int main(){ 84 // Echo task: input x_t in {0,1}, target y_t = x_{t-1} (y_0 = 0) 85 int T = 25; // sequence length 86 int epochs = 300; // small demo 87 int d_x = 1; // scalar input 88 int d_h = 8; // hidden units 89 int d_y = 1; // scalar output (prob of 1) 90 double lr = 0.1; 91 92 // Initialize parameters 93 std::mt19937 rng(123); 94 std::uniform_real_distribution<double> uni(-0.2, 0.2); 95 96 vector<vector<double>> W_x(d_h, vector<double>(d_x)); 97 vector<vector<double>> W_h(d_h, vector<double>(d_h)); 98 vector<double> b_h(d_h, 0.0); 99 vector<double> W_y(d_h); // shape 1 x d_h represented as vector 100 double b_y = 0.0; 101 102 for(int i=0;i<d_h;++i){ 103 for(int j=0;j<d_x;++j) W_x[i][j]=uni(rng); 104 for(int j=0;j<d_h;++j) W_h[i][j]=uni(rng) * (i==j? 0.5:1.0); // slightly smaller on diagonal 105 W_y[i]=uni(rng); 106 } 107 108 std::bernoulli_distribution bern(0.5); 109 110 for(int epoch=1; epoch<=epochs; ++epoch){ 111 // Generate a random sequence each epoch 112 vector<double> x(T, 0.0), y_true(T, 0.0); 113 for(int t=0;t<T;++t) x[t] = bern(rng) ? 1.0:0.0; 114 y_true[0] = 0.0; 115 for(int t=1;t<T;++t) y_true[t] = x[t-1]; 116 117 // Forward pass: store a_t, h_t, y_hat_t 118 vector<vector<double>> a(T, vector<double>(d_h, 0.0)); 119 vector<vector<double>> h(T, vector<double>(d_h, 0.0)); 120 vector<double> y_hat(T, 0.0); 121 122 vector<double> h_prev(d_h, 0.0); 123 for(int t=0;t<T;++t){ 124 // a_t = W_h h_{t-1} + W_x x_t + b_h 125 a[t] = matvec(W_h, h_prev); 126 // add W_x * x_t (scalar input) 127 for(int i=0;i<d_h;++i) a[t][i] += W_x[i][0] * x[t] + b_h[i]; 128 // h_t = tanh(a_t) 129 h[t] = tanh_vec(a[t]); 130 // y_t = sigmoid(W_y h_t + b_y) 131 double z = b_y; 132 for(int i=0;i<d_h;++i) z += W_y[i]*h[t][i]; 133 y_hat[t] = sigmoid(z); 134 h_prev = h[t]; 135 } 136 137 // Compute binary cross-entropy loss 138 double loss = 0.0; 139 for(int t=0;t<T;++t){ 140 double yt = y_true[t]; 141 double yp = min(1.0-1e-7, max(1e-7, y_hat[t])); 142 loss += -( yt*log(yp) + (1.0-yt)*log(1.0-yp) ); 143 } 144 loss /= T; 145 146 // Backward pass (BPTT) 147 vector<vector<double>> dW_x(d_h, vector<double>(d_x, 0.0)); 148 vector<vector<double>> dW_h(d_h, vector<double>(d_h, 0.0)); 149 vector<double> db_h(d_h, 0.0); 150 vector<double> dW_y(d_h, 0.0); 151 double db_y = 0.0; 152 153 vector<double> dh_next(d_h, 0.0); 154 for(int t=T-1; t>=0; --t){ 155 // Output gradient: dL/dz = y_hat - y_true for sigmoid + BCE 156 double dL_dz = (y_hat[t] - y_true[t]) / T; // average over T 157 // Grad w.r.t W_y, b_y 158 for(int i=0;i<d_h;++i) dW_y[i] += dL_dz * h[t][i]; 159 db_y += dL_dz; 160 161 // Backprop into h_t 162 vector<double> dh(d_h, 0.0); 163 for(int i=0;i<d_h;++i) dh[i] = dW_y[i]*0.0; // placeholder to emphasize structure 164 // add W_y^T * dL_dz 165 for(int i=0;i<d_h;++i) dh[i] += W_y[i] * dL_dz; 166 // add gradient from future due to recurrence 167 for(int i=0;i<d_h;++i) dh[i] += dh_next[i]; 168 169 // Through tanh: da = dh * (1 - tanh(a)^2) where tanh(a)=h[t][i] 170 vector<double> da(d_h, 0.0); 171 for(int i=0;i<d_h;++i) da[i] = dh[i] * dtanh(h[t][i]); 172 173 // Accumulate parameter grads: W_x, W_h, b_h 174 // W_x: da * x_t^T (x scalar) 175 for(int i=0;i<d_h;++i) dW_x[i][0] += da[i] * x[t]; 176 // W_h: da * h_{t-1}^T 177 const vector<double>& h_prev_t = (t>0? h[t-1] : vector<double>(d_h, 0.0)); 178 for(int i=0;i<d_h;++i){ 179 for(int j=0;j<d_h;++j){ 180 double hprevj = (t>0? h[t-1][j]: 0.0); 181 dW_h[i][j] += da[i] * hprevj; 182 } 183 } 184 for(int i=0;i<d_h;++i) db_h[i] += da[i]; 185 186 // Propagate to previous hidden: dh_next = W_h^T * da 187 fill(dh_next.begin(), dh_next.end(), 0.0); 188 add_matTvec_inplace(dh_next, W_h, da); 189 } 190 191 // Clip gradients to avoid exploding 192 clip_by_global_norm(dW_h, dW_x, db_h, dW_y, db_y, 5.0); 193 194 // SGD update 195 for(int i=0;i<d_h;++i){ 196 W_y[i] -= lr * dW_y[i]; 197 for(int j=0;j<d_x;++j) W_x[i][j] -= lr * dW_x[i][j]; 198 for(int j=0;j<d_h;++j) W_h[i][j] -= lr * dW_h[i][j]; 199 b_h[i] -= lr * db_h[i]; 200 } 201 b_y -= lr * db_y; 202 203 if(epoch % 50 == 0 || epoch == 1){ 204 cout << "Epoch " << epoch << ": loss = " << loss << "\n"; 205 } 206 } 207 208 // Quick qualitative check 209 cout << "\nQualitative check on a new random sequence:\n"; 210 vector<double> x(T,0.0); 211 std::mt19937 rng2(999); 212 std::bernoulli_distribution bern2(0.5); 213 for(int t=0;t<T;++t) x[t] = bern2(rng2)?1.0:0.0; 214 215 vector<double> h_prev(d_h, 0.0); 216 double prev_pred = 0.0; 217 cout << "t\tx_t\tpred p(x_{t-1}=1)\n"; 218 for(int t=0;t<T;++t){ 219 vector<double> a(d_h, 0.0); 220 for(int i=0;i<d_h;++i){ 221 double s = 0.0; 222 for(int j=0;j<d_h;++j) s += W_h[i][j]*h_prev[j]; 223 s += W_x[i][0]*x[t] + b_h[i]; 224 a[i] = s; 225 } 226 vector<double> h_t(d_h, 0.0); 227 for(int i=0;i<d_h;++i) h_t[i] = tanh(a[i]); 228 double z = b_y; for(int i=0;i<d_h;++i) z += W_y[i]*h_t[i]; 229 double pred = sigmoid(z); 230 cout << t << "\t" << x[t] << "\t" << pred << "\n"; 231 h_prev = h_t; 232 prev_pred = pred; 233 } 234 235 return 0; 236 } 237
This C++ program trains a minimal RNN end-to-end on the Echo task, where the target at step t is the previous input x_{t-1}. It implements forward propagation, binary cross-entropy loss, and full Backpropagation Through Time (BPTT) with tanh hidden units and a sigmoid output. Gradients are derived analytically and accumulated across time, then clipped by global norm before an SGD update. After a few hundred epochs, the model typically learns to output a probability close to the previous bit. The code demonstrates parameter sharing across time steps, state storage for BPTT, and handling of recurrent gradient flow without external libraries.