📚TheoryIntermediate

LSTM & Gating Mechanisms

Key Points

•
Long Short-Term Memory (LSTM) networks use gates (forget, input, and output) to control what information to erase, write, and reveal at each time step.
•
The cell state $c_{t}$ acts like a conveyor belt of memory; gates scale it element-wise to preserve or modify information over long sequences.
•
LSTMs solve vanishing gradients by providing a nearly linear path for gradients through the cell state when forget gates are near one.
•
Each gate is computed with a sigmoid activation, while the candidate content uses tanh; the hidden state combines the output gate and tanh of the cell state.
•
Forward computation per time step costs O( $n_{h}$ ( $n_{x}$ + $n_{h}$ )) where $n_{x}$ is input size and $n_{h}$ is hidden size; full sequence cost scales linearly with sequence length T.
•
Initialization matters: setting the forget gate bias positive (e.g., 1.0) encourages memory retention early in training.
•
Variants like peephole LSTMs add direct connections from the cell state to gates for finer timing control.
•
In C++, an LSTM cell can be implemented with straightforward matrix–vector operations and element-wise nonlinearities.

Prerequisites

→Vectors and Matrices — LSTM equations are expressed with matrix–vector multiplications and element-wise operations.
→Activation Functions (sigmoid, tanh) — Gates require sigmoid and candidate/readout use tanh; understanding their ranges and derivatives is essential.
→Chain Rule and Backpropagation — Training LSTMs uses backpropagation through time, which relies on the chain rule.
→Basic RNNs — LSTMs extend RNNs with gates and a separate cell state; knowing RNNs clarifies the motivation.
→Time Series/Sequence Data — Understanding sequential dependencies motivates why memory and gating are needed.
→Gradient Clipping & Optimization — Stabilizes training of recurrent models and prevents exploding gradients.
→Initialization Strategies — Forget gate bias and weight scaling affect memory and convergence.
→Numerical Stability — Sigmoid/tanh saturation and floating-point limits affect gradients and activations.

Detailed Explanation

Tap terms for definitions

01Overview

An LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture designed to learn dependencies over long sequences. Unlike a plain RNN that updates its hidden state with a single transformation, an LSTM maintains two internal vectors: a hidden state h_t and a cell state c_t. The cell state is a dedicated memory lane that can carry information forward with minimal modification, while three gates—forget f_t, input i_t, and output o_t—control how much information flows in and out at each time step. Each gate is a vector of values between 0 and 1 computed with a sigmoid function applied to an affine transformation of the current input and previous hidden state. The input gate decides what new information to write, the forget gate decides what to erase, and the output gate decides what part of the internal memory to expose as the hidden state. A candidate content vector g_t (often called \tilde{c}_t) produced by tanh suggests new information to add when permitted by the input gate. These simple, local, element-wise operations enable stable long-range credit assignment: gradients can pass through time largely unimpeded along the cell state when forget gates are near one. As a result, LSTMs became the go-to model for many sequence tasks—language modeling, speech, and time series—before attention-based Transformers gained prominence.

02Intuition & Analogies

Imagine you are taking notes during a long lecture. Your notebook (the cell state c_t) holds everything you’ve written so far. At each new point in the lecture, you decide three things: (1) what old notes to cross out (forget gate f_t), (2) what new notes to add (input gate i_t with candidate notes g_t), and (3) which notes to read aloud to the class right now (output gate o_t applied to the current notebook content). If the current topic is still relevant, you keep the old notes by setting the forget gate close to 1. If a topic has changed, you reduce that part of memory with a forget gate near 0. When you encounter new important facts, you open the input gate (close to 1) and write the candidate notes into your notebook; otherwise you keep it closed (near 0). Finally, you don’t always read your entire notebook to the class—you selectively present it by opening or closing the output gate. This selective write–erase–reveal control is like placing three adjustable valves on a pipe of information: one valve discards stale content, one injects fresh content, and one controls the display of the current content. Crucially, the notebook’s pages don’t get rewritten from scratch each time; they can persist for a long time. This persistence keeps gradients (the signals that tell the network what to change) from fading away across many time steps. When the “forget valve” remains open (values near 1), important facts can survive dozens or hundreds of time steps, allowing the model to connect distant events in a sequence.

03Formal Definition

Given input

x_{t}

\in

R^{n_{x}}

, hidden state

h_{t - 1}

\in

R^{n_{h}}

, and cell state

c_{t - 1}

\in

R^{n_{h}}

, an LSTM step computes: - Gates and candidate:

i_{t}

σ

(

W_{i}

x_{t}

U_{i}

h_{t - 1}

b_{i}

f_{t}

σ

(

W_{f}

x_{t}

U_{f}

h_{t - 1}

b_{f}

o_{t}

σ

(

W_{o}

x_{t}

U_{o}

h_{t - 1}

b_{o}

g_{t}

tanh

(

W_{g}

x_{t}

U_{g}

h_{t - 1}

b_{g}

), where W_*, U_* are weight matrices and b_* are biases. - Cell update:

c_{t} = f_{t}

⊙

c_{t - 1}

i_{t}

⊙

g_{t}

, where

⊙

is element-wise multiplication. - Hidden update:

h_{t} = o_{t}

⊙

tanh

(

c_{t}

). Here,

σ

(z) = 1/(1+

e^{- z}

) maps real numbers to (0,1), and tanh maps to (-1,1). Dimensions: W_*

\in

R^{n_{h} \times n_{x}}

, U_*

\in

R^{n_{h} \times n_{h}}

, b_*

\in

R^{n_{h}}

. The parameter count is 4(

n_{h}

n_{x}

n_{h}

n_{h}

n_{h}

). During backpropagation-through-time (BPTT), gradients through

c_{t}

follow a near-linear path multiplied by the forget gates,

\partial

c_{t}

\partial

c_{t - k}

\prod_{j = t - k + 1}^{t}

f_{j}

, which helps avoid vanishing if

f_{j}

\approx

1. Variants such as peephole LSTMs add terms like

V_{f}

⊙

c_{t - 1}

to the gate preactivations to let the gates directly observe the cell state.

04When to Use

Use LSTMs when you need to model sequences with potentially long-range dependencies and you prefer compact models that work well on moderate data sizes. Classic applications include language modeling and text generation, named entity recognition, speech recognition, audio tagging, sensor data forecasting, and anomaly detection in time series. They are also suitable when streaming or low-latency inference is required because they process inputs online—one step at a time—without needing to attend over the entire past context. LSTMs still excel on smaller datasets where attention models may overfit or require heavy regularization, and they are practical when memory is limited since they do not store full pairwise interactions like self-attention. Use peephole LSTMs if precise timing or duration modeling matters (e.g., certain speech tasks). On the other hand, if your task requires capturing extremely long contexts with complex interactions across many positions, or you benefit from parallel training over tokens, Transformer architectures might be preferable. If dependencies are mostly local or short, simpler RNNs or 1-D CNNs may suffice. In embedded systems, a single-layer LSTM with small hidden size can provide strong accuracy–latency trade-offs.

⚠️Common Mistakes

Mixing up gates or symbols: Different texts use g_t vs. \tilde{c}_t for the candidate. Always verify which symbol means what and keep consistent code naming.
Forgetting to initialize the forget gate bias positively: Setting b_f to about 1.0 encourages remembering early on; without this, models may prematurely erase information.
Dimension mismatches: Ensure W_* are (n_h \times n_x), U_* are (n_h \times n_h), and biases are length n_h. Test with asserts and small examples.
Not resetting state between independent sequences: Reusing h_T and c_T from the previous batch can leak information and degrade performance unless you use stateful training by design.
Ignoring padding masks: When batching variable-length sequences, you must mask timesteps beyond sequence length; otherwise, gradients and states will be corrupted by padding.
Skipping gradient clipping: Although LSTMs mitigate vanishing, they can still explode. Clip by global norm to stabilize training.
Using the wrong activation ranges: Gates must use sigmoid and the candidate uses tanh. Replacing them without care can break the gating interpretation and training stability.
Expecting magic memory without data signals: If inputs never correlate across long spans, no amount of gating will invent long-term memory; the data and loss must reward remembering.

Key Formulas

Input Gate

i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})

Explanation: Computes how much of the candidate information to write into the cell. Each element is between 0 and 1 due to the sigmoid.

Forget Gate

f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

Explanation: Controls how much of the previous cell state is retained. Values near 1 keep information; values near 0 erase it.

Output Gate

o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

Explanation: Scales how much of the cell state is revealed in the hidden state. Acts like a soft on/off switch for exposure.

Candidate Content

g_{t} = tanh (W_{g} x_{t} + U_{g} h_{t - 1} + b_{g})

Explanation: Proposes new content to add to the cell state, squashed to (-1,1) to keep values bounded.

Cell Update

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t}

Explanation: Updates memory by combining the retained old state with newly written content, both controlled element-wise by gates.

Hidden Update

h_{t} = o_{t} ⊙ tanh (c_{t})

Explanation: Computes the hidden state from the current memory, modulated by the output gate to control exposure.

Sigmoid Function

σ (z) = \frac{1}{1 + e ^{- z}}

Explanation: Maps real inputs to (0,1). Used for gates so they can act as smooth, differentiable switches.

Tanh Function

tanh (z) = \frac{e ^{z} - e ^{- z}}{e ^{z} + e ^{- z}}

Explanation: Maps real inputs to (-1,1). Keeps the candidate and readout bounded, aiding stability.

Gradient Through Cell State

\frac{\partial c _{t}}{\partial c _{t - k}} = j = t - k + 1 \prod t f_{j}

Explanation: The gradient along the cell state path is the product of forget gates. When $f_{j}$ are near 1, gradients can flow across long spans.

Parameter Count (Standard LSTM)

P = 4 (n_{h} n_{x} + n_{h} n_{h} + n_{h})

Explanation: There are four sets of input, recurrent, and bias parameters—one for each gate/candidate—yielding this total count.

Forward Complexity

Time (sequence) = O (T n_{h} (n_{x} + n_{h}))

Explanation: Each step performs matrix–vector multiplies by input and hidden sizes; costs add up linearly over T steps.

Peephole Input Gate

i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + V_{i} ⊙ c_{t - 1} + b_{i})

Explanation: A peephole variant lets gates directly observe the previous cell state via diagonal weights $V_{i}$ to refine timing.

Sigmoid Derivative

σ^{'} (z) = σ (z) (1 - σ (z))

Explanation: Gradients are largest near $σ = 0.5$ and vanish near 0 or 1, explaining saturation effects in gates.

Complexity Analysis

Let

n_{x}

be input size,

n_{h}

be hidden size, and T be sequence length. For each time step, an LSTM computes four affine transformations of the input (W_*

x_{t}

) and four affine transformations of the previous hidden state (U_*

h_{t - 1}

), followed by element-wise operations (sigmoid, tanh, and Hadamard products). A naive dense implementation costs O(

n_{h}

n_{x}

) for each W_*

x_{t}

and O(

n_{h}

n_{h}

) for each U_*

h_{t - 1}

. Because there are four such computations per step, the total per-step time is O(

n_{h}

(

n_{x}

n_{h}

)), with a constant factor of about 4–8 depending on how operations are fused. Over a sequence of length T, the forward pass costs O(T

n_{h}

(

n_{x}

n_{h}

)). A single backward pass through time roughly doubles this work (plus additional element-wise derivative computations), so training per batch is typically 2–4× the forward cost. Space complexity includes parameters and activations. Parameter memory is 4(

n_{h}

n_{x}

n_{h}

n_{h}

n_{h}

). During inference with streaming, you only need to keep the current

h_{t - 1}

and

c_{t - 1}

, so extra activation memory is O(

n_{h}

). During training with BPTT unrolling, you must store intermediate activations (gates,

c_{t}

h_{t}

) for each time step to compute gradients, costing O(T

n_{h}

) memory (times a small constant for the number of stored vectors). If you process mini-batches of size B, multiply both time and activation memory by B. In practice, optimized implementations fuse matrix multiplications and use BLAS to reduce constants, but the asymptotic behavior remains the same.

Code Examples

Single-timestep LSTM cell forward pass with gates (from scratch)

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <cmath>
5 #include <iomanip>
6 #include <cassert>
7 
8 // Helper functions for vector/matrix operations
9 
10 // Row-major matrix indexing
11 inline size_t idx(size_t r, size_t c, size_t ncols) { return r * ncols + c; }
12 
13 std::vector<double> matvec(const std::vector<double>& M, size_t rows, size_t cols,
14                            const std::vector<double>& v) {
15     assert(cols == v.size());
16     std::vector<double> out(rows, 0.0);
17     for (size_t r = 0; r < rows; ++r) {
18         double sum = 0.0;
19         for (size_t c = 0; c < cols; ++c) sum += M[idx(r, c, cols)] * v[c];
20         out[r] = sum;
21     }
22     return out;
23 }
24 
25 void vec_add_inplace(std::vector<double>& a, const std::vector<double>& b) {
26     assert(a.size() == b.size());
27     for (size_t i = 0; i < a.size(); ++i) a[i] += b[i];
28 }
29 
30 std::vector<double> vec_add(const std::vector<double>& a, const std::vector<double>& b) {
31     assert(a.size() == b.size());
32     std::vector<double> r(a.size());
33     for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] + b[i];
34     return r;
35 }
36 
37 std::vector<double> hadamard(const std::vector<double>& a, const std::vector<double>& b) {
38     assert(a.size() == b.size());
39     std::vector<double> r(a.size());
40     for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] * b[i];
41     return r;
42 }
43 
44 std::vector<double> sigmoid_vec(const std::vector<double>& v) {
45     std::vector<double> r(v.size());
46     for (size_t i = 0; i < v.size(); ++i) r[i] = 1.0 / (1.0 + std::exp(-v[i]));
47     return r;
48 }
49 
50 std::vector<double> tanh_vec(const std::vector<double>& v) {
51     std::vector<double> r(v.size());
52     for (size_t i = 0; i < v.size(); ++i) r[i] = std::tanh(v[i]);
53     return r;
54 }
55 
56 void print_vec(const std::string& name, const std::vector<double>& v) {
57     std::cout << name << ": [";
58     for (size_t i = 0; i < v.size(); ++i) {
59         std::cout << std::fixed << std::setprecision(4) << v[i];
60         if (i + 1 < v.size()) std::cout << ", ";
61     }
62     std::cout << "]\n";
63 }
64 
65 struct StepResult {
66     std::vector<double> i, f, o, g; // gates and candidate
67     std::vector<double> c, h;       // cell and hidden
68 };
69 
70 struct LSTMCell {
71     size_t n_x, n_h;
72     // Weights: W_* (n_h x n_x), U_* (n_h x n_h), biases b_* (n_h)
73     std::vector<double> W_i, U_i, b_i;
74     std::vector<double> W_f, U_f, b_f;
75     std::vector<double> W_o, U_o, b_o;
76     std::vector<double> W_g, U_g, b_g;
77 
78     LSTMCell(size_t input_dim, size_t hidden_dim) : n_x(input_dim), n_h(hidden_dim) {
79         std::mt19937 rng(42);
80         std::uniform_real_distribution<double> dist(-0.1, 0.1);
81         auto init = [&](std::vector<double>& M, size_t rows, size_t cols, double scale = 1.0) {
82             M.resize(rows * cols);
83             for (auto& x : M) x = dist(rng) * scale;
84         };
85         auto init_b = [&](std::vector<double>& b, double val = 0.0) {
86             b.assign(n_h, val);
87         };
88 
89         init(W_i, n_h, n_x); init(U_i, n_h, n_h); init_b(b_i, 0.0);
90         init(W_f, n_h, n_x); init(U_f, n_h, n_h); init_b(b_f, 1.0); // forget bias positive
91         init(W_o, n_h, n_x); init(U_o, n_h, n_h); init_b(b_o, 0.0);
92         init(W_g, n_h, n_x); init(U_g, n_h, n_h); init_b(b_g, 0.0);
93     }
94 
95     StepResult forward(const std::vector<double>& x,
96                        const std::vector<double>& h_prev,
97                        const std::vector<double>& c_prev) const {
98         assert(x.size() == n_x && h_prev.size() == n_h && c_prev.size() == n_h);
99         // Gate preactivations: W_* x + U_* h_prev + b_*
100         auto Wi_x = matvec(W_i, n_h, n_x, x);
101         auto Ui_h = matvec(U_i, n_h, n_h, h_prev);
102         auto Wf_x = matvec(W_f, n_h, n_x, x);
103         auto Uf_h = matvec(U_f, n_h, n_h, h_prev);
104         auto Wo_x = matvec(W_o, n_h, n_x, x);
105         auto Uo_h = matvec(U_o, n_h, n_h, h_prev);
106         auto Wg_x = matvec(W_g, n_h, n_x, x);
107         auto Ug_h = matvec(U_g, n_h, n_h, h_prev);
108 
109         // Sum and add biases
110         vec_add_inplace(Wi_x, Ui_h); vec_add_inplace(Wi_x, b_i);
111         vec_add_inplace(Wf_x, Uf_h); vec_add_inplace(Wf_x, b_f);
112         vec_add_inplace(Wo_x, Uo_h); vec_add_inplace(Wo_x, b_o);
113         vec_add_inplace(Wg_x, Ug_h); vec_add_inplace(Wg_x, b_g);
114 
115         StepResult r;
116         r.i = sigmoid_vec(Wi_x);
117         r.f = sigmoid_vec(Wf_x);
118         r.o = sigmoid_vec(Wo_x);
119         r.g = tanh_vec(Wg_x);
120 
121         // c_t = f ⊙ c_prev + i ⊙ g
122         auto f_cprev = hadamard(r.f, c_prev);
123         auto i_g     = hadamard(r.i, r.g);
124         r.c = vec_add(f_cprev, i_g);
125         // h_t = o ⊙ tanh(c_t)
126         auto tanh_c = tanh_vec(r.c);
127         r.h = hadamard(r.o, tanh_c);
128         return r;
129     }
130 };
131 
132 int main() {
133     // Dimensions
134     size_t n_x = 3, n_h = 4;
135     LSTMCell cell(n_x, n_h);
136 
137     // Example inputs
138     std::vector<double> x = {0.5, -0.3, 0.1};
139     std::vector<double> h_prev(n_h, 0.0);
140     std::vector<double> c_prev(n_h, 0.0);
141 
142     StepResult r = cell.forward(x, h_prev, c_prev);
143 
144     print_vec("i_t (input gate)", r.i);
145     print_vec("f_t (forget gate)", r.f);
146     print_vec("o_t (output gate)", r.o);
147     print_vec("g_t (candidate)", r.g);
148     print_vec("c_t (cell state)", r.c);
149     print_vec("h_t (hidden state)", r.h);
150 
151     return 0;
152 }
153

This program builds a minimal LSTM cell and runs a single forward step. It computes gate preactivations with matrix–vector products for input x_t and previous hidden state h_{t-1}, applies sigmoid/tanh, updates the cell state, and produces the hidden state. Forget gate bias is set positive (1.0) to favor remembering initially. The output prints each gate, the updated cell state c_t, and hidden state h_t.

Time: O(n_h(n_x + n_h)) per forward stepSpace: O(n_h n_x + n_h n_h) parameters, O(n_h) activations per step

Unrolling an LSTM over a short sequence and inspecting gates

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <cmath>
5 #include <numeric>
6 #include <iomanip>
7 #include <cassert>
8 
9 // Reuse the minimal LSTM cell from example 1, embedded here for completeness
10 inline size_t idx(size_t r, size_t c, size_t ncols) { return r * ncols + c; }
11 std::vector<double> matvec(const std::vector<double>& M, size_t rows, size_t cols,
12                            const std::vector<double>& v) {
13     std::vector<double> out(rows, 0.0);
14     for (size_t r = 0; r < rows; ++r) {
15         double sum = 0.0; for (size_t c = 0; c < cols; ++c) sum += M[idx(r,c,cols)] * v[c];
16         out[r] = sum;
17     }
18     return out;
19 }
20 void vec_add_inplace(std::vector<double>& a, const std::vector<double>& b){ for(size_t i=0;i<a.size();++i)a[i]+=b[i]; }
21 std::vector<double> vec_add(const std::vector<double>& a,const std::vector<double>& b){ std::vector<double> r(a.size()); for(size_t i=0;i<a.size();++i) r[i]=a[i]+b[i]; return r; }
22 std::vector<double> hadamard(const std::vector<double>& a,const std::vector<double>& b){ std::vector<double> r(a.size()); for(size_t i=0;i<a.size();++i) r[i]=a[i]*b[i]; return r; }
23 std::vector<double> sigmoid_vec(const std::vector<double>& v){ std::vector<double> r(v.size()); for(size_t i=0;i<v.size();++i) r[i]=1.0/(1.0+std::exp(-v[i])); return r; }
24 std::vector<double> tanh_vec(const std::vector<double>& v){ std::vector<double> r(v.size()); for(size_t i=0;i<v.size();++i) r[i]=std::tanh(v[i]); return r; }
25 
26 struct StepResult{ std::vector<double> i,f,o,g,c,h; };
27 struct LSTMCell{
28     size_t n_x,n_h; std::vector<double> W_i,U_i,b_i,W_f,U_f,b_f,W_o,U_o,b_o,W_g,U_g,b_g;
29     LSTMCell(size_t nx,size_t nh):n_x(nx),n_h(nh){
30         std::mt19937 rng(123); std::uniform_real_distribution<double> d(-0.1,0.1);
31         auto init=[&](std::vector<double>& M,size_t r,size_t c){M.resize(r*c); for(auto&x:M)x=d(rng);} ;
32         auto initb=[&](std::vector<double>& b,double v){b.assign(n_h,v);} ;
33         init(W_i,n_h,n_x); init(U_i,n_h,n_h); initb(b_i,0.0);
34         init(W_f,n_h,n_x); init(U_f,n_h,n_h); initb(b_f,1.0);
35         init(W_o,n_h,n_x); init(U_o,n_h,n_h); initb(b_o,0.0);
36         init(W_g,n_h,n_x); init(U_g,n_h,n_h); initb(b_g,0.0);
37     }
38     StepResult forward(const std::vector<double>& x,const std::vector<double>& h_prev,const std::vector<double>& c_prev) const{
39         auto Wi=matvec(W_i,n_h,n_x,x); auto Ui=matvec(U_i,n_h,n_h,h_prev); vec_add_inplace(Wi,Ui); vec_add_inplace(Wi,b_i);
40         auto Wf=matvec(W_f,n_h,n_x,x); auto Uf=matvec(U_f,n_h,n_h,h_prev); vec_add_inplace(Wf,Uf); vec_add_inplace(Wf,b_f);
41         auto Wo=matvec(W_o,n_h,n_x,x); auto Uo=matvec(U_o,n_h,n_h,h_prev); vec_add_inplace(Wo,Uo); vec_add_inplace(Wo,b_o);
42         auto Wg=matvec(W_g,n_h,n_x,x); auto Ug=matvec(U_g,n_h,n_h,h_prev); vec_add_inplace(Wg,Ug); vec_add_inplace(Wg,b_g);
43         StepResult r; r.i=sigmoid_vec(Wi); r.f=sigmoid_vec(Wf); r.o=sigmoid_vec(Wo); r.g=tanh_vec(Wg);
44         r.c = vec_add(hadamard(r.f,c_prev), hadamard(r.i,r.g));
45         r.h = hadamard(r.o, tanh_vec(r.c));
46         return r;
47     }
48 };
49 
50 int main(){
51     size_t T=6, n_x=3, n_h=5;
52     LSTMCell cell(n_x,n_h);
53 
54     // Create a simple input sequence with a "pulse" at t=2 we hope to carry forward
55     std::vector<std::vector<double>> X(T, std::vector<double>(n_x, 0.0));
56     X[2][0] = 1.0; // a spike on feature 0 at time 2
57 
58     std::vector<double> h(n_h,0.0), c(n_h,0.0);
59     for(size_t t=0;t<T;++t){
60         StepResult r = cell.forward(X[t], h, c);
61         h = r.h; c = r.c;
62         // Inspect the average of forget and input gates (0..1)
63         double f_mean = std::accumulate(r.f.begin(), r.f.end(), 0.0)/r.f.size();
64         double i_mean = std::accumulate(r.i.begin(), r.i.end(), 0.0)/r.i.size();
65         double o_mean = std::accumulate(r.o.begin(), r.o.end(), 0.0)/r.o.size();
66         double c_norm = 0.0; for(double v: c) c_norm += v*v; c_norm = std::sqrt(c_norm);
67         std::cout << "t="<<t<<"  f_mean="<<f_mean<<"  i_mean="<<i_mean
68                   <<"  o_mean="<<o_mean<<"  ||c_t||2="<<c_norm<<"\n";
69     }
70     return 0;
71 }
72

This program unrolls an LSTM over T=6 steps with a simple input "pulse" at t=2. It prints the mean values of the gates at each step and the L2 norm of the cell state, illustrating how forgetting and writing affect memory magnitude across time. With random weights, the behavior is not trained but still demonstrates how gates modulate state dynamics.

Time: O(T n_h(n_x + n_h))Space: O(n_h n_x + n_h n_h) parameters, O(n_h) running state

Gradient flow intuition: product of forget gates across time

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <numeric>
5 #include <cmath>
6 
7 // Computes the gradient multiplier from c_0 to c_T as the product of forget gates
8 // d c_T / d c_0 = prod_{t=1..T} f_t (element-wise; we show scalar intuition)
9 
10 double product(const std::vector<double>& v){
11     double p = 1.0; for(double x: v) p *= x; return p;
12 }
13 
14 std::vector<double> random_forgets(size_t T, double mean, double spread){
15     std::mt19937 rng(7);
16     std::uniform_real_distribution<double> d(-spread, spread);
17     std::vector<double> f(T);
18     for(size_t t=0;t<T;++t) f[t] = std::min(0.999, std::max(0.001, mean + d(rng)));
19     return f;
20 }
21 
22 int main(){
23     size_t T = 30;
24     auto f_good = random_forgets(T, 0.95, 0.03); // tends to preserve memory
25     auto f_bad  = random_forgets(T, 0.6, 0.05);  // tends to forget faster
26 
27     double grad_good = product(f_good);
28     double grad_bad  = product(f_bad);
29 
30     std::cout << "T="<<T<<"\n";
31     std::cout << "Product of forget gates (good memory):   "<< grad_good <<"\n";
32     std::cout << "Product of forget gates (poor memory):   "<< grad_bad  <<"\n";
33     std::cout << "Ratio good/bad: " << (grad_good / (grad_bad + 1e-12)) << "\n";
34 
35     // Show log-scale for interpretability
36     std::cout << "log grad good: " << std::log(std::max(grad_good,1e-300)) << "\n";
37     std::cout << "log grad bad : " << std::log(std::max(grad_bad ,1e-300)) << "\n";
38     return 0;
39 }
40

This program illustrates why LSTMs reduce vanishing gradients: the gradient flowing through the cell state is the product of forget gates. When forget gates are close to 1.0, the product remains sizable over many steps; when they are closer to 0.6, the product decays rapidly. The output compares these products on a log scale, highlighting the dramatic difference in long-term credit assignment.

Time: O(T)Space: O(1)

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <cmath>
5	#include <iomanip>
6	#include <cassert>
7
8	// Helper functions for vector/matrix operations
9
10	// Row-major matrix indexing
11	inline size_t idx(size_t r, size_t c, size_t ncols) { return r * ncols + c; }
12
13	std::vector<double> matvec(const std::vector<double>& M, size_t rows, size_t cols,
14	const std::vector<double>& v) {
15	assert(cols == v.size());
16	std::vector<double> out(rows, 0.0);
17	for (size_t r = 0; r < rows; ++r) {
18	double sum = 0.0;
19	for (size_t c = 0; c < cols; ++c) sum += M[idx(r, c, cols)] * v[c];
20	out[r] = sum;
21	}
22	return out;
23	}
24
25	void vec_add_inplace(std::vector<double>& a, const std::vector<double>& b) {
26	assert(a.size() == b.size());
27	for (size_t i = 0; i < a.size(); ++i) a[i] += b[i];
28	}
29
30	std::vector<double> vec_add(const std::vector<double>& a, const std::vector<double>& b) {
31	assert(a.size() == b.size());
32	std::vector<double> r(a.size());
33	for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] + b[i];
34	return r;
35	}
36
37	std::vector<double> hadamard(const std::vector<double>& a, const std::vector<double>& b) {
38	assert(a.size() == b.size());
39	std::vector<double> r(a.size());
40	for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] * b[i];
41	return r;
42	}
43
44	std::vector<double> sigmoid_vec(const std::vector<double>& v) {
45	std::vector<double> r(v.size());
46	for (size_t i = 0; i < v.size(); ++i) r[i] = 1.0 / (1.0 + std::exp(-v[i]));
47	return r;
48	}
49
50	std::vector<double> tanh_vec(const std::vector<double>& v) {
51	std::vector<double> r(v.size());
52	for (size_t i = 0; i < v.size(); ++i) r[i] = std::tanh(v[i]);
53	return r;
54	}
55
56	void print_vec(const std::string& name, const std::vector<double>& v) {
57	std::cout << name << ": [";
58	for (size_t i = 0; i < v.size(); ++i) {
59	std::cout << std::fixed << std::setprecision(4) << v[i];
60	if (i + 1 < v.size()) std::cout << ", ";
61	}
62	std::cout << "]\n";
63	}
64
65	struct StepResult {
66	std::vector<double> i, f, o, g; // gates and candidate
67	std::vector<double> c, h; // cell and hidden
68	};
69
70	struct LSTMCell {
71	size_t n_x, n_h;
72	// Weights: W_* (n_h x n_x), U_* (n_h x n_h), biases b_* (n_h)
73	std::vector<double> W_i, U_i, b_i;
74	std::vector<double> W_f, U_f, b_f;
75	std::vector<double> W_o, U_o, b_o;
76	std::vector<double> W_g, U_g, b_g;
77
78	LSTMCell(size_t input_dim, size_t hidden_dim) : n_x(input_dim), n_h(hidden_dim) {
79	std::mt19937 rng(42);
80	std::uniform_real_distribution<double> dist(-0.1, 0.1);
81	auto init = [&](std::vector<double>& M, size_t rows, size_t cols, double scale = 1.0) {
82	M.resize(rows * cols);
83	for (auto& x : M) x = dist(rng) * scale;
84	};
85	auto init_b = [&](std::vector<double>& b, double val = 0.0) {
86	b.assign(n_h, val);
87	};
88
89	init(W_i, n_h, n_x); init(U_i, n_h, n_h); init_b(b_i, 0.0);
90	init(W_f, n_h, n_x); init(U_f, n_h, n_h); init_b(b_f, 1.0); // forget bias positive
91	init(W_o, n_h, n_x); init(U_o, n_h, n_h); init_b(b_o, 0.0);
92	init(W_g, n_h, n_x); init(U_g, n_h, n_h); init_b(b_g, 0.0);
93	}
94
95	StepResult forward(const std::vector<double>& x,
96	const std::vector<double>& h_prev,
97	const std::vector<double>& c_prev) const {
98	assert(x.size() == n_x && h_prev.size() == n_h && c_prev.size() == n_h);
99	// Gate preactivations: W_* x + U_* h_prev + b_*
100	auto Wi_x = matvec(W_i, n_h, n_x, x);
101	auto Ui_h = matvec(U_i, n_h, n_h, h_prev);
102	auto Wf_x = matvec(W_f, n_h, n_x, x);
103	auto Uf_h = matvec(U_f, n_h, n_h, h_prev);
104	auto Wo_x = matvec(W_o, n_h, n_x, x);
105	auto Uo_h = matvec(U_o, n_h, n_h, h_prev);
106	auto Wg_x = matvec(W_g, n_h, n_x, x);
107	auto Ug_h = matvec(U_g, n_h, n_h, h_prev);
108
109	// Sum and add biases
110	vec_add_inplace(Wi_x, Ui_h); vec_add_inplace(Wi_x, b_i);
111	vec_add_inplace(Wf_x, Uf_h); vec_add_inplace(Wf_x, b_f);
112	vec_add_inplace(Wo_x, Uo_h); vec_add_inplace(Wo_x, b_o);
113	vec_add_inplace(Wg_x, Ug_h); vec_add_inplace(Wg_x, b_g);
114
115	StepResult r;
116	r.i = sigmoid_vec(Wi_x);
117	r.f = sigmoid_vec(Wf_x);
118	r.o = sigmoid_vec(Wo_x);
119	r.g = tanh_vec(Wg_x);
120
121	// c_t = f ⊙ c_prev + i ⊙ g
122	auto f_cprev = hadamard(r.f, c_prev);
123	auto i_g = hadamard(r.i, r.g);
124	r.c = vec_add(f_cprev, i_g);
125	// h_t = o ⊙ tanh(c_t)
126	auto tanh_c = tanh_vec(r.c);
127	r.h = hadamard(r.o, tanh_c);
128	return r;
129	}
130	};
131
132	int main() {
133	// Dimensions
134	size_t n_x = 3, n_h = 4;
135	LSTMCell cell(n_x, n_h);
136
137	// Example inputs
138	std::vector<double> x = {0.5, -0.3, 0.1};
139	std::vector<double> h_prev(n_h, 0.0);
140	std::vector<double> c_prev(n_h, 0.0);
141
142	StepResult r = cell.forward(x, h_prev, c_prev);
143
144	print_vec("i_t (input gate)", r.i);
145	print_vec("f_t (forget gate)", r.f);
146	print_vec("o_t (output gate)", r.o);
147	print_vec("g_t (candidate)", r.g);
148	print_vec("c_t (cell state)", r.c);
149	print_vec("h_t (hidden state)", r.h);
150
151	return 0;
152	}
153