LSTM & Gating Mechanisms
Key Points
- •Long Short-Term Memory (LSTM) networks use gates (forget, input, and output) to control what information to erase, write, and reveal at each time step.
- •The cell state acts like a conveyor belt of memory; gates scale it element-wise to preserve or modify information over long sequences.
- •LSTMs solve vanishing gradients by providing a nearly linear path for gradients through the cell state when forget gates are near one.
- •Each gate is computed with a sigmoid activation, while the candidate content uses tanh; the hidden state combines the output gate and tanh of the cell state.
- •Forward computation per time step costs O(( + )) where is input size and is hidden size; full sequence cost scales linearly with sequence length T.
- •Initialization matters: setting the forget gate bias positive (e.g., 1.0) encourages memory retention early in training.
- •Variants like peephole LSTMs add direct connections from the cell state to gates for finer timing control.
- •In C++, an LSTM cell can be implemented with straightforward matrix–vector operations and element-wise nonlinearities.
Prerequisites
- →Vectors and Matrices — LSTM equations are expressed with matrix–vector multiplications and element-wise operations.
- →Activation Functions (sigmoid, tanh) — Gates require sigmoid and candidate/readout use tanh; understanding their ranges and derivatives is essential.
- →Chain Rule and Backpropagation — Training LSTMs uses backpropagation through time, which relies on the chain rule.
- →Basic RNNs — LSTMs extend RNNs with gates and a separate cell state; knowing RNNs clarifies the motivation.
- →Time Series/Sequence Data — Understanding sequential dependencies motivates why memory and gating are needed.
- →Gradient Clipping & Optimization — Stabilizes training of recurrent models and prevents exploding gradients.
- →Initialization Strategies — Forget gate bias and weight scaling affect memory and convergence.
- →Numerical Stability — Sigmoid/tanh saturation and floating-point limits affect gradients and activations.
Detailed Explanation
Tap terms for definitions01Overview
An LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture designed to learn dependencies over long sequences. Unlike a plain RNN that updates its hidden state with a single transformation, an LSTM maintains two internal vectors: a hidden state h_t and a cell state c_t. The cell state is a dedicated memory lane that can carry information forward with minimal modification, while three gates—forget f_t, input i_t, and output o_t—control how much information flows in and out at each time step. Each gate is a vector of values between 0 and 1 computed with a sigmoid function applied to an affine transformation of the current input and previous hidden state. The input gate decides what new information to write, the forget gate decides what to erase, and the output gate decides what part of the internal memory to expose as the hidden state. A candidate content vector g_t (often called \tilde{c}_t) produced by tanh suggests new information to add when permitted by the input gate. These simple, local, element-wise operations enable stable long-range credit assignment: gradients can pass through time largely unimpeded along the cell state when forget gates are near one. As a result, LSTMs became the go-to model for many sequence tasks—language modeling, speech, and time series—before attention-based Transformers gained prominence.
02Intuition & Analogies
Imagine you are taking notes during a long lecture. Your notebook (the cell state c_t) holds everything you’ve written so far. At each new point in the lecture, you decide three things: (1) what old notes to cross out (forget gate f_t), (2) what new notes to add (input gate i_t with candidate notes g_t), and (3) which notes to read aloud to the class right now (output gate o_t applied to the current notebook content). If the current topic is still relevant, you keep the old notes by setting the forget gate close to 1. If a topic has changed, you reduce that part of memory with a forget gate near 0. When you encounter new important facts, you open the input gate (close to 1) and write the candidate notes into your notebook; otherwise you keep it closed (near 0). Finally, you don’t always read your entire notebook to the class—you selectively present it by opening or closing the output gate. This selective write–erase–reveal control is like placing three adjustable valves on a pipe of information: one valve discards stale content, one injects fresh content, and one controls the display of the current content. Crucially, the notebook’s pages don’t get rewritten from scratch each time; they can persist for a long time. This persistence keeps gradients (the signals that tell the network what to change) from fading away across many time steps. When the “forget valve” remains open (values near 1), important facts can survive dozens or hundreds of time steps, allowing the model to connect distant events in a sequence.
03Formal Definition
04When to Use
Use LSTMs when you need to model sequences with potentially long-range dependencies and you prefer compact models that work well on moderate data sizes. Classic applications include language modeling and text generation, named entity recognition, speech recognition, audio tagging, sensor data forecasting, and anomaly detection in time series. They are also suitable when streaming or low-latency inference is required because they process inputs online—one step at a time—without needing to attend over the entire past context. LSTMs still excel on smaller datasets where attention models may overfit or require heavy regularization, and they are practical when memory is limited since they do not store full pairwise interactions like self-attention. Use peephole LSTMs if precise timing or duration modeling matters (e.g., certain speech tasks). On the other hand, if your task requires capturing extremely long contexts with complex interactions across many positions, or you benefit from parallel training over tokens, Transformer architectures might be preferable. If dependencies are mostly local or short, simpler RNNs or 1-D CNNs may suffice. In embedded systems, a single-layer LSTM with small hidden size can provide strong accuracy–latency trade-offs.
⚠️Common Mistakes
- Mixing up gates or symbols: Different texts use g_t vs. \tilde{c}_t for the candidate. Always verify which symbol means what and keep consistent code naming.
- Forgetting to initialize the forget gate bias positively: Setting b_f to about 1.0 encourages remembering early on; without this, models may prematurely erase information.
- Dimension mismatches: Ensure W_* are (n_h \times n_x), U_* are (n_h \times n_h), and biases are length n_h. Test with asserts and small examples.
- Not resetting state between independent sequences: Reusing h_T and c_T from the previous batch can leak information and degrade performance unless you use stateful training by design.
- Ignoring padding masks: When batching variable-length sequences, you must mask timesteps beyond sequence length; otherwise, gradients and states will be corrupted by padding.
- Skipping gradient clipping: Although LSTMs mitigate vanishing, they can still explode. Clip by global norm to stabilize training.
- Using the wrong activation ranges: Gates must use sigmoid and the candidate uses tanh. Replacing them without care can break the gating interpretation and training stability.
- Expecting magic memory without data signals: If inputs never correlate across long spans, no amount of gating will invent long-term memory; the data and loss must reward remembering.
Key Formulas
Input Gate
Explanation: Computes how much of the candidate information to write into the cell. Each element is between 0 and 1 due to the sigmoid.
Forget Gate
Explanation: Controls how much of the previous cell state is retained. Values near 1 keep information; values near 0 erase it.
Output Gate
Explanation: Scales how much of the cell state is revealed in the hidden state. Acts like a soft on/off switch for exposure.
Candidate Content
Explanation: Proposes new content to add to the cell state, squashed to (-1,1) to keep values bounded.
Cell Update
Explanation: Updates memory by combining the retained old state with newly written content, both controlled element-wise by gates.
Hidden Update
Explanation: Computes the hidden state from the current memory, modulated by the output gate to control exposure.
Sigmoid Function
Explanation: Maps real inputs to (0,1). Used for gates so they can act as smooth, differentiable switches.
Tanh Function
Explanation: Maps real inputs to (-1,1). Keeps the candidate and readout bounded, aiding stability.
Gradient Through Cell State
Explanation: The gradient along the cell state path is the product of forget gates. When are near 1, gradients can flow across long spans.
Parameter Count (Standard LSTM)
Explanation: There are four sets of input, recurrent, and bias parameters—one for each gate/candidate—yielding this total count.
Forward Complexity
Explanation: Each step performs matrix–vector multiplies by input and hidden sizes; costs add up linearly over T steps.
Peephole Input Gate
Explanation: A peephole variant lets gates directly observe the previous cell state via diagonal weights to refine timing.
Sigmoid Derivative
Explanation: Gradients are largest near and vanish near 0 or 1, explaining saturation effects in gates.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <cmath> 5 #include <iomanip> 6 #include <cassert> 7 8 // Helper functions for vector/matrix operations 9 10 // Row-major matrix indexing 11 inline size_t idx(size_t r, size_t c, size_t ncols) { return r * ncols + c; } 12 13 std::vector<double> matvec(const std::vector<double>& M, size_t rows, size_t cols, 14 const std::vector<double>& v) { 15 assert(cols == v.size()); 16 std::vector<double> out(rows, 0.0); 17 for (size_t r = 0; r < rows; ++r) { 18 double sum = 0.0; 19 for (size_t c = 0; c < cols; ++c) sum += M[idx(r, c, cols)] * v[c]; 20 out[r] = sum; 21 } 22 return out; 23 } 24 25 void vec_add_inplace(std::vector<double>& a, const std::vector<double>& b) { 26 assert(a.size() == b.size()); 27 for (size_t i = 0; i < a.size(); ++i) a[i] += b[i]; 28 } 29 30 std::vector<double> vec_add(const std::vector<double>& a, const std::vector<double>& b) { 31 assert(a.size() == b.size()); 32 std::vector<double> r(a.size()); 33 for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] + b[i]; 34 return r; 35 } 36 37 std::vector<double> hadamard(const std::vector<double>& a, const std::vector<double>& b) { 38 assert(a.size() == b.size()); 39 std::vector<double> r(a.size()); 40 for (size_t i = 0; i < a.size(); ++i) r[i] = a[i] * b[i]; 41 return r; 42 } 43 44 std::vector<double> sigmoid_vec(const std::vector<double>& v) { 45 std::vector<double> r(v.size()); 46 for (size_t i = 0; i < v.size(); ++i) r[i] = 1.0 / (1.0 + std::exp(-v[i])); 47 return r; 48 } 49 50 std::vector<double> tanh_vec(const std::vector<double>& v) { 51 std::vector<double> r(v.size()); 52 for (size_t i = 0; i < v.size(); ++i) r[i] = std::tanh(v[i]); 53 return r; 54 } 55 56 void print_vec(const std::string& name, const std::vector<double>& v) { 57 std::cout << name << ": ["; 58 for (size_t i = 0; i < v.size(); ++i) { 59 std::cout << std::fixed << std::setprecision(4) << v[i]; 60 if (i + 1 < v.size()) std::cout << ", "; 61 } 62 std::cout << "]\n"; 63 } 64 65 struct StepResult { 66 std::vector<double> i, f, o, g; // gates and candidate 67 std::vector<double> c, h; // cell and hidden 68 }; 69 70 struct LSTMCell { 71 size_t n_x, n_h; 72 // Weights: W_* (n_h x n_x), U_* (n_h x n_h), biases b_* (n_h) 73 std::vector<double> W_i, U_i, b_i; 74 std::vector<double> W_f, U_f, b_f; 75 std::vector<double> W_o, U_o, b_o; 76 std::vector<double> W_g, U_g, b_g; 77 78 LSTMCell(size_t input_dim, size_t hidden_dim) : n_x(input_dim), n_h(hidden_dim) { 79 std::mt19937 rng(42); 80 std::uniform_real_distribution<double> dist(-0.1, 0.1); 81 auto init = [&](std::vector<double>& M, size_t rows, size_t cols, double scale = 1.0) { 82 M.resize(rows * cols); 83 for (auto& x : M) x = dist(rng) * scale; 84 }; 85 auto init_b = [&](std::vector<double>& b, double val = 0.0) { 86 b.assign(n_h, val); 87 }; 88 89 init(W_i, n_h, n_x); init(U_i, n_h, n_h); init_b(b_i, 0.0); 90 init(W_f, n_h, n_x); init(U_f, n_h, n_h); init_b(b_f, 1.0); // forget bias positive 91 init(W_o, n_h, n_x); init(U_o, n_h, n_h); init_b(b_o, 0.0); 92 init(W_g, n_h, n_x); init(U_g, n_h, n_h); init_b(b_g, 0.0); 93 } 94 95 StepResult forward(const std::vector<double>& x, 96 const std::vector<double>& h_prev, 97 const std::vector<double>& c_prev) const { 98 assert(x.size() == n_x && h_prev.size() == n_h && c_prev.size() == n_h); 99 // Gate preactivations: W_* x + U_* h_prev + b_* 100 auto Wi_x = matvec(W_i, n_h, n_x, x); 101 auto Ui_h = matvec(U_i, n_h, n_h, h_prev); 102 auto Wf_x = matvec(W_f, n_h, n_x, x); 103 auto Uf_h = matvec(U_f, n_h, n_h, h_prev); 104 auto Wo_x = matvec(W_o, n_h, n_x, x); 105 auto Uo_h = matvec(U_o, n_h, n_h, h_prev); 106 auto Wg_x = matvec(W_g, n_h, n_x, x); 107 auto Ug_h = matvec(U_g, n_h, n_h, h_prev); 108 109 // Sum and add biases 110 vec_add_inplace(Wi_x, Ui_h); vec_add_inplace(Wi_x, b_i); 111 vec_add_inplace(Wf_x, Uf_h); vec_add_inplace(Wf_x, b_f); 112 vec_add_inplace(Wo_x, Uo_h); vec_add_inplace(Wo_x, b_o); 113 vec_add_inplace(Wg_x, Ug_h); vec_add_inplace(Wg_x, b_g); 114 115 StepResult r; 116 r.i = sigmoid_vec(Wi_x); 117 r.f = sigmoid_vec(Wf_x); 118 r.o = sigmoid_vec(Wo_x); 119 r.g = tanh_vec(Wg_x); 120 121 // c_t = f ⊙ c_prev + i ⊙ g 122 auto f_cprev = hadamard(r.f, c_prev); 123 auto i_g = hadamard(r.i, r.g); 124 r.c = vec_add(f_cprev, i_g); 125 // h_t = o ⊙ tanh(c_t) 126 auto tanh_c = tanh_vec(r.c); 127 r.h = hadamard(r.o, tanh_c); 128 return r; 129 } 130 }; 131 132 int main() { 133 // Dimensions 134 size_t n_x = 3, n_h = 4; 135 LSTMCell cell(n_x, n_h); 136 137 // Example inputs 138 std::vector<double> x = {0.5, -0.3, 0.1}; 139 std::vector<double> h_prev(n_h, 0.0); 140 std::vector<double> c_prev(n_h, 0.0); 141 142 StepResult r = cell.forward(x, h_prev, c_prev); 143 144 print_vec("i_t (input gate)", r.i); 145 print_vec("f_t (forget gate)", r.f); 146 print_vec("o_t (output gate)", r.o); 147 print_vec("g_t (candidate)", r.g); 148 print_vec("c_t (cell state)", r.c); 149 print_vec("h_t (hidden state)", r.h); 150 151 return 0; 152 } 153
This program builds a minimal LSTM cell and runs a single forward step. It computes gate preactivations with matrix–vector products for input x_t and previous hidden state h_{t-1}, applies sigmoid/tanh, updates the cell state, and produces the hidden state. Forget gate bias is set positive (1.0) to favor remembering initially. The output prints each gate, the updated cell state c_t, and hidden state h_t.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <cmath> 5 #include <numeric> 6 #include <iomanip> 7 #include <cassert> 8 9 // Reuse the minimal LSTM cell from example 1, embedded here for completeness 10 inline size_t idx(size_t r, size_t c, size_t ncols) { return r * ncols + c; } 11 std::vector<double> matvec(const std::vector<double>& M, size_t rows, size_t cols, 12 const std::vector<double>& v) { 13 std::vector<double> out(rows, 0.0); 14 for (size_t r = 0; r < rows; ++r) { 15 double sum = 0.0; for (size_t c = 0; c < cols; ++c) sum += M[idx(r,c,cols)] * v[c]; 16 out[r] = sum; 17 } 18 return out; 19 } 20 void vec_add_inplace(std::vector<double>& a, const std::vector<double>& b){ for(size_t i=0;i<a.size();++i)a[i]+=b[i]; } 21 std::vector<double> vec_add(const std::vector<double>& a,const std::vector<double>& b){ std::vector<double> r(a.size()); for(size_t i=0;i<a.size();++i) r[i]=a[i]+b[i]; return r; } 22 std::vector<double> hadamard(const std::vector<double>& a,const std::vector<double>& b){ std::vector<double> r(a.size()); for(size_t i=0;i<a.size();++i) r[i]=a[i]*b[i]; return r; } 23 std::vector<double> sigmoid_vec(const std::vector<double>& v){ std::vector<double> r(v.size()); for(size_t i=0;i<v.size();++i) r[i]=1.0/(1.0+std::exp(-v[i])); return r; } 24 std::vector<double> tanh_vec(const std::vector<double>& v){ std::vector<double> r(v.size()); for(size_t i=0;i<v.size();++i) r[i]=std::tanh(v[i]); return r; } 25 26 struct StepResult{ std::vector<double> i,f,o,g,c,h; }; 27 struct LSTMCell{ 28 size_t n_x,n_h; std::vector<double> W_i,U_i,b_i,W_f,U_f,b_f,W_o,U_o,b_o,W_g,U_g,b_g; 29 LSTMCell(size_t nx,size_t nh):n_x(nx),n_h(nh){ 30 std::mt19937 rng(123); std::uniform_real_distribution<double> d(-0.1,0.1); 31 auto init=[&](std::vector<double>& M,size_t r,size_t c){M.resize(r*c); for(auto&x:M)x=d(rng);} ; 32 auto initb=[&](std::vector<double>& b,double v){b.assign(n_h,v);} ; 33 init(W_i,n_h,n_x); init(U_i,n_h,n_h); initb(b_i,0.0); 34 init(W_f,n_h,n_x); init(U_f,n_h,n_h); initb(b_f,1.0); 35 init(W_o,n_h,n_x); init(U_o,n_h,n_h); initb(b_o,0.0); 36 init(W_g,n_h,n_x); init(U_g,n_h,n_h); initb(b_g,0.0); 37 } 38 StepResult forward(const std::vector<double>& x,const std::vector<double>& h_prev,const std::vector<double>& c_prev) const{ 39 auto Wi=matvec(W_i,n_h,n_x,x); auto Ui=matvec(U_i,n_h,n_h,h_prev); vec_add_inplace(Wi,Ui); vec_add_inplace(Wi,b_i); 40 auto Wf=matvec(W_f,n_h,n_x,x); auto Uf=matvec(U_f,n_h,n_h,h_prev); vec_add_inplace(Wf,Uf); vec_add_inplace(Wf,b_f); 41 auto Wo=matvec(W_o,n_h,n_x,x); auto Uo=matvec(U_o,n_h,n_h,h_prev); vec_add_inplace(Wo,Uo); vec_add_inplace(Wo,b_o); 42 auto Wg=matvec(W_g,n_h,n_x,x); auto Ug=matvec(U_g,n_h,n_h,h_prev); vec_add_inplace(Wg,Ug); vec_add_inplace(Wg,b_g); 43 StepResult r; r.i=sigmoid_vec(Wi); r.f=sigmoid_vec(Wf); r.o=sigmoid_vec(Wo); r.g=tanh_vec(Wg); 44 r.c = vec_add(hadamard(r.f,c_prev), hadamard(r.i,r.g)); 45 r.h = hadamard(r.o, tanh_vec(r.c)); 46 return r; 47 } 48 }; 49 50 int main(){ 51 size_t T=6, n_x=3, n_h=5; 52 LSTMCell cell(n_x,n_h); 53 54 // Create a simple input sequence with a "pulse" at t=2 we hope to carry forward 55 std::vector<std::vector<double>> X(T, std::vector<double>(n_x, 0.0)); 56 X[2][0] = 1.0; // a spike on feature 0 at time 2 57 58 std::vector<double> h(n_h,0.0), c(n_h,0.0); 59 for(size_t t=0;t<T;++t){ 60 StepResult r = cell.forward(X[t], h, c); 61 h = r.h; c = r.c; 62 // Inspect the average of forget and input gates (0..1) 63 double f_mean = std::accumulate(r.f.begin(), r.f.end(), 0.0)/r.f.size(); 64 double i_mean = std::accumulate(r.i.begin(), r.i.end(), 0.0)/r.i.size(); 65 double o_mean = std::accumulate(r.o.begin(), r.o.end(), 0.0)/r.o.size(); 66 double c_norm = 0.0; for(double v: c) c_norm += v*v; c_norm = std::sqrt(c_norm); 67 std::cout << "t="<<t<<" f_mean="<<f_mean<<" i_mean="<<i_mean 68 <<" o_mean="<<o_mean<<" ||c_t||2="<<c_norm<<"\n"; 69 } 70 return 0; 71 } 72
This program unrolls an LSTM over T=6 steps with a simple input "pulse" at t=2. It prints the mean values of the gates at each step and the L2 norm of the cell state, illustrating how forgetting and writing affect memory magnitude across time. With random weights, the behavior is not trained but still demonstrates how gates modulate state dynamics.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <numeric> 5 #include <cmath> 6 7 // Computes the gradient multiplier from c_0 to c_T as the product of forget gates 8 // d c_T / d c_0 = prod_{t=1..T} f_t (element-wise; we show scalar intuition) 9 10 double product(const std::vector<double>& v){ 11 double p = 1.0; for(double x: v) p *= x; return p; 12 } 13 14 std::vector<double> random_forgets(size_t T, double mean, double spread){ 15 std::mt19937 rng(7); 16 std::uniform_real_distribution<double> d(-spread, spread); 17 std::vector<double> f(T); 18 for(size_t t=0;t<T;++t) f[t] = std::min(0.999, std::max(0.001, mean + d(rng))); 19 return f; 20 } 21 22 int main(){ 23 size_t T = 30; 24 auto f_good = random_forgets(T, 0.95, 0.03); // tends to preserve memory 25 auto f_bad = random_forgets(T, 0.6, 0.05); // tends to forget faster 26 27 double grad_good = product(f_good); 28 double grad_bad = product(f_bad); 29 30 std::cout << "T="<<T<<"\n"; 31 std::cout << "Product of forget gates (good memory): "<< grad_good <<"\n"; 32 std::cout << "Product of forget gates (poor memory): "<< grad_bad <<"\n"; 33 std::cout << "Ratio good/bad: " << (grad_good / (grad_bad + 1e-12)) << "\n"; 34 35 // Show log-scale for interpretability 36 std::cout << "log grad good: " << std::log(std::max(grad_good,1e-300)) << "\n"; 37 std::cout << "log grad bad : " << std::log(std::max(grad_bad ,1e-300)) << "\n"; 38 return 0; 39 } 40
This program illustrates why LSTMs reduce vanishing gradients: the gradient flowing through the cell state is the product of forget gates. When forget gates are close to 1.0, the product remains sizable over many steps; when they are closer to 0.6, the product decays rapidly. The output compares these products on a log scale, highlighting the dramatic difference in long-term credit assignment.