Hidden Markov Models
Key Points
- •A Hidden Markov Model (HMM) describes sequences where you cannot see the true state directly, but you can observe outputs generated by those hidden states.
- •An HMM is defined by three parts: initial state distribution \(\), transition matrix \(A\), and emission matrix \(B\), often written as \(=(, A, B)\).
- •The forward algorithm computes the probability of an observation sequence efficiently in \(O(T)\) time, where \(T\) is sequence length and \(N\) is the number of states.
- •The Viterbi algorithm finds the single most likely hidden state path that produced the observations, also in \(O(T)\) time.
- •The Baum–Welch algorithm (EM) learns HMM parameters from unlabeled sequences by alternating expectation and maximization steps.
- •Numerical underflow is a common issue; use scaling or log-space arithmetic to keep probabilities stable.
- •HMMs work well when the Markov and conditional independence assumptions roughly hold, such as in speech, bioinformatics, and POS tagging.
- •Compared to modern neural models, HMMs are interpretable, train fast on small data, and provide clear probabilistic semantics.
Prerequisites
- →Basic probability (Bayes rule, independence, conditional probability) — HMMs are conditional probability models over sequences and rely on factoring using independence assumptions.
- →Markov chains — HMM hidden states evolve as a Markov chain; understanding transitions and stationary distributions helps.
- →Dynamic programming — Forward, backward, and Viterbi are classic DP algorithms over time and states.
- →Linear algebra (vectors, matrices) — Parameters π, A, and B are vectors/matrices; operations like normalization and matrix products are common.
- →Numerical stability and floating-point — Scaling and log-space tricks are essential to avoid underflow in HMM computations.
Detailed Explanation
Tap terms for definitions01Overview
A Hidden Markov Model (HMM) is a probabilistic model for time series and sequences where the system evolves through hidden (unobserved) states that emit observable symbols. You never see the state directly; you only see its effects—the observations. The model assumes the hidden state follows a Markov chain: the next state depends only on the current state, not the entire history. Each state then produces an observation according to a fixed probability distribution. An HMM is fully specified by three components: the initial state distribution (\pi), the state transition probabilities (A), and the emission probabilities (B). With these, you can answer three classic questions efficiently using dynamic programming: evaluation (what’s the probability of an observation sequence?), decoding (what’s the most likely hidden state path?), and learning (how do we fit (\pi, A, B) to data?). HMMs are compact, interpretable, and effective in many applications such as speech recognition, part-of-speech tagging, bioinformatics (gene finding, protein domains), user behavior modeling, and activity recognition. They offer a principled balance between expressive power and computational tractability, especially for small to medium state spaces and discrete observations (with extensions for continuous emissions).
02Intuition & Analogies
Imagine you’re standing outside a windowless room trying to guess the weather inside a virtual world. You can’t see the sky (hidden state), but you do see people’s clothing when they exit (observations). If many people carry umbrellas, you guess it’s rainy; if they wear sunglasses, you guess it’s sunny. Also, weather tends to persist—rainy days are more likely to be followed by rainy days than by sunny ones. That persistence is the Markov chain over hidden weather states, and the clothing distributions are the emission probabilities. Similarly, think of listening to someone speak through a wall. You can’t see their mouth shapes (hidden articulatory states), but you hear sounds (observations). Each articulatory configuration is likely to produce certain sounds with certain probabilities, and the speaker’s mouth positions transition smoothly from one to another. The HMM formalizes exactly this setup: a chain of hidden causes producing noisy effects over time. The power of HMMs comes from two ideas. First, the Markov property drastically simplifies temporal dependence: we only need the current state to predict the next. Second, conditional independence says the observation at time t depends only on the current state, not past observations, once we know that state. With those assumptions, we can process sequences with dynamic programming—folding exponentially many paths into a compact computation. The forward algorithm sums over all possible hidden paths efficiently; Viterbi keeps only the single best path at each time. Together they let us evaluate how likely data is under the model and explain the data with a most plausible hidden story.
03Formal Definition
04When to Use
Use HMMs when you have sequential or time-series data with an underlying process that is simpler to describe via a small number of discrete hidden regimes. Typical use cases include: speech recognition (phoneme states emitting acoustic features), part-of-speech tagging (hidden tags emitting words), bioinformatics (hidden genomic regions emitting nucleotides), and user behavior modeling (hidden intents emitting actions). HMMs are ideal when the Markov assumption (next state depends only on current state) and conditional independence (observation depends only on current state) are approximately valid. They shine when data is scarce and interpretability matters: parameters (\pi, A, B) have direct probabilistic meaning and can be inspected. Choose HMMs for fast, stable inference with dynamic programming complexity (O(TN^2)), where you can trade off N (states) against accuracy and speed. If you need continuous observations, use Gaussian emissions or mixtures (GMM-HMM). If you require long-range dependencies, rich features, or discriminative training, consider CRFs or neural sequence models; however, HMMs remain competitive baselines and strong tools for small to mid-sized problems and for didactic and diagnostic purposes.
⚠️Common Mistakes
- Ignoring numerical stability: multiplying many probabilities underflows to zero in floating point. Always use scaling factors at each time step or run algorithms in log-space with log-sum-exp.
- Confusing evaluation and decoding: forward sums over all paths to get sequence likelihood; Viterbi maximizes over paths to find a single best explanation. They answer different questions and yield different numbers.
- Forgetting to normalize rows: transition matrix A rows and discrete emission matrix B rows must sum to 1. After any update or random initialization, renormalize.
- Zero probabilities locking the model: if (a_{ij}=0) or (b_j(o)=0), paths become impossible. Use smoothing (e.g., add-(\epsilon)) to avoid dead states during training.
- Off-by-one and indexing errors: states often indexed 0..N-1 and observations 0..M-1 in code. Keep a consistent convention and map symbols to indices carefully.
- Overfitting with too many states: likelihood on training data increases with more states, but generalization may worsen. Use validation, information criteria, or domain constraints.
- Misinterpreting posteriors: (\gamma_t(i)) is a marginal over states at time t, not a path probability. (\xi_t(i,j)) is a pairwise marginal between consecutive times.
- Expecting global optimum from EM: Baum–Welch finds a local optimum sensitive to initialization. Use multiple random starts or informed initializations.
Key Formulas
Joint Path Probability
Explanation: This is the probability of a specific hidden path Q producing observations O. It multiplies the start, transitions, and emissions along the path.
Sequence Likelihood
Explanation: The likelihood of the observations is obtained by summing over all hidden paths. The forward algorithm computes this sum efficiently without enumerating paths.
Forward Recursion
Explanation: Forward variables accumulate the total probability of all paths ending in state j at time t that generate the prefix observations. Summing the last column gives the sequence likelihood.
Backward Recursion
Explanation: Backward variables represent the probability of the observation suffix given the current state. They are crucial for computing posteriors used in EM.
Viterbi Recursion
Explanation: The Viterbi dynamic program keeps only the single best path to each state at each time. Backpointers recover the most likely state sequence.
State Posterior
Explanation: Gamma is the posterior probability of being in a specific state at time t. It uses the product of forward and backward probabilities normalized by the sequence likelihood.
Pairwise Posterior
Explanation: Xi is the posterior over consecutive state pairs. It is central to the expected count of transitions from i to j used in the M-step.
Baum–Welch Updates (Discrete)
Explanation: The EM update sets parameters to normalized expected counts under the current model. Emission updates for symbol k sum posteriors at times where k was observed.
Scaling for Stability
Explanation: Scaling prevents underflow by normalizing forward values at each time. The log-likelihood is recovered by accumulating the negative log of the scaling factors.
Log-Sum-Exp Trick
Explanation: This identity computes logs of sums stably by factoring out the maximum exponent. It is essential when doing HMM computations in log-space.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct HMM { 5 int N; // number of states 6 int M; // number of observation symbols 7 vector<double> pi; // size N 8 vector<vector<double>> A; // NxN 9 vector<vector<double>> B; // NxM 10 11 HMM(int n, int m) : N(n), M(m), pi(n), A(n, vector<double>(n)), B(n, vector<double>(m)) {} 12 13 // Forward algorithm with scaling. Returns log-likelihood. 14 double forward_scaled(const vector<int>& O, vector<vector<double>>& alpha_scaled, vector<double>& scales) const { 15 int T = (int)O.size(); 16 alpha_scaled.assign(T, vector<double>(N, 0.0)); 17 scales.assign(T, 0.0); 18 19 // t = 0 initialization 20 double c0 = 0.0; 21 for (int j = 0; j < N; ++j) { 22 alpha_scaled[0][j] = pi[j] * B[j][O[0]]; 23 c0 += alpha_scaled[0][j]; 24 } 25 // scale 26 if (c0 == 0.0) c0 = 1e-300; // avoid divide-by-zero 27 for (int j = 0; j < N; ++j) alpha_scaled[0][j] /= c0; 28 scales[0] = c0; 29 30 // induction 31 for (int t = 1; t < T; ++t) { 32 double ct = 0.0; 33 for (int j = 0; j < N; ++j) { 34 double sum_prev = 0.0; 35 for (int i = 0; i < N; ++i) sum_prev += alpha_scaled[t-1][i] * A[i][j]; 36 alpha_scaled[t][j] = sum_prev * B[j][O[t]]; 37 ct += alpha_scaled[t][j]; 38 } 39 if (ct == 0.0) ct = 1e-300; 40 for (int j = 0; j < N; ++j) alpha_scaled[t][j] /= ct; 41 scales[t] = ct; 42 } 43 // log-likelihood = -sum_t log(c_t) 44 double loglik = 0.0; 45 for (int t = 0; t < T; ++t) loglik += -log(scales[t]); 46 return loglik; 47 } 48 }; 49 50 int main() { 51 // Example HMM with 2 states and 3 observation symbols {0,1,2} 52 HMM hmm(2, 3); 53 hmm.pi = {0.6, 0.4}; 54 hmm.A = {{0.7, 0.3}, 55 {0.4, 0.6}}; 56 hmm.B = {{0.5, 0.4, 0.1}, // state 0 emits 0,1,2 57 {0.1, 0.3, 0.6}}; // state 1 emits 0,1,2 58 59 // Observation sequence O = [0, 1, 2, 1] 60 vector<int> O = {0, 1, 2, 1}; 61 62 vector<vector<double>> alpha_scaled; 63 vector<double> scales; 64 double loglik = hmm.forward_scaled(O, alpha_scaled, scales); 65 66 cout.setf(ios::fixed); cout<<setprecision(6); 67 cout << "Log-likelihood log P(O|λ): " << loglik << "\n"; 68 cout << "Approx P(O|λ) (may under/over-estimate due to precision): " << exp(loglik) << "\n"; 69 70 // Print final scaled alphas at T-1 (they sum to 1 by construction) 71 cout << "Final state distribution (scaled α_T): "; 72 for (double v : alpha_scaled.back()) cout << v << ' '; 73 cout << "\n"; 74 return 0; 75 } 76
This program builds a small discrete HMM and runs the forward algorithm with per-time-step scaling. It returns the log-likelihood by accumulating the negative logs of the scale factors. The scaled forward values avoid numerical underflow and represent the filtered state distribution at each time up to normalization.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct HMM { 5 int N, M; 6 vector<double> pi; // size N 7 vector<vector<double>> A, B; // A: NxN, B: NxM 8 HMM(int n, int m) : N(n), M(m), pi(n), A(n, vector<double>(n)), B(n, vector<double>(m)) {} 9 10 vector<int> viterbi(const vector<int>& O, double& best_log_prob) const { 11 int T = (int)O.size(); 12 const double NEG_INF = -1e300; // stand-in for -inf 13 14 // Work in log-space for stability: log(a*b) = log a + log b; max in log-space is ordinary max 15 auto l = [](double x){ return (x <= 0.0 ? -1e300 : log(x)); }; 16 17 vector<vector<double>> delta(T, vector<double>(N, NEG_INF)); 18 vector<vector<int>> psi(T, vector<int>(N, -1)); 19 20 // Initialization 21 for (int j = 0; j < N; ++j) { 22 delta[0][j] = l(pi[j]) + l(B[j][O[0]]); 23 psi[0][j] = -1; 24 } 25 26 // Recurrence 27 for (int t = 1; t < T; ++t) { 28 for (int j = 0; j < N; ++j) { 29 double best = NEG_INF; int arg = -1; 30 for (int i = 0; i < N; ++i) { 31 double cand = delta[t-1][i] + l(A[i][j]); 32 if (cand > best) { best = cand; arg = i; } 33 } 34 delta[t][j] = best + l(B[j][O[t]]); 35 psi[t][j] = arg; 36 } 37 } 38 39 // Termination 40 best_log_prob = NEG_INF; int last_state = -1; 41 for (int j = 0; j < N; ++j) if (delta[T-1][j] > best_log_prob) { best_log_prob = delta[T-1][j]; last_state = j; } 42 43 // Path backtracking 44 vector<int> path(T); 45 path[T-1] = last_state; 46 for (int t = T-2; t >= 0; --t) path[t] = psi[t+1][path[t+1]]; 47 return path; 48 } 49 }; 50 51 int main() { 52 // Same toy HMM as in the forward example 53 HMM hmm(2, 3); 54 hmm.pi = {0.6, 0.4}; 55 hmm.A = {{0.7, 0.3}, {0.4, 0.6}}; 56 hmm.B = {{0.5, 0.4, 0.1}, {0.1, 0.3, 0.6}}; 57 58 vector<int> O = {0, 1, 2, 1}; 59 double best_log_prob; 60 vector<int> path = hmm.viterbi(O, best_log_prob); 61 62 cout.setf(ios::fixed); cout<<setprecision(6); 63 cout << "Best log-probability (Viterbi): " << best_log_prob << "\n"; 64 cout << "Most likely hidden path: "; 65 for (int s : path) cout << s << ' '; 66 cout << "\n"; 67 return 0; 68 } 69
This code performs Viterbi decoding in log-space to find the most probable hidden state sequence for the given observations. It stores the best score to each state at each time (delta) and backpointers (psi) to reconstruct the optimal path.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct HMM { 5 int N, M; 6 vector<double> pi; // size N 7 vector<vector<double>> A, B; // A: NxN, B: NxM 8 HMM(int n, int m) : N(n), M(m), pi(n), A(n, vector<double>(n)), B(n, vector<double>(m)) {} 9 }; 10 11 struct FBOut { 12 vector<vector<double>> alpha, beta; // scaled alpha and beta 13 vector<double> c; // scaling factors (pre-scaling sums) 14 double loglik; 15 }; 16 17 FBOut forward_backward_scaled(const HMM& hmm, const vector<int>& O) { 18 int N = hmm.N, T = (int)O.size(); 19 FBOut out; out.alpha.assign(T, vector<double>(N, 0.0)); 20 out.beta.assign(T, vector<double>(N, 0.0)); 21 out.c.assign(T, 0.0); 22 23 // Forward with scaling 24 double c0 = 0.0; 25 for (int j = 0; j < N; ++j) { out.alpha[0][j] = hmm.pi[j] * hmm.B[j][O[0]]; c0 += out.alpha[0][j]; } 26 if (c0 == 0.0) c0 = 1e-300; 27 for (int j = 0; j < N; ++j) out.alpha[0][j] /= c0; out.c[0] = c0; 28 29 for (int t = 1; t < T; ++t) { 30 double ct = 0.0; 31 for (int j = 0; j < N; ++j) { 32 double s = 0.0; 33 for (int i = 0; i < N; ++i) s += out.alpha[t-1][i] * hmm.A[i][j]; 34 out.alpha[t][j] = s * hmm.B[j][O[t]]; 35 ct += out.alpha[t][j]; 36 } 37 if (ct == 0.0) ct = 1e-300; 38 for (int j = 0; j < N; ++j) out.alpha[t][j] /= ct; out.c[t] = ct; 39 } 40 41 // Backward with scaling (match alpha scaling) 42 for (int j = 0; j < N; ++j) out.beta[T-1][j] = 1.0; // already scaled 43 for (int t = T-2; t >= 0; --t) { 44 for (int i = 0; i < N; ++i) { 45 double s = 0.0; 46 for (int j = 0; j < N; ++j) 47 s += hmm.A[i][j] * hmm.B[j][O[t+1]] * out.beta[t+1][j]; 48 // scale beta to be consistent: divide by c_{t+1} 49 out.beta[t][i] = s / out.c[t+1]; 50 } 51 } 52 53 double loglik = 0.0; for (int t = 0; t < T; ++t) loglik += -log(out.c[t]); 54 out.loglik = loglik; 55 return out; 56 } 57 58 void em_one_iteration(HMM& hmm, const vector<int>& O, double smoothing=1e-8) { 59 int N = hmm.N, M = hmm.M; int T = (int)O.size(); 60 FBOut fb = forward_backward_scaled(hmm, O); 61 62 vector<vector<double>> gamma(T, vector<double>(N, 0.0)); 63 vector<vector<vector<double>>> xi(T-1, vector<vector<double>>(N, vector<double>(N, 0.0))); 64 65 // E-step: compute gamma and xi using scaled alpha/beta 66 for (int t = 0; t < T; ++t) { 67 double denom = 0.0; for (int i = 0; i < N; ++i) denom += fb.alpha[t][i] * fb.beta[t][i]; 68 if (denom == 0.0) denom = 1e-300; 69 for (int i = 0; i < N; ++i) gamma[t][i] = (fb.alpha[t][i] * fb.beta[t][i]) / denom; 70 } 71 for (int t = 0; t < T-1; ++t) { 72 double denom = 0.0; 73 for (int i = 0; i < N; ++i) 74 for (int j = 0; j < N; ++j) 75 denom += fb.alpha[t][i] * hmm.A[i][j] * hmm.B[j][O[t+1]] * fb.beta[t+1][j]; 76 if (denom == 0.0) denom = 1e-300; 77 for (int i = 0; i < N; ++i) 78 for (int j = 0; j < N; ++j) 79 xi[t][i][j] = (fb.alpha[t][i] * hmm.A[i][j] * hmm.B[j][O[t+1]] * fb.beta[t+1][j]) / denom; 80 } 81 82 // M-step: re-estimate pi, A, B with smoothing to avoid zeros 83 // Update pi 84 for (int i = 0; i < N; ++i) hmm.pi[i] = max(gamma[0][i], smoothing); 85 double pisum = accumulate(hmm.pi.begin(), hmm.pi.end(), 0.0); 86 for (int i = 0; i < N; ++i) hmm.pi[i] /= pisum; 87 88 // Update A 89 for (int i = 0; i < N; ++i) { 90 double denom = smoothing * N; // for smoothing mass 91 for (int t = 0; t < T-1; ++t) denom += gamma[t][i]; 92 for (int j = 0; j < N; ++j) { 93 double numer = smoothing; // add-ε 94 for (int t = 0; t < T-1; ++t) numer += xi[t][i][j]; 95 hmm.A[i][j] = numer / denom; 96 } 97 } 98 99 // Update B (discrete) 100 vector<double> denomB(N, smoothing * M); 101 for (int t = 0; t < T; ++t) 102 for (int i = 0; i < N; ++i) 103 denomB[i] += gamma[t][i]; 104 105 for (int j = 0; j < N; ++j) { 106 vector<double> numer(M, smoothing); 107 for (int t = 0; t < T; ++t) numer[O[t]] += gamma[t][j]; 108 for (int k = 0; k < M; ++k) hmm.B[j][k] = numer[k] / denomB[j]; 109 } 110 111 cout.setf(ios::fixed); cout<<setprecision(6); 112 cout << "EM (one iter) log-likelihood: " << fb.loglik << "\n"; 113 } 114 115 int main() { 116 // Toy HMM and a single observation sequence 117 HMM hmm(2, 3); 118 hmm.pi = {0.5, 0.5}; 119 hmm.A = {{0.6, 0.4}, {0.3, 0.7}}; 120 hmm.B = {{0.5, 0.4, 0.1}, {0.1, 0.3, 0.6}}; 121 122 vector<int> O = {0, 1, 2, 1, 0, 2}; 123 124 // Run one EM iteration 125 em_one_iteration(hmm, O); 126 127 // Print updated parameters (rows sum to ~1) 128 cout << "Updated pi: "; for (double x : hmm.pi) cout << x << ' '; cout << "\n"; 129 cout << "Updated A:\n"; 130 for (int i = 0; i < hmm.N; ++i) { for (int j = 0; j < hmm.N; ++j) cout << hmm.A[i][j] << ' '; cout << '\n'; } 131 cout << "Updated B:\n"; 132 for (int j = 0; j < hmm.N; ++j) { for (int k = 0; k < hmm.M; ++k) cout << hmm.B[j][k] << ' '; cout << '\n'; } 133 134 return 0; 135 } 136
This program performs a single Baum–Welch iteration for a discrete HMM using scaled forward–backward passes. It computes γ and ξ posteriors and updates π, A, and B with small add-ε smoothing to avoid zeros. It reports the current log-likelihood and prints the updated parameters.