∑MathIntermediate

Hidden Markov Models

Key Points

•
A Hidden Markov Model (HMM) describes sequences where you cannot see the true state directly, but you can observe outputs generated by those hidden states.
•
An HMM is defined by three parts: initial state distribution $ $π$ $, transition matrix $A$, and emission matrix $B$, often written as $ $λ$ =( $π$ , A, B)$.
•
The forward algorithm computes the probability of an observation sequence efficiently in $O(T $N^{2}$ )$ time, where $T$ is sequence length and $N$ is the number of states.
•
The Viterbi algorithm finds the single most likely hidden state path that produced the observations, also in $O(T $N^{2}$ )$ time.
•
The Baum–Welch algorithm (EM) learns HMM parameters from unlabeled sequences by alternating expectation and maximization steps.
•
Numerical underflow is a common issue; use scaling or log-space arithmetic to keep probabilities stable.
•
HMMs work well when the Markov and conditional independence assumptions roughly hold, such as in speech, bioinformatics, and POS tagging.
•
Compared to modern neural models, HMMs are interpretable, train fast on small data, and provide clear probabilistic semantics.

Prerequisites

→Basic probability (Bayes rule, independence, conditional probability) — HMMs are conditional probability models over sequences and rely on factoring using independence assumptions.
→Markov chains — HMM hidden states evolve as a Markov chain; understanding transitions and stationary distributions helps.
→Dynamic programming — Forward, backward, and Viterbi are classic DP algorithms over time and states.
→Linear algebra (vectors, matrices) — Parameters π, A, and B are vectors/matrices; operations like normalization and matrix products are common.
→Numerical stability and floating-point — Scaling and log-space tricks are essential to avoid underflow in HMM computations.

Detailed Explanation

Tap terms for definitions

01Overview

A Hidden Markov Model (HMM) is a probabilistic model for time series and sequences where the system evolves through hidden (unobserved) states that emit observable symbols. You never see the state directly; you only see its effects—the observations. The model assumes the hidden state follows a Markov chain: the next state depends only on the current state, not the entire history. Each state then produces an observation according to a fixed probability distribution. An HMM is fully specified by three components: the initial state distribution (\pi), the state transition probabilities (A), and the emission probabilities (B). With these, you can answer three classic questions efficiently using dynamic programming: evaluation (what’s the probability of an observation sequence?), decoding (what’s the most likely hidden state path?), and learning (how do we fit (\pi, A, B) to data?). HMMs are compact, interpretable, and effective in many applications such as speech recognition, part-of-speech tagging, bioinformatics (gene finding, protein domains), user behavior modeling, and activity recognition. They offer a principled balance between expressive power and computational tractability, especially for small to medium state spaces and discrete observations (with extensions for continuous emissions).

02Intuition & Analogies

Imagine you’re standing outside a windowless room trying to guess the weather inside a virtual world. You can’t see the sky (hidden state), but you do see people’s clothing when they exit (observations). If many people carry umbrellas, you guess it’s rainy; if they wear sunglasses, you guess it’s sunny. Also, weather tends to persist—rainy days are more likely to be followed by rainy days than by sunny ones. That persistence is the Markov chain over hidden weather states, and the clothing distributions are the emission probabilities. Similarly, think of listening to someone speak through a wall. You can’t see their mouth shapes (hidden articulatory states), but you hear sounds (observations). Each articulatory configuration is likely to produce certain sounds with certain probabilities, and the speaker’s mouth positions transition smoothly from one to another. The HMM formalizes exactly this setup: a chain of hidden causes producing noisy effects over time. The power of HMMs comes from two ideas. First, the Markov property drastically simplifies temporal dependence: we only need the current state to predict the next. Second, conditional independence says the observation at time t depends only on the current state, not past observations, once we know that state. With those assumptions, we can process sequences with dynamic programming—folding exponentially many paths into a compact computation. The forward algorithm sums over all possible hidden paths efficiently; Viterbi keeps only the single best path at each time. Together they let us evaluate how likely data is under the model and explain the data with a most plausible hidden story.

03Formal Definition

An HMM is the tuple \(

λ

= (

π

, A, B)\) where: (1) \(

π

\in

R^{N}

\) with \(

π_{i}

= P(

q_{1}

=i)\) is the initial distribution over $N$ hidden states; (2) \(A

\in

R^{N \times N}

\) with \(

a_{ij} = P (q_{t + 1} = j

∣

q_{t} = i

)\) is a row-stochastic transition matrix; and (3) $B$ encodes emission probabilities with \(

b_{j}

(o) = P(

o_{t} = o

∣

q_{t} = j

)\). For discrete observations over vocabulary size $M$, \(B

\in

R^{N \times M}

\). Given an observation sequence \(O=(

o_{1}

\dots

o_{T}

)\), the joint probability of a state path \(Q=(

q_{1}

\dots

q_{T}

)\) and observations is \(P(Q,O

∣

λ

π_{q_{1}}

b_{q_{1}}

(

o_{1}

)

\prod_{t = 2}^{T}

a_{

q_{t - 1}

q_{t}

}

b_{q_{t}}

(

o_{t}

)\). The probability of the observations marginalizes over all state paths: \(P(O

∣

λ

\sum_{Q}

P(Q,O

∣

λ

)\). The forward variables \(

α_{t}

(j)=P(

o_{1}

\dots

o_{t}

q_{t} = j

∣

λ

)\) satisfy the recursion \(

α_{1}

(j)=

π_{j}

b_{j}

(

o_{1}

)\), \(

α_{t}

(j)=

\big

(

\sum_{i}

α_{t - 1}

(i)

a_{ij}\big

)

b_{j}

(

o_{t}

)\). The backward variables \(

β_{t}

(i)=P(

o_{t + 1}

\dots

o_{T}

∣

q_{t} = i

λ

)\) satisfy \(

β_{T}

(i)=1\), \(

β_{t}

(i)=

\sum_{j}

a_{ij}

b_{j}

(

o_{t + 1}

)

β_{t + 1}

(j)\). Decoding uses Viterbi: \(

δ_{t}

(j)=

max_{i}

δ_{t - 1}

(i)

a_{ij}

b_{j}

(

o_{t}

)\) with backpointers. Learning updates \(

π

, A, B\) via EM using posteriors \(

γ_{t}

(i)\) and \(

ξ_{t}

(i,j)\).

04When to Use

Use HMMs when you have sequential or time-series data with an underlying process that is simpler to describe via a small number of discrete hidden regimes. Typical use cases include: speech recognition (phoneme states emitting acoustic features), part-of-speech tagging (hidden tags emitting words), bioinformatics (hidden genomic regions emitting nucleotides), and user behavior modeling (hidden intents emitting actions). HMMs are ideal when the Markov assumption (next state depends only on current state) and conditional independence (observation depends only on current state) are approximately valid. They shine when data is scarce and interpretability matters: parameters (\pi, A, B) have direct probabilistic meaning and can be inspected. Choose HMMs for fast, stable inference with dynamic programming complexity (O(TN^2)), where you can trade off N (states) against accuracy and speed. If you need continuous observations, use Gaussian emissions or mixtures (GMM-HMM). If you require long-range dependencies, rich features, or discriminative training, consider CRFs or neural sequence models; however, HMMs remain competitive baselines and strong tools for small to mid-sized problems and for didactic and diagnostic purposes.

⚠️Common Mistakes

Ignoring numerical stability: multiplying many probabilities underflows to zero in floating point. Always use scaling factors at each time step or run algorithms in log-space with log-sum-exp.
Confusing evaluation and decoding: forward sums over all paths to get sequence likelihood; Viterbi maximizes over paths to find a single best explanation. They answer different questions and yield different numbers.
Forgetting to normalize rows: transition matrix A rows and discrete emission matrix B rows must sum to 1. After any update or random initialization, renormalize.
Zero probabilities locking the model: if (a_{ij}=0) or (b_j(o)=0), paths become impossible. Use smoothing (e.g., add-(\epsilon)) to avoid dead states during training.
Off-by-one and indexing errors: states often indexed 0..N-1 and observations 0..M-1 in code. Keep a consistent convention and map symbols to indices carefully.
Overfitting with too many states: likelihood on training data increases with more states, but generalization may worsen. Use validation, information criteria, or domain constraints.
Misinterpreting posteriors: (\gamma_t(i)) is a marginal over states at time t, not a path probability. (\xi_t(i,j)) is a pairwise marginal between consecutive times.
Expecting global optimum from EM: Baum–Welch finds a local optimum sensitive to initialization. Use multiple random starts or informed initializations.

Key Formulas

Joint Path Probability

P (Q, O ∣ λ) = π_{q_{1}} b_{q_{1}} (o_{1}) t = 2 \prod T a_{q_{t - 1}, q_{t}} b_{q_{t}} (o_{t})

Explanation: This is the probability of a specific hidden path Q producing observations O. It multiplies the start, transitions, and emissions along the path.

Sequence Likelihood

P (O ∣ λ) = Q \sum P (Q, O ∣ λ)

Explanation: The likelihood of the observations is obtained by summing over all hidden paths. The forward algorithm computes this sum efficiently without enumerating paths.

Forward Recursion

α_{1} (j) = π_{j} b_{j} (o_{1}), α_{t} (j) = (i = 1 \sum N α_{t - 1} (i) a_{ij}) b_{j} (o_{t})

Explanation: Forward variables accumulate the total probability of all paths ending in state j at time t that generate the prefix observations. Summing the last column gives the sequence likelihood.

Backward Recursion

β_{T} (i) = 1, β_{t} (i) = j = 1 \sum N a_{ij} b_{j} (o_{t + 1}) β_{t + 1} (j)

Explanation: Backward variables represent the probability of the observation suffix given the current state. They are crucial for computing posteriors used in EM.

Viterbi Recursion

δ_{1} (j) = π_{j} b_{j} (o_{1}), δ_{t} (j) = i max δ_{t - 1} (i) a_{ij} b_{j} (o_{t})

Explanation: The Viterbi dynamic program keeps only the single best path to each state at each time. Backpointers recover the most likely state sequence.

State Posterior

γ_{t} (i) = P (q_{t} = i ∣ O, λ) = \frac{α _{t} ( i ) β _{t} ( i )}{\sum _{k = 1}^{N} α _{T} ( k )}

Explanation: Gamma is the posterior probability of being in a specific state at time t. It uses the product of forward and backward probabilities normalized by the sequence likelihood.

Pairwise Posterior

ξ_{t} (i, j) = P (q_{t} = i, q_{t + 1} = j ∣ O, λ) = \frac{α _{t} ( i ) a _{ij} b _{j} ( o _{t + 1} ) β _{t + 1} ( j )}{\sum _{k = 1}^{N} α _{T} ( k )}

Explanation: Xi is the posterior over consecutive state pairs. It is central to the expected count of transitions from i to j used in the M-step.

Baum–Welch Updates (Discrete)

π_{i}^{'} = γ_{1} (i), a_{ij}^{'} = \frac{\sum _{t = 1}^{T - 1} ξ _{t} ( i , j )}{\sum _{t = 1}^{T - 1} γ _{t} ( i )}, b_{j} (k)^{'} = \frac{\sum _{t = 1}^{T} [ o _{t} = k ] γ _{t} ( j )}{\sum _{t = 1}^{T} γ _{t} ( j )}

Explanation: The EM update sets parameters to normalized expected counts under the current model. Emission updates for symbol k sum posteriors at times where k was observed.

Scaling for Stability

c_{t} = \frac{1}{\sum _{j = 1}^{N} α _{t} ( j )}, \tilde{α}_{t} (j) = c_{t} α_{t} (j), lo g P (O ∣ λ) = - t = 1 \sum T lo g c_{t}

Explanation: Scaling prevents underflow by normalizing forward values at each time. The log-likelihood is recovered by accumulating the negative log of the scaling factors.

Log-Sum-Exp Trick

lo g i = 1 \sum n e^{x_{i}} = m + lo g i = 1 \sum n e^{x_{i} - m}, m = i max x_{i}

Explanation: This identity computes logs of sums stably by factoring out the maximum exponent. It is essential when doing HMM computations in log-space.

Complexity Analysis

Let N be the number of hidden states, M the number of symbols (for discrete emissions), and T the observation sequence length. The forward and backward algorithms both run in O(T

N^{2}

) time: for each of T−1 transitions, we compute N new state values, each requiring a sum over N previous states. Their naive space usage is O(T N) if you store the full trellis, but can be reduced to O(N) by keeping only the previous (or next) time slice when you only need the likelihood. Viterbi decoding also takes O(T

N^{2}

) time and O(T N) space due to the backpointer matrix; using only O(N) space for scores is possible, but backtracking still requires O(T N) memory to recover the best path unless you recompute or checkpoint. Baum–Welch (EM) per iteration requires one forward–backward pass for each sequence: O(T

N^{2}

) time and O(T N) space for posteriors (

γ,

ξ). The M-step for discrete emissions adds O(T N + N M) to accumulate and renormalize counts, which typically does not dominate the forward–backward cost. If you train on S sequences, multiply time linearly by S. In practice, scaling or log-space computation adds negligible overhead while preventing underflow. Memory can be the bottleneck for very long T if you store full γ and ξ; you can stream computations to reduce memory when only sufficient statistics are needed. Overall, HMMs offer predictable quadratic-in-states and linear-in-time complexity, making them efficient for small N (e.g., tens to low hundreds) and moderate T.

Code Examples

Forward algorithm with per-step scaling to compute log-likelihood P(O | λ)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct HMM {
5     int N;                 // number of states
6     int M;                 // number of observation symbols
7     vector<double> pi;     // size N
8     vector<vector<double>> A; // NxN
9     vector<vector<double>> B; // NxM
10 
11     HMM(int n, int m) : N(n), M(m), pi(n), A(n, vector<double>(n)), B(n, vector<double>(m)) {}
12 
13     // Forward algorithm with scaling. Returns log-likelihood.
14     double forward_scaled(const vector<int>& O, vector<vector<double>>& alpha_scaled, vector<double>& scales) const {
15         int T = (int)O.size();
16         alpha_scaled.assign(T, vector<double>(N, 0.0));
17         scales.assign(T, 0.0);
18 
19         // t = 0 initialization
20         double c0 = 0.0;
21         for (int j = 0; j < N; ++j) {
22             alpha_scaled[0][j] = pi[j] * B[j][O[0]];
23             c0 += alpha_scaled[0][j];
24         }
25         // scale
26         if (c0 == 0.0) c0 = 1e-300; // avoid divide-by-zero
27         for (int j = 0; j < N; ++j) alpha_scaled[0][j] /= c0;
28         scales[0] = c0;
29 
30         // induction
31         for (int t = 1; t < T; ++t) {
32             double ct = 0.0;
33             for (int j = 0; j < N; ++j) {
34                 double sum_prev = 0.0;
35                 for (int i = 0; i < N; ++i) sum_prev += alpha_scaled[t-1][i] * A[i][j];
36                 alpha_scaled[t][j] = sum_prev * B[j][O[t]];
37                 ct += alpha_scaled[t][j];
38             }
39             if (ct == 0.0) ct = 1e-300;
40             for (int j = 0; j < N; ++j) alpha_scaled[t][j] /= ct;
41             scales[t] = ct;
42         }
43         // log-likelihood = -sum_t log(c_t)
44         double loglik = 0.0;
45         for (int t = 0; t < T; ++t) loglik += -log(scales[t]);
46         return loglik;
47     }
48 };
49 
50 int main() {
51     // Example HMM with 2 states and 3 observation symbols {0,1,2}
52     HMM hmm(2, 3);
53     hmm.pi = {0.6, 0.4};
54     hmm.A = {{0.7, 0.3},
55              {0.4, 0.6}};
56     hmm.B = {{0.5, 0.4, 0.1},  // state 0 emits 0,1,2
57              {0.1, 0.3, 0.6}}; // state 1 emits 0,1,2
58 
59     // Observation sequence O = [0, 1, 2, 1]
60     vector<int> O = {0, 1, 2, 1};
61 
62     vector<vector<double>> alpha_scaled;
63     vector<double> scales;
64     double loglik = hmm.forward_scaled(O, alpha_scaled, scales);
65 
66     cout.setf(ios::fixed); cout<<setprecision(6);
67     cout << "Log-likelihood log P(O|λ): " << loglik << "\n";
68     cout << "Approx P(O|λ) (may under/over-estimate due to precision): " << exp(loglik) << "\n";
69 
70     // Print final scaled alphas at T-1 (they sum to 1 by construction)
71     cout << "Final state distribution (scaled α_T): ";
72     for (double v : alpha_scaled.back()) cout << v << ' ';
73     cout << "\n";
74     return 0;
75 }
76

This program builds a small discrete HMM and runs the forward algorithm with per-time-step scaling. It returns the log-likelihood by accumulating the negative logs of the scale factors. The scaled forward values avoid numerical underflow and represent the filtered state distribution at each time up to normalization.

Time: O(T N^2)Space: O(T N) for the full alpha table; O(N) extra per step if only streaming is needed.

Viterbi decoding to find the most likely hidden state path

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct HMM {
5     int N, M;
6     vector<double> pi;                 // size N
7     vector<vector<double>> A, B;       // A: NxN, B: NxM
8     HMM(int n, int m) : N(n), M(m), pi(n), A(n, vector<double>(n)), B(n, vector<double>(m)) {}
9 
10     vector<int> viterbi(const vector<int>& O, double& best_log_prob) const {
11         int T = (int)O.size();
12         const double NEG_INF = -1e300; // stand-in for -inf
13 
14         // Work in log-space for stability: log(a*b) = log a + log b; max in log-space is ordinary max
15         auto l = [](double x){ return (x <= 0.0 ? -1e300 : log(x)); };
16 
17         vector<vector<double>> delta(T, vector<double>(N, NEG_INF));
18         vector<vector<int>> psi(T, vector<int>(N, -1));
19 
20         // Initialization
21         for (int j = 0; j < N; ++j) {
22             delta[0][j] = l(pi[j]) + l(B[j][O[0]]);
23             psi[0][j] = -1;
24         }
25 
26         // Recurrence
27         for (int t = 1; t < T; ++t) {
28             for (int j = 0; j < N; ++j) {
29                 double best = NEG_INF; int arg = -1;
30                 for (int i = 0; i < N; ++i) {
31                     double cand = delta[t-1][i] + l(A[i][j]);
32                     if (cand > best) { best = cand; arg = i; }
33                 }
34                 delta[t][j] = best + l(B[j][O[t]]);
35                 psi[t][j] = arg;
36             }
37         }
38 
39         // Termination
40         best_log_prob = NEG_INF; int last_state = -1;
41         for (int j = 0; j < N; ++j) if (delta[T-1][j] > best_log_prob) { best_log_prob = delta[T-1][j]; last_state = j; }
42 
43         // Path backtracking
44         vector<int> path(T);
45         path[T-1] = last_state;
46         for (int t = T-2; t >= 0; --t) path[t] = psi[t+1][path[t+1]];
47         return path;
48     }
49 };
50 
51 int main() {
52     // Same toy HMM as in the forward example
53     HMM hmm(2, 3);
54     hmm.pi = {0.6, 0.4};
55     hmm.A = {{0.7, 0.3}, {0.4, 0.6}};
56     hmm.B = {{0.5, 0.4, 0.1}, {0.1, 0.3, 0.6}};
57 
58     vector<int> O = {0, 1, 2, 1};
59     double best_log_prob;
60     vector<int> path = hmm.viterbi(O, best_log_prob);
61 
62     cout.setf(ios::fixed); cout<<setprecision(6);
63     cout << "Best log-probability (Viterbi): " << best_log_prob << "\n";
64     cout << "Most likely hidden path: ";
65     for (int s : path) cout << s << ' ';
66     cout << "\n";
67     return 0;
68 }
69

This code performs Viterbi decoding in log-space to find the most probable hidden state sequence for the given observations. It stores the best score to each state at each time (delta) and backpointers (psi) to reconstruct the optimal path.

Time: O(T N^2)Space: O(T N) due to the backpointer matrix; O(N) for scores if you trade off path reconstruction.

One EM (Baum–Welch) iteration for a discrete HMM with scaling

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct HMM {
5     int N, M;
6     vector<double> pi;                 // size N
7     vector<vector<double>> A, B;       // A: NxN, B: NxM
8     HMM(int n, int m) : N(n), M(m), pi(n), A(n, vector<double>(n)), B(n, vector<double>(m)) {}
9 };
10 
11 struct FBOut {
12     vector<vector<double>> alpha, beta; // scaled alpha and beta
13     vector<double> c;                   // scaling factors (pre-scaling sums)
14     double loglik;
15 };
16 
17 FBOut forward_backward_scaled(const HMM& hmm, const vector<int>& O) {
18     int N = hmm.N, T = (int)O.size();
19     FBOut out; out.alpha.assign(T, vector<double>(N, 0.0));
20     out.beta.assign(T, vector<double>(N, 0.0));
21     out.c.assign(T, 0.0);
22 
23     // Forward with scaling
24     double c0 = 0.0;
25     for (int j = 0; j < N; ++j) { out.alpha[0][j] = hmm.pi[j] * hmm.B[j][O[0]]; c0 += out.alpha[0][j]; }
26     if (c0 == 0.0) c0 = 1e-300;
27     for (int j = 0; j < N; ++j) out.alpha[0][j] /= c0; out.c[0] = c0;
28 
29     for (int t = 1; t < T; ++t) {
30         double ct = 0.0;
31         for (int j = 0; j < N; ++j) {
32             double s = 0.0;
33             for (int i = 0; i < N; ++i) s += out.alpha[t-1][i] * hmm.A[i][j];
34             out.alpha[t][j] = s * hmm.B[j][O[t]];
35             ct += out.alpha[t][j];
36         }
37         if (ct == 0.0) ct = 1e-300;
38         for (int j = 0; j < N; ++j) out.alpha[t][j] /= ct; out.c[t] = ct;
39     }
40 
41     // Backward with scaling (match alpha scaling)
42     for (int j = 0; j < N; ++j) out.beta[T-1][j] = 1.0; // already scaled
43     for (int t = T-2; t >= 0; --t) {
44         for (int i = 0; i < N; ++i) {
45             double s = 0.0;
46             for (int j = 0; j < N; ++j)
47                 s += hmm.A[i][j] * hmm.B[j][O[t+1]] * out.beta[t+1][j];
48             // scale beta to be consistent: divide by c_{t+1}
49             out.beta[t][i] = s / out.c[t+1];
50         }
51     }
52 
53     double loglik = 0.0; for (int t = 0; t < T; ++t) loglik += -log(out.c[t]);
54     out.loglik = loglik;
55     return out;
56 }
57 
58 void em_one_iteration(HMM& hmm, const vector<int>& O, double smoothing=1e-8) {
59     int N = hmm.N, M = hmm.M; int T = (int)O.size();
60     FBOut fb = forward_backward_scaled(hmm, O);
61 
62     vector<vector<double>> gamma(T, vector<double>(N, 0.0));
63     vector<vector<vector<double>>> xi(T-1, vector<vector<double>>(N, vector<double>(N, 0.0)));
64 
65     // E-step: compute gamma and xi using scaled alpha/beta
66     for (int t = 0; t < T; ++t) {
67         double denom = 0.0; for (int i = 0; i < N; ++i) denom += fb.alpha[t][i] * fb.beta[t][i];
68         if (denom == 0.0) denom = 1e-300;
69         for (int i = 0; i < N; ++i) gamma[t][i] = (fb.alpha[t][i] * fb.beta[t][i]) / denom;
70     }
71     for (int t = 0; t < T-1; ++t) {
72         double denom = 0.0;
73         for (int i = 0; i < N; ++i)
74             for (int j = 0; j < N; ++j)
75                 denom += fb.alpha[t][i] * hmm.A[i][j] * hmm.B[j][O[t+1]] * fb.beta[t+1][j];
76         if (denom == 0.0) denom = 1e-300;
77         for (int i = 0; i < N; ++i)
78             for (int j = 0; j < N; ++j)
79                 xi[t][i][j] = (fb.alpha[t][i] * hmm.A[i][j] * hmm.B[j][O[t+1]] * fb.beta[t+1][j]) / denom;
80     }
81 
82     // M-step: re-estimate pi, A, B with smoothing to avoid zeros
83     // Update pi
84     for (int i = 0; i < N; ++i) hmm.pi[i] = max(gamma[0][i], smoothing);
85     double pisum = accumulate(hmm.pi.begin(), hmm.pi.end(), 0.0);
86     for (int i = 0; i < N; ++i) hmm.pi[i] /= pisum;
87 
88     // Update A
89     for (int i = 0; i < N; ++i) {
90         double denom = smoothing * N; // for smoothing mass
91         for (int t = 0; t < T-1; ++t) denom += gamma[t][i];
92         for (int j = 0; j < N; ++j) {
93             double numer = smoothing; // add-ε
94             for (int t = 0; t < T-1; ++t) numer += xi[t][i][j];
95             hmm.A[i][j] = numer / denom;
96         }
97     }
98 
99     // Update B (discrete)
100     vector<double> denomB(N, smoothing * M);
101     for (int t = 0; t < T; ++t)
102         for (int i = 0; i < N; ++i)
103             denomB[i] += gamma[t][i];
104 
105     for (int j = 0; j < N; ++j) {
106         vector<double> numer(M, smoothing);
107         for (int t = 0; t < T; ++t) numer[O[t]] += gamma[t][j];
108         for (int k = 0; k < M; ++k) hmm.B[j][k] = numer[k] / denomB[j];
109     }
110 
111     cout.setf(ios::fixed); cout<<setprecision(6);
112     cout << "EM (one iter) log-likelihood: " << fb.loglik << "\n";
113 }
114 
115 int main() {
116     // Toy HMM and a single observation sequence
117     HMM hmm(2, 3);
118     hmm.pi = {0.5, 0.5};
119     hmm.A = {{0.6, 0.4}, {0.3, 0.7}};
120     hmm.B = {{0.5, 0.4, 0.1}, {0.1, 0.3, 0.6}};
121 
122     vector<int> O = {0, 1, 2, 1, 0, 2};
123 
124     // Run one EM iteration
125     em_one_iteration(hmm, O);
126 
127     // Print updated parameters (rows sum to ~1)
128     cout << "Updated pi: "; for (double x : hmm.pi) cout << x << ' '; cout << "\n";
129     cout << "Updated A:\n";
130     for (int i = 0; i < hmm.N; ++i) { for (int j = 0; j < hmm.N; ++j) cout << hmm.A[i][j] << ' '; cout << '\n'; }
131     cout << "Updated B:\n";
132     for (int j = 0; j < hmm.N; ++j) { for (int k = 0; k < hmm.M; ++k) cout << hmm.B[j][k] << ' '; cout << '\n'; }
133 
134     return 0;
135 }
136

This program performs a single Baum–Welch iteration for a discrete HMM using scaled forward–backward passes. It computes γ and ξ posteriors and updates π, A, and B with small add-ε smoothing to avoid zeros. It reports the current log-likelihood and prints the updated parameters.

Time: O(T N^2) for forward–backward plus O(T N + N M) for accumulation per iterationSpace: O(T N) for α, β, and γ; O(T N^2) transiently if storing all ξ (can be streamed to reduce memory).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct HMM {
5	int N; // number of states
6	int M; // number of observation symbols
7	vector<double> pi; // size N
8	vector<vector<double>> A; // NxN
9	vector<vector<double>> B; // NxM
10
11	HMM(int n, int m) : N(n), M(m), pi(n), A(n, vector<double>(n)), B(n, vector<double>(m)) {}
12
13	// Forward algorithm with scaling. Returns log-likelihood.
14	double forward_scaled(const vector<int>& O, vector<vector<double>>& alpha_scaled, vector<double>& scales) const {
15	int T = (int)O.size();
16	alpha_scaled.assign(T, vector<double>(N, 0.0));
17	scales.assign(T, 0.0);
18
19	// t = 0 initialization
20	double c0 = 0.0;
21	for (int j = 0; j < N; ++j) {
22	alpha_scaled[0][j] = pi[j] * B[j][O[0]];
23	c0 += alpha_scaled[0][j];
24	}
25	// scale
26	if (c0 == 0.0) c0 = 1e-300; // avoid divide-by-zero
27	for (int j = 0; j < N; ++j) alpha_scaled[0][j] /= c0;
28	scales[0] = c0;
29
30	// induction
31	for (int t = 1; t < T; ++t) {
32	double ct = 0.0;
33	for (int j = 0; j < N; ++j) {
34	double sum_prev = 0.0;
35	for (int i = 0; i < N; ++i) sum_prev += alpha_scaled[t-1][i] * A[i][j];
36	alpha_scaled[t][j] = sum_prev * B[j][O[t]];
37	ct += alpha_scaled[t][j];
38	}
39	if (ct == 0.0) ct = 1e-300;
40	for (int j = 0; j < N; ++j) alpha_scaled[t][j] /= ct;
41	scales[t] = ct;
42	}
43	// log-likelihood = -sum_t log(c_t)
44	double loglik = 0.0;
45	for (int t = 0; t < T; ++t) loglik += -log(scales[t]);
46	return loglik;
47	}
48	};
49
50	int main() {
51	// Example HMM with 2 states and 3 observation symbols {0,1,2}
52	HMM hmm(2, 3);
53	hmm.pi = {0.6, 0.4};
54	hmm.A = {{0.7, 0.3},
55	{0.4, 0.6}};
56	hmm.B = {{0.5, 0.4, 0.1}, // state 0 emits 0,1,2
57	{0.1, 0.3, 0.6}}; // state 1 emits 0,1,2
58
59	// Observation sequence O = [0, 1, 2, 1]
60	vector<int> O = {0, 1, 2, 1};
61
62	vector<vector<double>> alpha_scaled;
63	vector<double> scales;
64	double loglik = hmm.forward_scaled(O, alpha_scaled, scales);
65
66	cout.setf(ios::fixed); cout<<setprecision(6);
67	cout << "Log-likelihood log P(O\|λ): " << loglik << "\n";
68	cout << "Approx P(O\|λ) (may under/over-estimate due to precision): " << exp(loglik) << "\n";
69
70	// Print final scaled alphas at T-1 (they sum to 1 by construction)
71	cout << "Final state distribution (scaled α_T): ";
72	for (double v : alpha_scaled.back()) cout << v << ' ';
73	cout << "\n";
74	return 0;
75	}
76