šŸ“šTheoryIntermediate

Markov Chain Theory

Key Points

  • •
    A Markov chain is a random process where the next state depends only on the current state, not the full history.
  • •
    The transition matrix P encodes one-step move probabilities, and a stationary distribution π satisfies =
  • •
    If a finite chain is irreducible and aperiodic (ergodic), it converges to a unique stationary distribution regardless of the start state.
  • •
    Mixing time measures how many steps are needed for the chain to get close to stationarity in total variation distance.
  • •
    Reversible chains satisfy detailed balance = , which guarantees stationarity and simplifies analysis.
  • •
    MCMC methods (e.g., Metropolis–Hastings and Gibbs sampling) build Markov chains whose stationary distribution is a desired target
  • •
    Coupling is a powerful technique to bound mixing time by making two copies of the chain meet quickly.
  • •
    In AI/ML, Markov chains underpin HMMs, PageRank, Bayesian inference via MCMC, and many stochastic optimization techniques.

Prerequisites

  • →Basic probability theory — Understanding conditional probability, random variables, and distributions is essential for the Markov property and transition probabilities.
  • →Linear algebra — Transition matrices, eigenvectors, and power iteration rely on matrix operations and spectral concepts.
  • →Calculus and multivariate Gaussian distributions — Needed to derive and understand Gibbs conditionals and to evaluate common target densities.
  • →Numerical methods — Power iteration convergence, normalization, and error tolerances are numerical-analysis topics.
  • →C++ programming and RNG usage — Implementing simulations, using std::random, and managing vectors/matrices require C++ proficiency.

Detailed Explanation

Tap terms for definitions

01Overview

Markov chain theory studies random processes that evolve in discrete time by jumping between states according to fixed probabilities. The key property is memorylessness: the next step depends only on the current state. This behavior is represented by a transition matrix P, where entry P_{ij} gives the probability of moving from state i to state j in one step. Iterating the process multiplies probabilities by powers of P, so t-step transitions are captured by P^{t}. A central object is the stationary distribution Ļ€, a probability vector that remains unchanged by the dynamics (Ļ€P = Ļ€). For finite chains that are irreducible (every state can reach every other) and aperiodic (no fixed cycle timing), the chain is ergodic: P^{t}(x,Ā·) converges to Ļ€ for every start state x. The speed of this convergence is the mixing time, often controlled by spectral properties of P (e.g., the second-largest eigenvalue magnitude) or geometric arguments like conductance. Reversible chains satisfy detailed balance Ļ€_i P_{ij} = Ļ€_j P_{ji}, making analysis and algorithm design (especially MCMC) more tractable. In machine learning, we exploit these ideas to sample from complex distributions via Metropolis–Hastings and Gibbs sampling, to model sequential data like HMMs, and to analyze graph processes such as PageRank. Practical work involves simulating chains, diagnosing convergence, and trading off step size versus acceptance in MCMC.

02Intuition & Analogies

Imagine walking around a city using a simple rule: at each intersection (state), you choose your next street based only on where you currently are, not on the whole path you took. That’s a Markov chain. The city map with signs telling you the probabilities of each outgoing street is the transition matrix. If you keep walking, you eventually develop a ā€œroutineā€ of how often you find yourself in each neighborhood—this long-run frequency is the stationary distribution. If the city is well-connected (irreducible) and has no forced rhythm like alternating only between uptown and downtown (aperiodic), then no matter where you start, you’ll settle into the same routine. Mixing time answers: how quickly do you forget where you started? Stirring cream into coffee is a physical analogy—after enough stirs (steps), you can’t tell where the cream started; the mixture is uniform (stationary). Detailed balance is like two-way traffic being balanced: the flow from A to B equals the flow from B to A in the long run, making the overall pattern stable. MCMC is like designing turn-by-turn rules so that your wandering naturally spends time in places proportional to how much you care about them (your target distribution), even if you don’t know the exact map of desirability. Metropolis–Hastings says: suggest a move, then probabilistically accept it so that long-run visiting frequencies match your target. Gibbs sampling says: hold all but one coordinate fixed, then update that coordinate from its local rule—repeat until the whole system equilibrates. Coupling imagines two walkers sharing the same random choices from different starts; when they meet and move together, you’ve essentially proven the process has forgotten its beginning.

03Formal Definition

A (time-homogeneous) Markov chain is a sequence of random variables (, , ) taking values in a countable state space such that for all n and states ,,,j, we have ( ,,) = ( ). For finite with =n, the chain is specified by an n n row-stochastic transition matrix P with entries = ( ), satisfying and 0. The t-step transition probabilities are entries of . A distribution vector over is stationary if P = and =1. A chain is irreducible if every state communicates with every other (each is reachable with positive probability in some number of steps), and aperiodic if the greatest common divisor of return times to every state is 1. For finite, irreducible, aperiodic chains (ergodic chains), the stationary distribution exists, is unique, and (x,)= for all x (convergence in total variation). A chain is reversible with respect to if it satisfies detailed balance = , which implies stationarity. The mixing time at accuracy is () = \{t : (x,) - \}. Spectral properties of P (e.g., the magnitude of the second-largest eigenvalue) control the convergence rate in many settings, and conductance bounds relate geometry of the state space to mixing time.

04When to Use

Use Markov chains whenever you need to model or compute with sequential uncertainty where the present suffices to predict the future. In AI/ML, they underpin Hidden Markov Models for time series, reinforcement learning dynamics (state transitions in MDPs), and random walks on graphs (PageRank, node ranking, community detection). For Bayesian inference, use MCMC (Metropolis–Hastings, Gibbs) to draw approximate samples from complicated posteriors where direct sampling or integration is impossible; detailed balance and stationarity ensure that long-run samples are from the target. Use reversible chains for simpler analysis and tuning; use nonreversible proposals when you seek faster mixing in practice (though proofs are harder). Apply power iteration or eigen-solvers to find stationary distributions in finite chains, especially for ranking problems or to validate MCMC samplers. Use coupling or spectral-gap bounds to reason about mixing time when you need guarantees on convergence speed. Prefer Gibbs sampling when conditional distributions are easy to sample (e.g., conjugate exponential-family models); prefer Metropolis–Hastings when only an unnormalized density is available. In simulation, use burn-in to reduce start-up bias and thin samples if storage is limited or autocorrelation is high.

āš ļøCommon Mistakes

• Confusing the Markov property with independence: in a Markov chain, steps are dependent—just not on the full history. Check that P(X_{n+1}|X_n) fully defines dynamics. • Building P with rows that don’t sum to 1 or have negative entries; always validate row-stochasticity to avoid invalid probabilities. • Assuming convergence without ergodicity: periodic or reducible chains may never settle to a unique \pi. Inspect connectivity and period (e.g., add self-loops to break periodicity). • Equating stationarity with reversibility: detailed balance implies stationarity, but not all stationary chains are reversible. Don’t over-restrict models when reversibility isn’t needed. • Using MCMC without diagnostics: relying on a fixed number of iterations may yield biased estimates if mixing is slow. Monitor trace plots, autocorrelation, R-hat, and effective sample size. • Poor MH tuning: too small a proposal step causes slow exploration (high acceptance but high autocorrelation), too large causes frequent rejections (low acceptance). Target a moderate acceptance rate (e.g., 20–40% in many 1D/low-D problems). • Ignoring burn-in and thinning: initial samples reflect the start state; discard a warm-up period and consider thinning only if storage is constrained. • Numerical pitfalls in power iteration: lack of normalization, too loose tolerance, or nearly defective matrices can stall convergence; re-normalize vectors and cap iterations. • Miscomputing total variation distance: remember the factor 1/2 and compare distributions, not raw counts; convert empirical counts to probabilities.

Key Formulas

Markov Property

Explanation: This is the memoryless condition: the distribution of the next state depends only on the current state, not the entire history.

Row-Stochastic Matrix

Explanation: Each row of the transition matrix represents a valid probability distribution over next states.

t-Step Transitions

Explanation: Raising P to the t-th power gives the probability of moving from state i to state j in t steps.

Stationary Distribution

Explanation: A distribution π is stationary if, after a transition, the distribution does not change. It is a left eigenvector with eigenvalue 1.

Detailed Balance

Explanation: This condition implies reversibility and guarantees that π is stationary. It balances probability flow between any two states.

Total Variation Distance

Explanation: Total variation measures how different two distributions are; it is half the L1 distance and ranges from 0 to 1.

Mixing Time

Explanation: Mixing time is the number of steps needed so that, from any start state, the chain is within ε of stationarity in total variation.

Convergence to Stationarity

Explanation: For finite, irreducible, aperiodic chains, the distribution after t steps converges to the unique stationary distribution.

Spectral Convergence (Reversible Case)

Explanation: In reversible chains, the second-largest eigenvalue in magnitude governs geometric decay of deviations from stationarity in the Euclidean norm.

Coupling Inequality

Explanation: For any coupling of and , the probability that coupled variables differ bounds the total variation distance.

Metropolis–Hastings Acceptance

Explanation: This acceptance probability ensures the Markov chain has stationary distribution even when q is asymmetric.

Ergodic Theorem for Markov Chains

Explanation: Time averages along a single long run converge to expectations under the stationary distribution, justifying MCMC estimation.

Complexity Analysis

For a finite-state chain with n states, one-step simulation from a given state is O( n) if next-state sampling uses binary search over cumulative probabilities, or O(1) with alias tables (O(n) preprocessing). Simulating T steps is O(T) with O(1) memory for the current state, and O(n) additional space if tracking empirical counts. Computing an exact stationary distribution by solving (P^ - I)^=0 with =1 via Gaussian elimination is O() time and O() space. Power iteration to approximate Ļ€ as a left eigenvector requires repeated vector–matrix multiplies: each multiply is O(), so k iterations cost O(k ) time and O(n) space for the vectors; convergence rate depends on the spectral gap (1 - ). To obtain (x,) explicitly, naive repeated multiplication of a delta row vector by P incurs O(t ) time and O(n) space per start state. Metropolis–Hastings in with simple proposals has O(1) work per step if evaluating the target density is O(1); T iterations cost O(T) time and O(1) space for streaming estimates, or O(T) space if storing all samples. Gibbs sampling typically costs O(d) per sweep across d coordinates if each conditional draw is O(1). Effective sample size (ESS) is less than T due to autocorrelation; obtaining M effective samples may require O(M ) steps, where IACT is the integrated autocorrelation time. In practice, the dominant costs are (i) matrix–vector multiplies for stationary computations on finite chains and (ii) target-density evaluations for MCMC in continuous spaces.

Code Examples

Simulate a finite Markov chain, compute stationary distribution by power iteration, and estimate mixing
1#include <bits/stdc++.h>
2using namespace std;
3
4// Multiply a row vector v (size n) by matrix P (n x n): v' = v P
5vector<double> rowVecTimesMat(const vector<double>& v, const vector<vector<double>>& P) {
6 int n = (int)P.size();
7 vector<double> out(n, 0.0);
8 for (int j = 0; j < n; ++j) {
9 double s = 0.0;
10 for (int i = 0; i < n; ++i) s += v[i] * P[i][j];
11 out[j] = s;
12 }
13 return out;
14}
15
16// L1 (Manhattan) norm of difference
17double l1(const vector<double>& a, const vector<double>& b) {
18 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += fabs(a[i] - b[i]); return s;
19}
20
21// Total variation distance = 0.5 * L1 distance
22double tv(const vector<double>& a, const vector<double>& b) { return 0.5 * l1(a, b); }
23
24// Power iteration to compute stationary distribution (left eigenvector): v <- v P until convergence
25vector<double> stationary_power_iteration(const vector<vector<double>>& P, int max_iter = 100000, double tol = 1e-12) {
26 int n = (int)P.size();
27 vector<double> v(n, 1.0 / n), vnext(n, 0.0);
28 for (int it = 0; it < max_iter; ++it) {
29 vnext = rowVecTimesMat(v, P);
30 // normalize to sum to 1 (guard against drift due to rounding)
31 double s = accumulate(vnext.begin(), vnext.end(), 0.0);
32 for (int i = 0; i < n; ++i) vnext[i] = vnext[i] / s;
33 if (l1(v, vnext) < tol) return vnext;
34 v.swap(vnext);
35 }
36 return v; // return best effort if not converged
37}
38
39// Compute distribution after t steps starting from a specific state 'start'
40vector<double> dist_after_t(const vector<vector<double>>& P, int start, int t) {
41 int n = (int)P.size();
42 vector<double> v(n, 0.0); v[start] = 1.0; // delta at start
43 for (int k = 0; k < t; ++k) v = rowVecTimesMat(v, P);
44 return v;
45}
46
47// Sample next state j given current i using cumulative probabilities
48int sample_next_state(const vector<vector<double>>& P, int i, mt19937& rng) {
49 uniform_real_distribution<double> U(0.0, 1.0);
50 double r = U(rng), c = 0.0;
51 for (int j = 0; j < (int)P.size(); ++j) {
52 c += P[i][j];
53 if (r <= c) return j;
54 }
55 return (int)P.size() - 1; // fallback for numerical roundoff
56}
57
58int main() {
59 ios::sync_with_stdio(false);
60 cin.tie(nullptr);
61
62 // A small ergodic chain (rows sum to 1, irreducible, aperiodic)
63 vector<vector<double>> P = {
64 {0.10, 0.60, 0.20, 0.10},
65 {0.30, 0.20, 0.40, 0.10},
66 {0.25, 0.25, 0.25, 0.25},
67 {0.40, 0.10, 0.10, 0.40}
68 };
69
70 // 1) Compute stationary distribution via power iteration
71 vector<double> pi = stationary_power_iteration(P);
72 cout << fixed << setprecision(6);
73 cout << "Stationary distribution (power iteration):\n";
74 for (double x : pi) cout << x << ' '; cout << "\n\n";
75
76 // 2) Simulate a long trajectory to estimate empirical distribution
77 mt19937 rng(42);
78 int n = (int)P.size();
79 int start = 0, steps = 200000, burn_in = 5000;
80 vector<int> counts(n, 0);
81 int state = start;
82 for (int t = 0; t < steps; ++t) {
83 state = sample_next_state(P, state, rng);
84 if (t >= burn_in) counts[state]++;
85 }
86 vector<double> emp(n, 0.0);
87 double total = max(1, steps - burn_in);
88 for (int i = 0; i < n; ++i) emp[i] = counts[i] / total;
89 cout << "Empirical distribution after burn-in:\n";
90 for (double x : emp) cout << x << ' '; cout << "\n";
91 cout << "TV(empirical, stationary) = " << tv(emp, pi) << "\n\n";
92
93 // 3) Rough mixing assessment: worst-case TV distance across starts after t steps
94 vector<int> Ts = {1,2,4,8,16,32,64};
95 for (int t : Ts) {
96 double worst = 0.0;
97 for (int s = 0; s < n; ++s) {
98 vector<double> vt = dist_after_t(P, s, t);
99 worst = max(worst, tv(vt, pi));
100 }
101 cout << "t=" << setw(2) << t << ": worst-case TV distance = " << worst << "\n";
102 }
103
104 return 0;
105}
106

This program defines a small ergodic Markov chain, computes its stationary distribution by power iteration (repeatedly multiplying a row vector by P), simulates a long run to estimate empirical frequencies, and reports total variation distance between empirical and stationary distributions. It also estimates mixing behavior by computing, for several t, the maximum TV distance between P^t(x,·) and π across start states.

Time: Power iteration: O(k n^2) for k iterations. Simulation: O(steps). Mixing distances: O(|T| * n * t * n) = O(|T| t n^2) for the tested t values.Space: O(n^2) to store P; O(n) for vectors and counts.
Metropolis–Hastings sampler targeting a 1D bimodal distribution
1#include <bits/stdc++.h>
2using namespace std;
3
4// Standard normal pdf (mean m, sd s)
5double normal_pdf(double x, double m = 0.0, double s = 1.0) {
6 static const double inv_sqrt_2pi = 1.0 / sqrt(2.0 * M_PI);
7 double z = (x - m) / s;
8 return inv_sqrt_2pi / s * exp(-0.5 * z * z);
9}
10
11// Target density: mixture of two unit-variance normals centered at -3 and +3
12// pi(x) āˆ 0.5*N(-3,1) + 0.5*N(3,1). (This is actually normalized, but MH only needs ratios.)
13double target_density(double x) {
14 return 0.5 * normal_pdf(x, -3.0, 1.0) + 0.5 * normal_pdf(x, 3.0, 1.0);
15}
16
17int main() {
18 ios::sync_with_stdio(false);
19 cin.tie(nullptr);
20
21 mt19937 rng(123);
22 normal_distribution<double> prop_noise(0.0, 1.0); // proposal step sd; tune for acceptance
23 uniform_real_distribution<double> U(0.0, 1.0);
24
25 int burn_in = 2000;
26 int N = 50000; // total iterations (including burn-in)
27 double step_sd = 1.5; // proposal standard deviation (scaled multiplier)
28
29 double x = 0.0; // start at mode center; any start is allowed
30 double fx = target_density(x);
31
32 vector<double> samples;
33 samples.reserve(N - burn_in);
34
35 int accepts = 0, tries = 0;
36 for (int t = 0; t < N; ++t) {
37 double y = x + step_sd * prop_noise(rng); // symmetric random-walk proposal
38 double fy = target_density(y);
39 double alpha = min(1.0, fy / fx); // q symmetric => ratio simplifies
40 double r = U(rng);
41 tries++;
42 if (r < alpha) { x = y; fx = fy; accepts++; }
43 if (t >= burn_in) samples.push_back(x);
44 }
45
46 // Report acceptance and simple moments
47 double mean = accumulate(samples.begin(), samples.end(), 0.0) / samples.size();
48 double var = 0.0;
49 for (double s : samples) var += (s - mean) * (s - mean);
50 var /= (samples.size() - 1);
51
52 cout << fixed << setprecision(4);
53 cout << "MH acceptance rate: " << (100.0 * accepts / (double)tries) << "%\n";
54 cout << "Sample size (post burn-in): " << samples.size() << "\n";
55 cout << "Sample mean: " << mean << ", sample variance: " << var << "\n";
56
57 // Print a few samples
58 cout << "First 10 samples: ";
59 for (int i = 0; i < 10 && i < (int)samples.size(); ++i) cout << samples[i] << ' ';
60 cout << "\n";
61
62 return 0;
63}
64

This program implements Metropolis–Hastings with a symmetric Gaussian random-walk proposal to sample from a bimodal target (mixture of two Gaussians). The acceptance probability simplifies to min(1, Ļ€(y)/Ļ€(x)) because the proposal is symmetric. It reports the acceptance rate and basic sample statistics as a quick sanity check.

Time: O(N) for N iterations, assuming O(1) target-density evaluations.Space: O(N) if all post burn-in samples are stored; O(1) if streaming estimates are used instead.
Gibbs sampling for a correlated bivariate Normal
1#include <bits/stdc++.h>
2using namespace std;
3
4// Gibbs sampler for (X, Y) ~ N(0, Σ) with Σ = [[1, ρ], [ρ, 1]]
5// Conditionals: X | Y=y ~ N(ρ y, 1-ρ^2), Y | X=x ~ N(ρ x, 1-ρ^2)
6int main() {
7 ios::sync_with_stdio(false);
8 cin.tie(nullptr);
9
10 double rho = 0.8; // correlation
11 double cond_sd = sqrt(1.0 - rho * rho);
12
13 mt19937 rng(2024);
14 normal_distribution<double> Z(0.0, 1.0);
15
16 int burn_in = 2000;
17 int iters = 40000;
18
19 double x = 0.0, y = 0.0; // start anywhere
20 vector<pair<double,double>> samples;
21 samples.reserve(iters - burn_in);
22
23 for (int t = 0; t < iters; ++t) {
24 // Sample X | Y=y
25 double mean_x = rho * y;
26 x = mean_x + cond_sd * Z(rng);
27 // Sample Y | X=x
28 double mean_y = rho * x;
29 y = mean_y + cond_sd * Z(rng);
30 if (t >= burn_in) samples.emplace_back(x, y);
31 }
32
33 // Empirical moments
34 double mx = 0.0, my = 0.0;
35 for (auto &p : samples) { mx += p.first; my += p.second; }
36 mx /= samples.size(); my /= samples.size();
37 double vx = 0.0, vy = 0.0, cxy = 0.0;
38 for (auto &p : samples) {
39 vx += (p.first - mx) * (p.first - mx);
40 vy += (p.second - my) * (p.second - my);
41 cxy += (p.first - mx) * (p.second - my);
42 }
43 vx /= (samples.size() - 1);
44 vy /= (samples.size() - 1);
45 cxy /= (samples.size() - 1);
46 double corr = cxy / sqrt(vx * vy);
47
48 cout << fixed << setprecision(4);
49 cout << "Empirical E[X]: " << mx << ", E[Y]: " << my << " (should be ~0)\n";
50 cout << "Empirical Var[X]: " << vx << ", Var[Y]: " << vy << " (should be ~1)\n";
51 cout << "Empirical Corr[X,Y]: " << corr << " (target ρ = " << rho << ")\n";
52
53 // Print a few samples
54 cout << "First 5 samples: ";
55 for (int i = 0; i < 5 && i < (int)samples.size(); ++i) {
56 cout << '(' << samples[i].first << ',' << samples[i].second << ") ";
57 }
58 cout << "\n";
59
60 return 0;
61}
62

This Gibbs sampler targets a 2D Normal with correlation ρ by alternating conditional draws: X|Y and Y|X are Normal with means ρy and ρx and variance 1āˆ’ĻĀ². The code collects samples after burn-in and verifies that empirical means, variances, and correlation match the target.

Time: O(T) for T iterations; each iteration performs O(1) arithmetic and random draws.Space: O(T) if all samples are stored; O(1) if only streaming moment estimates are kept.