Markov Chain Theory
Key Points
- ā¢A Markov chain is a random process where the next state depends only on the current state, not the full history.
- ā¢The transition matrix P encodes one-step move probabilities, and a stationary distribution Ļ satisfies =
- ā¢If a finite chain is irreducible and aperiodic (ergodic), it converges to a unique stationary distribution regardless of the start state.
- ā¢Mixing time measures how many steps are needed for the chain to get close to stationarity in total variation distance.
- ā¢Reversible chains satisfy detailed balance = , which guarantees stationarity and simplifies analysis.
- ā¢MCMC methods (e.g., MetropolisāHastings and Gibbs sampling) build Markov chains whose stationary distribution is a desired target
- ā¢Coupling is a powerful technique to bound mixing time by making two copies of the chain meet quickly.
- ā¢In AI/ML, Markov chains underpin HMMs, PageRank, Bayesian inference via MCMC, and many stochastic optimization techniques.
Prerequisites
- āBasic probability theory ā Understanding conditional probability, random variables, and distributions is essential for the Markov property and transition probabilities.
- āLinear algebra ā Transition matrices, eigenvectors, and power iteration rely on matrix operations and spectral concepts.
- āCalculus and multivariate Gaussian distributions ā Needed to derive and understand Gibbs conditionals and to evaluate common target densities.
- āNumerical methods ā Power iteration convergence, normalization, and error tolerances are numerical-analysis topics.
- āC++ programming and RNG usage ā Implementing simulations, using std::random, and managing vectors/matrices require C++ proficiency.
Detailed Explanation
Tap terms for definitions01Overview
Markov chain theory studies random processes that evolve in discrete time by jumping between states according to fixed probabilities. The key property is memorylessness: the next step depends only on the current state. This behavior is represented by a transition matrix P, where entry P_{ij} gives the probability of moving from state i to state j in one step. Iterating the process multiplies probabilities by powers of P, so t-step transitions are captured by P^{t}. A central object is the stationary distribution Ļ, a probability vector that remains unchanged by the dynamics (ĻP = Ļ). For finite chains that are irreducible (every state can reach every other) and aperiodic (no fixed cycle timing), the chain is ergodic: P^{t}(x,Ā·) converges to Ļ for every start state x. The speed of this convergence is the mixing time, often controlled by spectral properties of P (e.g., the second-largest eigenvalue magnitude) or geometric arguments like conductance. Reversible chains satisfy detailed balance Ļ_i P_{ij} = Ļ_j P_{ji}, making analysis and algorithm design (especially MCMC) more tractable. In machine learning, we exploit these ideas to sample from complex distributions via MetropolisāHastings and Gibbs sampling, to model sequential data like HMMs, and to analyze graph processes such as PageRank. Practical work involves simulating chains, diagnosing convergence, and trading off step size versus acceptance in MCMC.
02Intuition & Analogies
Imagine walking around a city using a simple rule: at each intersection (state), you choose your next street based only on where you currently are, not on the whole path you took. Thatās a Markov chain. The city map with signs telling you the probabilities of each outgoing street is the transition matrix. If you keep walking, you eventually develop a āroutineā of how often you find yourself in each neighborhoodāthis long-run frequency is the stationary distribution. If the city is well-connected (irreducible) and has no forced rhythm like alternating only between uptown and downtown (aperiodic), then no matter where you start, youāll settle into the same routine. Mixing time answers: how quickly do you forget where you started? Stirring cream into coffee is a physical analogyāafter enough stirs (steps), you canāt tell where the cream started; the mixture is uniform (stationary). Detailed balance is like two-way traffic being balanced: the flow from A to B equals the flow from B to A in the long run, making the overall pattern stable. MCMC is like designing turn-by-turn rules so that your wandering naturally spends time in places proportional to how much you care about them (your target distribution), even if you donāt know the exact map of desirability. MetropolisāHastings says: suggest a move, then probabilistically accept it so that long-run visiting frequencies match your target. Gibbs sampling says: hold all but one coordinate fixed, then update that coordinate from its local ruleārepeat until the whole system equilibrates. Coupling imagines two walkers sharing the same random choices from different starts; when they meet and move together, youāve essentially proven the process has forgotten its beginning.
03Formal Definition
04When to Use
Use Markov chains whenever you need to model or compute with sequential uncertainty where the present suffices to predict the future. In AI/ML, they underpin Hidden Markov Models for time series, reinforcement learning dynamics (state transitions in MDPs), and random walks on graphs (PageRank, node ranking, community detection). For Bayesian inference, use MCMC (MetropolisāHastings, Gibbs) to draw approximate samples from complicated posteriors where direct sampling or integration is impossible; detailed balance and stationarity ensure that long-run samples are from the target. Use reversible chains for simpler analysis and tuning; use nonreversible proposals when you seek faster mixing in practice (though proofs are harder). Apply power iteration or eigen-solvers to find stationary distributions in finite chains, especially for ranking problems or to validate MCMC samplers. Use coupling or spectral-gap bounds to reason about mixing time when you need guarantees on convergence speed. Prefer Gibbs sampling when conditional distributions are easy to sample (e.g., conjugate exponential-family models); prefer MetropolisāHastings when only an unnormalized density is available. In simulation, use burn-in to reduce start-up bias and thin samples if storage is limited or autocorrelation is high.
ā ļøCommon Mistakes
⢠Confusing the Markov property with independence: in a Markov chain, steps are dependentājust not on the full history. Check that P(X_{n+1}|X_n) fully defines dynamics. ⢠Building P with rows that donāt sum to 1 or have negative entries; always validate row-stochasticity to avoid invalid probabilities. ⢠Assuming convergence without ergodicity: periodic or reducible chains may never settle to a unique \pi. Inspect connectivity and period (e.g., add self-loops to break periodicity). ⢠Equating stationarity with reversibility: detailed balance implies stationarity, but not all stationary chains are reversible. Donāt over-restrict models when reversibility isnāt needed. ⢠Using MCMC without diagnostics: relying on a fixed number of iterations may yield biased estimates if mixing is slow. Monitor trace plots, autocorrelation, R-hat, and effective sample size. ⢠Poor MH tuning: too small a proposal step causes slow exploration (high acceptance but high autocorrelation), too large causes frequent rejections (low acceptance). Target a moderate acceptance rate (e.g., 20ā40% in many 1D/low-D problems). ⢠Ignoring burn-in and thinning: initial samples reflect the start state; discard a warm-up period and consider thinning only if storage is constrained. ⢠Numerical pitfalls in power iteration: lack of normalization, too loose tolerance, or nearly defective matrices can stall convergence; re-normalize vectors and cap iterations. ⢠Miscomputing total variation distance: remember the factor 1/2 and compare distributions, not raw counts; convert empirical counts to probabilities.
Key Formulas
Markov Property
Explanation: This is the memoryless condition: the distribution of the next state depends only on the current state, not the entire history.
Row-Stochastic Matrix
Explanation: Each row of the transition matrix represents a valid probability distribution over next states.
t-Step Transitions
Explanation: Raising P to the t-th power gives the probability of moving from state i to state j in t steps.
Stationary Distribution
Explanation: A distribution Ļ is stationary if, after a transition, the distribution does not change. It is a left eigenvector with eigenvalue 1.
Detailed Balance
Explanation: This condition implies reversibility and guarantees that Ļ is stationary. It balances probability flow between any two states.
Total Variation Distance
Explanation: Total variation measures how different two distributions are; it is half the L1 distance and ranges from 0 to 1.
Mixing Time
Explanation: Mixing time is the number of steps needed so that, from any start state, the chain is within ε of stationarity in total variation.
Convergence to Stationarity
Explanation: For finite, irreducible, aperiodic chains, the distribution after t steps converges to the unique stationary distribution.
Spectral Convergence (Reversible Case)
Explanation: In reversible chains, the second-largest eigenvalue in magnitude governs geometric decay of deviations from stationarity in the Euclidean norm.
Coupling Inequality
Explanation: For any coupling of and , the probability that coupled variables differ bounds the total variation distance.
MetropolisāHastings Acceptance
Explanation: This acceptance probability ensures the Markov chain has stationary distribution even when q is asymmetric.
Ergodic Theorem for Markov Chains
Explanation: Time averages along a single long run converge to expectations under the stationary distribution, justifying MCMC estimation.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Multiply a row vector v (size n) by matrix P (n x n): v' = v P 5 vector<double> rowVecTimesMat(const vector<double>& v, const vector<vector<double>>& P) { 6 int n = (int)P.size(); 7 vector<double> out(n, 0.0); 8 for (int j = 0; j < n; ++j) { 9 double s = 0.0; 10 for (int i = 0; i < n; ++i) s += v[i] * P[i][j]; 11 out[j] = s; 12 } 13 return out; 14 } 15 16 // L1 (Manhattan) norm of difference 17 double l1(const vector<double>& a, const vector<double>& b) { 18 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += fabs(a[i] - b[i]); return s; 19 } 20 21 // Total variation distance = 0.5 * L1 distance 22 double tv(const vector<double>& a, const vector<double>& b) { return 0.5 * l1(a, b); } 23 24 // Power iteration to compute stationary distribution (left eigenvector): v <- v P until convergence 25 vector<double> stationary_power_iteration(const vector<vector<double>>& P, int max_iter = 100000, double tol = 1e-12) { 26 int n = (int)P.size(); 27 vector<double> v(n, 1.0 / n), vnext(n, 0.0); 28 for (int it = 0; it < max_iter; ++it) { 29 vnext = rowVecTimesMat(v, P); 30 // normalize to sum to 1 (guard against drift due to rounding) 31 double s = accumulate(vnext.begin(), vnext.end(), 0.0); 32 for (int i = 0; i < n; ++i) vnext[i] = vnext[i] / s; 33 if (l1(v, vnext) < tol) return vnext; 34 v.swap(vnext); 35 } 36 return v; // return best effort if not converged 37 } 38 39 // Compute distribution after t steps starting from a specific state 'start' 40 vector<double> dist_after_t(const vector<vector<double>>& P, int start, int t) { 41 int n = (int)P.size(); 42 vector<double> v(n, 0.0); v[start] = 1.0; // delta at start 43 for (int k = 0; k < t; ++k) v = rowVecTimesMat(v, P); 44 return v; 45 } 46 47 // Sample next state j given current i using cumulative probabilities 48 int sample_next_state(const vector<vector<double>>& P, int i, mt19937& rng) { 49 uniform_real_distribution<double> U(0.0, 1.0); 50 double r = U(rng), c = 0.0; 51 for (int j = 0; j < (int)P.size(); ++j) { 52 c += P[i][j]; 53 if (r <= c) return j; 54 } 55 return (int)P.size() - 1; // fallback for numerical roundoff 56 } 57 58 int main() { 59 ios::sync_with_stdio(false); 60 cin.tie(nullptr); 61 62 // A small ergodic chain (rows sum to 1, irreducible, aperiodic) 63 vector<vector<double>> P = { 64 {0.10, 0.60, 0.20, 0.10}, 65 {0.30, 0.20, 0.40, 0.10}, 66 {0.25, 0.25, 0.25, 0.25}, 67 {0.40, 0.10, 0.10, 0.40} 68 }; 69 70 // 1) Compute stationary distribution via power iteration 71 vector<double> pi = stationary_power_iteration(P); 72 cout << fixed << setprecision(6); 73 cout << "Stationary distribution (power iteration):\n"; 74 for (double x : pi) cout << x << ' '; cout << "\n\n"; 75 76 // 2) Simulate a long trajectory to estimate empirical distribution 77 mt19937 rng(42); 78 int n = (int)P.size(); 79 int start = 0, steps = 200000, burn_in = 5000; 80 vector<int> counts(n, 0); 81 int state = start; 82 for (int t = 0; t < steps; ++t) { 83 state = sample_next_state(P, state, rng); 84 if (t >= burn_in) counts[state]++; 85 } 86 vector<double> emp(n, 0.0); 87 double total = max(1, steps - burn_in); 88 for (int i = 0; i < n; ++i) emp[i] = counts[i] / total; 89 cout << "Empirical distribution after burn-in:\n"; 90 for (double x : emp) cout << x << ' '; cout << "\n"; 91 cout << "TV(empirical, stationary) = " << tv(emp, pi) << "\n\n"; 92 93 // 3) Rough mixing assessment: worst-case TV distance across starts after t steps 94 vector<int> Ts = {1,2,4,8,16,32,64}; 95 for (int t : Ts) { 96 double worst = 0.0; 97 for (int s = 0; s < n; ++s) { 98 vector<double> vt = dist_after_t(P, s, t); 99 worst = max(worst, tv(vt, pi)); 100 } 101 cout << "t=" << setw(2) << t << ": worst-case TV distance = " << worst << "\n"; 102 } 103 104 return 0; 105 } 106
This program defines a small ergodic Markov chain, computes its stationary distribution by power iteration (repeatedly multiplying a row vector by P), simulates a long run to estimate empirical frequencies, and reports total variation distance between empirical and stationary distributions. It also estimates mixing behavior by computing, for several t, the maximum TV distance between P^t(x,Ā·) and Ļ across start states.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Standard normal pdf (mean m, sd s) 5 double normal_pdf(double x, double m = 0.0, double s = 1.0) { 6 static const double inv_sqrt_2pi = 1.0 / sqrt(2.0 * M_PI); 7 double z = (x - m) / s; 8 return inv_sqrt_2pi / s * exp(-0.5 * z * z); 9 } 10 11 // Target density: mixture of two unit-variance normals centered at -3 and +3 12 // pi(x) ā 0.5*N(-3,1) + 0.5*N(3,1). (This is actually normalized, but MH only needs ratios.) 13 double target_density(double x) { 14 return 0.5 * normal_pdf(x, -3.0, 1.0) + 0.5 * normal_pdf(x, 3.0, 1.0); 15 } 16 17 int main() { 18 ios::sync_with_stdio(false); 19 cin.tie(nullptr); 20 21 mt19937 rng(123); 22 normal_distribution<double> prop_noise(0.0, 1.0); // proposal step sd; tune for acceptance 23 uniform_real_distribution<double> U(0.0, 1.0); 24 25 int burn_in = 2000; 26 int N = 50000; // total iterations (including burn-in) 27 double step_sd = 1.5; // proposal standard deviation (scaled multiplier) 28 29 double x = 0.0; // start at mode center; any start is allowed 30 double fx = target_density(x); 31 32 vector<double> samples; 33 samples.reserve(N - burn_in); 34 35 int accepts = 0, tries = 0; 36 for (int t = 0; t < N; ++t) { 37 double y = x + step_sd * prop_noise(rng); // symmetric random-walk proposal 38 double fy = target_density(y); 39 double alpha = min(1.0, fy / fx); // q symmetric => ratio simplifies 40 double r = U(rng); 41 tries++; 42 if (r < alpha) { x = y; fx = fy; accepts++; } 43 if (t >= burn_in) samples.push_back(x); 44 } 45 46 // Report acceptance and simple moments 47 double mean = accumulate(samples.begin(), samples.end(), 0.0) / samples.size(); 48 double var = 0.0; 49 for (double s : samples) var += (s - mean) * (s - mean); 50 var /= (samples.size() - 1); 51 52 cout << fixed << setprecision(4); 53 cout << "MH acceptance rate: " << (100.0 * accepts / (double)tries) << "%\n"; 54 cout << "Sample size (post burn-in): " << samples.size() << "\n"; 55 cout << "Sample mean: " << mean << ", sample variance: " << var << "\n"; 56 57 // Print a few samples 58 cout << "First 10 samples: "; 59 for (int i = 0; i < 10 && i < (int)samples.size(); ++i) cout << samples[i] << ' '; 60 cout << "\n"; 61 62 return 0; 63 } 64
This program implements MetropolisāHastings with a symmetric Gaussian random-walk proposal to sample from a bimodal target (mixture of two Gaussians). The acceptance probability simplifies to min(1, Ļ(y)/Ļ(x)) because the proposal is symmetric. It reports the acceptance rate and basic sample statistics as a quick sanity check.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Gibbs sampler for (X, Y) ~ N(0, Ī£) with Ī£ = [[1, Ļ], [Ļ, 1]] 5 // Conditionals: X | Y=y ~ N(Ļ y, 1-Ļ^2), Y | X=x ~ N(Ļ x, 1-Ļ^2) 6 int main() { 7 ios::sync_with_stdio(false); 8 cin.tie(nullptr); 9 10 double rho = 0.8; // correlation 11 double cond_sd = sqrt(1.0 - rho * rho); 12 13 mt19937 rng(2024); 14 normal_distribution<double> Z(0.0, 1.0); 15 16 int burn_in = 2000; 17 int iters = 40000; 18 19 double x = 0.0, y = 0.0; // start anywhere 20 vector<pair<double,double>> samples; 21 samples.reserve(iters - burn_in); 22 23 for (int t = 0; t < iters; ++t) { 24 // Sample X | Y=y 25 double mean_x = rho * y; 26 x = mean_x + cond_sd * Z(rng); 27 // Sample Y | X=x 28 double mean_y = rho * x; 29 y = mean_y + cond_sd * Z(rng); 30 if (t >= burn_in) samples.emplace_back(x, y); 31 } 32 33 // Empirical moments 34 double mx = 0.0, my = 0.0; 35 for (auto &p : samples) { mx += p.first; my += p.second; } 36 mx /= samples.size(); my /= samples.size(); 37 double vx = 0.0, vy = 0.0, cxy = 0.0; 38 for (auto &p : samples) { 39 vx += (p.first - mx) * (p.first - mx); 40 vy += (p.second - my) * (p.second - my); 41 cxy += (p.first - mx) * (p.second - my); 42 } 43 vx /= (samples.size() - 1); 44 vy /= (samples.size() - 1); 45 cxy /= (samples.size() - 1); 46 double corr = cxy / sqrt(vx * vy); 47 48 cout << fixed << setprecision(4); 49 cout << "Empirical E[X]: " << mx << ", E[Y]: " << my << " (should be ~0)\n"; 50 cout << "Empirical Var[X]: " << vx << ", Var[Y]: " << vy << " (should be ~1)\n"; 51 cout << "Empirical Corr[X,Y]: " << corr << " (target Ļ = " << rho << ")\n"; 52 53 // Print a few samples 54 cout << "First 5 samples: "; 55 for (int i = 0; i < 5 && i < (int)samples.size(); ++i) { 56 cout << '(' << samples[i].first << ',' << samples[i].second << ") "; 57 } 58 cout << "\n"; 59 60 return 0; 61 } 62
This Gibbs sampler targets a 2D Normal with correlation Ļ by alternating conditional draws: X|Y and Y|X are Normal with means Ļy and Ļx and variance 1āϲ. The code collects samples after burn-in and verifies that empirical means, variances, and correlation match the target.