Markov Chains
Key Points
- •A Markov chain models a system that moves between states where the next step depends only on the current state, not the past.
- •The transition behavior is captured by a row-stochastic matrix P whose rows each sum to 1.
- •k-step transitions are given by matrix powers: starting distribution times yields the distribution after k steps.
- •A stationary distribution π is a probability vector that remains unchanged by the chain: =
- •For irreducible and aperiodic chains, distributions converge to a unique stationary distribution regardless of start state.
- •You can simulate Markov chains efficiently with random sampling and estimate long-run frequencies.
- •Matrix exponentiation and power iteration are practical C++ techniques to compute multi-step probabilities and stationary distributions.
- •Beware of row-vs-column conventions, non-stochastic inputs, and non-ergodic chains that do not converge to a single stationary distribution.
Prerequisites
- →Basic probability — Understanding conditional probability and distributions is essential for interpreting transition probabilities.
- →Matrix operations — Markov chains evolve via matrix multiplication and matrix powers.
- →Eigenvalues and eigenvectors — Stationary distributions are left eigenvectors with eigenvalue 1.
- →Numerical methods — Power iteration, convergence tolerances, and floating-point stability are practical concerns.
- →Graph theory (basics) — Random walks on graphs and connectivity (irreducibility) use graph concepts.
- →Random number generation — Simulation requires sampling from discrete distributions.
- →Series and convergence — Mixing time and convergence to stationarity rely on limits and geometric decay.
Detailed Explanation
Tap terms for definitions01Overview
A Markov chain is a mathematical model for systems that evolve step by step among a set of possible states with fixed probabilities. Its defining feature is the Markov property: the future depends only on the present, not on the full history. This makes Markov chains powerful yet tractable tools for modeling randomness in time, from weather changes to web surfing patterns. In a discrete-time, finite-state Markov chain, the dynamics are summarized by a transition matrix P, where P_{ij} is the probability of moving from state i to state j in one time step. If you know the current distribution over states as a row vector μ, the distribution after one step is μP; after k steps it is μP^k. Markov chains can exhibit rich long-term behavior. Some get absorbed into special states, some wander forever, and many, under mild connectivity and randomness conditions (irreducibility and aperiodicity), converge to a unique stationary distribution π satisfying πP = π. This π describes the long-run fraction of time spent in each state and underpins algorithms like PageRank. Practically, we can study Markov chains by simulating paths, by computing powers of the transition matrix, or by iterative methods like the power iteration. In C++, these methods translate into random sampling for simulation, repeated matrix multiplications for multi-step transitions, and iterative vector-matrix multiplications for stationary distributions. The required mathematics blends probability (for interpreting transitions), linear algebra (for matrix powers and eigenvectors), and convergence ideas (for understanding when and how fast limits exist).
02Intuition & Analogies
Think of a Markov chain like a board game where your next move depends only on the square you occupy right now, not on how you got there. Rolling dice to move in Monopoly, drawing a card in a game of chance, or flipping a coin to decide your next action—each step uses only the current position and some fixed rules. That’s the memoryless idea: yesterday’s journey doesn’t matter beyond where you ended up. Another familiar analogy is the weather: suppose tomorrow’s weather depends only on today’s weather. If today is sunny, there’s a 70% chance tomorrow is sunny; if it’s rainy, there’s a 40% chance tomorrow is sunny. The percentages form a transition matrix. If you start with some guess of today’s weather distribution, applying those percentages once predicts tomorrow; applying them repeatedly predicts next week, next month, etc. Over time, many such systems settle into a rhythm: the proportion of sunny vs. rainy days stabilizes. That stable mix is the stationary distribution. Simulation mirrors how you might explore a city by walking: from your current intersection, pick a random adjacent street based on posted signs (probabilities), then repeat. If the city is well-connected and you avoid perfectly periodic patterns, your long-run fraction of time at each intersection converges to a fixed profile that depends on the map, not on where you started. Computationally, these ideas become matrix operations. The transition matrix is the rulebook; multiplying by it advances time. Raising it to the k-th power compresses k steps into one calculation. Repeating vector-matrix multiplications mimics many steps cheaply and often reveals the long-run profile. The memoryless property keeps the math and code simple: all you need for the next move is the current state or distribution, not the full log of where you’ve been.
03Formal Definition
04When to Use
Use Markov chains whenever the system’s next step depends only on its current condition and you can assign meaningful transition probabilities. Typical cases include: modeling queues (how many customers in line next minute), weather (today’s weather predicts tomorrow’s), inventory levels (stock replenishment and demand), random walks on graphs (web surfing or network sampling), games of chance (board positions), and reliability models (component states up/down). For planning short-term forecasts, compute \mu^{(0)} P^{k} to get distributions k steps ahead. For long-run behavior, compute or approximate a stationary distribution \pi to understand steady-state usage or occupancy (e.g., what fraction of time a server is busy). When exact formulas are complex or the state space is large, simulate many trajectories to estimate statistics like hitting probabilities or expected times to reach a target. In ranking and recommendation (e.g., PageRank), use a modified Markov chain with teleportation to guarantee convergence and extract importance scores. In operations research, absorbing chains help compute expected number of steps until completion. In reinforcement learning, Markov decision processes (MDPs) extend chains by letting an agent choose actions that influence the transition matrix—Markov chains are the passive dynamics part. Choose matrix exponentiation for exact k-step transitions on moderate n and large k; choose power iteration when you only need the dominant left eigenvector (stationary distribution). Choose simulation to handle complicated, high-dimensional systems where exact computation is infeasible but sampling is cheap.
⚠️Common Mistakes
• Mixing conventions: Some texts use column vectors that evolve via v_{t+1} = P v_t; others use row vectors with v_{t+1} = v_t P. Mixing these leads to incorrect multiplications and transposed formulas. Be consistent across code and math. • Ignoring stochasticity: Input matrices must be nonnegative with each row summing to 1 (row-stochastic). Small floating errors are fine, but significant deviations can break convergence and produce invalid probabilities. Normalize rows if necessary. • Expecting convergence without conditions: Not all chains converge to a unique stationary distribution. Periodic chains can oscillate; reducible chains can have multiple stationary distributions; absorbing chains do not mix over transient states. Check irreducibility and aperiodicity (or add damping/teleportation) before using power iteration. • Misinterpreting empirical frequencies: Short simulations can give misleading estimates, especially if the chain mixes slowly (small spectral gap). Use burn-in, run sufficiently long, and compute confidence intervals. • Overusing matrix exponentiation: Computing P^k costs O(n^3 \log k), which is wasteful when you only need \mu^{(t)} for successive t. Iterative multiplication by P is cheaper for sequences of times. • Numerical instability: Subtracting nearly equal probabilities or repeatedly multiplying can accumulate floating error. Use double precision, normalize vectors periodically, and set sensible convergence tolerances. • Confusing detailed balance with stationarity: Detailed balance (reversibility) implies stationarity but is a stronger condition; not all stationary chains are reversible. • Forgetting absorbing structure: For absorbing chains, long-run distribution concentrates on absorbing states; using standard power iteration without recognizing absorption can hide useful transient metrics like expected time to absorption.
Key Formulas
Markov Property
Explanation: The probability of the next state depends only on the current state. This defines the memoryless nature of the chain.
Chapman–Kolmogorov
Explanation: The (m+n)-step transition matrix equals the product of the m-step and n-step matrices. It allows building long-horizon transitions from shorter ones.
Distribution Evolution
Explanation: A row distribution vector after t steps is the initial distribution multiplied by P raised to the t-th power. This computes forecasts for any horizon.
Stationary Distribution
Explanation: A stationary distribution is unchanged by the transition: multiplying by P leaves it the same. It describes long-run state proportions.
Detailed Balance (Reversibility)
Explanation: If this symmetry holds for all i,j, the chain is reversible with respect to It guarantees π is stationary and simplifies analysis.
Convergence via Spectral Gap
Explanation: The distance to stationarity decays geometrically at a rate governed by the spectral gap. Larger gaps imply faster mixing.
Fundamental Matrix (Absorbing Chains)
Explanation: For transient block Q of an absorbing chain, Z aggregates expected visits among transient states. It underlies absorption probabilities and times.
Expected Time to Absorption
Explanation: Multiplying the fundamental matrix by the all-ones vector yields expected steps to absorption from each transient state.
Graph Stationary Distribution
Explanation: For a simple random walk on an undirected graph, the stationary probability is proportional to node degree. High-degree nodes are visited more often.
Teleportation Equation (PageRank form)
Explanation: A damped update mixes a step of the chain with a jump to distribution u. This ensures ergodicity and stable convergence of power iteration.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <numeric> 5 #include <iomanip> 6 #include <stdexcept> 7 using namespace std; 8 9 // Ensure each row sums to 1 within tolerance; optionally renormalize small errors 10 void normalize_rows(vector<vector<double>>& P, double tol = 1e-12) { 11 size_t n = P.size(); 12 for (size_t i = 0; i < n; ++i) { 13 double s = accumulate(P[i].begin(), P[i].end(), 0.0); 14 if (s <= 0.0) throw runtime_error("Row has non-positive sum; invalid transition matrix"); 15 for (size_t j = 0; j < n; ++j) { 16 if (P[i][j] < -1e-15) throw runtime_error("Negative probability encountered"); 17 P[i][j] = max(0.0, P[i][j]); 18 } 19 // Renormalize to handle small floating deviations 20 for (size_t j = 0; j < n; ++j) P[i][j] /= s; 21 // Final check 22 double s2 = accumulate(P[i].begin(), P[i].end(), 0.0); 23 if (abs(s2 - 1.0) > tol) throw runtime_error("Row normalization failed"); 24 } 25 } 26 27 int main() { 28 // Example: Simple weather model with 3 states: 0=Sunny, 1=Cloudy, 2=Rainy 29 vector<vector<double>> P = { 30 {0.7, 0.2, 0.1}, // from Sunny 31 {0.5, 0.3, 0.2}, // from Cloudy 32 {0.4, 0.3, 0.3} // from Rainy 33 }; 34 35 normalize_rows(P); 36 37 size_t n = P.size(); 38 // Pre-build one discrete distribution per row for fast sampling 39 random_device rd; 40 mt19937 gen(rd()); 41 vector<discrete_distribution<int>> dists; 42 dists.reserve(n); 43 for (size_t i = 0; i < n; ++i) dists.emplace_back(P[i].begin(), P[i].end()); 44 45 // Simulation parameters 46 const size_t burn_in = 1000; // steps to discard to reduce initial bias 47 const size_t samples = 1'000'000; // steps to collect statistics 48 49 // Start from Sunny (state 0), could also randomize 50 int state = 0; 51 52 // Burn-in 53 for (size_t t = 0; t < burn_in; ++t) { 54 state = dists[state](gen); 55 } 56 57 // Collect empirical counts after burn-in 58 vector<long long> counts(n, 0); 59 for (size_t t = 0; t < samples; ++t) { 60 state = dists[state](gen); 61 counts[state]++; 62 } 63 64 // Convert counts to empirical distribution 65 double total = static_cast<double>(samples); 66 cout << fixed << setprecision(6); 67 cout << "Empirical long-run distribution (approximate stationary distribution):\n"; 68 for (size_t i = 0; i < n; ++i) { 69 cout << "State " << i << ": " << counts[i] / total << "\n"; 70 } 71 72 return 0; 73 } 74
This program simulates a discrete-time Markov chain with three weather states. It validates and normalizes the transition matrix, constructs sampling distributions for each state, runs a burn-in to reduce dependence on the initial state, and then collects a large number of samples to estimate the long-run fraction of time spent in each state. For an ergodic chain, these empirical frequencies approximate the stationary distribution.
1 #include <iostream> 2 #include <vector> 3 #include <iomanip> 4 #include <stdexcept> 5 #include <numeric> 6 #include <cmath> 7 using namespace std; 8 9 using Matrix = vector<vector<double>>; 10 11 void normalize_rows(Matrix& P, double tol = 1e-12) { 12 size_t n = P.size(); 13 for (size_t i = 0; i < n; ++i) { 14 double s = accumulate(P[i].begin(), P[i].end(), 0.0); 15 if (s <= 0.0) throw runtime_error("Row has non-positive sum"); 16 for (size_t j = 0; j < n; ++j) P[i][j] = max(0.0, P[i][j]); 17 for (size_t j = 0; j < n; ++j) P[i][j] /= s; 18 double s2 = accumulate(P[i].begin(), P[i].end(), 0.0); 19 if (abs(s2 - 1.0) > tol) throw runtime_error("Row normalization failed"); 20 } 21 } 22 23 Matrix multiply(const Matrix& A, const Matrix& B) { 24 size_t n = A.size(); 25 Matrix C(n, vector<double>(n, 0.0)); 26 for (size_t i = 0; i < n; ++i) { 27 for (size_t k = 0; k < n; ++k) { 28 double aik = A[i][k]; 29 if (aik == 0.0) continue; 30 for (size_t j = 0; j < n; ++j) { 31 C[i][j] += aik * B[k][j]; 32 } 33 } 34 } 35 return C; 36 } 37 38 Matrix mat_pow(Matrix base, long long exp) { 39 size_t n = base.size(); 40 // Identity matrix 41 Matrix result(n, vector<double>(n, 0.0)); 42 for (size_t i = 0; i < n; ++i) result[i][i] = 1.0; 43 while (exp > 0) { 44 if (exp & 1LL) result = multiply(result, base); 45 base = multiply(base, base); 46 exp >>= 1LL; 47 } 48 return result; 49 } 50 51 vector<double> rowvec_mul(const vector<double>& v, const Matrix& M) { 52 size_t n = v.size(); 53 vector<double> out(n, 0.0); 54 for (size_t j = 0; j < n; ++j) { 55 double s = 0.0; 56 for (size_t k = 0; k < n; ++k) s += v[k] * M[k][j]; 57 out[j] = s; 58 } 59 return out; 60 } 61 62 void normalize_dist(vector<double>& v) { 63 double s = accumulate(v.begin(), v.end(), 0.0); 64 for (double& x : v) x = (s > 0 ? max(0.0, x) / s : 0.0); 65 } 66 67 int main() { 68 // Transition matrix P (4-state example) 69 Matrix P = { 70 {0.6, 0.3, 0.1, 0.0}, 71 {0.2, 0.5, 0.3, 0.0}, 72 {0.1, 0.3, 0.5, 0.1}, 73 {0.0, 0.2, 0.3, 0.5} 74 }; 75 normalize_rows(P); 76 77 // Initial distribution (row vector) 78 vector<double> mu0 = {1.0, 0.0, 0.0, 0.0}; // start deterministically in state 0 79 normalize_dist(mu0); 80 81 long long k = 20; // number of steps ahead 82 83 Matrix Pk = mat_pow(P, k); 84 vector<double> muk = rowvec_mul(mu0, Pk); 85 normalize_dist(muk); // guard against minor floating drift 86 87 cout << fixed << setprecision(6); 88 cout << k << "-step distribution from state 0:\n"; 89 for (size_t i = 0; i < muk.size(); ++i) cout << "State " << i << ": " << muk[i] << "\n"; 90 return 0; 91 } 92
This program computes the k-step state distribution exactly (up to floating error) by raising the transition matrix to the k-th power using fast exponentiation, then multiplying the initial distribution from the left. It’s appropriate when you need accurate multi-step forecasts for moderate n and large k.
1 #include <iostream> 2 #include <vector> 3 #include <numeric> 4 #include <iomanip> 5 #include <cmath> 6 #include <stdexcept> 7 using namespace std; 8 9 using Matrix = vector<vector<double>>; 10 11 void normalize_rows(Matrix& P) { 12 for (auto& row : P) { 13 for (double& x : row) x = max(0.0, x); 14 double s = accumulate(row.begin(), row.end(), 0.0); 15 if (s <= 0.0) throw runtime_error("Invalid row (sum <= 0)"); 16 for (double& x : row) x /= s; 17 } 18 } 19 20 vector<double> multiply_left(const vector<double>& v, const Matrix& P) { 21 size_t n = v.size(); 22 vector<double> out(n, 0.0); 23 for (size_t j = 0; j < n; ++j) { 24 double s = 0.0; 25 for (size_t i = 0; i < n; ++i) s += v[i] * P[i][j]; 26 out[j] = s; 27 } 28 return out; 29 } 30 31 int main() { 32 // Example 4-state chain (can be viewed as a small web graph surfer) 33 Matrix P = { 34 {0.05, 0.90, 0.05, 0.00}, 35 {0.10, 0.10, 0.70, 0.10}, 36 {0.20, 0.10, 0.50, 0.20}, 37 {0.25, 0.25, 0.25, 0.25} // uniform row helps aperiodicity 38 }; 39 normalize_rows(P); 40 41 size_t n = P.size(); 42 43 // Teleportation parameters (set alpha < 1 to guarantee ergodicity) 44 const double alpha = 0.85; // damping factor; set to 1.0 to disable teleportation 45 vector<double> u(n, 1.0 / n); // teleportation distribution (uniform) 46 47 // Power iteration setup 48 vector<double> pi(n, 1.0 / n); // start from uniform distribution 49 vector<double> next(n, 0.0); 50 51 const int max_iters = 10000; 52 const double tol = 1e-12; // L1 tolerance 53 54 for (int it = 0; it < max_iters; ++it) { 55 // next = alpha * (pi P) + (1 - alpha) * u 56 vector<double> piP = multiply_left(pi, P); 57 for (size_t j = 0; j < n; ++j) next[j] = alpha * piP[j] + (1.0 - alpha) * u[j]; 58 59 // Normalize to counter floating drift 60 double s = accumulate(next.begin(), next.end(), 0.0); 61 for (double& x : next) x /= s; 62 63 // Check L1 convergence 64 double diff = 0.0; 65 for (size_t j = 0; j < n; ++j) diff += fabs(next[j] - pi[j]); 66 pi.swap(next); 67 if (diff < tol) { 68 cerr << "Converged in " << it + 1 << " iterations\n"; 69 break; 70 } 71 if (it == max_iters - 1) cerr << "Warning: did not converge within max_iters\n"; 72 } 73 74 cout << fixed << setprecision(10); 75 cout << "Approximate stationary distribution (with damping):\n"; 76 for (size_t i = 0; i < n; ++i) cout << "State " << i << ": " << pi[i] << "\n"; 77 78 return 0; 79 } 80
This program computes an approximate stationary distribution using power iteration. With α < 1, it implements a damped (teleporting) chain that is always ergodic, ensuring convergence. Setting α = 1 reduces to standard power iteration and requires the chain to be irreducible and aperiodic to converge from any start.