🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Markov Chains

Key Points

  • •
    A Markov chain models a system that moves between states where the next step depends only on the current state, not the past.
  • •
    The transition behavior is captured by a row-stochastic matrix P whose rows each sum to 1.
  • •
    k-step transitions are given by matrix powers: starting distribution times Pk yields the distribution after k steps.
  • •
    A stationary distribution π is a probability vector that remains unchanged by the chain: πP = π.
  • •
    For irreducible and aperiodic chains, distributions converge to a unique stationary distribution regardless of start state.
  • •
    You can simulate Markov chains efficiently with random sampling and estimate long-run frequencies.
  • •
    Matrix exponentiation and power iteration are practical C++ techniques to compute multi-step probabilities and stationary distributions.
  • •
    Beware of row-vs-column conventions, non-stochastic inputs, and non-ergodic chains that do not converge to a single stationary distribution.

Prerequisites

  • →Basic probability — Understanding conditional probability and distributions is essential for interpreting transition probabilities.
  • →Matrix operations — Markov chains evolve via matrix multiplication and matrix powers.
  • →Eigenvalues and eigenvectors — Stationary distributions are left eigenvectors with eigenvalue 1.
  • →Numerical methods — Power iteration, convergence tolerances, and floating-point stability are practical concerns.
  • →Graph theory (basics) — Random walks on graphs and connectivity (irreducibility) use graph concepts.
  • →Random number generation — Simulation requires sampling from discrete distributions.
  • →Series and convergence — Mixing time and convergence to stationarity rely on limits and geometric decay.

Detailed Explanation

Tap terms for definitions

01Overview

A Markov chain is a mathematical model for systems that evolve step by step among a set of possible states with fixed probabilities. Its defining feature is the Markov property: the future depends only on the present, not on the full history. This makes Markov chains powerful yet tractable tools for modeling randomness in time, from weather changes to web surfing patterns. In a discrete-time, finite-state Markov chain, the dynamics are summarized by a transition matrix P, where P_{ij} is the probability of moving from state i to state j in one time step. If you know the current distribution over states as a row vector μ, the distribution after one step is μP; after k steps it is μP^k. Markov chains can exhibit rich long-term behavior. Some get absorbed into special states, some wander forever, and many, under mild connectivity and randomness conditions (irreducibility and aperiodicity), converge to a unique stationary distribution π satisfying πP = π. This π describes the long-run fraction of time spent in each state and underpins algorithms like PageRank. Practically, we can study Markov chains by simulating paths, by computing powers of the transition matrix, or by iterative methods like the power iteration. In C++, these methods translate into random sampling for simulation, repeated matrix multiplications for multi-step transitions, and iterative vector-matrix multiplications for stationary distributions. The required mathematics blends probability (for interpreting transitions), linear algebra (for matrix powers and eigenvectors), and convergence ideas (for understanding when and how fast limits exist).

02Intuition & Analogies

Think of a Markov chain like a board game where your next move depends only on the square you occupy right now, not on how you got there. Rolling dice to move in Monopoly, drawing a card in a game of chance, or flipping a coin to decide your next action—each step uses only the current position and some fixed rules. That’s the memoryless idea: yesterday’s journey doesn’t matter beyond where you ended up. Another familiar analogy is the weather: suppose tomorrow’s weather depends only on today’s weather. If today is sunny, there’s a 70% chance tomorrow is sunny; if it’s rainy, there’s a 40% chance tomorrow is sunny. The percentages form a transition matrix. If you start with some guess of today’s weather distribution, applying those percentages once predicts tomorrow; applying them repeatedly predicts next week, next month, etc. Over time, many such systems settle into a rhythm: the proportion of sunny vs. rainy days stabilizes. That stable mix is the stationary distribution. Simulation mirrors how you might explore a city by walking: from your current intersection, pick a random adjacent street based on posted signs (probabilities), then repeat. If the city is well-connected and you avoid perfectly periodic patterns, your long-run fraction of time at each intersection converges to a fixed profile that depends on the map, not on where you started. Computationally, these ideas become matrix operations. The transition matrix is the rulebook; multiplying by it advances time. Raising it to the k-th power compresses k steps into one calculation. Repeating vector-matrix multiplications mimics many steps cheaply and often reveals the long-run profile. The memoryless property keeps the math and code simple: all you need for the next move is the current state or distribution, not the full log of where you’ve been.

03Formal Definition

A discrete-time, time-homogeneous Markov chain is a stochastic process {Xt​}_{t ≥ 0} on a finite (or countable) state space S = {1, …, n} with the Markov property: for all t ≥ 0 and all states i0​, …, it+1​ in S, P(Xt+1​=it+1​ ∣ Xt​=it​, Xt−1​=it−1​, …, X0​=i0​) = P(Xt+1​=it+1​ ∣ Xt​=it​). Time-homogeneity means the one-step transition probabilities do not depend on t. These probabilities are collected in a transition matrix P ∈ Rn×n with entries Pij​ = P(Xt+1​=j ∣ Xt​=i), where P is row-stochastic: Pij​ ≥ 0 and ∑j=1n​ Pij​=1 for each i. Let μ(t) denote the row vector of state probabilities at time t: μi(t)​ = P(Xt​=i). Then μ(t+1) = μ(t) P and thus μ(t) = μ(0) Pt. The n-step transition matrix is P(n)=Pn, and entries satisfy the Chapman–Kolmogorov equations: P^{(m+n)} = P^{(m)} P(n). A stationary distribution is a row vector π with nonnegative entries summing to 1 such that π P = π. If the chain is irreducible (one can reach any state from any state) and aperiodic (no deterministic cycling), then a unique stationary distribution exists and μ(t) → π as t → ∞ regardless of μ(0). Special structures (e.g., absorbing states) yield additional matrices like the fundamental matrix for transient behavior.

04When to Use

Use Markov chains whenever the system’s next step depends only on its current condition and you can assign meaningful transition probabilities. Typical cases include: modeling queues (how many customers in line next minute), weather (today’s weather predicts tomorrow’s), inventory levels (stock replenishment and demand), random walks on graphs (web surfing or network sampling), games of chance (board positions), and reliability models (component states up/down). For planning short-term forecasts, compute \mu^{(0)} P^{k} to get distributions k steps ahead. For long-run behavior, compute or approximate a stationary distribution \pi to understand steady-state usage or occupancy (e.g., what fraction of time a server is busy). When exact formulas are complex or the state space is large, simulate many trajectories to estimate statistics like hitting probabilities or expected times to reach a target. In ranking and recommendation (e.g., PageRank), use a modified Markov chain with teleportation to guarantee convergence and extract importance scores. In operations research, absorbing chains help compute expected number of steps until completion. In reinforcement learning, Markov decision processes (MDPs) extend chains by letting an agent choose actions that influence the transition matrix—Markov chains are the passive dynamics part. Choose matrix exponentiation for exact k-step transitions on moderate n and large k; choose power iteration when you only need the dominant left eigenvector (stationary distribution). Choose simulation to handle complicated, high-dimensional systems where exact computation is infeasible but sampling is cheap.

⚠️Common Mistakes

• Mixing conventions: Some texts use column vectors that evolve via v_{t+1} = P v_t; others use row vectors with v_{t+1} = v_t P. Mixing these leads to incorrect multiplications and transposed formulas. Be consistent across code and math. • Ignoring stochasticity: Input matrices must be nonnegative with each row summing to 1 (row-stochastic). Small floating errors are fine, but significant deviations can break convergence and produce invalid probabilities. Normalize rows if necessary. • Expecting convergence without conditions: Not all chains converge to a unique stationary distribution. Periodic chains can oscillate; reducible chains can have multiple stationary distributions; absorbing chains do not mix over transient states. Check irreducibility and aperiodicity (or add damping/teleportation) before using power iteration. • Misinterpreting empirical frequencies: Short simulations can give misleading estimates, especially if the chain mixes slowly (small spectral gap). Use burn-in, run sufficiently long, and compute confidence intervals. • Overusing matrix exponentiation: Computing P^k costs O(n^3 \log k), which is wasteful when you only need \mu^{(t)} for successive t. Iterative multiplication by P is cheaper for sequences of times. • Numerical instability: Subtracting nearly equal probabilities or repeatedly multiplying can accumulate floating error. Use double precision, normalize vectors periodically, and set sensible convergence tolerances. • Confusing detailed balance with stationarity: Detailed balance (reversibility) implies stationarity but is a stronger condition; not all stationary chains are reversible. • Forgetting absorbing structure: For absorbing chains, long-run distribution concentrates on absorbing states; using standard power iteration without recognizing absorption can hide useful transient metrics like expected time to absorption.

Key Formulas

Markov Property

P(Xt+1​=j∣Xt​=i,Xt−1​,…,X0​)=P(Xt+1​=j∣Xt​=i)=Pij​

Explanation: The probability of the next state depends only on the current state. This defines the memoryless nature of the chain.

Chapman–Kolmogorov

P(m+n)=P(m)P(n)

Explanation: The (m+n)-step transition matrix equals the product of the m-step and n-step matrices. It allows building long-horizon transitions from shorter ones.

Distribution Evolution

μ(t)=μ(0)Pt

Explanation: A row distribution vector after t steps is the initial distribution multiplied by P raised to the t-th power. This computes forecasts for any horizon.

Stationary Distribution

πP=π,i=1∑n​πi​=1, πi​≥0

Explanation: A stationary distribution is unchanged by the transition: multiplying by P leaves it the same. It describes long-run state proportions.

Detailed Balance (Reversibility)

πi​Pij​=πj​Pji​

Explanation: If this symmetry holds for all i,j, the chain is reversible with respect to π. It guarantees π is stationary and simplifies analysis.

Convergence via Spectral Gap

​μ(t)−π​TV​≤C(1−γ)t,γ=1−∣λ2​∣

Explanation: The distance to stationarity decays geometrically at a rate governed by the spectral gap. Larger gaps imply faster mixing.

Fundamental Matrix (Absorbing Chains)

Z=(I−Q)−1

Explanation: For transient block Q of an absorbing chain, Z aggregates expected visits among transient states. It underlies absorption probabilities and times.

Expected Time to Absorption

t=Z1

Explanation: Multiplying the fundamental matrix by the all-ones vector yields expected steps to absorption from each transient state.

Graph Stationary Distribution

πi​=∑j​dj​di​​(undirected graph random walk)

Explanation: For a simple random walk on an undirected graph, the stationary probability is proportional to node degree. High-degree nodes are visited more often.

Teleportation Equation (PageRank form)

π=απP+(1−α)u,0<α<1

Explanation: A damped update mixes a step of the chain with a jump to distribution u. This ensures ergodicity and stable convergence of power iteration.

Complexity Analysis

Three common computational tasks arise for finite Markov chains. (1) Simulation: Advancing a single trajectory for T steps costs O(T) random draws, plus O(n) preprocessing if you build one discrete distribution per row. Memory is O(n2) for storing the transition matrix (dense) and O(n) for state counts. Simulation scales well when n is large but each step depends only on the current row. (2) Exact k-step transitions via matrix exponentiation: Computing Pk with fast exponentiation requires O(n3 log k) time using standard dense matrix multiplication and O(n2) memory. This is efficient when n is moderate (tens to low hundreds) and k is large, or when you need many different initial distributions applied to the same Pk. If you only need successive horizons for one initial vector, repeated multiplication by P (k times) costs O(k n2) and can be better for small k. (3) Stationary distribution by power iteration: Each iteration computes π ← πP in O(n2) time and O(n) extra space; the number of iterations depends on the spectral gap (smaller gap means slower convergence). With teleportation, convergence is linear with rate α∣λ2​∣. For sparse chains with m non-zeros, replace n2 by m: simulation stays O(T), vector-matrix multiplies cost O(m) per step, and storage is O(m). Numerical stability considerations include normalizing probability vectors (to sum to 1), clamping small negatives to zero due to rounding, and using tolerances for convergence checks. If you solve absorbing-chain systems (e.g., (I - Q)^{-1}), dense Gaussian elimination costs O(n3) time and O(n2) space; iterative solvers can exploit sparsity for better scalability.

Code Examples

Simulate a Markov Chain and Estimate Long-Run Frequencies
1#include <iostream>
2#include <vector>
3#include <random>
4#include <numeric>
5#include <iomanip>
6#include <stdexcept>
7using namespace std;
8
9// Ensure each row sums to 1 within tolerance; optionally renormalize small errors
10void normalize_rows(vector<vector<double>>& P, double tol = 1e-12) {
11 size_t n = P.size();
12 for (size_t i = 0; i < n; ++i) {
13 double s = accumulate(P[i].begin(), P[i].end(), 0.0);
14 if (s <= 0.0) throw runtime_error("Row has non-positive sum; invalid transition matrix");
15 for (size_t j = 0; j < n; ++j) {
16 if (P[i][j] < -1e-15) throw runtime_error("Negative probability encountered");
17 P[i][j] = max(0.0, P[i][j]);
18 }
19 // Renormalize to handle small floating deviations
20 for (size_t j = 0; j < n; ++j) P[i][j] /= s;
21 // Final check
22 double s2 = accumulate(P[i].begin(), P[i].end(), 0.0);
23 if (abs(s2 - 1.0) > tol) throw runtime_error("Row normalization failed");
24 }
25}
26
27int main() {
28 // Example: Simple weather model with 3 states: 0=Sunny, 1=Cloudy, 2=Rainy
29 vector<vector<double>> P = {
30 {0.7, 0.2, 0.1}, // from Sunny
31 {0.5, 0.3, 0.2}, // from Cloudy
32 {0.4, 0.3, 0.3} // from Rainy
33 };
34
35 normalize_rows(P);
36
37 size_t n = P.size();
38 // Pre-build one discrete distribution per row for fast sampling
39 random_device rd;
40 mt19937 gen(rd());
41 vector<discrete_distribution<int>> dists;
42 dists.reserve(n);
43 for (size_t i = 0; i < n; ++i) dists.emplace_back(P[i].begin(), P[i].end());
44
45 // Simulation parameters
46 const size_t burn_in = 1000; // steps to discard to reduce initial bias
47 const size_t samples = 1'000'000; // steps to collect statistics
48
49 // Start from Sunny (state 0), could also randomize
50 int state = 0;
51
52 // Burn-in
53 for (size_t t = 0; t < burn_in; ++t) {
54 state = dists[state](gen);
55 }
56
57 // Collect empirical counts after burn-in
58 vector<long long> counts(n, 0);
59 for (size_t t = 0; t < samples; ++t) {
60 state = dists[state](gen);
61 counts[state]++;
62 }
63
64 // Convert counts to empirical distribution
65 double total = static_cast<double>(samples);
66 cout << fixed << setprecision(6);
67 cout << "Empirical long-run distribution (approximate stationary distribution):\n";
68 for (size_t i = 0; i < n; ++i) {
69 cout << "State " << i << ": " << counts[i] / total << "\n";
70 }
71
72 return 0;
73}
74

This program simulates a discrete-time Markov chain with three weather states. It validates and normalizes the transition matrix, constructs sampling distributions for each state, runs a burn-in to reduce dependence on the initial state, and then collects a large number of samples to estimate the long-run fraction of time spent in each state. For an ergodic chain, these empirical frequencies approximate the stationary distribution.

Time: O(burn_in + samples)Space: O(n^2) for the matrix plus O(n) for counts and distributions
Compute k-Step Transition Distribution via Fast Matrix Exponentiation
1#include <iostream>
2#include <vector>
3#include <iomanip>
4#include <stdexcept>
5#include <numeric>
6#include <cmath>
7using namespace std;
8
9using Matrix = vector<vector<double>>;
10
11void normalize_rows(Matrix& P, double tol = 1e-12) {
12 size_t n = P.size();
13 for (size_t i = 0; i < n; ++i) {
14 double s = accumulate(P[i].begin(), P[i].end(), 0.0);
15 if (s <= 0.0) throw runtime_error("Row has non-positive sum");
16 for (size_t j = 0; j < n; ++j) P[i][j] = max(0.0, P[i][j]);
17 for (size_t j = 0; j < n; ++j) P[i][j] /= s;
18 double s2 = accumulate(P[i].begin(), P[i].end(), 0.0);
19 if (abs(s2 - 1.0) > tol) throw runtime_error("Row normalization failed");
20 }
21}
22
23Matrix multiply(const Matrix& A, const Matrix& B) {
24 size_t n = A.size();
25 Matrix C(n, vector<double>(n, 0.0));
26 for (size_t i = 0; i < n; ++i) {
27 for (size_t k = 0; k < n; ++k) {
28 double aik = A[i][k];
29 if (aik == 0.0) continue;
30 for (size_t j = 0; j < n; ++j) {
31 C[i][j] += aik * B[k][j];
32 }
33 }
34 }
35 return C;
36}
37
38Matrix mat_pow(Matrix base, long long exp) {
39 size_t n = base.size();
40 // Identity matrix
41 Matrix result(n, vector<double>(n, 0.0));
42 for (size_t i = 0; i < n; ++i) result[i][i] = 1.0;
43 while (exp > 0) {
44 if (exp & 1LL) result = multiply(result, base);
45 base = multiply(base, base);
46 exp >>= 1LL;
47 }
48 return result;
49}
50
51vector<double> rowvec_mul(const vector<double>& v, const Matrix& M) {
52 size_t n = v.size();
53 vector<double> out(n, 0.0);
54 for (size_t j = 0; j < n; ++j) {
55 double s = 0.0;
56 for (size_t k = 0; k < n; ++k) s += v[k] * M[k][j];
57 out[j] = s;
58 }
59 return out;
60}
61
62void normalize_dist(vector<double>& v) {
63 double s = accumulate(v.begin(), v.end(), 0.0);
64 for (double& x : v) x = (s > 0 ? max(0.0, x) / s : 0.0);
65}
66
67int main() {
68 // Transition matrix P (4-state example)
69 Matrix P = {
70 {0.6, 0.3, 0.1, 0.0},
71 {0.2, 0.5, 0.3, 0.0},
72 {0.1, 0.3, 0.5, 0.1},
73 {0.0, 0.2, 0.3, 0.5}
74 };
75 normalize_rows(P);
76
77 // Initial distribution (row vector)
78 vector<double> mu0 = {1.0, 0.0, 0.0, 0.0}; // start deterministically in state 0
79 normalize_dist(mu0);
80
81 long long k = 20; // number of steps ahead
82
83 Matrix Pk = mat_pow(P, k);
84 vector<double> muk = rowvec_mul(mu0, Pk);
85 normalize_dist(muk); // guard against minor floating drift
86
87 cout << fixed << setprecision(6);
88 cout << k << "-step distribution from state 0:\n";
89 for (size_t i = 0; i < muk.size(); ++i) cout << "State " << i << ": " << muk[i] << "\n";
90 return 0;
91}
92

This program computes the k-step state distribution exactly (up to floating error) by raising the transition matrix to the k-th power using fast exponentiation, then multiplying the initial distribution from the left. It’s appropriate when you need accurate multi-step forecasts for moderate n and large k.

Time: O(n^3 log k) for exponentiation plus O(n^2) for final vector multiplySpace: O(n^2)
Stationary Distribution via Power Iteration with Optional Teleportation
1#include <iostream>
2#include <vector>
3#include <numeric>
4#include <iomanip>
5#include <cmath>
6#include <stdexcept>
7using namespace std;
8
9using Matrix = vector<vector<double>>;
10
11void normalize_rows(Matrix& P) {
12 for (auto& row : P) {
13 for (double& x : row) x = max(0.0, x);
14 double s = accumulate(row.begin(), row.end(), 0.0);
15 if (s <= 0.0) throw runtime_error("Invalid row (sum <= 0)");
16 for (double& x : row) x /= s;
17 }
18}
19
20vector<double> multiply_left(const vector<double>& v, const Matrix& P) {
21 size_t n = v.size();
22 vector<double> out(n, 0.0);
23 for (size_t j = 0; j < n; ++j) {
24 double s = 0.0;
25 for (size_t i = 0; i < n; ++i) s += v[i] * P[i][j];
26 out[j] = s;
27 }
28 return out;
29}
30
31int main() {
32 // Example 4-state chain (can be viewed as a small web graph surfer)
33 Matrix P = {
34 {0.05, 0.90, 0.05, 0.00},
35 {0.10, 0.10, 0.70, 0.10},
36 {0.20, 0.10, 0.50, 0.20},
37 {0.25, 0.25, 0.25, 0.25} // uniform row helps aperiodicity
38 };
39 normalize_rows(P);
40
41 size_t n = P.size();
42
43 // Teleportation parameters (set alpha < 1 to guarantee ergodicity)
44 const double alpha = 0.85; // damping factor; set to 1.0 to disable teleportation
45 vector<double> u(n, 1.0 / n); // teleportation distribution (uniform)
46
47 // Power iteration setup
48 vector<double> pi(n, 1.0 / n); // start from uniform distribution
49 vector<double> next(n, 0.0);
50
51 const int max_iters = 10000;
52 const double tol = 1e-12; // L1 tolerance
53
54 for (int it = 0; it < max_iters; ++it) {
55 // next = alpha * (pi P) + (1 - alpha) * u
56 vector<double> piP = multiply_left(pi, P);
57 for (size_t j = 0; j < n; ++j) next[j] = alpha * piP[j] + (1.0 - alpha) * u[j];
58
59 // Normalize to counter floating drift
60 double s = accumulate(next.begin(), next.end(), 0.0);
61 for (double& x : next) x /= s;
62
63 // Check L1 convergence
64 double diff = 0.0;
65 for (size_t j = 0; j < n; ++j) diff += fabs(next[j] - pi[j]);
66 pi.swap(next);
67 if (diff < tol) {
68 cerr << "Converged in " << it + 1 << " iterations\n";
69 break;
70 }
71 if (it == max_iters - 1) cerr << "Warning: did not converge within max_iters\n";
72 }
73
74 cout << fixed << setprecision(10);
75 cout << "Approximate stationary distribution (with damping):\n";
76 for (size_t i = 0; i < n; ++i) cout << "State " << i << ": " << pi[i] << "\n";
77
78 return 0;
79}
80

This program computes an approximate stationary distribution using power iteration. With α < 1, it implements a damped (teleporting) chain that is always ergodic, ensuring convergence. Setting α = 1 reduces to standard power iteration and requires the chain to be irreducible and aperiodic to converge from any start.

Time: O(n^2 * iters) for dense P; O(m * iters) if P is sparse with m nonzerosSpace: O(n^2) for dense P plus O(n) for vectors
#markov chain#transition matrix#stationary distribution#power iteration#matrix exponentiation#random walk#absorbing chain#detailed balance#spectral gap#simulation#pagerank#ergodic#aperiodic#chapman kolmogorov