Markov Chain Monte Carlo (MCMC)
Key Points
- ā¢MCMC builds a random walk (a Markov chain) whose long-run visiting frequency matches your target distribution, even when the target is only known up to a constant.
- ā¢The MetropolisāHastings (MH) algorithm accepts or rejects proposed moves so that detailed balance holds and the target becomes stationary.
- ā¢Gibbs sampling updates one coordinate at a time from its conditional distribution, which can be simpler to draw from than the joint distribution.
- ā¢You can compute expectations by averaging function values along the chain; normalization constants cancel out, so unnormalized densities are fine.
- ā¢Good mixing requires well-tuned proposals (step sizes/scales); poor tuning leads to high autocorrelation and slow convergence.
- ā¢Always work in log space for probabilities to avoid numerical underflow, and include proposal terms for asymmetric proposals.
- ā¢Use diagnostics like trace plots, autocorrelation, and effective sample size (ESS) to check convergence and efficiency.
- ā¢Per iteration cost is usually cheap, but the number of iterations needed can grow quickly with dimension and target complexity.
Prerequisites
- āBasic probability and random variables ā Understanding distributions, expectations, and conditional probability is essential to define targets and conditionals for Gibbs.
- āMarkov chains ā MCMC relies on stationarity, irreducibility, and aperiodicity to guarantee convergence.
- āCalculus and multivariate Gaussians ā Evaluating log densities and deriving conditionals (e.g., for the bivariate normal) requires these tools.
- āBayesā theorem ā Many MCMC applications target Bayesian posteriors known up to a normalizing constant.
- āNumerical stability (log-sum-exp) ā Working in log space prevents underflow when evaluating products or mixtures.
- āC++ random number generation (std::mt19937, distributions) ā Implementing proposals and conditional draws requires correct use of RNGs.
- āData transformations and Jacobians ā Sampling in transformed spaces requires adjusting densities to maintain correctness.
- āBasic statistics and covariance ā Interpreting sample summaries, autocorrelation, and ESS depends on these concepts.
Detailed Explanation
Tap terms for definitions01Overview
Markov Chain Monte Carlo (MCMC) is a family of algorithms for drawing samples from complex probability distributionsāespecially ones you can evaluate up to a constant but cannot sample from directly. The core idea is to construct a Markov chain whose long-run behavior (its stationary distribution) matches your target distribution. Once the chain has warmed up (burn-in), the states you visit can be treated as dependent samples from the target. By averaging functions over these samples, you approximate expectations like means, variances, and integrals required in Bayesian inference and statistical physics.
Why is this useful? Many real problems involve high-dimensional distributions with intractable normalizing constants. MCMC sidesteps exact normalization by using only ratios of densities. The two most common approaches are MetropolisāHastings (MH), which proposes a move and accepts it with a carefully chosen probability, and Gibbs sampling, which updates one variable at a time from its conditional distribution. With proper conditions (irreducibility, aperiodicity), these chains converge to the target distribution.
Example at a glance: Suppose your target is a mixture of two Gaussians (a ātwo-hillā landscape). MetropolisāHastings proposes small moves left or right; if the new position is in a higher-density region, you likely accept; if not, you might still accept with some probability. Over time, the chain visits each hill in proportion to its probability mass, letting you estimate the mixtureās mean or other features.
02Intuition & Analogies
Hook: Imagine exploring a dark, hilly landscape with a flashlight, trying to map where people tend to hang out. You canāt see the whole map at once, but you can tell whether a spot is more or less crowded than where you are now.
Concept: MCMC is like walking through that landscape so that the time you spend in each area matches how crowded it is. You decide your next step using only local comparisons of ācrowdednessā (probability density), not the whole map. In MetropolisāHastings, you suggest a new spot (proposal). If it seems more crowded, you usually move there. If itās less crowded, you might still move there, but less often. This balanced behavior ensures you donāt get stuck and that, in the long run, your footsteps represent the true crowd distribution. In Gibbs sampling, instead of jumping anywhere, you adjust one coordinate at a time using the exact rule for that coordinate given the othersālike moving strictly north/south, then east/west, according to local road signs.
Example: Suppose your target distribution is two neighborhoods: one downtown and one in the suburbs. Using a random-walk MH, you take small steps. If you step toward downtown when itās more crowded, you likely stay. If you wander into a quiet alley, you might backtrack. Over hours of walking, youāll spend time downtown and in the suburbs in the correct proportions. If you use Gibbs sampling in a grid-like city, you move northāsouth according to how crowded that street is given your current eastāwest position, then switch to eastāwest. After many cycles, your pathās heatmap reflects the real population density.
03Formal Definition
04When to Use
Use MCMC when you need samples or expectations from complex, high-dimensional distributions that you can evaluate up to a constant but cannot sample from directly. Classic scenarios include:
- Bayesian inference: posterior distributions with intractable normalizers (e.g., hierarchical models, mixture models, logistic regression with priors).
- Physics and ML: Boltzmann distributions, energy-based models where only an unnormalized energy E(x) is available.
- Probabilistic integration: computing \mathbb{E}_\pi[f(X)] for expensive black-box likelihoods.
- Models with convenient conditionals: if full conditionals are simple (e.g., conjugate), Gibbs sampling can be very efficient.
Avoid or be cautious when:
- Low dimensions with available direct samplers or quadrature; MCMC overhead is unnecessary.
- Severe multimodality with poor proposals; chains can get stuck in one modeāconsider tempered transitions or multiple chains.
- Strong correlations or ill-scaled parameters; reparameterize, precondition, or use adaptive proposals.
- You need independent samples very quickly; MCMC samples are dependent and may have high autocorrelation.
Practical cues: If you can code the log of the target density and its gradients are hard or unavailable, MH (random-walk or independence) is often a good starting point. If each coordinateās conditional is available, prefer Gibbs or Metropolis-within-Gibbs.
ā ļøCommon Mistakes
- Ignoring proposal asymmetry: For non-symmetric proposals, forgetting the q-terms in the MH acceptance ratio biases the chain. Always use \alpha(x,y) = min(1, [\pi(y) q(x\mid y)]/[\pi(x) q(y\mid x)]).
- Not using log space: Multiplying small probabilities underflows to zero. Compute with log densities and use log-sum-exp for mixtures.
- Poor tuning of step size/scale: Too small leads to high acceptance but slow exploration (high autocorrelation). Too large yields very low acceptance and many repeats. Tune to moderate acceptance (about 0.2ā0.5 for random-walk MH; in high dimensions, ~0.234 is a common heuristic).
- No diagnostics: Failing to check trace plots, autocorrelation, R-hat, or ESS can hide non-convergence or poor mixing. Run multiple chains from dispersed starts when possible.
- Insufficient burn-in: Early samples may reflect the start, not the target. Discard an initial portion after checking stability of summaries.
- Over-thinning: Thinning rarely increases statistical efficiency; it can reduce storage but throws away information. Prefer keeping all postāburn-in samples and using ESS.
- Ignoring constraints and transformations: If parameters are constrained (e.g., positive), either propose in a transformed space (log, logit) and include Jacobians when needed, or reflect proposals at boundaries.
- Bad random seeding or reusing RNGs incorrectly: Non-random or repeated seeds and shared RNGs across threads can produce artifacts. Initialize RNGs carefully and consider independent streams for parallel chains.
Key Formulas
Stationarity condition
Explanation: A distribution is stationary if, after one Markov transition, the probability of landing in any set A equals its original probability. This invariance ensures the chainās long-run behavior matches the target.
Detailed balance
Explanation: If this symmetry holds for all x and y, then is stationary for K. Many MCMC algorithms enforce detailed balance to guarantee correctness.
MH acceptance probability
Explanation: This is the acceptance probability for a move from x to y with proposal q. It corrects the proposal to target , allowing unnormalized because the unknown constant cancels in the ratio.
Random-walk proposal
Explanation: A common symmetric proposal centered at the current state. Symmetry implies q(y|x)=q(x|y), simplifying the MH ratio to a target-density ratio.
MCMC estimator and LLN
Explanation: Average function values along the chain to estimate expectations. Under ergodicity, these time averages converge almost surely to the true expectation.
Integrated autocorrelation time and ESS
Explanation: Autocorrelation inflates variance of MCMC averages. The integrated autocorrelation time summarizes dependence and yields an effective sample size smaller than n.
Log-sum-exp
Explanation: A numerically stable way to sum exponentials in log space, crucial for mixture targets or sums of likelihood terms in MH.
RobertsāGelmanāGilks heuristic
Explanation: For high-dimensional Gaussian targets with random-walk MH, scaling the proposal standard deviation as 2.38/ād often yields near-optimal acceptance around 0.234.
Gibbs full conditional
Explanation: In Gibbs sampling you draw from the conditional distribution of one coordinate given the rest, which is proportional to the joint density viewed as a function of that coordinate.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute log N(x; mu, sigma^2) up to the exact constant (includes -0.5*log(2*pi)) 5 double log_normal_pdf(double x, double mu, double sigma) { 6 static const double LOG_SQRT_2PI = 0.5 * log(2.0 * M_PI); 7 double z = (x - mu) / sigma; 8 return -LOG_SQRT_2PI - log(sigma) - 0.5 * z * z; 9 } 10 11 // Stable log-sum-exp for two terms 12 inline double logsumexp2(double a, double b) { 13 double m = max(a, b); 14 return m + log(exp(a - m) + exp(b - m)); 15 } 16 17 // Target: mixture 0.3*N(-3,1^2) + 0.7*N(2,0.5^2) 18 double log_target(double x) { 19 double a = log(0.3) + log_normal_pdf(x, -3.0, 1.0); 20 double b = log(0.7) + log_normal_pdf(x, 2.0, 0.5); 21 return logsumexp2(a, b); // exact log density of the mixture 22 } 23 24 int main() { 25 ios::sync_with_stdio(false); 26 cin.tie(nullptr); 27 28 // RNG setup (fixed seed for reproducibility) 29 std::mt19937_64 rng(42); 30 std::normal_distribution<double> std_normal(0.0, 1.0); 31 std::uniform_real_distribution<double> unif01(0.0, 1.0); 32 33 // MH parameters 34 const int total_iters = 50000; // total iterations including burn-in 35 const int burn_in = 5000; // discard initial steps 36 const int thin = 10; // keep every 'thin'-th sample after burn-in 37 const double proposal_sigma = 1.0; // random-walk step size 38 39 // Initialize chain 40 double x = 0.0; // starting point 41 double logp = log_target(x); 42 43 int accepted = 0; 44 vector<double> samples; samples.reserve((total_iters - burn_in) / max(1, thin)); 45 46 for (int t = 1; t <= total_iters; ++t) { 47 // Propose y = x + Normal(0, proposal_sigma^2) 48 double y = x + proposal_sigma * std_normal(rng); 49 double logp_y = log_target(y); 50 51 // Acceptance probability (symmetric proposal, so q cancels) 52 double log_alpha = logp_y - logp; // accept with prob min(1, exp(log_alpha)) 53 bool accept = false; 54 if (log_alpha >= 0.0) { 55 accept = true; // always accept upward moves 56 } else { 57 double u = unif01(rng); 58 accept = (log(u) < log_alpha); 59 } 60 61 if (accept) { 62 x = y; 63 logp = logp_y; 64 ++accepted; 65 } 66 67 // Record sample after burn-in with thinning 68 if (t > burn_in && ((t - burn_in) % thin == 0)) { 69 samples.push_back(x); 70 } 71 } 72 73 // Basic summaries: sample mean and variance 74 double mean = 0.0; 75 for (double v : samples) mean += v; 76 mean /= samples.size(); 77 78 double var = 0.0; 79 for (double v : samples) var += (v - mean) * (v - mean); 80 var /= (samples.size() - 1); 81 82 double acc_rate = static_cast<double>(accepted) / total_iters; 83 84 cout.setf(std::ios::fixed); cout << setprecision(6); 85 cout << "Kept samples: " << samples.size() << "\n"; 86 cout << "Acceptance rate: " << acc_rate << "\n"; 87 cout << "Sample mean: " << mean << "\n"; 88 cout << "Sample variance: " << var << "\n"; 89 90 // Print a few samples to inspect multimodality 91 cout << "First 10 kept samples:" << "\n"; 92 for (size_t i = 0; i < min<size_t>(10, samples.size()); ++i) { 93 cout << samples[i] << (i + 1 == min<size_t>(10, samples.size()) ? '\n' : ' '); 94 } 95 96 return 0; 97 } 98
This program implements a random-walk MetropolisāHastings sampler for a 1D mixture of two Gaussians. The target is evaluated in log space, using a log-sum-exp trick to stabilize the mixture density. At each step, we propose a Normal perturbation of the current state and accept with probability min(1, exp(logp_new ā logp_old)). After burn-in and thinning, we report the acceptance rate and basic summaries. Because the proposal is symmetric, the proposal terms cancel in the MH ratio.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Target: (X, Y) ~ N(0, 0) with Cov = [[1, rho], [rho, 1]] 5 // Full conditionals: X|Y=y ~ N(rho*y, 1 - rho^2), Y|X=x ~ N(rho*x, 1 - rho^2) 6 7 int main() { 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 double rho = 0.8; // correlation 12 if (rho <= -1.0 || rho >= 1.0) { 13 cerr << "rho must be in (-1,1)" << '\n'; 14 return 1; 15 } 16 17 double var_cond = 1.0 - rho * rho; 18 double sd_cond = sqrt(var_cond); 19 20 // RNG 21 std::mt19937_64 rng(123); 22 std::normal_distribution<double> std_normal(0.0, 1.0); 23 24 // Gibbs parameters 25 const int total_iters = 50000; 26 const int burn_in = 5000; 27 const int thin = 10; 28 29 // Initialize 30 double x = 0.0, y = 0.0; 31 32 vector<pair<double,double>> samples; samples.reserve((total_iters - burn_in) / max(1, thin)); 33 34 for (int t = 1; t <= total_iters; ++t) { 35 // Sample X | Y=y 36 double mean_x = rho * y; 37 x = mean_x + sd_cond * std_normal(rng); 38 39 // Sample Y | X=x 40 double mean_y = rho * x; 41 y = mean_y + sd_cond * std_normal(rng); 42 43 if (t > burn_in && ((t - burn_in) % thin == 0)) { 44 samples.emplace_back(x, y); 45 } 46 } 47 48 // Compute sample means and covariance 49 double mx = 0.0, my = 0.0; 50 for (auto &p : samples) { mx += p.first; my += p.second; } 51 mx /= samples.size(); 52 my /= samples.size(); 53 54 double sxx = 0.0, syy = 0.0, sxy = 0.0; 55 for (auto &p : samples) { 56 double dx = p.first - mx; 57 double dy = p.second - my; 58 sxx += dx * dx; 59 syy += dy * dy; 60 sxy += dx * dy; 61 } 62 sxx /= (samples.size() - 1); 63 syy /= (samples.size() - 1); 64 sxy /= (samples.size() - 1); 65 66 cout.setf(std::ios::fixed); cout << setprecision(6); 67 cout << "Kept samples: " << samples.size() << "\n"; 68 cout << "Sample mean (x, y): (" << mx << ", " << my << ")\n"; 69 cout << "Sample var(x): " << sxx << ", var(y): " << syy << "\n"; 70 cout << "Sample cov(x,y): " << sxy << " (target cov: " << rho << ")\n"; 71 72 // Print a few samples 73 cout << "First 5 kept samples (x, y):\n"; 74 for (size_t i = 0; i < min<size_t>(5, samples.size()); ++i) { 75 cout << samples[i].first << " " << samples[i].second << "\n"; 76 } 77 78 return 0; 79 } 80
This program performs Gibbs sampling for a bivariate normal with correlation rho. Because the full conditionals are Normal with known means and variances, each Gibbs step is just drawing from two univariate Normals. After burn-in and thinning, the empirical covariance should be close to rho, demonstrating correct sampling from the target.