MCMC Theory
Key Points
- ā¢MCMC simulates a Markov chain whose long-run behavior matches a target distribution, letting us sample from complex posteriors without knowing the normalization constant.
- ā¢MetropolisāHastings proposes a move and accepts it with a probability that corrects for proposal bias, ensuring detailed balance with the target.
- ā¢Gibbs sampling updates one coordinate at a time from its conditional distribution, which can mix fast when conditionals are easy.
- ā¢Ergodicity guarantees convergence to the target distribution after burn-in, but practical chains may mix slowly, especially for multimodal targets.
- ā¢Diagnostics like R-hat, effective sample size, and trace plots are essential to detect non-convergence and strong autocorrelation.
- ā¢Engineering choices (step size, proposal scale, parametrization, gradients for HMC) dominate performance and mixing.
- ā¢Hamiltonian Monte Carlo uses gradients and simulated dynamics to make distant proposals with high acceptance, dramatically reducing autocorrelation.
- ā¢In Bayesian inference and probabilistic programming, MCMC is a workhorse for sampling from intractable posteriors and computing expectations.
Prerequisites
- āProbability density and Bayesā rule ā MCMC targets are often posteriors Ļ(Īø|y) ā p(y|Īø)p(Īø), requiring comfort with densities and conditioning.
- āMarkov chains and transition kernels ā Understanding stationarity, detailed balance, and ergodicity is essential for why MCMC works.
- āMultivariate normal distribution ā Useful for Gibbs and for test targets; provides closed-form conditionals and gradients.
- āNumerical stability and log-space arithmetic ā Computing acceptance probabilities safely requires log-densities and log-sum-exp.
- āLinear algebra basics ā Evaluating Gaussian targets involves matrix operations like inverses and quadratic forms.
- āCalculus and gradients ā HMC requires computing āU(x); differentiability and gradient calculations are central.
- āRandom number generation ā Sampling from proposals and conditionals relies on high-quality RNGs and standard distributions.
- āConvergence diagnostics (R-hat, ESS) ā To know when chains have mixed and how much information you have.
- āReparameterization and constraints ā Transforming parameters to unconstrained spaces improves mixing and correctness (Jacobian terms).
- āBasic C++ programming ā Implementing samplers, RNG usage, and managing memory for storing samples.
Detailed Explanation
Tap terms for definitions01Overview
Markov chain Monte Carlo (MCMC) is a family of algorithms for sampling from complicated probability distributions, especially those known up to a normalizing constant. Instead of drawing independent samples, MCMC builds a Markov chain that spends time in regions proportional to the target density (\pi(x)). Over time, the distribution of the chain's state converges to (\pi), so averages over the chain approximate expectations under (\pi). Two core methods are MetropolisāHastings (MH) and Gibbs sampling. MH proposes a candidate point from a proposal distribution and accepts it with a carefully chosen probability that enforces detailed balance with (\pi). Gibbs sampling cycles through coordinates (or blocks), sampling each from its conditional distribution given the others. When conditionals are easy, Gibbs is simple and efficient. MCMC underpins Bayesian inference where posteriors are often high-dimensional and lack closed-form normalizing constants. Convergence is not instantaneous; the initial segment (burn-in) is discarded, and dependence between samples reduces the effective sample size. Practical use requires diagnostics (e.g., R-hat, ESS) and mindful tuning (e.g., proposal scales, reparameterization, or gradient-based HMC). Advanced variants like Hamiltonian Monte Carlo (HMC) and NUTS use gradients to move efficiently through parameter space, often improving mixing by orders of magnitude.
02Intuition & Analogies
Imagine exploring a dark landscape representing the probability of different parameter values. High terrain corresponds to high probability. You want to spend more time in high places but still visit low valleys occasionally, because uncertainty might hide there. A naive walker who only climbs uphill (greedy ascent) would get stuck on the nearest hill. MCMC is a smarter hiker: it proposes steps (sometimes short shuffles, sometimes longer jumps) and uses a gatekeeper to decide whether to stay or move. In MetropolisāHastings, the gatekeeper flips a biased coin favoring moves to higher terrain but still allowing some downhill steps so the hiker does not get trapped. Over time, the fraction of time spent on each hill reflects its area under the landscapeāthatās sampling from (\pi). Gibbs sampling is like adjusting knobs one at a time: you hold all but one coordinate fixed and set the free knob according to its local instructions (the conditional), then rotate to the next knob. If each knob comes with a clear manual (easy conditionals), you quickly find good configurations. Hamiltonian Monte Carlo is like taking a sled and gliding across the landscape along contours dictated by physics: you add momentum, follow smooth paths determined by gradients, and lose little energy, so you can travel far with a high chance of landing in a sensible place. Burn-in is the time it takes your hiker or sled to reach the interesting part of the landscape from a poor starting point. Diagnostics are your GPS and footprintsāchecking if multiple hikers explored the same regions and whether their steps are still highly correlated.
03Formal Definition
04When to Use
Use MCMC when you must sample from or compute expectations under distributions that are only known up to a multiplicative constantāclassically Bayesian posteriors with intractable normalizers. It excels when the dimension is moderate to high and exact sampling is unavailable, yet evaluating the (log-)density and, for advanced methods, its gradient is feasible. Examples include hierarchical Bayesian models (e.g., mixed effects), latent-variable models (e.g., topic models, HMMs with data augmentation), and probabilistic programs where conditionals are complex. Gibbs sampling is attractive when full conditionals are standard distributions (Gaussian, Gamma, Dirichlet), enabling simple, fast updates. Random-walk Metropolis can be effective in low to moderate dimensions with careful tuning of step size and reparameterization. Hamiltonian Monte Carlo shines for smooth, differentiable posteriors with informative gradients (GLMs with continuous priors, Gaussian processes with moderate size, Bayesian neural nets on small datasets), often reducing autocorrelation dramatically. Consider alternatives when: exact samplers exist (conjugate models with closed-form posteriors), dimension is tiny (grids/quadrature suffice), the target is severely multimodal (tempering or SMC may be better), or gradients are unavailable and dimension is high (adaptive MH or ensemble methods may be preferable). Always combine MCMC with convergence diagnostics and multiple chains started from dispersed initial points.
ā ļøCommon Mistakes
- Ignoring diagnostics: Declaring convergence based on a single chainās apparent stability can be misleading. Always run multiple chains, check R-hat close to 1.0, and ensure effective sample sizes are adequate for key parameters.
- Poor tuning of proposals: Step sizes that are too small cause slow exploration (high acceptance but high autocorrelation), while too large cause frequent rejections (low acceptance). Target acceptance rates around 0.2ā0.4 for random-walk MH in moderate dimensions; adapt scale during warm-up, not during sampling.
- Not using log-densities: Multiplying tiny probabilities underflows to zero in floating point. Work in log-space and add logs; compute acceptance using log-ratios and exponentiate only at the end.
- Bad parameterization: Sampling on constrained scales (e.g., variances, probabilities) without transformations (log, logit) leads to boundary stickiness and poor mixing. Reparameterize to unconstrained spaces when possible.
- Forgetting burn-in and initial bias: Early samples depend strongly on initialization. Discard a burn-in period and start chains from dispersed points.
- Over-thinning or unnecessary thinning: Thinning discards information; it rarely improves estimates for fixed compute budget. Prefer running longer chains and using ESS to assess precision.
- Multimodality without help: Random-walk MH can get trapped in one mode. Use parallel tempering, overdispersed initializations, or smarter proposals.
- Violating detailed balance inadvertently: In hand-rolled samplers, mishandling asymmetric proposals or Jacobians with transformations breaks correctness. Carefully include proposal densities or Jacobian terms in the acceptance ratio.
Key Formulas
MetropolisāHastings acceptance
Explanation: This is the probability of accepting a proposed move y from current state x when using proposal q. It corrects for proposal asymmetry to ensure detailed balance with the target
Detailed balance
Explanation: If the chain satisfies this symmetry with respect to then Ļ is stationary. Many MCMC algorithms are designed to satisfy this condition.
Stationarity
Explanation: A distribution Ļ is stationary if applying the transition once leaves it unchanged. This is the key property ensuring that long-run samples follow
Ergodic theorem for MCMC
Explanation: Averages of a function along the Markov chain converge almost surely to its expectation under This justifies using sample averages for estimation.
MCMC CLT
Explanation: Under regularity, the estimator is asymptotically normal with variance inflated by the sum of autocovariances Higher autocorrelation means larger variance and smaller effective sample size.
Effective sample size
Explanation: ESS estimates how many independent samples the correlated MCMC draws are worth, where Ļ_k are autocorrelations. Lower autocorrelation yields higher ESS.
R-hat (conceptual)
Explanation: R-hat compares between-chain and within-chain variances to assess convergence. Values near 1 indicate that chains are sampling the same distribution.
Hamiltonian for HMC
Explanation: HMC augments parameters x with momentum p and simulates Hamiltonian dynamics with potential U and kinetic energy from p. Approximately conserving H helps propose distant states with high acceptance.
Leapfrog updates
Explanation: The leapfrog integrator advances momentum and position in staggered half/full steps. It is time-reversible and volume-preserving, crucial for valid HMC proposals.
Convergence in total variation
Explanation: For an ergodic chain, the distribution of approaches Ļ in total variation distance from any starting point x. This formalizes convergence.
Log-acceptance for numerical stability
Explanation: Computing acceptance in log-space avoids underflow with tiny probabilities. Exponentiate only at the final step when drawing a uniform random variable.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Unnormalized log-density of a 2D Gaussian mixture: 0.5*N(m1, S) + 0.5*N(m2, S) 5 // This target is multimodal; MH random-walk may mix slowly between modes. 6 7 struct Vec2 { double x, y; }; 8 9 // Compute log N(z; m, S) up to additive constant using explicit inverse and det for 2x2 10 double log_gaussian_2d(const Vec2& z, const Vec2& m, const array<double,4>& S){ 11 // S = [s11, s12; s21, s22] (assume symmetric) 12 double s11 = S[0], s12 = S[1], s21 = S[2], s22 = S[3]; 13 double det = s11*s22 - s12*s21; 14 double dx = z.x - m.x; 15 double dy = z.y - m.y; 16 // Mahalanobis distance: (z-m)^T S^{-1} (z-m) 17 double inv11 = s22 / det; 18 double inv12 = -s12 / det; 19 double inv21 = -s21 / det; 20 double inv22 = s11 / det; 21 double qx = inv11*dx + inv12*dy; 22 double qy = inv21*dx + inv22*dy; 23 double md2 = dx*qx + dy*qy; 24 // ignore constant -0.5*log((2pi)^2 det(S)) since MH uses ratios 25 return -0.5 * md2; 26 } 27 28 // Unnormalized log target density log pi(z) 29 double log_pi(const Vec2& z){ 30 Vec2 m1{ -3.0, -3.0 }; 31 Vec2 m2{ 3.0, 3.0 }; 32 array<double,4> S{1.0, 0.6, 0.6, 1.5}; // positive-definite covariance 33 // log-sum-exp of two components with equal weights 0.5 each 34 double a = log(0.5) + log_gaussian_2d(z, m1, S); 35 double b = log(0.5) + log_gaussian_2d(z, m2, S); 36 // logsumexp(a,b) 37 double m = max(a,b); 38 return m + log(exp(a-m) + exp(b-m)); 39 } 40 41 int main(){ 42 ios::sync_with_stdio(false); 43 cin.tie(nullptr); 44 45 // Random number generators 46 random_device rd; 47 mt19937 gen(rd()); 48 49 // MH random-walk proposal: y = x + N(0, sigma^2 I) 50 double sigma = 1.0; // tune: smaller -> higher acceptance but slower exploration 51 normal_distribution<double> noise(0.0, sigma); 52 uniform_real_distribution<double> unif(0.0, 1.0); 53 54 // Initialize far from both modes to see burn-in 55 Vec2 x{0.0, 0.0}; 56 57 int burn_in = 2000; 58 int N = 20000; // number of kept samples after burn-in 59 60 vector<Vec2> samples; 61 samples.reserve(N); 62 63 int accepts = 0, proposals = 0; 64 65 // Total iterations = burn-in + N 66 for(int t=0; t<burn_in + N; ++t){ 67 // Propose a move 68 Vec2 y{ x.x + noise(gen), x.y + noise(gen) }; 69 70 // Compute log acceptance in a numerically stable way 71 double log_alpha = log_pi(y) - log_pi(x); // symmetric proposal => q cancels 72 double a = 1.0; 73 if (log_alpha < 0.0) a = exp(log_alpha); 74 75 // Accept/reject 76 double u = unif(gen); 77 if(u < a){ 78 x = y; 79 if (t >= burn_in) accepts++; 80 } 81 if (t >= burn_in){ 82 samples.push_back(x); 83 proposals++; 84 } 85 } 86 87 // Simple summaries 88 double mean_x=0, mean_y=0; 89 for(const auto& s : samples){ mean_x += s.x; mean_y += s.y; } 90 mean_x /= samples.size(); 91 mean_y /= samples.size(); 92 93 cout.setf(std::ios::fixed); cout<<setprecision(4); 94 cout << "Kept samples: " << samples.size() << "\n"; 95 cout << "Acceptance rate: " << (double)accepts / max(1,proposals) << "\n"; 96 cout << "Posterior mean (approx): [" << mean_x << ", " << mean_y << "]\n"; 97 98 // Note: For real diagnostics, run multiple chains and compute R-hat and ESS. 99 return 0; 100 } 101
This program implements a MetropolisāHastings random-walk sampler to draw from a bimodal 2D Gaussian mixture. The target density is evaluated up to a constant via log-sum-exp for numerical stability. Because the proposal is symmetric, the MH acceptance simplifies to min(1, Ļ(y)/Ļ(x)), computed in log-space to avoid underflow. After a burn-in, it collects samples, reports acceptance rate, and prints the approximate mean. On this multimodal target, a simple random-walk may exhibit slow mixing between modes, reflected in a low acceptance rate or long stretches near one mode.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Gibbs sampling for (X, Y) jointly normal with means mu1, mu2, stds s1, s2 and correlation rho. 5 // Conditionals are normal: X|Y=y ~ N(mu1 + rho*(s1/s2)*(y-mu2), (1-rho^2)*s1^2), and similarly for Y|X. 6 7 int main(){ 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 // Parameters of the joint normal 12 double mu1 = 0.0, mu2 = 0.0; 13 double s1 = 2.0, s2 = 1.0; 14 double rho = 0.7; // strong correlation to show Gibbs efficiency 15 16 // Derived conditional variances 17 double var_x = (1.0 - rho*rho) * s1 * s1; 18 double var_y = (1.0 - rho*rho) * s2 * s2; 19 double sd_x = sqrt(var_x); 20 double sd_y = sqrt(var_y); 21 22 random_device rd; mt19937 gen(rd()); 23 normal_distribution<double> stdn(0.0, 1.0); 24 25 int burn_in = 2000; 26 int N = 20000; 27 28 // Initialize 29 double x = -5.0, y = 5.0; 30 31 vector<array<double,2>> samples; samples.reserve(N); 32 33 // One Gibbs sweep updates x|y then y|x 34 for(int t=0; t<burn_in + N; ++t){ 35 // Sample X | Y=y 36 double mean_x = mu1 + rho * (s1/s2) * (y - mu2); 37 x = mean_x + sd_x * stdn(gen); 38 // Sample Y | X=x 39 double mean_y = mu2 + rho * (s2/s1) * (x - mu1); 40 y = mean_y + sd_y * stdn(gen); 41 42 if(t >= burn_in) samples.push_back({x,y}); 43 } 44 45 // Summaries 46 double mx=0, my=0; for(auto &s: samples){ mx+=s[0]; my+=s[1]; } 47 mx/=samples.size(); my/=samples.size(); 48 49 // Empirical correlation 50 double vx=0, vy=0, cxy=0; 51 for(auto &s: samples){ 52 vx += (s[0]-mx)*(s[0]-mx); 53 vy += (s[1]-my)*(s[1]-my); 54 cxy+= (s[0]-mx)*(s[1]-my); 55 } 56 vx/=samples.size(); vy/=samples.size(); cxy/=samples.size(); 57 double corr = cxy / (sqrt(vx)*sqrt(vy)); 58 59 cout.setf(std::ios::fixed); cout<<setprecision(4); 60 cout << "Kept samples: " << samples.size() << "\n"; 61 cout << "Mean ā (" << mx << ", " << my << ")\n"; 62 cout << "Corr ā " << corr << " (target rho=" << rho << ")\n"; 63 64 return 0; 65 } 66
This Gibbs sampler exploits the fact that a bivariate normal has normal full conditionals. Each sweep draws X given current Y, then Y given new X, using closed-form means and variances. With strong correlation, random-walk MH would struggle, but Gibbs can move efficiently by directly sampling from informative conditionals. The output reports empirical means and correlation to check correctness.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // HMC sampler for a 2D Gaussian target N(mu, Sigma) using an identity mass matrix. 5 // Demonstrates leapfrog integration and Metropolis correction. 6 7 struct Vec2 { double x, y; }; 8 9 struct Gaussian2D { 10 Vec2 mu; array<double,4> S; // Sigma = [s11 s12; s21 s22] 11 // Precompute Sigma^{-1} and log|Sigma| if desired 12 array<double,4> Sinv; double logdet; 13 Gaussian2D(Vec2 mu_, array<double,4> S_) : mu(mu_), S(S_) { 14 double det = S[0]*S[3] - S[1]*S[2]; 15 Sinv = { S[3]/det, -S[1]/det, -S[2]/det, S[0]/det }; 16 logdet = log(det); 17 } 18 // U(x) = -log pi(x) up to constant 19 double U(const Vec2& x) const { 20 double dx = x.x - mu.x, dy = x.y - mu.y; 21 double qx = Sinv[0]*dx + Sinv[1]*dy; 22 double qy = Sinv[2]*dx + Sinv[3]*dy; 23 double md2 = dx*qx + dy*qy; 24 return 0.5 * md2; // constants dropped 25 } 26 // grad U(x) = Sigma^{-1} (x - mu) 27 Vec2 gradU(const Vec2& x) const { 28 double dx = x.x - mu.x, dy = x.y - mu.y; 29 return { Sinv[0]*dx + Sinv[1]*dy, Sinv[2]*dx + Sinv[3]*dy }; 30 } 31 }; 32 33 int main(){ 34 ios::sync_with_stdio(false); 35 cin.tie(nullptr); 36 37 Gaussian2D target({0.0, 0.0}, {2.0, 1.2, 1.2, 1.5}); // correlated Gaussian 38 39 random_device rd; mt19937 gen(rd()); 40 normal_distribution<double> stdn(0.0, 1.0); 41 uniform_real_distribution<double> unif(0.0, 1.0); 42 43 // HMC hyperparameters (tune in warm-up) 44 double eps = 0.2; // step size 45 int L = 10; // leapfrog steps 46 47 Vec2 x{ -4.0, 3.0 }; // initial position 48 49 int burn_in = 2000, N = 20000; 50 vector<Vec2> samples; samples.reserve(N); 51 int accepted = 0; 52 53 for(int it=0; it<burn_in + N; ++it){ 54 // Sample momentum p ~ N(0, I) 55 Vec2 p{ stdn(gen), stdn(gen) }; 56 57 // Cache starting Hamiltonian 58 double U0 = target.U(x); 59 double K0 = 0.5 * (p.x*p.x + p.y*p.y); 60 61 // Make a proposal (x*, p*) via leapfrog 62 Vec2 xprop = x; Vec2 pprop = p; 63 // Half-step momentum 64 Vec2 g = target.gradU(xprop); 65 pprop.x -= 0.5 * eps * g.x; 66 pprop.y -= 0.5 * eps * g.y; 67 // L iterations of full updates 68 for(int i=0;i<L;i++){ 69 // Full-step position 70 xprop.x += eps * pprop.x; 71 xprop.y += eps * pprop.y; 72 // Gradient at new position (except final momentum half-step at the end) 73 g = target.gradU(xprop); 74 // Full-step momentum except last iteration where we still need a half-step 75 if(i != L-1){ 76 pprop.x -= eps * g.x; 77 pprop.y -= eps * g.y; 78 } 79 } 80 // Final half-step momentum 81 pprop.x -= 0.5 * eps * g.x; 82 pprop.y -= 0.5 * eps * g.y; 83 // Negate momentum for reversibility (optional) 84 pprop.x = -pprop.x; pprop.y = -pprop.y; 85 86 // Metropolis accept/reject based on Hamiltonian difference 87 double U1 = target.U(xprop); 88 double K1 = 0.5 * (pprop.x*pprop.x + pprop.y*pprop.y); 89 double log_alpha = -(U1 + K1) + (U0 + K0); 90 bool accept = (log(unif(gen)) < log_alpha); 91 if(accept){ x = xprop; if(it >= burn_in) accepted++; } 92 if(it >= burn_in) samples.push_back(x); 93 } 94 95 // Summaries 96 double mx=0,my=0; for(const auto& s : samples){ mx+=s.x; my+=s.y; } 97 mx/=samples.size(); my/=samples.size(); 98 cout.setf(std::ios::fixed); cout<<setprecision(4); 99 cout << "Kept samples: " << samples.size() << "\n"; 100 cout << "Mean ā (" << mx << ", " << my << ")\n"; 101 cout << "Acceptance (post burn-in) ā " << (double)accepted / samples.size() << "\n"; 102 103 return 0; 104 } 105
This example implements Hamiltonian Monte Carlo for a 2D correlated Gaussian. The potential U and its gradient are known in closed form, so the leapfrog integrator simulates Hamiltonian dynamics with step size ε and L steps. A Metropolis correction accepts or rejects the proposal to counteract integration error. Compared with random-walk MH on the same target, HMC typically achieves higher effective sample size per iteration due to long-distance, high-acceptance moves guided by gradients.