šŸ“šTheoryAdvanced

MCMC Theory

Key Points

  • •
    MCMC simulates a Markov chain whose long-run behavior matches a target distribution, letting us sample from complex posteriors without knowing the normalization constant.
  • •
    Metropolis–Hastings proposes a move and accepts it with a probability that corrects for proposal bias, ensuring detailed balance with the target.
  • •
    Gibbs sampling updates one coordinate at a time from its conditional distribution, which can mix fast when conditionals are easy.
  • •
    Ergodicity guarantees convergence to the target distribution after burn-in, but practical chains may mix slowly, especially for multimodal targets.
  • •
    Diagnostics like R-hat, effective sample size, and trace plots are essential to detect non-convergence and strong autocorrelation.
  • •
    Engineering choices (step size, proposal scale, parametrization, gradients for HMC) dominate performance and mixing.
  • •
    Hamiltonian Monte Carlo uses gradients and simulated dynamics to make distant proposals with high acceptance, dramatically reducing autocorrelation.
  • •
    In Bayesian inference and probabilistic programming, MCMC is a workhorse for sampling from intractable posteriors and computing expectations.

Prerequisites

  • →Probability density and Bayes’ rule — MCMC targets are often posteriors Ļ€(Īø|y) āˆ p(y|Īø)p(Īø), requiring comfort with densities and conditioning.
  • →Markov chains and transition kernels — Understanding stationarity, detailed balance, and ergodicity is essential for why MCMC works.
  • →Multivariate normal distribution — Useful for Gibbs and for test targets; provides closed-form conditionals and gradients.
  • →Numerical stability and log-space arithmetic — Computing acceptance probabilities safely requires log-densities and log-sum-exp.
  • →Linear algebra basics — Evaluating Gaussian targets involves matrix operations like inverses and quadratic forms.
  • →Calculus and gradients — HMC requires computing āˆ‡U(x); differentiability and gradient calculations are central.
  • →Random number generation — Sampling from proposals and conditionals relies on high-quality RNGs and standard distributions.
  • →Convergence diagnostics (R-hat, ESS) — To know when chains have mixed and how much information you have.
  • →Reparameterization and constraints — Transforming parameters to unconstrained spaces improves mixing and correctness (Jacobian terms).
  • →Basic C++ programming — Implementing samplers, RNG usage, and managing memory for storing samples.

Detailed Explanation

Tap terms for definitions

01Overview

Markov chain Monte Carlo (MCMC) is a family of algorithms for sampling from complicated probability distributions, especially those known up to a normalizing constant. Instead of drawing independent samples, MCMC builds a Markov chain that spends time in regions proportional to the target density (\pi(x)). Over time, the distribution of the chain's state converges to (\pi), so averages over the chain approximate expectations under (\pi). Two core methods are Metropolis–Hastings (MH) and Gibbs sampling. MH proposes a candidate point from a proposal distribution and accepts it with a carefully chosen probability that enforces detailed balance with (\pi). Gibbs sampling cycles through coordinates (or blocks), sampling each from its conditional distribution given the others. When conditionals are easy, Gibbs is simple and efficient. MCMC underpins Bayesian inference where posteriors are often high-dimensional and lack closed-form normalizing constants. Convergence is not instantaneous; the initial segment (burn-in) is discarded, and dependence between samples reduces the effective sample size. Practical use requires diagnostics (e.g., R-hat, ESS) and mindful tuning (e.g., proposal scales, reparameterization, or gradient-based HMC). Advanced variants like Hamiltonian Monte Carlo (HMC) and NUTS use gradients to move efficiently through parameter space, often improving mixing by orders of magnitude.

02Intuition & Analogies

Imagine exploring a dark landscape representing the probability of different parameter values. High terrain corresponds to high probability. You want to spend more time in high places but still visit low valleys occasionally, because uncertainty might hide there. A naive walker who only climbs uphill (greedy ascent) would get stuck on the nearest hill. MCMC is a smarter hiker: it proposes steps (sometimes short shuffles, sometimes longer jumps) and uses a gatekeeper to decide whether to stay or move. In Metropolis–Hastings, the gatekeeper flips a biased coin favoring moves to higher terrain but still allowing some downhill steps so the hiker does not get trapped. Over time, the fraction of time spent on each hill reflects its area under the landscape—that’s sampling from (\pi). Gibbs sampling is like adjusting knobs one at a time: you hold all but one coordinate fixed and set the free knob according to its local instructions (the conditional), then rotate to the next knob. If each knob comes with a clear manual (easy conditionals), you quickly find good configurations. Hamiltonian Monte Carlo is like taking a sled and gliding across the landscape along contours dictated by physics: you add momentum, follow smooth paths determined by gradients, and lose little energy, so you can travel far with a high chance of landing in a sensible place. Burn-in is the time it takes your hiker or sled to reach the interesting part of the landscape from a poor starting point. Diagnostics are your GPS and footprints—checking if multiple hikers explored the same regions and whether their steps are still highly correlated.

03Formal Definition

Let \(()_{t 0}\) be a Markov chain on state space \(\) with transition kernel \(P(x, A)\) (probability to move from \(x\) to a measurable set \(A\)). A distribution \(\) on \(\) is stationary if \( P = \), i.e., \( (dx) P(x, A) = (A)\) for all measurable \(A\). Many MCMC algorithms construct \(P\) to satisfy detailed balance (reversibility): \((dx) P(x, dy) = (dy) P(y, dx)\), which implies stationarity. In Metropolis–Hastings with proposal density \(q(y x)\), the transition is: propose \(Y q( )\), accept with probability \((, Y) = \{1, [(Y) q( Y)]/[() q(Y )]\}\), else stay at \(\). Gibbs sampling updates components sequentially (or in blocks) by drawing from full conditional distributions \(( )\), which also ensures detailed balance under mild conditions. If the chain is irreducible, aperiodic, and positive recurrent (ergodic), the distribution of \(\) converges to \(\) as \(t \). For integrable test functions \(f\), the MCMC estimator \( = f()\) converges almost surely to \(_[f(X)]\) (ergodic theorem), and under regularity a central limit theorem holds with asymptotic variance inflated by autocorrelation. Practical performance depends on mixing time (how fast dependence decays), which is influenced by proposals, geometry, and dimension.

04When to Use

Use MCMC when you must sample from or compute expectations under distributions that are only known up to a multiplicative constant—classically Bayesian posteriors with intractable normalizers. It excels when the dimension is moderate to high and exact sampling is unavailable, yet evaluating the (log-)density and, for advanced methods, its gradient is feasible. Examples include hierarchical Bayesian models (e.g., mixed effects), latent-variable models (e.g., topic models, HMMs with data augmentation), and probabilistic programs where conditionals are complex. Gibbs sampling is attractive when full conditionals are standard distributions (Gaussian, Gamma, Dirichlet), enabling simple, fast updates. Random-walk Metropolis can be effective in low to moderate dimensions with careful tuning of step size and reparameterization. Hamiltonian Monte Carlo shines for smooth, differentiable posteriors with informative gradients (GLMs with continuous priors, Gaussian processes with moderate size, Bayesian neural nets on small datasets), often reducing autocorrelation dramatically. Consider alternatives when: exact samplers exist (conjugate models with closed-form posteriors), dimension is tiny (grids/quadrature suffice), the target is severely multimodal (tempering or SMC may be better), or gradients are unavailable and dimension is high (adaptive MH or ensemble methods may be preferable). Always combine MCMC with convergence diagnostics and multiple chains started from dispersed initial points.

āš ļøCommon Mistakes

  • Ignoring diagnostics: Declaring convergence based on a single chain’s apparent stability can be misleading. Always run multiple chains, check R-hat close to 1.0, and ensure effective sample sizes are adequate for key parameters.
  • Poor tuning of proposals: Step sizes that are too small cause slow exploration (high acceptance but high autocorrelation), while too large cause frequent rejections (low acceptance). Target acceptance rates around 0.2–0.4 for random-walk MH in moderate dimensions; adapt scale during warm-up, not during sampling.
  • Not using log-densities: Multiplying tiny probabilities underflows to zero in floating point. Work in log-space and add logs; compute acceptance using log-ratios and exponentiate only at the end.
  • Bad parameterization: Sampling on constrained scales (e.g., variances, probabilities) without transformations (log, logit) leads to boundary stickiness and poor mixing. Reparameterize to unconstrained spaces when possible.
  • Forgetting burn-in and initial bias: Early samples depend strongly on initialization. Discard a burn-in period and start chains from dispersed points.
  • Over-thinning or unnecessary thinning: Thinning discards information; it rarely improves estimates for fixed compute budget. Prefer running longer chains and using ESS to assess precision.
  • Multimodality without help: Random-walk MH can get trapped in one mode. Use parallel tempering, overdispersed initializations, or smarter proposals.
  • Violating detailed balance inadvertently: In hand-rolled samplers, mishandling asymmetric proposals or Jacobians with transformations breaks correctness. Carefully include proposal densities or Jacobian terms in the acceptance ratio.

Key Formulas

Metropolis–Hastings acceptance

Explanation: This is the probability of accepting a proposed move y from current state x when using proposal q. It corrects for proposal asymmetry to ensure detailed balance with the target

Detailed balance

Explanation: If the chain satisfies this symmetry with respect to then π is stationary. Many MCMC algorithms are designed to satisfy this condition.

Stationarity

Explanation: A distribution π is stationary if applying the transition once leaves it unchanged. This is the key property ensuring that long-run samples follow

Ergodic theorem for MCMC

Explanation: Averages of a function along the Markov chain converge almost surely to its expectation under This justifies using sample averages for estimation.

MCMC CLT

Explanation: Under regularity, the estimator is asymptotically normal with variance inflated by the sum of autocovariances Higher autocorrelation means larger variance and smaller effective sample size.

Effective sample size

Explanation: ESS estimates how many independent samples the correlated MCMC draws are worth, where ρ_k are autocorrelations. Lower autocorrelation yields higher ESS.

R-hat (conceptual)

Explanation: R-hat compares between-chain and within-chain variances to assess convergence. Values near 1 indicate that chains are sampling the same distribution.

Hamiltonian for HMC

Explanation: HMC augments parameters x with momentum p and simulates Hamiltonian dynamics with potential U and kinetic energy from p. Approximately conserving H helps propose distant states with high acceptance.

Leapfrog updates

Explanation: The leapfrog integrator advances momentum and position in staggered half/full steps. It is time-reversible and volume-preserving, crucial for valid HMC proposals.

Convergence in total variation

Explanation: For an ergodic chain, the distribution of approaches π in total variation distance from any starting point x. This formalizes convergence.

Log-acceptance for numerical stability

Explanation: Computing acceptance in log-space avoids underflow with tiny probabilities. Exponentiate only at the final step when drawing a uniform random variable.

Complexity Analysis

Cost per MCMC iteration depends on evaluating the target density (and sometimes its gradient) and on the proposal mechanism. For Metropolis–Hastings with a simple random-walk proposal in d dimensions, one step typically requires O(d) work to evaluate the log-density difference (assuming cost linear in d), plus O(d) to generate Gaussian noise, for total O(d). Acceptance decisions are O(1). However, overall runtime to achieve a desired accuracy depends heavily on mixing: strong autocorrelation inflates variance and requires more iterations. Memory is O(Nd) if all N samples are stored; streaming estimates reduce memory to O(d). Gibbs sampling often reduces per-step cost because conditionals can be sampled in O(1) or O(d) time per coordinate, but a full sweep over all d coordinates is O(d). Its mixing can be excellent when variables are conditionally independent or weakly correlated, yet poor when there is strong cross-correlation not captured by coordinate updates. Hamiltonian Monte Carlo uses L leapfrog steps per iteration; each step requires a gradient evaluation O(d), so a proposal costs O(Ld). HMC typically has much lower autocorrelation, so fewer iterations are needed for a given effective sample size, often offsetting the higher per-iteration cost. Diagnostics add overhead: computing autocorrelations up to lag K costs O(NK) per parameter; multi-chain R-hat involves O(MNd) over M chains. In high dimensions, the dominant cost is repeatedly evaluating the model’s log-likelihood and gradient; careful implementation (vectorization, caching sufficient statistics) and tuning (step sizes, mass matrix) are critical. Parallelization across chains provides near-linear speedups for diagnostics and ESS estimation; within-chain parallelism is limited but gradient computations may be parallelizable.

Code Examples

Metropolis–Hastings (Random-Walk) for a 2D Gaussian Mixture (multimodal)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Unnormalized log-density of a 2D Gaussian mixture: 0.5*N(m1, S) + 0.5*N(m2, S)
5// This target is multimodal; MH random-walk may mix slowly between modes.
6
7struct Vec2 { double x, y; };
8
9// Compute log N(z; m, S) up to additive constant using explicit inverse and det for 2x2
10double log_gaussian_2d(const Vec2& z, const Vec2& m, const array<double,4>& S){
11 // S = [s11, s12; s21, s22] (assume symmetric)
12 double s11 = S[0], s12 = S[1], s21 = S[2], s22 = S[3];
13 double det = s11*s22 - s12*s21;
14 double dx = z.x - m.x;
15 double dy = z.y - m.y;
16 // Mahalanobis distance: (z-m)^T S^{-1} (z-m)
17 double inv11 = s22 / det;
18 double inv12 = -s12 / det;
19 double inv21 = -s21 / det;
20 double inv22 = s11 / det;
21 double qx = inv11*dx + inv12*dy;
22 double qy = inv21*dx + inv22*dy;
23 double md2 = dx*qx + dy*qy;
24 // ignore constant -0.5*log((2pi)^2 det(S)) since MH uses ratios
25 return -0.5 * md2;
26}
27
28// Unnormalized log target density log pi(z)
29double log_pi(const Vec2& z){
30 Vec2 m1{ -3.0, -3.0 };
31 Vec2 m2{ 3.0, 3.0 };
32 array<double,4> S{1.0, 0.6, 0.6, 1.5}; // positive-definite covariance
33 // log-sum-exp of two components with equal weights 0.5 each
34 double a = log(0.5) + log_gaussian_2d(z, m1, S);
35 double b = log(0.5) + log_gaussian_2d(z, m2, S);
36 // logsumexp(a,b)
37 double m = max(a,b);
38 return m + log(exp(a-m) + exp(b-m));
39}
40
41int main(){
42 ios::sync_with_stdio(false);
43 cin.tie(nullptr);
44
45 // Random number generators
46 random_device rd;
47 mt19937 gen(rd());
48
49 // MH random-walk proposal: y = x + N(0, sigma^2 I)
50 double sigma = 1.0; // tune: smaller -> higher acceptance but slower exploration
51 normal_distribution<double> noise(0.0, sigma);
52 uniform_real_distribution<double> unif(0.0, 1.0);
53
54 // Initialize far from both modes to see burn-in
55 Vec2 x{0.0, 0.0};
56
57 int burn_in = 2000;
58 int N = 20000; // number of kept samples after burn-in
59
60 vector<Vec2> samples;
61 samples.reserve(N);
62
63 int accepts = 0, proposals = 0;
64
65 // Total iterations = burn-in + N
66 for(int t=0; t<burn_in + N; ++t){
67 // Propose a move
68 Vec2 y{ x.x + noise(gen), x.y + noise(gen) };
69
70 // Compute log acceptance in a numerically stable way
71 double log_alpha = log_pi(y) - log_pi(x); // symmetric proposal => q cancels
72 double a = 1.0;
73 if (log_alpha < 0.0) a = exp(log_alpha);
74
75 // Accept/reject
76 double u = unif(gen);
77 if(u < a){
78 x = y;
79 if (t >= burn_in) accepts++;
80 }
81 if (t >= burn_in){
82 samples.push_back(x);
83 proposals++;
84 }
85 }
86
87 // Simple summaries
88 double mean_x=0, mean_y=0;
89 for(const auto& s : samples){ mean_x += s.x; mean_y += s.y; }
90 mean_x /= samples.size();
91 mean_y /= samples.size();
92
93 cout.setf(std::ios::fixed); cout<<setprecision(4);
94 cout << "Kept samples: " << samples.size() << "\n";
95 cout << "Acceptance rate: " << (double)accepts / max(1,proposals) << "\n";
96 cout << "Posterior mean (approx): [" << mean_x << ", " << mean_y << "]\n";
97
98 // Note: For real diagnostics, run multiple chains and compute R-hat and ESS.
99 return 0;
100}
101

This program implements a Metropolis–Hastings random-walk sampler to draw from a bimodal 2D Gaussian mixture. The target density is evaluated up to a constant via log-sum-exp for numerical stability. Because the proposal is symmetric, the MH acceptance simplifies to min(1, Ļ€(y)/Ļ€(x)), computed in log-space to avoid underflow. After a burn-in, it collects samples, reports acceptance rate, and prints the approximate mean. On this multimodal target, a simple random-walk may exhibit slow mixing between modes, reflected in a low acceptance rate or long stretches near one mode.

Time: O(N d) per chain (here d=2), dominated by log π evaluations for N kept samples (plus burn-in).Space: O(N d) to store samples; O(1) extra working memory.
Gibbs Sampler for a Bivariate Normal
1#include <bits/stdc++.h>
2using namespace std;
3
4// Gibbs sampling for (X, Y) jointly normal with means mu1, mu2, stds s1, s2 and correlation rho.
5// Conditionals are normal: X|Y=y ~ N(mu1 + rho*(s1/s2)*(y-mu2), (1-rho^2)*s1^2), and similarly for Y|X.
6
7int main(){
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 // Parameters of the joint normal
12 double mu1 = 0.0, mu2 = 0.0;
13 double s1 = 2.0, s2 = 1.0;
14 double rho = 0.7; // strong correlation to show Gibbs efficiency
15
16 // Derived conditional variances
17 double var_x = (1.0 - rho*rho) * s1 * s1;
18 double var_y = (1.0 - rho*rho) * s2 * s2;
19 double sd_x = sqrt(var_x);
20 double sd_y = sqrt(var_y);
21
22 random_device rd; mt19937 gen(rd());
23 normal_distribution<double> stdn(0.0, 1.0);
24
25 int burn_in = 2000;
26 int N = 20000;
27
28 // Initialize
29 double x = -5.0, y = 5.0;
30
31 vector<array<double,2>> samples; samples.reserve(N);
32
33 // One Gibbs sweep updates x|y then y|x
34 for(int t=0; t<burn_in + N; ++t){
35 // Sample X | Y=y
36 double mean_x = mu1 + rho * (s1/s2) * (y - mu2);
37 x = mean_x + sd_x * stdn(gen);
38 // Sample Y | X=x
39 double mean_y = mu2 + rho * (s2/s1) * (x - mu1);
40 y = mean_y + sd_y * stdn(gen);
41
42 if(t >= burn_in) samples.push_back({x,y});
43 }
44
45 // Summaries
46 double mx=0, my=0; for(auto &s: samples){ mx+=s[0]; my+=s[1]; }
47 mx/=samples.size(); my/=samples.size();
48
49 // Empirical correlation
50 double vx=0, vy=0, cxy=0;
51 for(auto &s: samples){
52 vx += (s[0]-mx)*(s[0]-mx);
53 vy += (s[1]-my)*(s[1]-my);
54 cxy+= (s[0]-mx)*(s[1]-my);
55 }
56 vx/=samples.size(); vy/=samples.size(); cxy/=samples.size();
57 double corr = cxy / (sqrt(vx)*sqrt(vy));
58
59 cout.setf(std::ios::fixed); cout<<setprecision(4);
60 cout << "Kept samples: " << samples.size() << "\n";
61 cout << "Mean ā‰ˆ (" << mx << ", " << my << ")\n";
62 cout << "Corr ā‰ˆ " << corr << " (target rho=" << rho << ")\n";
63
64 return 0;
65}
66

This Gibbs sampler exploits the fact that a bivariate normal has normal full conditionals. Each sweep draws X given current Y, then Y given new X, using closed-form means and variances. With strong correlation, random-walk MH would struggle, but Gibbs can move efficiently by directly sampling from informative conditionals. The output reports empirical means and correlation to check correctness.

Time: O(N) per sweep since each conditional sample is O(1); storing samples adds O(N).Space: O(N) for saved samples; O(1) additional memory.
Hamiltonian Monte Carlo for a 2D Correlated Gaussian
1#include <bits/stdc++.h>
2using namespace std;
3
4// HMC sampler for a 2D Gaussian target N(mu, Sigma) using an identity mass matrix.
5// Demonstrates leapfrog integration and Metropolis correction.
6
7struct Vec2 { double x, y; };
8
9struct Gaussian2D {
10 Vec2 mu; array<double,4> S; // Sigma = [s11 s12; s21 s22]
11 // Precompute Sigma^{-1} and log|Sigma| if desired
12 array<double,4> Sinv; double logdet;
13 Gaussian2D(Vec2 mu_, array<double,4> S_) : mu(mu_), S(S_) {
14 double det = S[0]*S[3] - S[1]*S[2];
15 Sinv = { S[3]/det, -S[1]/det, -S[2]/det, S[0]/det };
16 logdet = log(det);
17 }
18 // U(x) = -log pi(x) up to constant
19 double U(const Vec2& x) const {
20 double dx = x.x - mu.x, dy = x.y - mu.y;
21 double qx = Sinv[0]*dx + Sinv[1]*dy;
22 double qy = Sinv[2]*dx + Sinv[3]*dy;
23 double md2 = dx*qx + dy*qy;
24 return 0.5 * md2; // constants dropped
25 }
26 // grad U(x) = Sigma^{-1} (x - mu)
27 Vec2 gradU(const Vec2& x) const {
28 double dx = x.x - mu.x, dy = x.y - mu.y;
29 return { Sinv[0]*dx + Sinv[1]*dy, Sinv[2]*dx + Sinv[3]*dy };
30 }
31};
32
33int main(){
34 ios::sync_with_stdio(false);
35 cin.tie(nullptr);
36
37 Gaussian2D target({0.0, 0.0}, {2.0, 1.2, 1.2, 1.5}); // correlated Gaussian
38
39 random_device rd; mt19937 gen(rd());
40 normal_distribution<double> stdn(0.0, 1.0);
41 uniform_real_distribution<double> unif(0.0, 1.0);
42
43 // HMC hyperparameters (tune in warm-up)
44 double eps = 0.2; // step size
45 int L = 10; // leapfrog steps
46
47 Vec2 x{ -4.0, 3.0 }; // initial position
48
49 int burn_in = 2000, N = 20000;
50 vector<Vec2> samples; samples.reserve(N);
51 int accepted = 0;
52
53 for(int it=0; it<burn_in + N; ++it){
54 // Sample momentum p ~ N(0, I)
55 Vec2 p{ stdn(gen), stdn(gen) };
56
57 // Cache starting Hamiltonian
58 double U0 = target.U(x);
59 double K0 = 0.5 * (p.x*p.x + p.y*p.y);
60
61 // Make a proposal (x*, p*) via leapfrog
62 Vec2 xprop = x; Vec2 pprop = p;
63 // Half-step momentum
64 Vec2 g = target.gradU(xprop);
65 pprop.x -= 0.5 * eps * g.x;
66 pprop.y -= 0.5 * eps * g.y;
67 // L iterations of full updates
68 for(int i=0;i<L;i++){
69 // Full-step position
70 xprop.x += eps * pprop.x;
71 xprop.y += eps * pprop.y;
72 // Gradient at new position (except final momentum half-step at the end)
73 g = target.gradU(xprop);
74 // Full-step momentum except last iteration where we still need a half-step
75 if(i != L-1){
76 pprop.x -= eps * g.x;
77 pprop.y -= eps * g.y;
78 }
79 }
80 // Final half-step momentum
81 pprop.x -= 0.5 * eps * g.x;
82 pprop.y -= 0.5 * eps * g.y;
83 // Negate momentum for reversibility (optional)
84 pprop.x = -pprop.x; pprop.y = -pprop.y;
85
86 // Metropolis accept/reject based on Hamiltonian difference
87 double U1 = target.U(xprop);
88 double K1 = 0.5 * (pprop.x*pprop.x + pprop.y*pprop.y);
89 double log_alpha = -(U1 + K1) + (U0 + K0);
90 bool accept = (log(unif(gen)) < log_alpha);
91 if(accept){ x = xprop; if(it >= burn_in) accepted++; }
92 if(it >= burn_in) samples.push_back(x);
93 }
94
95 // Summaries
96 double mx=0,my=0; for(const auto& s : samples){ mx+=s.x; my+=s.y; }
97 mx/=samples.size(); my/=samples.size();
98 cout.setf(std::ios::fixed); cout<<setprecision(4);
99 cout << "Kept samples: " << samples.size() << "\n";
100 cout << "Mean ā‰ˆ (" << mx << ", " << my << ")\n";
101 cout << "Acceptance (post burn-in) ā‰ˆ " << (double)accepted / samples.size() << "\n";
102
103 return 0;
104}
105

This example implements Hamiltonian Monte Carlo for a 2D correlated Gaussian. The potential U and its gradient are known in closed form, so the leapfrog integrator simulates Hamiltonian dynamics with step size ε and L steps. A Metropolis correction accepts or rejects the proposal to counteract integration error. Compared with random-walk MH on the same target, HMC typically achieves higher effective sample size per iteration due to long-distance, high-acceptance moves guided by gradients.

Time: O(N L d) with d=2 and L leapfrog steps per iteration; each step costs O(d) for gradient evaluation.Space: O(N d) for stored samples; O(1) working memory.