Hamiltonian Monte Carlo (HMC)
Key Points
- •Hamiltonian Monte Carlo (HMC) uses gradients of the log-density to propose long-distance moves that still land in high-probability regions.
- •It treats sampling as simulating a frictionless physical system with position (parameters) and momentum (auxiliary variables).
- •The leapfrog integrator makes proposals that are reversible and volume-preserving, enabling a Metropolis correction with high acceptance rates.
- •Per iteration cost is proportional to the number of leapfrog steps times the cost of computing the gradient of the log-density.
- •HMC excels in moderate-to-high dimensions and for strongly correlated posteriors where random-walk MCMC struggles.
- •Proper tuning of step size, number of steps, and mass matrix dramatically affects performance and acceptance rate.
- •HMC is not suitable for discrete or non-differentiable parameters because it requires gradients.
- •NUTS (No-U-Turn Sampler) builds on HMC by automatically choosing the trajectory length, reducing manual tuning.
Prerequisites
- →Multivariable calculus (gradients) — HMC requires computing gradients of the log-density and understanding how they guide motion.
- →Linear algebra (vectors, matrices) — Leapfrog updates and mass matrices require basic vector/matrix operations.
- →Probability and Bayesian inference — Understanding target distributions, posteriors, priors, and likelihoods is essential.
- →Markov Chain Monte Carlo (MCMC) basics — HMC is an MCMC method; knowledge of Metropolis-Hastings and detailed balance helps.
- →Numerical integration of ODEs — HMC uses symplectic integrators (leapfrog) to approximate Hamiltonian dynamics.
- →C++ programming and random number generation — Implementing samplers requires writing efficient, reproducible code.
- →Data preprocessing and scaling — Standardization improves the geometry for HMC and allows larger stable step sizes.
Detailed Explanation
Tap terms for definitions01Overview
Hamiltonian Monte Carlo (HMC) is a Markov Chain Monte Carlo (MCMC) algorithm that uses gradient information to explore probability distributions efficiently. Instead of taking short, random steps (like a random walk), HMC simulates the motion of a particle moving through a landscape defined by the negative log probability (potential energy). By introducing auxiliary momentum variables and integrating Hamilton’s equations with a symplectic integrator (typically leapfrog), HMC proposes distant states that still have high probability, thereby reducing autocorrelation and improving mixing.
The core idea is to alternate between sampling a fresh momentum from a Gaussian distribution and then following a nearly energy-conserving path through the parameter space for several steps before applying a Metropolis acceptance test. Because the leapfrog method is time-reversible and volume-preserving, the proposals are accepted with relatively high probability if the step size is tuned properly. This makes HMC particularly powerful in higher dimensions and for posteriors with strong correlations where random-walk proposals are inefficient.
HMC requires the gradient of the log target density, which restricts its use to differentiable models but yields large gains when available. Practical performance depends on selecting the step size, number of leapfrog steps, and an appropriate mass matrix that matches the geometry of the target. Variants like the No-U-Turn Sampler (NUTS) automate trajectory length selection and often include adaptation of both step size and mass matrix during warm-up.
02Intuition & Analogies
Imagine hiking across a mountainous landscape where altitude equals negative log probability: valleys are high-probability regions, and peaks are low-probability regions. A basic random-walk sampler takes small, jittery steps in random directions; it gets stuck meandering in valleys, repeatedly proposing moves uphill that are frequently rejected. Progress is slow and inefficient.
HMC gives you a map with slope directions (the gradient) and a way to build momentum so you can glide along the valley floor. You start by giving yourself a push (sample momentum), then follow the terrain’s shape using the gradient to steer. Because you move with momentum, you travel longer distances along contours of nearly constant altitude, visiting many plausible points before losing speed. This produces proposals that are far apart but still land in promising areas, leading to faster exploration and lower autocorrelation.
The leapfrog integrator is like a careful sequence of mini-steps that keeps total energy nearly constant as you move. Reversibility (you can exactly retrace your path backward) and volume preservation (you don’t squish or stretch space) ensure that, after your flight, a simple Metropolis check can correct for small numerical errors. If the energy barely changes, your proposal is accepted most of the time. Intuitively, HMC replaces clumsy, uphill-prone stumbling with smooth, physics-informed gliding that respects the landscape’s geometry.
03Formal Definition
04When to Use
- Continuous, differentiable targets: Use HMC when your log-density is differentiable and gradients are available or cheaply computable (e.g., models with smooth priors/likelihoods).
- Moderate-to-high dimensions: HMC shines as dimension grows, where random-walk proposals degrade severely. It manages correlated, elongated posteriors well by moving along level sets.
- Complex posteriors with strong correlations: Hierarchical Bayesian models or logistic/Poisson regression with informative priors benefit significantly.
- Expensive likelihoods where fewer, higher-quality proposals are better than many cheap, low-quality ones.
- When you can precondition: If you can estimate scaling/rotation (mass matrix) from warm-up, HMC performance improves dramatically.
Avoid or modify when: parameters are discrete or piecewise constant (no gradients), the log-density is very rough or non-differentiable, or you cannot compute/approximate gradients. If hand-tuning (L) is difficult, prefer NUTS, which adapts trajectory length. If curvature varies widely, consider Riemannian HMC or adapt a dense mass matrix. For extremely multimodal targets with distant, isolated modes, HMC can still struggle to jump between modes; tempered methods may help.
⚠️Common Mistakes
- Wrong sign for gradients: In HMC, the potential is (U(x) = -\log \pi(x)). The leapfrog uses (-\nabla U = \nabla \log \pi) only in specific places. Mixing these signs leads to poor acceptance or divergence. Define one convention and stick to it.
- Step size too large: Large (\varepsilon) causes energy drift and rejections; too small wastes computation. Aim for an average acceptance around 60–90%, often near 65–80% in higher dimensions.
- Too few or too many leapfrog steps: Too few makes proposals local (like random walk); too many can loop back (wasting computation) or accumulate numerical error. NUTS mitigates this automatically.
- Ignoring mass matrix: Using (M=I) for anisotropic targets yields tiny stable step sizes. Estimate a diagonal or dense (M) from warm-up to match posterior scales/correlations.
- Not centering/scaling parameters: Poorly scaled parameters amplify stiffness, forcing tiny (\varepsilon). Standardize predictors and reparameterize when possible.
- Using non-symplectic integrators: Forward Euler breaks reversibility/volume preservation, causing bias even with Metropolis correction. Use leapfrog or higher-order symplectic schemes.
- Forgetting to flip momentum on rejection/acceptance consistently: Ensure the accept/reject step uses the correct Hamiltonian and that proposals are constructed reversibly (commonly accept with current (p') or its negation consistently).
- Insufficient warm-up: Without step size and mass adaptation, chains can diverge or mix poorly. Use dual averaging for (\varepsilon) and estimate (M) during warm-up.
- Miscomputing the logistic regression gradient: Remember (\nabla U = -\nabla \log \pi). The sigmoid and prior terms must be included with correct signs.
Key Formulas
Potential Energy
Explanation: Defines the energy landscape from the target density. Lower U corresponds to higher probability regions; HMC moves along this landscape.
Kinetic Energy
Explanation: Relates momentum to energy via the mass matrix. With M = I, it reduces to half the squared Euclidean norm of p.
Hamiltonian
Explanation: Total energy used in the Metropolis correction. Leapfrog approximately conserves H, so proposals are often accepted.
Hamilton’s Equations
Explanation: Continuous-time dynamics that HMC discretizes. They define how position and momentum evolve along the trajectory.
Leapfrog Half-Step (Momentum)
Explanation: First half-step update of momentum using the gradient of the potential. It keeps the integrator time-reversible and stable.
Leapfrog Full-Step (Position)
Explanation: Update position using the intermediate momentum. This is the central position update of leapfrog.
Leapfrog Half-Step (Momentum 2)
Explanation: Second half-step update of momentum using the new position. Together, the three updates form a reversible, volume-preserving step.
Metropolis Acceptance
Explanation: Accept or reject the proposal based on the change in total energy. Small integration error (small ΔH) leads to high acceptance.
Logistic Sigmoid
Explanation: Maps real numbers to (0,1); used in logistic regression likelihoods. Its derivative is \((z)(1-(z))\).
Logistic Regression Potential Gradient
Explanation: For a Gaussian prior w ~ N(0, I), the gradient of the potential includes the data-fit term and the prior penalty term.
Per-Iteration Complexity
Explanation: Each leapfrog step requires a gradient. For simple targets C_∇ is O(d); for generalized linear models it’s O(nd).
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct HMC { 5 int d; // dimension 6 double eps; // step size 7 int L; // number of leapfrog steps 8 mt19937 rng; 9 normal_distribution<double> stdnorm{0.0, 1.0}; 10 uniform_real_distribution<double> unif{0.0, 1.0}; 11 12 vector<double> mu; // mean 13 vector<vector<double>> invSigma; // inverse covariance (precomputed) 14 15 HMC(int dim, double step, int steps, const vector<double>& mu_, const vector<vector<double>>& invS) 16 : d(dim), eps(step), L(steps), mu(mu_), invSigma(invS) { 17 random_device rd; rng.seed(rd()); 18 } 19 20 // Vector utilities 21 static vector<double> add(const vector<double>& a, const vector<double>& b) { 22 vector<double> c(a.size()); 23 for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i]; 24 return c; 25 } 26 27 static vector<double> sub(const vector<double>& a, const vector<double>& b) { 28 vector<double> c(a.size()); 29 for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] - b[i]; 30 return c; 31 } 32 33 static vector<double> scal(double s, const vector<double>& a) { 34 vector<double> c(a.size()); 35 for (size_t i = 0; i < a.size(); ++i) c[i] = s * a[i]; 36 return c; 37 } 38 39 static double dot(const vector<double>& a, const vector<double>& b) { 40 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s; 41 } 42 43 vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x) const { 44 vector<double> y(A.size(), 0.0); 45 for (size_t i = 0; i < A.size(); ++i) { 46 double s = 0.0; 47 for (size_t j = 0; j < A[i].size(); ++j) s += A[i][j] * x[j]; 48 y[i] = s; 49 } 50 return y; 51 } 52 53 // Potential U(x) = 0.5 * (x-mu)^T invSigma (x-mu) (const dropped) 54 double U(const vector<double>& x) const { 55 vector<double> xm = sub(x, mu); 56 vector<double> t = matvec(invSigma, xm); 57 return 0.5 * dot(xm, t); 58 } 59 60 // Gradient of U: grad_U = invSigma * (x - mu) 61 vector<double> gradU(const vector<double>& x) const { 62 vector<double> xm = sub(x, mu); 63 return matvec(invSigma, xm); 64 } 65 66 // Single HMC transition from state x 67 vector<double> step(const vector<double>& x0, bool& accepted, double& dH) { 68 // Sample momentum p ~ N(0, I) 69 vector<double> p0(d); 70 for (int i = 0; i < d; ++i) p0[i] = stdnorm(rng); 71 72 // Cache initial energy 73 double H0 = U(x0) + 0.5 * dot(p0, p0); 74 75 // Initialize position and momentum 76 vector<double> x = x0; 77 vector<double> p = p0; 78 79 // Leapfrog integration 80 // Half-step momentum 81 vector<double> g = gradU(x); 82 for (int i = 0; i < d; ++i) p[i] -= 0.5 * eps * g[i]; 83 // L full steps 84 for (int step = 0; step < L; ++step) { 85 // Full position step (M = I) 86 for (int i = 0; i < d; ++i) x[i] += eps * p[i]; 87 // Compute gradient at new position, except after last step we'll still do half momentum 88 g = gradU(x); 89 // If not last step, full momentum step; but leapfrog uses half at end too 90 if (step != L - 1) { 91 for (int i = 0; i < d; ++i) p[i] -= eps * g[i]; 92 } 93 } 94 // Final half-step momentum 95 for (int i = 0; i < d; ++i) p[i] -= 0.5 * eps * g[i]; 96 97 // Compute proposed energy 98 double H1 = U(x) + 0.5 * dot(p, p); 99 dH = H1 - H0; 100 101 double loga = -dH; // log acceptance ratio 102 double u = log(unif(rng)); 103 if (u < loga) { accepted = true; return x; } 104 accepted = false; return x0; 105 } 106 }; 107 108 int main() { 109 // Define a 2D Gaussian with mean [0,0] and correlation 0.8 110 int d = 2; 111 vector<double> mu = {0.0, 0.0}; 112 double rho = 0.8, s1 = 1.0, s2 = 0.5; 113 // Sigma = [[s1^2, rho*s1*s2],[rho*s1*s2, s2^2]] 114 double s11 = s1*s1, s22 = s2*s2, s12 = rho*s1*s2; 115 double det = s11*s22 - s12*s12; 116 vector<vector<double>> invSigma = {{ s22/det, -s12/det }, { -s12/det, s11/det }}; 117 118 // HMC parameters 119 double eps = 0.05; // step size 120 int L = 25; // leapfrog steps 121 122 HMC hmc(d, eps, L, mu, invSigma); 123 124 // Initial state 125 vector<double> x = {2.0, -2.0}; 126 127 // Run sampler 128 int n_iter = 3000; 129 int burn = 500; 130 int accepted_count = 0; 131 132 cout << fixed << setprecision(4); 133 for (int it = 0; it < n_iter; ++it) { 134 bool acc; double dH; 135 x = hmc.step(x, acc, dH); 136 accepted_count += acc ? 1 : 0; 137 if (it % 500 == 0) { 138 cout << "iter=" << it << ", x=[" << x[0] << ", " << x[1] << "]"; 139 cout << ", accepted=" << acc << ", dH=" << dH << "\n"; 140 } 141 if (it >= burn && it % 200 == 0) { 142 // Print a posterior draw occasionally after burn-in 143 cout << "post-sample: " << x[0] << ", " << x[1] << "\n"; 144 } 145 } 146 cout << "Acceptance rate: " << (double)accepted_count / n_iter << "\n"; 147 return 0; 148 } 149
This program samples from a 2D correlated Gaussian using HMC. The potential U and its gradient are analytic and cheap: U(x) = 0.5 (x-μ)^T Σ^{-1} (x-μ), ∇U = Σ^{-1}(x-μ). We perform L leapfrog steps with step size ε, compute the Hamiltonian difference, and accept or reject the proposal. Because M = I and the target is Gaussian, well-tuned ε and L yield high acceptance and long, efficient moves along the correlated directions.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct LogisticHMC { 5 int n, d; 6 double eps; // step size 7 int L; // leapfrog steps 8 double sigma0; // prior std for weights 9 10 vector<vector<double>> X; // n x d 11 vector<int> y; // n labels in {0,1} 12 13 mt19937 rng; 14 normal_distribution<double> stdnorm{0.0, 1.0}; 15 uniform_real_distribution<double> unif{0.0, 1.0}; 16 17 LogisticHMC(const vector<vector<double>>& X_, const vector<int>& y_, double eps_, int L_, double sigma0_) 18 : n((int)X_.size()), d((int)X_[0].size()), eps(eps_), L(L_), sigma0(sigma0_), X(X_), y(y_) { 19 random_device rd; rng.seed(rd()); 20 } 21 22 static double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); } 23 24 // U(w) = -log posterior = -sum(y_i * x_i^T w - log(1+exp(x_i^T w))) + (1/(2 sigma0^2))||w||^2 25 double U(const vector<double>& w) const { 26 double loss = 0.0; 27 for (int i = 0; i < n; ++i) { 28 double z = 0.0; for (int j = 0; j < d; ++j) z += X[i][j] * w[j]; 29 // log(1 + exp(z)) is stable if we use log1p(exp(z)) 30 double log1pexp = (z > 0) ? z + log1p(exp(-z)) : log1p(exp(z)); 31 loss += - (y[i] * z - log1pexp); 32 } 33 double prior = 0.0; for (int j = 0; j < d; ++j) prior += 0.5 * (w[j] * w[j]) / (sigma0 * sigma0); 34 return loss + prior; 35 } 36 37 // grad U(w) = -sum((y_i - sigma(x_i^T w)) x_i) + (1/sigma0^2) w 38 vector<double> gradU(const vector<double>& w) const { 39 vector<double> g(d, 0.0); 40 for (int i = 0; i < n; ++i) { 41 double z = 0.0; for (int j = 0; j < d; ++j) z += X[i][j] * w[j]; 42 double p = sigmoid(z); 43 double coeff = (p - (double)y[i]); // note: derivative of negative log-lik 44 for (int j = 0; j < d; ++j) g[j] += coeff * X[i][j]; 45 } 46 double invs2 = 1.0 / (sigma0 * sigma0); 47 for (int j = 0; j < d; ++j) g[j] += invs2 * w[j]; 48 return g; 49 } 50 51 // One HMC step 52 vector<double> step(const vector<double>& w0, bool& accepted, double& dH) { 53 vector<double> p0(d); 54 for (int j = 0; j < d; ++j) p0[j] = stdnorm(rng); // M = I 55 56 double H0 = U(w0) + 0.5 * inner_product(p0.begin(), p0.end(), p0.begin(), 0.0); 57 58 vector<double> w = w0; 59 vector<double> p = p0; 60 61 // Half-step momentum 62 vector<double> g = gradU(w); 63 for (int j = 0; j < d; ++j) p[j] -= 0.5 * eps * g[j]; 64 65 // Leapfrog steps 66 for (int t = 0; t < L; ++t) { 67 for (int j = 0; j < d; ++j) w[j] += eps * p[j]; 68 g = gradU(w); 69 if (t != L - 1) { 70 for (int j = 0; j < d; ++j) p[j] -= eps * g[j]; 71 } 72 } 73 for (int j = 0; j < d; ++j) p[j] -= 0.5 * eps * g[j]; 74 75 double H1 = U(w) + 0.5 * inner_product(p.begin(), p.end(), p.begin(), 0.0); 76 dH = H1 - H0; 77 double loga = -dH; 78 if (log(unif(rng)) < loga) { accepted = true; return w; } 79 accepted = false; return w0; 80 } 81 }; 82 83 // Generate a simple synthetic dataset for logistic regression 84 void make_dataset(int n, int d, vector<vector<double>>& X, vector<int>& y, mt19937& rng) { 85 normal_distribution<double> N01(0.0, 1.0); 86 vector<double> w_true(d, 0.0); 87 for (int j = 0; j < d; ++j) w_true[j] = (j % 2 == 0 ? 1.5 : -0.8); // true weights 88 89 X.assign(n, vector<double>(d)); y.assign(n, 0); 90 for (int i = 0; i < n; ++i) { 91 for (int j = 0; j < d; ++j) X[i][j] = N01(rng); 92 double z = 0.0; for (int j = 0; j < d; ++j) z += X[i][j] * w_true[j]; 93 double p = 1.0 / (1.0 + exp(-z)); 94 bernoulli_distribution Bern(p); 95 y[i] = Bern(rng); 96 } 97 } 98 99 int main() { 100 ios::sync_with_stdio(false); 101 cin.tie(nullptr); 102 103 // Create synthetic data 104 mt19937 rng(42); 105 int n = 300, d = 3; 106 vector<vector<double>> X; vector<int> y; 107 make_dataset(n, d, X, y, rng); 108 109 // Standardize columns (improves HMC stability) 110 for (int j = 0; j < d; ++j) { 111 double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n; 112 double var = 0.0; for (int i = 0; i < n; ++i) { double v = X[i][j] - mean; var += v*v; } 113 var /= n; double sd = sqrt(var) + 1e-9; 114 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / sd; 115 } 116 117 double eps = 0.01; // step size (tune for your data) 118 int L = 50; // leapfrog steps 119 double sigma0 = 5.0; // prior std 120 121 LogisticHMC hmc(X, y, eps, L, sigma0); 122 123 vector<double> w(d, 0.0); // initial weights 124 int n_iter = 2000, burn = 500; 125 int accepted = 0; 126 127 cout << fixed << setprecision(4); 128 for (int it = 0; it < n_iter; ++it) { 129 bool acc; double dH; 130 w = hmc.step(w, acc, dH); 131 accepted += acc ? 1 : 0; 132 if (it % 200 == 0) { 133 cout << "iter=" << it << ", dH=" << dH << ", acc=" << acc << ", w=["; 134 for (int j = 0; j < d; ++j) cout << w[j] << (j+1<d?", ":""); 135 cout << "]\n"; 136 } 137 if (it >= burn && it % 200 == 0) { 138 cout << "posterior draw: "; 139 for (int j = 0; j < d; ++j) cout << w[j] << (j+1<d?", ":"\n"); 140 } 141 } 142 cout << "Acceptance rate: " << (double)accepted / n_iter << "\n"; 143 return 0; 144 } 145
This example performs HMC for Bayesian logistic regression with a Gaussian prior on weights. The potential U is the negative log posterior, and gradU is its gradient. We standardize features to improve geometry, use M = I, and run leapfrog steps to propose new weights. The acceptance probability corrects for integration error. This shows how HMC handles data-dependent models where each gradient evaluation costs O(n d).