🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Gibbs Sampling

Key Points

  • •
    Gibbs sampling is an MCMC method that generates samples by repeatedly drawing each variable from its conditional distribution given the others.
  • •
    It avoids computing hard normalization constants and always accepts proposals, making each update simple and efficient.
  • •
    When full conditionals are easy to sample (e.g., conjugate Bayesian models or pairwise Markov random fields), Gibbs sampling shines.
  • •
    Per sweep, the runtime is roughly linear in the number of variables; total runtime also depends on the chain’s mixing time.
  • •
    Highly correlated variables can cause slow mixing and high autocorrelation, so diagnostics like ESS and trace plots are essential.
  • •
    Burn-in and (sometimes) thinning help reduce bias and autocorrelation in estimates from the Markov chain.
  • •
    Blocking and reparameterization can dramatically improve mixing by reducing conditional dependence.
  • •
    Gibbs sampling is widely used in Bayesian inference, image modeling PottsIsing​, and latent-variable models.

Prerequisites

  • →Basic probability and random variables — Understanding conditionals, independence, and expectations is required to derive and sample full conditionals.
  • →Conditional distributions and Bayes’ rule — Gibbs sampling relies on computing full conditionals from the joint or from prior×likelihood.
  • →Markov chains — Convergence, stationarity, and ergodic averages all come from Markov chain theory.
  • →Random number generation — Implementing Gibbs needs reliable sampling from standard distributions (normal, uniform, etc.).
  • →Multivariate normal distribution (for Gaussian example) — Deriving the bivariate normal conditionals requires Gaussian identities.
  • →Graphs and Markov random fields (for Ising example) — Local conditional structure in MRFs enables efficient Gibbs updates.
  • →Numerical stability basics — Stable computation of logistic probabilities prevents overflow/underflow in updates.
  • →Monte Carlo error and diagnostics — Autocorrelation and ESS determine how many samples are needed for target accuracy.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine estimating a city’s opinion on a topic without asking everyone at once. Instead, you repeatedly ask one neighborhood while holding the others’ current estimates fixed, then rotate neighborhoods. Over time, you converge to the citywide distribution. That is the spirit of Gibbs sampling. Concept: Gibbs sampling is a Markov Chain Monte Carlo (MCMC) algorithm for drawing approximate samples from a complex joint probability distribution without ever computing its often intractable normalization constant. It operates by iteratively sampling each variable (or block of variables) from its conditional distribution given the current values of the others. These conditional distributions, called full conditionals, are often much simpler than the joint distribution and can be sampled directly. Example: In a bivariate normal distribution with strong correlation, drawing independent samples is hard. Gibbs sampling alternates between sampling X given Y and Y given X. Even though each conditional sample is easy to draw (a 1D normal), the sequence of these draws collectively approximates draws from the 2D joint. With enough iterations, summary statistics computed from the chain (e.g., means, variances, correlations) converge to their true values under the target distribution.

02Intuition & Analogies

Hook: Think of a group project where each student updates their section while treating others’ sections as temporarily fixed. After everyone takes a turn, the document is more coherent. Repeat this cycle and the document stabilizes into a high-quality final version. Concept: Gibbs sampling updates one coordinate at a time. Holding all other coordinates fixed converts a high-dimensional problem into a simple one-dimensional or low-dimensional one. This turns a complicated “all-at-once” sampling problem into a sequence of easy conditional draws. Analogy: Consider repainting a large mural with a limited set of colors. Instead of repainting the whole mural at once, you repaint one panel at a time based on the colors currently chosen for adjacent panels. After many passes, the mural settles into a pleasing, globally consistent palette. In Gibbs sampling, each panel is a variable; its neighbors’ colors are the conditioning variables; the repainting rule is the conditional distribution. Why it works: Even though each update only “locally” adjusts one variable, the entire collection of updates forms a Markov chain that preserves the target distribution. Over time, the chain explores the space in proportion to the target probability mass. Because each update is accepted with probability 1 (no rejections), progress is steady—though potentially slow if variables are tightly coupled. Example: In an image denoising Ising model, the label of a pixel depends mostly on its neighbors. Gibbs sampling flips one pixel at a time according to a simple logistic probability based on its neighbors, gradually producing a clean, globally consistent labeling.

03Formal Definition

Let \(π(x)\) be a target distribution over \(x=(x1​,…,xd​)\) on a measurable space. The Gibbs sampler defines a Markov chain \(\{X(t)\}_{t≥ 0}\) via successive conditional updates. In a systematic-scan Gibbs sampler, for iteration \(t→ t+1\): - For \(i=1,…,d\), sample \(X_i(t+1) ∼ π(xi​∣ X−i(t+1,1:i−1)​,\,X−i(t)​)\), where \(X−i​\) denotes all coordinates except \(i\), and the most recent values are used for previously updated coordinates. This yields a transition kernel \(K\) that leaves \(π\) invariant. Under mild regularity (irreducibility, aperiodicity, positive recurrence), \(K\) is ergodic and the chain converges to \(π\) in distribution. Block Gibbs generalizes by updating blocks \(xB​\) from their full conditional \(π(xB​∣ x−B​)\). For many models (e.g., conjugate exponential-family posteriors, Gaussian Markov random fields), full conditionals have closed forms. Monte Carlo estimates use the ergodic average \(μ^​N​(f) = N1​∑t=1N​ f(X(t))\), which converges to \(E_π[f(X)]\). Autocorrelation in the chain reduces the effective sample size relative to \(N\).

04When to Use

  • Conjugate Bayesian models: When the posterior factorizes into known families with tractable full conditionals (e.g., Gaussian with Normal–Inverse-Gamma priors, multinomial with Dirichlet priors). Each conditional draw is then a simple, closed-form sample.
  • Markov random fields (MRFs): In vision or spatial statistics (Ising/Potts), a node’s full conditional depends only on its neighbors. Gibbs naturally exploits local structure and conditional independence.
  • Latent-variable models: With hidden assignments or indicators (e.g., mixture models), you can alternate between sampling latent variables and parameters.
  • Intractable normalizing constants: When only ratios of probabilities are known (up to a constant), conditionals may still be computable, avoiding the partition function.
  • When proposal design is hard: Unlike Metropolis–Hastings, Gibbs does not require a tuned proposal or acceptance step; if you can sample the conditional, you are done. Caveats: If variables are strongly coupled or conditionals are expensive to compute, mixing may be slow. In high dimensions with strong correlations, consider blocking, reparameterization, or auxiliary-variable methods to improve performance.

⚠️Common Mistakes

  • Ignoring dependence and slow mixing: High autocorrelation leads to overconfident estimates. Use diagnostic tools (ESS, autocorrelation plots, Gelman–Rubin when multiple chains) to detect poor mixing.
  • No burn-in: Early samples reflect arbitrary initialization. Discard a reasonable burn-in (e.g., several times the autocorrelation length) before computing estimates.
  • Miscomputed conditionals: A small algebraic mistake in a full conditional can break stationarity. Derive conditionals carefully, check dimensions, and validate on toy cases with known answers.
  • Storing every sample: This inflates memory without benefit when autocorrelation is high. Prefer online averaging or thinning (judiciously) and store only what you need.
  • Deterministic update order issues: Fixed scans can be fine, but in some models random-scan or blocked updates reduce artifacts and improve mixing.
  • Ignoring label switching in mixture models: Posterior modes can be symmetric under permutations of labels; naive averaging of component-specific parameters can be misleading. Use identifiability constraints or post-processing relabeling.

Key Formulas

Full Conditional

π(xi​∣x−i​)=∫π(x1​,…,xd​)dxi​π(x1​,…,xd​)​

Explanation: The conditional of variable i given all others is the joint density divided by the joint integrated over that variable. This is the basic building block sampled in Gibbs.

Ergodic Average

μ^​N​(f)=N1​t=1∑N​f(X(t))N→∞​Eπ​[f(X)]

Explanation: Function averages along the Markov chain converge to expectations under the target distribution. This justifies estimating integrals from Gibbs samples.

Effective Sample Size

ESS≈τint​N​,τint​=1+2k=1∑∞​ρk​

Explanation: Autocorrelation inflates variance of Monte Carlo estimates. The integrated autocorrelation time \(τint​\) converts N correlated draws into the equivalent number of iid draws.

Bivariate Normal Conditional

X∣Y=y∼N(μX​+ρσY​σX​​(y−μY​),(1−ρ2)σX2​)

Explanation: Conditionals of a multivariate normal are themselves normal. This underpins a simple two-variable Gibbs sampler for a correlated Gaussian.

Ising Full Conditional

Pr(Si​=+1∣S∂i​)=1+exp(−2β(h+∑j∈∂i​Jij​Sj​))1​

Explanation: For binary spins in an Ising model, the probability of a +1 at site i given neighbors is a logistic function of the local field. Gibbs updates flip spins according to this probability.

Detailed Balance

π(x)K(x,x′)=π(x′)K(x′,x)

Explanation: If a transition kernel K satisfies detailed balance with respect to \(π\), then \(π\) is stationary. Systematic or random-scan Gibbs satisfies this property.

Monte Carlo Error with Autocorrelation

MSE(μ^​N​)≈Nσf2​​τint​

Explanation: Autocorrelation slows error decay by a factor equal to the integrated autocorrelation time. Larger \(τint​\) means fewer effective samples.

Per-Sweep Cost

Tsweep​=i=1∑d​T(sample xi​∣x−i​)

Explanation: One sweep updates all variables; the time is the sum of costs to draw each conditional sample. In sparse models, each update often depends on only a few neighbors.

Complexity Analysis

Let d be the number of variables (or blocks) and suppose drawing each full conditional takes time ci​. One full sweep that updates all variables costs Ts​weep = ∑_{i=1}^{d} ci​. In many common cases, ci​ is O(1) or depends only on local neighborhood size (e.g., Ising model with 4-neighbors has ci​=O(1)), so T_sweep=O(d). For dense models or expensive conditionals (e.g., multivariate Gaussian blocks with Cholesky decompositions), ci​ can be higher (O(k3) for a k-dimensional block if a factorization is needed), but blocking can reduce autocorrelation. The total runtime to achieve an estimation error ε depends on mixing. If τ_mix denotes the number of iterations (or sweeps) needed for the chain to approximate stationarity within a tolerance, and τ_int the integrated autocorrelation time for the function of interest, then the total cost is roughly O((τ_mix + N) · Ts​weep), with N chosen so that N/τ_int provides the desired effective sample size. In practice, τ_mix and τ_int are problem-dependent and can grow quickly with correlation strength or at criticality (e.g., near the Ising phase transition). Space complexity is small: O(d) to store the current state, plus optional O(N · m) if storing m-dimensional summaries across N retained samples. Often one uses online averages to keep memory O(d). Numerical stability considerations (e.g., avoiding exp overflow in logistic updates) affect constants but not asymptotic complexity. Overall, Gibbs sampling is attractive when Ts​weep is linear and τ_int is moderate; otherwise, consider blocking, reparameterization, or alternative MCMC methods.

Code Examples

Gibbs sampling for a correlated bivariate normal
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RunningStats {
5 long long n = 0;
6 double mean_x = 0.0, mean_y = 0.0;
7 double M2_x = 0.0, M2_y = 0.0, C_xy = 0.0; // for variance/covariance (Welford)
8 void add(double x, double y) {
9 n++;
10 double dx = x - mean_x;
11 double dy = y - mean_y;
12 mean_x += dx / n;
13 mean_y += dy / n;
14 M2_x += dx * (x - mean_x);
15 M2_y += dy * (y - mean_y);
16 C_xy += (x - mean_x) * (y - mean_y);
17 }
18 double var_x() const { return n > 1 ? M2_x / (n - 1) : 0.0; }
19 double var_y() const { return n > 1 ? M2_y / (n - 1) : 0.0; }
20 double cov_xy() const { return n > 1 ? C_xy / (n - 1) : 0.0; }
21 double corr() const {
22 double vx = var_x(), vy = var_y();
23 if (vx <= 0 || vy <= 0) return 0.0;
24 return cov_xy() / sqrt(vx * vy);
25 }
26};
27
28int main() {
29 // Target: (X, Y) ~ N(0, 0; 1, 1, corr = rho)
30 double rho = 0.8; // correlation
31 int burn_in = 5000; // discard initial iterations
32 int thin = 5; // keep every 'thin'-th sample
33 int kept = 20000; // number of kept samples after burn-in and thinning
34
35 // Conditionals: X|Y=y ~ N(rho*y, 1 - rho^2); Y|X=x ~ N(rho*x, 1 - rho^2)
36 double cond_var = 1.0 - rho * rho;
37 double cond_sd = sqrt(max(1e-15, cond_var));
38
39 // RNG setup
40 std::random_device rd;
41 std::mt19937_64 gen(rd());
42 std::normal_distribution<double> std_normal(0.0, 1.0);
43 std::uniform_real_distribution<double> uni01(0.0, 1.0);
44
45 // Initialize chain (arbitrary start)
46 double x = 0.0, y = 0.0;
47
48 // Burn-in phase
49 for (int t = 0; t < burn_in; ++t) {
50 // Sample X given Y
51 double mean_x = rho * y;
52 x = mean_x + cond_sd * std_normal(gen);
53 // Sample Y given X
54 double mean_y = rho * x;
55 y = mean_y + cond_sd * std_normal(gen);
56 }
57
58 // Sampling with thinning; collect running statistics
59 RunningStats stats;
60 int collected = 0;
61 for (int s = 0; collected < kept; ++s) {
62 // One Gibbs step
63 double mean_x = rho * y;
64 x = mean_x + cond_sd * std_normal(gen);
65 double mean_y = rho * x;
66 y = mean_y + cond_sd * std_normal(gen);
67
68 if ((s + 1) % thin == 0) {
69 stats.add(x, y);
70 collected++;
71 }
72 }
73
74 cout << fixed << setprecision(4);
75 cout << "Estimated E[X] = " << stats.mean_x << " (true: 0)\n";
76 cout << "Estimated E[Y] = " << stats.mean_y << " (true: 0)\n";
77 cout << "Estimated Var[X] = " << stats.var_x() << " (true: 1)\n";
78 cout << "Estimated Var[Y] = " << stats.var_y() << " (true: 1)\n";
79 cout << "Estimated Corr[X,Y] = " << stats.corr() << " (true: " << rho << ")\n";
80
81 return 0;
82}
83

We target a 2D normal with zero means, unit variances, and correlation rho. The full conditionals are univariate normals with mean rho times the other coordinate and variance 1 - rho^2. The algorithm alternates sampling X|Y and Y|X. We discard a burn-in, thin the chain, and maintain online estimates of means, variances, and correlation using Welford’s algorithm. The final estimates should be close to the true values when the chain mixes well and enough samples are collected.

Time: O(B + T), where B is burn-in iterations and T is the number of kept/thinned iterations (each iteration does O(1) work).Space: O(1) for state and online stats (or O(T) if all samples are stored).
Gibbs sampling on a 2D Ising model (binary image)
1#include <bits/stdc++.h>
2using namespace std;
3
4// 2D Ising model with nearest-neighbor interactions on an N x N grid.
5// Spins s(i,j) in {-1, +1}. Gibbs update uses logistic probability based on neighbors.
6
7struct Ising {
8 int N; // grid size
9 double beta; // inverse temperature (coupling strength)
10 double J; // interaction strength (usually 1.0)
11 double h; // external field (bias)
12 vector<int> spins; // flattened N*N array of -1 or +1
13
14 Ising(int N_, double beta_, double J_=1.0, double h_=0.0)
15 : N(N_), beta(beta_), J(J_), h(h_), spins(N_*N_, 1) {}
16
17 inline int idx(int i, int j) const { return i * N + j; }
18
19 // Periodic boundary conditions helper
20 inline int mod(int a, int m) const { int r = a % m; return r < 0 ? r + m : r; }
21
22 // Sum of 4-neighbor spins around (i,j) with periodic boundaries
23 int neighbor_sum(int i, int j) const {
24 int up = spins[idx(mod(i-1,N), j)];
25 int down = spins[idx(mod(i+1,N), j)];
26 int left = spins[idx(i, mod(j-1,N))];
27 int right = spins[idx(i, mod(j+1,N))];
28 return up + down + left + right;
29 }
30
31 // One Gibbs sweep over the grid (systematic scan)
32 void sweep(std::mt19937_64 &gen) {
33 std::uniform_real_distribution<double> uni01(0.0, 1.0);
34 for (int i = 0; i < N; ++i) {
35 for (int j = 0; j < N; ++j) {
36 int s_ij = spins[idx(i,j)];
37 int nb = neighbor_sum(i, j);
38 // Local field for spin +1 vs -1. Probability of +1 is logistic(z), z = 2*beta*(h + J*nb)
39 double z = 2.0 * beta * (h + J * static_cast<double>(nb));
40 // Stable logistic: p = 1/(1+exp(-z))
41 double p;
42 if (z >= 0) {
43 double ez = exp(-z);
44 p = 1.0 / (1.0 + ez);
45 } else {
46 double ez = exp(z);
47 p = ez / (1.0 + ez);
48 }
49 double u = uni01(gen);
50 spins[idx(i,j)] = (u < p) ? +1 : -1;
51 }
52 }
53 }
54
55 // Magnetization: average spin
56 double magnetization() const {
57 long long sum = 0;
58 for (int v : spins) sum += v;
59 return static_cast<double>(sum) / static_cast<double>(N * N);
60 }
61};
62
63int main() {
64 int N = 40; // Grid size (40x40)
65 double beta = 0.4; // Inverse temperature (< ~0.44 is disordered phase for J=1, h=0)
66 double J = 1.0;
67 double h = 0.0;
68
69 std::random_device rd;
70 std::mt19937_64 gen(rd());
71
72 Ising model(N, beta, J, h);
73
74 // Initialize spins randomly to speed up mixing from a neutral state
75 std::uniform_real_distribution<double> uni01(0.0, 1.0);
76 for (int i = 0; i < N; ++i) {
77 for (int j = 0; j < N; ++j) {
78 model.spins[model.idx(i,j)] = (uni01(gen) < 0.5) ? -1 : +1;
79 }
80 }
81
82 int burn_in_sweeps = 200;
83 for (int b = 0; b < burn_in_sweeps; ++b) model.sweep(gen);
84
85 // Collect magnetization samples
86 int sweeps = 1000, thin = 5;
87 double mean_mag = 0.0;
88 int kept = 0;
89 for (int t = 0; t < sweeps; ++t) {
90 model.sweep(gen);
91 if ((t + 1) % thin == 0) {
92 mean_mag += model.magnetization();
93 kept++;
94 }
95 }
96 mean_mag /= max(1, kept);
97
98 cout << fixed << setprecision(4);
99 cout << "Average magnetization (beta=" << beta << ") = " << mean_mag << "\n";
100 cout << "(Try increasing beta towards 0.5 to see magnetization increase due to ordering.)\n";
101 return 0;
102}
103

This implements Gibbs sampling for the 2D Ising model on an N×N grid with periodic boundaries. Each spin is updated using its full conditional, a logistic function of the sum of its four neighbors and the external field. We perform burn-in, then repeatedly sweep and average magnetization. At low beta (high temperature), spins are disordered and magnetization is near zero; as beta increases past the critical point, long-range order emerges and magnetization grows.

Time: O(N^2) work per sweep since each of the N^2 spins is updated using O(1) neighbor operations.Space: O(N^2) to store the grid of spins.
#gibbs sampling#mcmc#markov chain#conditional distribution#full conditional#ising model#bayesian inference#multivariate normal#detailed balance#burn-in#autocorrelation#effective sample size#blocked gibbs#random-scan gibbs#stationary distribution