Variational Inference
Key Points
- •Variational Inference (VI) turns Bayesian inference into an optimization problem by choosing a simple family q(z) to approximate an intractable posterior p(z|x).
- •The Evidence Lower Bound (ELBO) is the objective VI maximizes; increasing the ELBO tightens the approximation to the true posterior.
- •Mean-field VI assumes independent factors in q(z), enabling fast coordinate-ascent updates using expected log-joint terms.
- •Black-box VI uses Monte Carlo gradients of the ELBO and works even without conjugacy by leveraging the reparameterization trick.
- •The direction of the KL divergence in VI (KL(q||p)) encourages covering high-density regions but can under-estimate uncertainty in multi-modal posteriors.
- •Stochastic VI scales to large datasets by estimating gradients from mini-batches while maintaining an unbiased objective gradient.
- •Monitoring ELBO and gradient variance is crucial for stable and correct optimization during VI.
- •C++ implementations can demonstrate both closed-form CAVI (e.g., linear models) and reparameterized gradient-based VI (e.g., simple latent Gaussian models).
Prerequisites
- →Basic probability and expectation — VI relies on expectations under q(z) and reasoning about densities.
- →Bayes’ rule and posterior inference — Understanding p(z|x) and why it can be intractable motivates VI.
- →Kullback–Leibler divergence — VI minimizes KL(q||p) equivalently by maximizing the ELBO.
- →Multivariate Gaussian distributions — Common variational families and conjugate models use Gaussian forms.
- →Jensen’s inequality — It justifies the ELBO as a lower bound on log evidence.
- →Optimization and gradient descent — Black-box VI requires iterative gradient-based optimization.
- →Monte Carlo estimation — ELBO and gradient estimates use sampling from q.
- →Linear algebra (matrix–vector operations) — Efficient CAVI implementations depend on matrix–vector computations.
- →Log-likelihoods and numerical stability — Working in log-space prevents underflow and overflow.
- →Stochastic optimization (mini-batching) — Essential to scale VI to large datasets.
Detailed Explanation
Tap terms for definitions01Overview
Variational Inference (VI) is a method to perform approximate Bayesian inference when the true posterior p(z|x) is computationally intractable. The key idea is to replace exact inference with optimization: we pick a set of candidate distributions Q (the variational family) and then find q*(z) in Q that is closest to the true posterior according to Kullback–Leibler (KL) divergence. Rather than minimizing KL directly, VI maximizes the Evidence Lower Bound (ELBO), which is a tractable lower bound on the log evidence log p(x). This bound captures a trade-off between fitting the data (via the expected log-likelihood) and staying close to the prior (via a KL regularizer). In practice, VI comes in two dominant flavors. First, mean-field VI uses a product-form q(z)=∏ q_i(z_i) and iteratively updates each factor with closed-form formulas when the model is conjugate, which is known as Coordinate Ascent Variational Inference (CAVI). Second, black-box VI uses Monte Carlo estimates of the ELBO gradient and can be applied to virtually any model; the reparameterization trick provides low-variance gradients when q is reparameterizable (e.g., Gaussian). VI scales naturally to large datasets through stochastic (mini-batch) optimization. It also underlies modern deep generative models such as Variational Autoencoders (VAEs), where neural networks amortize inference by predicting parameters of q(z|x).
02Intuition & Analogies
Imagine trying to describe a complicated, twisty shape (the true posterior) using a simpler clay mold (our variational family q). We can’t reproduce every wrinkle, but we can press, stretch, and shift the mold to cover as much of the important mass as possible. The better the mold fits, the better our approximate inference. The ELBO is like a score telling us how well the mold fits without needing to see the exact shape; we can raise this score by improving how well the model explains the data and by not straying too far from our prior beliefs. Another analogy: think of locating a radio signal in a noisy environment. The true posterior is the exact signal we’d like to isolate, but we only have noisy measurements. We choose a tunable filter q with knobs (its parameters). Turning the knobs to raise the ELBO is like improving the filter so more true signal passes and less noise remains. In mean-field VI, we make a simplifying assumption that each knob controls a separate band (independence), so we can tune one band at a time using closed-form rules. In black-box VI, we don’t have an instruction manual for the knobs; instead, we jiggle them in smart ways (Monte Carlo gradients via reparameterization) and follow the direction that most increases the signal (the ELBO). In both cases, we aim for a good balance: explain the data while not overfitting, which the ELBO enforces automatically through its built-in regularization.
03Formal Definition
04When to Use
- Intractable posteriors: Use VI when exact p(z|x) is unavailable due to complex likelihoods or high-dimensional latent variables (e.g., topic models, large hierarchical Bayes).
- Conjugate models with factorization: Use mean-field CAVI for fast, stable inference with closed-form updates (e.g., Bayesian linear regression, Gaussian mixture models with conjugate priors).
- Nonconjugate or deep models: Use black-box VI with reparameterization (e.g., VAEs, Bayesian neural networks, probabilistic matrix factorization) to handle arbitrary likelihoods and priors.
- Large-scale datasets: Use stochastic VI (mini-batches) for scalability; the ELBO gradients can be unbiasedly estimated from subsamples with appropriate weighting.
- Need for speed and differentiability: VI often runs much faster than MCMC and provides a differentiable objective (the ELBO) suitable for integration with automatic differentiation and GPU acceleration.
- Amortized inference: When you must infer z for many x quickly (e.g., at test time in VAEs), learn an inference network that outputs q_\phi(z\mid x) parameters directly.
⚠️Common Mistakes
- Overly restrictive variational family: A mean-field diagonal Gaussian may under-estimate posterior uncertainty or miss multimodality. Remedy: Enrich q with covariance, mixtures, or normalizing flows.
- Ignoring ELBO monitoring: Not tracking the ELBO can hide divergence or numerical issues. Always report ELBO, its moving average, and gradient norms.
- High-variance gradients in black-box VI: Using the score-function estimator without control variates can stall learning. Prefer reparameterization when possible, increase the number of samples, and use variance reduction (baselines, Rao–Blackwellization).
- KL direction confusion: Minimizing KL(q||p) is mode-seeking; it can avoid low-probability regions between modes. If mode-covering is desired, consider alternative divergences or tempered posteriors.
- Poor initialization and step sizes: Starting with too narrow q or using aggressive learning rates can cause numerical instability. Initialize with reasonable scale and use adaptive optimizers (e.g., Adam) or line searches.
- Broken factor updates in CAVI: Forgetting to exclude the current coordinate when computing residuals leads to biased updates. Maintain and update residuals carefully to get O(nd) per sweep.
- Numerical underflow/overflow: Log-probabilities with tiny variances or large norms can cause -inf or nan. Work in log-space and clip/regularize variances.
Key Formulas
ELBO Definition
Explanation: The ELBO measures how well q explains the joint p(x,z) while penalizing complexity via the entropy of q. Maximizing it yields a better posterior approximation.
ELBO–Evidence Decomposition
Explanation: The gap between log evidence and the ELBO is the KL divergence from q to the true posterior. As ELBO increases, KL decreases and q approaches p(z|x).
KL Divergence
Explanation: This is the direction minimized by VI. It prefers placing mass where q can safely fit within p’s high-density regions, often resulting in mode-seeking behavior.
Mean-field Coordinate Update
Explanation: Given all other factors, the optimal update for factor i exponentiates the expected log-joint. This yields closed-form updates in conjugate exponential-family models.
Reparameterization
Explanation: Samples from q are expressed as a deterministic transform of noise independent of parameters. This enables low-variance pathwise gradients for the ELBO.
Pathwise Gradient
Explanation: By reparameterizing, we can move the gradient operator inside the expectation. This is central to black-box VI with differentiable models.
Score-Function Estimator
Explanation: Also known as REINFORCE, it requires only samples from q and log q’s gradient. It has higher variance than pathwise gradients but works without reparameterization.
Gaussian–Standard Normal KL
Explanation: A closed form for the KL from a Gaussian to a standard normal in k dimensions. Useful for ELBOs that include a Gaussian prior on latents.
Gaussian Entropy
Explanation: The entropy of a multivariate normal depends on its covariance determinant. It appears in the ELBO via the -[log q(z)] term.
Quadratic Expectation under Mean-field
Explanation: With independent coordinates, cross-covariances vanish. This identity yields simple ELBO terms for Gaussian linear models in CAVI.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Coordinate Ascent Variational Inference (CAVI) 5 // Model: y = X w + noise, noise ~ N(0, sigma2 I) 6 // Prior: w ~ N(0, alpha^{-1} I) 7 // Variational family: q(w) = \prod_j N(w_j | m_j, s2_j) 8 // Updates: 9 // s2_j = 1 / (alpha + (1/sigma2) * ||X_col_j||^2) 10 // m_j = s2_j * (1/sigma2) * sum_i X_{ij} * ( y_i - sum_{k!=j} X_{ik} m_k ) 11 // We implement O(nd) per sweep using a maintained residual r = y - X m. 12 13 struct Dataset { 14 int n, d; 15 vector<vector<double>> X_cols; // d columns, each of length n 16 vector<double> y; // length n 17 vector<double> col_norm2; // ||X_col_j||^2 18 }; 19 20 Dataset make_synthetic(int n, int d, double sigma, double alpha_true, unsigned seed=42) { 21 mt19937 rng(seed); 22 normal_distribution<double> N01(0.0, 1.0); 23 Dataset data; data.n = n; data.d = d; 24 data.X_cols.assign(d, vector<double>(n)); 25 data.y.assign(n, 0.0); 26 vector<double> w_true(d); 27 // True weights ~ N(0, alpha_true^{-1} I) 28 for (int j = 0; j < d; ++j) w_true[j] = N01(rng) / sqrt(alpha_true); 29 // Design matrix X ~ N(0,1) 30 for (int j = 0; j < d; ++j) { 31 for (int i = 0; i < n; ++i) data.X_cols[j][i] = N01(rng); 32 } 33 // Generate y = X w_true + noise 34 normal_distribution<double> noise(0.0, sigma); 35 for (int i = 0; i < n; ++i) { 36 double yi = 0.0; 37 for (int j = 0; j < d; ++j) yi += data.X_cols[j][i] * w_true[j]; 38 yi += noise(rng); 39 data.y[i] = yi; 40 } 41 // Precompute column norms 42 data.col_norm2.assign(d, 0.0); 43 for (int j = 0; j < d; ++j) { 44 double s = 0.0; for (int i = 0; i < n; ++i) s += data.X_cols[j][i] * data.X_cols[j][i]; 45 data.col_norm2[j] = s; 46 } 47 return data; 48 } 49 50 struct VariationalParams { 51 vector<double> m; // means 52 vector<double> s2; // variances (diagonal) 53 }; 54 55 // Compute ELBO under mean-field diagonal q(w) 56 // E_q[log p(y|w)] + E_q[log p(w)] - E_q[log q(w)] 57 // E[||y - X w||^2] = ||y - X m||^2 + sum_j s2_j * ||X_col_j||^2 58 // E[||w||^2] = ||m||^2 + sum_j s2_j 59 60 double elbo(const Dataset& D, const VariationalParams& vp, double sigma2, double alpha) { 61 int n = D.n, d = D.d; 62 // Compute residual r = y - X m 63 vector<double> r = D.y; 64 for (int j = 0; j < d; ++j) { 65 double mj = vp.m[j]; 66 if (mj == 0.0) continue; 67 for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * mj; 68 } 69 // ||r||^2 70 double r2 = 0.0; for (double v : r) r2 += v * v; 71 // sum_j s2_j * ||X_col_j||^2 72 double trace_term = 0.0; 73 for (int j = 0; j < d; ++j) trace_term += vp.s2[j] * D.col_norm2[j]; 74 // E_q[log p(y|w)] 75 double ll = -0.5 * n * log(2.0 * M_PI * sigma2) - 0.5 * (r2 + trace_term) / sigma2; 76 // E_q[log p(w)] with prior N(0, alpha^{-1} I) 77 double m2 = 0.0, s2sum = 0.0; 78 for (int j = 0; j < d; ++j) { m2 += vp.m[j] * vp.m[j]; s2sum += vp.s2[j]; } 79 double lpw = -0.5 * d * log(2.0 * M_PI / alpha) - 0.5 * alpha * (m2 + s2sum); 80 // Entropy of q (sum of 1D Gaussians) 81 double H = 0.0; 82 for (int j = 0; j < d; ++j) H += 0.5 * log(2.0 * M_E * M_PI * vp.s2[j]); 83 return ll + lpw + H; 84 } 85 86 VariationalParams cavi_bayesian_lr(const Dataset& D, double sigma2, double alpha, int max_iters=200, double tol=1e-6) { 87 int n = D.n, d = D.d; 88 VariationalParams vp; vp.m.assign(d, 0.0); vp.s2.assign(d, 0.0); 89 // Precompute s2_j which doesn't depend on other coordinates in this model 90 for (int j = 0; j < d; ++j) vp.s2[j] = 1.0 / (alpha + (1.0 / sigma2) * D.col_norm2[j]); 91 // Maintain residual r = y - X m 92 vector<double> r = D.y; // initially m=0 so r=y 93 double prev_elbo = elbo(D, vp, sigma2, alpha); 94 for (int it = 0; it < max_iters; ++it) { 95 for (int j = 0; j < d; ++j) { 96 double old_mj = vp.m[j]; 97 // Compute sum_i X_{ij} * (r_i + X_{ij} * m_j) efficiently 98 double tmp = 0.0; 99 for (int i = 0; i < n; ++i) tmp += D.X_cols[j][i] * (r[i] + D.X_cols[j][i] * old_mj); 100 double new_mj = vp.s2[j] * (tmp / sigma2); 101 double delta = new_mj - old_mj; 102 if (delta != 0.0) { 103 for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * delta; 104 vp.m[j] = new_mj; 105 } 106 } 107 double cur_elbo = elbo(D, vp, sigma2, alpha); 108 if (it % 10 == 0) { 109 cerr << "Iter " << it << ": ELBO = " << cur_elbo << "\n"; 110 } 111 if (abs(cur_elbo - prev_elbo) < tol * (1.0 + abs(prev_elbo))) break; 112 prev_elbo = cur_elbo; 113 } 114 return vp; 115 } 116 117 int main() { 118 ios::sync_with_stdio(false); 119 cin.tie(nullptr); 120 121 int n = 500, d = 20; 122 double sigma = 0.5; // noise std 123 double sigma2 = sigma * sigma; 124 double alpha = 1.0; // prior precision 125 126 Dataset D = make_synthetic(n, d, sigma, alpha, 123); 127 VariationalParams vp = cavi_bayesian_lr(D, sigma2, alpha, 300, 1e-7); 128 129 // Report learned mean and average variance 130 double avg_s2 = accumulate(vp.s2.begin(), vp.s2.end(), 0.0) / d; 131 cout << fixed << setprecision(6); 132 cout << "Learned mean w (first 5): "; 133 for (int j = 0; j < min(d, 5); ++j) cout << vp.m[j] << (j+1<min(d,5)?", ":"\n"); 134 cout << "Average posterior variance per weight: " << avg_s2 << "\n"; 135 cout << "Final ELBO: " << elbo(D, vp, sigma2, alpha) << "\n"; 136 return 0; 137 } 138
We approximate the posterior over regression weights w with a factorized Gaussian q(w)=∏ N(w_j|m_j,s_j^2). For Gaussian likelihood and Gaussian prior, the CAVI updates are closed-form. We maintain the residual r=y−X m so each coordinate update is O(n), yielding O(nd) per sweep. We also compute the ELBO exactly under the mean-field q to monitor convergence. The code generates synthetic data, runs CAVI, and prints the learned posterior means and average variances.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Model: z ~ N(0,1); x_i | z ~ N(z, sigma^2) for i=1..n (shared latent z) 5 // Variational family: q(z) = N(mu, s^2) with parameters (mu, l=log s^2) 6 // ELBO(mu,l) = E_q[ log p(z) + sum_i log p(x_i|z) - log q(z) ] 7 // Reparameterization: z = mu + s * eps, eps ~ N(0,1) 8 // Pathwise gradients: 9 // Let A'(z) = d/dz [ log p(z) + sum_i log p(x_i|z) ] = -z + (sum x_i)/sigma^2 - (n/sigma^2) z 10 // With s = exp(0.5*l): dL/dmu = A'(z), dL/dl = 0.5 * (A'(z) * s * eps + 1) 11 12 struct BBVIConfig { 13 int iters = 2000; // optimization steps 14 int S = 8; // MC samples per step 15 double lr_mu = 1e-2; // learning rate for mu 16 double lr_l = 1e-2; // learning rate for log-variance l 17 }; 18 19 struct Data { vector<double> x; double sigma; }; 20 21 struct VarParams { double mu, l; }; // l = log s^2 22 23 // Compute a Monte Carlo estimate of ELBO and its gradients w.r.t. (mu,l) 24 struct Est { double elbo; double gmu; double gl; }; 25 26 Est estimate_elbo_and_grad(const Data& D, const VarParams& vp, int S, mt19937& rng) { 27 normal_distribution<double> N01(0.0, 1.0); 28 int n = (int)D.x.size(); 29 double sigma2 = D.sigma * D.sigma; 30 double s = exp(0.5 * vp.l); 31 double sumx = accumulate(D.x.begin(), D.x.end(), 0.0); 32 const double c_log_pz = -0.5 * log(2.0 * M_PI); 33 const double c_log_px = -0.5 * log(2.0 * M_PI * sigma2); 34 double elbo_acc = 0.0, gmu_acc = 0.0, gl_acc = 0.0; 35 36 for (int t = 0; t < S; ++t) { 37 double eps = N01(rng); 38 double z = vp.mu + s * eps; 39 // log p(z) 40 double log_pz = c_log_pz - 0.5 * (z * z); 41 // sum log p(x_i | z) 42 double quad = 0.0; 43 for (double xi : D.x) { 44 double diff = xi - z; 45 quad += diff * diff; 46 } 47 double log_px = n * c_log_px - 0.5 * quad / sigma2; 48 // log q(z; mu, s) 49 double log_q = -0.5 * ( (z - vp.mu) * (z - vp.mu) / (s * s) + 2.0 * log(s) + log(2.0 * M_PI) ); 50 double L = log_pz + log_px - log_q; 51 elbo_acc += L; 52 // Gradients 53 double Aprime = -z + (sumx / sigma2) - (n / sigma2) * z; // d/dz of log p(z) + sum log p(x|z) 54 double dL_dmu = Aprime; // pathwise derivative 55 double dL_dl = 0.5 * (Aprime * s * eps + 1.0); // chain rule via s = exp(0.5*l) 56 gmu_acc += dL_dmu; 57 gl_acc += dL_dl; 58 } 59 Est out; out.elbo = elbo_acc / S; out.gmu = gmu_acc / S; out.gl = gl_acc / S; 60 return out; 61 } 62 63 VarParams fit_bbvi(const Data& D, const BBVIConfig& cfg, unsigned seed=7) { 64 mt19937 rng(seed); 65 VarParams vp{0.0, 0.0}; // initialize mu=0, l=0 => s^2=1 66 for (int it = 1; it <= cfg.iters; ++it) { 67 Est e = estimate_elbo_and_grad(D, vp, cfg.S, rng); 68 // Simple SGD updates 69 vp.mu += cfg.lr_mu * e.gmu; 70 vp.l += cfg.lr_l * e.gl; 71 if (it % 200 == 0) { 72 double s2 = exp(vp.l); 73 cerr << "Iter " << it << ": ELBO = " << e.elbo 74 << ", mu = " << vp.mu << ", s2 = " << s2 << "\n"; 75 } 76 } 77 return vp; 78 } 79 80 int main() { 81 ios::sync_with_stdio(false); 82 cin.tie(nullptr); 83 84 // Generate synthetic data from the model with a true latent z_true 85 int n = 200; double sigma = 0.75; double z_true = 1.75; 86 mt19937 rng(123); 87 normal_distribution<double> N01(0.0, 1.0); 88 normal_distribution<double> Nzt(z_true, sigma); 89 Data D; D.sigma = sigma; D.x.resize(n); 90 for (int i = 0; i < n; ++i) D.x[i] = Nzt(rng); 91 92 BBVIConfig cfg; cfg.iters = 2000; cfg.S = 8; cfg.lr_mu = 5e-3; cfg.lr_l = 5e-3; 93 VarParams vp = fit_bbvi(D, cfg, 321); 94 95 cout << fixed << setprecision(6); 96 cout << "Learned mu: " << vp.mu << "\n"; 97 cout << "Learned s2: " << exp(vp.l) << "\n"; 98 return 0; 99 } 100
This example performs black-box VI for a simple latent Gaussian model where a single latent z explains all observations. We use a diagonal Gaussian q(z)=N(μ,s^2) and optimize the ELBO via the reparameterization trick: z=μ+sε. The pathwise gradients have simple closed forms, enabling a concise SGD implementation without external libraries. The code reports ELBO progress and converges toward the true posterior (which is Gaussian for this model).