📚TheoryIntermediate

Variational Inference

Key Points

•
Variational Inference (VI) turns Bayesian inference into an optimization problem by choosing a simple family q(z) to approximate an intractable posterior p(z|x).
•
The Evidence Lower Bound (ELBO) is the objective VI maximizes; increasing the ELBO tightens the approximation to the true posterior.
•
Mean-field VI assumes independent factors in q(z), enabling fast coordinate-ascent updates using expected log-joint terms.
•
Black-box VI uses Monte Carlo gradients of the ELBO and works even without conjugacy by leveraging the reparameterization trick.
•
The direction of the KL divergence in VI (KL(q||p)) encourages covering high-density regions but can under-estimate uncertainty in multi-modal posteriors.
•
Stochastic VI scales to large datasets by estimating gradients from mini-batches while maintaining an unbiased objective gradient.
•
Monitoring ELBO and gradient variance is crucial for stable and correct optimization during VI.
•
C++ implementations can demonstrate both closed-form CAVI (e.g., linear models) and reparameterized gradient-based VI (e.g., simple latent Gaussian models).

Prerequisites

→Basic probability and expectation — VI relies on expectations under q(z) and reasoning about densities.
→Bayes’ rule and posterior inference — Understanding p(z|x) and why it can be intractable motivates VI.
→Kullback–Leibler divergence — VI minimizes KL(q||p) equivalently by maximizing the ELBO.
→Multivariate Gaussian distributions — Common variational families and conjugate models use Gaussian forms.
→Jensen’s inequality — It justifies the ELBO as a lower bound on log evidence.
→Optimization and gradient descent — Black-box VI requires iterative gradient-based optimization.
→Monte Carlo estimation — ELBO and gradient estimates use sampling from q.
→Linear algebra (matrix–vector operations) — Efficient CAVI implementations depend on matrix–vector computations.
→Log-likelihoods and numerical stability — Working in log-space prevents underflow and overflow.
→Stochastic optimization (mini-batching) — Essential to scale VI to large datasets.

Detailed Explanation

Tap terms for definitions

01Overview

Variational Inference (VI) is a method to perform approximate Bayesian inference when the true posterior p(z|x) is computationally intractable. The key idea is to replace exact inference with optimization: we pick a set of candidate distributions Q (the variational family) and then find q*(z) in Q that is closest to the true posterior according to Kullback–Leibler (KL) divergence. Rather than minimizing KL directly, VI maximizes the Evidence Lower Bound (ELBO), which is a tractable lower bound on the log evidence log p(x). This bound captures a trade-off between fitting the data (via the expected log-likelihood) and staying close to the prior (via a KL regularizer). In practice, VI comes in two dominant flavors. First, mean-field VI uses a product-form q(z)=∏ q_i(z_i) and iteratively updates each factor with closed-form formulas when the model is conjugate, which is known as Coordinate Ascent Variational Inference (CAVI). Second, black-box VI uses Monte Carlo estimates of the ELBO gradient and can be applied to virtually any model; the reparameterization trick provides low-variance gradients when q is reparameterizable (e.g., Gaussian). VI scales naturally to large datasets through stochastic (mini-batch) optimization. It also underlies modern deep generative models such as Variational Autoencoders (VAEs), where neural networks amortize inference by predicting parameters of q(z|x).

02Intuition & Analogies

Imagine trying to describe a complicated, twisty shape (the true posterior) using a simpler clay mold (our variational family q). We can’t reproduce every wrinkle, but we can press, stretch, and shift the mold to cover as much of the important mass as possible. The better the mold fits, the better our approximate inference. The ELBO is like a score telling us how well the mold fits without needing to see the exact shape; we can raise this score by improving how well the model explains the data and by not straying too far from our prior beliefs. Another analogy: think of locating a radio signal in a noisy environment. The true posterior is the exact signal we’d like to isolate, but we only have noisy measurements. We choose a tunable filter q with knobs (its parameters). Turning the knobs to raise the ELBO is like improving the filter so more true signal passes and less noise remains. In mean-field VI, we make a simplifying assumption that each knob controls a separate band (independence), so we can tune one band at a time using closed-form rules. In black-box VI, we don’t have an instruction manual for the knobs; instead, we jiggle them in smart ways (Monte Carlo gradients via reparameterization) and follow the direction that most increases the signal (the ELBO). In both cases, we aim for a good balance: explain the data while not overfitting, which the ELBO enforces automatically through its built-in regularization.

03Formal Definition

Let x be observed data and z latent variables. We select a variational family Q = { q_

ϕ

(z) } parameterized by

ϕ

. Variational Inference solves arg

max_{ϕ}

L

(

ϕ

), where the ELBO is

L

(

ϕ

) =

E_{q_{ϕ} (z)}

[

lo g

p(x,z) -

lo g

ϕ

(z)]. By Jensen’s inequality,

L

(

ϕ

)

\leq

lo g

p(x), and the gap equals the KL divergence from q_

ϕ

to the true posterior:

lo g

p(x) =

L

(

ϕ

) +

KL

(q_

ϕ

(z)

∥

p(z

∣

x)). Therefore maximizing the ELBO is equivalent to minimizing

KL

(q_

ϕ

(z)

∥

p(z

∣

x)). In mean-field VI, we restrict q(z) =

\prod_{i = 1}^{m}

q_{i}

(

z_{i}

). The optimal factor update, holding all others fixed, is q_

i^{*}

(

z_{i}

)

\propto

exp

E_{q_{- i}}

[

lo g

p(x,z)]\}, where

q_{- i}

is the product of all factors except i. When p is in the conjugate exponential family, these expectations yield closed-form updates, enabling Coordinate Ascent Variational Inference (CAVI). In black-box VI, we estimate

\nabla

ϕ

L

(

ϕ

) with Monte Carlo. For reparameterizable q_

ϕ

, write z = g_

ϕ

(

ϵ

) with

ϵ

\sim

ϵ

) independent of

ϕ

. Then

\nabla

ϕ

L

(

ϕ

) =

E_{ϵ}

[

\nabla

ϕ

(

lo g

p(x, g_

ϕ

(

ϵ

)) -

lo g

ϕ

(g_

ϕ

(

ϵ

)))], which can be differentiated via the pathwise derivative (reparameterization trick).

04When to Use

Intractable posteriors: Use VI when exact p(z|x) is unavailable due to complex likelihoods or high-dimensional latent variables (e.g., topic models, large hierarchical Bayes).
Conjugate models with factorization: Use mean-field CAVI for fast, stable inference with closed-form updates (e.g., Bayesian linear regression, Gaussian mixture models with conjugate priors).
Nonconjugate or deep models: Use black-box VI with reparameterization (e.g., VAEs, Bayesian neural networks, probabilistic matrix factorization) to handle arbitrary likelihoods and priors.
Large-scale datasets: Use stochastic VI (mini-batches) for scalability; the ELBO gradients can be unbiasedly estimated from subsamples with appropriate weighting.
Need for speed and differentiability: VI often runs much faster than MCMC and provides a differentiable objective (the ELBO) suitable for integration with automatic differentiation and GPU acceleration.
Amortized inference: When you must infer z for many x quickly (e.g., at test time in VAEs), learn an inference network that outputs q_\phi(z\mid x) parameters directly.

⚠️Common Mistakes

Overly restrictive variational family: A mean-field diagonal Gaussian may under-estimate posterior uncertainty or miss multimodality. Remedy: Enrich q with covariance, mixtures, or normalizing flows.
Ignoring ELBO monitoring: Not tracking the ELBO can hide divergence or numerical issues. Always report ELBO, its moving average, and gradient norms.
High-variance gradients in black-box VI: Using the score-function estimator without control variates can stall learning. Prefer reparameterization when possible, increase the number of samples, and use variance reduction (baselines, Rao–Blackwellization).
KL direction confusion: Minimizing KL(q||p) is mode-seeking; it can avoid low-probability regions between modes. If mode-covering is desired, consider alternative divergences or tempered posteriors.
Poor initialization and step sizes: Starting with too narrow q or using aggressive learning rates can cause numerical instability. Initialize with reasonable scale and use adaptive optimizers (e.g., Adam) or line searches.
Broken factor updates in CAVI: Forgetting to exclude the current coordinate when computing residuals leads to biased updates. Maintain and update residuals carefully to get O(nd) per sweep.
Numerical underflow/overflow: Log-probabilities with tiny variances or large norms can cause -inf or nan. Work in log-space and clip/regularize variances.

Key Formulas

ELBO Definition

L (q) = E_{q (z)} [lo g p (x, z) - lo g q (z)]

Explanation: The ELBO measures how well q explains the joint p(x,z) while penalizing complexity via the entropy of q. Maximizing it yields a better posterior approximation.

ELBO–Evidence Decomposition

lo g p (x) = L (q) + KL (q (z) ∥ p (z ∣ x))

Explanation: The gap between log evidence and the ELBO is the KL divergence from q to the true posterior. As ELBO increases, KL decreases and q approaches p(z|x).

KL Divergence

KL (q ∥ p) = E_{q} [lo g q (z) - lo g p (z)]

Explanation: This is the direction minimized by VI. It prefers placing mass where q can safely fit within p’s high-density regions, often resulting in mode-seeking behavior.

Mean-field Coordinate Update

q_{i}^{*} (z_{i}) \propto exp {E_{q_{- i}} [lo g p (x, z)]}

Explanation: Given all other factors, the optimal update for factor i exponentiates the expected log-joint. This yields closed-form updates in conjugate exponential-family models.

Reparameterization

z = μ + σ ϵ, ϵ \sim N (0, 1)

Explanation: Samples from q are expressed as a deterministic transform of noise independent of parameters. This enables low-variance pathwise gradients for the ELBO.

Pathwise Gradient

\nabla_{ϕ} E_{q_{ϕ} (z)} [f (z)] = E_{ϵ} [\nabla_{ϕ} f (g_{ϕ} (ϵ))]

Explanation: By reparameterizing, we can move the gradient operator inside the expectation. This is central to black-box VI with differentiable models.

Score-Function Estimator

\nabla_{ϕ} E_{q_{ϕ} (z)} [f (z)] = E_{q_{ϕ} (z)} [f (z) \nabla_{ϕ} lo g q_{ϕ} (z)]

Explanation: Also known as REINFORCE, it requires only samples from q and log q’s gradient. It has higher variance than pathwise gradients but works without reparameterization.

Gaussian–Standard Normal KL

KL (N (μ, Σ) ∥ N (0, I)) = \frac{1}{2} (tr (Σ) + μ^{⊤} μ - k - lo g det Σ)

Explanation: A closed form for the KL from a Gaussian to a standard normal in k dimensions. Useful for ELBOs that include a Gaussian prior on latents.

Gaussian Entropy

H (N (μ, Σ)) = \frac{1}{2} lo g ((2 π e)^{k} det Σ)

Explanation: The entropy of a multivariate normal depends on its covariance determinant. It appears in the ELBO via the - $E_{q}$ [log q(z)] term.

Quadratic Expectation under Mean-field

E_{q} [∥ y - Xw ∥_{2}^{2}] = ∥ y - X E [w] ∥_{2}^{2} + j = 1 \sum d ∥ X_{\cdot j} ∥_{2}^{2} Var (w_{j})

Explanation: With independent coordinates, cross-covariances vanish. This identity yields simple ELBO terms for Gaussian linear models in CAVI.

Complexity Analysis

Let n be the number of data points, d the number of parameters/latent dimensions, and I the number of outer iterations. - CAVI for mean-field Gaussian linear regression: A naive sweep re-computing residuals costs O(nd) per coordinate, i.e., O(n

d^{2}

) per iteration. Using maintained residuals

r = y

− X m and per-column updates, each coordinate update becomes O(n), giving O(nd) per sweep. Precomputations (column squared norms) cost O(nd). Memory is O(nd) to store X (dense) plus O(n + d) for vectors. Convergence typically occurs in tens to a few hundred sweeps depending on conditioning and prior strength. - Black-box VI with reparameterization (single latent, full-batch): Per iteration, drawing S Monte Carlo samples and evaluating the joint over n observations costs O(S n). Gradients are computed in the same pass. Memory is O(n) for data and O(1) for parameters. With mini-batches of size b, the cost is O(S b) per step; to see one effective epoch, multiply by n/b. - Numerical costs: ELBO evaluation adds negligible overhead compared to gradient computations (same asymptotic order). If we enrich q with covariance (full Gaussian), CAVI updates involve O(

d^{2}

) storage and O(

d^{2}

)–O(

d^{3}

) linear algebra per iteration. For models with local latents per data point (e.g., GMM responsibilities or VAE latents), mean-field VI incurs O(nk) updates for k clusters/components, while amortized VI shifts that cost to a learned encoder evaluated in O(n) forward passes. - Practical runtime considerations: Using vectorized BLAS for X–vector products and careful cache-friendly memory layouts can reduce constants substantially. Gradient variance in black-box VI scales down as O

\frac{1}{S}

; doubling S halves variance but doubles compute, so adaptive S or control variates can optimize wall-clock convergence.

Code Examples

CAVI for Bayesian Linear Regression with Mean-field Diagonal q(w)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Coordinate Ascent Variational Inference (CAVI)
5 // Model: y = X w + noise,   noise ~ N(0, sigma2 I)
6 // Prior: w ~ N(0, alpha^{-1} I)
7 // Variational family: q(w) = \prod_j N(w_j | m_j, s2_j)
8 // Updates:
9 //   s2_j = 1 / (alpha + (1/sigma2) * ||X_col_j||^2)
10 //   m_j  = s2_j * (1/sigma2) * sum_i X_{ij} * ( y_i - sum_{k!=j} X_{ik} m_k )
11 // We implement O(nd) per sweep using a maintained residual r = y - X m.
12 
13 struct Dataset {
14     int n, d;
15     vector<vector<double>> X_cols; // d columns, each of length n
16     vector<double> y;              // length n
17     vector<double> col_norm2;      // ||X_col_j||^2
18 };
19 
20 Dataset make_synthetic(int n, int d, double sigma, double alpha_true, unsigned seed=42) {
21     mt19937 rng(seed);
22     normal_distribution<double> N01(0.0, 1.0);
23     Dataset data; data.n = n; data.d = d;
24     data.X_cols.assign(d, vector<double>(n));
25     data.y.assign(n, 0.0);
26     vector<double> w_true(d);
27     // True weights ~ N(0, alpha_true^{-1} I)
28     for (int j = 0; j < d; ++j) w_true[j] = N01(rng) / sqrt(alpha_true);
29     // Design matrix X ~ N(0,1)
30     for (int j = 0; j < d; ++j) {
31         for (int i = 0; i < n; ++i) data.X_cols[j][i] = N01(rng);
32     }
33     // Generate y = X w_true + noise
34     normal_distribution<double> noise(0.0, sigma);
35     for (int i = 0; i < n; ++i) {
36         double yi = 0.0;
37         for (int j = 0; j < d; ++j) yi += data.X_cols[j][i] * w_true[j];
38         yi += noise(rng);
39         data.y[i] = yi;
40     }
41     // Precompute column norms
42     data.col_norm2.assign(d, 0.0);
43     for (int j = 0; j < d; ++j) {
44         double s = 0.0; for (int i = 0; i < n; ++i) s += data.X_cols[j][i] * data.X_cols[j][i];
45         data.col_norm2[j] = s;
46     }
47     return data;
48 }
49 
50 struct VariationalParams {
51     vector<double> m;   // means
52     vector<double> s2;  // variances (diagonal)
53 };
54 
55 // Compute ELBO under mean-field diagonal q(w)
56 // E_q[log p(y|w)] + E_q[log p(w)] - E_q[log q(w)]
57 // E[||y - X w||^2] = ||y - X m||^2 + sum_j s2_j * ||X_col_j||^2
58 // E[||w||^2] = ||m||^2 + sum_j s2_j
59 
60 double elbo(const Dataset& D, const VariationalParams& vp, double sigma2, double alpha) {
61     int n = D.n, d = D.d;
62     // Compute residual r = y - X m
63     vector<double> r = D.y;
64     for (int j = 0; j < d; ++j) {
65         double mj = vp.m[j];
66         if (mj == 0.0) continue;
67         for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * mj;
68     }
69     // ||r||^2
70     double r2 = 0.0; for (double v : r) r2 += v * v;
71     // sum_j s2_j * ||X_col_j||^2
72     double trace_term = 0.0;
73     for (int j = 0; j < d; ++j) trace_term += vp.s2[j] * D.col_norm2[j];
74     // E_q[log p(y|w)]
75     double ll = -0.5 * n * log(2.0 * M_PI * sigma2) - 0.5 * (r2 + trace_term) / sigma2;
76     // E_q[log p(w)] with prior N(0, alpha^{-1} I)
77     double m2 = 0.0, s2sum = 0.0;
78     for (int j = 0; j < d; ++j) { m2 += vp.m[j] * vp.m[j]; s2sum += vp.s2[j]; }
79     double lpw = -0.5 * d * log(2.0 * M_PI / alpha) - 0.5 * alpha * (m2 + s2sum);
80     // Entropy of q (sum of 1D Gaussians)
81     double H = 0.0;
82     for (int j = 0; j < d; ++j) H += 0.5 * log(2.0 * M_E * M_PI * vp.s2[j]);
83     return ll + lpw + H;
84 }
85 
86 VariationalParams cavi_bayesian_lr(const Dataset& D, double sigma2, double alpha, int max_iters=200, double tol=1e-6) {
87     int n = D.n, d = D.d;
88     VariationalParams vp; vp.m.assign(d, 0.0); vp.s2.assign(d, 0.0);
89     // Precompute s2_j which doesn't depend on other coordinates in this model
90     for (int j = 0; j < d; ++j) vp.s2[j] = 1.0 / (alpha + (1.0 / sigma2) * D.col_norm2[j]);
91     // Maintain residual r = y - X m
92     vector<double> r = D.y; // initially m=0 so r=y
93     double prev_elbo = elbo(D, vp, sigma2, alpha);
94     for (int it = 0; it < max_iters; ++it) {
95         for (int j = 0; j < d; ++j) {
96             double old_mj = vp.m[j];
97             // Compute sum_i X_{ij} * (r_i + X_{ij} * m_j) efficiently
98             double tmp = 0.0;
99             for (int i = 0; i < n; ++i) tmp += D.X_cols[j][i] * (r[i] + D.X_cols[j][i] * old_mj);
100             double new_mj = vp.s2[j] * (tmp / sigma2);
101             double delta = new_mj - old_mj;
102             if (delta != 0.0) {
103                 for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * delta;
104                 vp.m[j] = new_mj;
105             }
106         }
107         double cur_elbo = elbo(D, vp, sigma2, alpha);
108         if (it % 10 == 0) {
109             cerr << "Iter " << it << ": ELBO = " << cur_elbo << "\n";
110         }
111         if (abs(cur_elbo - prev_elbo) < tol * (1.0 + abs(prev_elbo))) break;
112         prev_elbo = cur_elbo;
113     }
114     return vp;
115 }
116 
117 int main() {
118     ios::sync_with_stdio(false);
119     cin.tie(nullptr);
120 
121     int n = 500, d = 20;
122     double sigma = 0.5;       // noise std
123     double sigma2 = sigma * sigma;
124     double alpha = 1.0;       // prior precision
125 
126     Dataset D = make_synthetic(n, d, sigma, alpha, 123);
127     VariationalParams vp = cavi_bayesian_lr(D, sigma2, alpha, 300, 1e-7);
128 
129     // Report learned mean and average variance
130     double avg_s2 = accumulate(vp.s2.begin(), vp.s2.end(), 0.0) / d;
131     cout << fixed << setprecision(6);
132     cout << "Learned mean w (first 5): ";
133     for (int j = 0; j < min(d, 5); ++j) cout << vp.m[j] << (j+1<min(d,5)?", ":"\n");
134     cout << "Average posterior variance per weight: " << avg_s2 << "\n";
135     cout << "Final ELBO: " << elbo(D, vp, sigma2, alpha) << "\n";
136     return 0;
137 }
138

We approximate the posterior over regression weights w with a factorized Gaussian q(w)=∏ N(w_j|m_j,s_j^2). For Gaussian likelihood and Gaussian prior, the CAVI updates are closed-form. We maintain the residual r=y−X m so each coordinate update is O(n), yielding O(nd) per sweep. We also compute the ELBO exactly under the mean-field q to monitor convergence. The code generates synthetic data, runs CAVI, and prints the learned posterior means and average variances.

Time: Per sweep O(n d); typically tens to hundreds of sweeps for convergence.Space: O(n d) to store X, plus O(n + d) for vectors and variational parameters.

Black-box VI with Reparameterization for a Single Latent Gaussian

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Model: z ~ N(0,1); x_i | z ~ N(z, sigma^2) for i=1..n (shared latent z)
5 // Variational family: q(z) = N(mu, s^2) with parameters (mu, l=log s^2)
6 // ELBO(mu,l) = E_q[ log p(z) + sum_i log p(x_i|z) - log q(z) ]
7 // Reparameterization: z = mu + s * eps, eps ~ N(0,1)
8 // Pathwise gradients:
9 //   Let A'(z) = d/dz [ log p(z) + sum_i log p(x_i|z) ] = -z + (sum x_i)/sigma^2 - (n/sigma^2) z
10 //   With s = exp(0.5*l):  dL/dmu = A'(z),  dL/dl = 0.5 * (A'(z) * s * eps + 1)
11 
12 struct BBVIConfig {
13     int iters = 2000;     // optimization steps
14     int S = 8;            // MC samples per step
15     double lr_mu = 1e-2;  // learning rate for mu
16     double lr_l  = 1e-2;  // learning rate for log-variance l
17 };
18 
19 struct Data { vector<double> x; double sigma; };
20 
21 struct VarParams { double mu, l; }; // l = log s^2
22 
23 // Compute a Monte Carlo estimate of ELBO and its gradients w.r.t. (mu,l)
24 struct Est { double elbo; double gmu; double gl; };
25 
26 Est estimate_elbo_and_grad(const Data& D, const VarParams& vp, int S, mt19937& rng) {
27     normal_distribution<double> N01(0.0, 1.0);
28     int n = (int)D.x.size();
29     double sigma2 = D.sigma * D.sigma;
30     double s = exp(0.5 * vp.l);
31     double sumx = accumulate(D.x.begin(), D.x.end(), 0.0);
32     const double c_log_pz = -0.5 * log(2.0 * M_PI);
33     const double c_log_px = -0.5 * log(2.0 * M_PI * sigma2);
34     double elbo_acc = 0.0, gmu_acc = 0.0, gl_acc = 0.0;
35 
36     for (int t = 0; t < S; ++t) {
37         double eps = N01(rng);
38         double z = vp.mu + s * eps;
39         // log p(z)
40         double log_pz = c_log_pz - 0.5 * (z * z);
41         // sum log p(x_i | z)
42         double quad = 0.0;
43         for (double xi : D.x) {
44             double diff = xi - z;
45             quad += diff * diff;
46         }
47         double log_px = n * c_log_px - 0.5 * quad / sigma2;
48         // log q(z; mu, s)
49         double log_q = -0.5 * ( (z - vp.mu) * (z - vp.mu) / (s * s) + 2.0 * log(s) + log(2.0 * M_PI) );
50         double L = log_pz + log_px - log_q;
51         elbo_acc += L;
52         // Gradients
53         double Aprime = -z + (sumx / sigma2) - (n / sigma2) * z; // d/dz of log p(z) + sum log p(x|z)
54         double dL_dmu = Aprime;                     // pathwise derivative
55         double dL_dl  = 0.5 * (Aprime * s * eps + 1.0); // chain rule via s = exp(0.5*l)
56         gmu_acc += dL_dmu;
57         gl_acc  += dL_dl;
58     }
59     Est out; out.elbo = elbo_acc / S; out.gmu = gmu_acc / S; out.gl = gl_acc / S;
60     return out;
61 }
62 
63 VarParams fit_bbvi(const Data& D, const BBVIConfig& cfg, unsigned seed=7) {
64     mt19937 rng(seed);
65     VarParams vp{0.0, 0.0}; // initialize mu=0, l=0 => s^2=1
66     for (int it = 1; it <= cfg.iters; ++it) {
67         Est e = estimate_elbo_and_grad(D, vp, cfg.S, rng);
68         // Simple SGD updates
69         vp.mu += cfg.lr_mu * e.gmu;
70         vp.l  += cfg.lr_l  * e.gl;
71         if (it % 200 == 0) {
72             double s2 = exp(vp.l);
73             cerr << "Iter " << it << ": ELBO = " << e.elbo
74                  << ", mu = " << vp.mu << ", s2 = " << s2 << "\n";
75         }
76     }
77     return vp;
78 }
79 
80 int main() {
81     ios::sync_with_stdio(false);
82     cin.tie(nullptr);
83 
84     // Generate synthetic data from the model with a true latent z_true
85     int n = 200; double sigma = 0.75; double z_true = 1.75;
86     mt19937 rng(123);
87     normal_distribution<double> N01(0.0, 1.0);
88     normal_distribution<double> Nzt(z_true, sigma);
89     Data D; D.sigma = sigma; D.x.resize(n);
90     for (int i = 0; i < n; ++i) D.x[i] = Nzt(rng);
91 
92     BBVIConfig cfg; cfg.iters = 2000; cfg.S = 8; cfg.lr_mu = 5e-3; cfg.lr_l = 5e-3;
93     VarParams vp = fit_bbvi(D, cfg, 321);
94 
95     cout << fixed << setprecision(6);
96     cout << "Learned mu: " << vp.mu << "\n";
97     cout << "Learned s2: " << exp(vp.l) << "\n";
98     return 0;
99 }
100

This example performs black-box VI for a simple latent Gaussian model where a single latent z explains all observations. We use a diagonal Gaussian q(z)=N(μ,s^2) and optimize the ELBO via the reparameterization trick: z=μ+sε. The pathwise gradients have simple closed forms, enabling a concise SGD implementation without external libraries. The code reports ELBO progress and converges toward the true posterior (which is Gaussian for this model).

Time: O(S n) per iteration, where S is the number of Monte Carlo samples.Space: O(n) for data and O(1) for the two variational parameters.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Coordinate Ascent Variational Inference (CAVI)
5	// Model: y = X w + noise, noise ~ N(0, sigma2 I)
6	// Prior: w ~ N(0, alpha^{-1} I)
7	// Variational family: q(w) = \prod_j N(w_j \| m_j, s2_j)
8	// Updates:
9	// s2_j = 1 / (alpha + (1/sigma2) * \|\|X_col_j\|\|^2)
10	// m_j = s2_j * (1/sigma2) * sum_i X_{ij} * ( y_i - sum_{k!=j} X_{ik} m_k )
11	// We implement O(nd) per sweep using a maintained residual r = y - X m.
12
13	struct Dataset {
14	int n, d;
15	vector<vector<double>> X_cols; // d columns, each of length n
16	vector<double> y; // length n
17	vector<double> col_norm2; // \|\|X_col_j\|\|^2
18	};
19
20	Dataset make_synthetic(int n, int d, double sigma, double alpha_true, unsigned seed=42) {
21	mt19937 rng(seed);
22	normal_distribution<double> N01(0.0, 1.0);
23	Dataset data; data.n = n; data.d = d;
24	data.X_cols.assign(d, vector<double>(n));
25	data.y.assign(n, 0.0);
26	vector<double> w_true(d);
27	// True weights ~ N(0, alpha_true^{-1} I)
28	for (int j = 0; j < d; ++j) w_true[j] = N01(rng) / sqrt(alpha_true);
29	// Design matrix X ~ N(0,1)
30	for (int j = 0; j < d; ++j) {
31	for (int i = 0; i < n; ++i) data.X_cols[j][i] = N01(rng);
32	}
33	// Generate y = X w_true + noise
34	normal_distribution<double> noise(0.0, sigma);
35	for (int i = 0; i < n; ++i) {
36	double yi = 0.0;
37	for (int j = 0; j < d; ++j) yi += data.X_cols[j][i] * w_true[j];
38	yi += noise(rng);
39	data.y[i] = yi;
40	}
41	// Precompute column norms
42	data.col_norm2.assign(d, 0.0);
43	for (int j = 0; j < d; ++j) {
44	double s = 0.0; for (int i = 0; i < n; ++i) s += data.X_cols[j][i] * data.X_cols[j][i];
45	data.col_norm2[j] = s;
46	}
47	return data;
48	}
49
50	struct VariationalParams {
51	vector<double> m; // means
52	vector<double> s2; // variances (diagonal)
53	};
54
55	// Compute ELBO under mean-field diagonal q(w)
56	// E_q[log p(y\|w)] + E_q[log p(w)] - E_q[log q(w)]
57	// E[\|\|y - X w\|\|^2] = \|\|y - X m\|\|^2 + sum_j s2_j * \|\|X_col_j\|\|^2
58	// E[\|\|w\|\|^2] = \|\|m\|\|^2 + sum_j s2_j
59
60	double elbo(const Dataset& D, const VariationalParams& vp, double sigma2, double alpha) {
61	int n = D.n, d = D.d;
62	// Compute residual r = y - X m
63	vector<double> r = D.y;
64	for (int j = 0; j < d; ++j) {
65	double mj = vp.m[j];
66	if (mj == 0.0) continue;
67	for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * mj;
68	}
69	// \|\|r\|\|^2
70	double r2 = 0.0; for (double v : r) r2 += v * v;
71	// sum_j s2_j * \|\|X_col_j\|\|^2
72	double trace_term = 0.0;
73	for (int j = 0; j < d; ++j) trace_term += vp.s2[j] * D.col_norm2[j];
74	// E_q[log p(y\|w)]
75	double ll = -0.5 * n * log(2.0 * M_PI * sigma2) - 0.5 * (r2 + trace_term) / sigma2;
76	// E_q[log p(w)] with prior N(0, alpha^{-1} I)
77	double m2 = 0.0, s2sum = 0.0;
78	for (int j = 0; j < d; ++j) { m2 += vp.m[j] * vp.m[j]; s2sum += vp.s2[j]; }
79	double lpw = -0.5 * d * log(2.0 * M_PI / alpha) - 0.5 * alpha * (m2 + s2sum);
80	// Entropy of q (sum of 1D Gaussians)
81	double H = 0.0;
82	for (int j = 0; j < d; ++j) H += 0.5 * log(2.0 * M_E * M_PI * vp.s2[j]);
83	return ll + lpw + H;
84	}
85
86	VariationalParams cavi_bayesian_lr(const Dataset& D, double sigma2, double alpha, int max_iters=200, double tol=1e-6) {
87	int n = D.n, d = D.d;
88	VariationalParams vp; vp.m.assign(d, 0.0); vp.s2.assign(d, 0.0);
89	// Precompute s2_j which doesn't depend on other coordinates in this model
90	for (int j = 0; j < d; ++j) vp.s2[j] = 1.0 / (alpha + (1.0 / sigma2) * D.col_norm2[j]);
91	// Maintain residual r = y - X m
92	vector<double> r = D.y; // initially m=0 so r=y
93	double prev_elbo = elbo(D, vp, sigma2, alpha);
94	for (int it = 0; it < max_iters; ++it) {
95	for (int j = 0; j < d; ++j) {
96	double old_mj = vp.m[j];
97	// Compute sum_i X_{ij} * (r_i + X_{ij} * m_j) efficiently
98	double tmp = 0.0;
99	for (int i = 0; i < n; ++i) tmp += D.X_cols[j][i] * (r[i] + D.X_cols[j][i] * old_mj);
100	double new_mj = vp.s2[j] * (tmp / sigma2);
101	double delta = new_mj - old_mj;
102	if (delta != 0.0) {
103	for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * delta;
104	vp.m[j] = new_mj;
105	}
106	}
107	double cur_elbo = elbo(D, vp, sigma2, alpha);
108	if (it % 10 == 0) {
109	cerr << "Iter " << it << ": ELBO = " << cur_elbo << "\n";
110	}
111	if (abs(cur_elbo - prev_elbo) < tol * (1.0 + abs(prev_elbo))) break;
112	prev_elbo = cur_elbo;
113	}
114	return vp;
115	}
116
117	int main() {
118	ios::sync_with_stdio(false);
119	cin.tie(nullptr);
120
121	int n = 500, d = 20;
122	double sigma = 0.5; // noise std
123	double sigma2 = sigma * sigma;
124	double alpha = 1.0; // prior precision
125
126	Dataset D = make_synthetic(n, d, sigma, alpha, 123);
127	VariationalParams vp = cavi_bayesian_lr(D, sigma2, alpha, 300, 1e-7);
128
129	// Report learned mean and average variance
130	double avg_s2 = accumulate(vp.s2.begin(), vp.s2.end(), 0.0) / d;
131	cout << fixed << setprecision(6);
132	cout << "Learned mean w (first 5): ";
133	for (int j = 0; j < min(d, 5); ++j) cout << vp.m[j] << (j+1<min(d,5)?", ":"\n");
134	cout << "Average posterior variance per weight: " << avg_s2 << "\n";
135	cout << "Final ELBO: " << elbo(D, vp, sigma2, alpha) << "\n";
136	return 0;
137	}
138

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Model: z ~ N(0,1); x_i \| z ~ N(z, sigma^2) for i=1..n (shared latent z)
5	// Variational family: q(z) = N(mu, s^2) with parameters (mu, l=log s^2)
6	// ELBO(mu,l) = E_q[ log p(z) + sum_i log p(x_i\|z) - log q(z) ]
7	// Reparameterization: z = mu + s * eps, eps ~ N(0,1)
8	// Pathwise gradients:
9	// Let A'(z) = d/dz [ log p(z) + sum_i log p(x_i\|z) ] = -z + (sum x_i)/sigma^2 - (n/sigma^2) z
10	// With s = exp(0.5l): dL/dmu = A'(z), dL/dl = 0.5 (A'(z) * s * eps + 1)
11
12	struct BBVIConfig {
13	int iters = 2000; // optimization steps
14	int S = 8; // MC samples per step
15	double lr_mu = 1e-2; // learning rate for mu
16	double lr_l = 1e-2; // learning rate for log-variance l
17	};
18
19	struct Data { vector<double> x; double sigma; };
20
21	struct VarParams { double mu, l; }; // l = log s^2
22
23	// Compute a Monte Carlo estimate of ELBO and its gradients w.r.t. (mu,l)
24	struct Est { double elbo; double gmu; double gl; };
25
26	Est estimate_elbo_and_grad(const Data& D, const VarParams& vp, int S, mt19937& rng) {
27	normal_distribution<double> N01(0.0, 1.0);
28	int n = (int)D.x.size();
29	double sigma2 = D.sigma * D.sigma;
30	double s = exp(0.5 * vp.l);
31	double sumx = accumulate(D.x.begin(), D.x.end(), 0.0);
32	const double c_log_pz = -0.5 * log(2.0 * M_PI);
33	const double c_log_px = -0.5 * log(2.0 * M_PI * sigma2);
34	double elbo_acc = 0.0, gmu_acc = 0.0, gl_acc = 0.0;
35
36	for (int t = 0; t < S; ++t) {
37	double eps = N01(rng);
38	double z = vp.mu + s * eps;
39	// log p(z)
40	double log_pz = c_log_pz - 0.5 * (z * z);
41	// sum log p(x_i \| z)
42	double quad = 0.0;
43	for (double xi : D.x) {
44	double diff = xi - z;
45	quad += diff * diff;
46	}
47	double log_px = n * c_log_px - 0.5 * quad / sigma2;
48	// log q(z; mu, s)
49	double log_q = -0.5 * ( (z - vp.mu) * (z - vp.mu) / (s * s) + 2.0 * log(s) + log(2.0 * M_PI) );
50	double L = log_pz + log_px - log_q;
51	elbo_acc += L;
52	// Gradients
53	double Aprime = -z + (sumx / sigma2) - (n / sigma2) * z; // d/dz of log p(z) + sum log p(x\|z)
54	double dL_dmu = Aprime; // pathwise derivative
55	double dL_dl = 0.5 * (Aprime * s * eps + 1.0); // chain rule via s = exp(0.5*l)
56	gmu_acc += dL_dmu;
57	gl_acc += dL_dl;
58	}
59	Est out; out.elbo = elbo_acc / S; out.gmu = gmu_acc / S; out.gl = gl_acc / S;
60	return out;
61	}
62
63	VarParams fit_bbvi(const Data& D, const BBVIConfig& cfg, unsigned seed=7) {
64	mt19937 rng(seed);
65	VarParams vp{0.0, 0.0}; // initialize mu=0, l=0 => s^2=1
66	for (int it = 1; it <= cfg.iters; ++it) {
67	Est e = estimate_elbo_and_grad(D, vp, cfg.S, rng);
68	// Simple SGD updates
69	vp.mu += cfg.lr_mu * e.gmu;
70	vp.l += cfg.lr_l * e.gl;
71	if (it % 200 == 0) {
72	double s2 = exp(vp.l);
73	cerr << "Iter " << it << ": ELBO = " << e.elbo
74	<< ", mu = " << vp.mu << ", s2 = " << s2 << "\n";
75	}
76	}
77	return vp;
78	}
79
80	int main() {
81	ios::sync_with_stdio(false);
82	cin.tie(nullptr);
83
84	// Generate synthetic data from the model with a true latent z_true
85	int n = 200; double sigma = 0.75; double z_true = 1.75;
86	mt19937 rng(123);
87	normal_distribution<double> N01(0.0, 1.0);
88	normal_distribution<double> Nzt(z_true, sigma);
89	Data D; D.sigma = sigma; D.x.resize(n);
90	for (int i = 0; i < n; ++i) D.x[i] = Nzt(rng);
91
92	BBVIConfig cfg; cfg.iters = 2000; cfg.S = 8; cfg.lr_mu = 5e-3; cfg.lr_l = 5e-3;
93	VarParams vp = fit_bbvi(D, cfg, 321);
94
95	cout << fixed << setprecision(6);
96	cout << "Learned mu: " << vp.mu << "\n";
97	cout << "Learned s2: " << exp(vp.l) << "\n";
98	return 0;
99	}
100