🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathAdvanced

Evidence Lower Bound (ELBO)

Key Points

  • •
    The Evidence Lower Bound (ELBO) is a tractable lower bound on the log evidence log p(x) used to perform approximate Bayesian inference.
  • •
    ELBO splits into a reconstruction term Eq​[log p(x∣z)]thatmeasuresdatafitandaregularizationterm−KL(q(z∣x) || p(z)) that keeps the posterior close to the prior.
  • •
    Maximizing ELBO is equivalent to minimizing the KL divergence between the variational posterior q(z|x) and the true posterior p(z|x).
  • •
    Jensen’s inequality guarantees ELBO≤log p(x), and the gap equals KL(q(z∣x)∣∣p(z∣x)), which is always nonnegative.
  • •
    For Gaussian models, the KL and many terms have closed-form expressions, enabling efficient and stable C++ implementations.
  • •
    Monte Carlo with the reparameterization trick provides unbiased, low-variance gradient estimates of the ELBO.
  • •
    Importance sampling with the log-sum-exp trick can estimate log p(x) and empirically verify the ELBO bound.
  • •
    ELBO underpins VAEs, variational inference in probabilistic models, and modern latent-variable learning.

Prerequisites

  • →Basic probability and random variables — Understanding priors, likelihoods, and expectations is necessary to interpret ELBO terms.
  • →Bayes’ rule — ELBO arises from approximating the intractable posterior p(z|x) defined via Bayes’ rule.
  • →KL divergence and information theory — ELBO uses KL(q||p) as a regularizer and its properties ensure the lower bound.
  • →Monte Carlo estimation — ELBO expectations are typically approximated via sampling.
  • →Gaussian distributions — Common closed-form KL and entropy results for Gaussians are heavily used in practice.
  • →Calculus and chain rule — Needed to derive gradients for ELBO, especially with reparameterization.
  • →Numerical stability in logs and exponentials — Stable computations (e.g., log-sum-exp) prevent underflow/overflow during ELBO estimation.

Detailed Explanation

Tap terms for definitions

01Overview

The Evidence Lower Bound (ELBO) is a cornerstone concept in variational inference, a technique for approximating intractable Bayesian posteriors. In latent-variable models, we posit hidden variables z that generate observations x through a likelihood p(x|z) and are themselves drawn from a prior p(z). The true posterior p(z|x) is often hard to compute because it requires the evidence log p(x) = log ∫ p(x, z) dz. ELBO introduces a family of simpler distributions q(z|x) and optimizes them so that they approximate p(z|x). The central result is a bound: log p(x) ≥ E_q[log p(x, z)] − E_q[log q(z)], which can be rewritten as E_q[log p(x|z)] − KL(q(z|x) || p(z)). This form decomposes the objective into a data reconstruction term and a complexity penalty via the Kullback–Leibler divergence to the prior. Maximizing ELBO both encourages accurate explanations of the data and prevents overfitting by limiting how much information is carried by z. Practically, ELBO can be estimated with Monte Carlo and differentiated via the reparameterization trick, enabling scalable learning in high dimensions. ELBO is foundational for Variational Autoencoders (VAEs), Bayesian neural networks, topic models (e.g., LDA), and many probabilistic graphical models.

02Intuition & Analogies

Imagine compressing a photo album onto a USB drive. You want the files to look like the originals when decompressed (good reconstruction), but you also want the USB drive to be small (limited capacity). ELBO captures this exact trade-off for probabilistic models. The latent variable z is like the compressed code; the decoder p(x|z) reconstructs the data; the prior p(z) sets the shape and capacity of the codebook; and q(z|x) is the encoder that creates codes for each example. The first ELBO term, E_q[log p(x|z)], rewards reconstructions that look like the data—akin to having crisp photos after decompression. The second term, −KL(q(z|x) || p(z)), penalizes using too many unusual codes—like forcing us to pack information efficiently so that the codes remain near the prior’s typical set. Why is it a lower bound? Think of log p(x) as the ideal score for compressing and reconstructing images without constraints. Because we restrict ourselves to a simpler family q(z|x), we can’t reach that ideal unless our q matches the true posterior. Jensen’s inequality formalizes that the expected log of a random variable is less than or equal to the log of its expectation—hence a bound. The gap between the ELBO and the true log evidence is exactly how much our encoder q(z|x) deviates from the true posterior p(z|x). When the encoder is perfect, the gap vanishes and the bound becomes tight. In practice, we optimize the encoder and decoder jointly so that reconstructions are good yet the latent codes remain simple and smooth, which improves generalization.

03Formal Definition

Given a generative model p(x, z) = p(x∣z)p(z)andavariationalfamilyq(z∣x), the ELBO for a single observation x is defined as: ELBO(x) = Eq(z∣x)​[log p(x, z) − log q(z|x)] = Eq(z∣x)​[log p(x|z)] − KL(q(z∣x)∣∣p(z)).ByJensen’sinequalityappliedtologp(x)=log∫q(z∣x) p(x, z)/q(z∣x)dz,weobtainlogp(x)≥ELBO(x).ThegapisKL(q(z∣x) |∣p(z∣x)) ≥ 0, giving log p(x) = ELBO(x) + KL(q(z∣x)∣∣p(z∣x)). In learning settings with parameters θ likelihooddecoder​ and φ variationalencoder​, we maximize ELBO(θ, φ; x) over θ and φ. Typically, we estimate expectations with Monte Carlo: draw z(s) ~ q_φ(z∣x) and approximate Eq​[·] by an average over samples. Gradients with respect to φ can be estimated via the reparameterization trick when z is continuous: write z=gφ​(ε, x) with ε ~ p(ε) independent noise (e.g., standard normal), and move the gradient inside the expectation. For exponential-family choices (e.g., Gaussian prior and posterior), the KL term often has a closed form, improving stability and efficiency.

04When to Use

Use ELBO whenever you want Bayesian inference but the true posterior p(z|x) is intractable. This includes: (1) Latent variable models such as VAEs, where z represents compressed features and p(x|z) is a neural decoder; (2) Probabilistic topic models like LDA, where documents are mixtures over topics; (3) Hierarchical Bayesian models where conjugacy is absent or partial; (4) Large-scale problems that require amortized inference—learning a global encoder q(z|x) that works for many x efficiently; and (5) Scenarios requiring differentiable objectives for gradient-based optimization, where the reparameterization trick applies. ELBO is also useful as a training objective balancing data fit and model complexity via an interpretable regularizer. When q can be expressive (e.g., normalizing flows), ELBO can be very tight. When q is limited, ELBO still provides a principled target, and diagnostics like the bound gap (via importance sampling) inform whether q needs to be improved. If your latent variables are discrete, ELBO can still be used, but gradients may require REINFORCE-like estimators or continuous relaxations (e.g., Gumbel-Softmax).

⚠️Common Mistakes

• Ignoring numerical stability in log-probabilities. Computing log p(x|z), log p(z), and log q(z|x) can underflow for extreme values; always work in log space and use tricks like log-sum-exp. • Miscomputing the KL divergence. For Gaussians, use the correct closed-form KL with variance (not standard deviation) and ensure positivity using log-variance parametrization. • Forgetting the sign: ELBO includes −KL(q || p). Accidentally maximizing reconstruction minus negative KL (i.e., adding KL) will over-regularize or destabilize training. • Using too few Monte Carlo samples. This increases variance of gradient estimates and can obscure learning progress; use several samples per data point or average over minibatches. • Confusing bound tightness with model quality. A tight ELBO can indicate a good variational family, but poor generative assumptions (bad p(x|z) or p(z)) can still yield poor samples. • Mixing parameterizations. If you parametrize q with log-variance, derive gradients accordingly; treating it as variance can lead to exploding or vanishing gradients. • Not checking the bound. Use importance sampling estimates of log p(x) to verify ELBO ≤ estimate; large gaps may suggest a more expressive q (e.g., flows) or improved encoder/decoder architectures.

Key Formulas

Evidence (Marginal Likelihood)

logp(x)=log∫p(x,z)dz

Explanation: The evidence is the log probability of data after integrating out latent variables. It is often intractable to compute directly.

ELBO (Joint Form)

L(x)=Eq(z∣x)​[logp(x,z)−logq(z∣x)]

Explanation: This is the basic definition of ELBO as an expectation under the variational posterior. It trades off data fit and complexity.

ELBO (Reconstruction + KL)

L(x)=Eq(z∣x)​[logp(x∣z)]−KL(q(z∣x)∥p(z))

Explanation: Decomposes ELBO into a reconstruction term and a KL regularizer to the prior. This is the standard VAE objective.

Bound Decomposition

logp(x)=L(x)+KL(q(z∣x)∥p(z∣x))≥L(x)

Explanation: The gap between log evidence and ELBO is the KL divergence between q and the true posterior, which is always nonnegative.

KL Divergence

KL(q∥p)=∫q(z)logp(z)q(z)​dz

Explanation: Definition of KL divergence. It measures how much q differs from p and is zero only when q equals p almost everywhere.

Gaussian-to-Standard Gaussian KL

KL(N(μ,σ2)∥N(0,1))=21​(μ2+σ2−1−logσ2)

Explanation: Closed-form KL for univariate Gaussian to standard normal. Useful for efficient ELBO computation in VAEs.

Reparameterization Trick

z=μ(x)+σ(x)⋅ϵ,ϵ∼N(0,I)

Explanation: Expresses sampling from q as a deterministic transform of noise, enabling gradients to pass through samples.

Jensen’s Inequality for ELBO

logp(x)=logEq(z∣x)​[q(z∣x)p(x,z)​]≥Eq(z∣x)​[logq(z∣x)p(x,z)​]

Explanation: The concavity of log implies the ELBO is a lower bound of the log evidence.

Monte Carlo ELBO

L^(x)=S1​s=1∑S​(logp(x∣z(s))+logp(z(s))−logq(z(s)∣x))

Explanation: Approximates the ELBO with S samples from q. Variance decreases with more samples.

Importance Sampling Estimate

logp(x)​=log(S1​s=1∑S​ws​),ws​=q(z(s)∣x)p(x,z(s))​

Explanation: Estimates log evidence using weighted samples from q. Using log-sum-exp yields numerical stability.

Gaussian Entropy

H(N(μ,Σ))=21​log((2πe)ddetΣ)

Explanation: Closed-form entropy of a multivariate Gaussian. Appears in ELBO via the −Eq​[log q] term.

Gradients of Gaussian KL

∇μ​KL(N(μ,σ2)∥N(0,1))=μ,∇logσ2​KL=21​(σ2−1)

Explanation: Convenient gradients when parametrizing with mean and log-variance. Useful for manual gradient checks.

Complexity Analysis

Suppose we estimate the ELBO for N data points, each with d-dimensional latent variables, using S Monte Carlo samples per point. For diagonal-Gaussian q(z∣x),samplingandevaluatinglogq,logp(z),andlogp(x∣z) cost O(d) per sample. Therefore, the per-point cost is O(S d), and the total pass over the dataset is O(N S d). When using analytic KL terms (e.g., Gaussian-to-Gaussian), we save one evaluation per sample and reduce variance, but the dominant term remains O(S d) due to the reconstruction expectation. Memory usage is O(d) per sample to store z, or O(S d) if you retain all samples simultaneously (e.g., for importance weighting or diagnostics). In streaming implementations, you can compute running sums to achieve O(d) memory regardless of S. For gradient-based optimization with the reparameterization trick, the cost is similar: computing gradients of the reconstruction term involves backpropagating through z = μ + σ ⊙ ε with O(d) work per sample. If encoder/decoder are neural networks, their forward/backward costs typically dominate; however, the variational components add only linear overhead in d and S. Numerical stabilization techniques (e.g., log-sum-exp) introduce negligible overhead and prevent underflow/overflow, which is crucial for high-dimensional or long-tailed distributions. In summary, ELBO-based training scales linearly with N, d, and S under common modeling assumptions, and can be engineered to use O(d) additional memory per data point.

Code Examples

Monte Carlo ELBO for a 1D Gaussian Latent Model (with analytic KL)
1#include <iostream>
2#include <random>
3#include <cmath>
4#include <vector>
5#include <numeric>
6#include <iomanip>
7
8// This example computes ELBO for a simple model:
9// z ~ N(0, 1)
10// x | z ~ N(z, sigma_x^2)
11// q(z|x) = N(mu, var) with var = exp(logvar)
12// ELBO(x) = E_q[log p(x|z)] - KL(q(z|x) || p(z))
13
14static const double PI = 3.14159265358979323846;
15
16// Log PDF of univariate normal N(mean, var)
17double normal_logpdf(double x, double mean, double var) {
18 // add a small epsilon for numerical safety
19 double v = std::max(var, 1e-12);
20 double diff = x - mean;
21 return -0.5 * (std::log(2.0 * PI * v) + (diff * diff) / v);
22}
23
24// KL(N(mu, var) || N(0,1)) in closed form
25double kl_gaussian_to_standard(double mu, double var) {
26 // 0.5 * (mu^2 + var - 1 - log var)
27 double v = std::max(var, 1e-12);
28 return 0.5 * (mu * mu + v - 1.0 - std::log(v));
29}
30
31int main() {
32 // Fixed generative noise sigma_x
33 double sigma_x = 0.5; // std dev of p(x|z)
34 double var_x = sigma_x * sigma_x;
35
36 // A single observation x
37 double x = 1.2;
38
39 // Variational parameters q(z|x) = N(mu, var), with var = exp(logvar)
40 double mu = 0.8;
41 double logvar = -0.5; // so var ~ 0.60653
42 double var = std::exp(logvar);
43 double sigma = std::sqrt(var);
44
45 // Monte Carlo samples
46 int S = 10000;
47
48 std::mt19937 rng(123);
49 std::normal_distribution<double> standard_normal(0.0, 1.0);
50
51 // Estimate E_q[log p(x|z)] via Monte Carlo
52 double sum_loglik = 0.0;
53 for (int s = 0; s < S; ++s) {
54 double eps = standard_normal(rng);
55 double z = mu + sigma * eps; // reparameterization
56 sum_loglik += normal_logpdf(x, z, var_x);
57 }
58 double recon_term = sum_loglik / static_cast<double>(S);
59
60 // Analytic KL term
61 double kl = kl_gaussian_to_standard(mu, var);
62
63 // ELBO = recon - KL
64 double elbo = recon_term - kl;
65
66 // True log evidence log p(x) is available in closed form for this model:
67 // z ~ N(0,1), x|z ~ N(z, sigma_x^2) => x ~ N(0, 1 + sigma_x^2)
68 double log_p_x = normal_logpdf(x, 0.0, 1.0 + var_x);
69
70 std::cout << std::fixed << std::setprecision(6);
71 std::cout << "Reconstruction term E_q[log p(x|z)]: " << recon_term << "\n";
72 std::cout << "KL(q||p): " << kl << "\n";
73 std::cout << "ELBO: " << elbo << "\n";
74 std::cout << "True log p(x): " << log_p_x << "\n";
75 std::cout << "ELBO <= log p(x)? " << (elbo <= log_p_x ? "yes" : "no") << "\n";
76
77 return 0;
78}
79

We define a simple 1D latent Gaussian model with Gaussian likelihood and a Gaussian variational posterior. The reconstruction term is estimated by Monte Carlo using the reparameterization trick. The KL(q||p) to a standard normal prior is computed in closed form. Because the marginal p(x) is also Gaussian, we compute log p(x) exactly and verify the ELBO bound.

Time: O(S)Space: O(1)
Reparameterization Gradients of ELBO for Gaussian q(z|x)
1#include <iostream>
2#include <random>
3#include <cmath>
4#include <iomanip>
5
6// Model: z ~ N(0,1), x|z ~ N(z, sigma_x^2)
7// q(z|x) = N(mu, var), var = exp(logvar)
8// ELBO = E_q[log p(x|z)] - KL(q||p)
9// We compute gradients wrt mu and logvar using reparameterization.
10
11static const double PI = 3.14159265358979323846;
12
13double normal_logpdf(double x, double mean, double var) {
14 double v = std::max(var, 1e-12);
15 double diff = x - mean;
16 return -0.5 * (std::log(2.0 * PI * v) + (diff * diff) / v);
17}
18
19// Derivative of log p(x|z) wrt z for Gaussian likelihood N(z, var_x)
20double dloglik_dz(double x, double z, double var_x) {
21 // loglik = -0.5*log(2pi var_x) - 0.5*(x - z)^2/var_x
22 // derivative wrt z is (x - z)/var_x
23 return (x - z) / var_x;
24}
25
26// Closed-form KL and its gradients w.r.t. mu and logvar (for q to N(0,1))
27struct KLGrads {
28 double value;
29 double d_mu;
30 double d_logvar;
31};
32
33KLGrads kl_and_grads(double mu, double logvar) {
34 double var = std::exp(logvar);
35 double kl = 0.5 * (mu * mu + var - 1.0 - std::log(std::max(var, 1e-12)));
36 double d_mu = mu; // d/dmu KL = mu
37 double d_logvar = 0.5 * (var - 1.0); // d/d logvar KL = 0.5*(var - 1)
38 return {kl, d_mu, d_logvar};
39}
40
41int main() {
42 double x = 1.2;
43 double sigma_x = 0.5;
44 double var_x = sigma_x * sigma_x;
45
46 // Initialize variational params
47 double mu = 0.8;
48 double logvar = -0.5;
49
50 int S = 4096; // MC samples for gradient estimate
51 double lr = 0.05; // learning rate for one gradient step
52
53 std::mt19937 rng(42);
54 std::normal_distribution<double> standard_normal(0.0, 1.0);
55
56 double var = std::exp(logvar);
57 double sigma = std::sqrt(var);
58
59 // Reconstruction gradients via reparameterization
60 double grad_mu_recon = 0.0;
61 double grad_logvar_recon = 0.0;
62 double recon_estimate = 0.0;
63
64 for (int s = 0; s < S; ++s) {
65 double eps = standard_normal(rng);
66 double z = mu + sigma * eps;
67 // accumulate reconstruction value (optional, for logging)
68 recon_estimate += normal_logpdf(x, z, var_x);
69 // derivative of loglik wrt z
70 double dl_dz = dloglik_dz(x, z, var_x);
71 // chain rule: z = mu + sigma * eps, sigma = exp(0.5*logvar)
72 // dsigma/dlogvar = 0.5 * sigma
73 grad_mu_recon += dl_dz * 1.0; // dz/dmu = 1
74 grad_logvar_recon += dl_dz * eps * (0.5 * sigma); // dz/dlogvar = eps * 0.5*sigma
75 }
76
77 grad_mu_recon /= static_cast<double>(S);
78 grad_logvar_recon /= static_cast<double>(S);
79 recon_estimate /= static_cast<double>(S);
80
81 // KL and its gradients
82 KLGrads k = kl_and_grads(mu, logvar);
83
84 // ELBO gradients: grad = grad_recon - grad_KL
85 double grad_mu = grad_mu_recon - k.d_mu;
86 double grad_logvar = grad_logvar_recon - k.d_logvar;
87
88 // One gradient ascent step on ELBO
89 mu += lr * grad_mu;
90 logvar += lr * grad_logvar;
91
92 std::cout << std::fixed << std::setprecision(6);
93 std::cout << "Reconstruction term estimate: " << recon_estimate << "\n";
94 std::cout << "KL value: " << k.value << "\n";
95 std::cout << "Grad mu (recon, KL, total): " << grad_mu_recon << ", " << k.d_mu << ", " << grad_mu << "\n";
96 std::cout << "Grad logvar (recon, KL, total): " << grad_logvar_recon << ", " << k.d_logvar << ", " << grad_logvar << "\n";
97 std::cout << "Updated mu: " << mu << ", updated logvar: " << logvar << "\n";
98
99 return 0;
100}
101

This program computes Monte Carlo gradients of the ELBO with respect to the mean and log-variance of a Gaussian variational posterior using the reparameterization trick. The reconstruction term gradients are estimated via samples; the KL term and its gradients are analytic. We perform one gradient ascent step to illustrate parameter updates.

Time: O(S)Space: O(1)
Importance Sampling vs. ELBO with Log-Sum-Exp Stability
1#include <iostream>
2#include <random>
3#include <cmath>
4#include <vector>
5#include <algorithm>
6#include <numeric>
7#include <iomanip>
8
9static const double PI = 3.14159265358979323846;
10
11double normal_logpdf(double x, double mean, double var) {
12 double v = std::max(var, 1e-12);
13 double diff = x - mean;
14 return -0.5 * (std::log(2.0 * PI * v) + (diff * diff) / v);
15}
16
17// Stable log-sum-exp of a vector of log-weights
18double logsumexp(const std::vector<double>& logw) {
19 double m = *std::max_element(logw.begin(), logw.end());
20 double sum = 0.0;
21 for (double lw : logw) sum += std::exp(lw - m);
22 return m + std::log(std::max(sum, 1e-300));
23}
24
25int main() {
26 // Model as before
27 double sigma_x = 0.5; // std dev of p(x|z)
28 double var_x = sigma_x * sigma_x;
29 double x = 1.2;
30
31 // Variational q(z|x)
32 double mu = 0.8;
33 double logvar = -0.5;
34 double var = std::exp(logvar);
35 double sigma = std::sqrt(var);
36
37 int S = 20000; // number of importance samples
38
39 std::mt19937 rng(7);
40 std::normal_distribution<double> standard_normal(0.0, 1.0);
41
42 std::vector<double> log_w; log_w.reserve(S);
43 double sum_elbo_terms = 0.0; // average of log w approximates the single-sample ELBO
44
45 for (int s = 0; s < S; ++s) {
46 double eps = standard_normal(rng);
47 double z = mu + sigma * eps; // sample from q via reparameterization
48
49 double log_p_x_given_z = normal_logpdf(x, z, var_x);
50 double log_p_z = normal_logpdf(z, 0.0, 1.0);
51 double log_q = normal_logpdf(z, mu, var);
52
53 double lw = log_p_x_given_z + log_p_z - log_q; // log importance weight
54 log_w.push_back(lw);
55 sum_elbo_terms += lw;
56 }
57
58 // Importance sampling estimate of log p(x)
59 double lse = logsumexp(log_w);
60 double log_p_x_hat = lse - std::log(static_cast<double>(S));
61
62 // Monte Carlo ELBO estimate (using single-sample form E_q[log w])
63 double elbo_mc = sum_elbo_terms / static_cast<double>(S);
64
65 // True log p(x) for verification
66 double log_p_x_true = normal_logpdf(x, 0.0, 1.0 + var_x);
67
68 std::cout << std::fixed << std::setprecision(6);
69 std::cout << "MC ELBO (E[log w]): " << elbo_mc << "\n";
70 std::cout << "IS log p(x) (log E[w]): " << log_p_x_hat << "\n";
71 std::cout << "True log p(x): " << log_p_x_true << "\n";
72 std::cout << "Check: ELBO <= IS estimate? " << (elbo_mc <= log_p_x_hat ? "yes" : "no") << "\n";
73
74 return 0;
75}
76

We compare the Monte Carlo ELBO estimate E_q[log w] with an importance sampling estimate log E_q[w], where w = p(x, z)/q(z|x). By Jensen’s inequality, E[log w] ≤ log E[w], so ELBO should be below the IS estimate. We implement a numerically stable log-sum-exp to compute log E[w] reliably.

Time: O(S)Space: O(S)
#elbo#variational inference#vae#kl divergence#reparameterization trick#importance sampling#jensen inequality#gaussian#evidence lower bound#free energy#log-sum-exp#monte carlo#latent variable model#posterior approximation#bayesian