ELBO (Evidence Lower Bound)
Key Points
- ā¢The Evidence Lower Bound (ELBO) is a tractable lower bound on the log evidence log p(x) that enables learning and inference in latent variable models like VAEs.
- ā¢ELBO splits into a reconstruction term that encourages accurate data modeling and a KL divergence term that regularizes the latent code toward the prior.
- ā¢You can derive ELBO with Jensenās inequality by introducing an auxiliary distribution q(zx)].
- ā¢The gap between log p(x) and ELBO equals KL(q(z|x) || p(z|x)), so the bound is tight when q equals the true posterior.
- ā¢scales the KL term by a factor β to trade off reconstruction fidelity against disentangled and simple latent representations.
- ā¢IWAE (Importance Weighted Autoencoder) uses multiple importance samples to form a tighter bound that approaches log p(x) as K increases.
- ā¢In practice, ELBO is estimated by Monte Carlo and optimized with the reparameterization trick to obtain low-variance gradients.
- ā¢Numerical stability (e.g., log-sum-exp) and correct KL direction are essential to implementing ELBO and IWAE safely.
Prerequisites
- āProbability distributions and density functions ā Understanding p(x), p(z), and joint/conditional distributions is essential for defining ELBO.
- āExpectation and Monte Carlo estimation ā ELBO involves expectations under q(z|x), typically estimated with samples.
- āKullbackāLeibler (KL) divergence ā ELBO includes a KL term; knowing properties like nonnegativity and asymmetry is crucial.
- āJensenās inequality ā The ELBO derivation applies Jensenās inequality to the log of an expectation.
- āGaussian distributions (diagonal covariance) ā Common variational families and decoders are diagonal Gaussians, enabling closed-form KL and log-likelihoods.
- āLatent variable models ā ELBO is used to learn/infer hidden variables that explain observed data.
- āReparameterization trick ā Enables low-variance gradient estimates for stochastic nodes like z ~ q(z|x).
- āNumerical stability (log-sum-exp) ā Required to compute IWAE and log-averages without overflow/underflow.
- āLinear algebra basics ā Even simple decoders are linear maps requiring matrixāvector multiplications.
- āStochastic gradient optimization ā In practice, ELBO is maximized with optimizers like SGD/Adam over mini-batches.
Detailed Explanation
Tap terms for definitions01Overview
The Evidence Lower Bound (ELBO) is a central tool in variational inference for probabilistic models with latent variables. Many generative models, such as Variational Autoencoders (VAEs), posit a hidden variable z that explains how observed data x is generated. Directly maximizing the data likelihood log p(x) is typically intractable because it involves integrating over all possible latent configurations z. ELBO addresses this by introducing a simpler, learnable distribution q(z|x) that approximates the true posterior p(z|x). Through Jensenās inequality, ELBO forms a computable lower bound on log p(x). ELBO decomposes into two interpretable parts: a reconstruction term that rewards how well the model can generate the observed data from sampled latent codes, and a KullbackāLeibler (KL) divergence term that pulls the approximate posterior q(z|x) toward the prior p(z). This balance encourages both faithful reconstructions and a well-structured latent space that generalizes. In VAEs, both the encoder (q) and decoder (p(x|z)) are parameterized by neural networks and optimized jointly by maximizing ELBO over a dataset. Beyond VAEs, ELBO formalizes approximate Bayesian inference in a wide range of models and is the backbone of scalable learning with mini-batches and stochastic gradients. Variants like β-VAE adjust the strength of the KL regularization to encourage disentanglement, while IWAE tightens the bound by using multiple importance samples. Together, these methods enable practical training of powerful generative models on large, complex datasets.
02Intuition & Analogies
Imagine compressing photos into short codes so that a friend can reconstruct the original pictures using only those codes and a shared dictionary. You want two things: (1) the reconstructions should look like the originals (good compression quality), and (2) the codes should follow a simple, consistent format so your friendās decoder can understand them (organized latent space). ELBO measures exactly this trade-off: the reconstruction term scores how well the picture is rebuilt from the code, while the KL term penalizes codes that stray from a standardized template (the prior). Another analogy: suppose youāre telling a story (generating x) based on a secret outline (latent z). To judge whether your storytelling process is good, you could try to account for all possible outlines, but thatās overwhelming. Instead, you keep a helperās guess of the outline given the story (q(z|x)). With that helper in place, you can score two things: how consistent the story would be if it followed the guessed outline (reconstruction), and how plausible the guessed outline is compared to your prior knowledge about outlines (KL regularization). If your helperās guesses match the true underlying outline distribution, your lower bound becomes exact. Finally, think of ELBO as a budget ledger. Income comes from how well the generated data matches reality (reconstruction rewards), and expenses come from how complex or unusual your latent codes are compared to a simple baseline (KL costs). Tuning β in β-VAE is like deciding how frugal to be: a higher β enforces stricter budgeting (simpler, more disentangled codes) at the cost of some reconstruction fidelity. Taking multiple estimates as in IWAE is like polling more judges to get a more accurate score; the more judges (samples) you use, the closer your evaluation approaches the true log-likelihood.
03Formal Definition
04When to Use
Use ELBO whenever you need approximate inference or learning in latent variable models where the exact posterior p(z|x) is intractable. Common scenarios include training Variational Autoencoders for images, text, audio, and other high-dimensional data. ELBO is also effective for semi-supervised learning, where latent variables capture class or style information while labels may be scarce. In missing data problems, ELBO-based methods can impute missing values by reasoning over latent structures. ELBO is appropriate when you can (1) choose a tractable variational family q(z|x) (e.g., diagonal Gaussians), (2) compute or approximate \log p(x|z) efficiently (via a decoder), and (3) leverage stochastic optimization over large datasets. Choose β-VAE when you want more structured, potentially disentangled representations by placing greater emphasis on matching the prior. Choose IWAE when you need a tighter bound to better correlate the objective with true log-likelihood, particularly in models where simple q(z|x) underestimates posterior complexity. ELBO is also foundational for amortized inference: the encoder network shares parameters across data points, making inference fast at test time.
ā ļøCommon Mistakes
- Confusing the direction of KL: ELBO uses \mathrm{KL}(q(z|x) ,\Vert, p(z)), not \mathrm{KL}(p(z) ,\Vert, q(z|x)). Using the wrong direction changes optimization behavior and invalidates the bound.
- Forgetting that ELBO ⤠log p(x): Maximizing ELBO does not guarantee maximum likelihood unless q equals the true posterior. Interpreting ELBO as exact likelihood leads to overconfidence.
- Numerical instability: Computing log averages without log-sum-exp causes underflow/overflow, especially in IWAE. Always use the log-sum-exp trick for stable importance-weight aggregation.
- Too few samples: Using K = 1 for noisy decoders can yield high-variance estimates. Increase K moderately or use variance reduction (e.g., control variates) when feasible.
- Mismatch of decoder likelihood: Using a Gaussian decoder for intrinsically binary data (or vice versa) mis-specifies the model and skews the objective. Pick p(x|z) consistent with data support.
- Posterior collapse: Overly strong decoders (or large β) can drive q(z|x) toward the prior, diminishing latent usage. Mitigate via KL annealing, free bits, or capacity constraints.
- Incorrect scaling in minibatches: When using minibatches, ensure proper scaling so that the KL and reconstruction terms reflect the dataset size (or use per-sample objectives consistently).
- Misinterpreting β-VAE: β > 1 is not ābetterā by default. It trades reconstruction accuracy for simpler latents; tune β based on downstream goals (e.g., disentanglement vs. fidelity).
Key Formulas
ELBO (standard form)
Explanation: This splits the bound into a reconstruction term and a regularization term. Maximizing it balances data fit with latent simplicity.
Jensenās inequality derivation
Explanation: Applying Jensenās inequality to the concave log function yields a computable lower bound on log evidence using any q(z|x).
ELBO decomposition identity
Explanation: The difference between the true log evidence and ELBO is a KL divergence, which is always nonnegative; the bound is tight when q equals the true posterior.
KL to standard normal (diagonal)
Explanation: Closed-form KL between a diagonal Gaussian and a standard normal. Useful for VAE objectives with Gaussian q and prior.
β-VAE objective
Explanation: Scales the KL term by β to encourage disentanglement and simpler latents, at the cost of reconstruction fidelity when β > 1.
IWAE bound
Explanation: A tighter lower bound using K importance samples. It approaches log p(x) as K increases, but at higher computational cost.
Reparameterization trick
Explanation: Samples from q are expressed as deterministic functions of parameters and noise, enabling gradient backpropagation through Monte Carlo samples.
Monte Carlo estimator
Explanation: Approximates expectations with sample averages; variance decreases as K increases, improving estimate quality.
Log-sum-exp stabilization
Explanation: Computes the log of averages stably by factoring out the maximum term to avoid overflow or underflow.
Gaussian decoder log-likelihood
Explanation: When p(x|z) is Gaussian with mean f(z) and diagonal variance, the log-likelihood decomposes into a constant, a variance term, and a scaled squared error.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utilities for diagonal Gaussians and ELBO estimation 5 struct RNG { 6 mt19937 rng; 7 normal_distribution<double> nd{0.0, 1.0}; 8 RNG(unsigned seed=42) : rng(seed) {} 9 // Standard normal sample 10 double std_normal() { return nd(rng); } 11 }; 12 13 // Compute KL( N(mu, diag(exp(logvar))) || N(0, I) ) 14 double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar) { 15 size_t d = mu.size(); 16 double kl = 0.0; 17 for (size_t i = 0; i < d; ++i) { 18 double m2 = mu[i] * mu[i]; 19 double v = exp(logvar[i]); // sigma^2 20 kl += m2 + v - logvar[i] - 1.0; 21 } 22 return 0.5 * kl; 23 } 24 25 // Sample from q(z|x) = N(mu, diag(exp(logvar))) using reparameterization 26 vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, RNG& rng) { 27 size_t d = mu.size(); 28 vector<double> z(d); 29 for (size_t i = 0; i < d; ++i) { 30 double eps = rng.std_normal(); 31 double sigma = sqrt(exp(logvar[i])); 32 z[i] = mu[i] + sigma * eps; 33 } 34 return z; 35 } 36 37 // Linear decoder: f(z) = W z + b, with W (d_x x d_z), b (d_x) 38 vector<double> linear_decoder(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z) { 39 size_t d_x = b.size(); 40 size_t d_z = z.size(); 41 vector<double> y(d_x, 0.0); 42 for (size_t i = 0; i < d_x; ++i) { 43 double s = b[i]; 44 for (size_t j = 0; j < d_z; ++j) s += W[i][j] * z[j]; 45 y[i] = s; 46 } 47 return y; 48 } 49 50 // Log-likelihood under Gaussian decoder: p(x|z) = N(f(z), diag(exp(logvar_x))) 51 double log_likelihood_x_given_z(const vector<double>& x, 52 const vector<double>& fz, 53 const vector<double>& logvar_x) { 54 size_t d_x = x.size(); 55 const double log2pi = log(2.0 * M_PI); 56 double ll = 0.0; 57 for (size_t i = 0; i < d_x; ++i) { 58 double v = exp(logvar_x[i]); 59 double diff = x[i] - fz[i]; 60 ll += -0.5 * (log2pi + logvar_x[i] + (diff * diff) / v); 61 } 62 return ll; 63 } 64 65 // Monte Carlo ELBO estimator with K samples and beta scaling 66 double estimate_elbo(const vector<double>& x, 67 const vector<double>& mu, 68 const vector<double>& logvar, 69 const vector<vector<double>>& W, 70 const vector<double>& b, 71 const vector<double>& logvar_x, 72 int K, double beta, RNG& rng) { 73 double recon_sum = 0.0; 74 for (int k = 0; k < K; ++k) { 75 vector<double> z = sample_diag_gaussian(mu, logvar, rng); 76 vector<double> fz = linear_decoder(W, b, z); 77 recon_sum += log_likelihood_x_given_z(x, fz, logvar_x); 78 } 79 double recon_avg = recon_sum / static_cast<double>(K); 80 double kl = kl_to_standard_normal(mu, logvar); 81 return recon_avg - beta * kl; 82 } 83 84 int main() { 85 ios::sync_with_stdio(false); 86 cin.tie(nullptr); 87 88 RNG rng(123); 89 90 // Dimensions 91 const size_t d_z = 2; // latent 92 const size_t d_x = 3; // observed 93 94 // Example data point x 95 vector<double> x = {0.5, -1.0, 1.2}; 96 97 // Example encoder outputs for this x: mu(x), logvar(x) 98 vector<double> mu = {0.2, -0.1}; 99 vector<double> logvar = {log(0.5), log(1.2)}; // log variances 100 101 // Linear decoder parameters W (d_x x d_z), b (d_x) 102 vector<vector<double>> W = { 103 {0.8, -0.3}, 104 {0.1, 0.5}, 105 {-0.4, 0.9} 106 }; 107 vector<double> b = {0.0, 0.0, 0.0}; 108 109 // Decoder output variance (diagonal), here fixed 110 vector<double> logvar_x = {log(0.1), log(0.1), log(0.2)}; 111 112 int K = 10; // Monte Carlo samples 113 double beta = 1; // VAE default 114 115 double elbo = estimate_elbo(x, mu, logvar, W, b, logvar_x, K, beta, rng); 116 cout << fixed << setprecision(6); 117 cout << "Estimated ELBO (K=" << K << ", beta=" << beta << ") = " << elbo << "\n"; 118 119 return 0; 120 } 121
This program computes a Monte Carlo estimate of the ELBO for a simple Gaussian VAE. The encoder q(z|x) is a diagonal Gaussian given by mu and logvar. The decoder p(x|z) is a linear Gaussian with mean f(z)=Wz+b and fixed diagonal variance. We sample K latent vectors using the reparameterization trick, average their reconstruction log-likelihoods, and subtract β times the closed-form KL to the standard normal prior.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct RNG { 5 mt19937 rng; normal_distribution<double> nd{0.0,1.0}; 6 RNG(unsigned seed=7): rng(seed) {} 7 double std_normal(){ return nd(rng); } 8 }; 9 10 vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, RNG& rng){ 11 size_t d = mu.size(); vector<double> z(d); 12 for(size_t i=0;i<d;++i){ double eps=rng.std_normal(); double sigma=sqrt(exp(logvar[i])); z[i]=mu[i]+sigma*eps; } 13 return z; 14 } 15 16 vector<double> linear_decoder(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z){ 17 size_t d_x=b.size(), d_z=z.size(); vector<double> y(d_x,0.0); 18 for(size_t i=0;i<d_x;++i){ double s=b[i]; for(size_t j=0;j<d_z;++j) s+=W[i][j]*z[j]; y[i]=s; } 19 return y; 20 } 21 22 // log p(x|z) under Gaussian decoder with diagonal variance 23 double log_p_x_given_z(const vector<double>& x, const vector<double>& fz, const vector<double>& logvar_x){ 24 const double log2pi = log(2.0*M_PI); double ll=0.0; 25 for(size_t i=0;i<x.size();++i){ double v=exp(logvar_x[i]); double d=x[i]-fz[i]; ll += -0.5*(log2pi + logvar_x[i] + (d*d)/v); } 26 return ll; 27 } 28 29 // log p(z) for standard normal 30 double log_pz_std_normal(const vector<double>& z){ 31 const double log2pi = log(2.0*M_PI); double ll=0.0; 32 for(double zi: z) ll += -0.5*(log2pi + zi*zi); 33 return ll; 34 } 35 36 // log q(z|x) for diagonal Gaussian N(mu, diag(exp(logvar))) 37 double log_qz_diag(const vector<double>& z, const vector<double>& mu, const vector<double>& logvar){ 38 const double log2pi = log(2.0*M_PI); double ll=0.0; 39 for(size_t i=0;i<z.size();++i){ double v=exp(logvar[i]); double d=z[i]-mu[i]; ll += -0.5*(log2pi + logvar[i] + (d*d)/v); } 40 return ll; 41 } 42 43 // Stable log-mean-exp: log( (1/K) * sum exp(a_k) ) 44 double log_mean_exp(const vector<double>& a){ 45 if(a.empty()) return -INFINITY; double m=*max_element(a.begin(), a.end()); 46 double s=0.0; for(double v: a) s += exp(v - m); return log(s) + m - log((double)a.size()); 47 } 48 49 // IWAE bound estimator using K importance samples 50 double estimate_iwae(const vector<double>& x, 51 const vector<double>& mu, 52 const vector<double>& logvar, 53 const vector<vector<double>>& W, 54 const vector<double>& b, 55 const vector<double>& logvar_x, 56 int K, RNG& rng){ 57 vector<double> logw; logw.reserve(K); 58 for(int k=0;k<K;++k){ 59 vector<double> z = sample_diag_gaussian(mu, logvar, rng); 60 vector<double> fz = linear_decoder(W, b, z); 61 double lp_x_given_z = log_p_x_given_z(x, fz, logvar_x); 62 double lpz = log_pz_std_normal(z); 63 double lqz = log_qz_diag(z, mu, logvar); 64 logw.push_back(lp_x_given_z + lpz - lqz); 65 } 66 return log_mean_exp(logw); 67 } 68 69 int main(){ 70 ios::sync_with_stdio(false); cin.tie(nullptr); 71 RNG rng(2024); 72 73 // Dimensions and parameters 74 size_t d_z=2, d_x=3; 75 vector<double> x = {0.5, -1.0, 1.2}; 76 vector<double> mu = {0.2, -0.1}; 77 vector<double> logvar = {log(0.5), log(1.2)}; 78 vector<vector<double>> W = {{0.8,-0.3},{0.1,0.5},{-0.4,0.9}}; 79 vector<double> b = {0.0,0.0,0.0}; 80 vector<double> logvar_x = {log(0.1), log(0.1), log(0.2)}; 81 82 cout << fixed << setprecision(6); 83 for(int K: {1,5,50,200}){ 84 double iwae = estimate_iwae(x, mu, logvar, W, b, logvar_x, K, rng); 85 cout << "IWAE bound (K=" << K << ") = " << iwae << "\n"; 86 } 87 return 0; 88 } 89
This program computes the IWAE bound using K importance samples. It evaluates log p(x,z) ā log q(z|x) for each sample and aggregates via a numerically stable log-mean-exp. As K increases, the bound typically tightens, better approximating log p(x).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct RNG{ mt19937 rng; normal_distribution<double> nd{0.0,1.0}; RNG(unsigned s=11):rng(s){} double n(){return nd(rng);} }; 5 6 double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar){ 7 double kl=0.0; for(size_t i=0;i<mu.size();++i){ double v=exp(logvar[i]); kl += mu[i]*mu[i] + v - logvar[i] - 1.0; } return 0.5*kl; 8 } 9 10 vector<double> sample_q(const vector<double>& mu, const vector<double>& logvar, RNG& r){ vector<double> z(mu.size()); for(size_t i=0;i<mu.size();++i){ z[i]=mu[i]+sqrt(exp(logvar[i]))*r.n(); } return z; } 11 12 vector<double> lin_dec(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z){ vector<double> y(b.size(),0.0); for(size_t i=0;i<b.size();++i){ double s=b[i]; for(size_t j=0;j<z.size();++j) s+=W[i][j]*z[j]; y[i]=s; } return y; } 13 14 double log_p_x_given_z(const vector<double>& x, const vector<double>& fz, const vector<double>& logvar_x){ const double log2pi=log(2.0*M_PI); double ll=0.0; for(size_t i=0;i<x.size();++i){ double v=exp(logvar_x[i]); double d=x[i]-fz[i]; ll += -0.5*(log2pi + logvar_x[i] + (d*d)/v);} return ll; } 15 16 pair<double,double> elbo_terms(const vector<double>& x, const vector<double>& mu, const vector<double>& logvar, 17 const vector<vector<double>>& W, const vector<double>& b, const vector<double>& logvar_x, 18 int K, RNG& rng){ 19 double recon=0.0; for(int k=0;k<K;++k){ auto z=sample_q(mu,logvar,rng); auto fz=lin_dec(W,b,z); recon += log_p_x_given_z(x,fz,logvar_x);} recon/=double(K); 20 double kl=kl_to_standard_normal(mu,logvar); return {recon, kl}; 21 } 22 23 int main(){ 24 ios::sync_with_stdio(false); cin.tie(nullptr); 25 RNG rng(99); 26 27 // Example setup 28 vector<double> x = {0.5,-1.0,1.2}; 29 vector<double> mu = {0.2,-0.1}; 30 vector<double> logvar = {log(0.5), log(1.2)}; 31 vector<vector<double>> W={{0.8,-0.3},{0.1,0.5},{-0.4,0.9}}; vector<double> b={0.0,0.0,0.0}; 32 vector<double> logvar_x={log(0.1),log(0.1),log(0.2)}; 33 34 cout<<fixed<<setprecision(6); 35 vector<double> betas = {0.0, 0.5, 1.0, 2.0, 4.0}; 36 int K=20; 37 auto [recon, kl] = elbo_terms(x,mu,logvar,W,b,logvar_x,K,rng); 38 cout << "Reconstruction (avg) = " << recon << ", KL = " << kl << "\n"; 39 for(double beta: betas){ 40 double obj = recon - beta*kl; 41 cout << "beta=" << beta << ": objective = " << obj << "\n"; 42 } 43 return 0; 44 } 45
This program separates the reconstruction and KL terms and evaluates the β-VAE objective for several β values. It illustrates the trade-off: increasing β downweights reconstruction quality to encourage simpler latents (larger KL penalty).