šŸ“šTheoryIntermediate

ELBO (Evidence Lower Bound)

Key Points

  • •
    The Evidence Lower Bound (ELBO) is a tractable lower bound on the log evidence log p(x) that enables learning and inference in latent variable models like VAEs.
  • •
    ELBO splits into a reconstruction term that encourages accurate data modeling and a KL divergence term that regularizes the latent code toward the prior.
  • •
    You can derive ELBO with Jensen’s inequality by introducing an auxiliary distribution q(zx)].
  • •
    The gap between log p(x) and ELBO equals KL(q(z|x) || p(z|x)), so the bound is tight when q equals the true posterior.
  • •
    scales the KL term by a factor β to trade off reconstruction fidelity against disentangled and simple latent representations.
  • •
    IWAE (Importance Weighted Autoencoder) uses multiple importance samples to form a tighter bound that approaches log p(x) as K increases.
  • •
    In practice, ELBO is estimated by Monte Carlo and optimized with the reparameterization trick to obtain low-variance gradients.
  • •
    Numerical stability (e.g., log-sum-exp) and correct KL direction are essential to implementing ELBO and IWAE safely.

Prerequisites

  • →Probability distributions and density functions — Understanding p(x), p(z), and joint/conditional distributions is essential for defining ELBO.
  • →Expectation and Monte Carlo estimation — ELBO involves expectations under q(z|x), typically estimated with samples.
  • →Kullback–Leibler (KL) divergence — ELBO includes a KL term; knowing properties like nonnegativity and asymmetry is crucial.
  • →Jensen’s inequality — The ELBO derivation applies Jensen’s inequality to the log of an expectation.
  • →Gaussian distributions (diagonal covariance) — Common variational families and decoders are diagonal Gaussians, enabling closed-form KL and log-likelihoods.
  • →Latent variable models — ELBO is used to learn/infer hidden variables that explain observed data.
  • →Reparameterization trick — Enables low-variance gradient estimates for stochastic nodes like z ~ q(z|x).
  • →Numerical stability (log-sum-exp) — Required to compute IWAE and log-averages without overflow/underflow.
  • →Linear algebra basics — Even simple decoders are linear maps requiring matrix–vector multiplications.
  • →Stochastic gradient optimization — In practice, ELBO is maximized with optimizers like SGD/Adam over mini-batches.

Detailed Explanation

Tap terms for definitions

01Overview

The Evidence Lower Bound (ELBO) is a central tool in variational inference for probabilistic models with latent variables. Many generative models, such as Variational Autoencoders (VAEs), posit a hidden variable z that explains how observed data x is generated. Directly maximizing the data likelihood log p(x) is typically intractable because it involves integrating over all possible latent configurations z. ELBO addresses this by introducing a simpler, learnable distribution q(z|x) that approximates the true posterior p(z|x). Through Jensen’s inequality, ELBO forms a computable lower bound on log p(x). ELBO decomposes into two interpretable parts: a reconstruction term that rewards how well the model can generate the observed data from sampled latent codes, and a Kullback–Leibler (KL) divergence term that pulls the approximate posterior q(z|x) toward the prior p(z). This balance encourages both faithful reconstructions and a well-structured latent space that generalizes. In VAEs, both the encoder (q) and decoder (p(x|z)) are parameterized by neural networks and optimized jointly by maximizing ELBO over a dataset. Beyond VAEs, ELBO formalizes approximate Bayesian inference in a wide range of models and is the backbone of scalable learning with mini-batches and stochastic gradients. Variants like β-VAE adjust the strength of the KL regularization to encourage disentanglement, while IWAE tightens the bound by using multiple importance samples. Together, these methods enable practical training of powerful generative models on large, complex datasets.

02Intuition & Analogies

Imagine compressing photos into short codes so that a friend can reconstruct the original pictures using only those codes and a shared dictionary. You want two things: (1) the reconstructions should look like the originals (good compression quality), and (2) the codes should follow a simple, consistent format so your friend’s decoder can understand them (organized latent space). ELBO measures exactly this trade-off: the reconstruction term scores how well the picture is rebuilt from the code, while the KL term penalizes codes that stray from a standardized template (the prior). Another analogy: suppose you’re telling a story (generating x) based on a secret outline (latent z). To judge whether your storytelling process is good, you could try to account for all possible outlines, but that’s overwhelming. Instead, you keep a helper’s guess of the outline given the story (q(z|x)). With that helper in place, you can score two things: how consistent the story would be if it followed the guessed outline (reconstruction), and how plausible the guessed outline is compared to your prior knowledge about outlines (KL regularization). If your helper’s guesses match the true underlying outline distribution, your lower bound becomes exact. Finally, think of ELBO as a budget ledger. Income comes from how well the generated data matches reality (reconstruction rewards), and expenses come from how complex or unusual your latent codes are compared to a simple baseline (KL costs). Tuning β in β-VAE is like deciding how frugal to be: a higher β enforces stricter budgeting (simpler, more disentangled codes) at the cost of some reconstruction fidelity. Taking multiple estimates as in IWAE is like polling more judges to get a more accurate score; the more judges (samples) you use, the closer your evaluation approaches the true log-likelihood.

03Formal Definition

Consider a latent variable model p(x, z) = p(xx). Using Jensen’s inequality on the identity log p(x) = log ∫ p(x, z) ∫ q(z|x) dz yields the Evidence Lower Bound (ELBO): ELBO(x) = [ p(x, z) - q(z|x)] = [ p(x|z)] - (q(zx) \,\, p(zx) = p(zz)] - \, (q||p), and IWAE, which tightens the bound using multiple importance samples.

04When to Use

Use ELBO whenever you need approximate inference or learning in latent variable models where the exact posterior p(z|x) is intractable. Common scenarios include training Variational Autoencoders for images, text, audio, and other high-dimensional data. ELBO is also effective for semi-supervised learning, where latent variables capture class or style information while labels may be scarce. In missing data problems, ELBO-based methods can impute missing values by reasoning over latent structures. ELBO is appropriate when you can (1) choose a tractable variational family q(z|x) (e.g., diagonal Gaussians), (2) compute or approximate \log p(x|z) efficiently (via a decoder), and (3) leverage stochastic optimization over large datasets. Choose β-VAE when you want more structured, potentially disentangled representations by placing greater emphasis on matching the prior. Choose IWAE when you need a tighter bound to better correlate the objective with true log-likelihood, particularly in models where simple q(z|x) underestimates posterior complexity. ELBO is also foundational for amortized inference: the encoder network shares parameters across data points, making inference fast at test time.

āš ļøCommon Mistakes

  • Confusing the direction of KL: ELBO uses \mathrm{KL}(q(z|x) ,\Vert, p(z)), not \mathrm{KL}(p(z) ,\Vert, q(z|x)). Using the wrong direction changes optimization behavior and invalidates the bound.
  • Forgetting that ELBO ≤ log p(x): Maximizing ELBO does not guarantee maximum likelihood unless q equals the true posterior. Interpreting ELBO as exact likelihood leads to overconfidence.
  • Numerical instability: Computing log averages without log-sum-exp causes underflow/overflow, especially in IWAE. Always use the log-sum-exp trick for stable importance-weight aggregation.
  • Too few samples: Using K = 1 for noisy decoders can yield high-variance estimates. Increase K moderately or use variance reduction (e.g., control variates) when feasible.
  • Mismatch of decoder likelihood: Using a Gaussian decoder for intrinsically binary data (or vice versa) mis-specifies the model and skews the objective. Pick p(x|z) consistent with data support.
  • Posterior collapse: Overly strong decoders (or large β) can drive q(z|x) toward the prior, diminishing latent usage. Mitigate via KL annealing, free bits, or capacity constraints.
  • Incorrect scaling in minibatches: When using minibatches, ensure proper scaling so that the KL and reconstruction terms reflect the dataset size (or use per-sample objectives consistently).
  • Misinterpreting β-VAE: β > 1 is not ā€œbetterā€ by default. It trades reconstruction accuracy for simpler latents; tune β based on downstream goals (e.g., disentanglement vs. fidelity).

Key Formulas

ELBO (standard form)

Explanation: This splits the bound into a reconstruction term and a regularization term. Maximizing it balances data fit with latent simplicity.

Jensen’s inequality derivation

Explanation: Applying Jensen’s inequality to the concave log function yields a computable lower bound on log evidence using any q(z|x).

ELBO decomposition identity

Explanation: The difference between the true log evidence and ELBO is a KL divergence, which is always nonnegative; the bound is tight when q equals the true posterior.

KL to standard normal (diagonal)

Explanation: Closed-form KL between a diagonal Gaussian and a standard normal. Useful for VAE objectives with Gaussian q and prior.

β-VAE objective

Explanation: Scales the KL term by β to encourage disentanglement and simpler latents, at the cost of reconstruction fidelity when β > 1.

IWAE bound

Explanation: A tighter lower bound using K importance samples. It approaches log p(x) as K increases, but at higher computational cost.

Reparameterization trick

Explanation: Samples from q are expressed as deterministic functions of parameters and noise, enabling gradient backpropagation through Monte Carlo samples.

Monte Carlo estimator

Explanation: Approximates expectations with sample averages; variance decreases as K increases, improving estimate quality.

Log-sum-exp stabilization

Explanation: Computes the log of averages stably by factoring out the maximum term to avoid overflow or underflow.

Gaussian decoder log-likelihood

Explanation: When p(x|z) is Gaussian with mean f(z) and diagonal variance, the log-likelihood decomposes into a constant, a variance term, and a scaled squared error.

Complexity Analysis

Computing ELBO or IWAE involves two main costs: sampling from the approximate posterior q(zz). The KL term with a standard normal prior is closed-form and costs O(). Thus, a single-sample ELBO is typically O( + ), while K-sample ELBO scales to O(KĀ· + KĀ·). Memory usage includes storing parameters (decoder weights), temporary latent samples (O(KĀ·)), and possibly intermediate decoder activations. For diagonal Gaussians, space is O( + model parameters) for ELBO with , and O(KĀ·) additional space for multiple samples. IWAE adds the need to compute importance weights and apply a numerically stable log-sum-exp across K samples. This introduces O(K) extra operations and O(K) extra memory for storing log-weights, but the dominant cost remains K decoder evaluations and K latent samples. As K increases, the bound tightens but computation grows linearly in K. In mini-batch training with batch size B, the per-step cost scales linearly with B, yielding O(BĀ·KĀ·) time and O(BĀ·KĀ·) space for storing samples/weights if computed jointly. Practical implementations amortize overheads (e.g., vectorized sampling) and rely on stable numerics to maintain accuracy. Overall, ELBO and IWAE are linear in the number of samples and batch size, and linear or superlinear in model size depending on decoder architecture.

Code Examples

ELBO for a Gaussian VAE with diagonal q(z|x) and linear Gaussian decoder
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utilities for diagonal Gaussians and ELBO estimation
5struct RNG {
6 mt19937 rng;
7 normal_distribution<double> nd{0.0, 1.0};
8 RNG(unsigned seed=42) : rng(seed) {}
9 // Standard normal sample
10 double std_normal() { return nd(rng); }
11};
12
13// Compute KL( N(mu, diag(exp(logvar))) || N(0, I) )
14double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar) {
15 size_t d = mu.size();
16 double kl = 0.0;
17 for (size_t i = 0; i < d; ++i) {
18 double m2 = mu[i] * mu[i];
19 double v = exp(logvar[i]); // sigma^2
20 kl += m2 + v - logvar[i] - 1.0;
21 }
22 return 0.5 * kl;
23}
24
25// Sample from q(z|x) = N(mu, diag(exp(logvar))) using reparameterization
26vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, RNG& rng) {
27 size_t d = mu.size();
28 vector<double> z(d);
29 for (size_t i = 0; i < d; ++i) {
30 double eps = rng.std_normal();
31 double sigma = sqrt(exp(logvar[i]));
32 z[i] = mu[i] + sigma * eps;
33 }
34 return z;
35}
36
37// Linear decoder: f(z) = W z + b, with W (d_x x d_z), b (d_x)
38vector<double> linear_decoder(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z) {
39 size_t d_x = b.size();
40 size_t d_z = z.size();
41 vector<double> y(d_x, 0.0);
42 for (size_t i = 0; i < d_x; ++i) {
43 double s = b[i];
44 for (size_t j = 0; j < d_z; ++j) s += W[i][j] * z[j];
45 y[i] = s;
46 }
47 return y;
48}
49
50// Log-likelihood under Gaussian decoder: p(x|z) = N(f(z), diag(exp(logvar_x)))
51double log_likelihood_x_given_z(const vector<double>& x,
52 const vector<double>& fz,
53 const vector<double>& logvar_x) {
54 size_t d_x = x.size();
55 const double log2pi = log(2.0 * M_PI);
56 double ll = 0.0;
57 for (size_t i = 0; i < d_x; ++i) {
58 double v = exp(logvar_x[i]);
59 double diff = x[i] - fz[i];
60 ll += -0.5 * (log2pi + logvar_x[i] + (diff * diff) / v);
61 }
62 return ll;
63}
64
65// Monte Carlo ELBO estimator with K samples and beta scaling
66double estimate_elbo(const vector<double>& x,
67 const vector<double>& mu,
68 const vector<double>& logvar,
69 const vector<vector<double>>& W,
70 const vector<double>& b,
71 const vector<double>& logvar_x,
72 int K, double beta, RNG& rng) {
73 double recon_sum = 0.0;
74 for (int k = 0; k < K; ++k) {
75 vector<double> z = sample_diag_gaussian(mu, logvar, rng);
76 vector<double> fz = linear_decoder(W, b, z);
77 recon_sum += log_likelihood_x_given_z(x, fz, logvar_x);
78 }
79 double recon_avg = recon_sum / static_cast<double>(K);
80 double kl = kl_to_standard_normal(mu, logvar);
81 return recon_avg - beta * kl;
82}
83
84int main() {
85 ios::sync_with_stdio(false);
86 cin.tie(nullptr);
87
88 RNG rng(123);
89
90 // Dimensions
91 const size_t d_z = 2; // latent
92 const size_t d_x = 3; // observed
93
94 // Example data point x
95 vector<double> x = {0.5, -1.0, 1.2};
96
97 // Example encoder outputs for this x: mu(x), logvar(x)
98 vector<double> mu = {0.2, -0.1};
99 vector<double> logvar = {log(0.5), log(1.2)}; // log variances
100
101 // Linear decoder parameters W (d_x x d_z), b (d_x)
102 vector<vector<double>> W = {
103 {0.8, -0.3},
104 {0.1, 0.5},
105 {-0.4, 0.9}
106 };
107 vector<double> b = {0.0, 0.0, 0.0};
108
109 // Decoder output variance (diagonal), here fixed
110 vector<double> logvar_x = {log(0.1), log(0.1), log(0.2)};
111
112 int K = 10; // Monte Carlo samples
113 double beta = 1; // VAE default
114
115 double elbo = estimate_elbo(x, mu, logvar, W, b, logvar_x, K, beta, rng);
116 cout << fixed << setprecision(6);
117 cout << "Estimated ELBO (K=" << K << ", beta=" << beta << ") = " << elbo << "\n";
118
119 return 0;
120}
121

This program computes a Monte Carlo estimate of the ELBO for a simple Gaussian VAE. The encoder q(z|x) is a diagonal Gaussian given by mu and logvar. The decoder p(x|z) is a linear Gaussian with mean f(z)=Wz+b and fixed diagonal variance. We sample K latent vectors using the reparameterization trick, average their reconstruction log-likelihoods, and subtract β times the closed-form KL to the standard normal prior.

Time: O(KĀ·(d_z + d_xĀ·d_z)) for sampling and decoder evaluation; KL costs O(d_z).Space: O(d_xĀ·d_z) for parameters plus O(d_z) for temporaries; O(KĀ·d_z) if you store all samples.
IWAE bound with numerically stable log-sum-exp aggregation
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RNG {
5 mt19937 rng; normal_distribution<double> nd{0.0,1.0};
6 RNG(unsigned seed=7): rng(seed) {}
7 double std_normal(){ return nd(rng); }
8};
9
10vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, RNG& rng){
11 size_t d = mu.size(); vector<double> z(d);
12 for(size_t i=0;i<d;++i){ double eps=rng.std_normal(); double sigma=sqrt(exp(logvar[i])); z[i]=mu[i]+sigma*eps; }
13 return z;
14}
15
16vector<double> linear_decoder(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z){
17 size_t d_x=b.size(), d_z=z.size(); vector<double> y(d_x,0.0);
18 for(size_t i=0;i<d_x;++i){ double s=b[i]; for(size_t j=0;j<d_z;++j) s+=W[i][j]*z[j]; y[i]=s; }
19 return y;
20}
21
22// log p(x|z) under Gaussian decoder with diagonal variance
23double log_p_x_given_z(const vector<double>& x, const vector<double>& fz, const vector<double>& logvar_x){
24 const double log2pi = log(2.0*M_PI); double ll=0.0;
25 for(size_t i=0;i<x.size();++i){ double v=exp(logvar_x[i]); double d=x[i]-fz[i]; ll += -0.5*(log2pi + logvar_x[i] + (d*d)/v); }
26 return ll;
27}
28
29// log p(z) for standard normal
30double log_pz_std_normal(const vector<double>& z){
31 const double log2pi = log(2.0*M_PI); double ll=0.0;
32 for(double zi: z) ll += -0.5*(log2pi + zi*zi);
33 return ll;
34}
35
36// log q(z|x) for diagonal Gaussian N(mu, diag(exp(logvar)))
37double log_qz_diag(const vector<double>& z, const vector<double>& mu, const vector<double>& logvar){
38 const double log2pi = log(2.0*M_PI); double ll=0.0;
39 for(size_t i=0;i<z.size();++i){ double v=exp(logvar[i]); double d=z[i]-mu[i]; ll += -0.5*(log2pi + logvar[i] + (d*d)/v); }
40 return ll;
41}
42
43// Stable log-mean-exp: log( (1/K) * sum exp(a_k) )
44double log_mean_exp(const vector<double>& a){
45 if(a.empty()) return -INFINITY; double m=*max_element(a.begin(), a.end());
46 double s=0.0; for(double v: a) s += exp(v - m); return log(s) + m - log((double)a.size());
47}
48
49// IWAE bound estimator using K importance samples
50double estimate_iwae(const vector<double>& x,
51 const vector<double>& mu,
52 const vector<double>& logvar,
53 const vector<vector<double>>& W,
54 const vector<double>& b,
55 const vector<double>& logvar_x,
56 int K, RNG& rng){
57 vector<double> logw; logw.reserve(K);
58 for(int k=0;k<K;++k){
59 vector<double> z = sample_diag_gaussian(mu, logvar, rng);
60 vector<double> fz = linear_decoder(W, b, z);
61 double lp_x_given_z = log_p_x_given_z(x, fz, logvar_x);
62 double lpz = log_pz_std_normal(z);
63 double lqz = log_qz_diag(z, mu, logvar);
64 logw.push_back(lp_x_given_z + lpz - lqz);
65 }
66 return log_mean_exp(logw);
67}
68
69int main(){
70 ios::sync_with_stdio(false); cin.tie(nullptr);
71 RNG rng(2024);
72
73 // Dimensions and parameters
74 size_t d_z=2, d_x=3;
75 vector<double> x = {0.5, -1.0, 1.2};
76 vector<double> mu = {0.2, -0.1};
77 vector<double> logvar = {log(0.5), log(1.2)};
78 vector<vector<double>> W = {{0.8,-0.3},{0.1,0.5},{-0.4,0.9}};
79 vector<double> b = {0.0,0.0,0.0};
80 vector<double> logvar_x = {log(0.1), log(0.1), log(0.2)};
81
82 cout << fixed << setprecision(6);
83 for(int K: {1,5,50,200}){
84 double iwae = estimate_iwae(x, mu, logvar, W, b, logvar_x, K, rng);
85 cout << "IWAE bound (K=" << K << ") = " << iwae << "\n";
86 }
87 return 0;
88}
89

This program computes the IWAE bound using K importance samples. It evaluates log p(x,z) āˆ’ log q(z|x) for each sample and aggregates via a numerically stable log-mean-exp. As K increases, the bound typically tightens, better approximating log p(x).

Time: O(KĀ·(d_z + d_xĀ·d_z)) dominated by K decoder evaluations and sampling.Space: O(d_xĀ·d_z) for parameters and O(K) to store log-weights (plus O(d_z) for temporaries).
β-VAE: Inspecting the reconstruction–regularization trade-off
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RNG{ mt19937 rng; normal_distribution<double> nd{0.0,1.0}; RNG(unsigned s=11):rng(s){} double n(){return nd(rng);} };
5
6double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar){
7 double kl=0.0; for(size_t i=0;i<mu.size();++i){ double v=exp(logvar[i]); kl += mu[i]*mu[i] + v - logvar[i] - 1.0; } return 0.5*kl;
8}
9
10vector<double> sample_q(const vector<double>& mu, const vector<double>& logvar, RNG& r){ vector<double> z(mu.size()); for(size_t i=0;i<mu.size();++i){ z[i]=mu[i]+sqrt(exp(logvar[i]))*r.n(); } return z; }
11
12vector<double> lin_dec(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z){ vector<double> y(b.size(),0.0); for(size_t i=0;i<b.size();++i){ double s=b[i]; for(size_t j=0;j<z.size();++j) s+=W[i][j]*z[j]; y[i]=s; } return y; }
13
14double log_p_x_given_z(const vector<double>& x, const vector<double>& fz, const vector<double>& logvar_x){ const double log2pi=log(2.0*M_PI); double ll=0.0; for(size_t i=0;i<x.size();++i){ double v=exp(logvar_x[i]); double d=x[i]-fz[i]; ll += -0.5*(log2pi + logvar_x[i] + (d*d)/v);} return ll; }
15
16pair<double,double> elbo_terms(const vector<double>& x, const vector<double>& mu, const vector<double>& logvar,
17 const vector<vector<double>>& W, const vector<double>& b, const vector<double>& logvar_x,
18 int K, RNG& rng){
19 double recon=0.0; for(int k=0;k<K;++k){ auto z=sample_q(mu,logvar,rng); auto fz=lin_dec(W,b,z); recon += log_p_x_given_z(x,fz,logvar_x);} recon/=double(K);
20 double kl=kl_to_standard_normal(mu,logvar); return {recon, kl};
21}
22
23int main(){
24 ios::sync_with_stdio(false); cin.tie(nullptr);
25 RNG rng(99);
26
27 // Example setup
28 vector<double> x = {0.5,-1.0,1.2};
29 vector<double> mu = {0.2,-0.1};
30 vector<double> logvar = {log(0.5), log(1.2)};
31 vector<vector<double>> W={{0.8,-0.3},{0.1,0.5},{-0.4,0.9}}; vector<double> b={0.0,0.0,0.0};
32 vector<double> logvar_x={log(0.1),log(0.1),log(0.2)};
33
34 cout<<fixed<<setprecision(6);
35 vector<double> betas = {0.0, 0.5, 1.0, 2.0, 4.0};
36 int K=20;
37 auto [recon, kl] = elbo_terms(x,mu,logvar,W,b,logvar_x,K,rng);
38 cout << "Reconstruction (avg) = " << recon << ", KL = " << kl << "\n";
39 for(double beta: betas){
40 double obj = recon - beta*kl;
41 cout << "beta=" << beta << ": objective = " << obj << "\n";
42 }
43 return 0;
44}
45

This program separates the reconstruction and KL terms and evaluates the β-VAE objective for several β values. It illustrates the trade-off: increasing β downweights reconstruction quality to encourage simpler latents (larger KL penalty).

Time: O(K·(d_z + d_x·d_z)) to estimate the reconstruction term once; evaluating multiple β values is O(1) each afterward.Space: O(d_x·d_z) for parameters and O(d_z) for temporaries.