📚TheoryIntermediate

ELBO (Evidence Lower Bound)

Key Points

•
The Evidence Lower Bound (ELBO) is a tractable lower bound on the log evidence log p(x) that enables learning and inference in latent variable models like VAEs.
•
ELBO splits into a reconstruction term that encourages accurate data modeling and a KL divergence term that regularizes the latent code toward the prior.
•
You can derive ELBO with Jensen’s inequality by introducing an auxiliary distribution q(z $∣ x), y i e l d in g l o g p (x) \geq Eq [l o g p (x, z) - l o g q (z ∣$ x)].
•
The gap between log p(x) and ELBO equals KL(q(z|x) || p(z|x)), so the bound is tight when q equals the true posterior.
•
$β - V A E$ scales the KL term by a factor β to trade off reconstruction fidelity against disentangled and simple latent representations.
•
IWAE (Importance Weighted Autoencoder) uses multiple importance samples to form a tighter bound that approaches log p(x) as K increases.
•
In practice, ELBO is estimated by Monte Carlo and optimized with the reparameterization trick to obtain low-variance gradients.
•
Numerical stability (e.g., log-sum-exp) and correct KL direction are essential to implementing ELBO and IWAE safely.

Prerequisites

→Probability distributions and density functions — Understanding p(x), p(z), and joint/conditional distributions is essential for defining ELBO.
→Expectation and Monte Carlo estimation — ELBO involves expectations under q(z|x), typically estimated with samples.
→Kullback–Leibler (KL) divergence — ELBO includes a KL term; knowing properties like nonnegativity and asymmetry is crucial.
→Jensen’s inequality — The ELBO derivation applies Jensen’s inequality to the log of an expectation.
→Gaussian distributions (diagonal covariance) — Common variational families and decoders are diagonal Gaussians, enabling closed-form KL and log-likelihoods.
→Latent variable models — ELBO is used to learn/infer hidden variables that explain observed data.
→Reparameterization trick — Enables low-variance gradient estimates for stochastic nodes like z ~ q(z|x).
→Numerical stability (log-sum-exp) — Required to compute IWAE and log-averages without overflow/underflow.
→Linear algebra basics — Even simple decoders are linear maps requiring matrix–vector multiplications.
→Stochastic gradient optimization — In practice, ELBO is maximized with optimizers like SGD/Adam over mini-batches.

Detailed Explanation

Tap terms for definitions

01Overview

The Evidence Lower Bound (ELBO) is a central tool in variational inference for probabilistic models with latent variables. Many generative models, such as Variational Autoencoders (VAEs), posit a hidden variable z that explains how observed data x is generated. Directly maximizing the data likelihood log p(x) is typically intractable because it involves integrating over all possible latent configurations z. ELBO addresses this by introducing a simpler, learnable distribution q(z|x) that approximates the true posterior p(z|x). Through Jensen’s inequality, ELBO forms a computable lower bound on log p(x). ELBO decomposes into two interpretable parts: a reconstruction term that rewards how well the model can generate the observed data from sampled latent codes, and a Kullback–Leibler (KL) divergence term that pulls the approximate posterior q(z|x) toward the prior p(z). This balance encourages both faithful reconstructions and a well-structured latent space that generalizes. In VAEs, both the encoder (q) and decoder (p(x|z)) are parameterized by neural networks and optimized jointly by maximizing ELBO over a dataset. Beyond VAEs, ELBO formalizes approximate Bayesian inference in a wide range of models and is the backbone of scalable learning with mini-batches and stochastic gradients. Variants like β-VAE adjust the strength of the KL regularization to encourage disentanglement, while IWAE tightens the bound by using multiple importance samples. Together, these methods enable practical training of powerful generative models on large, complex datasets.

02Intuition & Analogies

Imagine compressing photos into short codes so that a friend can reconstruct the original pictures using only those codes and a shared dictionary. You want two things: (1) the reconstructions should look like the originals (good compression quality), and (2) the codes should follow a simple, consistent format so your friend’s decoder can understand them (organized latent space). ELBO measures exactly this trade-off: the reconstruction term scores how well the picture is rebuilt from the code, while the KL term penalizes codes that stray from a standardized template (the prior). Another analogy: suppose you’re telling a story (generating x) based on a secret outline (latent z). To judge whether your storytelling process is good, you could try to account for all possible outlines, but that’s overwhelming. Instead, you keep a helper’s guess of the outline given the story (q(z|x)). With that helper in place, you can score two things: how consistent the story would be if it followed the guessed outline (reconstruction), and how plausible the guessed outline is compared to your prior knowledge about outlines (KL regularization). If your helper’s guesses match the true underlying outline distribution, your lower bound becomes exact. Finally, think of ELBO as a budget ledger. Income comes from how well the generated data matches reality (reconstruction rewards), and expenses come from how complex or unusual your latent codes are compared to a simple baseline (KL costs). Tuning β in β-VAE is like deciding how frugal to be: a higher β enforces stricter budgeting (simpler, more disentangled codes) at the cost of some reconstruction fidelity. Taking multiple estimates as in IWAE is like polling more judges to get a more accurate score; the more judges (samples) you use, the closer your evaluation approaches the true log-likelihood.

03Formal Definition

Consider a latent variable model p(x, z) = p(x

∣ z) p (z) w i t h o b ser v a t i o n s x an d l a t e n t v a r iab l esz . I n t ro d u ce ana pp ro x ima t e p os t er i or q (z ∣

x). Using Jensen’s inequality on the identity log p(x) = log ∫ p(x, z)

d z = l o g

∫ q(z|x)

\frac{p ( x , z )}{q ( z ∣ x )}

dz yields the Evidence Lower Bound (ELBO): ELBO(x) =

E_{q (z ∣ x)}

[

lo g

p(x, z) -

lo g

q(z|x)] =

E_{q (z ∣ x)}

[

lo g

p(x|z)] -

KL

(q(z

∣ x) ∥ p (z)) . T h e g a p b e tw ee n l o g p (x) an d E L BO i s t h eK L d i v er g e n ce b e tw ee n t h e a pp ro x ima t e an d t r u e p os t er i ors : lo g p (x) = ELBO (x) + KL (q (z ∣

x) \,

∥

\, p(z

∣ x)), w hi c hi s a lw a ys n o nn e g a t i v e, e n s u r in g E L BO \leq l o g p (x) . Eq u a l i t y h o l d s i ff q (z ∣

x) = p(z

∣ x) a l m os t e v ery w h ere . I n p r a c t i ce, w e ma x imi ze an e m p i r i c a l a v er a g eo f E L BO o v er d a t a p o in t s, t y p i c a ll yes t ima t e d v ia M o n t e C a r l o . T oo b t ain l o w - v a r ian ce g r a d i e n t s w h e n q i sre p a r am e t er i z ab l e (e . g ., d ia g o na lG a u ss ian), w e d r a w z v ia z = μ (x) + σ (x) ⊙ εw i t h ε \sim N (0, I) an d ba c k p ro p a g a t e t h ro ug h z (t h ere p a r am e t er i z a t i o n t r i c k) . E x t e n s i o n s in c l u d e β - V A E, w hi c h u ses ELBO_{β} = E_{q} [lo g p (x ∣

z)] -

β

KL

(q||p), and IWAE, which tightens the bound using multiple importance samples.

04When to Use

Use ELBO whenever you need approximate inference or learning in latent variable models where the exact posterior p(z|x) is intractable. Common scenarios include training Variational Autoencoders for images, text, audio, and other high-dimensional data. ELBO is also effective for semi-supervised learning, where latent variables capture class or style information while labels may be scarce. In missing data problems, ELBO-based methods can impute missing values by reasoning over latent structures. ELBO is appropriate when you can (1) choose a tractable variational family q(z|x) (e.g., diagonal Gaussians), (2) compute or approximate \log p(x|z) efficiently (via a decoder), and (3) leverage stochastic optimization over large datasets. Choose β-VAE when you want more structured, potentially disentangled representations by placing greater emphasis on matching the prior. Choose IWAE when you need a tighter bound to better correlate the objective with true log-likelihood, particularly in models where simple q(z|x) underestimates posterior complexity. ELBO is also foundational for amortized inference: the encoder network shares parameters across data points, making inference fast at test time.

⚠️Common Mistakes

Confusing the direction of KL: ELBO uses \mathrm{KL}(q(z|x) ,\Vert, p(z)), not \mathrm{KL}(p(z) ,\Vert, q(z|x)). Using the wrong direction changes optimization behavior and invalidates the bound.
Forgetting that ELBO ≤ log p(x): Maximizing ELBO does not guarantee maximum likelihood unless q equals the true posterior. Interpreting ELBO as exact likelihood leads to overconfidence.
Numerical instability: Computing log averages without log-sum-exp causes underflow/overflow, especially in IWAE. Always use the log-sum-exp trick for stable importance-weight aggregation.
Too few samples: Using K = 1 for noisy decoders can yield high-variance estimates. Increase K moderately or use variance reduction (e.g., control variates) when feasible.
Mismatch of decoder likelihood: Using a Gaussian decoder for intrinsically binary data (or vice versa) mis-specifies the model and skews the objective. Pick p(x|z) consistent with data support.
Posterior collapse: Overly strong decoders (or large β) can drive q(z|x) toward the prior, diminishing latent usage. Mitigate via KL annealing, free bits, or capacity constraints.
Incorrect scaling in minibatches: When using minibatches, ensure proper scaling so that the KL and reconstruction terms reflect the dataset size (or use per-sample objectives consistently).
Misinterpreting β-VAE: β > 1 is not “better” by default. It trades reconstruction accuracy for simpler latents; tune β based on downstream goals (e.g., disentanglement vs. fidelity).

Key Formulas

ELBO (standard form)

ELBO (x) = E_{q (z ∣ x)} [lo g p (x ∣ z)] - KL (q (z ∣ x) ∥ p (z))

Explanation: This splits the bound into a reconstruction term and a regularization term. Maximizing it balances data fit with latent simplicity.

Jensen’s inequality derivation

lo g p (x) = lo g \int p (x, z) d z = lo g \int q (z ∣ x) \frac{p ( x , z )}{q ( z ∣ x )} d z \geq E_{q (z ∣ x)} [lo g \frac{p ( x , z )}{q ( z ∣ x )}]

Explanation: Applying Jensen’s inequality to the concave log function yields a computable lower bound on log evidence using any q(z|x).

ELBO decomposition identity

lo g p (x) = ELBO (x) + KL (q (z ∣ x) ∥ p (z ∣ x))

Explanation: The difference between the true log evidence and ELBO is a KL divergence, which is always nonnegative; the bound is tight when q equals the true posterior.

KL to standard normal (diagonal)

KL (N (μ, diag (σ^{2})) ∥ N (0, I)) = \frac{1}{2} i = 1 \sum d (μ_{i}^{2} + σ_{i}^{2} - lo g σ_{i}^{2} - 1)

Explanation: Closed-form KL between a diagonal Gaussian and a standard normal. Useful for VAE objectives with Gaussian q and prior.

β-VAE objective

ELBO_{β} (x) = E_{q} [lo g p (x ∣ z)] - β KL (q ∥ p)

Explanation: Scales the KL term by β to encourage disentanglement and simpler latents, at the cost of reconstruction fidelity when β > 1.

IWAE bound

lo g p (x) \geq E_{z_{1 : K} \sim q} [lo g (\frac{1}{K} k = 1 \sum K \frac{p ( x , z _{k} )}{q ( z _{k} ∣ x )})]

Explanation: A tighter lower bound using K importance samples. It approaches log p(x) as K increases, but at higher computational cost.

Reparameterization trick

z = μ (x) + σ (x) ⊙ ε, ε \sim N (0, I)

Explanation: Samples from q are expressed as deterministic functions of parameters and noise, enabling gradient backpropagation through Monte Carlo samples.

Monte Carlo estimator

E_{q (z ∣ x)} [f (z)] \approx \frac{1}{K} k = 1 \sum K f (z^{(k)}), z^{(k)} \sim q (z ∣ x)

Explanation: Approximates expectations with sample averages; variance decreases as K increases, improving estimate quality.

Log-sum-exp stabilization

lo g (\frac{1}{K} k = 1 \sum K e^{a_{k}}) = - lo g K + LSE (a_{1}, \dots, a_{K}), LSE (a) = m + lo g k = 1 \sum K e^{a_{k} - m}, m = k max a_{k}

Explanation: Computes the log of averages stably by factoring out the maximum term to avoid overflow or underflow.

Gaussian decoder log-likelihood

lo g p (x ∣ z) = - \frac{1}{2} (d_{x} lo g (2 π) + i = 1 \sum d_{x} lo g σ_{x, i}^{2} + i = 1 \sum d_{x} \frac{( x _{i} - f _{i} ( z ) ) ^{2}}{σ _{x, i}^{2}})

Explanation: When p(x|z) is Gaussian with mean f(z) and diagonal variance, the log-likelihood decomposes into a constant, a variance term, and a scaled squared error.

Complexity Analysis

Computing ELBO or IWAE involves two main costs: sampling from the approximate posterior q(z

∣ x) an d e v a l u a t in g l o g - p ro babi l i t i es . F or a d ia g o na lG a u ss ian qw i t h l a t e n t d im e n s i o n d_{z} an d a d eco d er w i t h cos t C_{d} ec p er f or w a r d p a ss (e . g ., O (d_{x} \cdot d_{z}) f or a l in e a r d eco d eror t h ecos t o f an e u r a l n e tw or k), g e n er a t in g K M o n t e C a r l os am pl esre q u i res O (K \cdot d_{z}) r an d o mn u mb ers an d O (K \cdot C_{d} ec) t oe v a l u a t e p (x ∣

z). The KL term with a standard normal prior is closed-form and costs O(

d_{z}

). Thus, a single-sample ELBO is typically O(

C_{d} ec

d_{z}

), while K-sample ELBO scales to O(K·

C_{d} ec

+ K·

d_{z}

). Memory usage includes storing parameters (decoder weights), temporary latent samples (O(K·

d_{z}

)), and possibly intermediate decoder activations. For diagonal Gaussians, space is O(

d_{z}

+ model parameters) for ELBO with

K = 1

, and O(K·

d_{z}

) additional space for multiple samples. IWAE adds the need to compute importance weights and apply a numerically stable log-sum-exp across K samples. This introduces O(K) extra operations and O(K) extra memory for storing log-weights, but the dominant cost remains K decoder evaluations and K latent samples. As K increases, the bound tightens but computation grows linearly in K. In mini-batch training with batch size B, the per-step cost scales linearly with B, yielding O(B·K·

C_{d} ec

) time and O(B·K·

d_{z}

) space for storing samples/weights if computed jointly. Practical implementations amortize overheads (e.g., vectorized sampling) and rely on stable numerics to maintain accuracy. Overall, ELBO and IWAE are linear in the number of samples and batch size, and linear or superlinear in model size depending on decoder architecture.

Code Examples

ELBO for a Gaussian VAE with diagonal q(z|x) and linear Gaussian decoder

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Utilities for diagonal Gaussians and ELBO estimation
5 struct RNG {
6     mt19937 rng;
7     normal_distribution<double> nd{0.0, 1.0};
8     RNG(unsigned seed=42) : rng(seed) {}
9     // Standard normal sample
10     double std_normal() { return nd(rng); }
11 };
12 
13 // Compute KL( N(mu, diag(exp(logvar))) || N(0, I) )
14 double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar) {
15     size_t d = mu.size();
16     double kl = 0.0;
17     for (size_t i = 0; i < d; ++i) {
18         double m2 = mu[i] * mu[i];
19         double v = exp(logvar[i]); // sigma^2
20         kl += m2 + v - logvar[i] - 1.0;
21     }
22     return 0.5 * kl;
23 }
24 
25 // Sample from q(z|x) = N(mu, diag(exp(logvar))) using reparameterization
26 vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, RNG& rng) {
27     size_t d = mu.size();
28     vector<double> z(d);
29     for (size_t i = 0; i < d; ++i) {
30         double eps = rng.std_normal();
31         double sigma = sqrt(exp(logvar[i]));
32         z[i] = mu[i] + sigma * eps;
33     }
34     return z;
35 }
36 
37 // Linear decoder: f(z) = W z + b, with W (d_x x d_z), b (d_x)
38 vector<double> linear_decoder(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z) {
39     size_t d_x = b.size();
40     size_t d_z = z.size();
41     vector<double> y(d_x, 0.0);
42     for (size_t i = 0; i < d_x; ++i) {
43         double s = b[i];
44         for (size_t j = 0; j < d_z; ++j) s += W[i][j] * z[j];
45         y[i] = s;
46     }
47     return y;
48 }
49 
50 // Log-likelihood under Gaussian decoder: p(x|z) = N(f(z), diag(exp(logvar_x)))
51 double log_likelihood_x_given_z(const vector<double>& x,
52                                 const vector<double>& fz,
53                                 const vector<double>& logvar_x) {
54     size_t d_x = x.size();
55     const double log2pi = log(2.0 * M_PI);
56     double ll = 0.0;
57     for (size_t i = 0; i < d_x; ++i) {
58         double v = exp(logvar_x[i]);
59         double diff = x[i] - fz[i];
60         ll += -0.5 * (log2pi + logvar_x[i] + (diff * diff) / v);
61     }
62     return ll;
63 }
64 
65 // Monte Carlo ELBO estimator with K samples and beta scaling
66 double estimate_elbo(const vector<double>& x,
67                      const vector<double>& mu,
68                      const vector<double>& logvar,
69                      const vector<vector<double>>& W,
70                      const vector<double>& b,
71                      const vector<double>& logvar_x,
72                      int K, double beta, RNG& rng) {
73     double recon_sum = 0.0;
74     for (int k = 0; k < K; ++k) {
75         vector<double> z = sample_diag_gaussian(mu, logvar, rng);
76         vector<double> fz = linear_decoder(W, b, z);
77         recon_sum += log_likelihood_x_given_z(x, fz, logvar_x);
78     }
79     double recon_avg = recon_sum / static_cast<double>(K);
80     double kl = kl_to_standard_normal(mu, logvar);
81     return recon_avg - beta * kl;
82 }
83 
84 int main() {
85     ios::sync_with_stdio(false);
86     cin.tie(nullptr);
87 
88     RNG rng(123);
89 
90     // Dimensions
91     const size_t d_z = 2;   // latent
92     const size_t d_x = 3;   // observed
93 
94     // Example data point x
95     vector<double> x = {0.5, -1.0, 1.2};
96 
97     // Example encoder outputs for this x: mu(x), logvar(x)
98     vector<double> mu = {0.2, -0.1};
99     vector<double> logvar = {log(0.5), log(1.2)}; // log variances
100 
101     // Linear decoder parameters W (d_x x d_z), b (d_x)
102     vector<vector<double>> W = {
103         {0.8, -0.3},
104         {0.1,  0.5},
105         {-0.4, 0.9}
106     };
107     vector<double> b = {0.0, 0.0, 0.0};
108 
109     // Decoder output variance (diagonal), here fixed
110     vector<double> logvar_x = {log(0.1), log(0.1), log(0.2)};
111 
112     int K = 10;      // Monte Carlo samples
113     double beta = 1; // VAE default
114 
115     double elbo = estimate_elbo(x, mu, logvar, W, b, logvar_x, K, beta, rng);
116     cout << fixed << setprecision(6);
117     cout << "Estimated ELBO (K=" << K << ", beta=" << beta << ") = " << elbo << "\n";
118 
119     return 0;
120 }
121

This program computes a Monte Carlo estimate of the ELBO for a simple Gaussian VAE. The encoder q(z|x) is a diagonal Gaussian given by mu and logvar. The decoder p(x|z) is a linear Gaussian with mean f(z)=Wz+b and fixed diagonal variance. We sample K latent vectors using the reparameterization trick, average their reconstruction log-likelihoods, and subtract β times the closed-form KL to the standard normal prior.

Time: O(K·(d_z + d_x·d_z)) for sampling and decoder evaluation; KL costs O(d_z).Space: O(d_x·d_z) for parameters plus O(d_z) for temporaries; O(K·d_z) if you store all samples.

IWAE bound with numerically stable log-sum-exp aggregation

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RNG {
5     mt19937 rng; normal_distribution<double> nd{0.0,1.0};
6     RNG(unsigned seed=7): rng(seed) {}
7     double std_normal(){ return nd(rng); }
8 };
9 
10 vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, RNG& rng){
11     size_t d = mu.size(); vector<double> z(d);
12     for(size_t i=0;i<d;++i){ double eps=rng.std_normal(); double sigma=sqrt(exp(logvar[i])); z[i]=mu[i]+sigma*eps; }
13     return z;
14 }
15 
16 vector<double> linear_decoder(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z){
17     size_t d_x=b.size(), d_z=z.size(); vector<double> y(d_x,0.0);
18     for(size_t i=0;i<d_x;++i){ double s=b[i]; for(size_t j=0;j<d_z;++j) s+=W[i][j]*z[j]; y[i]=s; }
19     return y;
20 }
21 
22 // log p(x|z) under Gaussian decoder with diagonal variance
23 double log_p_x_given_z(const vector<double>& x, const vector<double>& fz, const vector<double>& logvar_x){
24     const double log2pi = log(2.0*M_PI); double ll=0.0;
25     for(size_t i=0;i<x.size();++i){ double v=exp(logvar_x[i]); double d=x[i]-fz[i]; ll += -0.5*(log2pi + logvar_x[i] + (d*d)/v); }
26     return ll;
27 }
28 
29 // log p(z) for standard normal
30 double log_pz_std_normal(const vector<double>& z){
31     const double log2pi = log(2.0*M_PI); double ll=0.0;
32     for(double zi: z) ll += -0.5*(log2pi + zi*zi);
33     return ll;
34 }
35 
36 // log q(z|x) for diagonal Gaussian N(mu, diag(exp(logvar)))
37 double log_qz_diag(const vector<double>& z, const vector<double>& mu, const vector<double>& logvar){
38     const double log2pi = log(2.0*M_PI); double ll=0.0;
39     for(size_t i=0;i<z.size();++i){ double v=exp(logvar[i]); double d=z[i]-mu[i]; ll += -0.5*(log2pi + logvar[i] + (d*d)/v); }
40     return ll;
41 }
42 
43 // Stable log-mean-exp: log( (1/K) * sum exp(a_k) )
44 double log_mean_exp(const vector<double>& a){
45     if(a.empty()) return -INFINITY; double m=*max_element(a.begin(), a.end());
46     double s=0.0; for(double v: a) s += exp(v - m); return log(s) + m - log((double)a.size());
47 }
48 
49 // IWAE bound estimator using K importance samples
50 double estimate_iwae(const vector<double>& x,
51                      const vector<double>& mu,
52                      const vector<double>& logvar,
53                      const vector<vector<double>>& W,
54                      const vector<double>& b,
55                      const vector<double>& logvar_x,
56                      int K, RNG& rng){
57     vector<double> logw; logw.reserve(K);
58     for(int k=0;k<K;++k){
59         vector<double> z = sample_diag_gaussian(mu, logvar, rng);
60         vector<double> fz = linear_decoder(W, b, z);
61         double lp_x_given_z = log_p_x_given_z(x, fz, logvar_x);
62         double lpz = log_pz_std_normal(z);
63         double lqz = log_qz_diag(z, mu, logvar);
64         logw.push_back(lp_x_given_z + lpz - lqz);
65     }
66     return log_mean_exp(logw);
67 }
68 
69 int main(){
70     ios::sync_with_stdio(false); cin.tie(nullptr);
71     RNG rng(2024);
72 
73     // Dimensions and parameters
74     size_t d_z=2, d_x=3;
75     vector<double> x = {0.5, -1.0, 1.2};
76     vector<double> mu = {0.2, -0.1};
77     vector<double> logvar = {log(0.5), log(1.2)};
78     vector<vector<double>> W = {{0.8,-0.3},{0.1,0.5},{-0.4,0.9}};
79     vector<double> b = {0.0,0.0,0.0};
80     vector<double> logvar_x = {log(0.1), log(0.1), log(0.2)};
81 
82     cout << fixed << setprecision(6);
83     for(int K: {1,5,50,200}){
84         double iwae = estimate_iwae(x, mu, logvar, W, b, logvar_x, K, rng);
85         cout << "IWAE bound (K=" << K << ") = " << iwae << "\n";
86     }
87     return 0;
88 }
89

This program computes the IWAE bound using K importance samples. It evaluates log p(x,z) − log q(z|x) for each sample and aggregates via a numerically stable log-mean-exp. As K increases, the bound typically tightens, better approximating log p(x).

Time: O(K·(d_z + d_x·d_z)) dominated by K decoder evaluations and sampling.Space: O(d_x·d_z) for parameters and O(K) to store log-weights (plus O(d_z) for temporaries).

β-VAE: Inspecting the reconstruction–regularization trade-off

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RNG{ mt19937 rng; normal_distribution<double> nd{0.0,1.0}; RNG(unsigned s=11):rng(s){} double n(){return nd(rng);} };
5 
6 double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar){
7     double kl=0.0; for(size_t i=0;i<mu.size();++i){ double v=exp(logvar[i]); kl += mu[i]*mu[i] + v - logvar[i] - 1.0; } return 0.5*kl;
8 }
9 
10 vector<double> sample_q(const vector<double>& mu, const vector<double>& logvar, RNG& r){ vector<double> z(mu.size()); for(size_t i=0;i<mu.size();++i){ z[i]=mu[i]+sqrt(exp(logvar[i]))*r.n(); } return z; }
11 
12 vector<double> lin_dec(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z){ vector<double> y(b.size(),0.0); for(size_t i=0;i<b.size();++i){ double s=b[i]; for(size_t j=0;j<z.size();++j) s+=W[i][j]*z[j]; y[i]=s; } return y; }
13 
14 double log_p_x_given_z(const vector<double>& x, const vector<double>& fz, const vector<double>& logvar_x){ const double log2pi=log(2.0*M_PI); double ll=0.0; for(size_t i=0;i<x.size();++i){ double v=exp(logvar_x[i]); double d=x[i]-fz[i]; ll += -0.5*(log2pi + logvar_x[i] + (d*d)/v);} return ll; }
15 
16 pair<double,double> elbo_terms(const vector<double>& x, const vector<double>& mu, const vector<double>& logvar,
17                                const vector<vector<double>>& W, const vector<double>& b, const vector<double>& logvar_x,
18                                int K, RNG& rng){
19     double recon=0.0; for(int k=0;k<K;++k){ auto z=sample_q(mu,logvar,rng); auto fz=lin_dec(W,b,z); recon += log_p_x_given_z(x,fz,logvar_x);} recon/=double(K);
20     double kl=kl_to_standard_normal(mu,logvar); return {recon, kl};
21 }
22 
23 int main(){
24     ios::sync_with_stdio(false); cin.tie(nullptr);
25     RNG rng(99);
26 
27     // Example setup
28     vector<double> x = {0.5,-1.0,1.2};
29     vector<double> mu = {0.2,-0.1};
30     vector<double> logvar = {log(0.5), log(1.2)};
31     vector<vector<double>> W={{0.8,-0.3},{0.1,0.5},{-0.4,0.9}}; vector<double> b={0.0,0.0,0.0};
32     vector<double> logvar_x={log(0.1),log(0.1),log(0.2)};
33 
34     cout<<fixed<<setprecision(6);
35     vector<double> betas = {0.0, 0.5, 1.0, 2.0, 4.0};
36     int K=20;
37     auto [recon, kl] = elbo_terms(x,mu,logvar,W,b,logvar_x,K,rng);
38     cout << "Reconstruction (avg) = " << recon << ", KL = " << kl << "\n";
39     for(double beta: betas){
40         double obj = recon - beta*kl;
41         cout << "beta=" << beta << ": objective = " << obj << "\n";
42     }
43     return 0;
44 }
45

This program separates the reconstruction and KL terms and evaluates the β-VAE objective for several β values. It illustrates the trade-off: increasing β downweights reconstruction quality to encourage simpler latents (larger KL penalty).

Time: O(K·(d_z + d_x·d_z)) to estimate the reconstruction term once; evaluating multiple β values is O(1) each afterward.Space: O(d_x·d_z) for parameters and O(d_z) for temporaries.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Utilities for diagonal Gaussians and ELBO estimation
5	struct RNG {
6	mt19937 rng;
7	normal_distribution<double> nd{0.0, 1.0};
8	RNG(unsigned seed=42) : rng(seed) {}
9	// Standard normal sample
10	double std_normal() { return nd(rng); }
11	};
12
13	// Compute KL( N(mu, diag(exp(logvar))) \|\| N(0, I) )
14	double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar) {
15	size_t d = mu.size();
16	double kl = 0.0;
17	for (size_t i = 0; i < d; ++i) {
18	double m2 = mu[i] * mu[i];
19	double v = exp(logvar[i]); // sigma^2
20	kl += m2 + v - logvar[i] - 1.0;
21	}
22	return 0.5 * kl;
23	}
24
25	// Sample from q(z\|x) = N(mu, diag(exp(logvar))) using reparameterization
26	vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, RNG& rng) {
27	size_t d = mu.size();
28	vector<double> z(d);
29	for (size_t i = 0; i < d; ++i) {
30	double eps = rng.std_normal();
31	double sigma = sqrt(exp(logvar[i]));
32	z[i] = mu[i] + sigma * eps;
33	}
34	return z;
35	}
36
37	// Linear decoder: f(z) = W z + b, with W (d_x x d_z), b (d_x)
38	vector<double> linear_decoder(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z) {
39	size_t d_x = b.size();
40	size_t d_z = z.size();
41	vector<double> y(d_x, 0.0);
42	for (size_t i = 0; i < d_x; ++i) {
43	double s = b[i];
44	for (size_t j = 0; j < d_z; ++j) s += W[i][j] * z[j];
45	y[i] = s;
46	}
47	return y;
48	}
49
50	// Log-likelihood under Gaussian decoder: p(x\|z) = N(f(z), diag(exp(logvar_x)))
51	double log_likelihood_x_given_z(const vector<double>& x,
52	const vector<double>& fz,
53	const vector<double>& logvar_x) {
54	size_t d_x = x.size();
55	const double log2pi = log(2.0 * M_PI);
56	double ll = 0.0;
57	for (size_t i = 0; i < d_x; ++i) {
58	double v = exp(logvar_x[i]);
59	double diff = x[i] - fz[i];
60	ll += -0.5 * (log2pi + logvar_x[i] + (diff * diff) / v);
61	}
62	return ll;
63	}
64
65	// Monte Carlo ELBO estimator with K samples and beta scaling
66	double estimate_elbo(const vector<double>& x,
67	const vector<double>& mu,
68	const vector<double>& logvar,
69	const vector<vector<double>>& W,
70	const vector<double>& b,
71	const vector<double>& logvar_x,
72	int K, double beta, RNG& rng) {
73	double recon_sum = 0.0;
74	for (int k = 0; k < K; ++k) {
75	vector<double> z = sample_diag_gaussian(mu, logvar, rng);
76	vector<double> fz = linear_decoder(W, b, z);
77	recon_sum += log_likelihood_x_given_z(x, fz, logvar_x);
78	}
79	double recon_avg = recon_sum / static_cast<double>(K);
80	double kl = kl_to_standard_normal(mu, logvar);
81	return recon_avg - beta * kl;
82	}
83
84	int main() {
85	ios::sync_with_stdio(false);
86	cin.tie(nullptr);
87
88	RNG rng(123);
89
90	// Dimensions
91	const size_t d_z = 2; // latent
92	const size_t d_x = 3; // observed
93
94	// Example data point x
95	vector<double> x = {0.5, -1.0, 1.2};
96
97	// Example encoder outputs for this x: mu(x), logvar(x)
98	vector<double> mu = {0.2, -0.1};
99	vector<double> logvar = {log(0.5), log(1.2)}; // log variances
100
101	// Linear decoder parameters W (d_x x d_z), b (d_x)
102	vector<vector<double>> W = {
103	{0.8, -0.3},
104	{0.1, 0.5},
105	{-0.4, 0.9}
106	};
107	vector<double> b = {0.0, 0.0, 0.0};
108
109	// Decoder output variance (diagonal), here fixed
110	vector<double> logvar_x = {log(0.1), log(0.1), log(0.2)};
111
112	int K = 10; // Monte Carlo samples
113	double beta = 1; // VAE default
114
115	double elbo = estimate_elbo(x, mu, logvar, W, b, logvar_x, K, beta, rng);
116	cout << fixed << setprecision(6);
117	cout << "Estimated ELBO (K=" << K << ", beta=" << beta << ") = " << elbo << "\n";
118
119	return 0;
120	}
121

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct RNG{ mt19937 rng; normal_distribution<double> nd{0.0,1.0}; RNG(unsigned s=11):rng(s){} double n(){return nd(rng);} };
5
6	double kl_to_standard_normal(const vector<double>& mu, const vector<double>& logvar){
7	double kl=0.0; for(size_t i=0;i<mu.size();++i){ double v=exp(logvar[i]); kl += mu[i]mu[i] + v - logvar[i] - 1.0; } return 0.5kl;
8	}
9
10	vector<double> sample_q(const vector<double>& mu, const vector<double>& logvar, RNG& r){ vector<double> z(mu.size()); for(size_t i=0;i<mu.size();++i){ z[i]=mu[i]+sqrt(exp(logvar[i]))*r.n(); } return z; }
11
12	vector<double> lin_dec(const vector<vector<double>>& W, const vector<double>& b, const vector<double>& z){ vector<double> y(b.size(),0.0); for(size_t i=0;i<b.size();++i){ double s=b[i]; for(size_t j=0;j<z.size();++j) s+=W[i][j]*z[j]; y[i]=s; } return y; }
13
14	double log_p_x_given_z(const vector<double>& x, const vector<double>& fz, const vector<double>& logvar_x){ const double log2pi=log(2.0M_PI); double ll=0.0; for(size_t i=0;i<x.size();++i){ double v=exp(logvar_x[i]); double d=x[i]-fz[i]; ll += -0.5(log2pi + logvar_x[i] + (d*d)/v);} return ll; }
15
16	pair<double,double> elbo_terms(const vector<double>& x, const vector<double>& mu, const vector<double>& logvar,
17	const vector<vector<double>>& W, const vector<double>& b, const vector<double>& logvar_x,
18	int K, RNG& rng){
19	double recon=0.0; for(int k=0;k<K;++k){ auto z=sample_q(mu,logvar,rng); auto fz=lin_dec(W,b,z); recon += log_p_x_given_z(x,fz,logvar_x);} recon/=double(K);
20	double kl=kl_to_standard_normal(mu,logvar); return {recon, kl};
21	}
22
23	int main(){
24	ios::sync_with_stdio(false); cin.tie(nullptr);
25	RNG rng(99);
26
27	// Example setup
28	vector<double> x = {0.5,-1.0,1.2};
29	vector<double> mu = {0.2,-0.1};
30	vector<double> logvar = {log(0.5), log(1.2)};
31	vector<vector<double>> W={{0.8,-0.3},{0.1,0.5},{-0.4,0.9}}; vector<double> b={0.0,0.0,0.0};
32	vector<double> logvar_x={log(0.1),log(0.1),log(0.2)};
33
34	cout<<fixed<<setprecision(6);
35	vector<double> betas = {0.0, 0.5, 1.0, 2.0, 4.0};
36	int K=20;
37	auto [recon, kl] = elbo_terms(x,mu,logvar,W,b,logvar_x,K,rng);
38	cout << "Reconstruction (avg) = " << recon << ", KL = " << kl << "\n";
39	for(double beta: betas){
40	double obj = recon - beta*kl;
41	cout << "beta=" << beta << ": objective = " << obj << "\n";
42	}
43	return 0;
44	}
45