📚TheoryAdvanced

Variational Autoencoders (VAE) Theory

Key Points

•
A Variational Autoencoder (VAE) is a probabilistic autoencoder that learns to generate data by inferring hidden causes (latent variables) and decoding them back to observations.
•
Because exact inference is intractable, VAEs optimize the Evidence Lower BOund (ELBO), which balances reconstruction quality and how close the latent distribution is to a simple prior.
•
The ELBO is log p(x) ≥ $E_{q}$ [log p(x $∣ z)] - D_{K} L (q (z ∣$ x) || p(z)), trading off fidelity and regularization.
•
The reparameterization trick z = $μ (x)$ + $σ (x)$ ⊙ ε with ε ∼ N(0, I) allows low-variance, differentiable Monte Carlo estimates of gradients.
•
For Gaussian decoders the reconstruction term becomes a scaled mean-squared error, and the KL to a standard normal has a closed form.
•
Amortized inference uses a neural encoder qϕ(z|x) so inference at test time is fast and shared across data points.
•
Common pitfalls include posterior collapse, unstable variance parameterization, and mis-scaled KL (often mitigated with $β - V A E s$ or KL warm-up).
•
In C++, you can compute ELBO terms, KL divergences, and pathwise gradients using standard libraries without deep-learning frameworks.

Prerequisites

→Basic probability and random variables — VAEs are probabilistic models; understanding expectations, variances, and independence is essential.
→Gaussian distributions — Common VAE priors and posteriors are Gaussian; closed-form KL and sampling rely on Gaussian identities.
→KL divergence and information theory — The ELBO trades off reconstruction and KL; knowing KL properties clarifies regularization behavior.
→Bayes’ rule and latent-variable models — VAEs approximate the intractable posterior pθ(z|x) with qϕ(z|x) via variational inference.
→Linear algebra — Neural networks and likelihoods involve vectors, matrices, and norms.
→Calculus and chain rule — Backpropagation and reparameterization gradients use derivatives through composite functions.
→Monte Carlo estimation — ELBO expectations are approximated by sampling; variance–bias trade-offs matter.
→Neural networks (MLP/CNN basics) — Encoders/decoders are neural nets mapping between x and z distributions.
→Optimization and SGD — Training maximizes the ELBO with stochastic gradient-based methods.
→Numerical stability practices — Working with log-variances, clipping, and epsilons prevents NaNs during training.

Detailed Explanation

Tap terms for definitions

01Overview

Hook → Concept → Example: Imagine compressing a photo into a short code that, when expanded, recreates a realistic photo—not necessarily the exact same pixels, but one that looks as if it came from the same world. That’s the idea behind Variational Autoencoders (VAEs). A VAE is a probabilistic model that assumes each observation x is generated from a low-dimensional hidden variable z through a decoder pθ(x|z), while z itself comes from a simple prior p(z), often a standard normal. Because computing log pθ(x) = log ∫ pθ(x|z)p(z) dz is usually impossible exactly, VAEs introduce an encoder qϕ(z|x) to approximate the true posterior pθ(z|x). The training objective maximizes the Evidence Lower BOund (ELBO), balancing reconstruction fidelity with a regularization that keeps qϕ(z|x) close to p(z). Conceptually, this creates a learned, stochastic compression scheme: the encoder maps data to a distribution over codes; the decoder maps codes back to data. Example: With a diagonal Gaussian encoder and an isotropic Gaussian decoder, the ELBO reduces to a mean-squared reconstruction term plus a closed-form KL divergence. Optimizing this objective yields a model that can sample new, realistic data by first sampling z ∼ p(z) and then decoding.

02Intuition & Analogies

Hook: Think of a film studio building a digital actor. They don’t store every possible pose pixel-by-pixel; they store a small set of controls—sliders for head turn, smile, lighting—and a renderer that turns these sliders into a full image. Concept: VAEs learn exactly this: a set of hidden sliders (z) and a renderer (decoder) that turns them into data. Instead of finding a single exact slider position for each image, VAEs learn a distribution over sliders: some uncertainty is allowed and even encouraged, because many different slider settings can plausibly explain a given image. The encoder learns, for each input, how to set the mean and spread (variance) of these sliders. The decoder then tries to reconstruct the input from a sample of sliders. The training rule encourages two things: (1) reconstructions should match the input (so the renderer is accurate), and (2) slider settings should look like simple, standard noise (so the sliders are easy to sample at test time). Example: Suppose you model handwritten digits. The latent z may capture stroke thickness, slant, and general digit identity. For a specific ‘3’, the encoder might say “z around (0.7, −0.2) with some uncertainty.” Sampling z from this distribution and rendering it gives slightly different but plausible ‘3’s. The regularizer nudges these z’s toward standard normal, preventing the model from memorizing each digit with unique, incomparable codes. The reparameterization trick is like saying: instead of rolling dice inside the network in a way that blocks gradients, roll dice (ε) outside and then transform them deterministically (z = μ + σ ⊙ ε), so you can still use calculus to learn μ and σ.

03Formal Definition

Let p(z) be a prior over latent variables (commonly N(0, I)) and p

θ (x ∣ z)

a decoder (likelihood) parameterized by

θ .

The joint model is p

θ (x,

z) = p(z)p

θ (x ∣ z) .

For an observation x, the marginal likelihood is log p

θ (x)

= log ∫ p

θ (x ∣ z) p (z)

dz, which is generally intractable. Introduce a variational posterior qϕ(z|x) to approximate p

θ (z ∣ x) .

The ELBO on log p

θ (x)

is L(

θ,

ϕ; x) =

E_{qϕ (z ∣ x)}

[log p

θ (x ∣ z)]

−

D_{K L}

(qϕ(z|x) |

∣ p (z)), an d s a t i s f i es l o g pθ (x) \geq L (θ, ϕ; x) . Eq u i v a l e n tl y, l o g pθ (x) = L (θ, ϕ; x) + D_{K L} (qϕ (z ∣

x) || p

θ (z ∣ x)),

showing the gap is exactly the inference error. For Gaussian choices, qϕ(z|x) = N(

μ ϕ (x),

diag(

σ_{ϕ}^{2} (x)))

and p(z) = N(0, I), the KL admits a closed form. To optimize with stochastic gradients, sample z via the reparameterization trick: z =

μ ϕ (x)

σ ϕ (x)

⊙ ε with ε ∼ N(0, I). A single-sample Monte Carlo estimate of the ELBO provides unbiased gradients for θ and low-variance pathwise gradients for ϕ (minus analytic KL gradients). Variants include

β - V A E s

(scaling the KL term) and alternative likelihoods (Bernoulli for binary data, Gaussian for continuous).

04When to Use

Use VAEs when you want: (1) Generative modeling with controllable, continuous latent representations (e.g., image, audio, or text embeddings that can be smoothly interpolated). (2) Fast amortized inference—given x, you instantly get a distribution over z without running an expensive per-example optimizer. (3) Principled likelihood-based training, enabling model comparison via ELBO. (4) Semi-supervised learning—combine a supervised term with the ELBO. (5) Anomaly detection—unlikely x under the model (low ELBO) flags anomalies. (6) Missing data imputation—sample latent z and decode missing parts. (7) Compression—store μ and σ for z instead of raw x. Prefer VAEs over GANs when you need likelihoods, uncertainty, or reproducible training with stable gradients. Be cautious if your data are discrete with complex likelihoods (reparameterization for discrete z is harder), or when powerful decoders (e.g., autoregressive transformers) might cause posterior collapse; mitigations include β-VAEs, KL warm-up, free-bits, structured priors, or weaker decoders. VAEs are especially effective when latent structure is roughly continuous and the chosen likelihood fits the data type.

⚠️Common Mistakes

Ignoring variance parameterization: Predicting σ instead of log σ^2 can produce negative or numerically unstable variances. Prefer predicting log-variance (logvar) and transforming with σ = exp(0.5·logvar). - Using too few Monte Carlo samples: One sample works in practice, but for debugging and evaluation the variance of the ELBO estimate can mislead. Average across multiple samples when validating. - Forgetting analytic KL: For diagonal Gaussians, always use the closed-form KL to standard normal. Estimating it via sampling adds noise without benefit. - Mismatched decoder likelihood: Using a Gaussian likelihood for inherently binary pixels or counts can distort training signals. Choose Bernoulli for binary, Gaussian for continuous, Poisson/negative-binomial for counts. - Posterior collapse: An overly expressive decoder can ignore z, driving the KL to zero. Use β < 1 initially (warm-up), free-bits, architectural constraints, or weaker decoders. - Scaling issues: The reconstruction term scales with data dimension; forgetting this can make the KL seem too small. Monitor per-dimension terms or use β-VAEs to rebalance. - Numerical stability: Directly computing log σ^2 or norms without eps can cause NaNs. Add small epsilons to logs/variances and clip logvars. - Misinterpreting ELBO: Higher ELBO means a better model on average, but sample quality can still lag GANs; assess with downstream metrics (FID, log-likelihood estimates) and reconstructions.

Key Formulas

ELBO

lo g p_{θ} (x) \geq E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))

Explanation: The ELBO lower-bounds the log evidence by a reconstruction term minus a KL regularizer. Maximizing it yields both good reconstructions and a posterior close to the prior.

Evidence Decomposition

lo g p_{θ} (x) = L (θ, ϕ; x) + D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x))

Explanation: The gap between the true log-likelihood and the ELBO is exactly the KL between the approximate and true posterior. As q approaches the true posterior, the ELBO becomes tight.

KL to Standard Normal (Diagonal)

D_{KL} (N (μ, diag (σ^{2})) ∥ N (0, I)) = \frac{1}{2} i = 1 \sum d (μ_{i}^{2} + σ_{i}^{2} - 1 - lo g σ_{i}^{2})

Explanation: Closed-form KL for a diagonal Gaussian against a standard normal. This is used in nearly all practical VAEs to avoid sampling noise in the KL term.

Reparameterization Trick

z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ε, ε \sim N (0, I)

Explanation: Express the random variable z as a deterministic transform of base noise $ε .$ This allows gradients to pass through z by differentiating the deterministic mapping.

Monte Carlo ELBO

L (θ, ϕ; x) = \frac{1}{K} k = 1 \sum K lo g p_{θ} (x ∣ z^{(k)}) - D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z)), z^{(k)} \sim q_{ϕ} (z ∣ x)

Explanation: A finite-sample estimate of the ELBO used in practice. Using reparameterized samples reduces estimator variance for gradients.

Gaussian Likelihood

lo g p_{θ} (x ∣ z) = - \frac{1}{2 σ _{x}^{2}} ∥ x - μ_{θ} (z) ∥_{2}^{2} - \frac{d _{x}}{2} lo g (2 π σ_{x}^{2})

Explanation: For an isotropic Gaussian decoder, the log-likelihood decomposes into a scaled squared error term and a constant. This links VAEs with MSE-style reconstruction losses.

Decoder Gradient

\nabla_{θ} L (θ, ϕ; x) = E_{q_{ϕ} (z ∣ x)} [\nabla_{θ} lo g p_{θ} (x ∣ z)]

Explanation: The gradient of the ELBO with respect to decoder parameters depends only on how the likelihood changes with z. It is estimated via reparameterized samples of z.

Pathwise Gradient for Encoder

\nabla_{ϕ} L (θ, ϕ; x) = E_{ε} [\nabla_{z} lo g p_{θ} (x ∣ z) \partial z / \partial ϕ] - \nabla_{ϕ} D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))

Explanation: Using reparameterization, encoder gradients flow through z’s dependence on φ plus the analytic gradient of the KL term. This yields low-variance updates.

β-VAE Objective

L_{β} (θ, ϕ; x) = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - β D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))

Explanation: Scaling the KL by β trades off reconstruction fidelity against latent regularization. Larger β encourages more disentangled or compressed representations.

Complexity Analysis

Consider a VAE with latent dimension d, observation dimension m, encoder/decoder forward costs

C_{e} n c

and

C_{d} ec

per example (often O(m·h + h·d) for hidden width h), batch size B, and K Monte Carlo samples per data point. A single ELBO estimate requires: (1) running the encoder once per example to obtain μ and logvar (cost B·

C_{e} n c

), (2) drawing K samples of ε and forming z = μ + σ ⊙ ε (cost O(B·K·d)), (3) running the decoder K times per example to compute likelihood terms (cost B·K·

C_{d} ec

), and (4) computing the analytic KL (cost O(B·d)). Thus the total time per batch is O(B·

C_{e} n c

+ B·K·

C_{d} ec

+ B·K·d + B·d). In practice, the dominant term is typically B·K·

C_{d} ec

, since decoder passes are far costlier than vector arithmetic in latent space. With

K = 1

(common in practice), time is roughly O(B·(

C_{e} n c

C_{d} ec

)). Space complexity is driven by (a) parameters

P = P_{e} n c + P_{d} ec

, (b) activations for backprop (roughly proportional to batch size times layer widths), and (c) temporary storage for K latent samples. This gives O(P + B·A + B·K·d), where A summarizes intermediate feature maps. When using diagonal Gaussians, KL computation adds negligible overhead (O(B·d)). If one averages ELBO over multiple samples or uses ensembles, both time (linear in K) and space (linear in K for cached activations if needed) increase accordingly. Numerical stability measures (storing log-variances) have no asymptotic cost but are critical for robust training. Standalone C++ examples here operate in O(K·d + m·d) time with O(d + m) memory, since they omit deep network layers.

Code Examples

Compute ELBO for a simple linear–Gaussian VAE (single data point)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Helper: dot product
5 double dot(const vector<double>& a, const vector<double>& b){
6     double s = 0.0; for(size_t i=0;i<a.size();++i) s += a[i]*b[i]; return s;
7 }
8 
9 // Helper: squared L2 norm of (a - b)
10 double sq_l2(const vector<double>& a, const vector<double>& b){
11     double s = 0.0; for(size_t i=0;i<a.size();++i){ double d = a[i]-b[i]; s += d*d; } return s;
12 }
13 
14 // Matrix-vector multiply: y = W z + b
15 vector<double> matvec_add(const vector<vector<double>>& W, const vector<double>& z, const vector<double>& b){
16     size_t m = W.size(); size_t d = z.size();
17     vector<double> y(m, 0.0);
18     for(size_t i=0;i<m;++i){
19         double s = 0.0;
20         for(size_t j=0;j<d;++j) s += W[i][j]*z[j];
21         y[i] = s + b[i];
22     }
23     return y;
24 }
25 
26 // Log-density of isotropic Gaussian N(mean, sigma2 I) at x
27 double log_gaussian_isotropic(const vector<double>& x, const vector<double>& mean, double sigma2){
28     const double TWO_PI = 2.0 * M_PI;
29     size_t m = x.size();
30     double quad = sq_l2(x, mean);
31     return -0.5 * ( (quad / sigma2) + m * log(TWO_PI * sigma2) );
32 }
33 
34 // Sample from diagonal Gaussian N(mu, diag(exp(logvar)))
35 vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, mt19937& rng){
36     normal_distribution<double> stdn(0.0, 1.0);
37     size_t d = mu.size();
38     vector<double> z(d);
39     for(size_t i=0;i<d;++i){
40         double eps = stdn(rng);
41         double sigma = exp(0.5 * logvar[i]);
42         z[i] = mu[i] + sigma * eps;
43     }
44     return z;
45 }
46 
47 // Closed-form KL( N(mu, diag(exp(logvar))) || N(0, I) )
48 double kl_standard_normal(const vector<double>& mu, const vector<double>& logvar){
49     double kl = 0.0;
50     for(size_t i=0;i<mu.size(); ++i){
51         double m2 = mu[i]*mu[i];
52         double v = exp(logvar[i]);
53         kl += (m2 + v - 1.0 - logvar[i]);
54     }
55     return 0.5 * kl;
56 }
57 
58 int main(){
59     // Dimensions
60     const size_t m = 3; // observation dim
61     const size_t d = 2; // latent dim
62 
63     // One observed example x
64     vector<double> x = {0.5, -1.0, 2.0};
65 
66     // Linear decoder: mean_x(z) = W z + b, with fixed isotropic variance sigma2_x
67     vector<vector<double>> W = {{1.0, -0.5}, {0.2, 0.7}, {-0.3, 0.4}}; // m x d
68     vector<double> b = {0.1, -0.2, 0.3};
69     double sigma2_x = 0.05; // decoder observation noise variance
70 
71     // Encoder output for this x: q(z|x) = N(mu_q, diag(exp(logvar_q)))
72     vector<double> mu_q = {0.3, -0.1};
73     vector<double> logvar_q = {-0.2, 0.1}; // can be any real numbers
74 
75     // Monte Carlo estimate of E_q[ log p(x|z) ]
76     mt19937 rng(42);
77     size_t K = 2000; // number of samples for expectation
78     double recon_sum = 0.0;
79     for(size_t k=0;k<K;++k){
80         vector<double> z = sample_diag_gaussian(mu_q, logvar_q, rng);
81         vector<double> mean_x = matvec_add(W, z, b);
82         recon_sum += log_gaussian_isotropic(x, mean_x, sigma2_x);
83     }
84     double recon_term = recon_sum / static_cast<double>(K);
85 
86     // Analytic KL to N(0, I)
87     double kl = kl_standard_normal(mu_q, logvar_q);
88 
89     // ELBO = E_q[ log p(x|z) ] - KL
90     double elbo = recon_term - kl;
91 
92     cout.setf(ios::fixed); cout<<setprecision(6);
93     cout << "Reconstruction term E_q[log p(x|z)] ≈ " << recon_term << "\n";
94     cout << "KL(q||p) = " << kl << "\n";
95     cout << "ELBO ≈ " << elbo << "\n";
96     return 0;
97 }
98

This program constructs a tiny linear–Gaussian VAE: a linear decoder with isotropic Gaussian likelihood and a diagonal Gaussian encoder. It estimates the reconstruction term via Monte Carlo by sampling z from q(z|x) using the reparameterization trick internally (μ, logvar → σ, ε). The KL to a standard normal is computed in closed form. Subtracting the KL from the reconstruction term yields an ELBO estimate for a single data point.

Time: O(K · (m·d + m) + d) ≈ O(K·m·d) for K samples, observation dim m, latent dim d.Space: O(m·d + m + d) for parameters and temporary vectors; dominated by storing W (m·d).

Analytical vs Monte Carlo KL for diagonal Gaussians

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 double kl_diag_to_standard_normal_analytic(const vector<double>& mu, const vector<double>& logvar){
5     double s = 0.0; for(size_t i=0;i<mu.size();++i){ double v = exp(logvar[i]); s += (mu[i]*mu[i] + v - 1.0 - logvar[i]); } return 0.5*s; }
6 
7 // Log-density of diagonal Gaussian q and standard normal p at z
8 double log_q_diag(const vector<double>& z, const vector<double>& mu, const vector<double>& logvar){
9     const double TWO_PI = 2.0 * M_PI; size_t d = z.size();
10     double s = 0.0; for(size_t i=0;i<d;++i){ double v = exp(logvar[i]); double diff = z[i] - mu[i]; s += (diff*diff)/v + log(2.0*M_PI*v); }
11     return -0.5 * s;
12 }
13 
14 double log_p_standard_normal(const vector<double>& z){
15     double s = 0.0; for(double zi: z) s += zi*zi; return -0.5 * ( s + z.size()*log(2.0*M_PI) );
16 }
17 
18 vector<double> sample_q(const vector<double>& mu, const vector<double>& logvar, mt19937& rng){
19     normal_distribution<double> stdn(0.0,1.0); size_t d = mu.size(); vector<double> z(d);
20     for(size_t i=0;i<d;++i){ double eps = stdn(rng); double sigma = exp(0.5*logvar[i]); z[i] = mu[i] + sigma*eps; }
21     return z;
22 }
23 
24 int main(){
25     size_t d = 5; vector<double> mu = {0.2, -0.1, 0.3, 0.0, -0.4}; vector<double> logvar = {-0.3, 0.2, -0.1, 0.0, 0.4};
26     double kl_analytic = kl_diag_to_standard_normal_analytic(mu, logvar);
27 
28     // Monte Carlo estimate: E_q[ log q(z) - log p(z) ]
29     mt19937 rng(123);
30     size_t K = 20000; // increase for lower variance
31     double sum_vals = 0.0;
32     for(size_t k=0;k<K;++k){
33         vector<double> z = sample_q(mu, logvar, rng);
34         double val = log_q_diag(z, mu, logvar) - log_p_standard_normal(z);
35         sum_vals += val;
36     }
37     double kl_mc = sum_vals / static_cast<double>(K);
38 
39     cout.setf(ios::fixed); cout<<setprecision(6);
40     cout << "Analytic KL = " << kl_analytic << "\n";
41     cout << "Monte Carlo KL ≈ " << kl_mc << "\n";
42     cout << "Absolute error ≈ " << fabs(kl_analytic - kl_mc) << "\n";
43     return 0;
44 }
45

This program verifies the closed-form KL divergence from a diagonal Gaussian to a standard normal using a Monte Carlo estimate E_q[log q(z) − log p(z)]. As K grows, the Monte Carlo estimate concentrates around the analytic value, illustrating variance–accuracy trade-offs in stochastic estimation.

Time: O(K · d) to sample and evaluate densities for K samples in d dimensions.Space: O(d) for vectors; constant additional memory aside from parameters.

Reparameterization trick: pathwise gradient estimate for a quadratic loss

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Computes Monte Carlo estimates of gradients of E_q[f(z)] where
5 // q = N(mu, diag(exp(logvar))) and f(z) = 0.5 * ||z - a||^2.
6 // Pathwise (reparameterized) gradients:
7 //   z_i = mu_i + sigma_i * eps_i,  sigma_i = exp(0.5*logvar_i)
8 //   df/dz_i = z_i - a_i
9 //   d/dmu_i  E[f] ≈ E[ z_i - a_i ]
10 //   d/dlogvar_i E[f] ≈ E[ (z_i - a_i) * (0.5*sigma_i*eps_i) ]
11 // Exact gradients for comparison:
12 //   E[f] = 0.5 * sum_i ( (mu_i - a_i)^2 + sigma_i^2 )
13 //   d/dmu_i = mu_i - a_i
14 //   d/dlogvar_i = 0.5 * sigma_i^2
15 
16 int main(){
17     size_t d = 3;
18     vector<double> a = {0.5, -1.0, 0.3};       // target vector in the quadratic
19     vector<double> mu = {0.2, -0.4, 0.1};      // encoder mean
20     vector<double> logvar = {-0.1, 0.2, -0.3}; // encoder log-variance
21 
22     // Precompute sigmas
23     vector<double> sigma(d), sigma2(d);
24     for(size_t i=0;i<d;++i){ sigma[i] = exp(0.5*logvar[i]); sigma2[i] = sigma[i]*sigma[i]; }
25 
26     // Monte Carlo estimates
27     mt19937 rng(7);
28     normal_distribution<double> stdn(0.0,1.0);
29     size_t K = 30000; // number of samples
30     vector<double> g_mu(d, 0.0), g_logvar(d, 0.0);
31 
32     for(size_t k=0;k<K;++k){
33         vector<double> eps(d);
34         for(size_t i=0;i<d;++i) eps[i] = stdn(rng);
35         vector<double> z(d);
36         for(size_t i=0;i<d;++i) z[i] = mu[i] + sigma[i]*eps[i];
37         for(size_t i=0;i<d;++i){
38             double df_dz = z[i] - a[i];
39             g_mu[i]     += df_dz;                          // ∂f/∂mu_i
40             g_logvar[i] += df_dz * (0.5 * sigma[i] * eps[i]); // ∂f/∂logvar_i
41         }
42     }
43     for(size_t i=0;i<d;++i){ g_mu[i] /= (double)K; g_logvar[i] /= (double)K; }
44 
45     // Exact gradients
46     vector<double> g_mu_exact(d), g_logvar_exact(d);
47     for(size_t i=0;i<d;++i){
48         g_mu_exact[i] = mu[i] - a[i];
49         g_logvar_exact[i] = 0.5 * sigma2[i];
50     }
51 
52     cout.setf(ios::fixed); cout<<setprecision(6);
53     cout << "Dimension-wise gradients (Monte Carlo ≈ Exact)\n";
54     for(size_t i=0;i<d;++i){
55         cout << "i="<<i
56              << "  d/dmu: " << g_mu[i] << " vs " << g_mu_exact[i]
57              << "  d/dlogvar: " << g_logvar[i] << " vs " << g_logvar_exact[i]
58              << "\n";
59     }
60     return 0;
61 }
62

This example demonstrates the reparameterization trick by estimating pathwise gradients of an expectation using samples ε ∼ N(0, I) and chain rule through z = μ + σ ⊙ ε. Because the function f is quadratic, exact gradients are known; the Monte Carlo estimates closely match them, showcasing low-variance, unbiased gradient estimation central to VAE training.

Time: O(K · d) for K samples and latent dimension d.Space: O(d) for storing vectors and gradient accumulators.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Helper: dot product
5	double dot(const vector<double>& a, const vector<double>& b){
6	double s = 0.0; for(size_t i=0;i<a.size();++i) s += a[i]*b[i]; return s;
7	}
8
9	// Helper: squared L2 norm of (a - b)
10	double sq_l2(const vector<double>& a, const vector<double>& b){
11	double s = 0.0; for(size_t i=0;i<a.size();++i){ double d = a[i]-b[i]; s += d*d; } return s;
12	}
13
14	// Matrix-vector multiply: y = W z + b
15	vector<double> matvec_add(const vector<vector<double>>& W, const vector<double>& z, const vector<double>& b){
16	size_t m = W.size(); size_t d = z.size();
17	vector<double> y(m, 0.0);
18	for(size_t i=0;i<m;++i){
19	double s = 0.0;
20	for(size_t j=0;j<d;++j) s += W[i][j]*z[j];
21	y[i] = s + b[i];
22	}
23	return y;
24	}
25
26	// Log-density of isotropic Gaussian N(mean, sigma2 I) at x
27	double log_gaussian_isotropic(const vector<double>& x, const vector<double>& mean, double sigma2){
28	const double TWO_PI = 2.0 * M_PI;
29	size_t m = x.size();
30	double quad = sq_l2(x, mean);
31	return -0.5 * ( (quad / sigma2) + m * log(TWO_PI * sigma2) );
32	}
33
34	// Sample from diagonal Gaussian N(mu, diag(exp(logvar)))
35	vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, mt19937& rng){
36	normal_distribution<double> stdn(0.0, 1.0);
37	size_t d = mu.size();
38	vector<double> z(d);
39	for(size_t i=0;i<d;++i){
40	double eps = stdn(rng);
41	double sigma = exp(0.5 * logvar[i]);
42	z[i] = mu[i] + sigma * eps;
43	}
44	return z;
45	}
46
47	// Closed-form KL( N(mu, diag(exp(logvar))) \|\| N(0, I) )
48	double kl_standard_normal(const vector<double>& mu, const vector<double>& logvar){
49	double kl = 0.0;
50	for(size_t i=0;i<mu.size(); ++i){
51	double m2 = mu[i]*mu[i];
52	double v = exp(logvar[i]);
53	kl += (m2 + v - 1.0 - logvar[i]);
54	}
55	return 0.5 * kl;
56	}
57
58	int main(){
59	// Dimensions
60	const size_t m = 3; // observation dim
61	const size_t d = 2; // latent dim
62
63	// One observed example x
64	vector<double> x = {0.5, -1.0, 2.0};
65
66	// Linear decoder: mean_x(z) = W z + b, with fixed isotropic variance sigma2_x
67	vector<vector<double>> W = {{1.0, -0.5}, {0.2, 0.7}, {-0.3, 0.4}}; // m x d
68	vector<double> b = {0.1, -0.2, 0.3};
69	double sigma2_x = 0.05; // decoder observation noise variance
70
71	// Encoder output for this x: q(z\|x) = N(mu_q, diag(exp(logvar_q)))
72	vector<double> mu_q = {0.3, -0.1};
73	vector<double> logvar_q = {-0.2, 0.1}; // can be any real numbers
74
75	// Monte Carlo estimate of E_q[ log p(x\|z) ]
76	mt19937 rng(42);
77	size_t K = 2000; // number of samples for expectation
78	double recon_sum = 0.0;
79	for(size_t k=0;k<K;++k){
80	vector<double> z = sample_diag_gaussian(mu_q, logvar_q, rng);
81	vector<double> mean_x = matvec_add(W, z, b);
82	recon_sum += log_gaussian_isotropic(x, mean_x, sigma2_x);
83	}
84	double recon_term = recon_sum / static_cast<double>(K);
85
86	// Analytic KL to N(0, I)
87	double kl = kl_standard_normal(mu_q, logvar_q);
88
89	// ELBO = E_q[ log p(x\|z) ] - KL
90	double elbo = recon_term - kl;
91
92	cout.setf(ios::fixed); cout<<setprecision(6);
93	cout << "Reconstruction term E_q[log p(x\|z)] ≈ " << recon_term << "\n";
94	cout << "KL(q\|\|p) = " << kl << "\n";
95	cout << "ELBO ≈ " << elbo << "\n";
96	return 0;
97	}
98

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	double kl_diag_to_standard_normal_analytic(const vector<double>& mu, const vector<double>& logvar){
5	double s = 0.0; for(size_t i=0;i<mu.size();++i){ double v = exp(logvar[i]); s += (mu[i]mu[i] + v - 1.0 - logvar[i]); } return 0.5s; }
6
7	// Log-density of diagonal Gaussian q and standard normal p at z
8	double log_q_diag(const vector<double>& z, const vector<double>& mu, const vector<double>& logvar){
9	const double TWO_PI = 2.0 * M_PI; size_t d = z.size();
10	double s = 0.0; for(size_t i=0;i<d;++i){ double v = exp(logvar[i]); double diff = z[i] - mu[i]; s += (diffdiff)/v + log(2.0M_PI*v); }
11	return -0.5 * s;
12	}
13
14	double log_p_standard_normal(const vector<double>& z){
15	double s = 0.0; for(double zi: z) s += zizi; return -0.5 ( s + z.size()log(2.0M_PI) );
16	}
17
18	vector<double> sample_q(const vector<double>& mu, const vector<double>& logvar, mt19937& rng){
19	normal_distribution<double> stdn(0.0,1.0); size_t d = mu.size(); vector<double> z(d);
20	for(size_t i=0;i<d;++i){ double eps = stdn(rng); double sigma = exp(0.5logvar[i]); z[i] = mu[i] + sigmaeps; }
21	return z;
22	}
23
24	int main(){
25	size_t d = 5; vector<double> mu = {0.2, -0.1, 0.3, 0.0, -0.4}; vector<double> logvar = {-0.3, 0.2, -0.1, 0.0, 0.4};
26	double kl_analytic = kl_diag_to_standard_normal_analytic(mu, logvar);
27
28	// Monte Carlo estimate: E_q[ log q(z) - log p(z) ]
29	mt19937 rng(123);
30	size_t K = 20000; // increase for lower variance
31	double sum_vals = 0.0;
32	for(size_t k=0;k<K;++k){
33	vector<double> z = sample_q(mu, logvar, rng);
34	double val = log_q_diag(z, mu, logvar) - log_p_standard_normal(z);
35	sum_vals += val;
36	}
37	double kl_mc = sum_vals / static_cast<double>(K);
38
39	cout.setf(ios::fixed); cout<<setprecision(6);
40	cout << "Analytic KL = " << kl_analytic << "\n";
41	cout << "Monte Carlo KL ≈ " << kl_mc << "\n";
42	cout << "Absolute error ≈ " << fabs(kl_analytic - kl_mc) << "\n";
43	return 0;
44	}
45

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Computes Monte Carlo estimates of gradients of E_q[f(z)] where
5	// q = N(mu, diag(exp(logvar))) and f(z) = 0.5 * \|\|z - a\|\|^2.
6	// Pathwise (reparameterized) gradients:
7	// z_i = mu_i + sigma_i * eps_i, sigma_i = exp(0.5*logvar_i)
8	// df/dz_i = z_i - a_i
9	// d/dmu_i E[f] ≈ E[ z_i - a_i ]
10	// d/dlogvar_i E[f] ≈ E[ (z_i - a_i) * (0.5sigma_ieps_i) ]
11	// Exact gradients for comparison:
12	// E[f] = 0.5 * sum_i ( (mu_i - a_i)^2 + sigma_i^2 )
13	// d/dmu_i = mu_i - a_i
14	// d/dlogvar_i = 0.5 * sigma_i^2
15
16	int main(){
17	size_t d = 3;
18	vector<double> a = {0.5, -1.0, 0.3}; // target vector in the quadratic
19	vector<double> mu = {0.2, -0.4, 0.1}; // encoder mean
20	vector<double> logvar = {-0.1, 0.2, -0.3}; // encoder log-variance
21
22	// Precompute sigmas
23	vector<double> sigma(d), sigma2(d);
24	for(size_t i=0;i<d;++i){ sigma[i] = exp(0.5logvar[i]); sigma2[i] = sigma[i]sigma[i]; }
25
26	// Monte Carlo estimates
27	mt19937 rng(7);
28	normal_distribution<double> stdn(0.0,1.0);
29	size_t K = 30000; // number of samples
30	vector<double> g_mu(d, 0.0), g_logvar(d, 0.0);
31
32	for(size_t k=0;k<K;++k){
33	vector<double> eps(d);
34	for(size_t i=0;i<d;++i) eps[i] = stdn(rng);
35	vector<double> z(d);
36	for(size_t i=0;i<d;++i) z[i] = mu[i] + sigma[i]*eps[i];
37	for(size_t i=0;i<d;++i){
38	double df_dz = z[i] - a[i];
39	g_mu[i] += df_dz; // ∂f/∂mu_i
40	g_logvar[i] += df_dz * (0.5 * sigma[i] * eps[i]); // ∂f/∂logvar_i
41	}
42	}
43	for(size_t i=0;i<d;++i){ g_mu[i] /= (double)K; g_logvar[i] /= (double)K; }
44
45	// Exact gradients
46	vector<double> g_mu_exact(d), g_logvar_exact(d);
47	for(size_t i=0;i<d;++i){
48	g_mu_exact[i] = mu[i] - a[i];
49	g_logvar_exact[i] = 0.5 * sigma2[i];
50	}
51
52	cout.setf(ios::fixed); cout<<setprecision(6);
53	cout << "Dimension-wise gradients (Monte Carlo ≈ Exact)\n";
54	for(size_t i=0;i<d;++i){
55	cout << "i="<<i
56	<< " d/dmu: " << g_mu[i] << " vs " << g_mu_exact[i]
57	<< " d/dlogvar: " << g_logvar[i] << " vs " << g_logvar_exact[i]
58	<< "\n";
59	}
60	return 0;
61	}
62