Normalizing Flow Variational Inference
Key Points
- •Normalizing-flow variational inference enriches the variational family by transforming a simple base distribution through a sequence of invertible, differentiable mappings.
- •The log density after K flow steps equals the base log density minus the sum of log absolute Jacobian determinants of each step.
- •This approach keeps exact likelihood evaluation tractable while increasing expressiveness to better approximate complex posteriors.
- •The ELBO is estimated via Monte Carlo by sampling from the flow (using the base plus transformations) and computing log p(x,z) − log q(z|x).
- •Planar flows are simple and cheap (O(D) per layer) but may require many layers; coupling flows like RealNVP offer fast triangular Jacobians.
- •Reparameterization makes gradient-based optimization low-variance by pushing randomness to the base distribution.
- •Care must be taken with numerical stability of log-determinants and invertibility constraints.
- •In C++, you can implement flows with standard library random generators and explicit Jacobian terms; training needs gradients from an autodiff library or careful finite-difference approximations.
Prerequisites
- →Multivariate Calculus — Understanding Jacobians and change-of-variables requires derivatives, determinants, and the chain rule.
- →Probability Density Functions — Flows manipulate continuous densities; you must know how PDFs transform under mappings.
- →KL Divergence and ELBO — NFVI optimizes ELBO, which includes KL and expected log-likelihood terms.
- →Linear Algebra — Vectors, matrices, determinants, and norms appear in Jacobians and Gaussian log-densities.
- →Stochastic Estimation (Monte Carlo) — ELBO is typically estimated via sampling from the variational distribution.
- →Optimization and Gradients — Training flows needs gradients of ELBO with respect to flow parameters.
Detailed Explanation
Tap terms for definitions01Overview
Normalizing Flow Variational Inference (NFVI) is a technique to make variational inference more flexible by transforming a simple, tractable distribution (like a standard Gaussian) through a chain of invertible, differentiable functions, called flows. Each flow reshapes the probability mass while keeping the ability to compute exact log-densities via the change-of-variables formula. The result is a rich family of variational distributions that can capture multimodality, skewness, and complex dependencies—far beyond the limitations of mean-field Gaussians. In practice, we start with a base variable z_0 sampled from a simple distribution p_0, then apply K transformations z_k = f_k(z_{k-1}) to obtain z_K. The log-density of z_K is computable using the base density and the sum of log absolute Jacobian determinants across layers. This property enables unbiased Monte Carlo estimates of the Evidence Lower Bound (ELBO), allowing us to train the flow parameters by maximizing the ELBO. NFVI gracefully blends exact likelihood computation (enabled by invertibility) with deep, learnable transformations, achieving both tractability and expressivity.
02Intuition & Analogies
Imagine squeezing and stretching a lump of clay to make it fit a complex mold. If you only push in one direction (like a diagonal Gaussian), you’ll miss the finer creases and curves. But if you repeatedly apply gentle, carefully designed deformations—press here, twist there—you can make the clay take on almost any shape. Normalizing flows are those sequences of gentle deformations for probability distributions. You start with a simple brick-shaped blob of probability (a standard normal). Each flow layer nudges and warps this blob slightly but in a way that you can exactly reverse and exactly quantify how volume changed. The “exactly quantify” part is crucial: when you squeeze part of the blob, density increases; when you stretch, density decreases. The Jacobian determinant of the transformation measures precisely how much squeezing or stretching you did. By chaining many such steps, you can sculpt the probability density to match complicated posterior shapes that arise in real-world Bayesian problems—like multiple plausible explanations for an image or correlated latent factors in time series. Because every step is invertible and its volume change is known, you can both sample (push forward) and score (compute log-density) efficiently. This is like having a reversible recipe with exact calorie tracking for every ingredient and step.
03Formal Definition
04When to Use
Use NFVI when a simple variational family (e.g., mean-field Gaussian) underfits the posterior, leading to biased inferences or poor predictive performance. Typical use cases include: (1) complex posteriors with strong correlations, skewness, or multimodality; (2) amortized inference in deep generative models (e.g., VAEs) where flows refine the encoder’s approximate posterior; (3) Bayesian neural networks requiring expressive approximate posteriors over weights; (4) latent-variable time-series models with intricate latent dynamics; and (5) probabilistic programs where tractable, flexible posteriors accelerate inference. If exact likelihood evaluation of q is needed (e.g., for ELBO or importance sampling), normalizing flows are attractive because they preserve tractability. Prefer coupling flows (RealNVP, Glow) when dimensionality is high and you need O(D) Jacobian determinants; use planar/radial flows for lightweight, plug-in expressivity boosts in moderate dimensions. If your problem only needs sampling (not density evaluation), you might consider implicit models, but for variational inference, flows’ exact log-density is a key advantage.
⚠️Common Mistakes
- Ignoring invertibility constraints: Some flows (e.g., planar) need parameter constraints to ensure invertibility. Without them, log-determinants can become undefined or numerically unstable. Use stabilized parameterizations (e.g., u-hat trick) or regularization.
- Numerical instability in log-determinants: Summing many log |det J| terms can suffer from precision loss. Use log1p when appropriate, clamp scales in coupling layers, and avoid extreme activations.
- Mismatch between forward and inverse: For coupling flows, ensure the forward transform and inverse are implemented consistently; errors silently corrupt density calculations.
- Overly complex s/t networks: In RealNVP, large neural nets for scale/shift increase compute and may overfit. Start simple, add capacity gradually, and monitor ELBO generalization.
- Too few Monte Carlo samples: ELBO estimates with very few samples can have high variance. Use mini-batching and average over multiple samples per data point when feasible.
- Forgetting base density contribution: The base log-density \log p_{0}(z_{0}) is essential in \log q; omitting it leads to incorrect likelihoods.
- Treating flows as black boxes: Inspect intermediate z_k and log-dets to debug; many issues show up as exploding/vanishing log-dets or wildly distributed activations.
Key Formulas
ELBO
Explanation: The ELBO measures how well the variational distribution matches the true posterior while rewarding data fit. Maximizing it brings q closer to p and increases the model’s evidence lower bound.
Flow Composition
Explanation: A normalizing flow is a composition of invertible functions applied to a base sample. Each step incrementally reshapes the distribution.
Change of Variables across K layers
Explanation: The log-density of the transformed variable equals the base log-density minus the sum of log absolute Jacobian determinants. This is the core tractability property of flows.
KL Divergence
Explanation: KL divergence quantifies how different q is from p. Minimizing it (equivalently maximizing the ELBO) drives q toward p.
Monte Carlo ELBO
Explanation: We estimate the ELBO by averaging over samples from the variational distribution. As S increases, the estimate concentrates around the true ELBO.
Matrix Determinant Lemma (Planar Flow)
Explanation: Planar flows have Jacobians of the form I + u, whose determinant reduces to a scalar. This makes log-determinant computation O(D) instead of O().
RealNVP Affine Coupling
Explanation: A coupling layer leaves part of the vector unchanged and affinely transforms the rest conditioned on it. The Jacobian is triangular, making the log-determinant equal to the sum of s(.) outputs.
RealNVP Log-Det
Explanation: Because the Jacobian is triangular with diagonal entries (), the log-determinant is the sum of the scale terms. This yields O(D) complexity for the log-determinant.
Reparameterization Gradient
Explanation: Expressing z as a function of base noise and parameters enables gradients to flow through the sampling process, reducing variance compared to score-function estimators.
Standard Gaussian Log-Density
Explanation: The log-density of a D-dimensional standard normal depends on the squared norm of . This term appears in the flow log-density via the change of variables.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utility: sample from standard normal N(0,1) 5 double randn(std::mt19937 &rng) { 6 static thread_local std::normal_distribution<double> dist(0.0, 1.0); 7 return dist(rng); 8 } 9 10 // Compute dot product of two vectors 11 double dot(const vector<double> &a, const vector<double> &b) { 12 double s = 0.0; size_t n = a.size(); 13 for (size_t i = 0; i < n; ++i) s += a[i] * b[i]; 14 return s; 15 } 16 17 // Planar flow: f(z) = z + u * h(w^T z + b), with h = tanh 18 struct PlanarFlow { 19 vector<double> u, w; // size D 20 double b; // scalar 21 22 PlanarFlow(size_t D, std::mt19937 &rng) : u(D), w(D) { 23 // Initialize small random params for stability 24 std::normal_distribution<double> dist(0.0, 0.1); 25 for (size_t i = 0; i < D; ++i) { 26 u[i] = dist(rng); 27 w[i] = dist(rng); 28 } 29 b = dist(rng); 30 } 31 32 // Forward transform: returns z_new and adds log|det J| to log_det_J 33 vector<double> forward(const vector<double> &z, double &log_det_J) const { 34 double a = dot(w, z) + b; // pre-activation 35 double h = tanh(a); // nonlinearity 36 double hprime = 1.0 - tanh(a) * tanh(a);// derivative of tanh 37 38 // psi(z) = h'(a) * w 39 vector<double> psi(w.size()); 40 for (size_t i = 0; i < w.size(); ++i) psi[i] = hprime * w[i]; 41 42 // det(I + u psi^T) = 1 + u^T psi 43 double det_term = 1.0 + dot(u, psi); 44 // Numerical safety: avoid log(<=0) 45 double eps = 1e-8; 46 log_det_J += log(fabs(det_term) + eps); 47 48 // z_new = z + u * h 49 vector<double> z_new = z; 50 for (size_t i = 0; i < z.size(); ++i) z_new[i] += u[i] * h; 51 return z_new; 52 } 53 }; 54 55 // Sample z0 ~ N(0, I) 56 vector<double> sample_base(size_t D, std::mt19937 &rng) { 57 vector<double> z0(D); 58 for (size_t i = 0; i < D; ++i) z0[i] = randn(rng); 59 return z0; 60 } 61 62 // Log-density of standard Gaussian N(0, I) 63 double log_std_normal(const vector<double> &z) { 64 double quad = 0.0; for (double v : z) quad += v * v; 65 double D = static_cast<double>(z.size()); 66 return -0.5 * (D * log(2.0 * M_PI) + quad); 67 } 68 69 // Apply a sequence of planar flows to a base sample, returning (zK, log_q(zK)) 70 pair<vector<double>, double> sample_and_log_q_planar( 71 const vector<PlanarFlow> &flows, size_t D, std::mt19937 &rng) { 72 73 vector<double> z = sample_base(D, rng); // z0 74 double log_det_sum = 0.0; 75 76 // Push forward through all flows 77 for (const auto &f : flows) { 78 z = f.forward(z, log_det_sum); 79 } 80 81 // log q_K(zK) = log p0(z0) - sum_k log|det J_k| 82 double logq = log_std_normal(sample_base(0, rng)); // placeholder 83 // Actually we need log p0(z0); we kept only zK. To compute logq correctly, 84 // we must retain z0. Modify to store it. 85 // For simplicity, recompute by re-simulating is wrong; so we keep z0. 86 return {z, 0.0}; 87 } 88 89 // Correct implementation that retains z0 90 pair<vector<double>, double> sample_and_log_q_planar_keep_z0( 91 const vector<PlanarFlow> &flows, size_t D, std::mt19937 &rng) { 92 93 vector<double> z0 = sample_base(D, rng); 94 vector<double> z = z0; 95 double log_det_sum = 0.0; 96 for (const auto &f : flows) { 97 z = f.forward(z, log_det_sum); 98 } 99 double logq = log_std_normal(z0) - log_det_sum; 100 return {z, logq}; 101 } 102 103 int main() { 104 ios::sync_with_stdio(false); 105 cin.tie(nullptr); 106 107 std::random_device rd; std::mt19937 rng(rd()); 108 109 size_t D = 4; size_t K = 5; // 5 planar layers in 4D 110 vector<PlanarFlow> flows; flows.reserve(K); 111 for (size_t k = 0; k < K; ++k) flows.emplace_back(D, rng); 112 113 // Draw a sample and compute its log-density under the flow q 114 auto [zK, logq] = sample_and_log_q_planar_keep_z0(flows, D, rng); 115 116 cout << fixed << setprecision(4); 117 cout << "zK: "; 118 for (double v : zK) cout << v << ' '; 119 cout << "\nlog q(zK): " << logq << "\n"; 120 121 return 0; 122 } 123
This program defines a planar flow layer f(z) = z + u tanh(w^T z + b). Using the matrix determinant lemma, the log-determinant reduces to log|1 + u^T ψ(z)| with ψ(z) = h'(a) w. We sample from a standard Gaussian base, push the sample forward through K planar layers, and compute the exact log-density using the base log-density minus the accumulated log-determinants. The key point is retaining z0 to compute log p0(z0).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // ---------- Random utils ---------- 5 struct RNG { 6 std::mt19937 gen; 7 RNG() : gen(std::random_device{}()) {} 8 double randn() { static normal_distribution<double> nd(0.0,1.0); return nd(gen); } 9 }; 10 11 // ---------- Linear algebra helpers (naive) ---------- 12 vector<double> matvec(const vector<vector<double>>& W, const vector<double>& x) { 13 size_t R = W.size(), C = x.size(); 14 vector<double> y(R, 0.0); 15 for (size_t i = 0; i < R; ++i) { 16 double s = 0.0; 17 for (size_t j = 0; j < C; ++j) s += W[i][j] * x[j]; 18 y[i] = s; 19 } 20 return y; 21 } 22 23 vector<double> add(const vector<double>& a, const vector<double>& b) { 24 vector<double> c(a.size()); 25 for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i]; 26 return c; 27 } 28 29 vector<double> tanh_vec(const vector<double>& a) { 30 vector<double> y(a.size()); 31 for (size_t i = 0; i < a.size(); ++i) y[i] = tanh(a[i]); 32 return y; 33 } 34 35 // Elementwise operations 36 vector<double> exp_vec(const vector<double>& a) { 37 vector<double> y(a.size()); 38 for (size_t i = 0; i < a.size(); ++i) y[i] = exp(a[i]); 39 return y; 40 } 41 vector<double> hadamard(const vector<double>& a, const vector<double>& b) { 42 vector<double> y(a.size()); 43 for (size_t i = 0; i < a.size(); ++i) y[i] = a[i] * b[i]; 44 return y; 45 } 46 47 // ---------- RealNVP Layer ---------- 48 // Partition: first D/2 dims are A, remaining are B (mask alternates across layers) 49 struct RealNVPLayer { 50 size_t D, Dh; // D total, D/2 per partition 51 bool flip; // if true, swap A and B roles 52 // Simple linear functions for s(.) and t(.) for demonstrative purposes 53 vector<vector<double>> W_s, W_t; // (Dh x Dh) 54 vector<double> b_s, b_t; // (Dh) 55 56 RealNVPLayer(size_t D_, bool flip_, RNG &rng) : D(D_), Dh(D_/2), flip(flip_), W_s(Dh, vector<double>(Dh)), W_t(Dh, vector<double>(Dh)), b_s(Dh), b_t(Dh) { 57 normal_distribution<double> nd(0.0, 0.2); 58 for (size_t i = 0; i < Dh; ++i) { 59 b_s[i] = 0.0; b_t[i] = 0.0; 60 for (size_t j = 0; j < Dh; ++j) { 61 W_s[i][j] = nd(rng.gen) * 0.1; // keep small for stability 62 W_t[i][j] = nd(rng.gen) * 0.1; 63 } 64 } 65 } 66 67 // Forward pass: y, and add to log_det_J 68 vector<double> forward(const vector<double>& x, double &log_det_J) const { 69 vector<double> y(D); 70 // Split x into A and B 71 vector<double> xA(Dh), xB(Dh); 72 if (!flip) { 73 for (size_t i = 0; i < Dh; ++i) xA[i] = x[i]; 74 for (size_t i = 0; i < Dh; ++i) xB[i] = x[Dh + i]; 75 } else { 76 for (size_t i = 0; i < Dh; ++i) xA[i] = x[Dh + i]; 77 for (size_t i = 0; i < Dh; ++i) xB[i] = x[i]; 78 } 79 // s, t as simple (tanh(Wx + b), Wx + b) 80 vector<double> s = tanh_vec(add(matvec(W_s, xA), b_s)); 81 vector<double> t = add(matvec(W_t, xA), b_t); 82 vector<double> exp_s = exp_vec(s); 83 vector<double> yB = add(hadamard(xB, exp_s), t); 84 85 // Accumulate log-det: sum(s) 86 double sum_s = 0.0; for (double si : s) sum_s += si; 87 log_det_J += sum_s; 88 89 // Merge back 90 if (!flip) { 91 for (size_t i = 0; i < Dh; ++i) y[i] = xA[i]; 92 for (size_t i = 0; i < Dh; ++i) y[Dh + i] = yB[i]; 93 } else { 94 for (size_t i = 0; i < Dh; ++i) y[i] = yB[i]; 95 for (size_t i = 0; i < Dh; ++i) y[Dh + i] = xA[i]; 96 } 97 return y; 98 } 99 }; 100 101 // Base sampling and log-density (standard normal) 102 vector<double> sample_base(size_t D, RNG &rng) { 103 vector<double> z(D); 104 for (size_t i = 0; i < D; ++i) z[i] = rng.randn(); 105 return z; 106 } 107 108 double log_std_normal(const vector<double>& z) { 109 double quad = 0.0; for (double v : z) quad += v*v; double D = z.size(); 110 return -0.5 * (D * log(2.0 * M_PI) + quad); 111 } 112 113 // Flow sampling: return (zK, log q(zK)) 114 pair<vector<double>, double> sample_and_log_q_realnvp( 115 const vector<RealNVPLayer>& layers, size_t D, RNG &rng) { 116 vector<double> z0 = sample_base(D, rng); 117 vector<double> z = z0; 118 double log_det_sum = 0.0; 119 for (const auto &L : layers) z = L.forward(z, log_det_sum); 120 double logq = log_std_normal(z0) - log_det_sum; 121 return {z, logq}; 122 } 123 124 // Toy model: p(z) = N(0,I), p(x|z) = N(W z, sigma^2 I) 125 // log p(x|z) = -0.5 [ M log(2pi sigma^2) + ||x - W z||^2 / sigma^2 ] 126 double log_likelihood_gaussian(const vector<double>& x, const vector<vector<double>>& W, const vector<double>& z, double sigma) { 127 vector<double> mu = matvec(W, z); 128 double quad = 0.0; for (size_t i = 0; i < x.size(); ++i) { double d = x[i] - mu[i]; quad += d*d; } 129 return -0.5 * (x.size() * log(2.0 * M_PI * sigma * sigma) + quad / (sigma * sigma)); 130 } 131 132 int main() { 133 ios::sync_with_stdio(false); 134 cin.tie(nullptr); 135 136 RNG rng; 137 const size_t D = 4; // latent dim (even) 138 const size_t K = 4; // RealNVP layers 139 140 // Build alternating-mask RealNVP stack 141 vector<RealNVPLayer> layers; layers.reserve(K); 142 for (size_t k = 0; k < K; ++k) layers.emplace_back(D, (k % 2 == 1), rng); 143 144 // Toy data and likelihood parameters 145 const size_t M = 3; // observation dim 146 vector<vector<double>> W(M, vector<double>(D, 0.0)); 147 // Simple W: first M dims pick from z 148 for (size_t i = 0; i < M; ++i) if (i < D) W[i][i] = 1.0; 149 vector<double> x = {0.5, -1.2, 0.3}; 150 double sigma = 0.5; 151 152 // Monte Carlo ELBO estimate with S samples 153 size_t S = 1000; 154 double elbo_sum = 0.0; 155 for (size_t s = 0; s < S; ++s) { 156 auto [z, logq] = sample_and_log_q_realnvp(layers, D, rng); 157 double logp_z = log_std_normal(z); // prior 158 double logp_x_given_z = log_likelihood_gaussian(x, W, z, sigma); 159 elbo_sum += (logp_x_given_z + logp_z - logq); 160 } 161 double elbo_mc = elbo_sum / static_cast<double>(S); 162 163 cout << fixed << setprecision(6); 164 cout << "Monte Carlo ELBO estimate: " << elbo_mc << "\n"; 165 166 return 0; 167 } 168
This example implements a lightweight RealNVP with alternating affine coupling layers. Scale s(.) and shift t(.) are simple linear functions (with tanh for s to keep it bounded). The log-determinant is the sum of s’s outputs, thanks to a triangular Jacobian. We then compute a Monte Carlo ELBO for a toy model with a standard normal prior and Gaussian likelihood p(x|z) = N(Wz, σ²I). Each ELBO term uses exact log-densities: log p(x|z), log p(z), and log q(z) from the flow.