📚TheoryAdvanced

Normalizing Flow Variational Inference

Key Points

•
Normalizing-flow variational inference enriches the variational family by transforming a simple base distribution through a sequence of invertible, differentiable mappings.
•
The log density after K flow steps equals the base log density minus the sum of log absolute Jacobian determinants of each step.
•
This approach keeps exact likelihood evaluation tractable while increasing expressiveness to better approximate complex posteriors.
•
The ELBO is estimated via Monte Carlo by sampling from the flow (using the base plus transformations) and computing log p(x,z) − log q(z|x).
•
Planar flows are simple and cheap (O(D) per layer) but may require many layers; coupling flows like RealNVP offer fast triangular Jacobians.
•
Reparameterization makes gradient-based optimization low-variance by pushing randomness to the base distribution.
•
Care must be taken with numerical stability of log-determinants and invertibility constraints.
•
In C++, you can implement flows with standard library random generators and explicit Jacobian terms; training needs gradients from an autodiff library or careful finite-difference approximations.

Prerequisites

→Multivariate Calculus — Understanding Jacobians and change-of-variables requires derivatives, determinants, and the chain rule.
→Probability Density Functions — Flows manipulate continuous densities; you must know how PDFs transform under mappings.
→KL Divergence and ELBO — NFVI optimizes ELBO, which includes KL and expected log-likelihood terms.
→Linear Algebra — Vectors, matrices, determinants, and norms appear in Jacobians and Gaussian log-densities.
→Stochastic Estimation (Monte Carlo) — ELBO is typically estimated via sampling from the variational distribution.
→Optimization and Gradients — Training flows needs gradients of ELBO with respect to flow parameters.

Detailed Explanation

Tap terms for definitions

01Overview

Normalizing Flow Variational Inference (NFVI) is a technique to make variational inference more flexible by transforming a simple, tractable distribution (like a standard Gaussian) through a chain of invertible, differentiable functions, called flows. Each flow reshapes the probability mass while keeping the ability to compute exact log-densities via the change-of-variables formula. The result is a rich family of variational distributions that can capture multimodality, skewness, and complex dependencies—far beyond the limitations of mean-field Gaussians. In practice, we start with a base variable z_0 sampled from a simple distribution p_0, then apply K transformations z_k = f_k(z_{k-1}) to obtain z_K. The log-density of z_K is computable using the base density and the sum of log absolute Jacobian determinants across layers. This property enables unbiased Monte Carlo estimates of the Evidence Lower Bound (ELBO), allowing us to train the flow parameters by maximizing the ELBO. NFVI gracefully blends exact likelihood computation (enabled by invertibility) with deep, learnable transformations, achieving both tractability and expressivity.

02Intuition & Analogies

Imagine squeezing and stretching a lump of clay to make it fit a complex mold. If you only push in one direction (like a diagonal Gaussian), you’ll miss the finer creases and curves. But if you repeatedly apply gentle, carefully designed deformations—press here, twist there—you can make the clay take on almost any shape. Normalizing flows are those sequences of gentle deformations for probability distributions. You start with a simple brick-shaped blob of probability (a standard normal). Each flow layer nudges and warps this blob slightly but in a way that you can exactly reverse and exactly quantify how volume changed. The “exactly quantify” part is crucial: when you squeeze part of the blob, density increases; when you stretch, density decreases. The Jacobian determinant of the transformation measures precisely how much squeezing or stretching you did. By chaining many such steps, you can sculpt the probability density to match complicated posterior shapes that arise in real-world Bayesian problems—like multiple plausible explanations for an image or correlated latent factors in time series. Because every step is invertible and its volume change is known, you can both sample (push forward) and score (compute log-density) efficiently. This is like having a reversible recipe with exact calorie tracking for every ingredient and step.

03Formal Definition

Let

z_{0}

\sim

p_{0}

(z) be a base distribution that is easy to sample and evaluate. A normalizing flow is a sequence of invertible, differentiable mappings

f_{k}

R^{D}

\to

R^{D}

for

k = 1

\dots

, K. Define

z_{k} = f_{k} (z_{k - 1}

) and

z_{K} = f_{K}

\circ

\dots

\circ

f_{1}

(

z_{0}

). By the change-of-variables formula, the resulting density is

lo g

q_{K}

(

z_{K}

) =

lo g

p_{0}

(

z_{0}

) -

\sum_{k = 1}^{K}

lo g

∣ det J_{f_{k}} (z_{k - 1}) ∣

, where J_{

f_{k}

} is the Jacobian of

f_{k}

. In variational inference for a model p(x, z) = p(x

∣

z) p(z), we define

q_{ϕ}

∣

x) using flows (possibly with parameters

ϕ

produced by an encoder) and optimize the Evidence Lower Bound (ELBO):

L

(

ϕ

) =

E_{q_{ϕ} (z ∣ x)}

[

lo g

p(x, z) -

lo g

q_{ϕ}

∣

x)]. Using reparameterization, we write z as a deterministic function of base noise

ϵ

\sim

p_{0}

and parameters

ϕ

, enabling low-variance gradient estimates of

\nabla_{ϕ}

L

. Specific flow families (planar, radial, coupling-based like RealNVP) are designed so that

lo g

∣ det J ∣

is efficient to compute.

04When to Use

Use NFVI when a simple variational family (e.g., mean-field Gaussian) underfits the posterior, leading to biased inferences or poor predictive performance. Typical use cases include: (1) complex posteriors with strong correlations, skewness, or multimodality; (2) amortized inference in deep generative models (e.g., VAEs) where flows refine the encoder’s approximate posterior; (3) Bayesian neural networks requiring expressive approximate posteriors over weights; (4) latent-variable time-series models with intricate latent dynamics; and (5) probabilistic programs where tractable, flexible posteriors accelerate inference. If exact likelihood evaluation of q is needed (e.g., for ELBO or importance sampling), normalizing flows are attractive because they preserve tractability. Prefer coupling flows (RealNVP, Glow) when dimensionality is high and you need O(D) Jacobian determinants; use planar/radial flows for lightweight, plug-in expressivity boosts in moderate dimensions. If your problem only needs sampling (not density evaluation), you might consider implicit models, but for variational inference, flows’ exact log-density is a key advantage.

⚠️Common Mistakes

Ignoring invertibility constraints: Some flows (e.g., planar) need parameter constraints to ensure invertibility. Without them, log-determinants can become undefined or numerically unstable. Use stabilized parameterizations (e.g., u-hat trick) or regularization.
Numerical instability in log-determinants: Summing many log |det J| terms can suffer from precision loss. Use log1p when appropriate, clamp scales in coupling layers, and avoid extreme activations.
Mismatch between forward and inverse: For coupling flows, ensure the forward transform and inverse are implemented consistently; errors silently corrupt density calculations.
Overly complex s/t networks: In RealNVP, large neural nets for scale/shift increase compute and may overfit. Start simple, add capacity gradually, and monitor ELBO generalization.
Too few Monte Carlo samples: ELBO estimates with very few samples can have high variance. Use mini-batching and average over multiple samples per data point when feasible.
Forgetting base density contribution: The base log-density \log p_{0}(z_{0}) is essential in \log q; omitting it leads to incorrect likelihoods.
Treating flows as black boxes: Inspect intermediate z_k and log-dets to debug; many issues show up as exploding/vanishing log-dets or wildly distributed activations.

Key Formulas

ELBO

L (ϕ) = E_{q_{ϕ} (z ∣ x)} [lo g p (x, z) - lo g q_{ϕ} (z ∣ x)]

Explanation: The ELBO measures how well the variational distribution matches the true posterior while rewarding data fit. Maximizing it brings q closer to p and increases the model’s evidence lower bound.

Flow Composition

z_{k} = f_{k} (z_{k - 1}), z_{0} \sim p_{0} (z)

Explanation: A normalizing flow is a composition of invertible functions applied to a base sample. Each step incrementally reshapes the distribution.

Change of Variables across K layers

lo g q_{K} (z_{K}) = lo g p_{0} (z_{0}) - k = 1 \sum K lo g ∣ det J_{f_{k}} (z_{k - 1}) ∣

Explanation: The log-density of the transformed variable equals the base log-density minus the sum of log absolute Jacobian determinants. This is the core tractability property of flows.

KL Divergence

KL (q ∥ p) = E_{q (z)} [lo g q (z) - lo g p (z)]

Explanation: KL divergence quantifies how different q is from p. Minimizing it (equivalently maximizing the ELBO) drives q toward p.

Monte Carlo ELBO

L = \frac{1}{S} s = 1 \sum S (lo g p (x, z^{(s)}) - lo g q (z^{(s)} ∣ x)), z^{(s)} \sim q

Explanation: We estimate the ELBO by averaging over samples from the variational distribution. As S increases, the estimate concentrates around the true ELBO.

Matrix Determinant Lemma (Planar Flow)

lo g det (I + u ψ (z)^{⊤}) = lo g 1 + u^{⊤} ψ (z)

Explanation: Planar flows have Jacobians of the form I + u $ψ^{T}$ , whose determinant reduces to a scalar. This makes log-determinant computation O(D) instead of O( $D^{3}$ ).

RealNVP Affine Coupling

y_{A} = x_{A}, y_{B} = x_{B} ⊙ exp (s (x_{A})) + t (x_{A})

Explanation: A coupling layer leaves part of the vector unchanged and affinely transforms the rest conditioned on it. The Jacobian is triangular, making the log-determinant equal to the sum of s(.) outputs.

RealNVP Log-Det

lo g ∣ det J_{RealNVP} ∣ = i \sum s_{i} (x_{A})

Explanation: Because the Jacobian is triangular with diagonal entries $exp$ ( $s_{i}$ ), the log-determinant is the sum of the scale terms. This yields O(D) complexity for the log-determinant.

Reparameterization Gradient

\nabla_{ϕ} L = E_{ϵ \sim p_{0}} [\nabla_{ϕ} (lo g p (x, z (ϵ; ϕ)) - lo g q_{ϕ} (z (ϵ; ϕ) ∣ x))]

Explanation: Expressing z as a function of base noise $ϵ$ and parameters $ϕ$ enables gradients to flow through the sampling process, reducing variance compared to score-function estimators.

Standard Gaussian Log-Density

lo g p_{0} (z_{0}) = - \frac{1}{2} (D lo g (2 π) + ∥ z_{0} ∥_{2}^{2})

Explanation: The log-density of a D-dimensional standard normal depends on the squared norm of $z_{0}$ . This term appears in the flow log-density via the change of variables.

Complexity Analysis

Let D be the latent dimension, K the number of flow layers, and S the number of Monte Carlo samples per ELBO estimate. For general flows where the Jacobian is dense, computing log

∣ d e t J ∣

naïvely costs O(

D^{3}

) using a determinant, which is prohibitive. Flow architectures are therefore designed to yield cheaper log-determinant computations: planar and radial flows use the matrix determinant lemma, making each layer O(D); coupling flows such as RealNVP/Glow have triangular Jacobians, so the log-determinant reduces to a sum over D/2 scale terms, also O(D) per layer. Thus, evaluating a single transformed sample typically costs O(KD) for these popular flows, not counting the cost of computing s(.) and t(.) networks in coupling layers. If s and t are neural nets with P parameters and cost O(C) per evaluation, then each layer costs O(C + D), and a full pass is O(K(C + D)). Space complexity is O(D) for holding current activations and, in training with backpropagation, O(KD) to store intermediate

z_{k}

(or recompute them with checkpointing to trade time for memory). If s/t are neural nets, additional O(P) parameters are stored. A Monte Carlo ELBO over S samples scales time by a factor of S, i.e., O(SKD) for pure planar/triangular layers, or O(SK(C + D)) with networks. For minibatch size B, multiply by B. Overall, NFVI offers linear complexity in dimension and layers for well-designed flows, enabling practical inference in moderate to high dimensions, provided the s/t networks are not excessively large.

Code Examples

Planar Flow: sampling and exact log-density via change of variables

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Utility: sample from standard normal N(0,1)
5 double randn(std::mt19937 &rng) {
6     static thread_local std::normal_distribution<double> dist(0.0, 1.0);
7     return dist(rng);
8 }
9 
10 // Compute dot product of two vectors
11 double dot(const vector<double> &a, const vector<double> &b) {
12     double s = 0.0; size_t n = a.size();
13     for (size_t i = 0; i < n; ++i) s += a[i] * b[i];
14     return s;
15 }
16 
17 // Planar flow: f(z) = z + u * h(w^T z + b), with h = tanh
18 struct PlanarFlow {
19     vector<double> u, w; // size D
20     double b;            // scalar
21 
22     PlanarFlow(size_t D, std::mt19937 &rng) : u(D), w(D) {
23         // Initialize small random params for stability
24         std::normal_distribution<double> dist(0.0, 0.1);
25         for (size_t i = 0; i < D; ++i) {
26             u[i] = dist(rng);
27             w[i] = dist(rng);
28         }
29         b = dist(rng);
30     }
31 
32     // Forward transform: returns z_new and adds log|det J| to log_det_J
33     vector<double> forward(const vector<double> &z, double &log_det_J) const {
34         double a = dot(w, z) + b;               // pre-activation
35         double h = tanh(a);                     // nonlinearity
36         double hprime = 1.0 - tanh(a) * tanh(a);// derivative of tanh
37 
38         // psi(z) = h'(a) * w
39         vector<double> psi(w.size());
40         for (size_t i = 0; i < w.size(); ++i) psi[i] = hprime * w[i];
41 
42         // det(I + u psi^T) = 1 + u^T psi
43         double det_term = 1.0 + dot(u, psi);
44         // Numerical safety: avoid log(<=0)
45         double eps = 1e-8;
46         log_det_J += log(fabs(det_term) + eps);
47 
48         // z_new = z + u * h
49         vector<double> z_new = z;
50         for (size_t i = 0; i < z.size(); ++i) z_new[i] += u[i] * h;
51         return z_new;
52     }
53 };
54 
55 // Sample z0 ~ N(0, I)
56 vector<double> sample_base(size_t D, std::mt19937 &rng) {
57     vector<double> z0(D);
58     for (size_t i = 0; i < D; ++i) z0[i] = randn(rng);
59     return z0;
60 }
61 
62 // Log-density of standard Gaussian N(0, I)
63 double log_std_normal(const vector<double> &z) {
64     double quad = 0.0; for (double v : z) quad += v * v;
65     double D = static_cast<double>(z.size());
66     return -0.5 * (D * log(2.0 * M_PI) + quad);
67 }
68 
69 // Apply a sequence of planar flows to a base sample, returning (zK, log_q(zK))
70 pair<vector<double>, double> sample_and_log_q_planar(
71     const vector<PlanarFlow> &flows, size_t D, std::mt19937 &rng) {
72 
73     vector<double> z = sample_base(D, rng); // z0
74     double log_det_sum = 0.0;
75 
76     // Push forward through all flows
77     for (const auto &f : flows) {
78         z = f.forward(z, log_det_sum);
79     }
80 
81     // log q_K(zK) = log p0(z0) - sum_k log|det J_k|
82     double logq = log_std_normal(sample_base(0, rng)); // placeholder
83     // Actually we need log p0(z0); we kept only zK. To compute logq correctly,
84     // we must retain z0. Modify to store it.
85     // For simplicity, recompute by re-simulating is wrong; so we keep z0.
86     return {z, 0.0};
87 }
88 
89 // Correct implementation that retains z0
90 pair<vector<double>, double> sample_and_log_q_planar_keep_z0(
91     const vector<PlanarFlow> &flows, size_t D, std::mt19937 &rng) {
92 
93     vector<double> z0 = sample_base(D, rng);
94     vector<double> z = z0;
95     double log_det_sum = 0.0;
96     for (const auto &f : flows) {
97         z = f.forward(z, log_det_sum);
98     }
99     double logq = log_std_normal(z0) - log_det_sum;
100     return {z, logq};
101 }
102 
103 int main() {
104     ios::sync_with_stdio(false);
105     cin.tie(nullptr);
106 
107     std::random_device rd; std::mt19937 rng(rd());
108 
109     size_t D = 4; size_t K = 5; // 5 planar layers in 4D
110     vector<PlanarFlow> flows; flows.reserve(K);
111     for (size_t k = 0; k < K; ++k) flows.emplace_back(D, rng);
112 
113     // Draw a sample and compute its log-density under the flow q
114     auto [zK, logq] = sample_and_log_q_planar_keep_z0(flows, D, rng);
115 
116     cout << fixed << setprecision(4);
117     cout << "zK: ";
118     for (double v : zK) cout << v << ' ';
119     cout << "\nlog q(zK): " << logq << "\n";
120 
121     return 0;
122 }
123

This program defines a planar flow layer f(z) = z + u tanh(w^T z + b). Using the matrix determinant lemma, the log-determinant reduces to log|1 + u^T ψ(z)| with ψ(z) = h'(a) w. We sample from a standard Gaussian base, push the sample forward through K planar layers, and compute the exact log-density using the base log-density minus the accumulated log-determinants. The key point is retaining z0 to compute log p0(z0).

Time: O(KD) per sample for K layers and D dimensions.Space: O(D) working memory; O(KD) if storing all intermediates for backpropagation.

RealNVP coupling flow and Monte Carlo ELBO for a toy Gaussian likelihood

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // ---------- Random utils ----------
5 struct RNG {
6     std::mt19937 gen;
7     RNG() : gen(std::random_device{}()) {}
8     double randn() { static normal_distribution<double> nd(0.0,1.0); return nd(gen); }
9 };
10 
11 // ---------- Linear algebra helpers (naive) ----------
12 vector<double> matvec(const vector<vector<double>>& W, const vector<double>& x) {
13     size_t R = W.size(), C = x.size();
14     vector<double> y(R, 0.0);
15     for (size_t i = 0; i < R; ++i) {
16         double s = 0.0;
17         for (size_t j = 0; j < C; ++j) s += W[i][j] * x[j];
18         y[i] = s;
19     }
20     return y;
21 }
22 
23 vector<double> add(const vector<double>& a, const vector<double>& b) {
24     vector<double> c(a.size());
25     for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i];
26     return c;
27 }
28 
29 vector<double> tanh_vec(const vector<double>& a) {
30     vector<double> y(a.size());
31     for (size_t i = 0; i < a.size(); ++i) y[i] = tanh(a[i]);
32     return y;
33 }
34 
35 // Elementwise operations
36 vector<double> exp_vec(const vector<double>& a) {
37     vector<double> y(a.size());
38     for (size_t i = 0; i < a.size(); ++i) y[i] = exp(a[i]);
39     return y;
40 }
41 vector<double> hadamard(const vector<double>& a, const vector<double>& b) {
42     vector<double> y(a.size());
43     for (size_t i = 0; i < a.size(); ++i) y[i] = a[i] * b[i];
44     return y;
45 }
46 
47 // ---------- RealNVP Layer ----------
48 // Partition: first D/2 dims are A, remaining are B (mask alternates across layers)
49 struct RealNVPLayer {
50     size_t D, Dh; // D total, D/2 per partition
51     bool flip;    // if true, swap A and B roles
52     // Simple linear functions for s(.) and t(.) for demonstrative purposes
53     vector<vector<double>> W_s, W_t; // (Dh x Dh)
54     vector<double> b_s, b_t;         // (Dh)
55 
56     RealNVPLayer(size_t D_, bool flip_, RNG &rng) : D(D_), Dh(D_/2), flip(flip_), W_s(Dh, vector<double>(Dh)), W_t(Dh, vector<double>(Dh)), b_s(Dh), b_t(Dh) {
57         normal_distribution<double> nd(0.0, 0.2);
58         for (size_t i = 0; i < Dh; ++i) {
59             b_s[i] = 0.0; b_t[i] = 0.0;
60             for (size_t j = 0; j < Dh; ++j) {
61                 W_s[i][j] = nd(rng.gen) * 0.1; // keep small for stability
62                 W_t[i][j] = nd(rng.gen) * 0.1;
63             }
64         }
65     }
66 
67     // Forward pass: y, and add to log_det_J
68     vector<double> forward(const vector<double>& x, double &log_det_J) const {
69         vector<double> y(D);
70         // Split x into A and B
71         vector<double> xA(Dh), xB(Dh);
72         if (!flip) {
73             for (size_t i = 0; i < Dh; ++i) xA[i] = x[i];
74             for (size_t i = 0; i < Dh; ++i) xB[i] = x[Dh + i];
75         } else {
76             for (size_t i = 0; i < Dh; ++i) xA[i] = x[Dh + i];
77             for (size_t i = 0; i < Dh; ++i) xB[i] = x[i];
78         }
79         // s, t as simple (tanh(Wx + b), Wx + b)
80         vector<double> s = tanh_vec(add(matvec(W_s, xA), b_s));
81         vector<double> t = add(matvec(W_t, xA), b_t);
82         vector<double> exp_s = exp_vec(s);
83         vector<double> yB = add(hadamard(xB, exp_s), t);
84 
85         // Accumulate log-det: sum(s)
86         double sum_s = 0.0; for (double si : s) sum_s += si;
87         log_det_J += sum_s;
88 
89         // Merge back
90         if (!flip) {
91             for (size_t i = 0; i < Dh; ++i) y[i] = xA[i];
92             for (size_t i = 0; i < Dh; ++i) y[Dh + i] = yB[i];
93         } else {
94             for (size_t i = 0; i < Dh; ++i) y[i] = yB[i];
95             for (size_t i = 0; i < Dh; ++i) y[Dh + i] = xA[i];
96         }
97         return y;
98     }
99 };
100 
101 // Base sampling and log-density (standard normal)
102 vector<double> sample_base(size_t D, RNG &rng) {
103     vector<double> z(D);
104     for (size_t i = 0; i < D; ++i) z[i] = rng.randn();
105     return z;
106 }
107 
108 double log_std_normal(const vector<double>& z) {
109     double quad = 0.0; for (double v : z) quad += v*v; double D = z.size();
110     return -0.5 * (D * log(2.0 * M_PI) + quad);
111 }
112 
113 // Flow sampling: return (zK, log q(zK))
114 pair<vector<double>, double> sample_and_log_q_realnvp(
115     const vector<RealNVPLayer>& layers, size_t D, RNG &rng) {
116     vector<double> z0 = sample_base(D, rng);
117     vector<double> z = z0;
118     double log_det_sum = 0.0;
119     for (const auto &L : layers) z = L.forward(z, log_det_sum);
120     double logq = log_std_normal(z0) - log_det_sum;
121     return {z, logq};
122 }
123 
124 // Toy model: p(z) = N(0,I), p(x|z) = N(W z, sigma^2 I)
125 // log p(x|z) = -0.5 [ M log(2pi sigma^2) + ||x - W z||^2 / sigma^2 ]
126 double log_likelihood_gaussian(const vector<double>& x, const vector<vector<double>>& W, const vector<double>& z, double sigma) {
127     vector<double> mu = matvec(W, z);
128     double quad = 0.0; for (size_t i = 0; i < x.size(); ++i) { double d = x[i] - mu[i]; quad += d*d; }
129     return -0.5 * (x.size() * log(2.0 * M_PI * sigma * sigma) + quad / (sigma * sigma));
130 }
131 
132 int main() {
133     ios::sync_with_stdio(false);
134     cin.tie(nullptr);
135 
136     RNG rng;
137     const size_t D = 4; // latent dim (even)
138     const size_t K = 4; // RealNVP layers
139 
140     // Build alternating-mask RealNVP stack
141     vector<RealNVPLayer> layers; layers.reserve(K);
142     for (size_t k = 0; k < K; ++k) layers.emplace_back(D, (k % 2 == 1), rng);
143 
144     // Toy data and likelihood parameters
145     const size_t M = 3; // observation dim
146     vector<vector<double>> W(M, vector<double>(D, 0.0));
147     // Simple W: first M dims pick from z
148     for (size_t i = 0; i < M; ++i) if (i < D) W[i][i] = 1.0;
149     vector<double> x = {0.5, -1.2, 0.3};
150     double sigma = 0.5;
151 
152     // Monte Carlo ELBO estimate with S samples
153     size_t S = 1000;
154     double elbo_sum = 0.0;
155     for (size_t s = 0; s < S; ++s) {
156         auto [z, logq] = sample_and_log_q_realnvp(layers, D, rng);
157         double logp_z = log_std_normal(z); // prior
158         double logp_x_given_z = log_likelihood_gaussian(x, W, z, sigma);
159         elbo_sum += (logp_x_given_z + logp_z - logq);
160     }
161     double elbo_mc = elbo_sum / static_cast<double>(S);
162 
163     cout << fixed << setprecision(6);
164     cout << "Monte Carlo ELBO estimate: " << elbo_mc << "\n";
165 
166     return 0;
167 }
168

This example implements a lightweight RealNVP with alternating affine coupling layers. Scale s(.) and shift t(.) are simple linear functions (with tanh for s to keep it bounded). The log-determinant is the sum of s’s outputs, thanks to a triangular Jacobian. We then compute a Monte Carlo ELBO for a toy model with a standard normal prior and Gaussian likelihood p(x|z) = N(Wz, σ²I). Each ELBO term uses exact log-densities: log p(x|z), log p(z), and log q(z) from the flow.

Time: O(SKD) for S samples, K layers, and D dimensions (excluding the small linear s/t cost).Space: O(D) for activations; parameters are O(K D^2) here due to naive linear s/t, though practical implementations use neural nets with their own parameter counts.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Utility: sample from standard normal N(0,1)
5	double randn(std::mt19937 &rng) {
6	static thread_local std::normal_distribution<double> dist(0.0, 1.0);
7	return dist(rng);
8	}
9
10	// Compute dot product of two vectors
11	double dot(const vector<double> &a, const vector<double> &b) {
12	double s = 0.0; size_t n = a.size();
13	for (size_t i = 0; i < n; ++i) s += a[i] * b[i];
14	return s;
15	}
16
17	// Planar flow: f(z) = z + u * h(w^T z + b), with h = tanh
18	struct PlanarFlow {
19	vector<double> u, w; // size D
20	double b; // scalar
21
22	PlanarFlow(size_t D, std::mt19937 &rng) : u(D), w(D) {
23	// Initialize small random params for stability
24	std::normal_distribution<double> dist(0.0, 0.1);
25	for (size_t i = 0; i < D; ++i) {
26	u[i] = dist(rng);
27	w[i] = dist(rng);
28	}
29	b = dist(rng);
30	}
31
32	// Forward transform: returns z_new and adds log\|det J\| to log_det_J
33	vector<double> forward(const vector<double> &z, double &log_det_J) const {
34	double a = dot(w, z) + b; // pre-activation
35	double h = tanh(a); // nonlinearity
36	double hprime = 1.0 - tanh(a) * tanh(a);// derivative of tanh
37
38	// psi(z) = h'(a) * w
39	vector<double> psi(w.size());
40	for (size_t i = 0; i < w.size(); ++i) psi[i] = hprime * w[i];
41
42	// det(I + u psi^T) = 1 + u^T psi
43	double det_term = 1.0 + dot(u, psi);
44	// Numerical safety: avoid log(<=0)
45	double eps = 1e-8;
46	log_det_J += log(fabs(det_term) + eps);
47
48	// z_new = z + u * h
49	vector<double> z_new = z;
50	for (size_t i = 0; i < z.size(); ++i) z_new[i] += u[i] * h;
51	return z_new;
52	}
53	};
54
55	// Sample z0 ~ N(0, I)
56	vector<double> sample_base(size_t D, std::mt19937 &rng) {
57	vector<double> z0(D);
58	for (size_t i = 0; i < D; ++i) z0[i] = randn(rng);
59	return z0;
60	}
61
62	// Log-density of standard Gaussian N(0, I)
63	double log_std_normal(const vector<double> &z) {
64	double quad = 0.0; for (double v : z) quad += v * v;
65	double D = static_cast<double>(z.size());
66	return -0.5 * (D * log(2.0 * M_PI) + quad);
67	}
68
69	// Apply a sequence of planar flows to a base sample, returning (zK, log_q(zK))
70	pair<vector<double>, double> sample_and_log_q_planar(
71	const vector<PlanarFlow> &flows, size_t D, std::mt19937 &rng) {
72
73	vector<double> z = sample_base(D, rng); // z0
74	double log_det_sum = 0.0;
75
76	// Push forward through all flows
77	for (const auto &f : flows) {
78	z = f.forward(z, log_det_sum);
79	}
80
81	// log q_K(zK) = log p0(z0) - sum_k log\|det J_k\|
82	double logq = log_std_normal(sample_base(0, rng)); // placeholder
83	// Actually we need log p0(z0); we kept only zK. To compute logq correctly,
84	// we must retain z0. Modify to store it.
85	// For simplicity, recompute by re-simulating is wrong; so we keep z0.
86	return {z, 0.0};
87	}
88
89	// Correct implementation that retains z0
90	pair<vector<double>, double> sample_and_log_q_planar_keep_z0(
91	const vector<PlanarFlow> &flows, size_t D, std::mt19937 &rng) {
92
93	vector<double> z0 = sample_base(D, rng);
94	vector<double> z = z0;
95	double log_det_sum = 0.0;
96	for (const auto &f : flows) {
97	z = f.forward(z, log_det_sum);
98	}
99	double logq = log_std_normal(z0) - log_det_sum;
100	return {z, logq};
101	}
102
103	int main() {
104	ios::sync_with_stdio(false);
105	cin.tie(nullptr);
106
107	std::random_device rd; std::mt19937 rng(rd());
108
109	size_t D = 4; size_t K = 5; // 5 planar layers in 4D
110	vector<PlanarFlow> flows; flows.reserve(K);
111	for (size_t k = 0; k < K; ++k) flows.emplace_back(D, rng);
112
113	// Draw a sample and compute its log-density under the flow q
114	auto [zK, logq] = sample_and_log_q_planar_keep_z0(flows, D, rng);
115
116	cout << fixed << setprecision(4);
117	cout << "zK: ";
118	for (double v : zK) cout << v << ' ';
119	cout << "\nlog q(zK): " << logq << "\n";
120
121	return 0;
122	}
123

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// ---------- Random utils ----------
5	struct RNG {
6	std::mt19937 gen;
7	RNG() : gen(std::random_device{}()) {}
8	double randn() { static normal_distribution<double> nd(0.0,1.0); return nd(gen); }
9	};
10
11	// ---------- Linear algebra helpers (naive) ----------
12	vector<double> matvec(const vector<vector<double>>& W, const vector<double>& x) {
13	size_t R = W.size(), C = x.size();
14	vector<double> y(R, 0.0);
15	for (size_t i = 0; i < R; ++i) {
16	double s = 0.0;
17	for (size_t j = 0; j < C; ++j) s += W[i][j] * x[j];
18	y[i] = s;
19	}
20	return y;
21	}
22
23	vector<double> add(const vector<double>& a, const vector<double>& b) {
24	vector<double> c(a.size());
25	for (size_t i = 0; i < a.size(); ++i) c[i] = a[i] + b[i];
26	return c;
27	}
28
29	vector<double> tanh_vec(const vector<double>& a) {
30	vector<double> y(a.size());
31	for (size_t i = 0; i < a.size(); ++i) y[i] = tanh(a[i]);
32	return y;
33	}
34
35	// Elementwise operations
36	vector<double> exp_vec(const vector<double>& a) {
37	vector<double> y(a.size());
38	for (size_t i = 0; i < a.size(); ++i) y[i] = exp(a[i]);
39	return y;
40	}
41	vector<double> hadamard(const vector<double>& a, const vector<double>& b) {
42	vector<double> y(a.size());
43	for (size_t i = 0; i < a.size(); ++i) y[i] = a[i] * b[i];
44	return y;
45	}
46
47	// ---------- RealNVP Layer ----------
48	// Partition: first D/2 dims are A, remaining are B (mask alternates across layers)
49	struct RealNVPLayer {
50	size_t D, Dh; // D total, D/2 per partition
51	bool flip; // if true, swap A and B roles
52	// Simple linear functions for s(.) and t(.) for demonstrative purposes
53	vector<vector<double>> W_s, W_t; // (Dh x Dh)
54	vector<double> b_s, b_t; // (Dh)
55
56	RealNVPLayer(size_t D_, bool flip_, RNG &rng) : D(D_), Dh(D_/2), flip(flip_), W_s(Dh, vector<double>(Dh)), W_t(Dh, vector<double>(Dh)), b_s(Dh), b_t(Dh) {
57	normal_distribution<double> nd(0.0, 0.2);
58	for (size_t i = 0; i < Dh; ++i) {
59	b_s[i] = 0.0; b_t[i] = 0.0;
60	for (size_t j = 0; j < Dh; ++j) {
61	W_s[i][j] = nd(rng.gen) * 0.1; // keep small for stability
62	W_t[i][j] = nd(rng.gen) * 0.1;
63	}
64	}
65	}
66
67	// Forward pass: y, and add to log_det_J
68	vector<double> forward(const vector<double>& x, double &log_det_J) const {
69	vector<double> y(D);
70	// Split x into A and B
71	vector<double> xA(Dh), xB(Dh);
72	if (!flip) {
73	for (size_t i = 0; i < Dh; ++i) xA[i] = x[i];
74	for (size_t i = 0; i < Dh; ++i) xB[i] = x[Dh + i];
75	} else {
76	for (size_t i = 0; i < Dh; ++i) xA[i] = x[Dh + i];
77	for (size_t i = 0; i < Dh; ++i) xB[i] = x[i];
78	}
79	// s, t as simple (tanh(Wx + b), Wx + b)
80	vector<double> s = tanh_vec(add(matvec(W_s, xA), b_s));
81	vector<double> t = add(matvec(W_t, xA), b_t);
82	vector<double> exp_s = exp_vec(s);
83	vector<double> yB = add(hadamard(xB, exp_s), t);
84
85	// Accumulate log-det: sum(s)
86	double sum_s = 0.0; for (double si : s) sum_s += si;
87	log_det_J += sum_s;
88
89	// Merge back
90	if (!flip) {
91	for (size_t i = 0; i < Dh; ++i) y[i] = xA[i];
92	for (size_t i = 0; i < Dh; ++i) y[Dh + i] = yB[i];
93	} else {
94	for (size_t i = 0; i < Dh; ++i) y[i] = yB[i];
95	for (size_t i = 0; i < Dh; ++i) y[Dh + i] = xA[i];
96	}
97	return y;
98	}
99	};
100
101	// Base sampling and log-density (standard normal)
102	vector<double> sample_base(size_t D, RNG &rng) {
103	vector<double> z(D);
104	for (size_t i = 0; i < D; ++i) z[i] = rng.randn();
105	return z;
106	}
107
108	double log_std_normal(const vector<double>& z) {
109	double quad = 0.0; for (double v : z) quad += v*v; double D = z.size();
110	return -0.5 * (D * log(2.0 * M_PI) + quad);
111	}
112
113	// Flow sampling: return (zK, log q(zK))
114	pair<vector<double>, double> sample_and_log_q_realnvp(
115	const vector<RealNVPLayer>& layers, size_t D, RNG &rng) {
116	vector<double> z0 = sample_base(D, rng);
117	vector<double> z = z0;
118	double log_det_sum = 0.0;
119	for (const auto &L : layers) z = L.forward(z, log_det_sum);
120	double logq = log_std_normal(z0) - log_det_sum;
121	return {z, logq};
122	}
123
124	// Toy model: p(z) = N(0,I), p(x\|z) = N(W z, sigma^2 I)
125	// log p(x\|z) = -0.5 [ M log(2pi sigma^2) + \|\|x - W z\|\|^2 / sigma^2 ]
126	double log_likelihood_gaussian(const vector<double>& x, const vector<vector<double>>& W, const vector<double>& z, double sigma) {
127	vector<double> mu = matvec(W, z);
128	double quad = 0.0; for (size_t i = 0; i < x.size(); ++i) { double d = x[i] - mu[i]; quad += d*d; }
129	return -0.5 * (x.size() * log(2.0 * M_PI * sigma * sigma) + quad / (sigma * sigma));
130	}
131
132	int main() {
133	ios::sync_with_stdio(false);
134	cin.tie(nullptr);
135
136	RNG rng;
137	const size_t D = 4; // latent dim (even)
138	const size_t K = 4; // RealNVP layers
139
140	// Build alternating-mask RealNVP stack
141	vector<RealNVPLayer> layers; layers.reserve(K);
142	for (size_t k = 0; k < K; ++k) layers.emplace_back(D, (k % 2 == 1), rng);
143
144	// Toy data and likelihood parameters
145	const size_t M = 3; // observation dim
146	vector<vector<double>> W(M, vector<double>(D, 0.0));
147	// Simple W: first M dims pick from z
148	for (size_t i = 0; i < M; ++i) if (i < D) W[i][i] = 1.0;
149	vector<double> x = {0.5, -1.2, 0.3};
150	double sigma = 0.5;
151
152	// Monte Carlo ELBO estimate with S samples
153	size_t S = 1000;
154	double elbo_sum = 0.0;
155	for (size_t s = 0; s < S; ++s) {
156	auto [z, logq] = sample_and_log_q_realnvp(layers, D, rng);
157	double logp_z = log_std_normal(z); // prior
158	double logp_x_given_z = log_likelihood_gaussian(x, W, z, sigma);
159	elbo_sum += (logp_x_given_z + logp_z - logq);
160	}
161	double elbo_mc = elbo_sum / static_cast<double>(S);
162
163	cout << fixed << setprecision(6);
164	cout << "Monte Carlo ELBO estimate: " << elbo_mc << "\n";
165
166	return 0;
167	}
168