📚TheoryAdvanced

Information Bottleneck Theory

Key Points

•
Information Bottleneck (IB) studies how to compress an input X into a representation Z that still preserves what is needed to predict Y.
•
The core trade-off is to maximize I(Z;Y) while minimizing I(X;Z), typically via the Lagrangian $L = I (Z; Y$ ) - $β$ I(X;Z).
•
Deterministic continuous networks make I(X;Z) ill-defined (often infinite), so stochastic encoders or discretization are needed to reason about mutual information.
•
The Variational Information Bottleneck (VIB) replaces intractable information terms with tractable bounds using KL divergence and the reparameterization trick.
•
The information plane plots I(X;Z) versus I(Z;Y) over training, often showing a fitting phase followed by a compression phase, though this is debated.
•
Compression observed with binning can be an artifact; careful estimators or analytic formulas (e.g., Gaussian channel) should be used when possible.
•
Choosing $β$ controls how much the model prioritizes compression versus task performance, analogous to rate–distortion trade-offs.
•
IB can encourage robustness, generalization, privacy, and compact representations in deep learning systems.

Prerequisites

→Probability distributions and expectations — Mutual information, entropy, and KL divergence are defined using probabilities and expected values.
→Entropy and KL divergence — IB and VIB objectives directly use entropy decompositions and KL regularization.
→Mutual information — The core quantities I(X;Z) and I(Z;Y) define the IB trade-off.
→Gaussian distributions and linear models — Closed-form MI and KL formulas for Gaussian channels enable analytic checks and stable training.
→Logistic regression and cross-entropy — VIB often uses a classifier p(y|z) trained with cross-entropy, linking utility to H(Y|Z).
→Stochastic gradient descent and reparameterization — Training VIB requires sampling z via reparameterization to compute low-variance gradients.
→C++ programming basics — Implementing estimators and toy training loops requires control structures, RNG, and numerical stability.

Detailed Explanation

Tap terms for definitions

01Overview

The Information Bottleneck (IB) principle is a framework for learning useful, compact representations. Given an input X and a target Y, we create an intermediate representation Z that should forget as much about X as possible while keeping what is necessary to predict Y. This is captured by two mutual information terms: I(X;Z) measures how much Z still remembers about X (compression cost), and I(Z;Y) measures how informative Z is for the target (prediction utility). The core idea is to search for representations that optimally balance these objectives.

Formally, IB poses an optimization over stochastic encoders p(z|x): maximize I(Z;Y) subject to a constraint on I(X;Z), or equivalently optimize a Lagrangian L = I(Z;Y) - \beta I(X;Z). In deep learning, Z is typically the activation of a hidden layer, and the IB principle suggests that good layers discard irrelevant details of X while retaining label-relevant information.

In practice, computing mutual information for high-dimensional, continuous neural activations is difficult. The Variational Information Bottleneck (VIB) addresses this by introducing a variational encoder q_\phi(z|x), a decoder p_\theta(y|z), and a simple prior r(z). This yields a tractable surrogate objective using KL divergence and the reparameterization trick, enabling stochastic gradient training. The information plane visualizes the trade-off during training, but interpreting it requires care because naive estimators (like binning) can be misleading, especially for deterministic networks.

IB provides a unifying language connecting compression, generalization, robustness, and even privacy, guiding the design of architectures and regularizers that emphasize task-relevant information.

02Intuition & Analogies

Imagine summarizing a long article (X) into a short abstract (Z) that still lets a reader answer specific questions (Y). If the abstract copies the article, it's not a summary (high I(X;Z)); if it omits key points, the reader can't answer questions (low I(Z;Y)). A great abstract keeps just the right details to solve the task—this is the bottleneck principle.

Think of a water pipe with a valve (the bottleneck). The source is noisy water with minerals (X), and the goal is to produce drinkable water (Y). The valve controls how much water (information) passes. If you open it fully, impurities flow too (Z keeps too much from X). If you close it too much, you get a trickle that’s too pure but insufficient for demand (Z loses task-relevant content). The optimal setting allows enough flow to satisfy needs while filtering out junk—maximizing usefulness while minimizing redundancy.

In a deep network, each hidden layer acts like a new filter. Early layers pick out edges or simple patterns; deeper layers retain higher-level concepts like object categories. Over training, layers may first memorize details to fit the data (high I(X;Z)); later, optimization and noise push them to forget irrelevant specifics while preserving what predicts the label (higher I(Z;Y) relative to I(X;Z)). But measuring this process precisely is tricky. If you measure information with coarse buckets (binning), it might look like compression even when, mathematically, a deterministic continuous layer can carry infinite information. To study and encourage true compression, we inject stochasticity (noise) and use well-founded approximations, as in the Variational Information Bottleneck.

03Formal Definition

Let random variables X (input), Y (target), and Z (representation) form a Markov chain X

\to

\to

Y via an encoder p(z

∣ x) an dd eco d er p (y ∣

z). Mutual information I(A;B) = H(A) - H(A

∣ B) q u an t i f i es h o w m u c hkn o w in g B re d u ces u n cer t ain t y ab o u t A . T h e I n f or ma t i o n B o ttl e n ec k p ro b l e mi s t o f in d p (z ∣

x) that solves maximize I(Z;Y) subject to I(X;Z)

\leq

R, equivalently maximizing the Lagrangian

L = I (Z; Y

) -

β

I(X;Z) for some

β

\geq

0. Here,

β

controls the trade-off between prediction utility and compression. For continuous variables and deterministic encoders, I(X;Z) can be ill-defined (often infinite). Therefore, stochastic encoders (e.g., Gaussian noise) are standard in IB analysis. The Variational Information Bottleneck introduces variational distributions q_

ϕ

∣ x) (e n co d er), p_{θ} (y ∣

z) (decoder), and a simple prior r(z) (e.g., standard normal). Using variational bounds, one obtains a tractable surrogate objective: minimize

L

(

θ

ϕ

) =

E_{p (x, y)}

E_{q_{ϕ} (z ∣ x)}

lo g

θ

(y|z)] +

β

E_{p (x)}

[

KL

(q_

ϕ

∣ x) ∥

\, r(z))]. The first term upper-bounds H(Y|Z) (so minimizing it tends to increase I(Z;Y)), while the KL term upper-bounds I(X;Z) under the variational prior. For Gaussian encoders with reparameterization z =

μ

ϕ

(x) +

σ

ϕ

(x)

⊙

ϵ

ϵ

\sim

N

(0,I), training proceeds via stochastic gradient descent.

04When to Use

Use IB or VIB when you want representations that are compact yet predictive. Concrete settings include:

Regularization in supervised learning: Encouraging Z to discard spurious correlations can improve generalization and robustness to distribution shifts.
Privacy and compression: Limiting I(X;Z) helps reduce leakage of sensitive input details and can enable smaller transmitted representations in edge/cloud systems.
Noisy or high-dimensional data: When inputs contain substantial nuisance variation (e.g., backgrounds in images, sensor noise), IB promotes invariances by keeping only task-relevant signal.
Multi-task or transfer learning: A well-compressed Z that emphasizes core factors may transfer better across related tasks.
Interpretable training diagnostics: The information plane (I(X;Z) vs. I(Z;Y)) can visualize how layers evolve during training, highlighting fitting and potential compression phases.

However, prefer VIB or other principled stochastic formulations over naive mutual information estimates for deep, continuous networks. If you must inspect the information plane with discretization, validate findings with alternative estimators or analytic formulas (e.g., for linear-Gaussian layers).

⚠️Common Mistakes

Treating deterministic continuous layers as having finite I(X;Z): For invertible or nearly invertible mappings with continuous noise-free data, I(X;Z) is infinite or undefined. Remedy: add stochasticity (e.g., Gaussian encoders) or analyze discrete/quantized cases.
Over-interpreting binning-based compression: Discretizing activations often shows decreasing I(X;Z) during training, but this can be a measurement artifact. Cross-check with analytic formulas (Gaussian channel) or variational bounds.
Confusing \beta’s role: \beta is a Lagrange multiplier controlling compression strength. Too large \beta over-compresses and hurts accuracy; too small \beta collapses to standard ERM with little compression. Tune \beta and monitor the information plane or validation metrics.
Ignoring the choice of prior r(z): The KL term depends on r(z). A poor prior can misestimate compression or hinder optimization. Try standard normal or learned priors and monitor performance.
Numerical instability when estimating MI: High-dimensional histograms, heavy tails, and small bins produce high-variance estimates. Prefer k-NN or Gaussian estimators when applicable, and report units (nats vs. bits).
Misreading I(Z;Y) from accuracy alone: Accuracy can saturate while I(Z;Y) still increases or vice versa. Use entropy-based measures or calibrated probabilities to approximate H(Y|Z).
Forgetting the data processing inequality: Adding layers cannot increase I(X;Y) but can re-distribute information. Don’t expect deeper networks to create new information about Y from X; they repackage it.

Key Formulas

Mutual Information

I (A; B) = H (A) - H (A ∣ B) = H (B) - H (B ∣ A)

Explanation: Mutual information is the reduction in uncertainty about A after observing B (and vice versa). It is symmetric and non-negative.

IB Lagrangian

L = I (Z; Y) - β I (X; Z)

Explanation: This objective encodes the trade-off between making Z predictive of Y and compressing information from X. The parameter $β$ controls compression strength.

KL Divergence

KL (q ∥ p) = \int q (u) lo g \frac{q ( u )}{p ( u )} d u

Explanation: KL measures how one distribution q differs from another p. It is zero only if q equals p almost everywhere and is used as a regularizer in VIB.

VIB Objective

L_{VIB} (θ, ϕ) = E_{p (x, y)} E_{q_{ϕ} (z ∣ x)} [- lo g p_{θ} (y ∣ z)] + β E_{p (x)} KL (q_{ϕ} (z ∣ x) ∥ r (z))

Explanation: The first term upper-bounds H(Y|Z) and encourages predictive Z. The second term penalizes information Z carries about X via KL to a simple prior r(z).

Bivariate Gaussian MI

I (X; Z) = - \frac{1}{2} lo g (1 - ρ^{2})

Explanation: For jointly Gaussian X and Z with correlation $ρ$ , mutual information has this closed form. It is useful when encoders are approximately linear-Gaussian.

Gaussian Channel MI

I (X; Z) = \frac{1}{2} lo g (1 + \frac{w ^{2} Var ( X )}{σ ^{2}})

Explanation: For $z = w$ x + $ϵ$ with $ϵ$ $\sim$ $N$ (0, $σ^{2}$ ), this gives mutual information in nats. It equals half the log of one plus the signal-to-noise ratio.

KL Between Univariate Gaussians

KL (N (μ, σ^{2}) ∥ N (0, 1)) = \frac{1}{2} (μ^{2} + σ^{2} - 1 - lo g σ^{2})

Explanation: Closed-form KL used in VIB when the encoder is Gaussian and the prior is standard normal. Enables efficient gradient computation.

Binned Entropy

H_{bin} (X) = - i = 1 \sum B p_{i} lo g p_{i}

Explanation: Histogram-based entropy for B bins with probabilities $p_{i}$ . Used in simple MI estimators but sensitive to bin count and sample size.

Binned Mutual Information

I_{bin} (X; Z) = i = 1 \sum B j = 1 \sum B p_{ij} lo g \frac{p _{ij}}{p _{i \cdot} p _{\cdot j}}

Explanation: Histogram-based MI from joint bin counts $p_{ij}$ . Useful for visualization but can produce artifacts with continuous, deterministic mappings.

Utility via Entropy

I (Z; Y) = H (Y) - H (Y ∣ Z)

Explanation: For classification, H(Y|Z) can be approximated by the expected cross-entropy under a calibrated decoder p(y|z). This links predictive loss to mutual information.

Data Processing Inequality

I (X; Y) \geq I (Z; Y) for X \to Z \to Y

Explanation: Processing cannot create information about Y beyond what X already contains. Any representation Z can only lose information about Y.

Complexity Analysis

Exact mutual information is typically intractable for high-dimensional deep nets. Variational Information Bottleneck reduces computation to standard training plus a simple KL term. For a model with d-dimensional latent Z and N samples, each epoch of VIB costs O(Nd) to evaluate the encoder/decoder and O(Nd) to form gradients; with mini-batching of size m, each step is O(md). Space is O(d) for parameters (per layer) plus O(m d) for activations during a batch. For histogram (binning) MI estimation with B bins per dimension on univariate variables, forming marginal and joint histograms is O(N +

B^{2}

) time and O(

B^{2}

) space. In higher dimensions, histogram methods suffer from the curse of dimensionality with O(

B^{k}

) bins in k dimensions, quickly becoming impractical. Gaussian closed forms (when applicable) compute MI in O(1) after estimating a few moments, while k-NN estimators typically cost O(N log N) or O(

N^{2}

) depending on implementation and dimension. In our C++ examples, the 1D VIB toy model uses scalar operations so each epoch scales linearly with N; memory holds the dataset (O(N)) plus a few scalars. The binning MI utility runs in O(N +

B^{2}

) time for univariate X and Z with modest constant factors. These simplified settings highlight computational trade-offs: VIB adds negligible overhead to standard training, while naive MI via histograms is cheap in 1D but unreliable and scales poorly in higher dimensions.

Code Examples

Linear-Gaussian VIB for Binary Classification with Information-Plane Estimates

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Sigmoid helper
5 static inline double sigmoid(double x){ return 1.0 / (1.0 + exp(-x)); }
6 
7 // Binary cross-entropy for y in {0,1}
8 static inline double bce(double y, double p){
9     const double eps = 1e-12;
10     p = min(max(p, eps), 1.0 - eps);
11     return -(y * log(p) + (1.0 - y) * log(1.0 - p));
12 }
13 
14 // Generate synthetic 1D data: X ~ N(0,1), Y = 1{X + noise > 0}
15 struct Dataset {
16     vector<double> x;
17     vector<int> y; // 0 or 1
18 };
19 
20 Dataset make_data(size_t N, double label_noise_std, uint64_t seed=42){
21     Dataset d; d.x.resize(N); d.y.resize(N);
22     mt19937_64 rng(seed);
23     normal_distribution<double> N01(0.0, 1.0);
24     normal_distribution<double> N0n(0.0, label_noise_std);
25     for(size_t i=0;i<N;++i){
26         double xi = N01(rng);
27         double yi_cont = xi + N0n(rng);
28         int yi = yi_cont > 0.0 ? 1 : 0;
29         d.x[i] = xi; d.y[i] = yi;
30     }
31     return d;
32 }
33 
34 // Compute entropy of binary labels in nats
35 double binary_entropy_from_counts(size_t c1, size_t c0){
36     double N = double(c1 + c0);
37     if(N == 0) return 0.0;
38     double p1 = c1 / N;
39     double p0 = c0 / N;
40     auto Hterm = [](double p){ if(p<=0.0 || p>=1.0) return 0.0; return -p*log(p); };
41     return Hterm(p1) + Hterm(p0);
42 }
43 
44 // Approximate I(Z;Y) via H(Y) - E[ cross-entropy( y, p(y=1|z) ) ]
45 // Using the model's predicted probability as a proxy for p(y|z) (calibration assumed)
46 
47 double approximate_I_ZY_from_decoder(const vector<int>& y, const vector<double>& z, double a, double b){
48     size_t N = y.size();
49     size_t c1 = accumulate(y.begin(), y.end(), size_t(0));
50     size_t c0 = N - c1;
51     double H_Y = binary_entropy_from_counts(c1, c0);
52     double ce = 0.0;
53     for(size_t i=0;i<N;++i){
54         double p = sigmoid(a*z[i] + b);
55         ce += bce(y[i], p);
56     }
57     ce /= double(N);
58     // I(Z;Y) ≈ H(Y) - H(Y|Z); H(Y|Z) approximated by cross-entropy
59     double I_approx = H_Y - ce;
60     if(I_approx < 0.0) I_approx = 0.0; // clip due to approximation error
61     return I_approx;
62 }
63 
64 // Discretized (binned) MI for 1D X and Z
65 // Note: This estimator is biased and can create artifacts for deterministic mappings.
66 double binned_MI_1d(const vector<double>& X, const vector<double>& Z, int B){
67     size_t N = X.size();
68     if(N == 0) return 0.0;
69     double xmin = *min_element(X.begin(), X.end());
70     double xmax = *max_element(X.begin(), X.end());
71     double zmin = *min_element(Z.begin(), Z.end());
72     double zmax = *max_element(Z.begin(), Z.end());
73     if(xmax == xmin) xmax = xmin + 1e-6;
74     if(zmax == zmin) zmax = zmin + 1e-6;
75     vector<vector<size_t>> joint(B, vector<size_t>(B, 0));
76     vector<size_t> mx(B,0), mz(B,0);
77 
78     auto bin_index = [](double v, double vmin, double vmax, int B){
79         double t = (v - vmin) / (vmax - vmin);
80         int idx = int(floor(t * B));
81         if(idx < 0) idx = 0; if(idx >= B) idx = B-1;
82         return idx;
83     };
84 
85     for(size_t i=0;i<N;++i){
86         int ix = bin_index(X[i], xmin, xmax, B);
87         int iz = bin_index(Z[i], zmin, zmax, B);
88         joint[ix][iz]++; mx[ix]++; mz[iz]++;
89     }
90     double NN = double(N);
91     double I = 0.0;
92     for(int i=0;i<B;++i){
93         for(int j=0;j<B;++j){
94             if(joint[i][j]==0) continue;
95             double pxy = joint[i][j]/NN;
96             double px = mx[i]/NN;
97             double pz = mz[j]/NN;
98             I += pxy * log(pxy/(px*pz));
99         }
100     }
101     return I; // nats
102 }
103 
104 int main(){
105     ios::sync_with_stdio(false);
106     cin.tie(nullptr);
107 
108     // 1) Create data
109     size_t N = 4000;
110     Dataset data = make_data(N, /*label_noise_std=*/0.5, /*seed=*/123);
111 
112     // 2) Parameters: encoder z = w*x + s*eps, eps ~ N(0,1); decoder p(y=1|z) = sigmoid(a*z + b)
113     double w = 0.1;    // encoder weight
114     double log_s = log(0.8); // log std to ensure positivity
115     double a = 0.1;    // decoder weight
116     double b = 0.0;    // decoder bias
117     double beta = 1e-2; // IB trade-off weight
118 
119     // 3) Optimizer hyperparams
120     double lr = 1e-2;
121     int epochs = 200;
122     int B_bins = 30; // for binned MI visualization
123 
124     // Random source for reparameterization noise
125     mt19937_64 rng(2024);
126     normal_distribution<double> N01(0.0, 1.0);
127 
128     // Precompute Var(X)
129     double mean_x = accumulate(data.x.begin(), data.x.end(), 0.0) / double(N);
130     double var_x = 0.0; for(double xi: data.x){ var_x += (xi - mean_x)*(xi - mean_x); } var_x /= double(N);
131 
132     vector<double> zbuf(N, 0.0);
133 
134     cout << fixed << setprecision(6);
135     cout << "#epoch loss acc Ixz_gauss Ixz_binned Izy_approx\n";
136 
137     for(int ep=1; ep<=epochs; ++ep){
138         double s = exp(log_s);
139         // Accumulate gradients
140         double gw=0.0, glogs=0.0, ga=0.0, gb=0.0;
141         double loss=0.0; size_t correct=0;
142 
143         for(size_t i=0;i<N;++i){
144             double x = data.x[i];
145             int y = data.y[i];
146             double eps = N01(rng);
147             double z = w * x + s * eps;   // reparameterization sample
148             double t = a * z + b;
149             double p = sigmoid(t);
150             double ce = bce(double(y), p);
151             // KL(q(z|x)||N(0,1)) for this x (Gaussian closed form)
152             double mu = w * x;
153             double KL = 0.5 * (mu*mu + s*s - 1.0 - log(s*s));
154             double L = ce + beta * KL;
155             loss += L;
156             // accuracy
157             int yhat = (p >= 0.5) ? 1 : 0;
158             if(yhat == y) correct++;
159             // Gradients
160             double dL_dt = (p - double(y)); // dCE/dt
161             double dL_dz = dL_dt * a;
162             double dL_dw = dL_dz * x + beta * (mu * x); // mu*x = w*x^2
163             double dL_ds = dL_dz * eps + beta * (s - 1.0/s);
164             double dL_da = dL_dt * z;
165             double dL_db = dL_dt;
166             // chain for log_s
167             glogs += dL_ds * s;
168             gw += dL_dw; ga += dL_da; gb += dL_db;
169             // buffer z for info estimates later (use noiseless mean to reduce variance)
170             zbuf[i] = mu; // store encoder mean w*x for analysis
171         }
172         loss /= double(N);
173         double acc = double(correct) / double(N);
174 
175         // Parameter update (SGD)
176         double scale = lr / double(N);
177         w -= scale * gw;
178         log_s -= scale * glogs;
179         a -= scale * ga;
180         b -= scale * gb;
181 
182         double s_now = exp(log_s);
183         // Analytic I(X;Z) for Gaussian channel with encoder mean and noise std s
184         double Ixz_gauss = 0.5 * log(1.0 + (w*w * var_x) / (s_now*s_now));
185 
186         // Binned I(X;Z): use Z = w*X (deterministic), which tends to underestimate true MI in continuous case
187         vector<double> Xvec = data.x;
188         vector<double> Zdet(N); for(size_t i=0;i<N;++i) Zdet[i] = w * data.x[i];
189         double Ixz_binned = binned_MI_1d(Xvec, Zdet, B_bins);
190 
191         // Approximate I(Z;Y) using decoder probabilities and noiseless mean z=mu (variance ignored here)
192         double Izy_approx = approximate_I_ZY_from_decoder(data.y, zbuf, a, b);
193 
194         cout << ep << " " << loss << " " << acc << " "
195              << Ixz_gauss << " " << Ixz_binned << " " << Izy_approx << "\n";
196     }
197 
198     // Note: For a proper information plane, you would record these values per epoch and plot them externally.
199     return 0;
200 }
201

This program trains a 1D Variational Information Bottleneck model: a Gaussian encoder z = w x + s ε and a logistic decoder p(y|z) = sigmoid(a z + b). The loss is cross-entropy plus β times the KL divergence between the encoder and a standard normal prior. Training uses SGD with reparameterization. After each epoch, it computes: (1) analytic I(X;Z) for the linear-Gaussian channel, (2) binned I(X;Z) using a histogram over the deterministic mapping z = w x (to illustrate discretization artifacts), and (3) an approximation of I(Z;Y) as H(Y) minus the average cross-entropy under the decoder. The output can be plotted to form an information plane. This minimal C++ example avoids external libraries and highlights how VIB operationalizes the IB trade-off.

Time: O(N * epochs) for this 1D model; O(N d * epochs) if extended to d-dimensional features.Space: O(N) to store the dataset and a few O(1) scalars for parameters; O(N) temporary buffer for z.

Histogram-based Mutual Information Estimator and the Binning Artifact

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Histogram-based 1D-1D MI (nats)
5 double binned_MI_1d(const vector<double>& X, const vector<double>& Z, int B){
6     size_t N = X.size();
7     if(N == 0) return 0.0;
8     double xmin = *min_element(X.begin(), X.end());
9     double xmax = *max_element(X.begin(), X.end());
10     double zmin = *min_element(Z.begin(), Z.end());
11     double zmax = *max_element(Z.begin(), Z.end());
12     if(xmax == xmin) xmax = xmin + 1e-6;
13     if(zmax == zmin) zmax = zmin + 1e-6;
14     vector<vector<size_t>> joint(B, vector<size_t>(B, 0));
15     vector<size_t> mx(B,0), mz(B,0);
16 
17     auto bin_index = [](double v, double vmin, double vmax, int B){
18         double t = (v - vmin) / (vmax - vmin);
19         int idx = int(floor(t * B));
20         if(idx < 0) idx = 0; if(idx >= B) idx = B-1;
21         return idx;
22     };
23 
24     for(size_t i=0;i<N;++i){
25         int ix = bin_index(X[i], xmin, xmax, B);
26         int iz = bin_index(Z[i], zmin, zmax, B);
27         joint[ix][iz]++; mx[ix]++; mz[iz]++;
28     }
29     double NN = double(N);
30     double I = 0.0;
31     for(int i=0;i<B;++i){
32         for(int j=0;j<B;++j){
33             if(joint[i][j]==0) continue;
34             double pxy = joint[i][j]/NN;
35             double px = mx[i]/NN;
36             double pz = mz[j]/NN;
37             I += pxy * log(pxy/(px*pz));
38         }
39     }
40     return I;
41 }
42 
43 int main(){
44     ios::sync_with_stdio(false);
45     cin.tie(nullptr);
46 
47     // Generate X ~ N(0,1)
48     size_t N = 20000;
49     mt19937_64 rng(7);
50     normal_distribution<double> N01(0.0, 1.0);
51     vector<double> X(N);
52     for(size_t i=0;i<N;++i) X[i] = N01(rng);
53 
54     // Deterministic mapping Z = w * X (no noise): true MI for continuous variables is infinite (if w != 0)
55     double w = 2.0;
56     vector<double> Zdet(N); for(size_t i=0;i<N;++i) Zdet[i] = w * X[i];
57 
58     // Slightly noisy mapping Z = w * X + eps: finite MI with Gaussian formula
59     normal_distribution<double> N0s(0.0, 0.2);
60     vector<double> Znoisy(N); for(size_t i=0;i<N;++i) Znoisy[i] = w * X[i] + N0s(rng);
61 
62     // Evaluate binned MI for increasing bin counts
63     cout << fixed << setprecision(6);
64     cout << "#B I_binned(X;Zdet) I_binned(X;Znoisy)\n";
65     for(int B=4; B<=128; B*=2){
66         double I_det = binned_MI_1d(X, Zdet, B);
67         double I_noisy = binned_MI_1d(X, Znoisy, B);
68         cout << B << " " << I_det << " " << I_noisy << "\n";
69     }
70 
71     // Also print analytic MI for the noisy linear-Gaussian case: I = 0.5 * log(1 + w^2 Var(X)/Var(noise))
72     // Var(X) ≈ 1 for N(0,1), Var(noise) = 0.2^2 = 0.04
73     double varX = 0.0; double meanX = accumulate(X.begin(), X.end(), 0.0)/double(N);
74     for(double xi: X) varX += (xi-meanX)*(xi-meanX); varX /= double(N);
75     double varNoise = 0.2 * 0.2;
76     double I_gauss = 0.5 * log(1.0 + (w*w * varX)/varNoise);
77     cerr << "Analytic Gaussian MI (noisy): " << I_gauss << " nats\n";
78 
79     // Expected outcome:
80     // - I_binned(X;Zdet) increases with B (artifact), even though true MI is infinite for deterministic continuous mapping.
81     // - I_binned(X;Znoisy) stabilizes with B and is consistent with the analytic Gaussian MI.
82 
83     return 0;
84 }
85

This utility computes histogram-based mutual information for univariate X and Z and demonstrates the binning artifact. For a deterministic continuous mapping Z = wX, true MI is infinite, but the binned estimate remains finite and tends to increase with the number of bins, misleadingly suggesting changes in information. Adding small Gaussian noise produces a finite MI that can be checked against the analytic Gaussian-channel formula. Use this example to sanity-check your information-plane measurements.

Time: O(N + B^2) per MI computation for univariate variables; repeated for each B tested.Space: O(B^2) for joint counts plus O(N) to store input vectors.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Sigmoid helper
5	static inline double sigmoid(double x){ return 1.0 / (1.0 + exp(-x)); }
6
7	// Binary cross-entropy for y in {0,1}
8	static inline double bce(double y, double p){
9	const double eps = 1e-12;
10	p = min(max(p, eps), 1.0 - eps);
11	return -(y * log(p) + (1.0 - y) * log(1.0 - p));
12	}
13
14	// Generate synthetic 1D data: X ~ N(0,1), Y = 1{X + noise > 0}
15	struct Dataset {
16	vector<double> x;
17	vector<int> y; // 0 or 1
18	};
19
20	Dataset make_data(size_t N, double label_noise_std, uint64_t seed=42){
21	Dataset d; d.x.resize(N); d.y.resize(N);
22	mt19937_64 rng(seed);
23	normal_distribution<double> N01(0.0, 1.0);
24	normal_distribution<double> N0n(0.0, label_noise_std);
25	for(size_t i=0;i<N;++i){
26	double xi = N01(rng);
27	double yi_cont = xi + N0n(rng);
28	int yi = yi_cont > 0.0 ? 1 : 0;
29	d.x[i] = xi; d.y[i] = yi;
30	}
31	return d;
32	}
33
34	// Compute entropy of binary labels in nats
35	double binary_entropy_from_counts(size_t c1, size_t c0){
36	double N = double(c1 + c0);
37	if(N == 0) return 0.0;
38	double p1 = c1 / N;
39	double p0 = c0 / N;
40	auto Hterm = [](double p){ if(p<=0.0 \|\| p>=1.0) return 0.0; return -p*log(p); };
41	return Hterm(p1) + Hterm(p0);
42	}
43
44	// Approximate I(Z;Y) via H(Y) - E[ cross-entropy( y, p(y=1\|z) ) ]
45	// Using the model's predicted probability as a proxy for p(y\|z) (calibration assumed)
46
47	double approximate_I_ZY_from_decoder(const vector<int>& y, const vector<double>& z, double a, double b){
48	size_t N = y.size();
49	size_t c1 = accumulate(y.begin(), y.end(), size_t(0));
50	size_t c0 = N - c1;
51	double H_Y = binary_entropy_from_counts(c1, c0);
52	double ce = 0.0;
53	for(size_t i=0;i<N;++i){
54	double p = sigmoid(a*z[i] + b);
55	ce += bce(y[i], p);
56	}
57	ce /= double(N);
58	// I(Z;Y) ≈ H(Y) - H(Y\|Z); H(Y\|Z) approximated by cross-entropy
59	double I_approx = H_Y - ce;
60	if(I_approx < 0.0) I_approx = 0.0; // clip due to approximation error
61	return I_approx;
62	}
63
64	// Discretized (binned) MI for 1D X and Z
65	// Note: This estimator is biased and can create artifacts for deterministic mappings.
66	double binned_MI_1d(const vector<double>& X, const vector<double>& Z, int B){
67	size_t N = X.size();
68	if(N == 0) return 0.0;
69	double xmin = *min_element(X.begin(), X.end());
70	double xmax = *max_element(X.begin(), X.end());
71	double zmin = *min_element(Z.begin(), Z.end());
72	double zmax = *max_element(Z.begin(), Z.end());
73	if(xmax == xmin) xmax = xmin + 1e-6;
74	if(zmax == zmin) zmax = zmin + 1e-6;
75	vector<vector<size_t>> joint(B, vector<size_t>(B, 0));
76	vector<size_t> mx(B,0), mz(B,0);
77
78	auto bin_index = [](double v, double vmin, double vmax, int B){
79	double t = (v - vmin) / (vmax - vmin);
80	int idx = int(floor(t * B));
81	if(idx < 0) idx = 0; if(idx >= B) idx = B-1;
82	return idx;
83	};
84
85	for(size_t i=0;i<N;++i){
86	int ix = bin_index(X[i], xmin, xmax, B);
87	int iz = bin_index(Z[i], zmin, zmax, B);
88	joint[ix][iz]++; mx[ix]++; mz[iz]++;
89	}
90	double NN = double(N);
91	double I = 0.0;
92	for(int i=0;i<B;++i){
93	for(int j=0;j<B;++j){
94	if(joint[i][j]==0) continue;
95	double pxy = joint[i][j]/NN;
96	double px = mx[i]/NN;
97	double pz = mz[j]/NN;
98	I += pxy * log(pxy/(px*pz));
99	}
100	}
101	return I; // nats
102	}
103
104	int main(){
105	ios::sync_with_stdio(false);
106	cin.tie(nullptr);
107
108	// 1) Create data
109	size_t N = 4000;
110	Dataset data = make_data(N, /label_noise_std=/0.5, /seed=/123);
111
112	// 2) Parameters: encoder z = wx + seps, eps ~ N(0,1); decoder p(y=1\|z) = sigmoid(a*z + b)
113	double w = 0.1; // encoder weight
114	double log_s = log(0.8); // log std to ensure positivity
115	double a = 0.1; // decoder weight
116	double b = 0.0; // decoder bias
117	double beta = 1e-2; // IB trade-off weight
118
119	// 3) Optimizer hyperparams
120	double lr = 1e-2;
121	int epochs = 200;
122	int B_bins = 30; // for binned MI visualization
123
124	// Random source for reparameterization noise
125	mt19937_64 rng(2024);
126	normal_distribution<double> N01(0.0, 1.0);
127
128	// Precompute Var(X)
129	double mean_x = accumulate(data.x.begin(), data.x.end(), 0.0) / double(N);
130	double var_x = 0.0; for(double xi: data.x){ var_x += (xi - mean_x)*(xi - mean_x); } var_x /= double(N);
131
132	vector<double> zbuf(N, 0.0);
133
134	cout << fixed << setprecision(6);
135	cout << "#epoch loss acc Ixz_gauss Ixz_binned Izy_approx\n";
136
137	for(int ep=1; ep<=epochs; ++ep){
138	double s = exp(log_s);
139	// Accumulate gradients
140	double gw=0.0, glogs=0.0, ga=0.0, gb=0.0;
141	double loss=0.0; size_t correct=0;
142
143	for(size_t i=0;i<N;++i){
144	double x = data.x[i];
145	int y = data.y[i];
146	double eps = N01(rng);
147	double z = w * x + s * eps; // reparameterization sample
148	double t = a * z + b;
149	double p = sigmoid(t);
150	double ce = bce(double(y), p);
151	// KL(q(z\|x)\|\|N(0,1)) for this x (Gaussian closed form)
152	double mu = w * x;
153	double KL = 0.5 * (mumu + ss - 1.0 - log(s*s));
154	double L = ce + beta * KL;
155	loss += L;
156	// accuracy
157	int yhat = (p >= 0.5) ? 1 : 0;
158	if(yhat == y) correct++;
159	// Gradients
160	double dL_dt = (p - double(y)); // dCE/dt
161	double dL_dz = dL_dt * a;
162	double dL_dw = dL_dz * x + beta * (mu * x); // mux = wx^2
163	double dL_ds = dL_dz * eps + beta * (s - 1.0/s);
164	double dL_da = dL_dt * z;
165	double dL_db = dL_dt;
166	// chain for log_s
167	glogs += dL_ds * s;
168	gw += dL_dw; ga += dL_da; gb += dL_db;
169	// buffer z for info estimates later (use noiseless mean to reduce variance)
170	zbuf[i] = mu; // store encoder mean w*x for analysis
171	}
172	loss /= double(N);
173	double acc = double(correct) / double(N);
174
175	// Parameter update (SGD)
176	double scale = lr / double(N);
177	w -= scale * gw;
178	log_s -= scale * glogs;
179	a -= scale * ga;
180	b -= scale * gb;
181
182	double s_now = exp(log_s);
183	// Analytic I(X;Z) for Gaussian channel with encoder mean and noise std s
184	double Ixz_gauss = 0.5 * log(1.0 + (ww var_x) / (s_now*s_now));
185
186	// Binned I(X;Z): use Z = w*X (deterministic), which tends to underestimate true MI in continuous case
187	vector<double> Xvec = data.x;
188	vector<double> Zdet(N); for(size_t i=0;i<N;++i) Zdet[i] = w * data.x[i];
189	double Ixz_binned = binned_MI_1d(Xvec, Zdet, B_bins);
190
191	// Approximate I(Z;Y) using decoder probabilities and noiseless mean z=mu (variance ignored here)
192	double Izy_approx = approximate_I_ZY_from_decoder(data.y, zbuf, a, b);
193
194	cout << ep << " " << loss << " " << acc << " "
195	<< Ixz_gauss << " " << Ixz_binned << " " << Izy_approx << "\n";
196	}
197
198	// Note: For a proper information plane, you would record these values per epoch and plot them externally.
199	return 0;
200	}
201

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Histogram-based 1D-1D MI (nats)
5	double binned_MI_1d(const vector<double>& X, const vector<double>& Z, int B){
6	size_t N = X.size();
7	if(N == 0) return 0.0;
8	double xmin = *min_element(X.begin(), X.end());
9	double xmax = *max_element(X.begin(), X.end());
10	double zmin = *min_element(Z.begin(), Z.end());
11	double zmax = *max_element(Z.begin(), Z.end());
12	if(xmax == xmin) xmax = xmin + 1e-6;
13	if(zmax == zmin) zmax = zmin + 1e-6;
14	vector<vector<size_t>> joint(B, vector<size_t>(B, 0));
15	vector<size_t> mx(B,0), mz(B,0);
16
17	auto bin_index = [](double v, double vmin, double vmax, int B){
18	double t = (v - vmin) / (vmax - vmin);
19	int idx = int(floor(t * B));
20	if(idx < 0) idx = 0; if(idx >= B) idx = B-1;
21	return idx;
22	};
23
24	for(size_t i=0;i<N;++i){
25	int ix = bin_index(X[i], xmin, xmax, B);
26	int iz = bin_index(Z[i], zmin, zmax, B);
27	joint[ix][iz]++; mx[ix]++; mz[iz]++;
28	}
29	double NN = double(N);
30	double I = 0.0;
31	for(int i=0;i<B;++i){
32	for(int j=0;j<B;++j){
33	if(joint[i][j]==0) continue;
34	double pxy = joint[i][j]/NN;
35	double px = mx[i]/NN;
36	double pz = mz[j]/NN;
37	I += pxy * log(pxy/(px*pz));
38	}
39	}
40	return I;
41	}
42
43	int main(){
44	ios::sync_with_stdio(false);
45	cin.tie(nullptr);
46
47	// Generate X ~ N(0,1)
48	size_t N = 20000;
49	mt19937_64 rng(7);
50	normal_distribution<double> N01(0.0, 1.0);
51	vector<double> X(N);
52	for(size_t i=0;i<N;++i) X[i] = N01(rng);
53
54	// Deterministic mapping Z = w * X (no noise): true MI for continuous variables is infinite (if w != 0)
55	double w = 2.0;
56	vector<double> Zdet(N); for(size_t i=0;i<N;++i) Zdet[i] = w * X[i];
57
58	// Slightly noisy mapping Z = w * X + eps: finite MI with Gaussian formula
59	normal_distribution<double> N0s(0.0, 0.2);
60	vector<double> Znoisy(N); for(size_t i=0;i<N;++i) Znoisy[i] = w * X[i] + N0s(rng);
61
62	// Evaluate binned MI for increasing bin counts
63	cout << fixed << setprecision(6);
64	cout << "#B I_binned(X;Zdet) I_binned(X;Znoisy)\n";
65	for(int B=4; B<=128; B*=2){
66	double I_det = binned_MI_1d(X, Zdet, B);
67	double I_noisy = binned_MI_1d(X, Znoisy, B);
68	cout << B << " " << I_det << " " << I_noisy << "\n";
69	}
70
71	// Also print analytic MI for the noisy linear-Gaussian case: I = 0.5 * log(1 + w^2 Var(X)/Var(noise))
72	// Var(X) ≈ 1 for N(0,1), Var(noise) = 0.2^2 = 0.04
73	double varX = 0.0; double meanX = accumulate(X.begin(), X.end(), 0.0)/double(N);
74	for(double xi: X) varX += (xi-meanX)*(xi-meanX); varX /= double(N);
75	double varNoise = 0.2 * 0.2;
76	double I_gauss = 0.5 * log(1.0 + (ww varX)/varNoise);
77	cerr << "Analytic Gaussian MI (noisy): " << I_gauss << " nats\n";
78
79	// Expected outcome:
80	// - I_binned(X;Zdet) increases with B (artifact), even though true MI is infinite for deterministic continuous mapping.
81	// - I_binned(X;Znoisy) stabilizes with B and is consistent with the analytic Gaussian MI.
82
83	return 0;
84	}
85