📚TheoryAdvanced

Information Bottleneck Theory

Key Points

  • Information Bottleneck (IB) studies how to compress an input X into a representation Z that still preserves what is needed to predict Y.
  • The core trade-off is to maximize I(Z;Y) while minimizing I(X;Z), typically via the Lagrangian ) - I(X;Z).
  • Deterministic continuous networks make I(X;Z) ill-defined (often infinite), so stochastic encoders or discretization are needed to reason about mutual information.
  • The Variational Information Bottleneck (VIB) replaces intractable information terms with tractable bounds using KL divergence and the reparameterization trick.
  • The information plane plots I(X;Z) versus I(Z;Y) over training, often showing a fitting phase followed by a compression phase, though this is debated.
  • Compression observed with binning can be an artifact; careful estimators or analytic formulas (e.g., Gaussian channel) should be used when possible.
  • Choosing controls how much the model prioritizes compression versus task performance, analogous to rate–distortion trade-offs.
  • IB can encourage robustness, generalization, privacy, and compact representations in deep learning systems.

Prerequisites

  • Probability distributions and expectationsMutual information, entropy, and KL divergence are defined using probabilities and expected values.
  • Entropy and KL divergenceIB and VIB objectives directly use entropy decompositions and KL regularization.
  • Mutual informationThe core quantities I(X;Z) and I(Z;Y) define the IB trade-off.
  • Gaussian distributions and linear modelsClosed-form MI and KL formulas for Gaussian channels enable analytic checks and stable training.
  • Logistic regression and cross-entropyVIB often uses a classifier p(y|z) trained with cross-entropy, linking utility to H(Y|Z).
  • Stochastic gradient descent and reparameterizationTraining VIB requires sampling z via reparameterization to compute low-variance gradients.
  • C++ programming basicsImplementing estimators and toy training loops requires control structures, RNG, and numerical stability.

Detailed Explanation

Tap terms for definitions

01Overview

The Information Bottleneck (IB) principle is a framework for learning useful, compact representations. Given an input X and a target Y, we create an intermediate representation Z that should forget as much about X as possible while keeping what is necessary to predict Y. This is captured by two mutual information terms: I(X;Z) measures how much Z still remembers about X (compression cost), and I(Z;Y) measures how informative Z is for the target (prediction utility). The core idea is to search for representations that optimally balance these objectives.

Formally, IB poses an optimization over stochastic encoders p(z|x): maximize I(Z;Y) subject to a constraint on I(X;Z), or equivalently optimize a Lagrangian L = I(Z;Y) - \beta I(X;Z). In deep learning, Z is typically the activation of a hidden layer, and the IB principle suggests that good layers discard irrelevant details of X while retaining label-relevant information.

In practice, computing mutual information for high-dimensional, continuous neural activations is difficult. The Variational Information Bottleneck (VIB) addresses this by introducing a variational encoder q_\phi(z|x), a decoder p_\theta(y|z), and a simple prior r(z). This yields a tractable surrogate objective using KL divergence and the reparameterization trick, enabling stochastic gradient training. The information plane visualizes the trade-off during training, but interpreting it requires care because naive estimators (like binning) can be misleading, especially for deterministic networks.

IB provides a unifying language connecting compression, generalization, robustness, and even privacy, guiding the design of architectures and regularizers that emphasize task-relevant information.

02Intuition & Analogies

Imagine summarizing a long article (X) into a short abstract (Z) that still lets a reader answer specific questions (Y). If the abstract copies the article, it's not a summary (high I(X;Z)); if it omits key points, the reader can't answer questions (low I(Z;Y)). A great abstract keeps just the right details to solve the task—this is the bottleneck principle.

Think of a water pipe with a valve (the bottleneck). The source is noisy water with minerals (X), and the goal is to produce drinkable water (Y). The valve controls how much water (information) passes. If you open it fully, impurities flow too (Z keeps too much from X). If you close it too much, you get a trickle that’s too pure but insufficient for demand (Z loses task-relevant content). The optimal setting allows enough flow to satisfy needs while filtering out junk—maximizing usefulness while minimizing redundancy.

In a deep network, each hidden layer acts like a new filter. Early layers pick out edges or simple patterns; deeper layers retain higher-level concepts like object categories. Over training, layers may first memorize details to fit the data (high I(X;Z)); later, optimization and noise push them to forget irrelevant specifics while preserving what predicts the label (higher I(Z;Y) relative to I(X;Z)). But measuring this process precisely is tricky. If you measure information with coarse buckets (binning), it might look like compression even when, mathematically, a deterministic continuous layer can carry infinite information. To study and encourage true compression, we inject stochasticity (noise) and use well-founded approximations, as in the Variational Information Bottleneck.

03Formal Definition

Let random variables X (input), Y (target), and Z (representation) form a Markov chain X Z Y via an encoder p(zz). Mutual information I(A;B) = H(A) - H(Ax) that solves maximize I(Z;Y) subject to I(X;Z) R, equivalently maximizing the Lagrangian ) - I(X;Z) for some 0. Here, controls the trade-off between prediction utility and compression. For continuous variables and deterministic encoders, I(X;Z) can be ill-defined (often infinite). Therefore, stochastic encoders (e.g., Gaussian noise) are standard in IB analysis. The Variational Information Bottleneck introduces variational distributions q_(zz) (decoder), and a simple prior r(z) (e.g., standard normal). Using variational bounds, one obtains a tractable surrogate objective: minimize (,) = [- p_(y|z)] + \, [(q_(z\, r(z))]. The first term upper-bounds H(Y|Z) (so minimizing it tends to increase I(Z;Y)), while the KL term upper-bounds I(X;Z) under the variational prior. For Gaussian encoders with reparameterization z = _(x) + _(x) , (0,I), training proceeds via stochastic gradient descent.

04When to Use

Use IB or VIB when you want representations that are compact yet predictive. Concrete settings include:

  • Regularization in supervised learning: Encouraging Z to discard spurious correlations can improve generalization and robustness to distribution shifts.
  • Privacy and compression: Limiting I(X;Z) helps reduce leakage of sensitive input details and can enable smaller transmitted representations in edge/cloud systems.
  • Noisy or high-dimensional data: When inputs contain substantial nuisance variation (e.g., backgrounds in images, sensor noise), IB promotes invariances by keeping only task-relevant signal.
  • Multi-task or transfer learning: A well-compressed Z that emphasizes core factors may transfer better across related tasks.
  • Interpretable training diagnostics: The information plane (I(X;Z) vs. I(Z;Y)) can visualize how layers evolve during training, highlighting fitting and potential compression phases.

However, prefer VIB or other principled stochastic formulations over naive mutual information estimates for deep, continuous networks. If you must inspect the information plane with discretization, validate findings with alternative estimators or analytic formulas (e.g., for linear-Gaussian layers).

⚠️Common Mistakes

  • Treating deterministic continuous layers as having finite I(X;Z): For invertible or nearly invertible mappings with continuous noise-free data, I(X;Z) is infinite or undefined. Remedy: add stochasticity (e.g., Gaussian encoders) or analyze discrete/quantized cases.
  • Over-interpreting binning-based compression: Discretizing activations often shows decreasing I(X;Z) during training, but this can be a measurement artifact. Cross-check with analytic formulas (Gaussian channel) or variational bounds.
  • Confusing \beta’s role: \beta is a Lagrange multiplier controlling compression strength. Too large \beta over-compresses and hurts accuracy; too small \beta collapses to standard ERM with little compression. Tune \beta and monitor the information plane or validation metrics.
  • Ignoring the choice of prior r(z): The KL term depends on r(z). A poor prior can misestimate compression or hinder optimization. Try standard normal or learned priors and monitor performance.
  • Numerical instability when estimating MI: High-dimensional histograms, heavy tails, and small bins produce high-variance estimates. Prefer k-NN or Gaussian estimators when applicable, and report units (nats vs. bits).
  • Misreading I(Z;Y) from accuracy alone: Accuracy can saturate while I(Z;Y) still increases or vice versa. Use entropy-based measures or calibrated probabilities to approximate H(Y|Z).
  • Forgetting the data processing inequality: Adding layers cannot increase I(X;Y) but can re-distribute information. Don’t expect deeper networks to create new information about Y from X; they repackage it.

Key Formulas

Mutual Information

Explanation: Mutual information is the reduction in uncertainty about A after observing B (and vice versa). It is symmetric and non-negative.

IB Lagrangian

Explanation: This objective encodes the trade-off between making Z predictive of Y and compressing information from X. The parameter controls compression strength.

KL Divergence

Explanation: KL measures how one distribution q differs from another p. It is zero only if q equals p almost everywhere and is used as a regularizer in VIB.

VIB Objective

Explanation: The first term upper-bounds H(Y|Z) and encourages predictive Z. The second term penalizes information Z carries about X via KL to a simple prior r(z).

Bivariate Gaussian MI

Explanation: For jointly Gaussian X and Z with correlation , mutual information has this closed form. It is useful when encoders are approximately linear-Gaussian.

Gaussian Channel MI

Explanation: For x + with (0,), this gives mutual information in nats. It equals half the log of one plus the signal-to-noise ratio.

KL Between Univariate Gaussians

Explanation: Closed-form KL used in VIB when the encoder is Gaussian and the prior is standard normal. Enables efficient gradient computation.

Binned Entropy

Explanation: Histogram-based entropy for B bins with probabilities . Used in simple MI estimators but sensitive to bin count and sample size.

Binned Mutual Information

Explanation: Histogram-based MI from joint bin counts . Useful for visualization but can produce artifacts with continuous, deterministic mappings.

Utility via Entropy

Explanation: For classification, H(Y|Z) can be approximated by the expected cross-entropy under a calibrated decoder p(y|z). This links predictive loss to mutual information.

Data Processing Inequality

Explanation: Processing cannot create information about Y beyond what X already contains. Any representation Z can only lose information about Y.

Complexity Analysis

Exact mutual information is typically intractable for high-dimensional deep nets. Variational Information Bottleneck reduces computation to standard training plus a simple KL term. For a model with d-dimensional latent Z and N samples, each epoch of VIB costs O(Nd) to evaluate the encoder/decoder and O(Nd) to form gradients; with mini-batching of size m, each step is O(md). Space is O(d) for parameters (per layer) plus O(m d) for activations during a batch. For histogram (binning) MI estimation with B bins per dimension on univariate variables, forming marginal and joint histograms is O(N + ) time and O() space. In higher dimensions, histogram methods suffer from the curse of dimensionality with O() bins in k dimensions, quickly becoming impractical. Gaussian closed forms (when applicable) compute MI in O(1) after estimating a few moments, while k-NN estimators typically cost O(N log N) or O() depending on implementation and dimension. In our C++ examples, the 1D VIB toy model uses scalar operations so each epoch scales linearly with N; memory holds the dataset (O(N)) plus a few scalars. The binning MI utility runs in O(N + ) time for univariate X and Z with modest constant factors. These simplified settings highlight computational trade-offs: VIB adds negligible overhead to standard training, while naive MI via histograms is cheap in 1D but unreliable and scales poorly in higher dimensions.

Code Examples

Linear-Gaussian VIB for Binary Classification with Information-Plane Estimates
1#include <bits/stdc++.h>
2using namespace std;
3
4// Sigmoid helper
5static inline double sigmoid(double x){ return 1.0 / (1.0 + exp(-x)); }
6
7// Binary cross-entropy for y in {0,1}
8static inline double bce(double y, double p){
9 const double eps = 1e-12;
10 p = min(max(p, eps), 1.0 - eps);
11 return -(y * log(p) + (1.0 - y) * log(1.0 - p));
12}
13
14// Generate synthetic 1D data: X ~ N(0,1), Y = 1{X + noise > 0}
15struct Dataset {
16 vector<double> x;
17 vector<int> y; // 0 or 1
18};
19
20Dataset make_data(size_t N, double label_noise_std, uint64_t seed=42){
21 Dataset d; d.x.resize(N); d.y.resize(N);
22 mt19937_64 rng(seed);
23 normal_distribution<double> N01(0.0, 1.0);
24 normal_distribution<double> N0n(0.0, label_noise_std);
25 for(size_t i=0;i<N;++i){
26 double xi = N01(rng);
27 double yi_cont = xi + N0n(rng);
28 int yi = yi_cont > 0.0 ? 1 : 0;
29 d.x[i] = xi; d.y[i] = yi;
30 }
31 return d;
32}
33
34// Compute entropy of binary labels in nats
35double binary_entropy_from_counts(size_t c1, size_t c0){
36 double N = double(c1 + c0);
37 if(N == 0) return 0.0;
38 double p1 = c1 / N;
39 double p0 = c0 / N;
40 auto Hterm = [](double p){ if(p<=0.0 || p>=1.0) return 0.0; return -p*log(p); };
41 return Hterm(p1) + Hterm(p0);
42}
43
44// Approximate I(Z;Y) via H(Y) - E[ cross-entropy( y, p(y=1|z) ) ]
45// Using the model's predicted probability as a proxy for p(y|z) (calibration assumed)
46
47double approximate_I_ZY_from_decoder(const vector<int>& y, const vector<double>& z, double a, double b){
48 size_t N = y.size();
49 size_t c1 = accumulate(y.begin(), y.end(), size_t(0));
50 size_t c0 = N - c1;
51 double H_Y = binary_entropy_from_counts(c1, c0);
52 double ce = 0.0;
53 for(size_t i=0;i<N;++i){
54 double p = sigmoid(a*z[i] + b);
55 ce += bce(y[i], p);
56 }
57 ce /= double(N);
58 // I(Z;Y) ≈ H(Y) - H(Y|Z); H(Y|Z) approximated by cross-entropy
59 double I_approx = H_Y - ce;
60 if(I_approx < 0.0) I_approx = 0.0; // clip due to approximation error
61 return I_approx;
62}
63
64// Discretized (binned) MI for 1D X and Z
65// Note: This estimator is biased and can create artifacts for deterministic mappings.
66double binned_MI_1d(const vector<double>& X, const vector<double>& Z, int B){
67 size_t N = X.size();
68 if(N == 0) return 0.0;
69 double xmin = *min_element(X.begin(), X.end());
70 double xmax = *max_element(X.begin(), X.end());
71 double zmin = *min_element(Z.begin(), Z.end());
72 double zmax = *max_element(Z.begin(), Z.end());
73 if(xmax == xmin) xmax = xmin + 1e-6;
74 if(zmax == zmin) zmax = zmin + 1e-6;
75 vector<vector<size_t>> joint(B, vector<size_t>(B, 0));
76 vector<size_t> mx(B,0), mz(B,0);
77
78 auto bin_index = [](double v, double vmin, double vmax, int B){
79 double t = (v - vmin) / (vmax - vmin);
80 int idx = int(floor(t * B));
81 if(idx < 0) idx = 0; if(idx >= B) idx = B-1;
82 return idx;
83 };
84
85 for(size_t i=0;i<N;++i){
86 int ix = bin_index(X[i], xmin, xmax, B);
87 int iz = bin_index(Z[i], zmin, zmax, B);
88 joint[ix][iz]++; mx[ix]++; mz[iz]++;
89 }
90 double NN = double(N);
91 double I = 0.0;
92 for(int i=0;i<B;++i){
93 for(int j=0;j<B;++j){
94 if(joint[i][j]==0) continue;
95 double pxy = joint[i][j]/NN;
96 double px = mx[i]/NN;
97 double pz = mz[j]/NN;
98 I += pxy * log(pxy/(px*pz));
99 }
100 }
101 return I; // nats
102}
103
104int main(){
105 ios::sync_with_stdio(false);
106 cin.tie(nullptr);
107
108 // 1) Create data
109 size_t N = 4000;
110 Dataset data = make_data(N, /*label_noise_std=*/0.5, /*seed=*/123);
111
112 // 2) Parameters: encoder z = w*x + s*eps, eps ~ N(0,1); decoder p(y=1|z) = sigmoid(a*z + b)
113 double w = 0.1; // encoder weight
114 double log_s = log(0.8); // log std to ensure positivity
115 double a = 0.1; // decoder weight
116 double b = 0.0; // decoder bias
117 double beta = 1e-2; // IB trade-off weight
118
119 // 3) Optimizer hyperparams
120 double lr = 1e-2;
121 int epochs = 200;
122 int B_bins = 30; // for binned MI visualization
123
124 // Random source for reparameterization noise
125 mt19937_64 rng(2024);
126 normal_distribution<double> N01(0.0, 1.0);
127
128 // Precompute Var(X)
129 double mean_x = accumulate(data.x.begin(), data.x.end(), 0.0) / double(N);
130 double var_x = 0.0; for(double xi: data.x){ var_x += (xi - mean_x)*(xi - mean_x); } var_x /= double(N);
131
132 vector<double> zbuf(N, 0.0);
133
134 cout << fixed << setprecision(6);
135 cout << "#epoch loss acc Ixz_gauss Ixz_binned Izy_approx\n";
136
137 for(int ep=1; ep<=epochs; ++ep){
138 double s = exp(log_s);
139 // Accumulate gradients
140 double gw=0.0, glogs=0.0, ga=0.0, gb=0.0;
141 double loss=0.0; size_t correct=0;
142
143 for(size_t i=0;i<N;++i){
144 double x = data.x[i];
145 int y = data.y[i];
146 double eps = N01(rng);
147 double z = w * x + s * eps; // reparameterization sample
148 double t = a * z + b;
149 double p = sigmoid(t);
150 double ce = bce(double(y), p);
151 // KL(q(z|x)||N(0,1)) for this x (Gaussian closed form)
152 double mu = w * x;
153 double KL = 0.5 * (mu*mu + s*s - 1.0 - log(s*s));
154 double L = ce + beta * KL;
155 loss += L;
156 // accuracy
157 int yhat = (p >= 0.5) ? 1 : 0;
158 if(yhat == y) correct++;
159 // Gradients
160 double dL_dt = (p - double(y)); // dCE/dt
161 double dL_dz = dL_dt * a;
162 double dL_dw = dL_dz * x + beta * (mu * x); // mu*x = w*x^2
163 double dL_ds = dL_dz * eps + beta * (s - 1.0/s);
164 double dL_da = dL_dt * z;
165 double dL_db = dL_dt;
166 // chain for log_s
167 glogs += dL_ds * s;
168 gw += dL_dw; ga += dL_da; gb += dL_db;
169 // buffer z for info estimates later (use noiseless mean to reduce variance)
170 zbuf[i] = mu; // store encoder mean w*x for analysis
171 }
172 loss /= double(N);
173 double acc = double(correct) / double(N);
174
175 // Parameter update (SGD)
176 double scale = lr / double(N);
177 w -= scale * gw;
178 log_s -= scale * glogs;
179 a -= scale * ga;
180 b -= scale * gb;
181
182 double s_now = exp(log_s);
183 // Analytic I(X;Z) for Gaussian channel with encoder mean and noise std s
184 double Ixz_gauss = 0.5 * log(1.0 + (w*w * var_x) / (s_now*s_now));
185
186 // Binned I(X;Z): use Z = w*X (deterministic), which tends to underestimate true MI in continuous case
187 vector<double> Xvec = data.x;
188 vector<double> Zdet(N); for(size_t i=0;i<N;++i) Zdet[i] = w * data.x[i];
189 double Ixz_binned = binned_MI_1d(Xvec, Zdet, B_bins);
190
191 // Approximate I(Z;Y) using decoder probabilities and noiseless mean z=mu (variance ignored here)
192 double Izy_approx = approximate_I_ZY_from_decoder(data.y, zbuf, a, b);
193
194 cout << ep << " " << loss << " " << acc << " "
195 << Ixz_gauss << " " << Ixz_binned << " " << Izy_approx << "\n";
196 }
197
198 // Note: For a proper information plane, you would record these values per epoch and plot them externally.
199 return 0;
200}
201

This program trains a 1D Variational Information Bottleneck model: a Gaussian encoder z = w x + s ε and a logistic decoder p(y|z) = sigmoid(a z + b). The loss is cross-entropy plus β times the KL divergence between the encoder and a standard normal prior. Training uses SGD with reparameterization. After each epoch, it computes: (1) analytic I(X;Z) for the linear-Gaussian channel, (2) binned I(X;Z) using a histogram over the deterministic mapping z = w x (to illustrate discretization artifacts), and (3) an approximation of I(Z;Y) as H(Y) minus the average cross-entropy under the decoder. The output can be plotted to form an information plane. This minimal C++ example avoids external libraries and highlights how VIB operationalizes the IB trade-off.

Time: O(N * epochs) for this 1D model; O(N d * epochs) if extended to d-dimensional features.Space: O(N) to store the dataset and a few O(1) scalars for parameters; O(N) temporary buffer for z.
Histogram-based Mutual Information Estimator and the Binning Artifact
1#include <bits/stdc++.h>
2using namespace std;
3
4// Histogram-based 1D-1D MI (nats)
5double binned_MI_1d(const vector<double>& X, const vector<double>& Z, int B){
6 size_t N = X.size();
7 if(N == 0) return 0.0;
8 double xmin = *min_element(X.begin(), X.end());
9 double xmax = *max_element(X.begin(), X.end());
10 double zmin = *min_element(Z.begin(), Z.end());
11 double zmax = *max_element(Z.begin(), Z.end());
12 if(xmax == xmin) xmax = xmin + 1e-6;
13 if(zmax == zmin) zmax = zmin + 1e-6;
14 vector<vector<size_t>> joint(B, vector<size_t>(B, 0));
15 vector<size_t> mx(B,0), mz(B,0);
16
17 auto bin_index = [](double v, double vmin, double vmax, int B){
18 double t = (v - vmin) / (vmax - vmin);
19 int idx = int(floor(t * B));
20 if(idx < 0) idx = 0; if(idx >= B) idx = B-1;
21 return idx;
22 };
23
24 for(size_t i=0;i<N;++i){
25 int ix = bin_index(X[i], xmin, xmax, B);
26 int iz = bin_index(Z[i], zmin, zmax, B);
27 joint[ix][iz]++; mx[ix]++; mz[iz]++;
28 }
29 double NN = double(N);
30 double I = 0.0;
31 for(int i=0;i<B;++i){
32 for(int j=0;j<B;++j){
33 if(joint[i][j]==0) continue;
34 double pxy = joint[i][j]/NN;
35 double px = mx[i]/NN;
36 double pz = mz[j]/NN;
37 I += pxy * log(pxy/(px*pz));
38 }
39 }
40 return I;
41}
42
43int main(){
44 ios::sync_with_stdio(false);
45 cin.tie(nullptr);
46
47 // Generate X ~ N(0,1)
48 size_t N = 20000;
49 mt19937_64 rng(7);
50 normal_distribution<double> N01(0.0, 1.0);
51 vector<double> X(N);
52 for(size_t i=0;i<N;++i) X[i] = N01(rng);
53
54 // Deterministic mapping Z = w * X (no noise): true MI for continuous variables is infinite (if w != 0)
55 double w = 2.0;
56 vector<double> Zdet(N); for(size_t i=0;i<N;++i) Zdet[i] = w * X[i];
57
58 // Slightly noisy mapping Z = w * X + eps: finite MI with Gaussian formula
59 normal_distribution<double> N0s(0.0, 0.2);
60 vector<double> Znoisy(N); for(size_t i=0;i<N;++i) Znoisy[i] = w * X[i] + N0s(rng);
61
62 // Evaluate binned MI for increasing bin counts
63 cout << fixed << setprecision(6);
64 cout << "#B I_binned(X;Zdet) I_binned(X;Znoisy)\n";
65 for(int B=4; B<=128; B*=2){
66 double I_det = binned_MI_1d(X, Zdet, B);
67 double I_noisy = binned_MI_1d(X, Znoisy, B);
68 cout << B << " " << I_det << " " << I_noisy << "\n";
69 }
70
71 // Also print analytic MI for the noisy linear-Gaussian case: I = 0.5 * log(1 + w^2 Var(X)/Var(noise))
72 // Var(X) ≈ 1 for N(0,1), Var(noise) = 0.2^2 = 0.04
73 double varX = 0.0; double meanX = accumulate(X.begin(), X.end(), 0.0)/double(N);
74 for(double xi: X) varX += (xi-meanX)*(xi-meanX); varX /= double(N);
75 double varNoise = 0.2 * 0.2;
76 double I_gauss = 0.5 * log(1.0 + (w*w * varX)/varNoise);
77 cerr << "Analytic Gaussian MI (noisy): " << I_gauss << " nats\n";
78
79 // Expected outcome:
80 // - I_binned(X;Zdet) increases with B (artifact), even though true MI is infinite for deterministic continuous mapping.
81 // - I_binned(X;Znoisy) stabilizes with B and is consistent with the analytic Gaussian MI.
82
83 return 0;
84}
85

This utility computes histogram-based mutual information for univariate X and Z and demonstrates the binning artifact. For a deterministic continuous mapping Z = wX, true MI is infinite, but the binned estimate remains finite and tends to increase with the number of bins, misleadingly suggesting changes in information. Adding small Gaussian noise produces a finite MI that can be checked against the analytic Gaussian-channel formula. Use this example to sanity-check your information-plane measurements.

Time: O(N + B^2) per MI computation for univariate variables; repeated for each B tested.Space: O(B^2) for joint counts plus O(N) to store input vectors.