📚TheoryAdvanced

Information Bottleneck in Deep Learning

Key Points

•
The Information Bottleneck (IB) principle formalizes learning compact representations T that keep only the information about X that is useful for predicting Y.
•
Mathematically it trades off compression I(X;T) against prediction I(T;Y) through the objective min I(X;T) - $β$ I(T;Y).
•
In deep networks, the data processing inequality implies mutual information with the input cannot increase across layers, motivating progressive compression.
•
The Variational Information Bottleneck (VIB) makes the IB objective trainable by replacing mutual information terms with tractable losses using variational bounds.
•
Adding stochasticity (e.g., Gaussian noise) in hidden representations is crucial to make I(X;T) finite and learnable for continuous variables.
•
Properly tuned $β$ controls the simplicity–accuracy trade-off: large $β$ encourages stronger compression and potentially better generalization.
•
Estimating mutual information is hard; simple binning estimators can illustrate trends but are biased, while variational bounds are typically preferred for training.
•
You can prototype IB ideas in C++ using a simple stochastic encoder, a logistic decoder, reparameterization, and the closed-form KL between Gaussians.

Prerequisites

→Probability and Random Variables — Mutual information, entropy, and KL divergence are defined over probability distributions.
→Entropy and Mutual Information — Understanding the IB objective requires facility with I(X;T), H(X), and related identities.
→Stochastic Gradient Descent — VIB is trained with gradient-based optimization using minibatches.
→Variational Inference and KL Divergence — VIB relies on variational bounds and KL regularization to make IB tractable.
→Neural Networks and Backpropagation — Representations T are hidden layers; training adjusts encoder/decoder parameters.
→Gaussian Distributions — The encoder and prior are often Gaussian, and the KL has a closed-form expression.

Detailed Explanation

Tap terms for definitions

01Overview

The Information Bottleneck (IB) is a framework for representation learning that explicitly balances two goals: preserve information about the target Y while discarding irrelevant details from the input X. Instead of just minimizing a prediction loss, IB asks us to find an intermediate random variable T (a representation) that is maximally informative for Y but minimally informative about X. This is formalized through mutual information: we want T that has low I(X;T) (compression) and high I(T;Y) (predictive power). In deep learning, T can be a hidden layer. The IB view provides a principled lens to understand why deeper layers often look simpler or more robust: they retain what matters for the task and filter out nuisance variation.

Operationally, directly optimizing mutual information is hard in high dimensions. The Variational Information Bottleneck (VIB) converts the problem into a tractable stochastic optimization by introducing a parametric encoder q_\phi(T|X) and decoder p_\theta(Y|T), a simple prior p(T), and variational bounds that upper bound I(X;T) and lower bound I(T;Y). The resulting loss resembles a standard prediction loss plus a KL regularizer that pushes T toward a simple prior, encouraging compression. This bridges information theory and practical deep learning, allowing one to train bottlenecked models with standard stochastic gradient methods, and to tune the compression–accuracy trade-off via a single parameter \beta.

02Intuition & Analogies

Imagine you’re summarizing a long article for a friend who only cares about a single question the article answers. You could copy the whole article (no compression), but that’s unnecessary and overwhelming. Instead, you’ll write a short summary that keeps only the parts relevant to the question. The Information Bottleneck treats a neural network’s hidden layer like that summary. The input X is the article, the target Y is the question, and the representation T is your summary: keep what helps answer Y, forget the rest.

Another analogy is taking a photo in fog with a camera that has a tunable aperture. A wide-open aperture captures everything—details plus noise—but the picture may be blurry or overwhelming for the downstream task. Closing the aperture throws away some light (compression) but can increase contrast for the objects that matter, making it easier to detect them. In IB, the aperture is controlled by \beta: a larger \beta enforces a tighter bottleneck, keeping fewer bits about X, which can improve generalization by removing irrelevant or noisy patterns that could cause overfitting.

Finally, think of a factory assembly line: raw materials (X) pass through stations (layers). Each station simplifies and standardizes the product (representation), making later assembly steps easier. The data processing inequality is the rule that no station can recreate lost raw detail without access to the original supplies. So as items move down the line, they typically become simpler yet more suitable for the final product (predicting Y). Injecting controlled randomness (like shaking a sieve) helps separate dust from useful grains; in networks, adding noise to T makes compression measurable and effective for continuous signals.

03Formal Definition

Let X and Y be random variables with joint distribution p(x,y). We seek a stochastic representation T with Markov chain Y

\leftrightarrow

\leftrightarrow

T, typically induced by a conditional q_

ϕ

∣ x) . T h e I n f or ma t i o n B o ttl e n ec k o bj ec t i v e i s \n \n - M inimi ze : I (X; T) - β I (T; Y) \n \nwhere I (\cdot; \cdot) d e n o t es m u t u a l in f or ma t i o nan d β \geq 0 t r a d eso ff co m p ress i o nan d p re d i c t i o n . M u t u a l in f or ma t i o ni s d e f in e d a s I (A; B) = E_{p (a, b)} [lo g \frac{p ( a , b )}{p ( a ) p ( b )}] . I n d ee p n e tw or k s, a l a yer T i s am e a s u r ab l e f u n c t i o n (ors t oc ha s t i c ma pp in g) o f X; t h ere f ore, b y t h e d a t a p rocess in g in e q u a l i t y, I (X; T) \leq I (X; X) an d I (T; Y) \leq I (X; Y) . \nDirect o pt imi z a t i o ni s in t r a c t ab l e f or hi g h - d im e n s i o na l co n t in u o u s v a r iab l es, so t h e Va r ia t i o na l I n f or ma t i o n B o ttl e n ec k (V I B) in t ro d u ces : (i) an e n co d er q_{ϕ} (t ∣

x), (ii) a decoder p_

θ

∣ t), an d (iii) a s im pl e p r i or p (t) (e . g ., s t an d a r d n or ma l) . U s in gv a r ia t i o na l b o u n d s, w eo b t ainan u pp er b o u n d o n I (X; T) b y E_{p (x)} [KL (q_{ϕ} (t ∣

x)\,\|\,p(t))] and relate I(T;Y) to the expected log-likelihood

E_{p (x, y) q_{ϕ} (t ∣ x)}

[

lo g

θ

(y|t)]. The trainable surrogate loss becomes

\n \n L

(

ϕ

θ

) =

E_{p (x, y)}

E_{q_{ϕ} (t ∣ x)}

lo g

θ

∣ t)] + β E_{p (x)} [KL (q_{ϕ} (t ∣

x)\,\|\,p(t))],

\n \nwhich

is optimized with stochastic gradients and the reparameterization trick when q_

ϕ

is Gaussian.

04When to Use

Use the Information Bottleneck when you want models that generalize better by discarding task-irrelevant information, produce robust representations under noise or distribution shift, or are compact and potentially easier to compress or prune. Concretely:

Robust classification: Encourage hidden layers to ignore nuisance factors (e.g., background textures) while keeping class-relevant structure.
Semi-supervised and transfer learning: Learn representations T that capture predictive structure shared across tasks, reducing overfitting to domain-specific noise.
Model compression and privacy: The KL term biases T toward a simple prior, which can reduce the effective information leaked about X, aiding compression and privacy-sensitive applications.
Small-data regimes: Regularizing with IB (via \beta) can prevent memorization by limiting I(X;T), often improving test performance.
Interpretability: A low-dimensional stochastic T can be probed or visualized, clarifying what information the network retains about X.

Choose VIB or related stochastic bottlenecks when you can afford slight changes to your model and training loop. If you only need post-hoc analysis, mutual information estimators (e.g., binning or kNN) can reveal compression trends across layers, though they may be biased. Increase \beta to strengthen compression, and add explicit noise in T to make the IB signal well-defined for continuous variables.

⚠️Common Mistakes

Assuming compression always increases during training: In practice, mutual information may fluctuate, depend on estimator bias, and be architecture- or dataset-specific. Validate with multiple estimators and seeds.
Estimating I(X;T) for deterministic continuous networks without noise: It is infinite in theory. Add stochasticity (e.g., Gaussian noise in T) or use variational upper bounds via KL to a prior.
Misinterpreting \beta: Very large \beta can overcompress T, hurting accuracy by throwing away task-relevant bits. Tune \beta using validation metrics and monitor both KL and prediction loss.
Using coarse binning for MI estimation: Too few bins bias MI downward; too many bins cause undersampling and high variance. Calibrate bin counts, apply add-one smoothing, and compare to kNN or variational estimates.
Ignoring the Markov assumption: The IB objective assumes Y \leftrightarrow X \leftrightarrow T. Violations (e.g., label leakage into T from elsewhere) break the theoretical guarantees.
Forgetting scaling and stability: Reparameterization with exp(log-variance) can cause numerical issues. Use log-variance, clamp ranges, or softplus to keep standard deviations well-behaved.
Equating lower cross-entropy with higher I(T;Y) blindly: While related, cross-entropy is a surrogate. Improvements can reflect calibration or overfitting; use held-out data and, when possible, direct MI surrogates.

Key Formulas

Mutual Information

I (X; T) = E_{p (x, t)} [lo g \frac{p ( x , t )}{p ( x ) p ( t )}]

Explanation: Mutual information measures how much knowing T reduces uncertainty about X (and vice versa). It is zero if and only if X and T are independent.

Information Bottleneck Lagrangian

q (t ∣ x) min I (X; T) - β I (T; Y)

Explanation: This objective encourages representations T that are compressed with respect to X while preserving predictive information about Y. The parameter $β$ tunes the strength of compression.

Variational Upper Bound on I(X;T)

I (X; T) \leq E_{p (x)} [KL (q_{ϕ} (t ∣ x) ∥ p (t))]

Explanation: Replacing I(X;T) with an expected KL to a simple prior yields a tractable surrogate for compression. It becomes exact when p(t)=p(t)= $\int$ q(t|x)p(x)dx and certain conditions hold.

Variational Lower Bound on I(T;Y)

I (T; Y) \geq E_{p (x, y) q_{ϕ} (t ∣ x)} [lo g p_{θ} (y ∣ t)] + const

Explanation: Maximizing the expected log-likelihood of labels under the decoder increases a lower bound on I(T;Y). The constant depends on p(y) and does not affect optimization.

VIB Training Objective

L (ϕ, θ) = E_{p (x, y) q_{ϕ} (t ∣ x)} [- lo g p_{θ} (y ∣ t)] + β E_{p (x)} [KL (q_{ϕ} (t ∣ x) ∥ p (t))]

Explanation: This loss combines a prediction term and a compression term. Minimizing it with SGD yields a representation T that balances accuracy and compactness.

KL of Univariate Gaussians

KL (N (μ, σ^{2}) ∥ N (0, 1)) = \frac{1}{2} (μ^{2} + σ^{2} - lo g σ^{2} - 1)

Explanation: When the encoder is Gaussian and the prior is standard normal, the KL term has this closed form. It provides a simple, stable compression penalty.

Reparameterization Trick

z = μ (x) + σ (x) ϵ, ϵ \sim N (0, 1)

Explanation: Sampling is rewritten as a differentiable transformation of noise, allowing low-variance gradient estimates of expectations over q(z|x).

Data Processing Along Layers

I (X; T_{2}) \leq I (X; T_{1}) \leq I (X; X)

Explanation: For a feedforward chain X $\to$ $T_{1}$ $\to$ $T_{2}$ , mutual information with the input cannot increase. This formalizes the notion of progressive compression.

Entropy and Conditional Entropy (discrete)

H (X) = - i \sum p (x_{i}) lo g p (x_{i}), H (X ∣ T) = - i, j \sum p (x_{i}, t_{j}) lo g p (x_{i} ∣ t_{j})

Explanation: Entropy quantifies average uncertainty. Mutual information can be written as I(X;T)=H(X)-H(X|T), highlighting how T reduces uncertainty about X.

Histogram MI Estimator

I (X; T) = b_{x}, b_{t} \sum \overset{p}{^}_{b_{x}, b_{t}} lo g \frac{p ^ _{b_{x}, b_{t}}}{p ^ _{b_{x}} p ^ _{b_{t}}}

Explanation: A practical estimator using discrete bins. It is simple but biased; results depend on bin count and sample size.

Complexity Analysis

Training with the Variational Information Bottleneck typically mirrors the cost of standard stochastic neural training plus an extra KL term per sample. For an encoder that outputs a

d_{z}

-dimensional Gaussian (mean and variance) and a decoder with cost O(C) per forward pass, each sample requires O(

d_{z}

+ C) time to produce z via reparameterization and compute the prediction loss, plus O(

d_{z}

) to evaluate the closed-form KL. Backpropagation roughly doubles the forward cost, so for N samples per epoch and batch size B, the per-epoch time is O(

\frac{N}{B}

· B · (C +

d_{z}

)) = O(N(C +

d_{z}

)). Space complexity is dominated by model parameters and per-batch activations; with a small latent and shallow decoder, this is typically O(P + B·

d_{z}

), where P is the parameter count. For mutual information estimation via histogram binning, building a joint 2D histogram over

B_{x}

B_{t}

bins from N scalar pairs requires O(N) time to accumulate counts and O(

B_{x}

B_{t}

) to normalize and sum the MI. Memory is O(

B_{x}

B_{t}

) for the joint table plus O(

B_{x}

B_{t}

) for marginals. In practice, the N term dominates; to reduce variance at fixed N, one must keep

B_{x}

and

B_{t}

modest. More accurate MI estimators (e.g., kNN or neural estimators) trade higher constant factors and more complex data structures for reduced bias and better scalability in higher dimensions. Overall, VIB training scales similarly to standard deep nets with minor overhead, while MI estimation cost depends heavily on the estimator choice and dimensionality.

Code Examples

Estimating I(X;T) by histogram binning to illustrate bottleneck compression across two noisy layers

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RNG {
5     mt19937_64 gen;
6     uniform_real_distribution<double> uni;
7     RNG() : gen(random_device{}()), uni(0.0, 1.0) {}
8     double uniform() { return uni(gen); }
9     // Box-Muller transform for standard normal
10     double normal() {
11         double u1 = max(1e-12, uniform());
12         double u2 = max(1e-12, uniform());
13         return sqrt(-2.0 * log(u1)) * cos(2.0 * M_PI * u2);
14     }
15 };
16 
17 // Estimate MI between scalar vectors x and t using 2D histogram binning
18 // Returns MI in bits
19 double estimate_mi(const vector<double>& x, const vector<double>& t, int bins_x=50, int bins_t=50) {
20     int n = (int)x.size();
21     if ((int)t.size() != n || n == 0) return 0.0;
22 
23     double x_min = *min_element(x.begin(), x.end());
24     double x_max = *max_element(x.begin(), x.end());
25     double t_min = *min_element(t.begin(), t.end());
26     double t_max = *max_element(t.begin(), t.end());
27     // Slightly expand ranges to avoid boundary issues
28     double eps = 1e-9;
29     double x_range = (x_max - x_min) + eps;
30     double t_range = (t_max - t_min) + eps;
31 
32     vector<double> joint((size_t)bins_x * bins_t, 0.0);
33     vector<double> px(bins_x, 0.0), pt(bins_t, 0.0);
34 
35     auto bx = [&](double xv){
36         int b = (int)((xv - x_min) / x_range * bins_x);
37         if (b < 0) b = 0; if (b >= bins_x) b = bins_x - 1;
38         return b;
39     };
40     auto bt = [&](double tv){
41         int b = (int)((tv - t_min) / t_range * bins_t);
42         if (b < 0) b = 0; if (b >= bins_t) b = bins_t - 1;
43         return b;
44     };
45 
46     for (int i = 0; i < n; ++i) {
47         int ix = bx(x[i]);
48         int it = bt(t[i]);
49         joint[(size_t)ix * bins_t + it] += 1.0;
50         px[ix] += 1.0;
51         pt[it] += 1.0;
52     }
53 
54     // Convert to probabilities with add-one smoothing to avoid log(0)
55     double n_eff = n + (double)(bins_x * bins_t);
56     for (double &v : joint) v = (v + 1.0) / n_eff;
57     double sum_px = 0, sum_pt = 0;
58     for (int i = 0; i < bins_x; ++i) { px[i] = (px[i] + (double)bins_t / (double)bins_x) / n_eff; sum_px += px[i]; }
59     for (int j = 0; j < bins_t; ++j) { pt[j] = (pt[j] + (double)bins_x / (double)bins_t) / n_eff; sum_pt += pt[j]; }
60     // Normalize (should already be ~1, but ensure numerical sanity)
61     for (double &v : px) v /= sum_px;
62     for (double &v : pt) v /= sum_pt;
63 
64     double mi_nats = 0.0;
65     for (int i = 0; i < bins_x; ++i) {
66         for (int j = 0; j < bins_t; ++j) {
67             double pxy = joint[(size_t)i * bins_t + j];
68             if (pxy <= 0) continue;
69             double px_i = max(px[i], 1e-15);
70             double pt_j = max(pt[j], 1e-15);
71             mi_nats += pxy * log(pxy / (px_i * pt_j));
72         }
73     }
74     // Convert nats to bits
75     return mi_nats / log(2.0);
76 }
77 
78 int main() {
79     ios::sync_with_stdio(false);
80     cin.tie(nullptr);
81 
82     RNG rng;
83     const int N = 20000; // number of samples
84 
85     // Generate a 1D input X from a mixture to have structure
86     vector<double> X(N);
87     for (int i = 0; i < N; ++i) {
88         // Mixture of two Gaussians centered at -1 and +1
89         int comp = (rng.uniform() < 0.5) ? 0 : 1;
90         double mean = (comp == 0 ? -1.0 : 1.0);
91         double x = mean + 0.4 * rng.normal();
92         X[i] = x;
93     }
94 
95     // Define two noisy layers: T1 = tanh(w1 * X + b1 + noise1), T2 = tanh(w2 * T1 + b2 + noise2)
96     double w1 = 1.5, b1 = 0.2, noise1_std = 0.6;
97     double w2 = -1.2, b2 = 0.1, noise2_std = 0.6;
98 
99     vector<double> T1(N), T2(N);
100     for (int i = 0; i < N; ++i) {
101         double n1 = noise1_std * rng.normal();
102         double t1 = tanh(w1 * X[i] + b1 + n1);
103         T1[i] = t1;
104         double n2 = noise2_std * rng.normal();
105         double t2 = tanh(w2 * t1 + b2 + n2);
106         T2[i] = t2;
107     }
108 
109     double I_X_T1 = estimate_mi(X, T1, 60, 60);
110     double I_X_T2 = estimate_mi(X, T2, 60, 60);
111 
112     cout << fixed << setprecision(4);
113     cout << "Estimated MI I(X;T1) [bits]: " << I_X_T1 << "\n";
114     cout << "Estimated MI I(X;T2) [bits]: " << I_X_T2 << "\n";
115     cout << "Note: By the data processing inequality, I(X;T2) should not exceed I(X;T1).\n";
116 
117     // Try stronger noise to show stronger compression
118     double noise_boost = 1.2;
119     for (int i = 0; i < N; ++i) {
120         double n1 = noise_boost * noise1_std * rng.normal();
121         double t1 = tanh(w1 * X[i] + b1 + n1);
122         T1[i] = t1;
123         double n2 = noise_boost * noise2_std * rng.normal();
124         double t2 = tanh(w2 * t1 + b2 + n2);
125         T2[i] = t2;
126     }
127 
128     double I_X_T1_high = estimate_mi(X, T1, 60, 60);
129     double I_X_T2_high = estimate_mi(X, T2, 60, 60);
130     cout << "With higher noise: I(X;T1)=" << I_X_T1_high << ", I(X;T2)=" << I_X_T2_high << " [bits]\n";
131     cout << "Higher noise tightens the bottleneck, typically lowering I(X;T).\n";
132     return 0;
133 }
134

This program synthesizes 1D inputs X from a two-Gaussian mixture, passes them through two noisy nonlinear layers, and estimates I(X;T1) and I(X;T2) using a histogram-based mutual information estimator. The data processing inequality predicts that I(X;T2) ≤ I(X;T1). Increasing the injected Gaussian noise tightens the bottleneck, further reducing the estimated mutual information. While the estimator is biased, it qualitatively illustrates compression.

Time: O(N + B_x B_t) where N is the number of samples and B_x, B_t are the bin counts.Space: O(B_x B_t) for the joint histogram plus O(B_x + B_t) for marginals.

Training a 1D Variational Information Bottleneck (VIB) classifier with closed-form Gaussian KL

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RNG {
5     mt19937_64 gen;
6     uniform_real_distribution<double> uni;
7     RNG() : gen(random_device{}()), uni(0.0, 1.0) {}
8     double uniform() { return uni(gen); }
9     double normal() {
10         double u1 = max(1e-12, uniform());
11         double u2 = max(1e-12, uniform());
12         return sqrt(-2.0 * log(u1)) * cos(2.0 * M_PI * u2);
13     }
14 };
15 
16 struct Sample { double x; int y; };
17 
18 // Sigmoid with clamping for numerical stability
19 inline double sigmoid(double u) {
20     if (u >= 0) {
21         double z = exp(-u);
22         return 1.0 / (1.0 + z);
23     } else {
24         double z = exp(u);
25         return z / (1.0 + z);
26     }
27 }
28 
29 int main(){
30     ios::sync_with_stdio(false);
31     cin.tie(nullptr);
32 
33     RNG rng;
34     // Generate a simple binary dataset: x|y=0 ~ N(-1, 0.3^2), x|y=1 ~ N(1, 0.3^2)
35     const int N = 2000;
36     vector<Sample> data(N);
37     for (int i = 0; i < N; ++i) {
38         int y = (i < N/2) ? 0 : 1;
39         double mean = (y == 0 ? -1.0 : 1.0);
40         double x = mean + 0.3 * rng.normal();
41         data[i] = {x, y};
42     }
43     shuffle(data.begin(), data.end(), rng.gen);
44 
45     // Encoder: q(z|x) = N(mu(x), sigma^2(x)) with
46     // mu(x) = a * x + b
47     // logvar(x) = s * x + t  (sigma = exp(0.5 * logvar))
48     double a = 0.5, b = 0.0, s = -0.1, t = -1.0; // initialize encoder params
49     // Decoder: p(y=1|z) = sigmoid(w * z + c)
50     double w = 0.5, c = 0.0; // initialize decoder params
51 
52     double beta = 1.0; // compression strength; try 0.1, 1.0, 5.0 to see the trade-off
53     double lr = 1e-2;
54     int epochs = 2000;
55     int batch = 64;
56 
57     auto step = [&](const vector<int>& idxs){
58         // Accumulate gradients over the batch
59         double ga=0, gb=0, gs=0, gt=0, gw=0, gc=0;
60         double loss_sum = 0.0;
61         for (int id : idxs) {
62             double x = data[id].x;
63             int y = data[id].y;
64             // Forward through encoder
65             double mu = a * x + b;
66             double logvar = s * x + t; // can be any real; sigma^2 = exp(logvar)
67             // clamp logvar to avoid extreme sigmas
68             logvar = min(5.0, max(-5.0, logvar));
69             double sigma = exp(0.5 * logvar);
70             // Reparameterization: z = mu + sigma * eps
71             double eps = rng.normal();
72             double z = mu + sigma * eps;
73             // Decoder prediction
74             double u = w * z + c;
75             double p = sigmoid(u);
76             // Binary cross-entropy
77             double nll = -(y * log(max(1e-12, p)) + (1 - y) * log(max(1e-12, 1 - p)));
78             // KL(q||p) to N(0,1): 0.5*(mu^2 + sigma^2 - logvar - 1)
79             double kl = 0.5 * (mu*mu + sigma*sigma - logvar - 1.0);
80             double L = nll + beta * kl;
81             loss_sum += L;
82 
83             // Gradients
84             // dNLL/du = p - y
85             double dL_du = (p - y);
86             // Decoder params
87             gw += dL_du * z; // du/dw = z
88             gc += dL_du;     // du/dc = 1
89             // Backprop to z
90             double dL_dz = dL_du * w;
91             // Backprop to mu and logvar via z and KL
92             // dL/dmu = dL/dz * dz/dmu + beta * dKL/dmu = dL_dz * 1 + beta * mu
93             double dL_dmu = dL_dz + beta * mu;
94             // dKL/dlogvar = 0.5 * (sigma^2 - 1)
95             double dKL_dlogvar = 0.5 * (sigma*sigma - 1.0);
96             // From z: z depends on sigma: dz/dsigma = eps; dsigma/dlogvar = 0.5 * sigma
97             double dZ_dlogvar = eps * 0.5 * sigma;
98             double dL_dlogvar = beta * dKL_dlogvar + dL_dz * dZ_dlogvar;
99             // Encoder params
100             ga += dL_dmu * x; // dmu/da = x
101             gb += dL_dmu;     // dmu/db = 1
102             gs += dL_dlogvar * x; // dlogvar/ds = x
103             gt += dL_dlogvar;     // dlogvar/dt = 1
104         }
105         double scale = 1.0 / (double)idxs.size();
106         a -= lr * ga * scale;
107         b -= lr * gb * scale;
108         s -= lr * gs * scale;
109         t -= lr * gt * scale;
110         w -= lr * gw * scale;
111         c -= lr * gc * scale;
112         return loss_sum * scale;
113     };
114 
115     // Training loop
116     vector<int> order(N);
117     iota(order.begin(), order.end(), 0);
118     for (int e = 1; e <= epochs; ++e) {
119         shuffle(order.begin(), order.end(), rng.gen);
120         double epoch_loss = 0.0;
121         int batches = 0;
122         for (int i = 0; i < N; i += batch) {
123             vector<int> idxs;
124             for (int j = i; j < min(N, i + batch); ++j) idxs.push_back(order[j]);
125             epoch_loss += step(idxs);
126             ++batches;
127         }
128         epoch_loss /= max(1, batches);
129         if (e % 200 == 0 || e == 1) {
130             cout << "Epoch " << e << ": loss = " << epoch_loss << "\n";
131         }
132     }
133 
134     // Evaluate on the training data: report BCE and average KL per sample
135     double bce = 0.0, kl_avg = 0.0;
136     int correct = 0;
137     for (int i = 0; i < N; ++i) {
138         double x = data[i].x; int y = data[i].y;
139         double mu = a * x + b;
140         double logvar = s * x + t; logvar = min(5.0, max(-5.0, logvar));
141         double sigma = exp(0.5 * logvar);
142         // Use mean of z for prediction (deterministic eval)
143         double z = mu;
144         double p = sigmoid(w * z + c);
145         bce += -(y * log(max(1e-12, p)) + (1 - y) * log(max(1e-12, 1 - p)));
146         double kl = 0.5 * (mu*mu + sigma*sigma - logvar - 1.0);
147         kl_avg += kl;
148         int yhat = (p >= 0.5);
149         correct += (yhat == y);
150     }
151     bce /= N; kl_avg /= N;
152 
153     cout << fixed << setprecision(4);
154     cout << "Final params: a=" << a << ", b=" << b << ", s=" << s << ", t=" << t
155          << "; w=" << w << ", c=" << c << "\n";
156     cout << "Train BCE: " << bce << ", Avg KL: " << kl_avg << ", Accuracy: " << (100.0*correct/N) << "%\n";
157     cout << "Tip: Increase beta to tighten the bottleneck (higher KL, often lower BCE up to a point, then degradation).\n";
158 
159     return 0;
160 }
161

This self-contained program trains a 1D Variational Information Bottleneck classifier. The encoder outputs a Gaussian q(z|x) via linear functions for mean and log-variance; the decoder is logistic regression p(y|z). The loss is the VIB objective: binary cross-entropy plus β times the KL divergence to a standard normal prior. Reparameterization z = μ + σ ε enables unbiased gradient estimates. After training, it reports prediction quality and average KL, which reflects compression. Vary β to observe the accuracy–compression trade-off.

Time: O(N · E) per epoch for 1D latent with small constant factors; more precisely O(N (d_z + C)) where d_z=1 and C is decoder cost.Space: O(P + B · d_z) where P is the number of parameters (constant here) and B is batch size.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct RNG {
5	mt19937_64 gen;
6	uniform_real_distribution<double> uni;
7	RNG() : gen(random_device{}()), uni(0.0, 1.0) {}
8	double uniform() { return uni(gen); }
9	// Box-Muller transform for standard normal
10	double normal() {
11	double u1 = max(1e-12, uniform());
12	double u2 = max(1e-12, uniform());
13	return sqrt(-2.0 * log(u1)) * cos(2.0 * M_PI * u2);
14	}
15	};
16
17	// Estimate MI between scalar vectors x and t using 2D histogram binning
18	// Returns MI in bits
19	double estimate_mi(const vector<double>& x, const vector<double>& t, int bins_x=50, int bins_t=50) {
20	int n = (int)x.size();
21	if ((int)t.size() != n \|\| n == 0) return 0.0;
22
23	double x_min = *min_element(x.begin(), x.end());
24	double x_max = *max_element(x.begin(), x.end());
25	double t_min = *min_element(t.begin(), t.end());
26	double t_max = *max_element(t.begin(), t.end());
27	// Slightly expand ranges to avoid boundary issues
28	double eps = 1e-9;
29	double x_range = (x_max - x_min) + eps;
30	double t_range = (t_max - t_min) + eps;
31
32	vector<double> joint((size_t)bins_x * bins_t, 0.0);
33	vector<double> px(bins_x, 0.0), pt(bins_t, 0.0);
34
35	auto bx = [&](double xv){
36	int b = (int)((xv - x_min) / x_range * bins_x);
37	if (b < 0) b = 0; if (b >= bins_x) b = bins_x - 1;
38	return b;
39	};
40	auto bt = [&](double tv){
41	int b = (int)((tv - t_min) / t_range * bins_t);
42	if (b < 0) b = 0; if (b >= bins_t) b = bins_t - 1;
43	return b;
44	};
45
46	for (int i = 0; i < n; ++i) {
47	int ix = bx(x[i]);
48	int it = bt(t[i]);
49	joint[(size_t)ix * bins_t + it] += 1.0;
50	px[ix] += 1.0;
51	pt[it] += 1.0;
52	}
53
54	// Convert to probabilities with add-one smoothing to avoid log(0)
55	double n_eff = n + (double)(bins_x * bins_t);
56	for (double &v : joint) v = (v + 1.0) / n_eff;
57	double sum_px = 0, sum_pt = 0;
58	for (int i = 0; i < bins_x; ++i) { px[i] = (px[i] + (double)bins_t / (double)bins_x) / n_eff; sum_px += px[i]; }
59	for (int j = 0; j < bins_t; ++j) { pt[j] = (pt[j] + (double)bins_x / (double)bins_t) / n_eff; sum_pt += pt[j]; }
60	// Normalize (should already be ~1, but ensure numerical sanity)
61	for (double &v : px) v /= sum_px;
62	for (double &v : pt) v /= sum_pt;
63
64	double mi_nats = 0.0;
65	for (int i = 0; i < bins_x; ++i) {
66	for (int j = 0; j < bins_t; ++j) {
67	double pxy = joint[(size_t)i * bins_t + j];
68	if (pxy <= 0) continue;
69	double px_i = max(px[i], 1e-15);
70	double pt_j = max(pt[j], 1e-15);
71	mi_nats += pxy * log(pxy / (px_i * pt_j));
72	}
73	}
74	// Convert nats to bits
75	return mi_nats / log(2.0);
76	}
77
78	int main() {
79	ios::sync_with_stdio(false);
80	cin.tie(nullptr);
81
82	RNG rng;
83	const int N = 20000; // number of samples
84
85	// Generate a 1D input X from a mixture to have structure
86	vector<double> X(N);
87	for (int i = 0; i < N; ++i) {
88	// Mixture of two Gaussians centered at -1 and +1
89	int comp = (rng.uniform() < 0.5) ? 0 : 1;
90	double mean = (comp == 0 ? -1.0 : 1.0);
91	double x = mean + 0.4 * rng.normal();
92	X[i] = x;
93	}
94
95	// Define two noisy layers: T1 = tanh(w1 * X + b1 + noise1), T2 = tanh(w2 * T1 + b2 + noise2)
96	double w1 = 1.5, b1 = 0.2, noise1_std = 0.6;
97	double w2 = -1.2, b2 = 0.1, noise2_std = 0.6;
98
99	vector<double> T1(N), T2(N);
100	for (int i = 0; i < N; ++i) {
101	double n1 = noise1_std * rng.normal();
102	double t1 = tanh(w1 * X[i] + b1 + n1);
103	T1[i] = t1;
104	double n2 = noise2_std * rng.normal();
105	double t2 = tanh(w2 * t1 + b2 + n2);
106	T2[i] = t2;
107	}
108
109	double I_X_T1 = estimate_mi(X, T1, 60, 60);
110	double I_X_T2 = estimate_mi(X, T2, 60, 60);
111
112	cout << fixed << setprecision(4);
113	cout << "Estimated MI I(X;T1) [bits]: " << I_X_T1 << "\n";
114	cout << "Estimated MI I(X;T2) [bits]: " << I_X_T2 << "\n";
115	cout << "Note: By the data processing inequality, I(X;T2) should not exceed I(X;T1).\n";
116
117	// Try stronger noise to show stronger compression
118	double noise_boost = 1.2;
119	for (int i = 0; i < N; ++i) {
120	double n1 = noise_boost * noise1_std * rng.normal();
121	double t1 = tanh(w1 * X[i] + b1 + n1);
122	T1[i] = t1;
123	double n2 = noise_boost * noise2_std * rng.normal();
124	double t2 = tanh(w2 * t1 + b2 + n2);
125	T2[i] = t2;
126	}
127
128	double I_X_T1_high = estimate_mi(X, T1, 60, 60);
129	double I_X_T2_high = estimate_mi(X, T2, 60, 60);
130	cout << "With higher noise: I(X;T1)=" << I_X_T1_high << ", I(X;T2)=" << I_X_T2_high << " [bits]\n";
131	cout << "Higher noise tightens the bottleneck, typically lowering I(X;T).\n";
132	return 0;
133	}
134