🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Information Bottleneck in Deep Learning

Key Points

  • •
    The Information Bottleneck (IB) principle formalizes learning compact representations T that keep only the information about X that is useful for predicting Y.
  • •
    Mathematically it trades off compression I(X;T) against prediction I(T;Y) through the objective min I(X;T) - β I(T;Y).
  • •
    In deep networks, the data processing inequality implies mutual information with the input cannot increase across layers, motivating progressive compression.
  • •
    The Variational Information Bottleneck (VIB) makes the IB objective trainable by replacing mutual information terms with tractable losses using variational bounds.
  • •
    Adding stochasticity (e.g., Gaussian noise) in hidden representations is crucial to make I(X;T) finite and learnable for continuous variables.
  • •
    Properly tuned β controls the simplicity–accuracy trade-off: large β encourages stronger compression and potentially better generalization.
  • •
    Estimating mutual information is hard; simple binning estimators can illustrate trends but are biased, while variational bounds are typically preferred for training.
  • •
    You can prototype IB ideas in C++ using a simple stochastic encoder, a logistic decoder, reparameterization, and the closed-form KL between Gaussians.

Prerequisites

  • →Probability and Random Variables — Mutual information, entropy, and KL divergence are defined over probability distributions.
  • →Entropy and Mutual Information — Understanding the IB objective requires facility with I(X;T), H(X), and related identities.
  • →Stochastic Gradient Descent — VIB is trained with gradient-based optimization using minibatches.
  • →Variational Inference and KL Divergence — VIB relies on variational bounds and KL regularization to make IB tractable.
  • →Neural Networks and Backpropagation — Representations T are hidden layers; training adjusts encoder/decoder parameters.
  • →Gaussian Distributions — The encoder and prior are often Gaussian, and the KL has a closed-form expression.

Detailed Explanation

Tap terms for definitions

01Overview

The Information Bottleneck (IB) is a framework for representation learning that explicitly balances two goals: preserve information about the target Y while discarding irrelevant details from the input X. Instead of just minimizing a prediction loss, IB asks us to find an intermediate random variable T (a representation) that is maximally informative for Y but minimally informative about X. This is formalized through mutual information: we want T that has low I(X;T) (compression) and high I(T;Y) (predictive power). In deep learning, T can be a hidden layer. The IB view provides a principled lens to understand why deeper layers often look simpler or more robust: they retain what matters for the task and filter out nuisance variation.

Operationally, directly optimizing mutual information is hard in high dimensions. The Variational Information Bottleneck (VIB) converts the problem into a tractable stochastic optimization by introducing a parametric encoder q_\phi(T|X) and decoder p_\theta(Y|T), a simple prior p(T), and variational bounds that upper bound I(X;T) and lower bound I(T;Y). The resulting loss resembles a standard prediction loss plus a KL regularizer that pushes T toward a simple prior, encouraging compression. This bridges information theory and practical deep learning, allowing one to train bottlenecked models with standard stochastic gradient methods, and to tune the compression–accuracy trade-off via a single parameter \beta.

02Intuition & Analogies

Imagine you’re summarizing a long article for a friend who only cares about a single question the article answers. You could copy the whole article (no compression), but that’s unnecessary and overwhelming. Instead, you’ll write a short summary that keeps only the parts relevant to the question. The Information Bottleneck treats a neural network’s hidden layer like that summary. The input X is the article, the target Y is the question, and the representation T is your summary: keep what helps answer Y, forget the rest.

Another analogy is taking a photo in fog with a camera that has a tunable aperture. A wide-open aperture captures everything—details plus noise—but the picture may be blurry or overwhelming for the downstream task. Closing the aperture throws away some light (compression) but can increase contrast for the objects that matter, making it easier to detect them. In IB, the aperture is controlled by \beta: a larger \beta enforces a tighter bottleneck, keeping fewer bits about X, which can improve generalization by removing irrelevant or noisy patterns that could cause overfitting.

Finally, think of a factory assembly line: raw materials (X) pass through stations (layers). Each station simplifies and standardizes the product (representation), making later assembly steps easier. The data processing inequality is the rule that no station can recreate lost raw detail without access to the original supplies. So as items move down the line, they typically become simpler yet more suitable for the final product (predicting Y). Injecting controlled randomness (like shaking a sieve) helps separate dust from useful grains; in networks, adding noise to T makes compression measurable and effective for continuous signals.

03Formal Definition

Let X and Y be random variables with joint distribution p(x,y). We seek a stochastic representation T with Markov chain Y ↔ X ↔ T, typically induced by a conditional q_ϕ(t∣x).TheInformationBottleneckobjectiveis\n\n−Minimize:I(X;T)−βI(T;Y)\n\nwhereI(⋅;⋅)denotesmutualinformationandβ≥0tradesoffcompressionandprediction.MutualinformationisdefinedasI(A;B)=Ep(a,b)​[logp(a)p(b)p(a,b)​].Indeepnetworks,alayerTisameasurablefunction(orstochasticmapping)ofX;therefore,bythedataprocessinginequality,I(X;T)≤I(X;X)andI(T;Y)≤I(X;Y).\nDirectoptimizationisintractableforhigh−dimensionalcontinuousvariables,sotheVariationalInformationBottleneck(VIB)introduces:(i)anencoderqϕ​(t∣x), (ii) a decoder p_θ(y∣t),and(iii)asimplepriorp(t)(e.g.,standardnormal).Usingvariationalbounds,weobtainanupperboundonI(X;T)byEp(x)​[KL(qϕ​(t∣x)\,\|\,p(t))] and relate I(T;Y) to the expected log-likelihood Ep(x,y)qϕ​(t∣x)​[log p_θ(y|t)]. The trainable surrogate loss becomes\n\nL(ϕ,θ) = Ep(x,y)​ Eqϕ​(t∣x)​[-log p_θ(y∣t)]+βEp(x)​[KL(qϕ​(t∣x)\,\|\,p(t))],\n\nwhich is optimized with stochastic gradients and the reparameterization trick when q_ϕ is Gaussian.

04When to Use

Use the Information Bottleneck when you want models that generalize better by discarding task-irrelevant information, produce robust representations under noise or distribution shift, or are compact and potentially easier to compress or prune. Concretely:

  • Robust classification: Encourage hidden layers to ignore nuisance factors (e.g., background textures) while keeping class-relevant structure.
  • Semi-supervised and transfer learning: Learn representations T that capture predictive structure shared across tasks, reducing overfitting to domain-specific noise.
  • Model compression and privacy: The KL term biases T toward a simple prior, which can reduce the effective information leaked about X, aiding compression and privacy-sensitive applications.
  • Small-data regimes: Regularizing with IB (via \beta) can prevent memorization by limiting I(X;T), often improving test performance.
  • Interpretability: A low-dimensional stochastic T can be probed or visualized, clarifying what information the network retains about X.

Choose VIB or related stochastic bottlenecks when you can afford slight changes to your model and training loop. If you only need post-hoc analysis, mutual information estimators (e.g., binning or kNN) can reveal compression trends across layers, though they may be biased. Increase \beta to strengthen compression, and add explicit noise in T to make the IB signal well-defined for continuous variables.

⚠️Common Mistakes

  • Assuming compression always increases during training: In practice, mutual information may fluctuate, depend on estimator bias, and be architecture- or dataset-specific. Validate with multiple estimators and seeds.
  • Estimating I(X;T) for deterministic continuous networks without noise: It is infinite in theory. Add stochasticity (e.g., Gaussian noise in T) or use variational upper bounds via KL to a prior.
  • Misinterpreting \beta: Very large \beta can overcompress T, hurting accuracy by throwing away task-relevant bits. Tune \beta using validation metrics and monitor both KL and prediction loss.
  • Using coarse binning for MI estimation: Too few bins bias MI downward; too many bins cause undersampling and high variance. Calibrate bin counts, apply add-one smoothing, and compare to kNN or variational estimates.
  • Ignoring the Markov assumption: The IB objective assumes Y \leftrightarrow X \leftrightarrow T. Violations (e.g., label leakage into T from elsewhere) break the theoretical guarantees.
  • Forgetting scaling and stability: Reparameterization with exp(log-variance) can cause numerical issues. Use log-variance, clamp ranges, or softplus to keep standard deviations well-behaved.
  • Equating lower cross-entropy with higher I(T;Y) blindly: While related, cross-entropy is a surrogate. Improvements can reflect calibration or overfitting; use held-out data and, when possible, direct MI surrogates.

Key Formulas

Mutual Information

I(X;T)=Ep(x,t)​[logp(x)p(t)p(x,t)​]

Explanation: Mutual information measures how much knowing T reduces uncertainty about X (and vice versa). It is zero if and only if X and T are independent.

Information Bottleneck Lagrangian

q(t∣x)min​I(X;T)−βI(T;Y)

Explanation: This objective encourages representations T that are compressed with respect to X while preserving predictive information about Y. The parameter β tunes the strength of compression.

Variational Upper Bound on I(X;T)

I(X;T)≤Ep(x)​[KL(qϕ​(t∣x)∥p(t))]

Explanation: Replacing I(X;T) with an expected KL to a simple prior yields a tractable surrogate for compression. It becomes exact when p(t)=p(t)=∫ q(t|x)p(x)dx and certain conditions hold.

Variational Lower Bound on I(T;Y)

I(T;Y)≥Ep(x,y)qϕ​(t∣x)​[logpθ​(y∣t)]+const

Explanation: Maximizing the expected log-likelihood of labels under the decoder increases a lower bound on I(T;Y). The constant depends on p(y) and does not affect optimization.

VIB Training Objective

L(ϕ,θ)=Ep(x,y)qϕ​(t∣x)​[−logpθ​(y∣t)]+βEp(x)​[KL(qϕ​(t∣x)∥p(t))]

Explanation: This loss combines a prediction term and a compression term. Minimizing it with SGD yields a representation T that balances accuracy and compactness.

KL of Univariate Gaussians

KL(N(μ,σ2)∥N(0,1))=21​(μ2+σ2−logσ2−1)

Explanation: When the encoder is Gaussian and the prior is standard normal, the KL term has this closed form. It provides a simple, stable compression penalty.

Reparameterization Trick

z=μ(x)+σ(x)ϵ,ϵ∼N(0,1)

Explanation: Sampling is rewritten as a differentiable transformation of noise, allowing low-variance gradient estimates of expectations over q(z|x).

Data Processing Along Layers

I(X;T2​)≤I(X;T1​)≤I(X;X)

Explanation: For a feedforward chain X → T1​ → T2​, mutual information with the input cannot increase. This formalizes the notion of progressive compression.

Entropy and Conditional Entropy (discrete)

H(X)=−i∑​p(xi​)logp(xi​),H(X∣T)=−i,j∑​p(xi​,tj​)logp(xi​∣tj​)

Explanation: Entropy quantifies average uncertainty. Mutual information can be written as I(X;T)=H(X)-H(X|T), highlighting how T reduces uncertainty about X.

Histogram MI Estimator

I(X;T)=bx​,bt​∑​p^​bx​,bt​​logp^​bx​​p^​bt​​p^​bx​,bt​​​

Explanation: A practical estimator using discrete bins. It is simple but biased; results depend on bin count and sample size.

Complexity Analysis

Training with the Variational Information Bottleneck typically mirrors the cost of standard stochastic neural training plus an extra KL term per sample. For an encoder that outputs a dz​-dimensional Gaussian (mean and variance) and a decoder with cost O(C) per forward pass, each sample requires O(dz​ + C) time to produce z via reparameterization and compute the prediction loss, plus O(dz​) to evaluate the closed-form KL. Backpropagation roughly doubles the forward cost, so for N samples per epoch and batch size B, the per-epoch time is O(BN​ · B · (C + dz​)) = O(N(C + dz​)). Space complexity is dominated by model parameters and per-batch activations; with a small latent and shallow decoder, this is typically O(P + B·dz​), where P is the parameter count. For mutual information estimation via histogram binning, building a joint 2D histogram over Bx​ × Bt​ bins from N scalar pairs requires O(N) time to accumulate counts and O(Bx​ Bt​) to normalize and sum the MI. Memory is O(Bx​ Bt​) for the joint table plus O(Bx​ + Bt​) for marginals. In practice, the N term dominates; to reduce variance at fixed N, one must keep Bx​ and Bt​ modest. More accurate MI estimators (e.g., kNN or neural estimators) trade higher constant factors and more complex data structures for reduced bias and better scalability in higher dimensions. Overall, VIB training scales similarly to standard deep nets with minor overhead, while MI estimation cost depends heavily on the estimator choice and dimensionality.

Code Examples

Estimating I(X;T) by histogram binning to illustrate bottleneck compression across two noisy layers
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RNG {
5 mt19937_64 gen;
6 uniform_real_distribution<double> uni;
7 RNG() : gen(random_device{}()), uni(0.0, 1.0) {}
8 double uniform() { return uni(gen); }
9 // Box-Muller transform for standard normal
10 double normal() {
11 double u1 = max(1e-12, uniform());
12 double u2 = max(1e-12, uniform());
13 return sqrt(-2.0 * log(u1)) * cos(2.0 * M_PI * u2);
14 }
15};
16
17// Estimate MI between scalar vectors x and t using 2D histogram binning
18// Returns MI in bits
19double estimate_mi(const vector<double>& x, const vector<double>& t, int bins_x=50, int bins_t=50) {
20 int n = (int)x.size();
21 if ((int)t.size() != n || n == 0) return 0.0;
22
23 double x_min = *min_element(x.begin(), x.end());
24 double x_max = *max_element(x.begin(), x.end());
25 double t_min = *min_element(t.begin(), t.end());
26 double t_max = *max_element(t.begin(), t.end());
27 // Slightly expand ranges to avoid boundary issues
28 double eps = 1e-9;
29 double x_range = (x_max - x_min) + eps;
30 double t_range = (t_max - t_min) + eps;
31
32 vector<double> joint((size_t)bins_x * bins_t, 0.0);
33 vector<double> px(bins_x, 0.0), pt(bins_t, 0.0);
34
35 auto bx = [&](double xv){
36 int b = (int)((xv - x_min) / x_range * bins_x);
37 if (b < 0) b = 0; if (b >= bins_x) b = bins_x - 1;
38 return b;
39 };
40 auto bt = [&](double tv){
41 int b = (int)((tv - t_min) / t_range * bins_t);
42 if (b < 0) b = 0; if (b >= bins_t) b = bins_t - 1;
43 return b;
44 };
45
46 for (int i = 0; i < n; ++i) {
47 int ix = bx(x[i]);
48 int it = bt(t[i]);
49 joint[(size_t)ix * bins_t + it] += 1.0;
50 px[ix] += 1.0;
51 pt[it] += 1.0;
52 }
53
54 // Convert to probabilities with add-one smoothing to avoid log(0)
55 double n_eff = n + (double)(bins_x * bins_t);
56 for (double &v : joint) v = (v + 1.0) / n_eff;
57 double sum_px = 0, sum_pt = 0;
58 for (int i = 0; i < bins_x; ++i) { px[i] = (px[i] + (double)bins_t / (double)bins_x) / n_eff; sum_px += px[i]; }
59 for (int j = 0; j < bins_t; ++j) { pt[j] = (pt[j] + (double)bins_x / (double)bins_t) / n_eff; sum_pt += pt[j]; }
60 // Normalize (should already be ~1, but ensure numerical sanity)
61 for (double &v : px) v /= sum_px;
62 for (double &v : pt) v /= sum_pt;
63
64 double mi_nats = 0.0;
65 for (int i = 0; i < bins_x; ++i) {
66 for (int j = 0; j < bins_t; ++j) {
67 double pxy = joint[(size_t)i * bins_t + j];
68 if (pxy <= 0) continue;
69 double px_i = max(px[i], 1e-15);
70 double pt_j = max(pt[j], 1e-15);
71 mi_nats += pxy * log(pxy / (px_i * pt_j));
72 }
73 }
74 // Convert nats to bits
75 return mi_nats / log(2.0);
76}
77
78int main() {
79 ios::sync_with_stdio(false);
80 cin.tie(nullptr);
81
82 RNG rng;
83 const int N = 20000; // number of samples
84
85 // Generate a 1D input X from a mixture to have structure
86 vector<double> X(N);
87 for (int i = 0; i < N; ++i) {
88 // Mixture of two Gaussians centered at -1 and +1
89 int comp = (rng.uniform() < 0.5) ? 0 : 1;
90 double mean = (comp == 0 ? -1.0 : 1.0);
91 double x = mean + 0.4 * rng.normal();
92 X[i] = x;
93 }
94
95 // Define two noisy layers: T1 = tanh(w1 * X + b1 + noise1), T2 = tanh(w2 * T1 + b2 + noise2)
96 double w1 = 1.5, b1 = 0.2, noise1_std = 0.6;
97 double w2 = -1.2, b2 = 0.1, noise2_std = 0.6;
98
99 vector<double> T1(N), T2(N);
100 for (int i = 0; i < N; ++i) {
101 double n1 = noise1_std * rng.normal();
102 double t1 = tanh(w1 * X[i] + b1 + n1);
103 T1[i] = t1;
104 double n2 = noise2_std * rng.normal();
105 double t2 = tanh(w2 * t1 + b2 + n2);
106 T2[i] = t2;
107 }
108
109 double I_X_T1 = estimate_mi(X, T1, 60, 60);
110 double I_X_T2 = estimate_mi(X, T2, 60, 60);
111
112 cout << fixed << setprecision(4);
113 cout << "Estimated MI I(X;T1) [bits]: " << I_X_T1 << "\n";
114 cout << "Estimated MI I(X;T2) [bits]: " << I_X_T2 << "\n";
115 cout << "Note: By the data processing inequality, I(X;T2) should not exceed I(X;T1).\n";
116
117 // Try stronger noise to show stronger compression
118 double noise_boost = 1.2;
119 for (int i = 0; i < N; ++i) {
120 double n1 = noise_boost * noise1_std * rng.normal();
121 double t1 = tanh(w1 * X[i] + b1 + n1);
122 T1[i] = t1;
123 double n2 = noise_boost * noise2_std * rng.normal();
124 double t2 = tanh(w2 * t1 + b2 + n2);
125 T2[i] = t2;
126 }
127
128 double I_X_T1_high = estimate_mi(X, T1, 60, 60);
129 double I_X_T2_high = estimate_mi(X, T2, 60, 60);
130 cout << "With higher noise: I(X;T1)=" << I_X_T1_high << ", I(X;T2)=" << I_X_T2_high << " [bits]\n";
131 cout << "Higher noise tightens the bottleneck, typically lowering I(X;T).\n";
132 return 0;
133}
134

This program synthesizes 1D inputs X from a two-Gaussian mixture, passes them through two noisy nonlinear layers, and estimates I(X;T1) and I(X;T2) using a histogram-based mutual information estimator. The data processing inequality predicts that I(X;T2) ≤ I(X;T1). Increasing the injected Gaussian noise tightens the bottleneck, further reducing the estimated mutual information. While the estimator is biased, it qualitatively illustrates compression.

Time: O(N + B_x B_t) where N is the number of samples and B_x, B_t are the bin counts.Space: O(B_x B_t) for the joint histogram plus O(B_x + B_t) for marginals.
Training a 1D Variational Information Bottleneck (VIB) classifier with closed-form Gaussian KL
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RNG {
5 mt19937_64 gen;
6 uniform_real_distribution<double> uni;
7 RNG() : gen(random_device{}()), uni(0.0, 1.0) {}
8 double uniform() { return uni(gen); }
9 double normal() {
10 double u1 = max(1e-12, uniform());
11 double u2 = max(1e-12, uniform());
12 return sqrt(-2.0 * log(u1)) * cos(2.0 * M_PI * u2);
13 }
14};
15
16struct Sample { double x; int y; };
17
18// Sigmoid with clamping for numerical stability
19inline double sigmoid(double u) {
20 if (u >= 0) {
21 double z = exp(-u);
22 return 1.0 / (1.0 + z);
23 } else {
24 double z = exp(u);
25 return z / (1.0 + z);
26 }
27}
28
29int main(){
30 ios::sync_with_stdio(false);
31 cin.tie(nullptr);
32
33 RNG rng;
34 // Generate a simple binary dataset: x|y=0 ~ N(-1, 0.3^2), x|y=1 ~ N(1, 0.3^2)
35 const int N = 2000;
36 vector<Sample> data(N);
37 for (int i = 0; i < N; ++i) {
38 int y = (i < N/2) ? 0 : 1;
39 double mean = (y == 0 ? -1.0 : 1.0);
40 double x = mean + 0.3 * rng.normal();
41 data[i] = {x, y};
42 }
43 shuffle(data.begin(), data.end(), rng.gen);
44
45 // Encoder: q(z|x) = N(mu(x), sigma^2(x)) with
46 // mu(x) = a * x + b
47 // logvar(x) = s * x + t (sigma = exp(0.5 * logvar))
48 double a = 0.5, b = 0.0, s = -0.1, t = -1.0; // initialize encoder params
49 // Decoder: p(y=1|z) = sigmoid(w * z + c)
50 double w = 0.5, c = 0.0; // initialize decoder params
51
52 double beta = 1.0; // compression strength; try 0.1, 1.0, 5.0 to see the trade-off
53 double lr = 1e-2;
54 int epochs = 2000;
55 int batch = 64;
56
57 auto step = [&](const vector<int>& idxs){
58 // Accumulate gradients over the batch
59 double ga=0, gb=0, gs=0, gt=0, gw=0, gc=0;
60 double loss_sum = 0.0;
61 for (int id : idxs) {
62 double x = data[id].x;
63 int y = data[id].y;
64 // Forward through encoder
65 double mu = a * x + b;
66 double logvar = s * x + t; // can be any real; sigma^2 = exp(logvar)
67 // clamp logvar to avoid extreme sigmas
68 logvar = min(5.0, max(-5.0, logvar));
69 double sigma = exp(0.5 * logvar);
70 // Reparameterization: z = mu + sigma * eps
71 double eps = rng.normal();
72 double z = mu + sigma * eps;
73 // Decoder prediction
74 double u = w * z + c;
75 double p = sigmoid(u);
76 // Binary cross-entropy
77 double nll = -(y * log(max(1e-12, p)) + (1 - y) * log(max(1e-12, 1 - p)));
78 // KL(q||p) to N(0,1): 0.5*(mu^2 + sigma^2 - logvar - 1)
79 double kl = 0.5 * (mu*mu + sigma*sigma - logvar - 1.0);
80 double L = nll + beta * kl;
81 loss_sum += L;
82
83 // Gradients
84 // dNLL/du = p - y
85 double dL_du = (p - y);
86 // Decoder params
87 gw += dL_du * z; // du/dw = z
88 gc += dL_du; // du/dc = 1
89 // Backprop to z
90 double dL_dz = dL_du * w;
91 // Backprop to mu and logvar via z and KL
92 // dL/dmu = dL/dz * dz/dmu + beta * dKL/dmu = dL_dz * 1 + beta * mu
93 double dL_dmu = dL_dz + beta * mu;
94 // dKL/dlogvar = 0.5 * (sigma^2 - 1)
95 double dKL_dlogvar = 0.5 * (sigma*sigma - 1.0);
96 // From z: z depends on sigma: dz/dsigma = eps; dsigma/dlogvar = 0.5 * sigma
97 double dZ_dlogvar = eps * 0.5 * sigma;
98 double dL_dlogvar = beta * dKL_dlogvar + dL_dz * dZ_dlogvar;
99 // Encoder params
100 ga += dL_dmu * x; // dmu/da = x
101 gb += dL_dmu; // dmu/db = 1
102 gs += dL_dlogvar * x; // dlogvar/ds = x
103 gt += dL_dlogvar; // dlogvar/dt = 1
104 }
105 double scale = 1.0 / (double)idxs.size();
106 a -= lr * ga * scale;
107 b -= lr * gb * scale;
108 s -= lr * gs * scale;
109 t -= lr * gt * scale;
110 w -= lr * gw * scale;
111 c -= lr * gc * scale;
112 return loss_sum * scale;
113 };
114
115 // Training loop
116 vector<int> order(N);
117 iota(order.begin(), order.end(), 0);
118 for (int e = 1; e <= epochs; ++e) {
119 shuffle(order.begin(), order.end(), rng.gen);
120 double epoch_loss = 0.0;
121 int batches = 0;
122 for (int i = 0; i < N; i += batch) {
123 vector<int> idxs;
124 for (int j = i; j < min(N, i + batch); ++j) idxs.push_back(order[j]);
125 epoch_loss += step(idxs);
126 ++batches;
127 }
128 epoch_loss /= max(1, batches);
129 if (e % 200 == 0 || e == 1) {
130 cout << "Epoch " << e << ": loss = " << epoch_loss << "\n";
131 }
132 }
133
134 // Evaluate on the training data: report BCE and average KL per sample
135 double bce = 0.0, kl_avg = 0.0;
136 int correct = 0;
137 for (int i = 0; i < N; ++i) {
138 double x = data[i].x; int y = data[i].y;
139 double mu = a * x + b;
140 double logvar = s * x + t; logvar = min(5.0, max(-5.0, logvar));
141 double sigma = exp(0.5 * logvar);
142 // Use mean of z for prediction (deterministic eval)
143 double z = mu;
144 double p = sigmoid(w * z + c);
145 bce += -(y * log(max(1e-12, p)) + (1 - y) * log(max(1e-12, 1 - p)));
146 double kl = 0.5 * (mu*mu + sigma*sigma - logvar - 1.0);
147 kl_avg += kl;
148 int yhat = (p >= 0.5);
149 correct += (yhat == y);
150 }
151 bce /= N; kl_avg /= N;
152
153 cout << fixed << setprecision(4);
154 cout << "Final params: a=" << a << ", b=" << b << ", s=" << s << ", t=" << t
155 << "; w=" << w << ", c=" << c << "\n";
156 cout << "Train BCE: " << bce << ", Avg KL: " << kl_avg << ", Accuracy: " << (100.0*correct/N) << "%\n";
157 cout << "Tip: Increase beta to tighten the bottleneck (higher KL, often lower BCE up to a point, then degradation).\n";
158
159 return 0;
160}
161

This self-contained program trains a 1D Variational Information Bottleneck classifier. The encoder outputs a Gaussian q(z|x) via linear functions for mean and log-variance; the decoder is logistic regression p(y|z). The loss is the VIB objective: binary cross-entropy plus β times the KL divergence to a standard normal prior. Reparameterization z = μ + σ ε enables unbiased gradient estimates. After training, it reports prediction quality and average KL, which reflects compression. Vary β to observe the accuracy–compression trade-off.

Time: O(N · E) per epoch for 1D latent with small constant factors; more precisely O(N (d_z + C)) where d_z=1 and C is decoder cost.Space: O(P + B · d_z) where P is the number of parameters (constant here) and B is batch size.
#information bottleneck#variational information bottleneck#mutual information#kl divergence#data processing inequality#reparameterization trick#representation learning#compression#entropy#deep learning theory#variational bounds#robustness#generalization#stochastic encoder