📚TheoryIntermediate

Mutual Information

Key Points

  • •
    Mutual Information (MI) measures how much knowing one random variable reduces uncertainty about another.
  • •
    It is symmetric, non‑negative, and equals zero if and only if the variables are independent.
  • •
    MI connects entropy and divergence: I(X;Y) = H(X) + H(Y) - H(X,Y) = ( || ).
  • •
    Processing data cannot increase information: the Data Processing Inequality ensures I(X;Y) I(X;f(Y)).
  • •
    Conditional MI, I(X;Y|Z), measures shared information between X and Y after Z is known.
  • •
    MI decomposes with the chain rule: I(X;Y,Z) = I(X;Z) + I(X;Y|Z).
  • •
    In ML, MI guides feature selection and underlies contrastive learning via the InfoNCE bound.
  • •
    Discrete MI can be computed from counts; continuous MI usually requires estimators (binning, kNN, or contrastive bounds).

Prerequisites

  • →Basic Probability — Understanding random variables, joint and marginal distributions is fundamental to MI.
  • →Logarithms and Exponents — Entropy and MI use logarithms; units (nats vs. bits) depend on log base.
  • →Entropy and KL Divergence — MI is defined via entropy identities and as a KL divergence.
  • →Hash Maps (C++ STL) — Efficient counting of categories for empirical MI relies on unordered_map.
  • →Numerical Stability (Log-Sum-Exp) — Stable evaluation of softmax/InfoNCE requires the log-sum-exp trick.
  • →Linear Algebra Basics — InfoNCE uses vector dot products and batch matrix operations.
  • →Big-O Complexity — To analyze scalability of MI estimators and contrastive objectives.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine two locked boxes labeled X and Y. When you open box X, how much does it help you guess what’s inside box Y? Mutual Information (MI) is the ruler we use to measure that help. Concept: Mutual Information quantifies the reduction in uncertainty about one random variable when another is observed. It blends two core ideas from information theory: entropy (uncertainty) and Kullback–Leibler divergence (difference between probability distributions). Mathematically, MI has several equivalent forms, such as I(X;Y) = H(X) + H(Y) - H(X,Y), I(X;Y) = H(X) - H(X|Y), and I(X;Y) = D_{KL}(P_{XY} || P_X P_Y). These identities make MI both interpretable (in terms of uncertainty reduction) and analyzable (as a divergence between the joint distribution and the product of marginals). Example: If X is the language a document is written in (English, Spanish, etc.) and Y is the frequency of certain letters, then knowing Y should tell you quite a bit about X (e.g., the letter distribution of Spanish vs. English). MI captures how strongly Y’s evidence reduces uncertainty about X. If X and Y are independent (say, language and coin flips done by a separate person), MI is zero because observing one doesn’t help predict the other.

02Intuition & Analogies

Think of uncertainty as fog and information as a flashlight. Entropy measures the thickness of the fog around a single object (a variable). Mutual Information measures how much the flashlight from Y can pierce the fog around X (and vice versa). If Y’s flashlight points exactly where X is hiding, you learn a lot, so MI is high. If Y shines randomly and doesn’t illuminate X at all, MI is zero. Consider a detective story: X is the identity of a culprit; Y is a clue like a fingerprint pattern. Before seeing the clue, you have many suspects (high entropy). After seeing Y, the suspect list narrows. The amount of narrowing is I(X;Y). Now imagine transforming the clue Y into a blurry photocopy f(Y). The Data Processing Inequality says the blurred clue cannot make you more certain than the original; information cannot be increased by post‑processing. For conditional MI, suppose Z is extra context (e.g., the city where the crime occurred). I(X;Y|Z) asks: once you already know Z, how much extra does Y help? If the city already tells you which suspects to consider, the fingerprint may add less. The chain rule I(X;Y,Z) = I(X;Z) + I(X;Y|Z) mirrors the idea of staged investigation: first learn from Z, then from Y given that context. In machine learning, MI is like a quality score for representations. Good features carry a lot of information about targets (high MI with labels), and useful encoders preserve information about relevant signals while discarding noise (information bottleneck). Contrastive learning even trains models by maximizing a computable lower bound to MI (InfoNCE).

03Formal Definition

Let (X,Y) be random variables on a common probability space with joint distribution and marginals , . The mutual information is defined as I(X;Y) = ( ) = \left( \right) , for continuous/discrete cases as appropriate. Equivalently, in terms of entropies, I(X;Y) = H(X) + H(Y) - H(X,Y) = H(X) - H(XX). For discrete variables with mass functions p(x,y), p(x), p(y), we have I(X;Y) = p(x,y) . Key properties include symmetry (I(X;Y) = I(Y;X)), non‑negativity (I(X;Y) 0, equality iff independence), and the Data Processing Inequality (for any measurable f, I(X;Y) I(X;f(Y))). Conditional mutual information extends MI with a conditioning variable Z: I(X;Y|Z) = \left[ ( ) \right] = p(z) p(x,y|z) . The chain rule decomposes MI: I(X;Y,Z) = I(X;Z) + I(X;Y|Z), reflecting staged information gain.

04When to Use

• Feature selection: Choose variables X (features) that carry high MI with Y (labels), even if relationships are nonlinear or non‑monotonic. MI is model‑agnostic, making it robust for pre‑processing and screening. • Representation learning and contrastive learning: Maximize information between paired views (e.g., two augmentations of the same image). The InfoNCE objective provides a tractable lower bound to MI, enabling self‑supervised training. • Information bottleneck: Learn compressed representations Z that retain information about targets Y while discarding irrelevant parts of X. MI terms I(Z;X) and I(Z;Y) formalize the trade‑off. • Dependency discovery: Detect general statistical dependence between variables (beyond linear correlation). This is useful for exploratory data analysis, causal discovery pipelines (as a proxy), and clustering. • Communication systems: Quantify channel capacity and coding performance, where MI between transmitted and received signals measures achievable reliable rates. • Neuroscience and signals: Measure how much a neural spike train or sensor signal reveals about a stimulus. Use MI when you need a universal dependence measure insensitive to reparameterizations and not restricted to linear relationships. Prefer entropy/correlation when linear Gaussian assumptions hold and are sufficient; prefer MI when interactions are complex or multimodal.

⚠️Common Mistakes

• Using naive histograms for continuous MI without sensitivity checks: bin size strongly affects estimates. Mitigation: perform bin‑sweep sensitivity analysis, use kNN or kernel estimators, or rely on contrastive lower bounds when appropriate. • Forgetting base of logarithm: MI in nats (natural log) vs. bits (log base 2). Always report the base or convert via division by \ln 2. • Interpreting MI as causal influence: High MI indicates dependence, not causation. Conditional MI and causal frameworks are needed for causal claims. • Violating the Data Processing Inequality in analysis: If you post‑process features, reported MI with targets should not increase. Apparent increases often indicate estimation noise or data leakage. • Ignoring finite‑sample bias: Empirical MI from counts is upward biased, especially with many categories and small sample sizes. Apply bias corrections (e.g., Miller–Madow) or use cross‑validation/bootstrap to assess reliability. • Mixing conditional and unconditional MI: I(X;Y|Z) can be small even when I(X;Y) is large, if Z already explains the dependence. Always specify conditioning variables. • Assuming symmetry implies identical roles: While I(X;Y) = I(Y;X), in learning you often optimize asymmetric objectives (predict Y from X). Keep roles clear when designing models.

Key Formulas

Entropy Identity

Explanation: Mutual information equals the sum of marginal entropies minus joint entropy. It measures how much the joint uncertainty is less than what it would be if variables were independent.

Uncertainty Reduction

Explanation: MI is the reduction in uncertainty of one variable after observing the other. Zero reduction means independence.

KL Form

Explanation: MI is the KL divergence between the joint distribution and the product of marginals. It quantifies how far dependence is from independence.

Discrete MI

Explanation: For discrete variables, MI is a sum over all value pairs weighted by joint probabilities. Each term measures the local departure from independence.

Conditional MI

Explanation: Conditional MI averages the MI between X and Y across each value of Z. It captures remaining dependence after accounting for Z.

Chain Rule

Explanation: MI with a pair (Y,Z) can be decomposed into information from Z plus additional information from Y given Z. Useful for staged analysis.

Data Processing Inequality

Explanation: Applying any function f to Y cannot increase information about X. Post‑processing cannot create new information.

Non-Negativity and Independence

Explanation: MI is always non‑negative by Gibbs’ inequality. It is zero exactly when X and Y are independent.

Empirical MI (Counts)

Explanation: A plug‑in estimator using empirical frequencies. It is simple but can be biased with small samples or many categories.

InfoNCE Loss

Explanation: Contrastive loss over a batch with similarities and temperature . Encourages high similarity for positive pairs relative to negatives.

InfoNCE MI Lower Bound

Explanation: For a batch of size N, the mutual information is lower‑bounded by log N minus the InfoNCE loss. Larger batches can give tighter bounds.

Entropy Definitions

Explanation: Entropy and conditional entropy quantify uncertainty. These definitions are the building blocks for MI identities.

Complexity Analysis

For discrete mutual information computed from samples, we typically build hash maps for marginal counts and a joint count map. With n samples and up to k unique categories per variable, counting runs in expected O(n) time using hash maps (with good hashing), plus O() in the worst case if all pairs are unique and we iterate over the joint map. Space usage is O(k + k + m) where m is the number of distinct (x,y) pairs actually observed ( and ). In practice, for sparse data m ≪ . For conditional mutual information I(X;Y|Z), we maintain counts for (x,y,z), (x,z), (y,z), and z. The pass to accumulate counts is O(n). The summation over distinct observed triples (x,y,z) has cost O(), where . Space is O( + + + ), typically dominated by the triple map. The InfoNCE loss over a batch of size B with embedding dimension d requires computing all pairwise similarities, which is O( d) time and O() space if we materialize the similarity matrix, or O(B d) additional space if we stream over columns/rows. The log‑sum‑exp per row is O(B). Therefore, total time per batch is O( d) and space O() (or O(B) with streaming). Numerical stability requires the standard log‑sum‑exp trick, which adds negligible overhead. Overall, discrete MI and CMI scale linearly with samples and unique observed combinations, making them practical for medium‑cardinality data. Contrastive MI bounds scale quadratically in batch size due to all‑pairs similarities; techniques like memory banks or sub‑sampling can reduce cost.

Code Examples

Empirical Mutual Information for Discrete Variables (Counts from Samples)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Hash for pair<int,int>
5struct PairHash {
6 size_t operator()(const pair<int,int>& p) const noexcept {
7 // 64-bit mix of two 32-bit ints
8 uint64_t a = static_cast<uint64_t>(static_cast<uint32_t>(p.first));
9 uint64_t b = static_cast<uint64_t>(static_cast<uint32_t>(p.second));
10 uint64_t x = (a << 32) ^ b;
11 // SplitMix64
12 x += 0x9e3779b97f4a7c15ULL;
13 x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
14 x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
15 x = x ^ (x >> 31);
16 return static_cast<size_t>(x);
17 }
18};
19
20// Compute empirical MI in nats from integer-labeled samples
21// X[i] and Y[i] are the i-th paired observations
22// Returns MI in nats; divide by log(2.0) to convert to bits
23double mutual_information_discrete(const vector<int>& X, const vector<int>& Y) {
24 if (X.size() != Y.size()) throw runtime_error("X and Y must have the same length");
25 size_t n = X.size();
26 if (n == 0) return 0.0; // convention
27
28 unordered_map<int, long long> cx, cy;
29 unordered_map<pair<int,int>, long long, PairHash> cxy;
30 cx.reserve(n * 2); cy.reserve(n * 2); cxy.reserve(n * 2);
31
32 // Count occurrences
33 for (size_t i = 0; i < n; ++i) {
34 ++cx[X[i]];
35 ++cy[Y[i]];
36 ++cxy[{X[i], Y[i]}];
37 }
38
39 const double inv_n = 1.0 / static_cast<double>(n);
40 double mi = 0.0;
41
42 // Sum over observed pairs only (others contribute 0)
43 for (const auto& kv : cxy) {
44 int x = kv.first.first;
45 int y = kv.first.second;
46 double p_xy = kv.second * inv_n;
47 double p_x = cx[x] * inv_n;
48 double p_y = cy[y] * inv_n;
49 // Guard against numerical issues, though counts ensure positivity
50 if (p_xy > 0 && p_x > 0 && p_y > 0) {
51 mi += p_xy * log(p_xy / (p_x * p_y));
52 }
53 }
54 return mi; // nats
55}
56
57int main() {
58 // Example: two discrete variables with some dependence
59 vector<int> X = {0,0,1,1,1,0,2,2,2,2};
60 vector<int> Y = {1,0,1,1,0,0,2,2,1,2};
61
62 double mi_nats = mutual_information_discrete(X, Y);
63 double mi_bits = mi_nats / log(2.0);
64
65 cout << fixed << setprecision(6);
66 cout << "Empirical MI (nats): " << mi_nats << "\n";
67 cout << "Empirical MI (bits): " << mi_bits << "\n";
68
69 return 0;
70}
71

This program computes empirical mutual information for two discrete variables from sample pairs using hash maps for marginals and joint counts. It sums over observed pairs only, which is efficient for sparse data. The result is in nats (natural log); it also prints bits by dividing by ln 2.

Time: O(n) expected for counting; O(m) for summation over observed pairs (m ≤ n).Space: O(k_x + k_y + m), where k_x and k_y are unique values in X and Y, and m is unique observed pairs.
Conditional Mutual Information I(X;Y|Z) for Discrete Variables
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Triple {
5 int a,b,c;
6 bool operator==(const Triple& o) const noexcept { return a==o.a && b==o.b && c==o.c; }
7};
8
9struct TripleHash {
10 size_t operator()(const Triple& t) const noexcept {
11 uint64_t x = static_cast<uint64_t>(static_cast<uint32_t>(t.a));
12 uint64_t y = static_cast<uint64_t>(static_cast<uint32_t>(t.b));
13 uint64_t z = static_cast<uint64_t>(static_cast<uint32_t>(t.c));
14 uint64_t h = x;
15 h = (h * 0x9e3779b97f4a7c15ULL) ^ y;
16 h = (h * 0x9e3779b97f4a7c15ULL) ^ z;
17 // finalize (SplitMix64)
18 h += 0x9e3779b97f4a7c15ULL;
19 h = (h ^ (h >> 30)) * 0xbf58476d1ce4e5b9ULL;
20 h = (h ^ (h >> 27)) * 0x94d049bb133111ebULL;
21 h = h ^ (h >> 31);
22 return static_cast<size_t>(h);
23 }
24};
25
26double conditional_mutual_information_discrete(const vector<int>& X,
27 const vector<int>& Y,
28 const vector<int>& Z) {
29 if (X.size() != Y.size() || X.size() != Z.size()) throw runtime_error("All inputs must have the same length");
30 size_t n = X.size();
31 if (n == 0) return 0.0;
32
33 unordered_map<Triple, long long, TripleHash> cxyz; cxyz.reserve(n*2);
34 unordered_map<pair<int,int>, long long, hash<long long>> cxz_raw; cxz_raw.reserve(n*2);
35 unordered_map<pair<int,int>, long long, hash<long long>> cyz_raw; cyz_raw.reserve(n*2);
36 unordered_map<int, long long> cz; cz.reserve(n*2);
37
38 auto pack_pair = [](int a, int b) -> long long {
39 // pack two 32-bit ints into 64-bit key for default hash<long long>
40 return (static_cast<long long>(static_cast<uint32_t>(a)) << 32) |
41 static_cast<uint32_t>(b);
42 };
43
44 // Count occurrences
45 for (size_t i = 0; i < n; ++i) {
46 Triple t{X[i], Y[i], Z[i]};
47 ++cxyz[t];
48 ++cxz_raw[{X[i], Z[i]}]; // but we used hash<long long>; instead pack to long long
49 ++cyz_raw[{Y[i], Z[i]}]; // to keep code simple, repack below
50 ++cz[Z[i]];
51 }
52
53 // Rebuild cxz and cyz using packed keys for performance
54 unordered_map<long long, long long> cxz; cxz.reserve(cxz_raw.size()*2);
55 unordered_map<long long, long long> cyz; cyz.reserve(cyz_raw.size()*2);
56 for (const auto& kv : cxz_raw) {
57 long long key = (static_cast<long long>(static_cast<uint32_t>(kv.first.first)) << 32) |
58 static_cast<uint32_t>(kv.first.second);
59 cxz[key] = kv.second;
60 }
61 for (const auto& kv : cyz_raw) {
62 long long key = (static_cast<long long>(static_cast<uint32_t>(kv.first.first)) << 32) |
63 static_cast<uint32_t>(kv.first.second);
64 cyz[key] = kv.second;
65 }
66
67 const double inv_n = 1.0 / static_cast<double>(n);
68 double cmi = 0.0;
69
70 for (const auto& kv : cxyz) {
71 int x = kv.first.a;
72 int y = kv.first.b;
73 int z = kv.first.c;
74 double c_xyz = static_cast<double>(kv.second);
75 double p_xyz = c_xyz * inv_n;
76
77 long long xz_key = (static_cast<long long>(static_cast<uint32_t>(x)) << 32) | static_cast<uint32_t>(z);
78 long long yz_key = (static_cast<long long>(static_cast<uint32_t>(y)) << 32) | static_cast<uint32_t>(z);
79
80 double p_xz = static_cast<double>(cxz[xz_key]) * inv_n;
81 double p_yz = static_cast<double>(cyz[yz_key]) * inv_n;
82 double p_z = static_cast<double>(cz[z]) * inv_n;
83
84 // p(x,y|z) = p(x,y,z)/p(z); p(x|z) = p(x,z)/p(z); p(y|z) = p(y,z)/p(z)
85 // cmi term: p(x,y,z) * log( p(x,y,z) * p(z) / ( p(x,z) * p(y,z) ) ) / n
86 if (p_xyz > 0 && p_xz > 0 && p_yz > 0 && p_z > 0) {
87 double ratio = (p_xyz * p_z) / (p_xz * p_yz);
88 cmi += p_xyz * log(ratio);
89 }
90 }
91
92 return cmi; // nats
93}
94
95int main() {
96 // Toy example: Z partially explains dependence between X and Y
97 vector<int> X = {0,0,1,1,0,1,0,1};
98 vector<int> Y = {0,1,1,1,0,0,1,0};
99 vector<int> Z = {0,0,0,0,1,1,1,1};
100
101 double cmi_nats = conditional_mutual_information_discrete(X, Y, Z);
102 cout << fixed << setprecision(6);
103 cout << "Conditional MI I(X;Y|Z) (nats): " << cmi_nats << "\n";
104 cout << "Conditional MI (bits): " << cmi_nats / log(2.0) << "\n";
105 return 0;
106}
107

This program estimates conditional mutual information I(X;Y|Z) from discrete samples. It counts triple (x,y,z) frequencies plus (x,z), (y,z), and z marginals, then applies the plug‑in formula. The result quantifies remaining dependence between X and Y after accounting for Z.

Time: O(n) expected for counting; O(m_xyz) for summing over observed triples (m_xyz ≤ n).Space: O(m_xyz + m_xz + m_yz + k_z), typically dominated by the triple map.
InfoNCE Loss and MI Lower Bound for Contrastive Learning
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute InfoNCE loss for a batch of query and key embeddings.
5// Positive pairs are aligned by index: q_i with k_i. Negatives are other pairs in the batch.
6// Returns loss in nats. MI lower bound is log(B) - loss.
7double infoNCE_loss(const vector<vector<double>>& Q, const vector<vector<double>>& K, double tau) {
8 size_t B = Q.size();
9 if (B == 0 || K.size() != B) throw runtime_error("Mismatched batch sizes");
10 size_t d = Q[0].size();
11 for (size_t i = 0; i < B; ++i) if (Q[i].size() != d || K[i].size() != d) throw runtime_error("Inconsistent dimensions");
12
13 // Compute similarity matrix S_{ij} = (q_i dot k_j) / tau
14 vector<vector<double>> S(B, vector<double>(B, 0.0));
15 for (size_t i = 0; i < B; ++i) {
16 for (size_t j = 0; j < B; ++j) {
17 double dot = 0.0;
18 for (size_t t = 0; t < d; ++t) dot += Q[i][t] * K[j][t];
19 S[i][j] = dot / tau;
20 }
21 }
22
23 // Compute - (1/B) * sum_i log( exp(S_{ii}) / sum_j exp(S_{ij}) ) using log-sum-exp
24 double loss = 0.0;
25 for (size_t i = 0; i < B; ++i) {
26 double m = *max_element(S[i].begin(), S[i].end());
27 double denom = 0.0;
28 for (size_t j = 0; j < B; ++j) denom += exp(S[i][j] - m);
29 double log_denom = m + log(denom);
30 loss += -(S[i][i] - log_denom);
31 }
32 return loss / static_cast<double>(B);
33}
34
35int main() {
36 // Reproducible random embeddings
37 mt19937 rng(42);
38 normal_distribution<double> dist(0.0, 1.0);
39
40 size_t B = 8; // batch size
41 size_t d = 16; // embedding dimension
42 double tau = 0.1; // temperature
43
44 vector<vector<double>> Q(B, vector<double>(d));
45 vector<vector<double>> K(B, vector<double>(d));
46
47 // Create positive correlation by constructing K = Q + noise
48 for (size_t i = 0; i < B; ++i) {
49 for (size_t t = 0; t < d; ++t) {
50 double q = dist(rng);
51 double noise = 0.2 * dist(rng);
52 Q[i][t] = q;
53 K[i][t] = q + noise; // positive pair is similar
54 }
55 }
56
57 double loss = infoNCE_loss(Q, K, tau); // nats
58 double mi_lower_bound_nats = log(static_cast<double>(B)) - loss;
59
60 cout << fixed << setprecision(6);
61 cout << "InfoNCE loss (nats): " << loss << "\n";
62 cout << "MI lower bound (nats): " << mi_lower_bound_nats << "\n";
63 cout << "MI lower bound (bits): " << mi_lower_bound_nats / log(2.0) << "\n";
64
65 return 0;
66}
67

This code computes the InfoNCE contrastive loss for a batch of embeddings and reports the mutual information lower bound log(B) − loss. It builds a similarity matrix using dot products and applies the log‑sum‑exp trick for numerical stability. The toy data makes positives similar by setting K ≈ Q + noise.

Time: O(B^2 d) to compute all dot products and O(B^2) for the softmax denominators.Space: O(B^2) if storing the full similarity matrix; can be reduced to O(B) by streaming rows.