📚TheoryIntermediate

Mutual Information

Key Points

•
Mutual Information (MI) measures how much knowing one random variable reduces uncertainty about another.
•
It is symmetric, non‑negative, and equals zero if and only if the variables are independent.
•
MI connects entropy and divergence: I(X;Y) = H(X) + H(Y) - H(X,Y) = $D_{K L}$ ( $P_{X Y}$ || $P_{X}$ $P_{Y}$ ).
•
Processing data cannot increase information: the Data Processing Inequality ensures I(X;Y) $\geq$ I(X;f(Y)).
•
Conditional MI, I(X;Y|Z), measures shared information between X and Y after Z is known.
•
MI decomposes with the chain rule: I(X;Y,Z) = I(X;Z) + I(X;Y|Z).
•
In ML, MI guides feature selection and underlies contrastive learning via the InfoNCE bound.
•
Discrete MI can be computed from counts; continuous MI usually requires estimators (binning, kNN, or contrastive bounds).

Prerequisites

→Basic Probability — Understanding random variables, joint and marginal distributions is fundamental to MI.
→Logarithms and Exponents — Entropy and MI use logarithms; units (nats vs. bits) depend on log base.
→Entropy and KL Divergence — MI is defined via entropy identities and as a KL divergence.
→Hash Maps (C++ STL) — Efficient counting of categories for empirical MI relies on unordered_map.
→Numerical Stability (Log-Sum-Exp) — Stable evaluation of softmax/InfoNCE requires the log-sum-exp trick.
→Linear Algebra Basics — InfoNCE uses vector dot products and batch matrix operations.
→Big-O Complexity — To analyze scalability of MI estimators and contrastive objectives.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine two locked boxes labeled X and Y. When you open box X, how much does it help you guess what’s inside box Y? Mutual Information (MI) is the ruler we use to measure that help. Concept: Mutual Information quantifies the reduction in uncertainty about one random variable when another is observed. It blends two core ideas from information theory: entropy (uncertainty) and Kullback–Leibler divergence (difference between probability distributions). Mathematically, MI has several equivalent forms, such as I(X;Y) = H(X) + H(Y) - H(X,Y), I(X;Y) = H(X) - H(X|Y), and I(X;Y) = D_{KL}(P_{XY} || P_X P_Y). These identities make MI both interpretable (in terms of uncertainty reduction) and analyzable (as a divergence between the joint distribution and the product of marginals). Example: If X is the language a document is written in (English, Spanish, etc.) and Y is the frequency of certain letters, then knowing Y should tell you quite a bit about X (e.g., the letter distribution of Spanish vs. English). MI captures how strongly Y’s evidence reduces uncertainty about X. If X and Y are independent (say, language and coin flips done by a separate person), MI is zero because observing one doesn’t help predict the other.

02Intuition & Analogies

Think of uncertainty as fog and information as a flashlight. Entropy measures the thickness of the fog around a single object (a variable). Mutual Information measures how much the flashlight from Y can pierce the fog around X (and vice versa). If Y’s flashlight points exactly where X is hiding, you learn a lot, so MI is high. If Y shines randomly and doesn’t illuminate X at all, MI is zero. Consider a detective story: X is the identity of a culprit; Y is a clue like a fingerprint pattern. Before seeing the clue, you have many suspects (high entropy). After seeing Y, the suspect list narrows. The amount of narrowing is I(X;Y). Now imagine transforming the clue Y into a blurry photocopy f(Y). The Data Processing Inequality says the blurred clue cannot make you more certain than the original; information cannot be increased by post‑processing. For conditional MI, suppose Z is extra context (e.g., the city where the crime occurred). I(X;Y|Z) asks: once you already know Z, how much extra does Y help? If the city already tells you which suspects to consider, the fingerprint may add less. The chain rule I(X;Y,Z) = I(X;Z) + I(X;Y|Z) mirrors the idea of staged investigation: first learn from Z, then from Y given that context. In machine learning, MI is like a quality score for representations. Good features carry a lot of information about targets (high MI with labels), and useful encoders preserve information about relevant signals while discarding noise (information bottleneck). Contrastive learning even trains models by maximizing a computable lower bound to MI (InfoNCE).

03Formal Definition

Let (X,Y) be random variables on a common probability space with joint distribution

P_{X Y}

and marginals

P_{X}

P_{Y}

. The mutual information is defined as I(X;Y) =

D_{K L}

(

P_{X Y}

∥

P_{X}

P_{Y}

) =

\int

lo g

\left

(

\frac{d P _{X Y}}{d ( P _{X} P _{Y} )}

\right

)

d P_{X Y}

, for continuous/discrete cases as appropriate. Equivalently, in terms of entropies, I(X;Y) = H(X) + H(Y) - H(X,Y) = H(X) - H(X

∣ Y) = H (Y) - H (Y ∣

X). For discrete variables with mass functions p(x,y), p(x), p(y), we have I(X;Y) =

\sum_{x, y}

p(x,y)

lo g

\frac{p ( x , y )}{p ( x ) p ( y )}

. Key properties include symmetry (I(X;Y) = I(Y;X)), non‑negativity (I(X;Y)

\geq

0, equality iff independence), and the Data Processing Inequality (for any measurable f, I(X;Y)

\geq

I(X;f(Y))). Conditional mutual information extends MI with a conditioning variable Z: I(X;Y|Z) =

E_{Z}

\left

[

D_{K L}

(

P_{X Y ∣ Z}

∥

P_{X ∣ Z}

P_{Y ∣ Z}

)

\right

] =

\sum_{z}

p(z)

\sum_{x, y}

p(x,y|z)

lo g

\frac{p ( x , y ∣ z )}{p ( x ∣ z ) p ( y ∣ z )}

. The chain rule decomposes MI: I(X;Y,Z) = I(X;Z) + I(X;Y|Z), reflecting staged information gain.

04When to Use

• Feature selection: Choose variables X (features) that carry high MI with Y (labels), even if relationships are nonlinear or non‑monotonic. MI is model‑agnostic, making it robust for pre‑processing and screening. • Representation learning and contrastive learning: Maximize information between paired views (e.g., two augmentations of the same image). The InfoNCE objective provides a tractable lower bound to MI, enabling self‑supervised training. • Information bottleneck: Learn compressed representations Z that retain information about targets Y while discarding irrelevant parts of X. MI terms I(Z;X) and I(Z;Y) formalize the trade‑off. • Dependency discovery: Detect general statistical dependence between variables (beyond linear correlation). This is useful for exploratory data analysis, causal discovery pipelines (as a proxy), and clustering. • Communication systems: Quantify channel capacity and coding performance, where MI between transmitted and received signals measures achievable reliable rates. • Neuroscience and signals: Measure how much a neural spike train or sensor signal reveals about a stimulus. Use MI when you need a universal dependence measure insensitive to reparameterizations and not restricted to linear relationships. Prefer entropy/correlation when linear Gaussian assumptions hold and are sufficient; prefer MI when interactions are complex or multimodal.

⚠️Common Mistakes

• Using naive histograms for continuous MI without sensitivity checks: bin size strongly affects estimates. Mitigation: perform bin‑sweep sensitivity analysis, use kNN or kernel estimators, or rely on contrastive lower bounds when appropriate. • Forgetting base of logarithm: MI in nats (natural log) vs. bits (log base 2). Always report the base or convert via division by \ln 2. • Interpreting MI as causal influence: High MI indicates dependence, not causation. Conditional MI and causal frameworks are needed for causal claims. • Violating the Data Processing Inequality in analysis: If you post‑process features, reported MI with targets should not increase. Apparent increases often indicate estimation noise or data leakage. • Ignoring finite‑sample bias: Empirical MI from counts is upward biased, especially with many categories and small sample sizes. Apply bias corrections (e.g., Miller–Madow) or use cross‑validation/bootstrap to assess reliability. • Mixing conditional and unconditional MI: I(X;Y|Z) can be small even when I(X;Y) is large, if Z already explains the dependence. Always specify conditioning variables. • Assuming symmetry implies identical roles: While I(X;Y) = I(Y;X), in learning you often optimize asymmetric objectives (predict Y from X). Keep roles clear when designing models.

Key Formulas

Entropy Identity

I (X; Y) = H (X) + H (Y) - H (X, Y)

Explanation: Mutual information equals the sum of marginal entropies minus joint entropy. It measures how much the joint uncertainty is less than what it would be if variables were independent.

Uncertainty Reduction

I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X)

Explanation: MI is the reduction in uncertainty of one variable after observing the other. Zero reduction means independence.

KL Form

I (X; Y) = D_{K L} (P_{X Y} ∥ P_{X} P_{Y})

Explanation: MI is the KL divergence between the joint distribution and the product of marginals. It quantifies how far dependence is from independence.

Discrete MI

I (X; Y) = x, y \sum p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )}

Explanation: For discrete variables, MI is a sum over all value pairs weighted by joint probabilities. Each term measures the local departure from independence.

Conditional MI

I (X; Y ∣ Z) = z \sum p (z) x, y \sum p (x, y ∣ z) lo g \frac{p ( x , y ∣ z )}{p ( x ∣ z ) p ( y ∣ z )}

Explanation: Conditional MI averages the MI between X and Y across each value of Z. It captures remaining dependence after accounting for Z.

Chain Rule

I (X; Y, Z) = I (X; Z) + I (X; Y ∣ Z)

Explanation: MI with a pair (Y,Z) can be decomposed into information from Z plus additional information from Y given Z. Useful for staged analysis.

Data Processing Inequality

I (X; Y) \geq I (X; f (Y))

Explanation: Applying any function f to Y cannot increase information about X. Post‑processing cannot create new information.

Non-Negativity and Independence

I (X; Y) \geq 0, I (X; Y) = 0 ⟺ P_{X Y} = P_{X} P_{Y}

Explanation: MI is always non‑negative by Gibbs’ inequality. It is zero exactly when X and Y are independent.

Empirical MI (Counts)

\hat{I} (X; Y) = x, y \sum \overset{p}{^} (x, y) lo g \frac{p ^ ( x , y )}{p ^ ( x ) p ^ ( y )}

Explanation: A plug‑in estimator using empirical frequencies. It is simple but can be biased with small samples or many categories.

InfoNCE Loss

L_{I n f o NCE} = - \frac{1}{N} i = 1 \sum N lo g \frac{exp ( s _{ii} / τ )}{\sum _{j = 1}^{N} exp ( s _{ij} / τ )}

Explanation: Contrastive loss over a batch with similarities $s_{ij}$ and temperature $τ$ . Encourages high similarity for positive pairs relative to negatives.

InfoNCE MI Lower Bound

I (X; Y) \geq lo g N - L_{I n f o NCE}

Explanation: For a batch of size N, the mutual information is lower‑bounded by log N minus the InfoNCE loss. Larger batches can give tighter bounds.

Entropy Definitions

H (X) = - x \sum p (x) lo g p (x), H (X ∣ Y) = - x, y \sum p (x, y) lo g p (x ∣ y)

Explanation: Entropy and conditional entropy quantify uncertainty. These definitions are the building blocks for MI identities.

Complexity Analysis

For discrete mutual information computed from samples, we typically build hash maps for marginal counts and a joint count map. With n samples and up to k unique categories per variable, counting runs in expected O(n) time using hash maps (with good hashing), plus O(

k^{2}

) in the worst case if all pairs are unique and we iterate over the joint map. Space usage is O(k + k + m) where m is the number of distinct (x,y) pairs actually observed (

m \leq n

and

m \leq k^{2}

). In practice, for sparse data m ≪

k^{2}

. For conditional mutual information I(X;Y|Z), we maintain counts for (x,y,z), (x,z), (y,z), and z. The pass to accumulate counts is O(n). The summation over distinct observed triples (x,y,z) has cost O(

m_{x yz}

), where

m_{x yz} \leq n

. Space is O(

m_{x yz}

m_{x z}

m_{yz}

k_{z}

), typically dominated by the triple map. The InfoNCE loss over a batch of size B with embedding dimension d requires computing all pairwise similarities, which is O(

B^{2}

d) time and O(

B^{2}

) space if we materialize the similarity matrix, or O(B d) additional space if we stream over columns/rows. The log‑sum‑exp per row is O(B). Therefore, total time per batch is O(

B^{2}

d) and space O(

B^{2}

) (or O(B) with streaming). Numerical stability requires the standard log‑sum‑exp trick, which adds negligible overhead. Overall, discrete MI and CMI scale linearly with samples and unique observed combinations, making them practical for medium‑cardinality data. Contrastive MI bounds scale quadratically in batch size due to all‑pairs similarities; techniques like memory banks or sub‑sampling can reduce cost.

Code Examples

Empirical Mutual Information for Discrete Variables (Counts from Samples)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Hash for pair<int,int>
5 struct PairHash {
6     size_t operator()(const pair<int,int>& p) const noexcept {
7         // 64-bit mix of two 32-bit ints
8         uint64_t a = static_cast<uint64_t>(static_cast<uint32_t>(p.first));
9         uint64_t b = static_cast<uint64_t>(static_cast<uint32_t>(p.second));
10         uint64_t x = (a << 32) ^ b;
11         // SplitMix64
12         x += 0x9e3779b97f4a7c15ULL;
13         x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
14         x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
15         x = x ^ (x >> 31);
16         return static_cast<size_t>(x);
17     }
18 };
19 
20 // Compute empirical MI in nats from integer-labeled samples
21 // X[i] and Y[i] are the i-th paired observations
22 // Returns MI in nats; divide by log(2.0) to convert to bits
23 double mutual_information_discrete(const vector<int>& X, const vector<int>& Y) {
24     if (X.size() != Y.size()) throw runtime_error("X and Y must have the same length");
25     size_t n = X.size();
26     if (n == 0) return 0.0; // convention
27 
28     unordered_map<int, long long> cx, cy;
29     unordered_map<pair<int,int>, long long, PairHash> cxy;
30     cx.reserve(n * 2); cy.reserve(n * 2); cxy.reserve(n * 2);
31 
32     // Count occurrences
33     for (size_t i = 0; i < n; ++i) {
34         ++cx[X[i]];
35         ++cy[Y[i]];
36         ++cxy[{X[i], Y[i]}];
37     }
38 
39     const double inv_n = 1.0 / static_cast<double>(n);
40     double mi = 0.0;
41 
42     // Sum over observed pairs only (others contribute 0)
43     for (const auto& kv : cxy) {
44         int x = kv.first.first;
45         int y = kv.first.second;
46         double p_xy = kv.second * inv_n;
47         double p_x = cx[x] * inv_n;
48         double p_y = cy[y] * inv_n;
49         // Guard against numerical issues, though counts ensure positivity
50         if (p_xy > 0 && p_x > 0 && p_y > 0) {
51             mi += p_xy * log(p_xy / (p_x * p_y));
52         }
53     }
54     return mi; // nats
55 }
56 
57 int main() {
58     // Example: two discrete variables with some dependence
59     vector<int> X = {0,0,1,1,1,0,2,2,2,2};
60     vector<int> Y = {1,0,1,1,0,0,2,2,1,2};
61 
62     double mi_nats = mutual_information_discrete(X, Y);
63     double mi_bits = mi_nats / log(2.0);
64 
65     cout << fixed << setprecision(6);
66     cout << "Empirical MI (nats): " << mi_nats << "\n";
67     cout << "Empirical MI (bits): " << mi_bits << "\n";
68 
69     return 0;
70 }
71

This program computes empirical mutual information for two discrete variables from sample pairs using hash maps for marginals and joint counts. It sums over observed pairs only, which is efficient for sparse data. The result is in nats (natural log); it also prints bits by dividing by ln 2.

Time: O(n) expected for counting; O(m) for summation over observed pairs (m ≤ n).Space: O(k_x + k_y + m), where k_x and k_y are unique values in X and Y, and m is unique observed pairs.

Conditional Mutual Information I(X;Y|Z) for Discrete Variables

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Triple {
5     int a,b,c;
6     bool operator==(const Triple& o) const noexcept { return a==o.a && b==o.b && c==o.c; }
7 };
8 
9 struct TripleHash {
10     size_t operator()(const Triple& t) const noexcept {
11         uint64_t x = static_cast<uint64_t>(static_cast<uint32_t>(t.a));
12         uint64_t y = static_cast<uint64_t>(static_cast<uint32_t>(t.b));
13         uint64_t z = static_cast<uint64_t>(static_cast<uint32_t>(t.c));
14         uint64_t h = x;
15         h = (h * 0x9e3779b97f4a7c15ULL) ^ y;
16         h = (h * 0x9e3779b97f4a7c15ULL) ^ z;
17         // finalize (SplitMix64)
18         h += 0x9e3779b97f4a7c15ULL;
19         h = (h ^ (h >> 30)) * 0xbf58476d1ce4e5b9ULL;
20         h = (h ^ (h >> 27)) * 0x94d049bb133111ebULL;
21         h = h ^ (h >> 31);
22         return static_cast<size_t>(h);
23     }
24 };
25 
26 double conditional_mutual_information_discrete(const vector<int>& X,
27                                                const vector<int>& Y,
28                                                const vector<int>& Z) {
29     if (X.size() != Y.size() || X.size() != Z.size()) throw runtime_error("All inputs must have the same length");
30     size_t n = X.size();
31     if (n == 0) return 0.0;
32 
33     unordered_map<Triple, long long, TripleHash> cxyz; cxyz.reserve(n*2);
34     unordered_map<pair<int,int>, long long, hash<long long>> cxz_raw; cxz_raw.reserve(n*2);
35     unordered_map<pair<int,int>, long long, hash<long long>> cyz_raw; cyz_raw.reserve(n*2);
36     unordered_map<int, long long> cz; cz.reserve(n*2);
37 
38     auto pack_pair = [](int a, int b) -> long long {
39         // pack two 32-bit ints into 64-bit key for default hash<long long>
40         return (static_cast<long long>(static_cast<uint32_t>(a)) << 32) |
41                static_cast<uint32_t>(b);
42     };
43 
44     // Count occurrences
45     for (size_t i = 0; i < n; ++i) {
46         Triple t{X[i], Y[i], Z[i]};
47         ++cxyz[t];
48         ++cxz_raw[{X[i], Z[i]}]; // but we used hash<long long>; instead pack to long long
49         ++cyz_raw[{Y[i], Z[i]}]; // to keep code simple, repack below
50         ++cz[Z[i]];
51     }
52 
53     // Rebuild cxz and cyz using packed keys for performance
54     unordered_map<long long, long long> cxz; cxz.reserve(cxz_raw.size()*2);
55     unordered_map<long long, long long> cyz; cyz.reserve(cyz_raw.size()*2);
56     for (const auto& kv : cxz_raw) {
57         long long key = (static_cast<long long>(static_cast<uint32_t>(kv.first.first)) << 32) |
58                          static_cast<uint32_t>(kv.first.second);
59         cxz[key] = kv.second;
60     }
61     for (const auto& kv : cyz_raw) {
62         long long key = (static_cast<long long>(static_cast<uint32_t>(kv.first.first)) << 32) |
63                          static_cast<uint32_t>(kv.first.second);
64         cyz[key] = kv.second;
65     }
66 
67     const double inv_n = 1.0 / static_cast<double>(n);
68     double cmi = 0.0;
69 
70     for (const auto& kv : cxyz) {
71         int x = kv.first.a;
72         int y = kv.first.b;
73         int z = kv.first.c;
74         double c_xyz = static_cast<double>(kv.second);
75         double p_xyz = c_xyz * inv_n;
76 
77         long long xz_key = (static_cast<long long>(static_cast<uint32_t>(x)) << 32) | static_cast<uint32_t>(z);
78         long long yz_key = (static_cast<long long>(static_cast<uint32_t>(y)) << 32) | static_cast<uint32_t>(z);
79 
80         double p_xz = static_cast<double>(cxz[xz_key]) * inv_n;
81         double p_yz = static_cast<double>(cyz[yz_key]) * inv_n;
82         double p_z  = static_cast<double>(cz[z]) * inv_n;
83 
84         // p(x,y|z) = p(x,y,z)/p(z); p(x|z) = p(x,z)/p(z); p(y|z) = p(y,z)/p(z)
85         // cmi term: p(x,y,z) * log( p(x,y,z) * p(z) / ( p(x,z) * p(y,z) ) ) / n
86         if (p_xyz > 0 && p_xz > 0 && p_yz > 0 && p_z > 0) {
87             double ratio = (p_xyz * p_z) / (p_xz * p_yz);
88             cmi += p_xyz * log(ratio);
89         }
90     }
91 
92     return cmi; // nats
93 }
94 
95 int main() {
96     // Toy example: Z partially explains dependence between X and Y
97     vector<int> X = {0,0,1,1,0,1,0,1};
98     vector<int> Y = {0,1,1,1,0,0,1,0};
99     vector<int> Z = {0,0,0,0,1,1,1,1};
100 
101     double cmi_nats = conditional_mutual_information_discrete(X, Y, Z);
102     cout << fixed << setprecision(6);
103     cout << "Conditional MI I(X;Y|Z) (nats): " << cmi_nats << "\n";
104     cout << "Conditional MI (bits): " << cmi_nats / log(2.0) << "\n";
105     return 0;
106 }
107

This program estimates conditional mutual information I(X;Y|Z) from discrete samples. It counts triple (x,y,z) frequencies plus (x,z), (y,z), and z marginals, then applies the plug‑in formula. The result quantifies remaining dependence between X and Y after accounting for Z.

Time: O(n) expected for counting; O(m_xyz) for summing over observed triples (m_xyz ≤ n).Space: O(m_xyz + m_xz + m_yz + k_z), typically dominated by the triple map.

InfoNCE Loss and MI Lower Bound for Contrastive Learning

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute InfoNCE loss for a batch of query and key embeddings.
5 // Positive pairs are aligned by index: q_i with k_i. Negatives are other pairs in the batch.
6 // Returns loss in nats. MI lower bound is log(B) - loss.
7 double infoNCE_loss(const vector<vector<double>>& Q, const vector<vector<double>>& K, double tau) {
8     size_t B = Q.size();
9     if (B == 0 || K.size() != B) throw runtime_error("Mismatched batch sizes");
10     size_t d = Q[0].size();
11     for (size_t i = 0; i < B; ++i) if (Q[i].size() != d || K[i].size() != d) throw runtime_error("Inconsistent dimensions");
12 
13     // Compute similarity matrix S_{ij} = (q_i dot k_j) / tau
14     vector<vector<double>> S(B, vector<double>(B, 0.0));
15     for (size_t i = 0; i < B; ++i) {
16         for (size_t j = 0; j < B; ++j) {
17             double dot = 0.0;
18             for (size_t t = 0; t < d; ++t) dot += Q[i][t] * K[j][t];
19             S[i][j] = dot / tau;
20         }
21     }
22 
23     // Compute - (1/B) * sum_i log( exp(S_{ii}) / sum_j exp(S_{ij}) ) using log-sum-exp
24     double loss = 0.0;
25     for (size_t i = 0; i < B; ++i) {
26         double m = *max_element(S[i].begin(), S[i].end());
27         double denom = 0.0;
28         for (size_t j = 0; j < B; ++j) denom += exp(S[i][j] - m);
29         double log_denom = m + log(denom);
30         loss += -(S[i][i] - log_denom);
31     }
32     return loss / static_cast<double>(B);
33 }
34 
35 int main() {
36     // Reproducible random embeddings
37     mt19937 rng(42);
38     normal_distribution<double> dist(0.0, 1.0);
39 
40     size_t B = 8;   // batch size
41     size_t d = 16;  // embedding dimension
42     double tau = 0.1; // temperature
43 
44     vector<vector<double>> Q(B, vector<double>(d));
45     vector<vector<double>> K(B, vector<double>(d));
46 
47     // Create positive correlation by constructing K = Q + noise
48     for (size_t i = 0; i < B; ++i) {
49         for (size_t t = 0; t < d; ++t) {
50             double q = dist(rng);
51             double noise = 0.2 * dist(rng);
52             Q[i][t] = q;
53             K[i][t] = q + noise; // positive pair is similar
54         }
55     }
56 
57     double loss = infoNCE_loss(Q, K, tau); // nats
58     double mi_lower_bound_nats = log(static_cast<double>(B)) - loss;
59 
60     cout << fixed << setprecision(6);
61     cout << "InfoNCE loss (nats): " << loss << "\n";
62     cout << "MI lower bound (nats): " << mi_lower_bound_nats << "\n";
63     cout << "MI lower bound (bits): " << mi_lower_bound_nats / log(2.0) << "\n";
64 
65     return 0;
66 }
67

This code computes the InfoNCE contrastive loss for a batch of embeddings and reports the mutual information lower bound log(B) − loss. It builds a similarity matrix using dot products and applies the log‑sum‑exp trick for numerical stability. The toy data makes positives similar by setting K ≈ Q + noise.

Time: O(B^2 d) to compute all dot products and O(B^2) for the softmax denominators.Space: O(B^2) if storing the full similarity matrix; can be reduced to O(B) by streaming rows.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Hash for pair<int,int>
5	struct PairHash {
6	size_t operator()(const pair<int,int>& p) const noexcept {
7	// 64-bit mix of two 32-bit ints
8	uint64_t a = static_cast<uint64_t>(static_cast<uint32_t>(p.first));
9	uint64_t b = static_cast<uint64_t>(static_cast<uint32_t>(p.second));
10	uint64_t x = (a << 32) ^ b;
11	// SplitMix64
12	x += 0x9e3779b97f4a7c15ULL;
13	x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9ULL;
14	x = (x ^ (x >> 27)) * 0x94d049bb133111ebULL;
15	x = x ^ (x >> 31);
16	return static_cast<size_t>(x);
17	}
18	};
19
20	// Compute empirical MI in nats from integer-labeled samples
21	// X[i] and Y[i] are the i-th paired observations
22	// Returns MI in nats; divide by log(2.0) to convert to bits
23	double mutual_information_discrete(const vector<int>& X, const vector<int>& Y) {
24	if (X.size() != Y.size()) throw runtime_error("X and Y must have the same length");
25	size_t n = X.size();
26	if (n == 0) return 0.0; // convention
27
28	unordered_map<int, long long> cx, cy;
29	unordered_map<pair<int,int>, long long, PairHash> cxy;
30	cx.reserve(n * 2); cy.reserve(n * 2); cxy.reserve(n * 2);
31
32	// Count occurrences
33	for (size_t i = 0; i < n; ++i) {
34	++cx[X[i]];
35	++cy[Y[i]];
36	++cxy[{X[i], Y[i]}];
37	}
38
39	const double inv_n = 1.0 / static_cast<double>(n);
40	double mi = 0.0;
41
42	// Sum over observed pairs only (others contribute 0)
43	for (const auto& kv : cxy) {
44	int x = kv.first.first;
45	int y = kv.first.second;
46	double p_xy = kv.second * inv_n;
47	double p_x = cx[x] * inv_n;
48	double p_y = cy[y] * inv_n;
49	// Guard against numerical issues, though counts ensure positivity
50	if (p_xy > 0 && p_x > 0 && p_y > 0) {
51	mi += p_xy * log(p_xy / (p_x * p_y));
52	}
53	}
54	return mi; // nats
55	}
56
57	int main() {
58	// Example: two discrete variables with some dependence
59	vector<int> X = {0,0,1,1,1,0,2,2,2,2};
60	vector<int> Y = {1,0,1,1,0,0,2,2,1,2};
61
62	double mi_nats = mutual_information_discrete(X, Y);
63	double mi_bits = mi_nats / log(2.0);
64
65	cout << fixed << setprecision(6);
66	cout << "Empirical MI (nats): " << mi_nats << "\n";
67	cout << "Empirical MI (bits): " << mi_bits << "\n";
68
69	return 0;
70	}
71

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Triple {
5	int a,b,c;
6	bool operator==(const Triple& o) const noexcept { return a==o.a && b==o.b && c==o.c; }
7	};
8
9	struct TripleHash {
10	size_t operator()(const Triple& t) const noexcept {
11	uint64_t x = static_cast<uint64_t>(static_cast<uint32_t>(t.a));
12	uint64_t y = static_cast<uint64_t>(static_cast<uint32_t>(t.b));
13	uint64_t z = static_cast<uint64_t>(static_cast<uint32_t>(t.c));
14	uint64_t h = x;
15	h = (h * 0x9e3779b97f4a7c15ULL) ^ y;
16	h = (h * 0x9e3779b97f4a7c15ULL) ^ z;
17	// finalize (SplitMix64)
18	h += 0x9e3779b97f4a7c15ULL;
19	h = (h ^ (h >> 30)) * 0xbf58476d1ce4e5b9ULL;
20	h = (h ^ (h >> 27)) * 0x94d049bb133111ebULL;
21	h = h ^ (h >> 31);
22	return static_cast<size_t>(h);
23	}
24	};
25
26	double conditional_mutual_information_discrete(const vector<int>& X,
27	const vector<int>& Y,
28	const vector<int>& Z) {
29	if (X.size() != Y.size() \|\| X.size() != Z.size()) throw runtime_error("All inputs must have the same length");
30	size_t n = X.size();
31	if (n == 0) return 0.0;
32
33	unordered_map<Triple, long long, TripleHash> cxyz; cxyz.reserve(n*2);
34	unordered_map<pair<int,int>, long long, hash<long long>> cxz_raw; cxz_raw.reserve(n*2);
35	unordered_map<pair<int,int>, long long, hash<long long>> cyz_raw; cyz_raw.reserve(n*2);
36	unordered_map<int, long long> cz; cz.reserve(n*2);
37
38	auto pack_pair = [](int a, int b) -> long long {
39	// pack two 32-bit ints into 64-bit key for default hash<long long>
40	return (static_cast<long long>(static_cast<uint32_t>(a)) << 32) \|
41	static_cast<uint32_t>(b);
42	};
43
44	// Count occurrences
45	for (size_t i = 0; i < n; ++i) {
46	Triple t{X[i], Y[i], Z[i]};
47	++cxyz[t];
48	++cxz_raw[{X[i], Z[i]}]; // but we used hash<long long>; instead pack to long long
49	++cyz_raw[{Y[i], Z[i]}]; // to keep code simple, repack below
50	++cz[Z[i]];
51	}
52
53	// Rebuild cxz and cyz using packed keys for performance
54	unordered_map<long long, long long> cxz; cxz.reserve(cxz_raw.size()*2);
55	unordered_map<long long, long long> cyz; cyz.reserve(cyz_raw.size()*2);
56	for (const auto& kv : cxz_raw) {
57	long long key = (static_cast<long long>(static_cast<uint32_t>(kv.first.first)) << 32) \|
58	static_cast<uint32_t>(kv.first.second);
59	cxz[key] = kv.second;
60	}
61	for (const auto& kv : cyz_raw) {
62	long long key = (static_cast<long long>(static_cast<uint32_t>(kv.first.first)) << 32) \|
63	static_cast<uint32_t>(kv.first.second);
64	cyz[key] = kv.second;
65	}
66
67	const double inv_n = 1.0 / static_cast<double>(n);
68	double cmi = 0.0;
69
70	for (const auto& kv : cxyz) {
71	int x = kv.first.a;
72	int y = kv.first.b;
73	int z = kv.first.c;
74	double c_xyz = static_cast<double>(kv.second);
75	double p_xyz = c_xyz * inv_n;
76
77	long long xz_key = (static_cast<long long>(static_cast<uint32_t>(x)) << 32) \| static_cast<uint32_t>(z);
78	long long yz_key = (static_cast<long long>(static_cast<uint32_t>(y)) << 32) \| static_cast<uint32_t>(z);
79
80	double p_xz = static_cast<double>(cxz[xz_key]) * inv_n;
81	double p_yz = static_cast<double>(cyz[yz_key]) * inv_n;
82	double p_z = static_cast<double>(cz[z]) * inv_n;
83
84	// p(x,y\|z) = p(x,y,z)/p(z); p(x\|z) = p(x,z)/p(z); p(y\|z) = p(y,z)/p(z)
85	// cmi term: p(x,y,z) * log( p(x,y,z) * p(z) / ( p(x,z) * p(y,z) ) ) / n
86	if (p_xyz > 0 && p_xz > 0 && p_yz > 0 && p_z > 0) {
87	double ratio = (p_xyz * p_z) / (p_xz * p_yz);
88	cmi += p_xyz * log(ratio);
89	}
90	}
91
92	return cmi; // nats
93	}
94
95	int main() {
96	// Toy example: Z partially explains dependence between X and Y
97	vector<int> X = {0,0,1,1,0,1,0,1};
98	vector<int> Y = {0,1,1,1,0,0,1,0};
99	vector<int> Z = {0,0,0,0,1,1,1,1};
100
101	double cmi_nats = conditional_mutual_information_discrete(X, Y, Z);
102	cout << fixed << setprecision(6);
103	cout << "Conditional MI I(X;Y\|Z) (nats): " << cmi_nats << "\n";
104	cout << "Conditional MI (bits): " << cmi_nats / log(2.0) << "\n";
105	return 0;
106	}
107

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute InfoNCE loss for a batch of query and key embeddings.
5	// Positive pairs are aligned by index: q_i with k_i. Negatives are other pairs in the batch.
6	// Returns loss in nats. MI lower bound is log(B) - loss.
7	double infoNCE_loss(const vector<vector<double>>& Q, const vector<vector<double>>& K, double tau) {
8	size_t B = Q.size();
9	if (B == 0 \|\| K.size() != B) throw runtime_error("Mismatched batch sizes");
10	size_t d = Q[0].size();
11	for (size_t i = 0; i < B; ++i) if (Q[i].size() != d \|\| K[i].size() != d) throw runtime_error("Inconsistent dimensions");
12
13	// Compute similarity matrix S_{ij} = (q_i dot k_j) / tau
14	vector<vector<double>> S(B, vector<double>(B, 0.0));
15	for (size_t i = 0; i < B; ++i) {
16	for (size_t j = 0; j < B; ++j) {
17	double dot = 0.0;
18	for (size_t t = 0; t < d; ++t) dot += Q[i][t] * K[j][t];
19	S[i][j] = dot / tau;
20	}
21	}
22
23	// Compute - (1/B) * sum_i log( exp(S_{ii}) / sum_j exp(S_{ij}) ) using log-sum-exp
24	double loss = 0.0;
25	for (size_t i = 0; i < B; ++i) {
26	double m = *max_element(S[i].begin(), S[i].end());
27	double denom = 0.0;
28	for (size_t j = 0; j < B; ++j) denom += exp(S[i][j] - m);
29	double log_denom = m + log(denom);
30	loss += -(S[i][i] - log_denom);
31	}
32	return loss / static_cast<double>(B);
33	}
34
35	int main() {
36	// Reproducible random embeddings
37	mt19937 rng(42);
38	normal_distribution<double> dist(0.0, 1.0);
39
40	size_t B = 8; // batch size
41	size_t d = 16; // embedding dimension
42	double tau = 0.1; // temperature
43
44	vector<vector<double>> Q(B, vector<double>(d));
45	vector<vector<double>> K(B, vector<double>(d));
46
47	// Create positive correlation by constructing K = Q + noise
48	for (size_t i = 0; i < B; ++i) {
49	for (size_t t = 0; t < d; ++t) {
50	double q = dist(rng);
51	double noise = 0.2 * dist(rng);
52	Q[i][t] = q;
53	K[i][t] = q + noise; // positive pair is similar
54	}
55	}
56
57	double loss = infoNCE_loss(Q, K, tau); // nats
58	double mi_lower_bound_nats = log(static_cast<double>(B)) - loss;
59
60	cout << fixed << setprecision(6);
61	cout << "InfoNCE loss (nats): " << loss << "\n";
62	cout << "MI lower bound (nats): " << mi_lower_bound_nats << "\n";
63	cout << "MI lower bound (bits): " << mi_lower_bound_nats / log(2.0) << "\n";
64
65	return 0;
66	}
67