📚TheoryIntermediate

Information Theory

Key Points

  • Information theory quantifies uncertainty and information using measures like entropy, cross-entropy, KL divergence, and mutual information.
  • Entropy H(X) measures the average surprise in outcomes; higher entropy means more unpredictability.
  • Cross-entropy H(p, q) is the expected number of bits required to encode samples from p using a code optimized for q, and it underlies many ML loss functions.
  • KL divergence (P||Q) measures how much information is lost when Q is used to approximate P and is always nonnegative.
  • Mutual information I(X; Y) quantifies how much knowing Y reduces uncertainty about X, capturing any kind of dependency, not just linear.
  • Channel capacity C is the maximum mutual information between input and output over input distributions and sets a hard limit for reliable communication.
  • In machine learning, cross-entropy is a standard classification loss, KL divergence appears in VAEs via the ELBO, and mutual information guides representation learning.
  • You can compute these quantities efficiently from histograms or sample counts in O(n) time, with careful handling of zeros for numerical stability.

Prerequisites

  • Basic Probability (PMF, expectations, independence)Entropy and mutual information are expectations over probability distributions.
  • Logarithms and ExponentialsAll core definitions use logs; understanding bases (2 vs e) is essential.
  • Discrete Random Variables and HistogramsPractical estimation relies on counting and normalizing frequencies.
  • C++ Vectors, Maps, and Floating-PointImplementations require arrays, hash maps, and careful numeric handling.
  • Asymptotic Analysis (Big-O)To evaluate the efficiency of entropy and MI estimators.
  • Optimization BasicsCapacity involves maximizing mutual information over input distributions.
  • Information-Theoretic Coding (optional)Helps interpret entropy as code length and understand capacity and source coding.

Detailed Explanation

Tap terms for definitions

01Overview

Information theory, founded by Claude Shannon in 1948, studies how to quantify, transmit, and compress information. Its core quantity, entropy H(X), measures the average uncertainty (or surprise) in a random variable’s outcomes. Building on entropy, cross-entropy compares two distributions by asking: if the world follows p but we encode it assuming q, how many bits do we spend? KL divergence D_KL(P||Q) refines this by measuring the extra coding cost incurred by using the wrong distribution. Mutual information I(X; Y) gauges how much two variables share information—how much knowing one reduces uncertainty about the other. These concepts power both communication theory (e.g., determining channel capacity, the limit of reliable data rates) and modern machine learning (e.g., cross-entropy loss for classification, KL regularization in variational autoencoders, and information bottleneck for representation learning). Although the theory is mathematical, it translates into practical, implementable estimators: compute histograms from data, normalize to probabilities, and evaluate sums involving logs. With careful handling of edge cases (like zero probabilities), you can write efficient C++ routines to estimate entropy, cross-entropy, KL divergence, and mutual information from real datasets, enabling you to analyze uncertainty, dependency, and model fit.

02Intuition & Analogies

Imagine guessing a secret word. If the word is equally likely to be any of 1,024 choices, you’d need around 10 yes/no questions (bits) to identify it. This number of questions reflects entropy: more possible and equally likely options mean more uncertainty, thus more questions. Now suppose you design your questions based on the wrong belief—maybe you think some words are more likely than they actually are. You’ll waste questions on the wrong leads. That waste is the cross-entropy between the true world and your belief, and the avoidable part of that waste is the KL divergence—the extra questions you pay for being wrong. Consider clues from a friend about the word. Each clue reduces your uncertainty. The amount your uncertainty drops, on average, is the mutual information between the clue and the word. If the clues are perfect, uncertainty plummets; if they’re noisy or irrelevant, the drop is small. Now, think of a noisy walkie‑talkie channel. You encode messages into bits and transmit them. Noise flips some bits. Channel capacity is the best you can do—by cleverly choosing how often to send 0s vs 1s and how to encode them—so that, despite noise, you can still communicate reliably up to a certain rate. In machine learning, this maps neatly to practice: training with cross-entropy is like designing codes matched to the data; KL regularization in VAEs penalizes beliefs (posteriors) that deviate too far from a simple prior; mutual information helps judge whether features actually tell you about labels. These quantities are just average question counts (bits) and uncertainty reductions made concrete.

03Formal Definition

Let X be a discrete random variable taking values in with probability mass function p(x). The (Shannon) entropy in base b is (X) = - p(x) p(x), which measures expected information content in units of (e.g., bits for , nats for ). For two distributions p and q on the same support, the cross-entropy is (p, q) = - p(x) q(x). The Kullback–Leibler divergence is (p q) = p(x) , which satisfies 0 with equality iff almost everywhere. For a pair (X, Y) with joint pmf p(x, y), mutual information is (X; Y) = p(x, y) = (X) - (X Y) = (Y) - (Y X). Conditional entropy is (X Y) = p(y) (X ) = (X, Y) - (Y). A discrete memoryless channel is specified by input alphabet , output alphabet , and transition probabilities p(y x). Its capacity is = (X; Y), the maximum achievable mutual information over input distributions. Source coding theorems relate entropy to optimal compression length, while channel coding theorems relate capacity to reliable communication rates.

04When to Use

Use entropy when you need to quantify unpredictability or design compression schemes: lower entropy implies better compressibility. Use cross-entropy as a training objective when your model outputs a predictive distribution q and the data follow (unknown) p; minimizing empirical cross-entropy aligns your model with the data and is exactly the standard classification loss. Use KL divergence to measure how different two distributions are when direction matters (D_{KL}(p\Vert q) \ne D_{KL}(q\Vert p)); this appears in regularization (e.g., VAEs), model comparison, and approximate inference. Use mutual information when you want a dependency measure that goes beyond correlation—MI captures nonlinear relationships and is invariant to invertible transformations; it’s useful for feature selection, representation learning, and understanding layers in neural nets. In communication problems, compute channel capacity when assessing the theoretical limit of reliable transmission or deciding input distributions for modems. Practically, when you have samples: build histograms (for discrete variables), normalize to probabilities, and compute the sums. When dealing with continuous variables, estimate densities (e.g., kernel density, k-NN estimators) or discretize carefully. If probabilities can be zero, add smoothing (pseudo-counts) or clip to a small \epsilon to avoid numerical issues. Choose log base 2 for bits in compression/communication and base e (nats) for connections to exponential families and variational inference.

⚠️Common Mistakes

• Ignoring zeros: Computing \log 0 causes -\infty and NaNs. Always clip probabilities to a small \epsilon > 0 or use smoothing with pseudo-counts. For cross-entropy and KL, you must ensure q(x) > 0 whenever p(x) > 0. • Mixing log bases: Reporting in bits but using natural logs gives wrong magnitudes. Convert via \log_b(x) = \log(x)/\log(b). • Confusing cross-entropy and KL: H(p, q) = H(p) + D_{KL}(p\Vert q). Minimizing cross-entropy is equivalent to minimizing KL only when H(p) is constant w.r.t. model parameters (true for fixed data). • Estimating MI with scarce data: Plug-in estimators using histograms are biased upward with small samples or many bins. Use larger datasets, fewer bins, or bias-corrected/k-NN estimators. • Ignoring support mismatch: KL is undefined if there exists x with p(x) > 0 but q(x) = 0. In ML, this appears when a model assigns zero probability to observed labels. Use label smoothing or logits with softmax to avoid hard zeros. • Treating MI like correlation: MI detects any dependence, but it does not indicate direction or causality. • Overfitting in compression claims: Observed cross-entropy on the training set can be overly optimistic; validate on held-out data. • Misinterpreting capacity: Channel capacity is an asymptotic limit under optimal coding and vanishing error probability; practical codes approach but do not instantly achieve C.

Key Formulas

Entropy

Explanation: Average uncertainty in X measured in base b units (bits if b=2, nats if b=e). It equals expected code length under an optimal code for p.

Joint Entropy

Explanation: Uncertainty of the pair (X, Y). It accounts for both variables together and is at least as large as the larger of H(X) and H(Y).

Conditional Entropy

Explanation: Average remaining uncertainty in X after observing Y. It decreases as Y becomes more informative about X.

Cross-Entropy Decomposition

Explanation: Cross-entropy equals the true entropy plus the KL divergence. Minimizing cross-entropy w.r.t. q is equivalent to minimizing KL when p is fixed.

KL Divergence

Explanation: Measures the inefficiency of assuming q when the true distribution is p. It is nonnegative and zero iff p equals q almost everywhere.

Mutual Information

Explanation: Quantifies dependency between X and Y. It is the KL divergence between the joint distribution and the product of marginals.

Entropy Chain Rule

Explanation: The entropy of a sequence decomposes into a sum of conditional entropies. Useful in sequential models and coding.

Data Processing Inequality

Explanation: No deterministic or stochastic processing of Y can increase its information about X. It formalizes the intuition that post-processing cannot create information.

Source Coding Theorem (Prefix Codes)

Explanation: The optimal expected code length L* for prefix codes is at least the entropy and within 1 bit above it. Huffman codes achieve this bound.

Channel Capacity

Explanation: The highest rate (in base b units per symbol) at which information can be sent with vanishing error through a memoryless channel.

Fano's Inequality

Explanation: Relates the probability of error in estimating X from Y to the conditional entropy. If H(X|Y) is small, must be small.

Binary Entropy Function

Explanation: Entropy of a Bernoulli(p) random variable. Central in binary channels and classification tasks.

MI as KL

Explanation: Mutual information equals the KL divergence between the joint distribution and the product of marginals. It is always nonnegative.

ELBO

Explanation: Variational lower bound on log-likelihood. Maximizing the ELBO balances reconstruction accuracy and regularization toward a prior.

Jensen–Shannon Divergence

Explanation: A symmetric, smoothed divergence based on KL. Unlike KL, it is always finite and bounded.

Complexity Analysis

Computing entropy, cross-entropy, and KL divergence from categorical distributions represented as k-length vectors is O(k) time: each requires a single pass to sum p(x) log terms and optionally a second pass for validation/normalization. Space complexity is O(1) beyond the input arrays if computed in-place; adding smoothed or normalized copies raises it to O(k). Numerically, constant-time safeguards (epsilon clipping and base conversion) do not change the asymptotics. Estimating mutual information (MI) from N paired samples using hash maps to build joint and marginal histograms is O(N) expected time, assuming average O(1) insert/lookup in unordere. In worst-case hashing collisions it can degrade, but with good hashing and moderate load factors this is rare. The number of distinct pairs determines the subsequent summation cost O(m). Space usage is O(m + a + b), where a and b are the numbers of distinct symbols of X and Y; this can approach O(N) in the worst case when nearly all observations are unique. For channel capacity by grid search over G candidate input distributions (e.g., Bernoulli(p) with p on a uniform grid), evaluating mutual information per grid point is O(1) for binary channels (since entropies are closed-form), leading to O(G) time and O(1) extra space. For general discrete channels with =k and =l, MI evaluation per input pmf is O(kl) and a grid or iterative optimizer (e.g., Blahut–Arimoto) would yield O(T kl) for T iterations. Practically, the main costs arise from histogram construction (linear in samples) and numerical stability checks. Memory is dominated by the size of frequency tables; careful choice of binning and smoothing controls both bias and resource usage.

Code Examples

Entropy, Cross-Entropy, and KL Divergence for Discrete Distributions
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <numeric>
5#include <iomanip>
6#include <stdexcept>
7
8// Compute log base b using natural log conversion
9inline double logb(double x, double base) {
10 return std::log(x) / std::log(base);
11}
12
13// Clip probability away from 0 to avoid log(0)
14inline double clip_prob(double p, double eps = 1e-12) {
15 if (p < eps) return eps;
16 if (p > 1.0 - eps) return 1.0 - eps;
17 return p;
18}
19
20// Validate that vectors have same size and non-negative entries
21void validate_same_size(const std::vector<double>& p, const std::vector<double>& q) {
22 if (p.size() != q.size()) throw std::invalid_argument("Vectors must have same length");
23}
24
25void validate_nonnegative(const std::vector<double>& p) {
26 for (double v : p) if (v < 0.0) throw std::invalid_argument("Probabilities must be non-negative");
27}
28
29// Optionally normalize a vector to sum to 1 (useful if inputs are counts)
30std::vector<double> normalize(const std::vector<double>& v) {
31 validate_nonnegative(v);
32 double s = std::accumulate(v.begin(), v.end(), 0.0);
33 if (s <= 0.0) throw std::invalid_argument("Sum must be positive to normalize");
34 std::vector<double> out(v.size());
35 for (size_t i = 0; i < v.size(); ++i) out[i] = v[i] / s;
36 return out;
37}
38
39// Shannon entropy in base 'base' (default base=2 -> bits)
40double entropy(const std::vector<double>& p, double base = 2.0) {
41 validate_nonnegative(p);
42 double H = 0.0;
43 for (double pi : p) {
44 if (pi <= 0.0) continue; // contributes 0 in the limit
45 double pci = clip_prob(pi);
46 H -= pci * logb(pci, base);
47 }
48 return H;
49}
50
51// Cross-entropy H(p, q) = - sum p log q
52double cross_entropy(const std::vector<double>& p, const std::vector<double>& q, double base = 2.0) {
53 validate_same_size(p, q);
54 validate_nonnegative(p);
55 validate_nonnegative(q);
56 double Hpq = 0.0;
57 for (size_t i = 0; i < p.size(); ++i) {
58 if (p[i] <= 0.0) continue; // contributes 0
59 double qi = clip_prob(q[i]); // ensure q>0 where p>0
60 Hpq += -p[i] * logb(qi, base);
61 }
62 return Hpq;
63}
64
65// KL divergence D_KL(p || q) = sum p log (p/q)
66double kl_divergence(const std::vector<double>& p, const std::vector<double>& q, double base = 2.0) {
67 validate_same_size(p, q);
68 validate_nonnegative(p);
69 validate_nonnegative(q);
70 double D = 0.0;
71 for (size_t i = 0; i < p.size(); ++i) {
72 if (p[i] <= 0.0) continue; // skip terms with p=0
73 double pi = clip_prob(p[i]);
74 double qi = clip_prob(q[i]);
75 D += pi * (logb(pi, base) - logb(qi, base));
76 }
77 return D;
78}
79
80int main() {
81 std::cout << std::fixed << std::setprecision(6);
82
83 // Example 1: Fair vs biased coin
84 std::vector<double> fair = {0.5, 0.5};
85 std::vector<double> bias9 = {0.9, 0.1};
86
87 std::cout << "H(fair) [bits] = " << entropy(fair) << "\n";
88 std::cout << "H(bias9) [bits] = " << entropy(bias9) << "\n";
89
90 // Cross-entropy and KL (note: H(p,q) = H(p) + D_KL(p||q))
91 double Hpq = cross_entropy(fair, bias9);
92 double Dpq = kl_divergence(fair, bias9);
93 std::cout << "H(fair, bias9) = " << Hpq << ", H(fair)+D_KL(fair||bias9) = "
94 << entropy(fair) + Dpq << "\n";
95
96 // Example 2: From counts (normalize first)
97 std::vector<double> counts_p = {30, 10, 10, 50}; // e.g., observed symbol counts
98 std::vector<double> counts_q = {25, 15, 10, 50}; // model-implied counts (same total not required)
99 auto p = normalize(counts_p);
100 auto q = normalize(counts_q);
101
102 std::cout << "H(p) = " << entropy(p) << ", H(p,q) = " << cross_entropy(p, q)
103 << ", D_KL(p||q) = " << kl_divergence(p, q) << "\n";
104
105 return 0;
106}
107

This program implements core information-theoretic quantities for discrete distributions: entropy H(X), cross-entropy H(p, q), and KL divergence D_KL(p||q). It includes safe handling of zeros via clipping to avoid log(0), allows base selection (bits by default), and demonstrates both direct probability inputs and normalization from counts. The identity H(p, q) = H(p) + D_KL(p||q) is numerically verified.

Time: O(k) for vectors of length kSpace: O(1) extra (O(k) if normalizing into a new vector)
Estimating Mutual Information from Sampled Pairs (Plug-in Estimator)
1#include <iostream>
2#include <vector>
3#include <unordered_map>
4#include <cmath>
5#include <utility>
6#include <iomanip>
7
8struct PairHash {
9 std::size_t operator()(const std::pair<int,int>& p) const noexcept {
10 // Combine hashes of two ints (simple but effective)
11 return std::hash<long long>{}((static_cast<long long>(p.first) << 32) ^ (p.second & 0xffffffffLL));
12 }
13};
14
15inline double logb(double x, double base) { return std::log(x) / std::log(base); }
16
17// Estimate mutual information I(X;Y) from paired samples using histograms
18// X and Y must have same length N. Symbols are integers (can be categorical IDs).
19double mutual_information(const std::vector<int>& X, const std::vector<int>& Y, double base = 2.0) {
20 if (X.size() != Y.size()) throw std::invalid_argument("X and Y must have same length");
21 const size_t N = X.size();
22 if (N == 0) return 0.0;
23
24 std::unordered_map<int, size_t> countX;
25 std::unordered_map<int, size_t> countY;
26 std::unordered_map<std::pair<int,int>, size_t, PairHash> countXY;
27 countX.reserve(N); countY.reserve(N); countXY.reserve(N);
28
29 // Build histograms
30 for (size_t i = 0; i < N; ++i) {
31 ++countX[X[i]];
32 ++countY[Y[i]];
33 ++countXY[{X[i], Y[i]}];
34 }
35
36 // Convert to probabilities and sum p(x,y) log p(x,y)/(p(x)p(y))
37 double I = 0.0;
38 for (const auto& kv : countXY) {
39 int x = kv.first.first;
40 int y = kv.first.second;
41 double pxy = static_cast<double>(kv.second) / static_cast<double>(N);
42 double px = static_cast<double>(countX[x]) / static_cast<double>(N);
43 double py = static_cast<double>(countY[y]) / static_cast<double>(N);
44 // Only nonzero joint probabilities are present in the map
45 double ratio = pxy / (px * py);
46 I += pxy * logb(ratio, base);
47 }
48 return I;
49}
50
51int main() {
52 std::cout << std::fixed << std::setprecision(6);
53
54 // Example 1: Independent variables (MI should be ~0)
55 std::vector<int> X1 = {0,0,1,1,0,1,0,1,0,1};
56 std::vector<int> Y1 = {1,0,1,0,0,1,1,0,0,1};
57 std::cout << "I(X1;Y1) [bits] = " << mutual_information(X1, Y1) << "\n";
58
59 // Example 2: Strong dependence (Y = X with occasional noise)
60 std::vector<int> X2 = {0,0,1,1,0,1,0,1,1,0,1,0,1,1,0,0};
61 std::vector<int> Y2 = {0,0,1,1,0,1,1,1,1,0,1,0,1,0,0,0}; // one mismatch
62 std::cout << "I(X2;Y2) [bits] = " << mutual_information(X2, Y2) << "\n";
63
64 return 0;
65}
66

This code estimates mutual information from discrete paired samples using a plug-in (histogram) estimator. It builds joint and marginal frequency tables with hash maps and computes I(X;Y) = ∑ p(x,y) log (p(x,y)/(p(x)p(y))). Example 1 demonstrates near-independence (MI ~ 0), and Example 2 shows higher MI when Y is almost equal to X.

Time: O(N) expected (hash map operations), O(m) to sum over m observed pairsSpace: O(m + a + b), where m is number of distinct (x,y) pairs and a,b are distinct X and Y symbols
Binary Symmetric Channel (BSC): Mutual Information and Capacity by Grid Search
1#include <iostream>
2#include <cmath>
3#include <iomanip>
4
5inline double logb(double x, double base) { return std::log(x) / std::log(base); }
6
7// Binary entropy in base 'base' (bits if base=2)
8double Hb(double p, double base = 2.0) {
9 const double eps = 1e-12;
10 p = std::max(eps, std::min(1.0 - eps, p));
11 return -p * logb(p, base) - (1.0 - p) * logb(1.0 - p, base);
12}
13
14// Mutual information for a BSC with input Bernoulli(p1) and crossover e
15double MI_BSC(double p1, double e, double base = 2.0) {
16 // Output probability of 1: pY1 = p1*(1-e) + (1-p1)*e
17 double pY1 = p1 * (1.0 - e) + (1.0 - p1) * e;
18 double HY = Hb(pY1, base);
19 double HY_given_X = Hb(e, base); // independent of input distribution
20 return HY - HY_given_X;
21}
22
23int main() {
24 std::cout << std::fixed << std::setprecision(6);
25
26 double e = 0.10; // crossover probability
27 double best_p = 0.0, best_I = -1.0;
28
29 // Grid search over input bias p1 in [0,1]
30 int G = 10001; // grid points
31 for (int i = 0; i < G; ++i) {
32 double p1 = static_cast<double>(i) / (G - 1);
33 double I = MI_BSC(p1, e, 2.0);
34 if (I > best_I) { best_I = I; best_p = p1; }
35 }
36
37 double capacity_closed = 1.0 - Hb(e, 2.0); // known formula C = 1 - H_b(e)
38
39 std::cout << "Best input bias p*(1) ≈ " << best_p << "\n";
40 std::cout << "Estimated capacity C ≈ " << best_I << " bits/use\n";
41 std::cout << "Closed-form capacity 1 - H_b(e) = " << capacity_closed << " bits/use\n";
42
43 return 0;
44}
45

This program computes mutual information for a Binary Symmetric Channel (bit flip with probability e) and searches over input distributions to estimate capacity. It also prints the known closed-form capacity 1 − H_b(e), verifying that the optimum occurs near p(1) = 0.5 and matches the theoretical value.

Time: O(G), where G is the number of grid pointsSpace: O(1)