📚TheoryIntermediate

Shannon Entropy

Key Points

  • Shannon entropy quantifies the average uncertainty or information content of a random variable in bits when using base-2 logarithms.
  • Entropy is maximized by the uniform distribution and minimized (zero) when the outcome is certain.
  • For independent variables, joint entropy adds: H(X, Y) = H(X) + H(Y).
  • Conditional entropy H(Y|X) measures remaining uncertainty in Y after knowing X and never exceeds H(Y).
  • Information gain equals the reduction in entropy when a feature is observed and is central to decision tree splits.
  • Cross-entropy and perplexity evaluate predictive models by measuring how well predicted probabilities match actual outcomes.
  • Entropy sets the theoretical lower bound for the average number of bits needed for lossless compression (source coding theorem).
  • In practice, careful handling of zero probabilities, log bases, and small-sample bias is essential when estimating entropy.

Prerequisites

  • Basic probability theoryEntropy is defined over probability distributions and requires understanding events, independence, and conditional probability.
  • Discrete random variables and PMFsShannon entropy for discrete variables sums over the PMF; knowing how PMFs work is essential.
  • Logarithms and change of baseEntropy uses logarithms; base determines the unit (bits or nats) and conversions are common.
  • Summations and seriesEntropy involves summing terms over a support; fluency with summation notation helps.
  • Hash maps and counting in C++ (STL)Efficient entropy estimation from data relies on frequency tables built with unordered_map.
  • Numerical stabilityAvoiding log(0) and handling floating-point precision is critical in entropy and cross-entropy computations.
  • Algorithmic complexityChoosing appropriate data structures and understanding O(n) vs O(k) costs improves implementation efficiency.

Detailed Explanation

Tap terms for definitions

01Overview

Shannon entropy is a foundational concept in information theory that measures uncertainty. Imagine trying to guess the next symbol emitted by a source (like letters in a text). If the source is very predictable, you need few yes/no questions (bits) to identify the symbol; if it is highly unpredictable, you need many. Entropy formalizes this idea as the expected number of bits required to encode outcomes from a probability distribution. When we use base-2 logarithms, the unit is bits; base-e gives nats. The core insight is that the more spread-out (less concentrated) a probability distribution is, the higher the entropy. A uniform distribution over k outcomes has the maximum possible entropy of log2 k bits, while a deterministic outcome has zero entropy. Entropy is not just an abstract measure—it dictates the best you can possibly do at data compression and directly connects to how confident or calibrated your probabilistic predictions are. In machine learning, entropy underlies decision tree criteria (information gain), model evaluation (cross-entropy loss and perplexity), and feature selection (mutual information). In compression, entropy tells you the smallest average code length you can hope to achieve, assuming ideal coding and exact knowledge of the source distribution.

02Intuition & Analogies

Think of a 20-questions game where you must identify a hidden object. Each yes/no question splits the possibilities. If the choices are equally likely, you can plan questions that reliably halve the possibilities, needing about log2(k) questions for k options. That average number of questions is the same spirit as entropy. Now imagine a biased coin that shows heads 99% of the time. Before flipping, you’re already quite sure of the result; your uncertainty is low. You would need far fewer questions to communicate the flip’s outcome than with a fair coin. Entropy captures this difference: the fair coin’s entropy is 1 bit, while the skewed coin has less than 1 bit (you can compress its outcomes better). Another analogy: packing suitcases efficiently. If you know exactly which items appear most often, you can design smaller boxes for common items and larger boxes for rare ones, minimizing space on average—this is what variable-length coding like Huffman coding does, guided by entropy. In prediction, think of a weather app. If it predicts 90% chance of sun on a sunny day and 90% rain on a rainy day and generally matches reality, your surprise per day is low; cross-entropy (closely related to entropy) will be small. But if the app assigns tiny probabilities to what actually happens, your surprise (negative log-probability) explodes, and cross-entropy grows, indicating poor calibration. Entropy thus links guessing games, coding efficiency, and predictive confidence into one measure of uncertainty.

03Formal Definition

For a discrete random variable X with support \(\) and probability mass function p(x), the Shannon entropy in base b is defined as \((X) = - p(x) p(x)\). Using base 2 () yields entropy in bits. When p(x) = 0 for some x, the term is treated as 0 because \( p p = 0\). The joint entropy of a pair (X, Y) is \(H(X, Y) = - p(x, y) p(x, y)\). Conditional entropy is \(H(Y X) = - p(x, y) p(y x) = p(x) H(Y )\). The chain rule states \(H(X, Y) = H(X) + H(Y X)\). If X and Y are independent, \(H(X, Y) = H(X) + H(Y)\). The binary entropy function for a Bernoulli(p) variable is \(H(p) = -p p - (1-p) (1-p)\). Shannon’s source coding theorem establishes entropy as a lower bound on average code length for any prefix-free code: \( H(X)\) (in bits when using base-2). Cross-entropy between true distribution p and model q is \(H(p, q) = - p(x) q(x)\), and KL divergence is \(D_{}(p q) = p(x) = H(p, q) - H(p)\).

04When to Use

Use Shannon entropy whenever you need to quantify uncertainty, compressibility, or average information. In decision trees (ID3/C4.5/CART variants), choose the split that maximizes information gain (IG = H(Y) - H(Y\mid X)). In feature selection, mutual information (I(X; Y)) identifies features that reduce uncertainty about labels. In language modeling and sequence prediction, evaluate models via cross-entropy and perplexity (\operatorname{PP} = 2^{H_{2}}), which reflect how surprised the model is by the test data. In compression, entropy estimates the theoretical lower bound on code length; practical schemes like Huffman or arithmetic coding approach this bound when the model is accurate. In anomaly detection, low-probability events contribute high information ((-\log p)), so entropy-related scores help flag unusual occurrences. In reinforcement learning and optimization, entropy regularization encourages exploration by preferring higher-uncertainty policies early on. In uncertainty quantification for probabilistic classifiers, monitoring entropy of predictive distributions helps detect overconfidence or ambiguity. Finally, in communication systems, entropy connects to channel capacity via mutual information; understanding entropy is a prerequisite for reasoning about how much information can be reliably transmitted.

⚠️Common Mistakes

• Mixing log bases and units: using natural logs but reporting bits. Remember (\log_b x = \frac{\log_a x}{\log_a b}). If you compute with ln, convert to bits by dividing by ln 2. • Mishandling zeros: directly computing (\log 0) causes NaNs. Either omit zero-probability terms or use smoothing when estimating from finite samples. • Forgetting to normalize counts: entropy requires probabilities that sum to 1. Using raw counts without normalization biases the result. • Confusing entropy with cross-entropy or loss: entropy is a property of the true distribution; cross-entropy depends on a model q. Minimizing cross-entropy matches the model to data but does not necessarily reduce the data’s inherent entropy. • Assuming additivity without independence: (H(X, Y) = H(X) + H(Y)) holds only if X and Y are independent; otherwise use the chain rule with conditional entropy. • Small-sample bias: the plug-in estimator (using empirical frequencies) underestimates true entropy. Bias-corrections (e.g., Miller–Madow) or Bayesian smoothing (Dirichlet priors) can help. • Ignoring rare events: trimming or thresholding tiny probabilities can distort entropy and evaluation metrics like perplexity. • Overinterpreting differential entropy: for continuous variables, differential entropy can be negative and is not invariant to transformations—don’t compare it directly to discrete entropy. • Data leakage in information gain: compute IG using only training data within cross-validation folds; otherwise splits may look better than they generalize. • Numerical instability: subtracting nearly equal quantities (e.g., in KL) or taking logs of extremely small probabilities can cause underflow; use clamping and log-sum-exp techniques.

Key Formulas

Shannon Entropy

Explanation: Average information content of a discrete random variable in base b units. Use b=2 for bits, b=e for nats.

Joint Entropy

Explanation: Measures uncertainty of the pair (X, Y). It is the entropy over the joint distribution p(x, y).

Conditional Entropy

Explanation: Average remaining uncertainty in Y after observing X. Computed by weighting the entropy of Y|X=x by p(x).

Chain Rule

Explanation: Decomposes joint entropy into the entropy of X plus the conditional entropy of Y given X.

Additivity for Independence

Explanation: If X and Y are independent, their joint entropy equals the sum of individual entropies.

Binary Entropy Function

Explanation: Entropy of a Bernoulli(p) variable. Maximal at p=0.5 and approaches 0 as p tends to 0 or 1.

Maximum Entropy Bound

Explanation: Entropy is maximized by the uniform distribution over its support; equality holds when p is uniform.

Information Gain

Explanation: Reduction in uncertainty about Y after observing feature X. Used for decision tree splits and feature selection.

Mutual Information

Explanation: Measures dependence between X and Y. Zero if and only if X and Y are independent.

Cross-Entropy

Explanation: Expected code length if we encode samples from p using codes optimized for q. Used as a loss for probabilistic models.

KL Divergence

Explanation: Non-negative measure of how distribution q departs from p. Zero only when p=q almost everywhere.

Perplexity

Explanation: Perplexity equals the exponential of (cross-)entropy in the same base. It reflects the effective branching factor.

Source Coding Theorem (Lower Bound)

Explanation: The average length of any prefix-free code (in base b) cannot be smaller than the entropy of the source.

Huffman Code Bound

Explanation: For binary Huffman coding, expected code length is at least the entropy and less than one bit above it.

Change of Base

Explanation: Converts logarithms between bases, useful for switching units (bits vs nats).

Complexity Analysis

Computing entropy from empirical data typically proceeds in two phases: (1) counting frequencies for each category and (2) summing −p(x) log p(x) over categories. If n is the number of samples and k is the number of distinct categories, building a frequency table via a hash map is O(n) time on average and O(k) space, while the entropy summation is O(k) time and O(1) extra space beyond the counts. If the categories are already aggregated as counts, the entire computation is O(k) time. For binary entropy H(p), time is O(1). When computing conditional entropy H(Y|X), we must count label frequencies for each value of X. Given n samples, v distinct feature values, and c classes, we can build a nested map of size O(v·c) in O(n) time; computing H(Y|X) then takes O(v·c). Therefore, information gain IG = H(Y) − H(Y|X) can be computed in O(n + v·c) time and O(v·c) space. In practice, v and c are often much smaller than n. Cross-entropy and perplexity for a given sequence of N events require summing −log probabilities over the N positions. If a model directly provides the probability of the observed symbol at each step, the computation is O(N) time and O(1) space. If instead one must derive these probabilities by normalizing scores across a vocabulary of size V (e.g., via softmax), the per-step cost is O(V), yielding O(N·V) time. Numerical stability is important: avoid log(0) by clamping probabilities to a small epsilon or using smoothed estimates. When converting between bases (bits vs nats), division by log base factors is O(1) per term and does not change asymptotic complexity. Memory use is dominated by count tables (O(k) or O(v·c)); streaming implementations can process samples in one pass with sublinear memory when k is bounded.

Code Examples

Compute Shannon Entropy from Counts (with optional smoothing and base)
1#include <iostream>
2#include <vector>
3#include <numeric>
4#include <cmath>
5#include <iomanip>
6#include <stdexcept>
7
8// Compute Shannon entropy from category counts.
9// Parameters:
10// - counts: non-negative integer counts per category
11// - alpha: additive smoothing (Laplace). alpha=0 means no smoothing.
12// - base: logarithm base (2 for bits, exp(1) for nats)
13// Returns entropy in units of log base (e.g., bits when base=2)
14double shannon_entropy_from_counts(const std::vector<long long>& counts, double alpha = 0.0, double base = 2.0) {
15 if (counts.empty()) return 0.0;
16 if (alpha < 0.0) throw std::invalid_argument("alpha must be >= 0");
17 const size_t k = counts.size();
18 long long raw_total = std::accumulate(counts.begin(), counts.end(), 0LL);
19 double total = static_cast<double>(raw_total) + alpha * static_cast<double>(k);
20 if (total <= 0.0) return 0.0; // all zeros
21
22 double H = 0.0;
23 const double log_base = std::log(base);
24 for (size_t i = 0; i < k; ++i) {
25 if (counts[i] < 0) throw std::invalid_argument("counts must be non-negative");
26 double p = (static_cast<double>(counts[i]) + alpha) / total;
27 if (p > 0.0) {
28 H -= p * (std::log(p) / log_base);
29 }
30 }
31 return H;
32}
33
34// Binary entropy function H(p) for Bernoulli(p)
35double binary_entropy(double p, double base = 2.0) {
36 if (p <= 0.0 || p >= 1.0) return 0.0; // limit cases; exact 0 and 1 have zero entropy
37 const double log_base = std::log(base);
38 return -(p * (std::log(p) / log_base) + (1.0 - p) * (std::log(1.0 - p) / log_base));
39}
40
41int main() {
42 // Example: counts for 4 categories, including a zero-count category
43 std::vector<long long> counts = {10, 0, 5, 5};
44
45 double H_bits = shannon_entropy_from_counts(counts, /*alpha=*/0.0, /*base=*/2.0);
46 double H_bits_smoothed = shannon_entropy_from_counts(counts, /*alpha=*/1.0, /*base=*/2.0);
47
48 std::cout << std::fixed << std::setprecision(6);
49 std::cout << "Entropy (bits, no smoothing): " << H_bits << "\n";
50 std::cout << "Entropy (bits, Laplace alpha=1): " << H_bits_smoothed << "\n";
51
52 // Binary entropy examples
53 std::cout << "Binary entropy H(0.5) bits: " << binary_entropy(0.5) << "\n";
54 std::cout << "Binary entropy H(0.1) bits: " << binary_entropy(0.1) << "\n";
55
56 // Convert bits to nats (base e) by changing base
57 double H_nats = shannon_entropy_from_counts(counts, 0.0, std::exp(1.0));
58 std::cout << "Entropy (nats): " << H_nats << "\n";
59
60 return 0;
61}
62

This program computes Shannon entropy from category counts, optionally applying additive (Laplace) smoothing to avoid zero probabilities. It also provides the binary entropy function and demonstrates changing the logarithm base to switch units between bits and nats.

Time: O(k) where k is the number of categoriesSpace: O(1) extra space beyond the input
Information Gain for a Discrete Feature and Class Labels
1#include <iostream>
2#include <vector>
3#include <unordered_map>
4#include <string>
5#include <cmath>
6#include <numeric>
7#include <iomanip>
8
9// Helper: compute entropy from a label->count map
10double entropy_from_map(const std::unordered_map<std::string, long long>& counts, double base = 2.0) {
11 long long total = 0;
12 for (auto &kv : counts) total += kv.second;
13 if (total == 0) return 0.0;
14 const double log_base = std::log(base);
15 double H = 0.0;
16 for (auto &kv : counts) {
17 if (kv.second <= 0) continue;
18 double p = static_cast<double>(kv.second) / static_cast<double>(total);
19 H -= p * (std::log(p) / log_base);
20 }
21 return H;
22}
23
24// Compute Information Gain IG(Y; X) from dataset of (feature_value, label)
25double information_gain(const std::vector<std::pair<std::string, std::string>>& data, double base = 2.0) {
26 if (data.empty()) return 0.0;
27
28 // Count label frequencies for H(Y)
29 std::unordered_map<std::string, long long> label_counts;
30 for (auto &p : data) label_counts[p.second]++;
31 double H_Y = entropy_from_map(label_counts, base);
32
33 // Group counts by feature value: for each X=x, count labels y
34 std::unordered_map<std::string, std::unordered_map<std::string, long long>> by_feat;
35 std::unordered_map<std::string, long long> feat_counts;
36 for (auto &p : data) {
37 by_feat[p.first][p.second]++;
38 feat_counts[p.first]++;
39 }
40
41 // Compute H(Y|X) = sum_x P(X=x) * H(Y|X=x)
42 const double N = static_cast<double>(data.size());
43 double H_Y_given_X = 0.0;
44 for (auto &fx : by_feat) {
45 double Px = static_cast<double>(feat_counts[fx.first]) / N;
46 double H_Y_given_x = entropy_from_map(fx.second, base);
47 H_Y_given_X += Px * H_Y_given_x;
48 }
49
50 return H_Y - H_Y_given_X; // Information Gain
51}
52
53int main() {
54 // Toy dataset: weather feature -> play tennis label
55 std::vector<std::pair<std::string, std::string>> data = {
56 {"Sunny", "No"}, {"Sunny", "No"}, {"Overcast", "Yes"}, {"Rain", "Yes"},
57 {"Rain", "No"}, {"Rain", "Yes"}, {"Overcast", "Yes"}, {"Sunny", "Yes"},
58 {"Sunny", "Yes"}, {"Rain", "Yes"}, {"Sunny", "Yes"}, {"Overcast", "Yes"},
59 {"Overcast", "Yes"}, {"Rain", "No"}
60 };
61
62 double IG_bits = information_gain(data, 2.0);
63
64 std::cout << std::fixed << std::setprecision(6);
65 std::cout << "Information Gain IG(Y; feature) in bits: " << IG_bits << "\n";
66
67 return 0;
68}
69

This code computes information gain for a discrete feature with respect to class labels. It first computes H(Y), then computes the conditional entropy H(Y|X) by grouping the data by feature value and averaging entropies of the label distribution within each group, finally returning IG = H(Y) - H(Y|X).

Time: O(n + v * c) on average, where n is samples, v distinct feature values, and c classesSpace: O(v * c) for grouped counts
Cross-Entropy and Perplexity for a Sequence of True-Event Probabilities
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <iomanip>
5#include <stdexcept>
6
7// Computes average cross-entropy in base 'base' given probabilities assigned to the true events.
8// Inputs are the model's probabilities for the actually observed outcome at each step.
9double average_cross_entropy_bits(const std::vector<double>& true_event_probs, double base = 2.0) {
10 if (true_event_probs.empty()) return 0.0;
11 const double log_base = std::log(base);
12 const double eps = 1e-15; // clamp to avoid log(0)
13 double sum_neglog = 0.0;
14 for (double p : true_event_probs) {
15 if (p < 0.0 || p > 1.0) throw std::invalid_argument("probabilities must be in [0,1]");
16 double q = std::max(p, eps);
17 sum_neglog += -std::log(q) / log_base;
18 }
19 return sum_neglog / static_cast<double>(true_event_probs.size());
20}
21
22// Perplexity is base^cross_entropy in the same base
23double perplexity_from_cross_entropy(double cross_entropy, double base = 2.0) {
24 return std::pow(base, cross_entropy);
25}
26
27int main() {
28 // Example: probabilities that a language model assigns to the actual next token at each step
29 std::vector<double> assigned = {0.10, 0.50, 0.20, 0.05, 0.30};
30
31 double H_bits = average_cross_entropy_bits(assigned, 2.0);
32 double PP = perplexity_from_cross_entropy(H_bits, 2.0);
33
34 std::cout << std::fixed << std::setprecision(6);
35 std::cout << "Average cross-entropy (bits per token): " << H_bits << "\n";
36 std::cout << "Perplexity: " << PP << "\n";
37
38 return 0;
39}
40

Given the probabilities a model assigns to the actually observed outcomes, this program computes the average cross-entropy in bits and the corresponding perplexity. It clamps extremely small probabilities to avoid numerical issues with log(0).

Time: O(N) where N is the sequence lengthSpace: O(1) extra space