KL Divergence (Kullback-Leibler Divergence)
Key Points
- •Kullback–Leibler (KL) divergence measures how one probability distribution P devotes probability mass differently from a reference distribution Q.
- •It is asymmetric: (P||P), and it is not a distance metric.
- •KL divergence is always non‑negative and equals zero if and only if P and Q are identical almost everywhere.
- •Forward KL ((P||P)) is mode‑seeking and tends to place mass where P is high.
- •It connects to entropy and cross‑entropy via (P||Q) = H(P,Q) − H(P), which is the extra coding cost when using Q instead of P.
- •In ML, minimizing forward KL is equivalent to maximum likelihood; reverse KL is central in variational inference and VAEs through the ELBO.
- •KL can be computed exactly for some families (e.g., Gaussians) or estimated with Monte Carlo using samples.
- •Numerical stability requires careful handling of zeros and tiny probabilities (clamping, smoothing, and log‑sum‑exp).
Prerequisites
- →Basic probability distributions — Understanding PMFs, PDFs, and support is essential to define KL divergence.
- →Logarithms and change of base — KL uses logarithms; interpreting units (nats vs bits) requires base conversion.
- →Expectation and the law of large numbers — KL is an expectation; Monte Carlo estimators converge by LLN.
- →Entropy and cross-entropy — KL connects directly via D_KL(P||Q) = H(P,Q) - H(P).
- →Gaussian distributions — Important closed-form KL cases appear for Gaussians (used in VAE and VI).
- →Numerical stability in floating-point — Preventing underflow/overflow and handling zeros is critical when computing KL.
- →Optimization basics — Choosing which KL direction to minimize depends on training objectives (MLE vs VI).
Detailed Explanation
Tap terms for definitions01Overview
Kullback–Leibler (KL) divergence is a fundamental tool for comparing probability distributions. Intuitively, it measures how inefficient it is to assume a distribution Q when the true distribution is P. If two distributions are identical, the inefficiency is zero; the more they differ, the larger the divergence. Unlike a distance, KL divergence is asymmetric: D_KL(P||Q) does not equal D_KL(Q||P), and it does not satisfy the triangle inequality. This asymmetry is meaningful: choosing the direction changes what kinds of mistakes are penalized. Forward KL (P||Q) emphasizes covering all regions where P has probability mass, while reverse KL (Q||P) emphasizes concentrating on the most likely regions under P. In machine learning, KL divergence appears everywhere: in maximum likelihood training (where minimizing the expected negative log-likelihood is equivalent to minimizing forward KL), in variational inference (minimizing reverse KL between an approximate posterior and the true posterior), in regularization of policies in reinforcement learning, and in knowledge distillation between teacher and student models. KL is also tied to information theory: it quantifies the expected extra number of nats or bits required to encode samples from P using a code optimized for Q. Practically, we compute KL exactly for certain families (e.g., Gaussians) or estimate it numerically via sampling and log-densities, taking care with numerical stability.
02Intuition & Analogies
Imagine you pack for a trip based on a weather app. If the app (Q) predicts mostly sunny days but the real weather (P) brings unexpected rain, you will be underprepared. KL divergence is the average penalty for planning by Q when reality follows P. When KL is zero, your packing list perfectly matches the true weather pattern; when KL is large, you made many poor choices consistently. Two planning mindsets mirror forward and reverse KL. Forward KL (P||Q) is like planning to be safe: make sure you have something for any weather that might actually occur. If there is any chance of rain (P says rain is possible), your packing (Q) must include a raincoat, or you pay a big penalty (infinite, in the discrete case if Q assigns zero to something P says can happen). This makes forward KL “mode-covering”: it pushes Q to spread out and not miss any true possibilities. Reverse KL (Q||P) is like packing light around what you expect to happen most often. If you believe sunny days dominate (Q concentrates where P is most likely), you may ignore rare storms. Reverse KL penalizes placing mass where P has essentially none, making it “mode-seeking”: it often picks a single representative mode in multimodal P. From an encoding viewpoint, suppose you design a code optimized for Q but your messages are actually produced by P. The extra average code length over the optimal code for P is exactly D_KL(P||Q). Thus KL measures the cost of using the wrong beliefs. In ML terms, minimizing forward KL aligns the model’s predictions with observed data frequencies (cover everything you see), whereas minimizing reverse KL yields compact approximations that focus on the bulk of probability mass (good for tractable approximate inference).
03Formal Definition
04When to Use
Use forward KL (D_{\mathrm{KL}}(P,|,Q)) when you have data from P and a parametric model Q_\theta, and you want to fit Q_\theta to cover the data distribution—this is exactly maximum likelihood and cross-entropy minimization in classification and language modeling. Forward KL is also appropriate for evaluation: to quantify how far a learned model Q is from a known target P. Use reverse KL (D_{\mathrm{KL}}(Q,|,P)) in approximate Bayesian inference, where P is the intractable posterior and Q is a simpler, tractable family. Minimizing reverse KL (e.g., in VAEs via the ELBO) yields approximations that place mass where P is large, often single-mode summaries of multimodal posteriors. Use symmetric variants when you need a similarity score without directionality: Jensen–Shannon divergence (JS) is finite, symmetric, and its square root is a metric. In reinforcement learning, penalize policy updates with KL to stay close to a previous policy (trust-region methods like TRPO/PPO). In model compression (knowledge distillation), minimize (\operatorname{KL}(P_{\text{teacher}},|,Q_{\text{student}})) via soft labels (often with temperature scaling). For distribution shift and monitoring, compute KL between current and baseline prediction distributions to detect drift. If you only have samples and can evaluate log-densities up to normalization constants, consider Monte Carlo estimates of KL differences, or use alternatives (e.g., JS) when supports only partially overlap or densities are implicit.
⚠️Common Mistakes
- Ignoring zeros: Computing (\log P/Q) when Q(x)=0 and P(x)>0 makes forward KL infinite. For discrete tasks, ensure Q has support wherever P does (smoothing) or handle infinities explicitly.
- Mixing log bases: Using base-(e) in training and base-2 in reporting leads to inconsistent scales. Convert via (\text{nats} = (\ln 2)\times \text{bits}) or vice versa.
- Treating KL as a distance: It is asymmetric and fails the triangle inequality; do not average (D_{\mathrm{KL}}(P,|,Q)) and (D_{\mathrm{KL}}(Q,|,P)) unless that’s what you intend. If you need a metric-like quantity, consider Jensen–Shannon divergence.
- Comparing unnormalized densities: KL needs normalized distributions. If only unnormalized log-densities are available, reverse KL is still well-defined up to an additive constant in the expectation, but forward KL is problematic unless you can evaluate (\log p).
- Numerical underflow: Multiplying tiny probabilities or taking logs of nearly zero leads to -inf and NaNs. Use log-sum-exp and clamp variances/probabilities with epsilons; normalize carefully.
- Wrong direction in VI: Minimizing (D_{\mathrm{KL}}(P,|,Q)) is not the same as minimizing (D_{\mathrm{KL}}(Q,|,P)); they produce qualitatively different approximations (covering vs seeking). Choose direction based on objective.
- Ignoring support mismatch in continuous case: Absolute continuity is required; if q(x)=0 while p(x)>0 on a set with positive measure, the forward KL is infinite.
- Overinterpreting magnitudes across problems: KL values depend on dimension and task scale. Compare KLs only within the same setting and units.
Key Formulas
Discrete KL divergence
Explanation: Average under P of the log ratio P/Q over the support. Infinite if Q assigns zero probability where P is positive.
Continuous KL divergence
Explanation: Expectation under the true density p of the log density ratio p/q. Requires absolute continuity for finiteness.
Relation to cross-entropy
Explanation: KL equals the extra code length when using Q instead of the optimal code for P. H(P) is entropy; H(P,Q) is cross-entropy.
Entropy and cross-entropy
Explanation: Entropy measures uncertainty of P; cross-entropy measures expected coding cost using Q to encode draws from P.
Gibbs' inequality
Explanation: KL is always non-negative and zero only when the distributions match almost everywhere. Follows from Jensen's inequality.
KL chain rule
Explanation: The KL between joint distributions decomposes into a marginal KL plus expected conditional KL. Useful for structured models.
Jensen–Shannon divergence
Explanation: A symmetric, finite divergence smoothing both distributions with their average. The square root of JS is a metric.
Pinsker's inequality
Explanation: Total variation distance is bounded by KL. A small KL guarantees small TV, giving a robustness interpretation.
Mutual information via KL
Explanation: Mutual information is the KL between the joint and the product of marginals, or the expected KL between conditionals and marginals.
ELBO decomposition
Explanation: The evidence lower bound plus the reverse KL to the true posterior equals the log evidence. Maximizing ELBO minimizes that KL.
ELBO definition
Explanation: A tractable objective used in variational inference; encourages q to place mass where the joint p(x,z) is large.
KL between Gaussians
Explanation: Closed form for multivariate normals in k dimensions. Simplifies to elementwise operations for diagonal covariances.
Monte Carlo KL estimator (forward)
Explanation: Estimate the expectation defining KL by averaging log-density ratios over samples from P. Variance decreases with N.
Base change for logs
Explanation: Changing the log base rescales KL. Divide by ln b to convert from natural logs (nats) to base-b units (e.g., bits with b=2).
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <numeric> 5 #include <iomanip> 6 #include <stdexcept> 7 8 // Normalize a nonnegative vector to sum to 1 (probability simplex) 9 std::vector<double> normalize(const std::vector<double>& v) { 10 double sum = std::accumulate(v.begin(), v.end(), 0.0); 11 if (sum <= 0.0) throw std::invalid_argument("Vector must have positive sum to normalize."); 12 std::vector<double> out(v.size()); 13 for (size_t i = 0; i < v.size(); ++i) out[i] = v[i] / sum; 14 return out; 15 } 16 17 // Additive smoothing: add eps to all entries, then renormalize 18 std::vector<double> smooth_and_normalize(const std::vector<double>& v, double eps) { 19 if (eps < 0.0) throw std::invalid_argument("eps must be nonnegative"); 20 std::vector<double> w = v; 21 for (double &x : w) x = std::max(0.0, x) + eps; // clamp negatives to 0, then add eps 22 return normalize(w); 23 } 24 25 // Shannon entropy H(P) in nats 26 double entropy(const std::vector<double>& P) { 27 double H = 0.0; 28 for (double p : P) { 29 if (p > 0.0) H -= p * std::log(p); 30 // terms with p==0 contribute 0 31 } 32 return H; 33 } 34 35 // Cross-entropy H(P,Q) in nats; returns +inf if any Q_i==0 with P_i>0 36 double cross_entropy(const std::vector<double>& P, const std::vector<double>& Q) { 37 if (P.size() != Q.size()) throw std::invalid_argument("Size mismatch"); 38 double H = 0.0; 39 for (size_t i = 0; i < P.size(); ++i) { 40 if (P[i] > 0.0) { 41 if (Q[i] <= 0.0) return INFINITY; // undefined; infinite penalty 42 H -= P[i] * std::log(Q[i]); 43 } 44 } 45 return H; 46 } 47 48 // KL divergence D_KL(P||Q) in nats; returns +inf if Q has zero where P>0 (strict) 49 double kl_divergence(const std::vector<double>& P, const std::vector<double>& Q) { 50 if (P.size() != Q.size()) throw std::invalid_argument("Size mismatch"); 51 double D = 0.0; 52 for (size_t i = 0; i < P.size(); ++i) { 53 double p = P[i], q = Q[i]; 54 if (p == 0.0) continue; // term is 0 55 if (q <= 0.0) return INFINITY; // infinite KL 56 D += p * (std::log(p) - std::log(q)); 57 } 58 return D; 59 } 60 61 int main() { 62 std::cout << std::fixed << std::setprecision(6); 63 64 // Example distributions (possibly unnormalized counts) 65 std::vector<double> P_raw{0.7, 0.2, 0.1}; 66 std::vector<double> Q1_raw{0.6, 0.3, 0.1}; 67 std::vector<double> Q2_raw{0.0, 0.5, 0.5}; // zero where P>0 -> infinite forward KL (strict) 68 69 // Normalize (or smooth+normalize for safety) 70 auto P = normalize(P_raw); 71 auto Q1 = normalize(Q1_raw); 72 auto Q2 = normalize(Q2_raw); 73 74 // Entropy and cross-entropy 75 double HP = entropy(P); 76 double HPQ1 = cross_entropy(P, Q1); 77 double HPQ2 = cross_entropy(P, Q2); // +inf expected 78 79 // KL computations 80 double D_P_Q1 = kl_divergence(P, Q1); 81 double D_P_Q2 = kl_divergence(P, Q2); // +inf expected 82 double D_Q1_P = kl_divergence(Q1, P); // reverse KL 83 84 std::cout << "H(P): " << HP << " nats\n"; 85 std::cout << "H(P,Q1): " << HPQ1 << " nats, KL(P||Q1): " << D_P_Q1 << " nats\n"; 86 std::cout << "H(P,Q2): " << (std::isinf(HPQ2) ? std::string("inf") : std::to_string(HPQ2)) << ", KL(P||Q2): " 87 << (std::isinf(D_P_Q2) ? std::string("inf") : std::to_string(D_P_Q2)) << "\n"; 88 std::cout << "KL(Q1||P): " << D_Q1_P << " nats (asymmetry check)\n"; 89 90 // Smoothed comparison to avoid infinities (for analysis/monitoring) 91 double eps = 1e-12; 92 auto P_s = smooth_and_normalize(P_raw, eps); 93 auto Q2_s = smooth_and_normalize(Q2_raw, eps); 94 double D_smoothed = kl_divergence(P_s, Q2_s); 95 std::cout << "Smoothed KL(P||Q2) with eps=" << eps << ": " << D_smoothed << " nats\n"; 96 97 return 0; 98 } 99
This program computes entropy, cross-entropy, and KL divergence for discrete distributions with careful handling of zeros. It shows that forward KL becomes infinite when Q assigns zero probability where P is positive, demonstrating the mode-covering nature of forward KL. It also illustrates asymmetry by comparing KL(P||Q1) and KL(Q1||P), and shows how small additive smoothing can produce finite, stable numbers for monitoring purposes.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <stdexcept> 5 #include <iomanip> 6 7 // KL between N(mu0, diag(var0)) and N(mu1, diag(var1)) in d dimensions 8 // Formula: 0.5 * sum_i [ log(var1_i/var0_i) + (var0_i + (mu0_i - mu1_i)^2)/var1_i - 1 ] 9 double kl_diag_gaussians(const std::vector<double>& mu0, 10 const std::vector<double>& var0, 11 const std::vector<double>& mu1, 12 const std::vector<double>& var1, 13 double min_var = 1e-12) { 14 size_t d = mu0.size(); 15 if (mu1.size() != d || var0.size() != d || var1.size() != d) 16 throw std::invalid_argument("All vectors must have the same dimension"); 17 double D = 0.0; 18 for (size_t i = 0; i < d; ++i) { 19 double v0 = std::max(var0[i], min_var); 20 double v1 = std::max(var1[i], min_var); 21 double diff = mu0[i] - mu1[i]; 22 D += std::log(v1 / v0) + (v0 + diff * diff) / v1 - 1.0; 23 } 24 return 0.5 * D; 25 } 26 27 int main() { 28 std::cout << std::fixed << std::setprecision(6); 29 30 // 2D example 31 std::vector<double> mu_p{0.0, 1.0}; 32 std::vector<double> var_p{1.0, 0.5}; 33 std::vector<double> mu_q{0.5, 0.5}; 34 std::vector<double> var_q{2.0, 0.25}; 35 36 double D_pq = kl_diag_gaussians(mu_p, var_p, mu_q, var_q); 37 double D_qp = kl_diag_gaussians(mu_q, var_q, mu_p, var_p); 38 39 std::cout << "KL(N_p || N_q) = " << D_pq << " nats\n"; 40 std::cout << "KL(N_q || N_p) = " << D_qp << " nats (asymmetry)\n"; 41 42 return 0; 43 } 44
This code implements the standard closed-form KL divergence between two diagonal-covariance multivariate Gaussians. It clamps variances to avoid division by zero and demonstrates asymmetry by computing both directions.
1 #include <iostream> 2 #include <random> 3 #include <cmath> 4 #include <functional> 5 #include <vector> 6 #include <limits> 7 #include <iomanip> 8 9 const double PI_CONST = 3.14159265358979323846; 10 11 // Log-density of univariate Normal N(mu, sigma^2) 12 double log_norm_pdf(double x, double mu, double sigma) { 13 double z = (x - mu) / sigma; 14 return -0.5 * std::log(2.0 * PI_CONST) - std::log(sigma) - 0.5 * z * z; 15 } 16 17 // Stable log-sum-exp for two values 18 inline double logsumexp2(double a, double b) { 19 double m = std::max(a, b); 20 return m + std::log(std::exp(a - m) + std::exp(b - m)); 21 } 22 23 // Define P as a 2-component mixture: 0.5*N(-2, 0.5^2) + 0.5*N(2, 1^2) 24 double log_p(double x) { 25 double l1 = std::log(0.5) + log_norm_pdf(x, -2.0, 0.5); 26 double l2 = std::log(0.5) + log_norm_pdf(x, 2.0, 1.0); 27 return logsumexp2(l1, l2); 28 } 29 30 // Define Q as a Normal N(m, s^2) 31 struct QNormal { double m; double s; }; 32 33 double log_q(double x, const QNormal& q) { 34 return log_norm_pdf(x, q.m, q.s); 35 } 36 37 // Samplers 38 struct RNG { std::mt19937_64 gen; RNG(uint64_t seed=42) : gen(seed) {} }; 39 40 // Sample from P mixture 41 double sample_P(RNG& rng) { 42 std::uniform_real_distribution<double> U(0.0, 1.0); 43 std::bernoulli_distribution B(0.5); 44 bool pick_first = B(rng.gen); 45 if (pick_first) { 46 std::normal_distribution<double> N(-2.0, 0.5); 47 return N(rng.gen); 48 } else { 49 std::normal_distribution<double> N(2.0, 1.0); 50 return N(rng.gen); 51 } 52 } 53 54 // Sample from Q 55 double sample_Q(RNG& rng, const QNormal& q) { 56 std::normal_distribution<double> N(q.m, q.s); 57 return N(rng.gen); 58 } 59 60 // Monte Carlo estimate of forward KL D_KL(P||Q): E_P[log p(x) - log q(x)] 61 double mc_forward_KL(size_t N, const QNormal& q, RNG& rng) { 62 double sum = 0.0; 63 for (size_t i = 0; i < N; ++i) { 64 double x = sample_P(rng); 65 sum += (log_p(x) - log_q(x, q)); 66 } 67 return sum / static_cast<double>(N); 68 } 69 70 // Monte Carlo estimate of reverse KL D_KL(Q||P): E_Q[log q(x) - log p(x)] 71 double mc_reverse_KL(size_t N, const QNormal& q, RNG& rng) { 72 double sum = 0.0; 73 for (size_t i = 0; i < N; ++i) { 74 double x = sample_Q(rng, q); 75 sum += (log_q(x, q) - log_p(x)); 76 } 77 return sum / static_cast<double>(N); 78 } 79 80 int main() { 81 std::cout << std::fixed << std::setprecision(6); 82 RNG rng(123); 83 84 // Baseline Q 85 QNormal q{0.0, 1.0}; 86 size_t N = 200000; // samples for MC 87 88 double D_fwd = mc_forward_KL(N, q, rng); 89 double D_rev = mc_reverse_KL(N, q, rng); 90 91 std::cout << "Forward KL D_KL(P||Q) with Q=N(0,1): " << D_fwd << " nats\n"; 92 std::cout << "Reverse KL D_KL(Q||P) with Q=N(0,1): " << D_rev << " nats\n"; 93 94 // Simple 1D grid search over mean m to illustrate mode-covering vs mode-seeking 95 double best_m_fwd = 0.0, best_val_fwd = std::numeric_limits<double>::infinity(); 96 double best_m_rev = 0.0, best_val_rev = std::numeric_limits<double>::infinity(); 97 98 for (int i = -30; i <= 30; ++i) { 99 double m = i * 0.2; // grid from -6 to 6 100 QNormal qtest{m, 1.0}; // fix s=1 for simplicity 101 RNG rng_local(1000 + i); // different seeds to reduce correlation 102 double val_fwd = mc_forward_KL(40000, qtest, rng_local); 103 RNG rng_local2(2000 + i); 104 double val_rev = mc_reverse_KL(40000, qtest, rng_local2); 105 if (val_fwd < best_val_fwd) { best_val_fwd = val_fwd; best_m_fwd = m; } 106 if (val_rev < best_val_rev) { best_val_rev = val_rev; best_m_rev = m; } 107 } 108 109 std::cout << "Minimizer of forward KL over m (s=1): m* = " << best_m_fwd 110 << ", KL = " << best_val_fwd << "\n"; 111 std::cout << "Minimizer of reverse KL over m (s=1): m* = " << best_m_rev 112 << ", KL = " << best_val_rev << "\n"; 113 114 std::cout << "(Expected: reverse KL tends to sit on one mode near -2 or 2; forward KL tends to sit between modes.)\n"; 115 return 0; 116 } 117
We estimate forward and reverse KL between a bimodal mixture P and a unimodal Gaussian Q via Monte Carlo. The code shows that reverse KL prefers placing Q at a single mode (mode-seeking), while forward KL prefers a compromise between modes (mode-covering). Numerical stability is maintained by evaluating log-densities in log-space using log-sum-exp for the mixture.