📚TheoryIntermediate

KL Divergence (Kullback-Leibler Divergence)

Key Points

  • Kullback–Leibler (KL) divergence measures how one probability distribution P devotes probability mass differently from a reference distribution Q.
  • It is asymmetric: (P||P), and it is not a distance metric.
  • KL divergence is always non‑negative and equals zero if and only if P and Q are identical almost everywhere.
  • Forward KL ((P||P)) is mode‑seeking and tends to place mass where P is high.
  • It connects to entropy and cross‑entropy via (P||Q) = H(P,Q) − H(P), which is the extra coding cost when using Q instead of P.
  • In ML, minimizing forward KL is equivalent to maximum likelihood; reverse KL is central in variational inference and VAEs through the ELBO.
  • KL can be computed exactly for some families (e.g., Gaussians) or estimated with Monte Carlo using samples.
  • Numerical stability requires careful handling of zeros and tiny probabilities (clamping, smoothing, and log‑sum‑exp).

Prerequisites

  • Basic probability distributionsUnderstanding PMFs, PDFs, and support is essential to define KL divergence.
  • Logarithms and change of baseKL uses logarithms; interpreting units (nats vs bits) requires base conversion.
  • Expectation and the law of large numbersKL is an expectation; Monte Carlo estimators converge by LLN.
  • Entropy and cross-entropyKL connects directly via D_KL(P||Q) = H(P,Q) - H(P).
  • Gaussian distributionsImportant closed-form KL cases appear for Gaussians (used in VAE and VI).
  • Numerical stability in floating-pointPreventing underflow/overflow and handling zeros is critical when computing KL.
  • Optimization basicsChoosing which KL direction to minimize depends on training objectives (MLE vs VI).

Detailed Explanation

Tap terms for definitions

01Overview

Kullback–Leibler (KL) divergence is a fundamental tool for comparing probability distributions. Intuitively, it measures how inefficient it is to assume a distribution Q when the true distribution is P. If two distributions are identical, the inefficiency is zero; the more they differ, the larger the divergence. Unlike a distance, KL divergence is asymmetric: D_KL(P||Q) does not equal D_KL(Q||P), and it does not satisfy the triangle inequality. This asymmetry is meaningful: choosing the direction changes what kinds of mistakes are penalized. Forward KL (P||Q) emphasizes covering all regions where P has probability mass, while reverse KL (Q||P) emphasizes concentrating on the most likely regions under P. In machine learning, KL divergence appears everywhere: in maximum likelihood training (where minimizing the expected negative log-likelihood is equivalent to minimizing forward KL), in variational inference (minimizing reverse KL between an approximate posterior and the true posterior), in regularization of policies in reinforcement learning, and in knowledge distillation between teacher and student models. KL is also tied to information theory: it quantifies the expected extra number of nats or bits required to encode samples from P using a code optimized for Q. Practically, we compute KL exactly for certain families (e.g., Gaussians) or estimate it numerically via sampling and log-densities, taking care with numerical stability.

02Intuition & Analogies

Imagine you pack for a trip based on a weather app. If the app (Q) predicts mostly sunny days but the real weather (P) brings unexpected rain, you will be underprepared. KL divergence is the average penalty for planning by Q when reality follows P. When KL is zero, your packing list perfectly matches the true weather pattern; when KL is large, you made many poor choices consistently. Two planning mindsets mirror forward and reverse KL. Forward KL (P||Q) is like planning to be safe: make sure you have something for any weather that might actually occur. If there is any chance of rain (P says rain is possible), your packing (Q) must include a raincoat, or you pay a big penalty (infinite, in the discrete case if Q assigns zero to something P says can happen). This makes forward KL “mode-covering”: it pushes Q to spread out and not miss any true possibilities. Reverse KL (Q||P) is like packing light around what you expect to happen most often. If you believe sunny days dominate (Q concentrates where P is most likely), you may ignore rare storms. Reverse KL penalizes placing mass where P has essentially none, making it “mode-seeking”: it often picks a single representative mode in multimodal P. From an encoding viewpoint, suppose you design a code optimized for Q but your messages are actually produced by P. The extra average code length over the optimal code for P is exactly D_KL(P||Q). Thus KL measures the cost of using the wrong beliefs. In ML terms, minimizing forward KL aligns the model’s predictions with observed data frequencies (cover everything you see), whereas minimizing reverse KL yields compact approximations that focus on the bulk of probability mass (good for tractable approximate inference).

03Formal Definition

For discrete distributions on a common support \(\), the KL divergence is defined as \[ D_{}(P\,\|\,Q) = \sum_{x\in\mathcal{X}} P(x) \log\frac{P(x)}{Q(x)}. \] In words: average, under P, of the log-ratio of probabilities. The base of the logarithm determines units: natural log gives nats; base-2 gives bits. If there exists an \(x\) with \(P(x)>0\) but \(Q(x)=0\), the divergence is \(+\infty\). For continuous distributions with densities \(p\) and \(q\) with respect to a common measure (typically Lebesgue), assuming absolute continuity of P with respect to Q (so that \(p(x)=0\) whenever \(q(x)=0\)), \[ D_{\mathrm{KL}}(P\,\|\,Q) = p(x) \,dx. \] KL satisfies \(D_{}(P\,\|\,Q)\ge 0\) with equality iff \(P=Q\) almost everywhere (Gibbs' inequality), but it is not symmetric and does not obey the triangle inequality. KL relates to entropy and cross-entropy: entropy \(H(P) = -\sum_x P(x)\log P(x)\) (or \(-\int p\log p\)) measures inherent uncertainty; cross-entropy \(H(P,Q) = -\sum_x P(x)\log Q(x)\) measures expected code length when encoding P using Q. Their difference is the KL: \(D_{\mathrm{KL}}(P\,\|\,Q) = H(P,Q) - H(P)\).

04When to Use

Use forward KL (D_{\mathrm{KL}}(P,|,Q)) when you have data from P and a parametric model Q_\theta, and you want to fit Q_\theta to cover the data distribution—this is exactly maximum likelihood and cross-entropy minimization in classification and language modeling. Forward KL is also appropriate for evaluation: to quantify how far a learned model Q is from a known target P. Use reverse KL (D_{\mathrm{KL}}(Q,|,P)) in approximate Bayesian inference, where P is the intractable posterior and Q is a simpler, tractable family. Minimizing reverse KL (e.g., in VAEs via the ELBO) yields approximations that place mass where P is large, often single-mode summaries of multimodal posteriors. Use symmetric variants when you need a similarity score without directionality: Jensen–Shannon divergence (JS) is finite, symmetric, and its square root is a metric. In reinforcement learning, penalize policy updates with KL to stay close to a previous policy (trust-region methods like TRPO/PPO). In model compression (knowledge distillation), minimize (\operatorname{KL}(P_{\text{teacher}},|,Q_{\text{student}})) via soft labels (often with temperature scaling). For distribution shift and monitoring, compute KL between current and baseline prediction distributions to detect drift. If you only have samples and can evaluate log-densities up to normalization constants, consider Monte Carlo estimates of KL differences, or use alternatives (e.g., JS) when supports only partially overlap or densities are implicit.

⚠️Common Mistakes

  1. Ignoring zeros: Computing (\log P/Q) when Q(x)=0 and P(x)>0 makes forward KL infinite. For discrete tasks, ensure Q has support wherever P does (smoothing) or handle infinities explicitly.
  2. Mixing log bases: Using base-(e) in training and base-2 in reporting leads to inconsistent scales. Convert via (\text{nats} = (\ln 2)\times \text{bits}) or vice versa.
  3. Treating KL as a distance: It is asymmetric and fails the triangle inequality; do not average (D_{\mathrm{KL}}(P,|,Q)) and (D_{\mathrm{KL}}(Q,|,P)) unless that’s what you intend. If you need a metric-like quantity, consider Jensen–Shannon divergence.
  4. Comparing unnormalized densities: KL needs normalized distributions. If only unnormalized log-densities are available, reverse KL is still well-defined up to an additive constant in the expectation, but forward KL is problematic unless you can evaluate (\log p).
  5. Numerical underflow: Multiplying tiny probabilities or taking logs of nearly zero leads to -inf and NaNs. Use log-sum-exp and clamp variances/probabilities with epsilons; normalize carefully.
  6. Wrong direction in VI: Minimizing (D_{\mathrm{KL}}(P,|,Q)) is not the same as minimizing (D_{\mathrm{KL}}(Q,|,P)); they produce qualitatively different approximations (covering vs seeking). Choose direction based on objective.
  7. Ignoring support mismatch in continuous case: Absolute continuity is required; if q(x)=0 while p(x)>0 on a set with positive measure, the forward KL is infinite.
  8. Overinterpreting magnitudes across problems: KL values depend on dimension and task scale. Compare KLs only within the same setting and units.

Key Formulas

Discrete KL divergence

Explanation: Average under P of the log ratio P/Q over the support. Infinite if Q assigns zero probability where P is positive.

Continuous KL divergence

Explanation: Expectation under the true density p of the log density ratio p/q. Requires absolute continuity for finiteness.

Relation to cross-entropy

Explanation: KL equals the extra code length when using Q instead of the optimal code for P. H(P) is entropy; H(P,Q) is cross-entropy.

Entropy and cross-entropy

Explanation: Entropy measures uncertainty of P; cross-entropy measures expected coding cost using Q to encode draws from P.

Gibbs' inequality

Explanation: KL is always non-negative and zero only when the distributions match almost everywhere. Follows from Jensen's inequality.

KL chain rule

Explanation: The KL between joint distributions decomposes into a marginal KL plus expected conditional KL. Useful for structured models.

Jensen–Shannon divergence

Explanation: A symmetric, finite divergence smoothing both distributions with their average. The square root of JS is a metric.

Pinsker's inequality

Explanation: Total variation distance is bounded by KL. A small KL guarantees small TV, giving a robustness interpretation.

Mutual information via KL

Explanation: Mutual information is the KL between the joint and the product of marginals, or the expected KL between conditionals and marginals.

ELBO decomposition

Explanation: The evidence lower bound plus the reverse KL to the true posterior equals the log evidence. Maximizing ELBO minimizes that KL.

ELBO definition

Explanation: A tractable objective used in variational inference; encourages q to place mass where the joint p(x,z) is large.

KL between Gaussians

Explanation: Closed form for multivariate normals in k dimensions. Simplifies to elementwise operations for diagonal covariances.

Monte Carlo KL estimator (forward)

Explanation: Estimate the expectation defining KL by averaging log-density ratios over samples from P. Variance decreases with N.

Base change for logs

Explanation: Changing the log base rescales KL. Divide by ln b to convert from natural logs (nats) to base-b units (e.g., bits with b=2).

Complexity Analysis

Computing discrete KL divergence over n outcomes is O(n) time and O(1) extra space beyond storing the distributions. Each term requires a division and a logarithm, so the constant factors depend on floating-point log cost. If you smooth (add epsilon) and renormalize, you still traverse the vectors a constant number of times, maintaining O(n) time. Numerical stability checks (clamping to at least eps) are O(n) as well. For continuous distributions with special structure, complexity depends on the family. For diagonal Gaussians in d dimensions, the closed-form KL requires O(d) time: per dimension, compute a log ratio of variances, a quadratic mean difference term, and a variance ratio. Space is O(1) extra beyond parameter vectors. For full-covariance Gaussians, computing the inverse and determinant typically costs O() using Cholesky or similar factorizations, and O() space for matrices. Monte Carlo estimation scales with the number of samples N: O(N) time to evaluate log-densities and accumulate averages, with O(1) extra space. The variance of the estimator decreases as O, but heavy-tailed or mismatched densities can increase variance significantly. If evaluating log p(x) or log q(x) is expensive (e.g., neural nets), the dominant cost is the per-sample evaluation, often GPU-bound in ML practice. Using variance reduction (control variates, antithetic sampling) or stratified sampling can improve sample efficiency. Across all methods, the dominant practical concerns are (1) avoiding log(0) by ensuring support coverage or smoothing, (2) performing computations in log-space for mixtures (log-sum-exp), and (3) choosing the KL direction that matches the optimization goal (forward for MLE, reverse for VI).

Code Examples

Discrete KL, entropy, and cross-entropy with stability checks
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <numeric>
5#include <iomanip>
6#include <stdexcept>
7
8// Normalize a nonnegative vector to sum to 1 (probability simplex)
9std::vector<double> normalize(const std::vector<double>& v) {
10 double sum = std::accumulate(v.begin(), v.end(), 0.0);
11 if (sum <= 0.0) throw std::invalid_argument("Vector must have positive sum to normalize.");
12 std::vector<double> out(v.size());
13 for (size_t i = 0; i < v.size(); ++i) out[i] = v[i] / sum;
14 return out;
15}
16
17// Additive smoothing: add eps to all entries, then renormalize
18std::vector<double> smooth_and_normalize(const std::vector<double>& v, double eps) {
19 if (eps < 0.0) throw std::invalid_argument("eps must be nonnegative");
20 std::vector<double> w = v;
21 for (double &x : w) x = std::max(0.0, x) + eps; // clamp negatives to 0, then add eps
22 return normalize(w);
23}
24
25// Shannon entropy H(P) in nats
26double entropy(const std::vector<double>& P) {
27 double H = 0.0;
28 for (double p : P) {
29 if (p > 0.0) H -= p * std::log(p);
30 // terms with p==0 contribute 0
31 }
32 return H;
33}
34
35// Cross-entropy H(P,Q) in nats; returns +inf if any Q_i==0 with P_i>0
36double cross_entropy(const std::vector<double>& P, const std::vector<double>& Q) {
37 if (P.size() != Q.size()) throw std::invalid_argument("Size mismatch");
38 double H = 0.0;
39 for (size_t i = 0; i < P.size(); ++i) {
40 if (P[i] > 0.0) {
41 if (Q[i] <= 0.0) return INFINITY; // undefined; infinite penalty
42 H -= P[i] * std::log(Q[i]);
43 }
44 }
45 return H;
46}
47
48// KL divergence D_KL(P||Q) in nats; returns +inf if Q has zero where P>0 (strict)
49double kl_divergence(const std::vector<double>& P, const std::vector<double>& Q) {
50 if (P.size() != Q.size()) throw std::invalid_argument("Size mismatch");
51 double D = 0.0;
52 for (size_t i = 0; i < P.size(); ++i) {
53 double p = P[i], q = Q[i];
54 if (p == 0.0) continue; // term is 0
55 if (q <= 0.0) return INFINITY; // infinite KL
56 D += p * (std::log(p) - std::log(q));
57 }
58 return D;
59}
60
61int main() {
62 std::cout << std::fixed << std::setprecision(6);
63
64 // Example distributions (possibly unnormalized counts)
65 std::vector<double> P_raw{0.7, 0.2, 0.1};
66 std::vector<double> Q1_raw{0.6, 0.3, 0.1};
67 std::vector<double> Q2_raw{0.0, 0.5, 0.5}; // zero where P>0 -> infinite forward KL (strict)
68
69 // Normalize (or smooth+normalize for safety)
70 auto P = normalize(P_raw);
71 auto Q1 = normalize(Q1_raw);
72 auto Q2 = normalize(Q2_raw);
73
74 // Entropy and cross-entropy
75 double HP = entropy(P);
76 double HPQ1 = cross_entropy(P, Q1);
77 double HPQ2 = cross_entropy(P, Q2); // +inf expected
78
79 // KL computations
80 double D_P_Q1 = kl_divergence(P, Q1);
81 double D_P_Q2 = kl_divergence(P, Q2); // +inf expected
82 double D_Q1_P = kl_divergence(Q1, P); // reverse KL
83
84 std::cout << "H(P): " << HP << " nats\n";
85 std::cout << "H(P,Q1): " << HPQ1 << " nats, KL(P||Q1): " << D_P_Q1 << " nats\n";
86 std::cout << "H(P,Q2): " << (std::isinf(HPQ2) ? std::string("inf") : std::to_string(HPQ2)) << ", KL(P||Q2): "
87 << (std::isinf(D_P_Q2) ? std::string("inf") : std::to_string(D_P_Q2)) << "\n";
88 std::cout << "KL(Q1||P): " << D_Q1_P << " nats (asymmetry check)\n";
89
90 // Smoothed comparison to avoid infinities (for analysis/monitoring)
91 double eps = 1e-12;
92 auto P_s = smooth_and_normalize(P_raw, eps);
93 auto Q2_s = smooth_and_normalize(Q2_raw, eps);
94 double D_smoothed = kl_divergence(P_s, Q2_s);
95 std::cout << "Smoothed KL(P||Q2) with eps=" << eps << ": " << D_smoothed << " nats\n";
96
97 return 0;
98}
99

This program computes entropy, cross-entropy, and KL divergence for discrete distributions with careful handling of zeros. It shows that forward KL becomes infinite when Q assigns zero probability where P is positive, demonstrating the mode-covering nature of forward KL. It also illustrates asymmetry by comparing KL(P||Q1) and KL(Q1||P), and shows how small additive smoothing can produce finite, stable numbers for monitoring purposes.

Time: O(n)Space: O(1) extra (beyond storing the input vectors)
Closed-form KL for diagonal-covariance Gaussians
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <stdexcept>
5#include <iomanip>
6
7// KL between N(mu0, diag(var0)) and N(mu1, diag(var1)) in d dimensions
8// Formula: 0.5 * sum_i [ log(var1_i/var0_i) + (var0_i + (mu0_i - mu1_i)^2)/var1_i - 1 ]
9double kl_diag_gaussians(const std::vector<double>& mu0,
10 const std::vector<double>& var0,
11 const std::vector<double>& mu1,
12 const std::vector<double>& var1,
13 double min_var = 1e-12) {
14 size_t d = mu0.size();
15 if (mu1.size() != d || var0.size() != d || var1.size() != d)
16 throw std::invalid_argument("All vectors must have the same dimension");
17 double D = 0.0;
18 for (size_t i = 0; i < d; ++i) {
19 double v0 = std::max(var0[i], min_var);
20 double v1 = std::max(var1[i], min_var);
21 double diff = mu0[i] - mu1[i];
22 D += std::log(v1 / v0) + (v0 + diff * diff) / v1 - 1.0;
23 }
24 return 0.5 * D;
25}
26
27int main() {
28 std::cout << std::fixed << std::setprecision(6);
29
30 // 2D example
31 std::vector<double> mu_p{0.0, 1.0};
32 std::vector<double> var_p{1.0, 0.5};
33 std::vector<double> mu_q{0.5, 0.5};
34 std::vector<double> var_q{2.0, 0.25};
35
36 double D_pq = kl_diag_gaussians(mu_p, var_p, mu_q, var_q);
37 double D_qp = kl_diag_gaussians(mu_q, var_q, mu_p, var_p);
38
39 std::cout << "KL(N_p || N_q) = " << D_pq << " nats\n";
40 std::cout << "KL(N_q || N_p) = " << D_qp << " nats (asymmetry)\n";
41
42 return 0;
43}
44

This code implements the standard closed-form KL divergence between two diagonal-covariance multivariate Gaussians. It clamps variances to avoid division by zero and demonstrates asymmetry by computing both directions.

Time: O(d)Space: O(1) extra
Monte Carlo estimation of forward and reverse KL (mixture vs Gaussian)
1#include <iostream>
2#include <random>
3#include <cmath>
4#include <functional>
5#include <vector>
6#include <limits>
7#include <iomanip>
8
9const double PI_CONST = 3.14159265358979323846;
10
11// Log-density of univariate Normal N(mu, sigma^2)
12double log_norm_pdf(double x, double mu, double sigma) {
13 double z = (x - mu) / sigma;
14 return -0.5 * std::log(2.0 * PI_CONST) - std::log(sigma) - 0.5 * z * z;
15}
16
17// Stable log-sum-exp for two values
18inline double logsumexp2(double a, double b) {
19 double m = std::max(a, b);
20 return m + std::log(std::exp(a - m) + std::exp(b - m));
21}
22
23// Define P as a 2-component mixture: 0.5*N(-2, 0.5^2) + 0.5*N(2, 1^2)
24double log_p(double x) {
25 double l1 = std::log(0.5) + log_norm_pdf(x, -2.0, 0.5);
26 double l2 = std::log(0.5) + log_norm_pdf(x, 2.0, 1.0);
27 return logsumexp2(l1, l2);
28}
29
30// Define Q as a Normal N(m, s^2)
31struct QNormal { double m; double s; };
32
33double log_q(double x, const QNormal& q) {
34 return log_norm_pdf(x, q.m, q.s);
35}
36
37// Samplers
38struct RNG { std::mt19937_64 gen; RNG(uint64_t seed=42) : gen(seed) {} };
39
40// Sample from P mixture
41double sample_P(RNG& rng) {
42 std::uniform_real_distribution<double> U(0.0, 1.0);
43 std::bernoulli_distribution B(0.5);
44 bool pick_first = B(rng.gen);
45 if (pick_first) {
46 std::normal_distribution<double> N(-2.0, 0.5);
47 return N(rng.gen);
48 } else {
49 std::normal_distribution<double> N(2.0, 1.0);
50 return N(rng.gen);
51 }
52}
53
54// Sample from Q
55double sample_Q(RNG& rng, const QNormal& q) {
56 std::normal_distribution<double> N(q.m, q.s);
57 return N(rng.gen);
58}
59
60// Monte Carlo estimate of forward KL D_KL(P||Q): E_P[log p(x) - log q(x)]
61double mc_forward_KL(size_t N, const QNormal& q, RNG& rng) {
62 double sum = 0.0;
63 for (size_t i = 0; i < N; ++i) {
64 double x = sample_P(rng);
65 sum += (log_p(x) - log_q(x, q));
66 }
67 return sum / static_cast<double>(N);
68}
69
70// Monte Carlo estimate of reverse KL D_KL(Q||P): E_Q[log q(x) - log p(x)]
71double mc_reverse_KL(size_t N, const QNormal& q, RNG& rng) {
72 double sum = 0.0;
73 for (size_t i = 0; i < N; ++i) {
74 double x = sample_Q(rng, q);
75 sum += (log_q(x, q) - log_p(x));
76 }
77 return sum / static_cast<double>(N);
78}
79
80int main() {
81 std::cout << std::fixed << std::setprecision(6);
82 RNG rng(123);
83
84 // Baseline Q
85 QNormal q{0.0, 1.0};
86 size_t N = 200000; // samples for MC
87
88 double D_fwd = mc_forward_KL(N, q, rng);
89 double D_rev = mc_reverse_KL(N, q, rng);
90
91 std::cout << "Forward KL D_KL(P||Q) with Q=N(0,1): " << D_fwd << " nats\n";
92 std::cout << "Reverse KL D_KL(Q||P) with Q=N(0,1): " << D_rev << " nats\n";
93
94 // Simple 1D grid search over mean m to illustrate mode-covering vs mode-seeking
95 double best_m_fwd = 0.0, best_val_fwd = std::numeric_limits<double>::infinity();
96 double best_m_rev = 0.0, best_val_rev = std::numeric_limits<double>::infinity();
97
98 for (int i = -30; i <= 30; ++i) {
99 double m = i * 0.2; // grid from -6 to 6
100 QNormal qtest{m, 1.0}; // fix s=1 for simplicity
101 RNG rng_local(1000 + i); // different seeds to reduce correlation
102 double val_fwd = mc_forward_KL(40000, qtest, rng_local);
103 RNG rng_local2(2000 + i);
104 double val_rev = mc_reverse_KL(40000, qtest, rng_local2);
105 if (val_fwd < best_val_fwd) { best_val_fwd = val_fwd; best_m_fwd = m; }
106 if (val_rev < best_val_rev) { best_val_rev = val_rev; best_m_rev = m; }
107 }
108
109 std::cout << "Minimizer of forward KL over m (s=1): m* = " << best_m_fwd
110 << ", KL = " << best_val_fwd << "\n";
111 std::cout << "Minimizer of reverse KL over m (s=1): m* = " << best_m_rev
112 << ", KL = " << best_val_rev << "\n";
113
114 std::cout << "(Expected: reverse KL tends to sit on one mode near -2 or 2; forward KL tends to sit between modes.)\n";
115 return 0;
116}
117

We estimate forward and reverse KL between a bimodal mixture P and a unimodal Gaussian Q via Monte Carlo. The code shows that reverse KL prefers placing Q at a single mode (mode-seeking), while forward KL prefers a compromise between modes (mode-covering). Numerical stability is maintained by evaluating log-densities in log-space using log-sum-exp for the mixture.

Time: O(N) per KL estimate (N samples); O(G * N) for the grid search over G meansSpace: O(1) extra