🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathAdvanced

f-Divergences

Key Points

  • •
    An f-divergence measures how different two probability distributions P and Q are by averaging a convex function f of the density ratio p(x)/q(x) under Q.
  • •
    Intuitively, you compare how much more (or less) likely P says an outcome is than Q, then summarize these distortions with a carefully chosen f.
  • •
    Many popular divergences are special cases: KL (f(t)=tlog t), total variation (f(t)=21​∣t−1∣), Hellinger squared (f(t)=(t​-1)^2), and Pearson χ2 (f(t)=(t-1)^2).
  • •
    Formally, if P is absolutely continuous with respect to Q, Df​(P∥ Q)=\mathbb{E}_Q\big[f\big(\tfrac{dP}{dQ}\big)\big] ≥ 0, and equals 0 iff P=Q.
  • •
    f-divergences satisfy the data processing inequality: applying the same transformation to both P and Q cannot increase the divergence.
  • •
    Discrete distributions make computation simple: Df​(P∥ Q)=∑x​ q(x) f\big(\tfrac{p(x)}{q(x)}\big).
  • •
    In practice, you can estimate f-divergences from samples by Monte Carlo if you can evaluate the density ratio p/q or log-densities.
  • •
    Be careful with support mismatches (q(x)=0 while p(x)>0), which make the divergence infinite, and with numerical stability when ratios are extreme.

Prerequisites

  • →Measure and probability spaces — Understanding absolute continuity and the Radon–Nikodym derivative requires basic measure-theoretic probability.
  • →Convex functions and conjugates — The generator f must be convex, and the variational form uses the Fenchel–Legendre conjugate.
  • →Expectation and change of measure — f-divergences are defined as expectations and often estimated by Monte Carlo or importance sampling.
  • →Probability mass/density functions — Computing ratios p/q and handling discrete vs continuous cases relies on familiarity with pmfs and pdfs.
  • →Numerical stability and log-domain arithmetic — Density ratios can overflow/underflow; stable computation via log-densities is essential in practice.

Detailed Explanation

Tap terms for definitions

01Overview

An f-divergence is a broad family of measures that quantify how different two probability distributions P and Q are. The idea is to examine, point by point, how much P up-weights or down-weights outcomes relative to Q through the density ratio r(x)=p(x)/q(x), then to summarize these distortions using a convex function f. Averaging f(r(x)) under Q yields a nonnegative number: zero when P and Q are the same, and larger as they differ more. This family unifies many familiar divergences from statistics and machine learning, including Kullback–Leibler (KL), total variation, Hellinger, and chi-square divergences. Because f-divergences are defined via expectations, they behave well under transformations: coarse-graining or deterministic mappings cannot increase them (data processing inequality).

02Intuition & Analogies

Imagine Q as your current belief about the world and P as an alternative hypothesis. For each possible outcome x, ask: “How many times more likely does P consider x than Q?” That multiplier is the ratio r(x)=p(x)/q(x). If r(x)=1, P and Q agree on x. If r(x)>1, P favors x more; if r(x)<1, Q favors x more. Now, to summarize disagreement, you don’t just average r(x) itself. Instead, you pass r(x) through a convex penalty function f that decides how severely to penalize up- or down-weighting. For KL, f(t)=t log t heavily penalizes large up-weighting (t≫1) and is gentle when t is tiny (because t log t→0 as t→0). For total variation, f(t)=|t−1|/2 symmetrically penalizes deviations from agreement. For Hellinger, f(t)=(√t−1)^2 compresses large ratios via the square root, leading to a bounded, symmetric measure. Averaging f(r(x)) using Q’s probabilities is like asking Q to “score” how distorted P looks. If P and Q are identical, r(x)=1 everywhere, f(1)=0, and the score is zero. If P puts positive mass where Q puts none, r(x) is infinite there and the score shoots to infinity—capturing an irreconcilable disagreement in support.

03Formal Definition

Let (Ω,F) be a measurable space, and let P and Q be probability measures such that P is absolutely continuous with respect to Q (denoted P≪ Q). By the Radon–Nikodym theorem, there exists a measurable function r=dQdP​ (the density ratio) such that for all measurable A, P(A)=∫A​ r\,dQ. Given a convex function f:[0,∞)→R with f(1)=0 (called the generator), the f-divergence from P to Q is defined as Df​(P∥ Q)=∫ f\big(\tfrac{dP}{dQ}\big)\,dQ=\mathbb{E}_Q\Big[f\big(dQdP​(X)\big)\Big]. In the discrete case with pmfs p(x), q(x) over a countable support X and assuming q(x)>0 whenever p(x)>0, this becomes Df​(P∥ Q)=∑x∈X​ q(x) f\big(\tfrac{p(x)}{q(x)}\big). If there exists x with q(x)=0 but p(x)>0, then Df​(P∥ Q)=+∞. Typical choices of f recover standard divergences: KL with f(t)=tlog t, Hellinger squared with f(t)=(t​-1)^2, total variation with f(t)=21​∣t−1∣, and Pearson χ2 with f(t)=(t-1)^2.

04When to Use

Use f-divergences when you need a principled, general way to compare probability distributions. In statistics and ML: (1) model fitting and evaluation—KL shows up in maximum likelihood and variational inference; Hellinger can be preferable when you want bounded, symmetric sensitivity; (2) hypothesis testing and goodness-of-fit—\chi^2 divergences connect to classical \chi^2 tests; (3) robust learning—choosing different f’s tunes sensitivity to outliers or tail mismatches; (4) information processing—when mapping data through features, the data processing inequality ensures divergences don’t grow, useful in privacy and representation learning; (5) generative modeling—Jensen–Shannon (an f-divergence) is related to GAN training objectives; (6) distribution shift detection—estimating divergences from samples can flag changes between training and deployment distributions. Discrete settings (histograms, categories) often allow exact computation; continuous settings typically rely on density evaluation or Monte Carlo with density ratios.

⚠️Common Mistakes

  • Ignoring support mismatch: If Q assigns zero probability where P does not, D_f(P\Vert Q) is infinite. Always check q(x)=0 implies p(x)=0 in discrete computations, or enforce absolute continuity in continuous settings.
  • Mixing up direction: D_f(P\Vert Q) is generally not symmetric. KL(P\Vert Q)≠KL(Q\Vert P). Make sure the direction matches your application.
  • Numerical instability: Directly computing ratios p/q can overflow or underflow when densities are very peaked. Prefer working with log-densities and stabilizing exponentials.
  • Wrong generator normalization: Ensure f is convex and satisfies f(1)=0; otherwise, the quantity may not be a proper divergence (e.g., can be negative).
  • Using too few samples: Monte Carlo estimates of divergences with heavy-tailed ratios can have high variance, especially when estimating via change of measure (dividing by ratios). Use variance reduction, larger sample sizes, or bounded f (e.g., Hellinger, JS) when possible.
  • Confusing metrics: Most f-divergences do not satisfy the triangle inequality. If you need a true metric, use total variation (after taking the square root of certain forms) or Hellinger distance (square root of the squared divergence).

Key Formulas

Definition of f-divergence

Df​(P∥Q)=∫f(dQdP​)dQ=EQ​[f(dQdP​(X))]

Explanation: Average the convex generator f applied to the density ratio dP/dQ under Q. Requires absolute continuity P≪ Q so that dP/dQ exists.

Discrete f-divergence

Df​(P∥Q)=x∈X∑​q(x)f(q(x)p(x)​)

Explanation: For finite/countable supports, compute a weighted sum with weights from Q and penalties from the ratio p/q. If any q(x)=0 while p(x)>0, the value is +∞.

Common generators

fKL​(t)=tlogt,fTV​(t)=21​∣t−1∣,fHell​(t)=(t​−1)2,fχ2​(t)=(t−1)2

Explanation: Choosing different convex f yields well-known divergences. These tune sensitivity to large ratios or impose boundedness and symmetry.

Non-negativity and identity of indiscernibles

Df​(P∥Q)≥0andDf​(P∥Q)=0⟺P=Q

Explanation: Convexity and Jensen’s inequality imply non-negativity, and equality holds only when the density ratio equals 1 almost everywhere.

Data Processing Inequality

Df​(P∥Q)≥Df​(P∘T−1∥Q∘T−1)

Explanation: Applying the same measurable transformation (e.g., compression or binning) cannot increase divergence. This reflects loss of information under processing.

Variational form

f∗(u)=t≥0sup​{ut−f(t)},Df​(P∥Q)=gsup​{EP​[g]−EQ​[f∗(g)]}

Explanation: The convex conjugate f^* yields a dual optimization view. Useful for deriving estimators and training objectives (e.g., f-GANs).

Pinsker’s inequality

TV(P,Q)≤21​KL(P∥Q)​

Explanation: Total variation distance is bounded by the square root of half the KL divergence. This provides a way to translate bounds between divergences.

Dual generator relation

Df†​(Q∥P)=Df​(P∥Q),f†(t)=tf(1/t)

Explanation: Swapping the order of arguments corresponds to using the dual generator f^†. This highlights the asymmetry of most f-divergences.

Monte Carlo estimator

If X∼Q,Df​=n1​i=1∑n​f(q(Xi​)p(Xi​)​)

Explanation: A simple unbiased estimator when you can sample from Q and evaluate the density ratio. Variance depends on how heavy-tailed the ratio is.

Change-of-measure identity

If X∼P,Df​(P∥Q)=EP​​q(X)p(X)​f(q(X)p(X)​)​​

Explanation: You can estimate expectations under Q by reweighting samples from P. This form can have high variance when ratios are small.

KL between univariate Gaussians

KL(N(μ1​,σ12​)∥N(μ2​,σ22​))=logσ1​σ2​​+2σ22​σ12​+(μ1​−μ2​)2​−21​

Explanation: A closed-form benchmark useful to validate Monte Carlo estimators and numerical implementations.

Squared Hellinger between univariate Gaussians

H2(N1​,N2​)=1−σ12​+σ22​2σ1​σ2​​​exp(−4(σ12​+σ22​)(μ1​−μ2​)2​)

Explanation: Provides a bounded, symmetric divergence with a convenient closed form for Gaussians.

Complexity Analysis

For discrete distributions with finite support size k, computing Df​(P||Q)=∑i​ qi​ f(pi​/qi​) takes O(k) time and O(1) extra space beyond storing the inputs. The dominant cost is a constant-time evaluation of the generator f at each ratio. Numerical checks for support mismatches (qi​=0<pi​) and safe handling of pi​=0 add only O(1) work per entry. For continuous distributions estimated via Monte Carlo with n samples from Q, evaluating the estimator n1​∑i=1n​ f(p(Xi​)/q(Xi​)) is O(n) time and O(1) auxiliary space provided you can compute the ratio (or log-ratio) in O(1). The runtime constants depend on the cost of log-density evaluations and random number generation. If you only have samples from P and use the change-of-measure form EP​[f(w)/w], you still spend O(n) time, but variance can be much higher when w is small, potentially requiring larger n to achieve a target mean-squared error. Numerically, working with log-densities avoids overflow/underflow: computing log w and then applying f via stable transformations (e.g., using exp with clipping) adds negligible overhead. In practice, the statistical sample complexity (number of samples needed) dominates computational complexity for accurate estimates, especially for heavy-tailed ratios or unbounded f such as KL’s t log t. Memory footprints are small since you can compute running averages online without storing all samples.

Code Examples

Generic discrete f-divergence calculator with common generators
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <functional>
5#include <limits>
6#include <numeric>
7#include <stdexcept>
8
9// Utility: check sums to ~1
10bool approx_one(const std::vector<double>& v, double tol = 1e-9) {
11 double s = std::accumulate(v.begin(), v.end(), 0.0);
12 return std::fabs(s - 1.0) <= tol;
13}
14
15// Common f-generators as functions on [0, +inf)
16namespace fgen {
17 // KL: f(t) = t * log(t), with f(1)=0, limit f(0)=0
18 double kl(double t) {
19 if (t <= 0.0) return 0.0; // by limit t log t -> 0 as t -> 0+
20 return t * std::log(t);
21 }
22 // Total Variation: f(t) = 0.5 * |t - 1|
23 double tv(double t) {
24 return 0.5 * std::fabs(t - 1.0);
25 }
26 // Hellinger squared: f(t) = (sqrt(t) - 1)^2, with f(0)=1
27 double hellinger2(double t) {
28 if (t <= 0.0) return 1.0; // limit as t->0+: (0 - 1)^2 = 1
29 double s = std::sqrt(t);
30 return (s - 1.0) * (s - 1.0);
31 }
32 // Pearson chi-square: f(t) = (t - 1)^2
33 double chi2(double t) {
34 double d = t - 1.0;
35 return d * d;
36 }
37 // Jensen-Shannon (base e): f(t) = t log t - (t+1) log((t+1)/2) + log 2
38 double js(double t) {
39 if (t <= 0.0) return std::log(2.0); // limit t->0+: 0 - (1)log(1/2) + log 2 = log 2
40 double a = t * std::log(t);
41 double b = (t + 1.0) * std::log((t + 1.0) / 2.0);
42 return a - b + std::log(2.0);
43 }
44}
45
46// Compute D_f(P||Q) for discrete distributions (finite support)
47// P and Q must be the same size, entries >= 0, sum to 1 (approximately), and P << Q.
48// If exists i with Q[i]==0 and P[i]>0, returns +inf.
49double f_divergence_discrete(
50 const std::vector<double>& P,
51 const std::vector<double>& Q,
52 const std::function<double(double)>& f
53) {
54 if (P.size() != Q.size() || P.empty()) {
55 throw std::invalid_argument("P and Q must be non-empty and of the same size");
56 }
57 if (!approx_one(P) || !approx_one(Q)) {
58 std::cerr << "Warning: P or Q does not sum to 1 within tolerance.\n";
59 }
60 double D = 0.0;
61 for (size_t i = 0; i < P.size(); ++i) {
62 double p = P[i], q = Q[i];
63 if (q == 0.0) {
64 if (p == 0.0) {
65 continue; // contributes 0
66 } else {
67 return std::numeric_limits<double>::infinity(); // support mismatch
68 }
69 }
70 double t = (p == 0.0) ? 0.0 : (p / q);
71 D += q * f(t);
72 }
73 return D;
74}
75
76int main() {
77 // Example: two categorical distributions over 4 outcomes
78 std::vector<double> P = {0.10, 0.20, 0.30, 0.40};
79 std::vector<double> Q = {0.25, 0.25, 0.25, 0.25}; // uniform
80
81 std::cout.setf(std::ios::fixed); std::cout.precision(6);
82
83 double D_KL = f_divergence_discrete(P, Q, fgen::kl);
84 double D_TV = f_divergence_discrete(P, Q, fgen::tv);
85 double D_H2 = f_divergence_discrete(P, Q, fgen::hellinger2);
86 double D_CHI2 = f_divergence_discrete(P, Q, fgen::chi2);
87 double D_JS = f_divergence_discrete(P, Q, fgen::js);
88
89 std::cout << "KL(P||Q) = " << D_KL << "\n";
90 std::cout << "TV(P,Q) = " << D_TV << "\n";
91 std::cout << "Hellinger^2(P,Q) = " << D_H2 << "\n";
92 std::cout << "Pearson chi^2 = " << D_CHI2<< "\n";
93 std::cout << "Jensen-Shannon = " << D_JS << "\n";
94
95 return 0;
96}
97

This program defines a generic discrete f-divergence calculator and several common generators. It checks for support mismatches (q_i=0<p_i) and handles limits like f(0) when needed. It then computes KL, total variation, Hellinger squared, Pearson chi-square, and Jensen–Shannon for two simple categorical distributions.

Time: O(k) where k is the support size.Space: O(1) additional space beyond the input vectors.
Monte Carlo estimation of f-divergences between Gaussians using log-densities
1#include <iostream>
2#include <random>
3#include <cmath>
4#include <functional>
5#include <limits>
6
7struct Normal {
8 double mu;
9 double sigma; // > 0
10};
11
12double log_pdf_normal(const Normal& dist, double x) {
13 const double var = dist.sigma * dist.sigma;
14 return -0.5 * std::log(2.0 * M_PI * var) - 0.5 * (x - dist.mu) * (x - dist.mu) / var;
15}
16
17// Safe exponential to avoid overflow/underflow extremes
18inline double safe_exp(double a) {
19 if (a > 700.0) return std::numeric_limits<double>::infinity();
20 if (a < -745.0) return 0.0;
21 return std::exp(a);
22}
23
24namespace fgen {
25 double kl(double t) { return (t <= 0.0) ? 0.0 : t * std::log(t); }
26 double hellinger2_from_logw(double logw) {
27 // f(t) = (sqrt(t) - 1)^2 with t = exp(logw) => sqrt(t) = exp(0.5*logw)
28 double s = safe_exp(0.5 * logw);
29 double d = s - 1.0;
30 return d * d;
31 }
32}
33
34// Monte Carlo estimator using samples from Q: D_f(P||Q) ~ 1/n sum f(p(x)/q(x))
35double estimate_f_divergence_mc(
36 size_t n,
37 const Normal& P,
38 const Normal& Q,
39 const std::function<double(double)>& f_from_t,
40 std::mt19937& rng
41) {
42 std::normal_distribution<double> q_sampler(Q.mu, Q.sigma);
43 double acc = 0.0;
44 for (size_t i = 0; i < n; ++i) {
45 double x = q_sampler(rng);
46 double logw = log_pdf_normal(P, x) - log_pdf_normal(Q, x); // log(p/q)
47 double t = safe_exp(logw);
48 acc += f_from_t(t);
49 }
50 return acc / static_cast<double>(n);
51}
52
53// Closed-form KL between univariate Gaussians
54double kl_gaussians_closed(const Normal& P, const Normal& Q) {
55 double s1 = P.sigma, s2 = Q.sigma;
56 double term = std::log(s2 / s1) + (s1*s1 + (P.mu - Q.mu)*(P.mu - Q.mu)) / (2.0 * s2 * s2) - 0.5;
57 return term;
58}
59
60// Closed-form squared Hellinger between univariate Gaussians
61double hellinger2_gaussians_closed(const Normal& P, const Normal& Q) {
62 double s1 = P.sigma, s2 = Q.sigma;
63 double num = 2.0 * s1 * s2;
64 double den = s1*s1 + s2*s2;
65 double expo = std::exp(- (P.mu - Q.mu)*(P.mu - Q.mu) / (4.0 * den));
66 return 1.0 - std::sqrt(num / den) * expo;
67}
68
69int main() {
70 std::mt19937 rng(123);
71 Normal P{0.0, 1.0};
72 Normal Q{1.0, 2.0};
73
74 size_t n = 200000; // number of MC samples
75
76 // Estimate KL via Monte Carlo using f(t) = t log t
77 double Dkl_mc = estimate_f_divergence_mc(n, P, Q, [](double t){ return (t<=0.0)?0.0: t*std::log(t); }, rng);
78 double Dkl_cf = kl_gaussians_closed(P, Q);
79
80 // Estimate Hellinger^2 via Monte Carlo more stably using logw directly
81 auto hell_from_t = [](double t){
82 // fallback generic path if only t is available
83 double s = std::sqrt(t);
84 double d = s - 1.0; return d*d;
85 };
86 double Dh2_mc = estimate_f_divergence_mc(n, P, Q, hell_from_t, rng);
87 double Dh2_cf = hellinger2_gaussians_closed(P, Q);
88
89 std::cout.setf(std::ios::fixed); std::cout.precision(6);
90 std::cout << "KL MC estimate : " << Dkl_mc << "\n";
91 std::cout << "KL closed-form : " << Dkl_cf << "\n";
92 std::cout << "Hellinger^2 MC est. : " << Dh2_mc << "\n";
93 std::cout << "Hellinger^2 closed : " << Dh2_cf << "\n";
94
95 return 0;
96}
97

This program estimates D_f(P||Q) for Gaussian P and Q by sampling from Q and averaging f(p/q). It validates the Monte Carlo estimates against known closed forms for KL and squared Hellinger. It uses log-densities to compute the ratio stably and includes a safe exponential to reduce overflow/underflow risk.

Time: O(n) for n samples; each iteration does O(1) work.Space: O(1) auxiliary space.
Estimating D_f(P||Q) using change of measure with samples from P (importance reweighting)
1#include <iostream>
2#include <random>
3#include <cmath>
4#include <functional>
5#include <limits>
6
7struct Normal { double mu, sigma; };
8
9double log_pdf_normal(const Normal& d, double x) {
10 double v = d.sigma * d.sigma;
11 return -0.5*std::log(2*M_PI*v) - 0.5*(x-d.mu)*(x-d.mu)/v;
12}
13
14inline double safe_exp(double a) {
15 if (a > 700.0) return std::numeric_limits<double>::infinity();
16 if (a < -745.0) return 0.0;
17 return std::exp(a);
18}
19
20// Change-of-measure identity: E_Q[h] = E_P[h / w], where w = p/q
21// Here h = f(w). So D_f(P||Q) = E_P[f(w)/w]
22double estimate_via_change_of_measure(
23 size_t n,
24 const Normal& P,
25 const Normal& Q,
26 const std::function<double(double)>& f_from_t,
27 std::mt19937& rng
28) {
29 std::normal_distribution<double> p_sampler(P.mu, P.sigma);
30 double acc = 0.0;
31 size_t finite = 0;
32 for (size_t i = 0; i < n; ++i) {
33 double x = p_sampler(rng);
34 double logw = log_pdf_normal(P, x) - log_pdf_normal(Q, x); // log(p/q)
35 double w = safe_exp(logw);
36 if (w == 0.0 || !std::isfinite(w)) continue; // skip pathological contributions
37 acc += f_from_t(w) / w; // unbiased if E[.|finite] well-defined
38 ++finite;
39 }
40 return (finite == 0) ? std::numeric_limits<double>::quiet_NaN() : acc / static_cast<double>(finite);
41}
42
43int main(){
44 std::mt19937 rng(42);
45 Normal P{0.0, 1.0};
46 Normal Q{1.0, 2.0};
47
48 auto f_kl = [](double t){ return (t<=0.0)?0.0: t*std::log(t); };
49
50 size_t n = 200000;
51 double D_est = estimate_via_change_of_measure(n, P, Q, f_kl, rng);
52
53 // Closed-form KL for comparison
54 auto kl_cf = [](Normal P, Normal Q){
55 return std::log(Q.sigma/P.sigma) + (P.sigma*P.sigma + (P.mu-Q.mu)*(P.mu-Q.mu))/(2*Q.sigma*Q.sigma) - 0.5;
56 };
57
58 std::cout.setf(std::ios::fixed); std::cout.precision(6);
59 std::cout << "KL via change-of-measure (from P): " << D_est << "\n";
60 std::cout << "KL closed-form : " << kl_cf(P,Q) << "\n";
61
62 std::cout << "Note: This estimator can have high variance when w is small.\n";
63 return 0;
64}
65

When you cannot sample from Q but can sample from P, the change-of-measure identity D_f(P||Q)=E_P[f(w)/w] with w=p/q provides an alternative estimator. This can be statistically unstable if w is often small; the example demonstrates it on Gaussians and compares to the closed-form KL.

Time: O(n) for n samples.Space: O(1) auxiliary space.
#f-divergence#csiszar divergence#kullback–leibler#hellinger distance#total variation#chi-square divergence#jensen–shannon#data processing inequality#variational representation#radon–nikodym derivative#importance sampling#monte carlo estimation#density ratio#convex conjugate