∑MathAdvanced

f-Divergences

Key Points

•
An f-divergence measures how different two probability distributions P and Q are by averaging a convex function f of the density ratio p(x)/q(x) under Q.
•
Intuitively, you compare how much more (or less) likely P says an outcome is than Q, then summarize these distortions with a carefully chosen f.
•
Many popular divergences are special cases: KL (f(t)=t $lo g$ t), total variation (f(t)= $\frac{1}{2} ∣ t - 1∣$ ), Hellinger squared (f(t)=( $t$ -1)^2), and Pearson $χ^{2}$ (f(t)=(t-1)^2).
•
Formally, if P is absolutely continuous with respect to Q, $D_{f}$ (P $∥$ Q)= $\mathbb{E}_Q\big$ [f $\big$ ( $\tfrac{dP}{dQ}\big$ ) $\big$ ] $\geq$ 0, and equals 0 iff $P = Q$ .
•
f-divergences satisfy the data processing inequality: applying the same transformation to both P and Q cannot increase the divergence.
•
Discrete distributions make computation simple: $D_{f}$ (P $∥$ Q)= $\sum_{x}$ q(x) f $\big$ ( $\tfrac{p(x)}{q(x)}\big$ ).
•
In practice, you can estimate f-divergences from samples by Monte Carlo if you can evaluate the density ratio p/q or log-densities.
•
Be careful with support mismatches (q(x)=0 while p(x)>0), which make the divergence infinite, and with numerical stability when ratios are extreme.

Prerequisites

→Measure and probability spaces — Understanding absolute continuity and the Radon–Nikodym derivative requires basic measure-theoretic probability.
→Convex functions and conjugates — The generator f must be convex, and the variational form uses the Fenchel–Legendre conjugate.
→Expectation and change of measure — f-divergences are defined as expectations and often estimated by Monte Carlo or importance sampling.
→Probability mass/density functions — Computing ratios p/q and handling discrete vs continuous cases relies on familiarity with pmfs and pdfs.
→Numerical stability and log-domain arithmetic — Density ratios can overflow/underflow; stable computation via log-densities is essential in practice.

Detailed Explanation

Tap terms for definitions

01Overview

An f-divergence is a broad family of measures that quantify how different two probability distributions P and Q are. The idea is to examine, point by point, how much P up-weights or down-weights outcomes relative to Q through the density ratio r(x)=p(x)/q(x), then to summarize these distortions using a convex function f. Averaging f(r(x)) under Q yields a nonnegative number: zero when P and Q are the same, and larger as they differ more. This family unifies many familiar divergences from statistics and machine learning, including Kullback–Leibler (KL), total variation, Hellinger, and chi-square divergences. Because f-divergences are defined via expectations, they behave well under transformations: coarse-graining or deterministic mappings cannot increase them (data processing inequality).

02Intuition & Analogies

Imagine Q as your current belief about the world and P as an alternative hypothesis. For each possible outcome x, ask: “How many times more likely does P consider x than Q?” That multiplier is the ratio r(x)=p(x)/q(x). If r(x)=1, P and Q agree on x. If r(x)>1, P favors x more; if r(x)<1, Q favors x more. Now, to summarize disagreement, you don’t just average r(x) itself. Instead, you pass r(x) through a convex penalty function f that decides how severely to penalize up- or down-weighting. For KL, f(t)=t log t heavily penalizes large up-weighting (t≫1) and is gentle when t is tiny (because t log t→0 as t→0). For total variation, f(t)=|t−1|/2 symmetrically penalizes deviations from agreement. For Hellinger, f(t)=(√t−1)^2 compresses large ratios via the square root, leading to a bounded, symmetric measure. Averaging f(r(x)) using Q’s probabilities is like asking Q to “score” how distorted P looks. If P and Q are identical, r(x)=1 everywhere, f(1)=0, and the score is zero. If P puts positive mass where Q puts none, r(x) is infinite there and the score shoots to infinity—capturing an irreconcilable disagreement in support.

03Formal Definition

Let (

Ω

F

) be a measurable space, and let P and Q be probability measures such that P is absolutely continuous with respect to Q (denoted P

≪

Q). By the Radon–Nikodym theorem, there exists a measurable function r=

\frac{d P}{d Q}

(the density ratio) such that for all measurable A, P(A)=

\int_{A}

r\,dQ. Given a convex function f:[0,

\infty

)

\to R

with f(1)=0 (called the generator), the f-divergence from P to Q is defined as

D_{f}

∥

Q)=

\int

\big

(

\tfrac{dP}{dQ}\big

)\,dQ=

\mathbb{E}_Q\Big

\big

(

\frac{d P}{d Q}

(X)

\big

)

\Big

]. In the discrete case with pmfs p(x), q(x) over a countable support

X

and assuming q(x)>0 whenever p(x)>0, this becomes

D_{f}

∥

Q)=

\sum_{x \in X}

q(x) f

\big

(

\tfrac{p(x)}{q(x)}\big

). If there exists x with q(x)=0 but p(x)>0, then

D_{f}

∥

Q)=+

\infty

. Typical choices of f recover standard divergences: KL with f(t)=t

lo g

t, Hellinger squared with f(t)=(

t

-1)^2, total variation with f(t)=

\frac{1}{2} ∣ t - 1∣

, and Pearson

χ^{2}

with f(t)=(t-1)^2.

04When to Use

Use f-divergences when you need a principled, general way to compare probability distributions. In statistics and ML: (1) model fitting and evaluation—KL shows up in maximum likelihood and variational inference; Hellinger can be preferable when you want bounded, symmetric sensitivity; (2) hypothesis testing and goodness-of-fit—\ch $i^2$ divergences connect to classical \ch $i^2$ tests; (3) robust learning—choosing different f’s tunes sensitivity to outliers or tail mismatches; (4) information processing—when mapping data through features, the data processing inequality ensures divergences don’t grow, useful in privacy and representation learning; (5) generative modeling—Jensen–Shannon (an f-divergence) is related to GAN training objectives; (6) distribution shift detection—estimating divergences from samples can flag changes between training and deployment distributions. Discrete settings (histograms, categories) often allow exact computation; continuous settings typically rely on density evaluation or Monte Carlo with density ratios.

⚠️Common Mistakes

Ignoring support mismatch: If Q assigns zero probability where P does not, $D_f$ (P\Vert Q) is infinite. Always check q(x)=0 implies p(x)=0 in discrete computations, or enforce absolute continuity in continuous settings.
Mixing up direction: $D_f$ (P\Vert Q) is generally not symmetric. KL(P\Vert $Q)≠KL(Q$ \Vert P). Make sure the direction matches your application.
Numerical instability: Directly computing ratios p/q can overflow or underflow when densities are very peaked. Prefer working with log-densities and stabilizing exponentials.
Wrong generator normalization: Ensure f is convex and satisfies f(1)=0; otherwise, the quantity may not be a proper divergence (e.g., can be negative).
Using too few samples: Monte Carlo estimates of divergences with heavy-tailed ratios can have high variance, especially when estimating via change of measure (dividing by ratios). Use variance reduction, larger sample sizes, or bounded f (e.g., Hellinger, JS) when possible.
Confusing metrics: Most f-divergences do not satisfy the triangle inequality. If you need a true metric, use total variation (after taking the square root of certain forms) or Hellinger distance (square root of the squared divergence).

Key Formulas

Definition of f-divergence

D_{f} (P ∥ Q) = \int f (\frac{d P}{d Q}) d Q = E_{Q} [f (\frac{d P}{d Q} (X))]

Explanation: Average the convex generator f applied to the density ratio dP/dQ under Q. Requires absolute continuity P $≪$ Q so that dP/dQ exists.

Discrete f-divergence

D_{f} (P ∥ Q) = x \in X \sum q (x) f (\frac{p ( x )}{q ( x )})

Explanation: For finite/countable supports, compute a weighted sum with weights from Q and penalties from the ratio p/q. If any q(x)=0 while p(x)>0, the value is + $\infty$ .

Common generators

f_{KL} (t) = t lo g t, f_{TV} (t) = \frac{1}{2} ∣ t - 1∣, f_{Hell} (t) = (t - 1)^{2}, f_{χ^{2}} (t) = (t - 1)^{2}

Explanation: Choosing different convex f yields well-known divergences. These tune sensitivity to large ratios or impose boundedness and symmetry.

Non-negativity and identity of indiscernibles

D_{f} (P ∥ Q) \geq 0 and D_{f} (P ∥ Q) = 0 ⟺ P = Q

Explanation: Convexity and Jensen’s inequality imply non-negativity, and equality holds only when the density ratio equals 1 almost everywhere.

Data Processing Inequality

D_{f} (P ∥ Q) \geq D_{f} (P \circ T^{- 1} ∥ Q \circ T^{- 1})

Explanation: Applying the same measurable transformation (e.g., compression or binning) cannot increase divergence. This reflects loss of information under processing.

Variational form

f^{*} (u) = t \geq 0 sup {u t - f (t)}, D_{f} (P ∥ Q) = g sup {E_{P} [g] - E_{Q} [f^{*} (g)]}

Explanation: The convex conjugate f^* yields a dual optimization view. Useful for deriving estimators and training objectives (e.g., f-GANs).

Pinsker’s inequality

TV (P, Q) \leq \frac{1}{2} KL (P ∥ Q)

Explanation: Total variation distance is bounded by the square root of half the KL divergence. This provides a way to translate bounds between divergences.

Dual generator relation

D_{f^{†}} (Q ∥ P) = D_{f} (P ∥ Q), f^{†} (t) = t f (1/ t)

Explanation: Swapping the order of arguments corresponds to using the dual generator f^ $†$ . This highlights the asymmetry of most f-divergences.

Monte Carlo estimator

If X \sim Q, D_{f} = \frac{1}{n} i = 1 \sum n f (\frac{p ( X _{i} )}{q ( X _{i} )})

Explanation: A simple unbiased estimator when you can sample from Q and evaluate the density ratio. Variance depends on how heavy-tailed the ratio is.

Change-of-measure identity

If X \sim P, D_{f} (P ∥ Q) = E_{P} \frac{f ( \frac{p ( X )}{q ( X )} )}{\frac{p ( X )}{q ( X )}}

Explanation: You can estimate expectations under Q by reweighting samples from P. This form can have high variance when ratios are small.

KL between univariate Gaussians

KL (N (μ_{1}, σ_{1}^{2}) ∥ N (μ_{2}, σ_{2}^{2})) = lo g \frac{σ _{2}}{σ _{1}} + \frac{σ _{1}^{2} + ( μ _{1} - μ _{2} ) ^{2}}{2 σ _{2}^{2}} - \frac{1}{2}

Explanation: A closed-form benchmark useful to validate Monte Carlo estimators and numerical implementations.

Squared Hellinger between univariate Gaussians

H^{2} (N_{1}, N_{2}) = 1 - \frac{2 σ _{1} σ _{2}}{σ _{1}^{2} + σ _{2}^{2}} exp (- \frac{( μ _{1} - μ _{2} ) ^{2}}{4 ( σ _{1}^{2} + σ _{2}^{2} )})

Explanation: Provides a bounded, symmetric divergence with a convenient closed form for Gaussians.

Complexity Analysis

For discrete distributions with finite support size k, computing

D_{f}

(P||Q)=

\sum_{i}

q_{i}

p_{i}

q_{i}

) takes O(k) time and O(1) extra space beyond storing the inputs. The dominant cost is a constant-time evaluation of the generator f at each ratio. Numerical checks for support mismatches (

q_{i} = 0 < p_{i}

) and safe handling of

p_{i} = 0

add only O(1) work per entry. For continuous distributions estimated via Monte Carlo with n samples from Q, evaluating the estimator

\frac{1}{n} \sum_{i = 1}^{n}

f(p(

X_{i}

)/q(

X_{i}

)) is O(n) time and O(1) auxiliary space provided you can compute the ratio (or log-ratio) in O(1). The runtime constants depend on the cost of log-density evaluations and random number generation. If you only have samples from P and use the change-of-measure form

E_{P}

[f(w)/w], you still spend O(n) time, but variance can be much higher when w is small, potentially requiring larger n to achieve a target mean-squared error. Numerically, working with log-densities avoids overflow/underflow: computing log w and then applying f via stable transformations (e.g., using exp with clipping) adds negligible overhead. In practice, the statistical sample complexity (number of samples needed) dominates computational complexity for accurate estimates, especially for heavy-tailed ratios or unbounded f such as KL’s t log t. Memory footprints are small since you can compute running averages online without storing all samples.

Code Examples

Generic discrete f-divergence calculator with common generators

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <functional>
5 #include <limits>
6 #include <numeric>
7 #include <stdexcept>
8 
9 // Utility: check sums to ~1
10 bool approx_one(const std::vector<double>& v, double tol = 1e-9) {
11     double s = std::accumulate(v.begin(), v.end(), 0.0);
12     return std::fabs(s - 1.0) <= tol;
13 }
14 
15 // Common f-generators as functions on [0, +inf)
16 namespace fgen {
17     // KL: f(t) = t * log(t), with f(1)=0, limit f(0)=0
18     double kl(double t) {
19         if (t <= 0.0) return 0.0; // by limit t log t -> 0 as t -> 0+
20         return t * std::log(t);
21     }
22     // Total Variation: f(t) = 0.5 * |t - 1|
23     double tv(double t) {
24         return 0.5 * std::fabs(t - 1.0);
25     }
26     // Hellinger squared: f(t) = (sqrt(t) - 1)^2, with f(0)=1
27     double hellinger2(double t) {
28         if (t <= 0.0) return 1.0; // limit as t->0+: (0 - 1)^2 = 1
29         double s = std::sqrt(t);
30         return (s - 1.0) * (s - 1.0);
31     }
32     // Pearson chi-square: f(t) = (t - 1)^2
33     double chi2(double t) {
34         double d = t - 1.0;
35         return d * d;
36     }
37     // Jensen-Shannon (base e): f(t) = t log t - (t+1) log((t+1)/2) + log 2
38     double js(double t) {
39         if (t <= 0.0) return std::log(2.0); // limit t->0+: 0 - (1)log(1/2) + log 2 = log 2
40         double a = t * std::log(t);
41         double b = (t + 1.0) * std::log((t + 1.0) / 2.0);
42         return a - b + std::log(2.0);
43     }
44 }
45 
46 // Compute D_f(P||Q) for discrete distributions (finite support)
47 // P and Q must be the same size, entries >= 0, sum to 1 (approximately), and P << Q.
48 // If exists i with Q[i]==0 and P[i]>0, returns +inf.
49 double f_divergence_discrete(
50     const std::vector<double>& P,
51     const std::vector<double>& Q,
52     const std::function<double(double)>& f
53 ) {
54     if (P.size() != Q.size() || P.empty()) {
55         throw std::invalid_argument("P and Q must be non-empty and of the same size");
56     }
57     if (!approx_one(P) || !approx_one(Q)) {
58         std::cerr << "Warning: P or Q does not sum to 1 within tolerance.\n";
59     }
60     double D = 0.0;
61     for (size_t i = 0; i < P.size(); ++i) {
62         double p = P[i], q = Q[i];
63         if (q == 0.0) {
64             if (p == 0.0) {
65                 continue; // contributes 0
66             } else {
67                 return std::numeric_limits<double>::infinity(); // support mismatch
68             }
69         }
70         double t = (p == 0.0) ? 0.0 : (p / q);
71         D += q * f(t);
72     }
73     return D;
74 }
75 
76 int main() {
77     // Example: two categorical distributions over 4 outcomes
78     std::vector<double> P = {0.10, 0.20, 0.30, 0.40};
79     std::vector<double> Q = {0.25, 0.25, 0.25, 0.25}; // uniform
80 
81     std::cout.setf(std::ios::fixed); std::cout.precision(6);
82 
83     double D_KL = f_divergence_discrete(P, Q, fgen::kl);
84     double D_TV = f_divergence_discrete(P, Q, fgen::tv);
85     double D_H2 = f_divergence_discrete(P, Q, fgen::hellinger2);
86     double D_CHI2 = f_divergence_discrete(P, Q, fgen::chi2);
87     double D_JS = f_divergence_discrete(P, Q, fgen::js);
88 
89     std::cout << "KL(P||Q)           = " << D_KL  << "\n";
90     std::cout << "TV(P,Q)            = " << D_TV  << "\n";
91     std::cout << "Hellinger^2(P,Q)   = " << D_H2  << "\n";
92     std::cout << "Pearson chi^2      = " << D_CHI2<< "\n";
93     std::cout << "Jensen-Shannon     = " << D_JS  << "\n";
94 
95     return 0;
96 }
97

This program defines a generic discrete f-divergence calculator and several common generators. It checks for support mismatches (q_i=0<p_i) and handles limits like f(0) when needed. It then computes KL, total variation, Hellinger squared, Pearson chi-square, and Jensen–Shannon for two simple categorical distributions.

Time: O(k) where k is the support size.Space: O(1) additional space beyond the input vectors.

Monte Carlo estimation of f-divergences between Gaussians using log-densities

1 #include <iostream>
2 #include <random>
3 #include <cmath>
4 #include <functional>
5 #include <limits>
6 
7 struct Normal {
8     double mu;
9     double sigma; // > 0
10 };
11 
12 double log_pdf_normal(const Normal& dist, double x) {
13     const double var = dist.sigma * dist.sigma;
14     return -0.5 * std::log(2.0 * M_PI * var) - 0.5 * (x - dist.mu) * (x - dist.mu) / var;
15 }
16 
17 // Safe exponential to avoid overflow/underflow extremes
18 inline double safe_exp(double a) {
19     if (a > 700.0) return std::numeric_limits<double>::infinity();
20     if (a < -745.0) return 0.0;
21     return std::exp(a);
22 }
23 
24 namespace fgen {
25     double kl(double t) { return (t <= 0.0) ? 0.0 : t * std::log(t); }
26     double hellinger2_from_logw(double logw) {
27         // f(t) = (sqrt(t) - 1)^2 with t = exp(logw) => sqrt(t) = exp(0.5*logw)
28         double s = safe_exp(0.5 * logw);
29         double d = s - 1.0;
30         return d * d;
31     }
32 }
33 
34 // Monte Carlo estimator using samples from Q: D_f(P||Q) ~ 1/n sum f(p(x)/q(x))
35 double estimate_f_divergence_mc(
36     size_t n,
37     const Normal& P,
38     const Normal& Q,
39     const std::function<double(double)>& f_from_t,
40     std::mt19937& rng
41 ) {
42     std::normal_distribution<double> q_sampler(Q.mu, Q.sigma);
43     double acc = 0.0;
44     for (size_t i = 0; i < n; ++i) {
45         double x = q_sampler(rng);
46         double logw = log_pdf_normal(P, x) - log_pdf_normal(Q, x); // log(p/q)
47         double t = safe_exp(logw);
48         acc += f_from_t(t);
49     }
50     return acc / static_cast<double>(n);
51 }
52 
53 // Closed-form KL between univariate Gaussians
54 double kl_gaussians_closed(const Normal& P, const Normal& Q) {
55     double s1 = P.sigma, s2 = Q.sigma;
56     double term = std::log(s2 / s1) + (s1*s1 + (P.mu - Q.mu)*(P.mu - Q.mu)) / (2.0 * s2 * s2) - 0.5;
57     return term;
58 }
59 
60 // Closed-form squared Hellinger between univariate Gaussians
61 double hellinger2_gaussians_closed(const Normal& P, const Normal& Q) {
62     double s1 = P.sigma, s2 = Q.sigma;
63     double num = 2.0 * s1 * s2;
64     double den = s1*s1 + s2*s2;
65     double expo = std::exp(- (P.mu - Q.mu)*(P.mu - Q.mu) / (4.0 * den));
66     return 1.0 - std::sqrt(num / den) * expo;
67 }
68 
69 int main() {
70     std::mt19937 rng(123);
71     Normal P{0.0, 1.0};
72     Normal Q{1.0, 2.0};
73 
74     size_t n = 200000; // number of MC samples
75 
76     // Estimate KL via Monte Carlo using f(t) = t log t
77     double Dkl_mc = estimate_f_divergence_mc(n, P, Q, [](double t){ return (t<=0.0)?0.0: t*std::log(t); }, rng);
78     double Dkl_cf = kl_gaussians_closed(P, Q);
79 
80     // Estimate Hellinger^2 via Monte Carlo more stably using logw directly
81     auto hell_from_t = [](double t){
82         // fallback generic path if only t is available
83         double s = std::sqrt(t);
84         double d = s - 1.0; return d*d;
85     };
86     double Dh2_mc = estimate_f_divergence_mc(n, P, Q, hell_from_t, rng);
87     double Dh2_cf = hellinger2_gaussians_closed(P, Q);
88 
89     std::cout.setf(std::ios::fixed); std::cout.precision(6);
90     std::cout << "KL MC estimate      : " << Dkl_mc << "\n";
91     std::cout << "KL closed-form      : " << Dkl_cf << "\n";
92     std::cout << "Hellinger^2 MC est. : " << Dh2_mc << "\n";
93     std::cout << "Hellinger^2 closed  : " << Dh2_cf << "\n";
94 
95     return 0;
96 }
97

This program estimates D_f(P||Q) for Gaussian P and Q by sampling from Q and averaging f(p/q). It validates the Monte Carlo estimates against known closed forms for KL and squared Hellinger. It uses log-densities to compute the ratio stably and includes a safe exponential to reduce overflow/underflow risk.

Time: O(n) for n samples; each iteration does O(1) work.Space: O(1) auxiliary space.

Estimating D_f(P||Q) using change of measure with samples from P (importance reweighting)

1 #include <iostream>
2 #include <random>
3 #include <cmath>
4 #include <functional>
5 #include <limits>
6 
7 struct Normal { double mu, sigma; };
8 
9 double log_pdf_normal(const Normal& d, double x) {
10     double v = d.sigma * d.sigma;
11     return -0.5*std::log(2*M_PI*v) - 0.5*(x-d.mu)*(x-d.mu)/v;
12 }
13 
14 inline double safe_exp(double a) {
15     if (a > 700.0) return std::numeric_limits<double>::infinity();
16     if (a < -745.0) return 0.0;
17     return std::exp(a);
18 }
19 
20 // Change-of-measure identity: E_Q[h] = E_P[h / w], where w = p/q
21 // Here h = f(w). So D_f(P||Q) = E_P[f(w)/w]
22 double estimate_via_change_of_measure(
23     size_t n,
24     const Normal& P,
25     const Normal& Q,
26     const std::function<double(double)>& f_from_t,
27     std::mt19937& rng
28 ) {
29     std::normal_distribution<double> p_sampler(P.mu, P.sigma);
30     double acc = 0.0;
31     size_t finite = 0;
32     for (size_t i = 0; i < n; ++i) {
33         double x = p_sampler(rng);
34         double logw = log_pdf_normal(P, x) - log_pdf_normal(Q, x); // log(p/q)
35         double w = safe_exp(logw);
36         if (w == 0.0 || !std::isfinite(w)) continue; // skip pathological contributions
37         acc += f_from_t(w) / w; // unbiased if E[.|finite] well-defined
38         ++finite;
39     }
40     return (finite == 0) ? std::numeric_limits<double>::quiet_NaN() : acc / static_cast<double>(finite);
41 }
42 
43 int main(){
44     std::mt19937 rng(42);
45     Normal P{0.0, 1.0};
46     Normal Q{1.0, 2.0};
47 
48     auto f_kl = [](double t){ return (t<=0.0)?0.0: t*std::log(t); };
49 
50     size_t n = 200000;
51     double D_est = estimate_via_change_of_measure(n, P, Q, f_kl, rng);
52 
53     // Closed-form KL for comparison
54     auto kl_cf = [](Normal P, Normal Q){
55         return std::log(Q.sigma/P.sigma) + (P.sigma*P.sigma + (P.mu-Q.mu)*(P.mu-Q.mu))/(2*Q.sigma*Q.sigma) - 0.5;
56     };
57 
58     std::cout.setf(std::ios::fixed); std::cout.precision(6);
59     std::cout << "KL via change-of-measure (from P): " << D_est << "\n";
60     std::cout << "KL closed-form                 : " << kl_cf(P,Q) << "\n";
61 
62     std::cout << "Note: This estimator can have high variance when w is small.\n";
63     return 0;
64 }
65

When you cannot sample from Q but can sample from P, the change-of-measure identity D_f(P||Q)=E_P[f(w)/w] with w=p/q provides an alternative estimator. This can be statistically unstable if w is often small; the example demonstrates it on Gaussians and compares to the closed-form KL.

Time: O(n) for n samples.Space: O(1) auxiliary space.

1	#include <iostream>
2	#include <vector>
3	#include <cmath>
4	#include <functional>
5	#include <limits>
6	#include <numeric>
7	#include <stdexcept>
8
9	// Utility: check sums to ~1
10	bool approx_one(const std::vector<double>& v, double tol = 1e-9) {
11	double s = std::accumulate(v.begin(), v.end(), 0.0);
12	return std::fabs(s - 1.0) <= tol;
13	}
14
15	// Common f-generators as functions on [0, +inf)
16	namespace fgen {
17	// KL: f(t) = t * log(t), with f(1)=0, limit f(0)=0
18	double kl(double t) {
19	if (t <= 0.0) return 0.0; // by limit t log t -> 0 as t -> 0+
20	return t * std::log(t);
21	}
22	// Total Variation: f(t) = 0.5 * \|t - 1\|
23	double tv(double t) {
24	return 0.5 * std::fabs(t - 1.0);
25	}
26	// Hellinger squared: f(t) = (sqrt(t) - 1)^2, with f(0)=1
27	double hellinger2(double t) {
28	if (t <= 0.0) return 1.0; // limit as t->0+: (0 - 1)^2 = 1
29	double s = std::sqrt(t);
30	return (s - 1.0) * (s - 1.0);
31	}
32	// Pearson chi-square: f(t) = (t - 1)^2
33	double chi2(double t) {
34	double d = t - 1.0;
35	return d * d;
36	}
37	// Jensen-Shannon (base e): f(t) = t log t - (t+1) log((t+1)/2) + log 2
38	double js(double t) {
39	if (t <= 0.0) return std::log(2.0); // limit t->0+: 0 - (1)log(1/2) + log 2 = log 2
40	double a = t * std::log(t);
41	double b = (t + 1.0) * std::log((t + 1.0) / 2.0);
42	return a - b + std::log(2.0);
43	}
44	}
45
46	// Compute D_f(P\|\|Q) for discrete distributions (finite support)
47	// P and Q must be the same size, entries >= 0, sum to 1 (approximately), and P << Q.
48	// If exists i with Q[i]==0 and P[i]>0, returns +inf.
49	double f_divergence_discrete(
50	const std::vector<double>& P,
51	const std::vector<double>& Q,
52	const std::function<double(double)>& f
53	) {
54	if (P.size() != Q.size() \|\| P.empty()) {
55	throw std::invalid_argument("P and Q must be non-empty and of the same size");
56	}
57	if (!approx_one(P) \|\| !approx_one(Q)) {
58	std::cerr << "Warning: P or Q does not sum to 1 within tolerance.\n";
59	}
60	double D = 0.0;
61	for (size_t i = 0; i < P.size(); ++i) {
62	double p = P[i], q = Q[i];
63	if (q == 0.0) {
64	if (p == 0.0) {
65	continue; // contributes 0
66	} else {
67	return std::numeric_limits<double>::infinity(); // support mismatch
68	}
69	}
70	double t = (p == 0.0) ? 0.0 : (p / q);
71	D += q * f(t);
72	}
73	return D;
74	}
75
76	int main() {
77	// Example: two categorical distributions over 4 outcomes
78	std::vector<double> P = {0.10, 0.20, 0.30, 0.40};
79	std::vector<double> Q = {0.25, 0.25, 0.25, 0.25}; // uniform
80
81	std::cout.setf(std::ios::fixed); std::cout.precision(6);
82
83	double D_KL = f_divergence_discrete(P, Q, fgen::kl);
84	double D_TV = f_divergence_discrete(P, Q, fgen::tv);
85	double D_H2 = f_divergence_discrete(P, Q, fgen::hellinger2);
86	double D_CHI2 = f_divergence_discrete(P, Q, fgen::chi2);
87	double D_JS = f_divergence_discrete(P, Q, fgen::js);
88
89	std::cout << "KL(P\|\|Q) = " << D_KL << "\n";
90	std::cout << "TV(P,Q) = " << D_TV << "\n";
91	std::cout << "Hellinger^2(P,Q) = " << D_H2 << "\n";
92	std::cout << "Pearson chi^2 = " << D_CHI2<< "\n";
93	std::cout << "Jensen-Shannon = " << D_JS << "\n";
94
95	return 0;
96	}
97

1	#include <iostream>
2	#include <random>
3	#include <cmath>
4	#include <functional>
5	#include <limits>
6
7	struct Normal {
8	double mu;
9	double sigma; // > 0
10	};
11
12	double log_pdf_normal(const Normal& dist, double x) {
13	const double var = dist.sigma * dist.sigma;
14	return -0.5 * std::log(2.0 * M_PI * var) - 0.5 * (x - dist.mu) * (x - dist.mu) / var;
15	}
16
17	// Safe exponential to avoid overflow/underflow extremes
18	inline double safe_exp(double a) {
19	if (a > 700.0) return std::numeric_limits<double>::infinity();
20	if (a < -745.0) return 0.0;
21	return std::exp(a);
22	}
23
24	namespace fgen {
25	double kl(double t) { return (t <= 0.0) ? 0.0 : t * std::log(t); }
26	double hellinger2_from_logw(double logw) {
27	// f(t) = (sqrt(t) - 1)^2 with t = exp(logw) => sqrt(t) = exp(0.5*logw)
28	double s = safe_exp(0.5 * logw);
29	double d = s - 1.0;
30	return d * d;
31	}
32	}
33
34	// Monte Carlo estimator using samples from Q: D_f(P\|\|Q) ~ 1/n sum f(p(x)/q(x))
35	double estimate_f_divergence_mc(
36	size_t n,
37	const Normal& P,
38	const Normal& Q,
39	const std::function<double(double)>& f_from_t,
40	std::mt19937& rng
41	) {
42	std::normal_distribution<double> q_sampler(Q.mu, Q.sigma);
43	double acc = 0.0;
44	for (size_t i = 0; i < n; ++i) {
45	double x = q_sampler(rng);
46	double logw = log_pdf_normal(P, x) - log_pdf_normal(Q, x); // log(p/q)
47	double t = safe_exp(logw);
48	acc += f_from_t(t);
49	}
50	return acc / static_cast<double>(n);
51	}
52
53	// Closed-form KL between univariate Gaussians
54	double kl_gaussians_closed(const Normal& P, const Normal& Q) {
55	double s1 = P.sigma, s2 = Q.sigma;
56	double term = std::log(s2 / s1) + (s1s1 + (P.mu - Q.mu)(P.mu - Q.mu)) / (2.0 * s2 * s2) - 0.5;
57	return term;
58	}
59
60	// Closed-form squared Hellinger between univariate Gaussians
61	double hellinger2_gaussians_closed(const Normal& P, const Normal& Q) {
62	double s1 = P.sigma, s2 = Q.sigma;
63	double num = 2.0 * s1 * s2;
64	double den = s1s1 + s2s2;
65	double expo = std::exp(- (P.mu - Q.mu)(P.mu - Q.mu) / (4.0 den));
66	return 1.0 - std::sqrt(num / den) * expo;
67	}
68
69	int main() {
70	std::mt19937 rng(123);
71	Normal P{0.0, 1.0};
72	Normal Q{1.0, 2.0};
73
74	size_t n = 200000; // number of MC samples
75
76	// Estimate KL via Monte Carlo using f(t) = t log t
77	double Dkl_mc = estimate_f_divergence_mc(n, P, Q, [](double t){ return (t<=0.0)?0.0: t*std::log(t); }, rng);
78	double Dkl_cf = kl_gaussians_closed(P, Q);
79
80	// Estimate Hellinger^2 via Monte Carlo more stably using logw directly
81	auto hell_from_t = [](double t){
82	// fallback generic path if only t is available
83	double s = std::sqrt(t);
84	double d = s - 1.0; return d*d;
85	};
86	double Dh2_mc = estimate_f_divergence_mc(n, P, Q, hell_from_t, rng);
87	double Dh2_cf = hellinger2_gaussians_closed(P, Q);
88
89	std::cout.setf(std::ios::fixed); std::cout.precision(6);
90	std::cout << "KL MC estimate : " << Dkl_mc << "\n";
91	std::cout << "KL closed-form : " << Dkl_cf << "\n";
92	std::cout << "Hellinger^2 MC est. : " << Dh2_mc << "\n";
93	std::cout << "Hellinger^2 closed : " << Dh2_cf << "\n";
94
95	return 0;
96	}
97