f-Divergences
Key Points
- â˘An f-divergence measures how different two probability distributions P and Q are by averaging a convex function f of the density ratio p(x)/q(x) under Q.
- â˘Intuitively, you compare how much more (or less) likely P says an outcome is than Q, then summarize these distortions with a carefully chosen f.
- â˘Many popular divergences are special cases: KL (f(t)=t t), total variation (f(t)=), Hellinger squared (f(t)=(-1)^2), and Pearson (f(t)=(t-1)^2).
- â˘Formally, if P is absolutely continuous with respect to Q, (P Q)=\mathbb{E}_Q\big[f\big(\tfrac{dP}{dQ}\big)\big] 0, and equals 0 iff .
- â˘f-divergences satisfy the data processing inequality: applying the same transformation to both P and Q cannot increase the divergence.
- â˘Discrete distributions make computation simple: (P Q)= q(x) f\big(\tfrac{p(x)}{q(x)}\big).
- â˘In practice, you can estimate f-divergences from samples by Monte Carlo if you can evaluate the density ratio p/q or log-densities.
- â˘Be careful with support mismatches (q(x)=0 while p(x)>0), which make the divergence infinite, and with numerical stability when ratios are extreme.
Prerequisites
- âMeasure and probability spaces â Understanding absolute continuity and the RadonâNikodym derivative requires basic measure-theoretic probability.
- âConvex functions and conjugates â The generator f must be convex, and the variational form uses the FenchelâLegendre conjugate.
- âExpectation and change of measure â f-divergences are defined as expectations and often estimated by Monte Carlo or importance sampling.
- âProbability mass/density functions â Computing ratios p/q and handling discrete vs continuous cases relies on familiarity with pmfs and pdfs.
- âNumerical stability and log-domain arithmetic â Density ratios can overflow/underflow; stable computation via log-densities is essential in practice.
Detailed Explanation
Tap terms for definitions01Overview
An f-divergence is a broad family of measures that quantify how different two probability distributions P and Q are. The idea is to examine, point by point, how much P up-weights or down-weights outcomes relative to Q through the density ratio r(x)=p(x)/q(x), then to summarize these distortions using a convex function f. Averaging f(r(x)) under Q yields a nonnegative number: zero when P and Q are the same, and larger as they differ more. This family unifies many familiar divergences from statistics and machine learning, including KullbackâLeibler (KL), total variation, Hellinger, and chi-square divergences. Because f-divergences are defined via expectations, they behave well under transformations: coarse-graining or deterministic mappings cannot increase them (data processing inequality).
02Intuition & Analogies
Imagine Q as your current belief about the world and P as an alternative hypothesis. For each possible outcome x, ask: âHow many times more likely does P consider x than Q?â That multiplier is the ratio r(x)=p(x)/q(x). If r(x)=1, P and Q agree on x. If r(x)>1, P favors x more; if r(x)<1, Q favors x more. Now, to summarize disagreement, you donât just average r(x) itself. Instead, you pass r(x) through a convex penalty function f that decides how severely to penalize up- or down-weighting. For KL, f(t)=t log t heavily penalizes large up-weighting (tâŤ1) and is gentle when t is tiny (because t log tâ0 as tâ0). For total variation, f(t)=|tâ1|/2 symmetrically penalizes deviations from agreement. For Hellinger, f(t)=(âtâ1)^2 compresses large ratios via the square root, leading to a bounded, symmetric measure. Averaging f(r(x)) using Qâs probabilities is like asking Q to âscoreâ how distorted P looks. If P and Q are identical, r(x)=1 everywhere, f(1)=0, and the score is zero. If P puts positive mass where Q puts none, r(x) is infinite there and the score shoots to infinityâcapturing an irreconcilable disagreement in support.
03Formal Definition
04When to Use
Use f-divergences when you need a principled, general way to compare probability distributions. In statistics and ML: (1) model fitting and evaluationâKL shows up in maximum likelihood and variational inference; Hellinger can be preferable when you want bounded, symmetric sensitivity; (2) hypothesis testing and goodness-of-fitâ\chi^2 divergences connect to classical \chi^2 tests; (3) robust learningâchoosing different fâs tunes sensitivity to outliers or tail mismatches; (4) information processingâwhen mapping data through features, the data processing inequality ensures divergences donât grow, useful in privacy and representation learning; (5) generative modelingâJensenâShannon (an f-divergence) is related to GAN training objectives; (6) distribution shift detectionâestimating divergences from samples can flag changes between training and deployment distributions. Discrete settings (histograms, categories) often allow exact computation; continuous settings typically rely on density evaluation or Monte Carlo with density ratios.
â ď¸Common Mistakes
- Ignoring support mismatch: If Q assigns zero probability where P does not, D_f(P\Vert Q) is infinite. Always check q(x)=0 implies p(x)=0 in discrete computations, or enforce absolute continuity in continuous settings.
- Mixing up direction: D_f(P\Vert Q) is generally not symmetric. KL(P\Vert Q)â KL(Q\Vert P). Make sure the direction matches your application.
- Numerical instability: Directly computing ratios p/q can overflow or underflow when densities are very peaked. Prefer working with log-densities and stabilizing exponentials.
- Wrong generator normalization: Ensure f is convex and satisfies f(1)=0; otherwise, the quantity may not be a proper divergence (e.g., can be negative).
- Using too few samples: Monte Carlo estimates of divergences with heavy-tailed ratios can have high variance, especially when estimating via change of measure (dividing by ratios). Use variance reduction, larger sample sizes, or bounded f (e.g., Hellinger, JS) when possible.
- Confusing metrics: Most f-divergences do not satisfy the triangle inequality. If you need a true metric, use total variation (after taking the square root of certain forms) or Hellinger distance (square root of the squared divergence).
Key Formulas
Definition of f-divergence
Explanation: Average the convex generator f applied to the density ratio dP/dQ under Q. Requires absolute continuity P Q so that dP/dQ exists.
Discrete f-divergence
Explanation: For finite/countable supports, compute a weighted sum with weights from Q and penalties from the ratio p/q. If any q(x)=0 while p(x)>0, the value is +.
Common generators
Explanation: Choosing different convex f yields well-known divergences. These tune sensitivity to large ratios or impose boundedness and symmetry.
Non-negativity and identity of indiscernibles
Explanation: Convexity and Jensenâs inequality imply non-negativity, and equality holds only when the density ratio equals 1 almost everywhere.
Data Processing Inequality
Explanation: Applying the same measurable transformation (e.g., compression or binning) cannot increase divergence. This reflects loss of information under processing.
Variational form
Explanation: The convex conjugate f^* yields a dual optimization view. Useful for deriving estimators and training objectives (e.g., f-GANs).
Pinskerâs inequality
Explanation: Total variation distance is bounded by the square root of half the KL divergence. This provides a way to translate bounds between divergences.
Dual generator relation
Explanation: Swapping the order of arguments corresponds to using the dual generator f^. This highlights the asymmetry of most f-divergences.
Monte Carlo estimator
Explanation: A simple unbiased estimator when you can sample from Q and evaluate the density ratio. Variance depends on how heavy-tailed the ratio is.
Change-of-measure identity
Explanation: You can estimate expectations under Q by reweighting samples from P. This form can have high variance when ratios are small.
KL between univariate Gaussians
Explanation: A closed-form benchmark useful to validate Monte Carlo estimators and numerical implementations.
Squared Hellinger between univariate Gaussians
Explanation: Provides a bounded, symmetric divergence with a convenient closed form for Gaussians.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <functional> 5 #include <limits> 6 #include <numeric> 7 #include <stdexcept> 8 9 // Utility: check sums to ~1 10 bool approx_one(const std::vector<double>& v, double tol = 1e-9) { 11 double s = std::accumulate(v.begin(), v.end(), 0.0); 12 return std::fabs(s - 1.0) <= tol; 13 } 14 15 // Common f-generators as functions on [0, +inf) 16 namespace fgen { 17 // KL: f(t) = t * log(t), with f(1)=0, limit f(0)=0 18 double kl(double t) { 19 if (t <= 0.0) return 0.0; // by limit t log t -> 0 as t -> 0+ 20 return t * std::log(t); 21 } 22 // Total Variation: f(t) = 0.5 * |t - 1| 23 double tv(double t) { 24 return 0.5 * std::fabs(t - 1.0); 25 } 26 // Hellinger squared: f(t) = (sqrt(t) - 1)^2, with f(0)=1 27 double hellinger2(double t) { 28 if (t <= 0.0) return 1.0; // limit as t->0+: (0 - 1)^2 = 1 29 double s = std::sqrt(t); 30 return (s - 1.0) * (s - 1.0); 31 } 32 // Pearson chi-square: f(t) = (t - 1)^2 33 double chi2(double t) { 34 double d = t - 1.0; 35 return d * d; 36 } 37 // Jensen-Shannon (base e): f(t) = t log t - (t+1) log((t+1)/2) + log 2 38 double js(double t) { 39 if (t <= 0.0) return std::log(2.0); // limit t->0+: 0 - (1)log(1/2) + log 2 = log 2 40 double a = t * std::log(t); 41 double b = (t + 1.0) * std::log((t + 1.0) / 2.0); 42 return a - b + std::log(2.0); 43 } 44 } 45 46 // Compute D_f(P||Q) for discrete distributions (finite support) 47 // P and Q must be the same size, entries >= 0, sum to 1 (approximately), and P << Q. 48 // If exists i with Q[i]==0 and P[i]>0, returns +inf. 49 double f_divergence_discrete( 50 const std::vector<double>& P, 51 const std::vector<double>& Q, 52 const std::function<double(double)>& f 53 ) { 54 if (P.size() != Q.size() || P.empty()) { 55 throw std::invalid_argument("P and Q must be non-empty and of the same size"); 56 } 57 if (!approx_one(P) || !approx_one(Q)) { 58 std::cerr << "Warning: P or Q does not sum to 1 within tolerance.\n"; 59 } 60 double D = 0.0; 61 for (size_t i = 0; i < P.size(); ++i) { 62 double p = P[i], q = Q[i]; 63 if (q == 0.0) { 64 if (p == 0.0) { 65 continue; // contributes 0 66 } else { 67 return std::numeric_limits<double>::infinity(); // support mismatch 68 } 69 } 70 double t = (p == 0.0) ? 0.0 : (p / q); 71 D += q * f(t); 72 } 73 return D; 74 } 75 76 int main() { 77 // Example: two categorical distributions over 4 outcomes 78 std::vector<double> P = {0.10, 0.20, 0.30, 0.40}; 79 std::vector<double> Q = {0.25, 0.25, 0.25, 0.25}; // uniform 80 81 std::cout.setf(std::ios::fixed); std::cout.precision(6); 82 83 double D_KL = f_divergence_discrete(P, Q, fgen::kl); 84 double D_TV = f_divergence_discrete(P, Q, fgen::tv); 85 double D_H2 = f_divergence_discrete(P, Q, fgen::hellinger2); 86 double D_CHI2 = f_divergence_discrete(P, Q, fgen::chi2); 87 double D_JS = f_divergence_discrete(P, Q, fgen::js); 88 89 std::cout << "KL(P||Q) = " << D_KL << "\n"; 90 std::cout << "TV(P,Q) = " << D_TV << "\n"; 91 std::cout << "Hellinger^2(P,Q) = " << D_H2 << "\n"; 92 std::cout << "Pearson chi^2 = " << D_CHI2<< "\n"; 93 std::cout << "Jensen-Shannon = " << D_JS << "\n"; 94 95 return 0; 96 } 97
This program defines a generic discrete f-divergence calculator and several common generators. It checks for support mismatches (q_i=0<p_i) and handles limits like f(0) when needed. It then computes KL, total variation, Hellinger squared, Pearson chi-square, and JensenâShannon for two simple categorical distributions.
1 #include <iostream> 2 #include <random> 3 #include <cmath> 4 #include <functional> 5 #include <limits> 6 7 struct Normal { 8 double mu; 9 double sigma; // > 0 10 }; 11 12 double log_pdf_normal(const Normal& dist, double x) { 13 const double var = dist.sigma * dist.sigma; 14 return -0.5 * std::log(2.0 * M_PI * var) - 0.5 * (x - dist.mu) * (x - dist.mu) / var; 15 } 16 17 // Safe exponential to avoid overflow/underflow extremes 18 inline double safe_exp(double a) { 19 if (a > 700.0) return std::numeric_limits<double>::infinity(); 20 if (a < -745.0) return 0.0; 21 return std::exp(a); 22 } 23 24 namespace fgen { 25 double kl(double t) { return (t <= 0.0) ? 0.0 : t * std::log(t); } 26 double hellinger2_from_logw(double logw) { 27 // f(t) = (sqrt(t) - 1)^2 with t = exp(logw) => sqrt(t) = exp(0.5*logw) 28 double s = safe_exp(0.5 * logw); 29 double d = s - 1.0; 30 return d * d; 31 } 32 } 33 34 // Monte Carlo estimator using samples from Q: D_f(P||Q) ~ 1/n sum f(p(x)/q(x)) 35 double estimate_f_divergence_mc( 36 size_t n, 37 const Normal& P, 38 const Normal& Q, 39 const std::function<double(double)>& f_from_t, 40 std::mt19937& rng 41 ) { 42 std::normal_distribution<double> q_sampler(Q.mu, Q.sigma); 43 double acc = 0.0; 44 for (size_t i = 0; i < n; ++i) { 45 double x = q_sampler(rng); 46 double logw = log_pdf_normal(P, x) - log_pdf_normal(Q, x); // log(p/q) 47 double t = safe_exp(logw); 48 acc += f_from_t(t); 49 } 50 return acc / static_cast<double>(n); 51 } 52 53 // Closed-form KL between univariate Gaussians 54 double kl_gaussians_closed(const Normal& P, const Normal& Q) { 55 double s1 = P.sigma, s2 = Q.sigma; 56 double term = std::log(s2 / s1) + (s1*s1 + (P.mu - Q.mu)*(P.mu - Q.mu)) / (2.0 * s2 * s2) - 0.5; 57 return term; 58 } 59 60 // Closed-form squared Hellinger between univariate Gaussians 61 double hellinger2_gaussians_closed(const Normal& P, const Normal& Q) { 62 double s1 = P.sigma, s2 = Q.sigma; 63 double num = 2.0 * s1 * s2; 64 double den = s1*s1 + s2*s2; 65 double expo = std::exp(- (P.mu - Q.mu)*(P.mu - Q.mu) / (4.0 * den)); 66 return 1.0 - std::sqrt(num / den) * expo; 67 } 68 69 int main() { 70 std::mt19937 rng(123); 71 Normal P{0.0, 1.0}; 72 Normal Q{1.0, 2.0}; 73 74 size_t n = 200000; // number of MC samples 75 76 // Estimate KL via Monte Carlo using f(t) = t log t 77 double Dkl_mc = estimate_f_divergence_mc(n, P, Q, [](double t){ return (t<=0.0)?0.0: t*std::log(t); }, rng); 78 double Dkl_cf = kl_gaussians_closed(P, Q); 79 80 // Estimate Hellinger^2 via Monte Carlo more stably using logw directly 81 auto hell_from_t = [](double t){ 82 // fallback generic path if only t is available 83 double s = std::sqrt(t); 84 double d = s - 1.0; return d*d; 85 }; 86 double Dh2_mc = estimate_f_divergence_mc(n, P, Q, hell_from_t, rng); 87 double Dh2_cf = hellinger2_gaussians_closed(P, Q); 88 89 std::cout.setf(std::ios::fixed); std::cout.precision(6); 90 std::cout << "KL MC estimate : " << Dkl_mc << "\n"; 91 std::cout << "KL closed-form : " << Dkl_cf << "\n"; 92 std::cout << "Hellinger^2 MC est. : " << Dh2_mc << "\n"; 93 std::cout << "Hellinger^2 closed : " << Dh2_cf << "\n"; 94 95 return 0; 96 } 97
This program estimates D_f(P||Q) for Gaussian P and Q by sampling from Q and averaging f(p/q). It validates the Monte Carlo estimates against known closed forms for KL and squared Hellinger. It uses log-densities to compute the ratio stably and includes a safe exponential to reduce overflow/underflow risk.
1 #include <iostream> 2 #include <random> 3 #include <cmath> 4 #include <functional> 5 #include <limits> 6 7 struct Normal { double mu, sigma; }; 8 9 double log_pdf_normal(const Normal& d, double x) { 10 double v = d.sigma * d.sigma; 11 return -0.5*std::log(2*M_PI*v) - 0.5*(x-d.mu)*(x-d.mu)/v; 12 } 13 14 inline double safe_exp(double a) { 15 if (a > 700.0) return std::numeric_limits<double>::infinity(); 16 if (a < -745.0) return 0.0; 17 return std::exp(a); 18 } 19 20 // Change-of-measure identity: E_Q[h] = E_P[h / w], where w = p/q 21 // Here h = f(w). So D_f(P||Q) = E_P[f(w)/w] 22 double estimate_via_change_of_measure( 23 size_t n, 24 const Normal& P, 25 const Normal& Q, 26 const std::function<double(double)>& f_from_t, 27 std::mt19937& rng 28 ) { 29 std::normal_distribution<double> p_sampler(P.mu, P.sigma); 30 double acc = 0.0; 31 size_t finite = 0; 32 for (size_t i = 0; i < n; ++i) { 33 double x = p_sampler(rng); 34 double logw = log_pdf_normal(P, x) - log_pdf_normal(Q, x); // log(p/q) 35 double w = safe_exp(logw); 36 if (w == 0.0 || !std::isfinite(w)) continue; // skip pathological contributions 37 acc += f_from_t(w) / w; // unbiased if E[.|finite] well-defined 38 ++finite; 39 } 40 return (finite == 0) ? std::numeric_limits<double>::quiet_NaN() : acc / static_cast<double>(finite); 41 } 42 43 int main(){ 44 std::mt19937 rng(42); 45 Normal P{0.0, 1.0}; 46 Normal Q{1.0, 2.0}; 47 48 auto f_kl = [](double t){ return (t<=0.0)?0.0: t*std::log(t); }; 49 50 size_t n = 200000; 51 double D_est = estimate_via_change_of_measure(n, P, Q, f_kl, rng); 52 53 // Closed-form KL for comparison 54 auto kl_cf = [](Normal P, Normal Q){ 55 return std::log(Q.sigma/P.sigma) + (P.sigma*P.sigma + (P.mu-Q.mu)*(P.mu-Q.mu))/(2*Q.sigma*Q.sigma) - 0.5; 56 }; 57 58 std::cout.setf(std::ios::fixed); std::cout.precision(6); 59 std::cout << "KL via change-of-measure (from P): " << D_est << "\n"; 60 std::cout << "KL closed-form : " << kl_cf(P,Q) << "\n"; 61 62 std::cout << "Note: This estimator can have high variance when w is small.\n"; 63 return 0; 64 } 65
When you cannot sample from Q but can sample from P, the change-of-measure identity D_f(P||Q)=E_P[f(w)/w] with w=p/q provides an alternative estimator. This can be statistically unstable if w is often small; the example demonstrates it on Gaussians and compares to the closed-form KL.