Sufficient Statistics
Key Points
- •A sufficient statistic compresses all information in the sample about a parameter into a lower-dimensional summary without losing inferential power.
- •The Fisher–Neyman factorization theorem gives a practical test: a statistic T(X) is sufficient for θ if the likelihood factors as g(T(X), h(X).
- •Many common models have simple sufficient statistics: sum of Bernoullis, sum of Poissons, and (sum, sum of squares) for Normal with unknown mean/variance.
- •Using sufficient statistics enables O(1) memory streaming estimation and faster likelihood/MLE computation compared to using the raw data.
- •In exponential families, T(X) naturally appears in the canonical form; this guarantees sufficiency and often minimal sufficiency.
- •Rao–Blackwellization improves an estimator by conditioning on a sufficient statistic, never worsening variance.
- •Confusing minimal sufficiency with sufficiency or forgetting which parameters are unknown often leads to wrong summaries.
- •In C++, you can implement aggregators that maintain only T(X), supporting merges and distributed processing with exact results.
Prerequisites
- →Random variables and distributions — Sufficiency is defined in terms of probability distributions and samples.
- →Likelihood and maximum likelihood — Factorization and MLE rely on understanding the likelihood as a function of parameters.
- →Conditional probability and independence — The formal definition of sufficiency uses conditional distributions.
- →Exponential family of distributions — Many sufficient statistics arise naturally from the exponential family form.
- →Basic calculus and logarithms — Manipulating likelihoods requires log properties and derivatives for MLE.
- →Combinatorics (binomial coefficients) — Understanding counts and arrangements helps with Binomial/Bernoulli sufficiency.
- →C++ programming basics — To implement aggregators and likelihoods efficiently and safely.
- →Numerical stability (floating-point) — Accumulating sums and computing logs requires care to avoid overflow/underflow.
Detailed Explanation
Tap terms for definitions01Overview
Sufficient statistics formalize the idea of data reduction without losing information about a parameter. Suppose you observe a sample X = (X_1, ..., X_n) from a distribution indexed by a parameter θ (for example, Bernoulli(p), Poisson(λ), or Normal(μ, σ^2)). A statistic T(X) is a function that maps the data to a summary, like the sum or mean. T(X) is called sufficient for θ if, once you know T(X), the exact arrangement of the raw data does not add any further information about θ. The Fisher–Neyman factorization theorem provides a concrete and widely used criterion: T(X) is sufficient for θ if and only if the joint likelihood (or probability density/mass) of the data can be written as a product of a term that depends on X only through T(X) and θ, and another term that does not depend on θ at all. This is powerful because it turns an abstract conditional-independence definition into an algebraic check. Sufficient statistics appear everywhere in statistical inference: maximum likelihood estimation (MLE), Bayesian conjugacy, hypothesis testing, and efficient streaming or distributed computation. For many exponential family models (Bernoulli, Poisson, Normal, Exponential, Gamma, etc.), sufficiency emerges automatically from the canonical form. Practically, this means we can replace a large dataset with a handful of numbers—often just counts and sums—without losing any inferential precision about θ.
02Intuition & Analogies
Imagine you are judging the sweetness of a batch of strawberries and you care only about the average sweetness. If you take many bites (data points), the total amount of sugar consumed (the sum) tells you everything you need about the average, regardless of which berries you bit or in what order. The pile of peels (the fine details) no longer matters for estimating the average sweetness. Here, the sum acts like a sufficient statistic for the mean when variability is known. Likewise, if you are counting how many customers arrive each minute and you model arrivals with a Poisson process, the total number of arrivals over n minutes is all that matters for estimating the arrival rate λ. How those arrivals are distributed across minutes is irrelevant for λ—only the total count carries information. That total is the sufficient statistic. Think of sufficiency as a perfect compression that preserves all parameter-related information, similar to a lossless ZIP file specialized for inference about θ. If you pass the compressed file (T(X)) to any statistician, they can reconstruct the same likelihood for θ as if they saw the raw data. In exponential family models, this compression is especially neat: the data enter the likelihood through simple aggregates (like sums or counts), ensuring that these aggregates are sufficient. In practice, this lets you do streaming analytics with tiny memory, merge summaries from different machines, and compute MLEs quickly because the heavy lifting reduces to a few numbers.
03Formal Definition
04When to Use
- Parametric inference with familiar models: For Bernoulli(p), use the total number of successes; for Poisson(λ), use the total count; for Normal with known variance, use the sum (or mean); for Normal with both mean and variance unknown, use the pair (sum, sum of squares).
- Streaming or distributed data: Maintain only sufficient statistics to compute MLEs or posteriors later with O(1) memory. You can merge summaries across shards without revisiting raw data.
- Faster likelihood evaluation: Replace O(n) likelihood computations with O(1) using T(X), enabling rapid optimization or grid searches over parameters.
- Bayesian conjugacy: Posterior updates depend only on T(X), e.g., Beta–Binomial uses total successes and trials; Gamma–Poisson uses total count and number of observations.
- Privacy and storage: Share T(X) instead of raw data to reduce exposure while preserving inferential capability about \theta.
- Model diagnostics and tests: In some classical tests (e.g., conditional tests), conditioning on a sufficient statistic eliminates nuisance parameters.
⚠️Common Mistakes
- Forgetting which parameters are unknown: For Normal data, the sample mean alone is sufficient for μ only when σ^2 is known. If both μ and σ^2 are unknown, you need (\sum X_i, \sum X_i^2) or equivalently (\bar{X}, S^2).
- Confusing sufficiency with minimal sufficiency: A statistic can be sufficient but not minimal; carrying extra components can be redundant but still sufficient.
- Misapplying factorization: h(x) must not depend on \theta. If any factor still has \theta hidden in it, sufficiency is not established.
- Ignoring the support: The factorization theorem assumes a common support that does not depend on \theta. If the support changes with \theta, standard results may fail.
- Over-compression: Using an insufficient summary (e.g., only the sample mean for skewed, heavy-tailed models) loses information and biases inferences.
- Not using one-to-one transformations: If T is sufficient, any one-to-one function of T is also sufficient—sometimes a simpler or numerically stable form exists (e.g., use sums instead of means to avoid floating-point division during accumulation).
Key Formulas
Fisher–Neyman Factorization
Explanation: A statistic T is sufficient if the likelihood separates into a part depending on x only via T(x) and and a part h(x) independent of This is the main practical test for sufficiency.
Conditional Definition
Explanation: Once T(X)=t is known, the distribution of X does not change with This captures the idea that T(X) contains all information from X.
Exponential Family Form
Explanation: In exponential families, the density/mass has this canonical form. The presence of T(x) in the exponent guarantees T is sufficient by factorization.
Bernoulli Likelihood
Explanation: For iid Bernoulli data, the likelihood depends on x only through the total number of successes T= , proving sufficiency of the sum.
Poisson Likelihood
Explanation: The likelihood factors into g(T,λ)= and h(x)= 1/!, showing sufficiency of T= for
Normal (known variance) Log-Likelihood
Explanation: Expanding the square shows dependence on x only through (or the sample mean), establishing sufficiency for μ when is known.
MLE via Sufficient Statistic
Explanation: When T is sufficient, maximizing the likelihood can be done using T(X) alone. This reduces computation and memory needs.
Normal Minimal Sufficiency
Explanation: With both μ and unknown, the pair of sums is (minimal) sufficient. The sample mean alone is not sufficient in this case.
Lehmann–Scheffé Criterion (informal)
Explanation: Two samples are equivalent under a minimal sufficient statistic if their likelihood ratio does not depend on This characterizes minimal sufficiency.
Rao–Blackwellization
Explanation: Improving an estimator by conditioning on a sufficient statistic reduces or maintains variance. It uses no extra data, only better summarization.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute the Bernoulli log-likelihood in two ways: 5 // (1) From raw data (O(n) per evaluation) 6 // (2) From sufficient statistic T = sum x_i (O(1) per evaluation) 7 // Demonstrate they are identical for any p in (0,1). 8 9 double loglik_raw(const vector<int>& x, double p) { 10 // Guard for p in (0,1) 11 if (p <= 0.0 || p >= 1.0) return -INFINITY; 12 double ll = 0.0; 13 for (int xi : x) { 14 // xi must be 0 or 1 15 if (xi == 1) ll += log(p); 16 else if (xi == 0) ll += log(1.0 - p); 17 else return -INFINITY; // invalid Bernoulli observation 18 } 19 return ll; 20 } 21 22 double loglik_from_T(int T, int n, double p) { 23 if (p <= 0.0 || p >= 1.0) return -INFINITY; 24 // L(p|x) = p^T (1-p)^{n-T} => log L = T log p + (n-T) log(1-p) 25 return T * log(p) + (n - T) * log(1.0 - p); 26 } 27 28 int main() { 29 ios::sync_with_stdio(false); 30 cin.tie(nullptr); 31 32 // Example data: 10 trials with 6 successes (order arbitrary) 33 vector<int> x = {1,0,1,1,0,1,0,1,0,1}; 34 int n = (int)x.size(); 35 int T = accumulate(x.begin(), x.end(), 0); 36 37 cout << "n=" << n << ", T=sum x_i=" << T << "\n"; 38 39 // Compare log-likelihoods over a grid of p values 40 vector<double> grid; 41 for (int i = 1; i <= 9; ++i) grid.push_back(i / 10.0); // 0.1..0.9 42 43 cout.setf(std::ios::fixed); cout << setprecision(6); 44 for (double p : grid) { 45 double ll1 = loglik_raw(x, p); 46 double ll2 = loglik_from_T(T, n, p); 47 cout << "p=" << p << ": ll_raw=" << ll1 << ", ll_T=" << ll2 48 << ", diff=" << fabs(ll1 - ll2) << "\n"; 49 } 50 51 // MLE from T: p_hat = T/n 52 double p_hat = (double)T / n; 53 cout << "MLE p_hat = T/n = " << p_hat << "\n"; 54 return 0; 55 } 56
For iid Bernoulli data, the joint likelihood is p^{T}(1-p)^{n-T}, where T=\sum x_i is the total successes. The code verifies that log-likelihood computed from the raw sequence equals the one computed solely from T for various p. This illustrates that T is sufficient for p, and that MLE p̂=T/n can be computed with O(1) memory.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // We assume X_i ~ N(mu, sigma2) with known sigma2. 5 // Sufficient statistic for mu is the sum S = sum X_i (or equivalently the mean). 6 // This aggregator supports streaming adds and merges for distributed computation. 7 8 struct NormalKnownVarAgg { 9 long long n = 0; // number of observations 10 long double S = 0.0; // sum of observations 11 12 void add(long double x) { 13 ++n; S += x; // For better FP accuracy, one could use Kahan summation. 14 } 15 16 void merge(const NormalKnownVarAgg& other) { 17 n += other.n; S += other.S; 18 } 19 20 long double mean() const { return (n == 0) ? NAN : (S / n); } 21 22 // Log-likelihood using only S and n: 23 long double loglik(long double mu, long double sigma2, long double sumsq_residual_part) const { 24 // log L(mu|x) = -n/2 log(2pi sigma2) - (1/(2 sigma2)) sum (x_i - mu)^2 25 // Expand: sum (x_i - mu)^2 = sum x_i^2 - 2 mu S + n mu^2 26 // If we precompute R = sum x_i^2 (theta-free), then we only need S and n for mu dependence. 27 long double R = sumsq_residual_part; // equals sum x_i^2 28 long double term = R - 2.0L * mu * S + (long double)n * mu * mu; 29 return -0.5L * n * log(2.0L * M_PIl * sigma2) - (1.0L / (2.0L * sigma2)) * term; 30 } 31 }; 32 33 int main(){ 34 ios::sync_with_stdio(false); 35 cin.tie(nullptr); 36 37 // Simulate or provide data 38 vector<long double> x = {2.1L, 1.9L, 2.3L, 2.0L, 1.8L}; 39 long double sigma2 = 0.04L; // known variance (e.g., sigma=0.2) 40 41 NormalKnownVarAgg agg; 42 long double R = 0.0L; // sum of squares 43 for (auto v : x) { agg.add(v); R += v * v; } 44 45 cout.setf(std::ios::fixed); cout << setprecision(6); 46 cout << "n=" << agg.n << ", S=" << (double)agg.S << ", mean=" << (double)agg.mean() << "\n"; 47 48 // Compute MLE for mu: mu_hat = S/n 49 long double mu_hat = agg.mean(); 50 cout << "MLE mu_hat = S/n = " << (double)mu_hat << "\n"; 51 52 // Compare log-likelihood at several mu values using only (n,S,R) 53 for (long double mu : {1.8L, 1.9L, 2.0L, 2.1L, 2.2L}) { 54 long double ll = agg.loglik(mu, sigma2, R); 55 cout << "mu=" << (double)mu << ", logL=" << (double)ll << "\n"; 56 } 57 58 return 0; 59 } 60
For Normal data with known variance σ^2, the likelihood in μ depends on the sample only through S=\sum X_i (or the mean). The code maintains n and S in a streaming-friendly aggregator, computes the MLE μ̂=S/n, and evaluates the log-likelihood using only (n, S) and the θ-free constant R=\sum X_i^2. This demonstrates sufficiency and O(1) memory usage.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // For iid Poisson(lambda), T = sum X_i is sufficient for lambda. 5 // We implement a streaming aggregator that can be merged across shards. 6 7 struct PoissonAgg { 8 long long n = 0; // number of observations 9 long double S = 0.0L; // sum of counts 10 11 void add(long long x) { 12 if (x < 0) throw runtime_error("Poisson count must be nonnegative"); 13 ++n; S += (long double)x; 14 } 15 16 void merge(const PoissonAgg& other) { 17 n += other.n; S += other.S; 18 } 19 20 long double mle_lambda() const { return (n == 0) ? NAN : (S / n); } 21 22 long double loglik(long double lambda, const vector<int>& x) const { 23 if (lambda <= 0.0L) return -INFINITY; 24 // Raw-data log-likelihood: sum( -lambda + x_i log lambda - log x_i! ) 25 long double ll = 0.0L; 26 for (int xi : x) { 27 if (xi < 0) return -INFINITY; 28 ll += -lambda + xi * log(lambda) - lgammal((long double)xi + 1.0L); // log(x!) via lgamma 29 } 30 return ll; 31 } 32 33 long double loglik_from_T(long double lambda) const { 34 if (lambda <= 0.0L) return -INFINITY; 35 // Using factorization: log L = -n lambda + S log lambda - sum log x_i! 36 // The last term is theta-free; we omit it when comparing across lambda. 37 return -n * lambda + S * log(lambda); 38 } 39 }; 40 41 int main(){ 42 ios::sync_with_stdio(false); 43 cin.tie(nullptr); 44 45 vector<int> x = {2, 0, 1, 3, 2, 4, 1, 0}; 46 47 PoissonAgg a, b; 48 // Simulate two shards 49 for (size_t i = 0; i < x.size(); ++i) { 50 if (i < x.size()/2) a.add(x[i]); else b.add(x[i]); 51 } 52 53 // Merge summaries 54 PoissonAgg all = a; all.merge(b); 55 56 cout.setf(std::ios::fixed); cout << setprecision(6); 57 cout << "n=" << all.n << ", S=" << (double)all.S << ", lambda_hat=" << (double)all.mle_lambda() << "\n"; 58 59 // Compare raw vs. sufficient-stat log-likelihood up to a constant 60 for (long double lambda : {0.5L, 1.0L, 1.5L, 2.0L}) { 61 long double ll_raw = all.loglik(lambda, x); 62 long double ll_T = all.loglik_from_T(lambda); 63 cout << "lambda=" << (double)lambda 64 << ": ll_raw (up to const)=" << (double)ll_raw 65 << ", ll_T (no const)=" << (double)ll_T << "\n"; 66 } 67 68 return 0; 69 } 70
For Poisson data, the sum S=\sum X_i with the count n is sufficient for λ. The aggregator supports streaming adds and merges across shards, enabling distributed MLE λ̂=S/n. The code compares raw-data log-likelihood and the θ-dependent part from T; they differ by an additive constant that does not depend on λ, confirming the factorization.