🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Sufficient Statistics

Key Points

  • •
    A sufficient statistic compresses all information in the sample about a parameter into a lower-dimensional summary without losing inferential power.
  • •
    The Fisher–Neyman factorization theorem gives a practical test: a statistic T(X) is sufficient for θ if the likelihood factors as g(T(X), θ) h(X).
  • •
    Many common models have simple sufficient statistics: sum of Bernoullis, sum of Poissons, and (sum, sum of squares) for Normal with unknown mean/variance.
  • •
    Using sufficient statistics enables O(1) memory streaming estimation and faster likelihood/MLE computation compared to using the raw data.
  • •
    In exponential families, T(X) naturally appears in the canonical form; this guarantees sufficiency and often minimal sufficiency.
  • •
    Rao–Blackwellization improves an estimator by conditioning on a sufficient statistic, never worsening variance.
  • •
    Confusing minimal sufficiency with sufficiency or forgetting which parameters are unknown often leads to wrong summaries.
  • •
    In C++, you can implement aggregators that maintain only T(X), supporting merges and distributed processing with exact results.

Prerequisites

  • →Random variables and distributions — Sufficiency is defined in terms of probability distributions and samples.
  • →Likelihood and maximum likelihood — Factorization and MLE rely on understanding the likelihood as a function of parameters.
  • →Conditional probability and independence — The formal definition of sufficiency uses conditional distributions.
  • →Exponential family of distributions — Many sufficient statistics arise naturally from the exponential family form.
  • →Basic calculus and logarithms — Manipulating likelihoods requires log properties and derivatives for MLE.
  • →Combinatorics (binomial coefficients) — Understanding counts and arrangements helps with Binomial/Bernoulli sufficiency.
  • →C++ programming basics — To implement aggregators and likelihoods efficiently and safely.
  • →Numerical stability (floating-point) — Accumulating sums and computing logs requires care to avoid overflow/underflow.

Detailed Explanation

Tap terms for definitions

01Overview

Sufficient statistics formalize the idea of data reduction without losing information about a parameter. Suppose you observe a sample X = (X_1, ..., X_n) from a distribution indexed by a parameter θ (for example, Bernoulli(p), Poisson(λ), or Normal(μ, σ^2)). A statistic T(X) is a function that maps the data to a summary, like the sum or mean. T(X) is called sufficient for θ if, once you know T(X), the exact arrangement of the raw data does not add any further information about θ. The Fisher–Neyman factorization theorem provides a concrete and widely used criterion: T(X) is sufficient for θ if and only if the joint likelihood (or probability density/mass) of the data can be written as a product of a term that depends on X only through T(X) and θ, and another term that does not depend on θ at all. This is powerful because it turns an abstract conditional-independence definition into an algebraic check. Sufficient statistics appear everywhere in statistical inference: maximum likelihood estimation (MLE), Bayesian conjugacy, hypothesis testing, and efficient streaming or distributed computation. For many exponential family models (Bernoulli, Poisson, Normal, Exponential, Gamma, etc.), sufficiency emerges automatically from the canonical form. Practically, this means we can replace a large dataset with a handful of numbers—often just counts and sums—without losing any inferential precision about θ.

02Intuition & Analogies

Imagine you are judging the sweetness of a batch of strawberries and you care only about the average sweetness. If you take many bites (data points), the total amount of sugar consumed (the sum) tells you everything you need about the average, regardless of which berries you bit or in what order. The pile of peels (the fine details) no longer matters for estimating the average sweetness. Here, the sum acts like a sufficient statistic for the mean when variability is known. Likewise, if you are counting how many customers arrive each minute and you model arrivals with a Poisson process, the total number of arrivals over n minutes is all that matters for estimating the arrival rate λ. How those arrivals are distributed across minutes is irrelevant for λ—only the total count carries information. That total is the sufficient statistic. Think of sufficiency as a perfect compression that preserves all parameter-related information, similar to a lossless ZIP file specialized for inference about θ. If you pass the compressed file (T(X)) to any statistician, they can reconstruct the same likelihood for θ as if they saw the raw data. In exponential family models, this compression is especially neat: the data enter the likelihood through simple aggregates (like sums or counts), ensuring that these aggregates are sufficient. In practice, this lets you do streaming analytics with tiny memory, merge summaries from different machines, and compute MLEs quickly because the heavy lifting reduces to a few numbers.

03Formal Definition

Let X = (X1​, …, Xn​) have joint distribution f(x ∣ θ) with respect to a common measure. A statistic T(X) is sufficient for parameter θ if the conditional distribution of X given T(X) = t does not depend on θ. Formally, for all measurable sets A, Pθ​(X ∈ A ∣ T(X)=t) is the same function of A and t for all θ. The Fisher–Neyman factorization theorem states that T(X) is sufficient for θ if and only if there exist nonnegative functions g and h such that f(x ∣ θ) = g(T(x), θ)\, h(x) for all x and all θ, where h(x) does not depend on θ. This connects sufficiency to a checkable algebraic factorization. For exponential families, densities/masses can be written as f(x ∣ θ) = h(x) exp\{ η(θ)^{⊤} T(x) - A(θ) \}, which immediately shows T(X) is sufficient for θ. Minimal sufficiency refines the idea: a sufficient statistic T is minimal if it is a function of every other sufficient statistic; equivalently, for regular families, two samples x and y are equivalent (and thus map to the same T) if the likelihood ratio f(x∣θ)/f(y∣θ) is independent of θ.

04When to Use

  • Parametric inference with familiar models: For Bernoulli(p), use the total number of successes; for Poisson(λ), use the total count; for Normal with known variance, use the sum (or mean); for Normal with both mean and variance unknown, use the pair (sum, sum of squares).
  • Streaming or distributed data: Maintain only sufficient statistics to compute MLEs or posteriors later with O(1) memory. You can merge summaries across shards without revisiting raw data.
  • Faster likelihood evaluation: Replace O(n) likelihood computations with O(1) using T(X), enabling rapid optimization or grid searches over parameters.
  • Bayesian conjugacy: Posterior updates depend only on T(X), e.g., Beta–Binomial uses total successes and trials; Gamma–Poisson uses total count and number of observations.
  • Privacy and storage: Share T(X) instead of raw data to reduce exposure while preserving inferential capability about \theta.
  • Model diagnostics and tests: In some classical tests (e.g., conditional tests), conditioning on a sufficient statistic eliminates nuisance parameters.

⚠️Common Mistakes

  • Forgetting which parameters are unknown: For Normal data, the sample mean alone is sufficient for μ only when σ^2 is known. If both μ and σ^2 are unknown, you need (\sum X_i, \sum X_i^2) or equivalently (\bar{X}, S^2).
  • Confusing sufficiency with minimal sufficiency: A statistic can be sufficient but not minimal; carrying extra components can be redundant but still sufficient.
  • Misapplying factorization: h(x) must not depend on \theta. If any factor still has \theta hidden in it, sufficiency is not established.
  • Ignoring the support: The factorization theorem assumes a common support that does not depend on \theta. If the support changes with \theta, standard results may fail.
  • Over-compression: Using an insufficient summary (e.g., only the sample mean for skewed, heavy-tailed models) loses information and biases inferences.
  • Not using one-to-one transformations: If T is sufficient, any one-to-one function of T is also sufficient—sometimes a simpler or numerically stable form exists (e.g., use sums instead of means to avoid floating-point division during accumulation).

Key Formulas

Fisher–Neyman Factorization

T is sufficient for θ⟺f(x∣θ)=g(T(x),θ)h(x)

Explanation: A statistic T is sufficient if the likelihood separates into a part depending on x only via T(x) and θ, and a part h(x) independent of θ. This is the main practical test for sufficiency.

Conditional Definition

Sufficiency: Pθ​(X∈A∣T(X)=t)=P(X∈A∣T(X)=t)for all θ

Explanation: Once T(X)=t is known, the distribution of X does not change with θ. This captures the idea that T(X) contains all θ−related information from X.

Exponential Family Form

f(x∣θ)=h(x)exp{η(θ)⊤T(x)−A(θ)}

Explanation: In exponential families, the density/mass has this canonical form. The presence of T(x) in the exponent guarantees T is sufficient by factorization.

Bernoulli Likelihood

L(p∣x)=p∑i=1n​xi​(1−p)n−∑i=1n​xi​

Explanation: For iid Bernoulli data, the likelihood depends on x only through the total number of successes T=∑ xi​, proving sufficiency of the sum.

Poisson Likelihood

L(λ∣x)=e−nλλ∑i=1n​xi​i=1∏n​xi​!1​

Explanation: The likelihood factors into g(T,λ)=e−nλλT and h(x)=∏ 1/xi​!, showing sufficiency of T=∑ xi​ for λ.

Normal (known variance) Log-Likelihood

logL(μ∣x)=−2n​log(2πσ2)−2σ21​i=1∑n​(xi​−μ)2

Explanation: Expanding the square shows dependence on x only through ∑ xi​ (or the sample mean), establishing sufficiency for μ when σ2 is known.

MLE via Sufficient Statistic

θ^MLE​=argθmax​L(θ∣T(X))

Explanation: When T is sufficient, maximizing the likelihood can be done using T(X) alone. This reduces computation and memory needs.

Normal Minimal Sufficiency

x∼N(μ,σ2),σ2 unknown⇒T(x)=(∑xi​,∑xi2​)

Explanation: With both μ and σ2 unknown, the pair of sums is (minimal) sufficient. The sample mean alone is not sufficient in this case.

Lehmann–Scheffé Criterion (informal)

Minimal sufficiency: f(y∣θ)f(x∣θ)​ independent of θ⟺T(x)=T(y)

Explanation: Two samples are equivalent under a minimal sufficient statistic if their likelihood ratio does not depend on θ. This characterizes minimal sufficiency.

Rao–Blackwellization

θ^RB​=E[θ^(X)∣T(X)]

Explanation: Improving an estimator by conditioning on a sufficient statistic reduces or maintains variance. It uses no extra data, only better summarization.

Complexity Analysis

Computing a sufficient statistic typically requires a single pass over n observations, yielding O(n) time and O(1) additional memory. For example, for Bernoulli or Poisson data, you need only maintain the count n and the sum of observations, which are constant-sized aggregates. In contrast, evaluating the full likelihood from raw data naively costs O(n) per parameter evaluation because each evaluation involves summing n terms. Once you have T(X), each likelihood evaluation becomes O(1), which is a major speedup when optimizing or sampling over parameters. In streaming or distributed settings, sufficient statistics shine: partial summaries can be merged in O(1) time (e.g., sums and counts add), enabling exact, scalable estimation without storing the raw data. This stands in contrast to non-sufficient summaries that might require additional passes or re-access to raw data for accurate inference. Space complexity is minimal because only a small fixed number of aggregates are stored (e.g., for Normal with unknown mean and variance, two sums suffice). Numerically, accumulating sums is stable if you use appropriate techniques (e.g., Kahan summation for floating-point), though this is an implementation detail rather than a complexity issue. Overall, sufficiency converts repetitive O(n) computations into upfront O(n) aggregation plus O(1) per-query evaluation, improving both time and space efficiency substantially.

Code Examples

Bernoulli sufficiency: likelihood depends only on total successes
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute the Bernoulli log-likelihood in two ways:
5// (1) From raw data (O(n) per evaluation)
6// (2) From sufficient statistic T = sum x_i (O(1) per evaluation)
7// Demonstrate they are identical for any p in (0,1).
8
9double loglik_raw(const vector<int>& x, double p) {
10 // Guard for p in (0,1)
11 if (p <= 0.0 || p >= 1.0) return -INFINITY;
12 double ll = 0.0;
13 for (int xi : x) {
14 // xi must be 0 or 1
15 if (xi == 1) ll += log(p);
16 else if (xi == 0) ll += log(1.0 - p);
17 else return -INFINITY; // invalid Bernoulli observation
18 }
19 return ll;
20}
21
22double loglik_from_T(int T, int n, double p) {
23 if (p <= 0.0 || p >= 1.0) return -INFINITY;
24 // L(p|x) = p^T (1-p)^{n-T} => log L = T log p + (n-T) log(1-p)
25 return T * log(p) + (n - T) * log(1.0 - p);
26}
27
28int main() {
29 ios::sync_with_stdio(false);
30 cin.tie(nullptr);
31
32 // Example data: 10 trials with 6 successes (order arbitrary)
33 vector<int> x = {1,0,1,1,0,1,0,1,0,1};
34 int n = (int)x.size();
35 int T = accumulate(x.begin(), x.end(), 0);
36
37 cout << "n=" << n << ", T=sum x_i=" << T << "\n";
38
39 // Compare log-likelihoods over a grid of p values
40 vector<double> grid;
41 for (int i = 1; i <= 9; ++i) grid.push_back(i / 10.0); // 0.1..0.9
42
43 cout.setf(std::ios::fixed); cout << setprecision(6);
44 for (double p : grid) {
45 double ll1 = loglik_raw(x, p);
46 double ll2 = loglik_from_T(T, n, p);
47 cout << "p=" << p << ": ll_raw=" << ll1 << ", ll_T=" << ll2
48 << ", diff=" << fabs(ll1 - ll2) << "\n";
49 }
50
51 // MLE from T: p_hat = T/n
52 double p_hat = (double)T / n;
53 cout << "MLE p_hat = T/n = " << p_hat << "\n";
54 return 0;
55}
56

For iid Bernoulli data, the joint likelihood is p^{T}(1-p)^{n-T}, where T=\sum x_i is the total successes. The code verifies that log-likelihood computed from the raw sequence equals the one computed solely from T for various p. This illustrates that T is sufficient for p, and that MLE p̂=T/n can be computed with O(1) memory.

Time: - Building T: O(n). - Each likelihood evaluation from raw data: O(n). - Each evaluation from T: O(1).Space: O(1) beyond the input; streaming accumulation of T and n needs constant memory.
Normal with known variance: streaming sufficient statistic for μ
1#include <bits/stdc++.h>
2using namespace std;
3
4// We assume X_i ~ N(mu, sigma2) with known sigma2.
5// Sufficient statistic for mu is the sum S = sum X_i (or equivalently the mean).
6// This aggregator supports streaming adds and merges for distributed computation.
7
8struct NormalKnownVarAgg {
9 long long n = 0; // number of observations
10 long double S = 0.0; // sum of observations
11
12 void add(long double x) {
13 ++n; S += x; // For better FP accuracy, one could use Kahan summation.
14 }
15
16 void merge(const NormalKnownVarAgg& other) {
17 n += other.n; S += other.S;
18 }
19
20 long double mean() const { return (n == 0) ? NAN : (S / n); }
21
22 // Log-likelihood using only S and n:
23 long double loglik(long double mu, long double sigma2, long double sumsq_residual_part) const {
24 // log L(mu|x) = -n/2 log(2pi sigma2) - (1/(2 sigma2)) sum (x_i - mu)^2
25 // Expand: sum (x_i - mu)^2 = sum x_i^2 - 2 mu S + n mu^2
26 // If we precompute R = sum x_i^2 (theta-free), then we only need S and n for mu dependence.
27 long double R = sumsq_residual_part; // equals sum x_i^2
28 long double term = R - 2.0L * mu * S + (long double)n * mu * mu;
29 return -0.5L * n * log(2.0L * M_PIl * sigma2) - (1.0L / (2.0L * sigma2)) * term;
30 }
31};
32
33int main(){
34 ios::sync_with_stdio(false);
35 cin.tie(nullptr);
36
37 // Simulate or provide data
38 vector<long double> x = {2.1L, 1.9L, 2.3L, 2.0L, 1.8L};
39 long double sigma2 = 0.04L; // known variance (e.g., sigma=0.2)
40
41 NormalKnownVarAgg agg;
42 long double R = 0.0L; // sum of squares
43 for (auto v : x) { agg.add(v); R += v * v; }
44
45 cout.setf(std::ios::fixed); cout << setprecision(6);
46 cout << "n=" << agg.n << ", S=" << (double)agg.S << ", mean=" << (double)agg.mean() << "\n";
47
48 // Compute MLE for mu: mu_hat = S/n
49 long double mu_hat = agg.mean();
50 cout << "MLE mu_hat = S/n = " << (double)mu_hat << "\n";
51
52 // Compare log-likelihood at several mu values using only (n,S,R)
53 for (long double mu : {1.8L, 1.9L, 2.0L, 2.1L, 2.2L}) {
54 long double ll = agg.loglik(mu, sigma2, R);
55 cout << "mu=" << (double)mu << ", logL=" << (double)ll << "\n";
56 }
57
58 return 0;
59}
60

For Normal data with known variance σ^2, the likelihood in μ depends on the sample only through S=\sum X_i (or the mean). The code maintains n and S in a streaming-friendly aggregator, computes the MLE μ̂=S/n, and evaluates the log-likelihood using only (n, S) and the θ-free constant R=\sum X_i^2. This demonstrates sufficiency and O(1) memory usage.

Time: Building (n, S, R): O(n). Each log-likelihood evaluation in μ: O(1).Space: O(1) for the aggregator (plus O(1) to store R).
Poisson rate λ: mergeable sufficient statistics for streaming and distributed MLE
1#include <bits/stdc++.h>
2using namespace std;
3
4// For iid Poisson(lambda), T = sum X_i is sufficient for lambda.
5// We implement a streaming aggregator that can be merged across shards.
6
7struct PoissonAgg {
8 long long n = 0; // number of observations
9 long double S = 0.0L; // sum of counts
10
11 void add(long long x) {
12 if (x < 0) throw runtime_error("Poisson count must be nonnegative");
13 ++n; S += (long double)x;
14 }
15
16 void merge(const PoissonAgg& other) {
17 n += other.n; S += other.S;
18 }
19
20 long double mle_lambda() const { return (n == 0) ? NAN : (S / n); }
21
22 long double loglik(long double lambda, const vector<int>& x) const {
23 if (lambda <= 0.0L) return -INFINITY;
24 // Raw-data log-likelihood: sum( -lambda + x_i log lambda - log x_i! )
25 long double ll = 0.0L;
26 for (int xi : x) {
27 if (xi < 0) return -INFINITY;
28 ll += -lambda + xi * log(lambda) - lgammal((long double)xi + 1.0L); // log(x!) via lgamma
29 }
30 return ll;
31 }
32
33 long double loglik_from_T(long double lambda) const {
34 if (lambda <= 0.0L) return -INFINITY;
35 // Using factorization: log L = -n lambda + S log lambda - sum log x_i!
36 // The last term is theta-free; we omit it when comparing across lambda.
37 return -n * lambda + S * log(lambda);
38 }
39};
40
41int main(){
42 ios::sync_with_stdio(false);
43 cin.tie(nullptr);
44
45 vector<int> x = {2, 0, 1, 3, 2, 4, 1, 0};
46
47 PoissonAgg a, b;
48 // Simulate two shards
49 for (size_t i = 0; i < x.size(); ++i) {
50 if (i < x.size()/2) a.add(x[i]); else b.add(x[i]);
51 }
52
53 // Merge summaries
54 PoissonAgg all = a; all.merge(b);
55
56 cout.setf(std::ios::fixed); cout << setprecision(6);
57 cout << "n=" << all.n << ", S=" << (double)all.S << ", lambda_hat=" << (double)all.mle_lambda() << "\n";
58
59 // Compare raw vs. sufficient-stat log-likelihood up to a constant
60 for (long double lambda : {0.5L, 1.0L, 1.5L, 2.0L}) {
61 long double ll_raw = all.loglik(lambda, x);
62 long double ll_T = all.loglik_from_T(lambda);
63 cout << "lambda=" << (double)lambda
64 << ": ll_raw (up to const)=" << (double)ll_raw
65 << ", ll_T (no const)=" << (double)ll_T << "\n";
66 }
67
68 return 0;
69}
70

For Poisson data, the sum S=\sum X_i with the count n is sufficient for λ. The aggregator supports streaming adds and merges across shards, enabling distributed MLE λ̂=S/n. The code compares raw-data log-likelihood and the θ-dependent part from T; they differ by an additive constant that does not depend on λ, confirming the factorization.

Time: Building (n, S): O(n). Each log-likelihood evaluation from T: O(1).Space: O(1) for the aggregator; storing only n and S.
#sufficient statistic#fisher neyman factorization#exponential family#rao blackwell#likelihood#mle#bernoulli#poisson#normal distribution#data reduction#minimal sufficient statistic#conjugate prior#streaming statistics#distributed estimation#information sufficiency