# Sufficiency

Author

Will Fithian

Published

August 29, 2023

$\newcommand{\cB}{\mathcal{B}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cX}{\mathcal{X}} \newcommand{\EE}{\mathbb{E}} \newcommand{\PP}{\mathbb{P}} \newcommand{\RR}{\mathbb{R}} \newcommand{\ZZ}{\mathbb{Z}} \newcommand{\td}{\,\textrm{d}} \newcommand{\simiid}{\stackrel{\textrm{i.i.d.}}{\sim}} \DeclareMathOperator*{\minz}{minimize\;}$

## Sufficiency

Sufficiency is a central concept in statistics that allows us to focus on the essential aspects of the data set while ignoring details that are irrelevant to the inference problem. If $$X\sim P_\theta$$ represents the entire data set, drawn from a model $$\cP = \{P_\theta:\; \theta \in \Theta\}$$, then this lecture will concern the idea of a sufficient statistic $$T(X)$$ that carries all of the information in the data that can help us learn about $$\theta$$.

A statistic $$T(X)$$ is any random variable which is a function of the data $$X$$, and which does not depend on the unknown parameter $$\theta$$. We say the statistic $$T(X)$$ is sufficient for the model $$\cP$$ if $$P_\theta(X \mid T)$$ does not depend on $$\theta$$. This lecture will be devoted to interpreting this definition and giving examples.

Example (Independent Bernoulli sequence): We introduced the binomial example from Lecture 2 by telling a story about an investigator who flips a biased coin $$n$$ times and records the total number of heads, which has a binomial distribution. All of the estimators we considered were functions only of the (binomially-distributed) count of heads.

But if the investigator had actually performed this experiment, they would have observed more than just the total number of heads: they would have observed the entire sequence of $$n$$ heads and tails. If we let $$X_i$$ denote a binary indicator of whether the $$i$$th throw is heads, for $$i=1,\ldots,n$$, then we have assumed that these indicators are i.i.d. Bernoulli random variables:

$X_1,\ldots,X_n \simiid \text{Bern}(\theta).$

Let $$T(X) = \sum_i X_i \sim \text{Binom}(n,\theta)$$ denote the summary statistic that we previously used to represent the entire data set. It is undeniable that we have lost some information by only recording $$T(X)$$ instead of the entire sequence $$X = (X_1,\ldots,X_n)$$. As a result, we might wonder whether we could have improved the estimator by considering all functions of $$X$$, not just functions of $$T(X)$$.

The answer is that, no, we did not really lose anything by summarizing the data by $$T(X)$$ because $$T(X)$$ is sufficient. The joint pmf of the data set $$X \in \{0,1\}^n$$ (i.e., the density wrt the counting measure on $$\{0,1\}^n$$) is

$p_\theta(x) = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} = \theta^{\sum_i x_i}(1-\theta)^{n-\sum_i x_i}.$

Note that this pmf depends only on $$T(x)$$: it assigns probability $$\theta^t (1-\theta)^{n-t}$$ to every sequence with $$T(X)=t$$ total heads. As a result, the conditional distribution given $$T(X)=t$$ should be uniform on all of the $$\binom{n}{t}$$ sequences with $$t$$ heads. We can confirm this by calculating the conditional pmf directly:

\begin{aligned} \PP_\theta(X = x \mid T(X) = t) &= \frac{\PP_\theta(X=x, \sum_i X_i = t)}{\PP_\theta(T(X) = t)} \\[7pt] &= \frac{\theta^t (1-\theta)^{n-t}1\{\sum_i x_i = t\}}{\theta^t(1-\theta)^{n-t}\binom{n}{t}}\\[5pt] &= \binom{n}{t}^{-1}1\{T(x) = t\}. \end{aligned}

Since the conditional distribution does not depend on $$\theta$$, $$T(X)$$ is sufficient for the model $$\cP$$.

As we will see next, we didn’t really need to go to the trouble of calculating the conditional distribution. Once we noticed that the density depends on $$x$$ only through $$T(x)$$, we could have concluded that $$T(X)$$ was sufficient.

## Factorization theorem

The easiest way to verify that a statistic is sufficient is to show that the density $$p_\theta$$ factorizes into a part that involves only $$\theta$$ and $$T(x)$$, and a part that involves only $$h(x)$$.

Factorization Theorem: Let $$\cP$$ be a model having densities $$p_\theta(x)$$ with respect to a common dominating measure $$\mu$$. Then $$T(X)$$ is sufficient for $$\cP$$ if and only if there exist non-negative functions $$g_\theta$$ and $$h$$ for which

$p_\theta(x) = g_\theta(T(x)) h(x),$

for almost every $$x$$ under $$\mu$$.

The “almost every $$x$$” qualification means that

$\mu\left(\{x:\; p_\theta(x) \neq g_\theta(T(x))h(x)\}\right) = 0.$

It is needed to avoid counterexamples with continuous distributions where we could arbitrarily change the value of $$p_{\theta_0}(x_0)$$ for a single $$\theta_0$$ and $$x_0$$ without actually changing any of the distributions.

Proof: The proof is easiest in the discrete case, so we don’t have to deal with conditioning on measure-zero events.

First, assume that there exists a factorization $$p_\theta(x) = g_\theta(T(x)) h(x)$$. Then we

\begin{aligned} \PP_\theta(X = x \mid T(X) = t) &= \frac{p_\theta(x)1\{T(x) = t\}}{\displaystyle\sum_{z:\;T(z) = t} p_\theta(z)}\\[7pt] &= \frac{g_\theta(t) h(x) 1\{T(x) = t\}}{g_\theta(t)\displaystyle\sum_{z:\;T(z) = t} h(z)}\\[7pt] &= \frac{h(x) 1\{T(x) = t\}}{\displaystyle\sum_{z:\;T(z) = t} h(z)}, \end{aligned}

which we see does not depend on $$\theta$$.

Next consider the opposite direction. If $$T(X)$$ is sufficient, then we can construct a factorization by writing

. First, define

$g_\theta(t) = \PP_\theta(T(X)=t) = \sum_{x:\;T(x) = t} p_\theta(x).$

Then, let

$h(x) = \PP(X = x \mid T(X) = T(x)),$

which does not depend on $$\theta$$ by sufficiency. Then we have

$\PP_\theta(X = x) = \PP_\theta(T(X) = T(x)) \;\cdot\;\PP_\theta(X = x \mid T(X) = T(x)) = g_\theta(T(x)) h(x)$