\documentclass{article}
\usepackage[margin=1.2in]{geometry}
\usepackage{graphicx}
\usepackage{amsmath,amssymb,amsthm,bm}
\usepackage{latexsym,color,minipage-marginpar,caption,multirow,verbatim}
\usepackage[round]{natbib}
\usepackage{enumerate}
\usepackage{times}
\newcommand{\RR}{\mathbb{R}}
\newcommand{\PP}{\mathbb{P}}
\newcommand{\EE}{\mathbb{E}}
\newcommand{\ZZ}{\mathbb{Z}}
\newcommand{\cP}{\mathcal{P}}
\newcommand{\cC}{\mathcal{C}}
\newcommand{\cX}{\mathcal{X}}
\newcommand{\cT}{\mathcal{T}}
\newcommand{\simiid}{\overset{\textrm{i.i.d.}}{\sim}}
\newcommand{\simind}{\overset{\textrm{ind.}}{\sim}}
\newcommand{\td}{\,\textrm{d}}
\newcommand{\red}{\color{red}}
\definecolor{darkblue}{rgb}{0.2, 0.2, 0.5}
\newcommand{\sol}{~\\\color{darkblue}{\bf Solution:~\\}}
\begin{document}
\title{Stats 210A, Fall 2023\\
Homework 2\\
{\large {\bf Due date}: Wednesday, Sep. 13}}
\date{}
\maketitle
\vspace{-5em}
You may disregard measure-theoretic niceties about conditioning on measure-zero sets, almost-sure equality vs. actual equality, ``all functions'' vs. ``all measurable functions,'' etc. (unless the problem is explicitly asking about such issues).
If you need to write code to answer a question, show your code. If you need to include a plot, make sure the plot is readable, with appropriate axis labels and a legend if necessary. Points will be deducted for very hard-to-read code or plots.
\begin{description}
\item[1. Minimal sufficiency of the likelihood ratio]\hfill\\
Suppose that $\cP = \{p_\theta:\; \theta\in\Theta\}$ is a family of densities defined with respect to a common measure $\mu$ on $\cX$. Assume for simplicity that $p_{\theta}(x) > 0$ for all $\theta\in\Theta$ and $x\in\cX$.
For $\theta_1,\theta_2 \in \Theta$, define the likelihood ratio as
\[
\text{LR}(\theta_1, \theta_2; X) = \frac{p_{\theta_1}(X)}{p_{\theta_2}(X)} \in (0,\infty).
\]
\begin{enumerate}[(a)]
\item Use the factorization theorem directly to prove that the \emph{likelihood ratio process}
\[
R(X) = (\text{LR}(\theta_1,\theta_2; X): \theta_1,\theta_2 \in \Theta)
\]
is minimal sufficient.
The statistic $R(X)$ should be understood as a stochastic process, i.e. a collection of real random variables $R_{\theta_1,\theta_2}(X) = \text{LR}(\theta_1, \theta_2; X)$, indexed by $(\theta_1,\theta_2) \in \Theta^2$.
{\bf Hint:} Don't forget to prove that $R(X)$ is sufficient.
{\bf Hint:} If you find the concept of a stochastic process over a generic index set perplexing and unintuitive, I suggest you warm up by working through the problem assuming that $\Theta = \{1,2,\ldots,d\}$ for some finite integer $d$. Then $R$ is simply a $d\times d$ random matrix with $R_{i,j}=\text{LR}(i,j; X)$.
{\bf Note:} You could trivialize this problem by starting from the essentially equivalent result from class about the ``likelihood shape.'' I {\em don't} want you to use the likelihood shape because the point of this exercise is for you to work out a more concrete version of what is essentially the same result.
\item Show by counterexample that the {\em likelihood function}, defined as
\[
\text{Lik}(\theta; X) = \left(p_\theta(X)\right)_{\theta\in\Theta}
\]
is {\em not}, in general, minimal sufficient.
{\bf Note:} If you try to construct a counterexample by playing dirty tricks with measure-zero sets, it probably won't be a real counterexample for the rigorous measure-theoretic definition of minimal sufficient, the rigorous statement of the factorization theorem, and so on. These kind of shenanigans should not be necessary; once you have understood the essence of the problem it will not be hard to come up with a counterexample for discrete $\cX$.
\item {\bf Optional} (not graded, no extra points). If we want to be more concrete we can define the ``likelihood shape'' concretely as the equivalency class of all functions on $\Theta$ that are proportional to $\text{Lik}$:
\[
S(X) = (0,\infty) \cdot \text{Lik}(\cdot; X) = \left\{c \cdot \text{Lik}(\cdot; x): c \in (0,\infty)\right\}
\]
Show that the likelihood shape $S(X)$ is minimal sufficient by appealing to your result from part (a).
\end{enumerate}
{\bf Moral:} The collection of likelihood ratios is minimal sufficient, as is the likelihood shape. However, the likelihood function is not minimal sufficient because the scaling constant might be irrelevant for estimating $\theta$.
\item[2. Bayesian interpretation of sufficiency]\hfill\\
Assume we have a family $\cP$ of densities $p_{\theta}(x)$ with respect to a common measure $\mu$ on $\cX$, for $\theta\in \Theta \subseteq \RR^n$. Additionally, assume the parameter $\theta$ is itself random, following {\em prior density} $q(\theta)$ with respect to the Lebesgue measure on $\Theta$.
Then, we can write the {\em posterior density} (distribution of $\theta$ given $X=x$) as
\[
q_{\text{post}}(\theta \mid x) = \frac{p_\theta(x)q(\theta)}{\int_{\Theta} p_\zeta(x)q(\zeta) \td \zeta}.
\]
({\bf Note:} this manipulation of the densities generally works even though we might worry about conditioning on a measure zero set. Feel free to make similar manipulations yourself in the problem).
In this setting, prove the following claims:
\begin{enumerate}[(a)]
%%%
\item Suppose a statistic $T(X)$ has the property that, for any prior distribution $q(\theta)$, the posterior distribution $q_{\text{post}}(\theta \mid x)$ depends on $x$ only through $T(x)$. Show that $T(X)$ is sufficient for $\cP$.
%%%
\item Conversely, assume that if $T(X)$ is sufficient for $\cP$ and show that, for any prior $q$, the posterior depends on $x$ only through $T(x)$.
\end{enumerate}
{\bf Moral:} If we have a prior opinion about $\theta$ in the form of a distribution, and then we rationally update our opinion after observing $X$, then we will naturally adhere to the sufficiency principle. This gives an alternative epistemological motivation for the principle.
\item[3. Mean parameterization of an exponential family]\hfill\\
Consider the $s$-parameter exponential family $\cP = \{P_\eta:\; \eta \in \Xi\}$ on $\cX$ with densities $p_\eta(x) = e^{\eta'T(x) - A(\eta)}h(x)$ with respect to a common dominating measure $\nu$. Assume $\Xi=\Xi_1^{\circ}$, the interior of the full natural parameter space, and that $\text{Var}_\eta(a'T(X))>0$ for all $a \neq 0$ and $\eta\in \Xi$.
Define the {\em mean parameter}
\[
\mu(\eta) = \EE_\eta[T(X)].
\]
We will show that this is a one-to-one mapping, so $\cP$ can be alternatively be parameterized by $\mu(\eta)$ instead of $\eta$. The Bernoulli, Poisson, and exponential distributions are exponential families that are most often parameterized by their means, and parameterizations of other distributions like the normal and binomial are closely related to the mean parameterization.
Throughout this problem, you may use without proof that if the variance of any statistic $S(X)$ is positive under one $P_\eta \in \cP$ then it is positive under all $P_\eta \in \cP$ (as an optional exercise, try to prove this).
\begin{enumerate}[(a)]
\item For $s=1$, show that $\eta \mapsto \EE_\eta[T(X)]$ is a one-to-one mapping; that is, show that if $\eta_1 \neq \eta_2$ then $\EE_{\eta_1}[T(X)] \neq \EE_{\eta_2}[T(X)]$.
{\bf Hint:} You can use the differential identities.
\item For $s>1$ and $\eta_1,\eta_2\in\Xi$, consider the subfamily whose parameter space is the line segment between $\eta_1$ and $\eta_2$. For $\theta \in [0,1]$, let
\[
\eta(\theta) = (1-\theta) \eta_1 + \theta \eta_2.
\]
Show that this subfamily is a one-parameter exponential family on $\cX$ with natural parameter $\theta$, and write it in standard exponential family form.
\item Combine (a) and (b) to show that $\eta \mapsto \EE_\eta[T(X)]$ is a one-to-one mapping for $s \geq 1$.
\end{enumerate}
\item[4. Multinomial family]\hfill\\
The multinomial family is a multi-category version of the binomial, it measures the number of times each category comes up if we sample a $d$-category random variable with distribution $\pi$ on $n$ independent trials. Throughout this problem assume $d \geq 3$.
If $X \sim \text{Multinom}(n, \pi)$, with all $\pi_j > 0$ and $\sum_j \pi_j = 1$, then $X$ has density
\[
p_\pi(x) = \pi_1^{x_1}\pi_2^{x_2}\cdots \pi_d^{x_d} \cdot \frac{n!}{x_1! x_2! \cdots x_d!}
\]
{\bf Note:} The coordinates of $X=(X_1,\ldots,X_d)$ are {\em not} i.i.d. samples; each one corresponds to a different bin and $X_1$ is not independent of $X_2$.
\begin{enumerate}[(a)]
\item Rewrite the densities as a $(d-1)$-parameter exponential family, giving an explicit form for $T(x)$, $h(x)$, $\eta$, and $A(\eta)$. Is $X=(X_1,\ldots,X_d)$ minimal sufficient?
\item Suppose a certain gene has two alleles {\bf A} and {\bf a}, and $\theta\in (0,1)$ is the unknown prevalence of allele {\bf a} in a well-mixed population. Then the proportion of people in the population with genotypes {\bf aa}, {\bf Aa}, and {\bf AA} is $\theta^2$, $2\theta(1-\theta)$, and $(1-\theta)^2$, respectively.
We can estimate $\theta$ by sampling $n$ independent individuals from the population and counting the number who have each genotype. These counts will have a joint multinomial distribution with probability parameter
\[
\pi(\theta) = (\theta^2, 2\theta(1-\theta), (1-\theta)^2).
\]
Hence, scientific considerations might lead us to use the multinomial subfamily indexed by $\theta$:
\[
\cP = \{\text{Multinom}(n,\pi(\theta)):\; \theta \in (0,1)\}.
\]
Can $\cP$ be written as a one-parameter exponential family? Find a minimal sufficient statistic for $\cP$.
\end{enumerate}
\item[5. Uniform location-scale family]\hfill\\
Let $X_1, \ldots, X_n\simiid \text{Unif}[\mu-\sigma, \mu + \sigma]$, with $\mu\in\RR$ and $\sigma > 0$ unknown.
%
\begin{enumerate}[(a)]
\item Show that $T(X) = (X_{(1)}, X_{(n)})$ is minimal sufficient.
\item If $B \sim \text{Beta}(\alpha,\beta)$ then its density is proportional to $x^{\alpha-1}(1-x)^{\beta-1}$ on $x\in [0,1]$.
If $U_1,\ldots,U_n \simiid U[0,1]$, show that
\[
U_{(n)} \sim \text{Beta}(n,1), \quad\text{ which has density } p(x) = n x^{n-1},
\]
and
\[
U_{(1)}/U_{(n)} \sim \text{Beta}(1,n-1) \quad\text{ which has density } p(x) = (n-1)(1-x)^{n-2},
\]
independently of $U_{(n)}$.
{\bf Hint:} For the first part, start by writing down the CDF of $U_{(n)}$.
{\bf Hint:} For the second part, you may use without proof the fact that, conditional on $U_{(n)} = u$, the remaining $n-1$ values are i.i.d. $\text{Unif}[0,u]$, then proceed similarly to what you did for the first part.
\item Suppose that we wish to estimate $\mu$ under the squared error loss. The sample mean $\overline{X}$ may appear to be a reasonable estimator of $\mu$, but we might worry about the fact that it is not a function of $T(X)$.
Guided by the sufficiency principle, we could instead consider the estimator
\[\delta(X) = \frac{X_{(1)} + X_{(n)}}{2}.\]
Compute the MSE of each estimator as a function of $n, \mu,$ and $\sigma$, and show that $\delta$ strictly dominates $\overline{X}$ for $n > 2$ (the estimators coincide for $n=2$). What happens to the ratio of their MSE's as $n\to\infty$?
{\bf Hint:} The results from part (b) should be useful. You may use without proof that $\text{Beta}(\alpha,\beta)$ has mean $\frac{\alpha}{\alpha+\beta}$ and variance $\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha + \beta + 1)}$.
\item Simulate the distribution for $\mu = 0, \sigma = 1, n = 1000$. For each estimator, plot a histogram of simulated estimates.
\end{enumerate}
{\bf Moral:} Understanding and respecting the statistical structure of a model sometimes helps us to come up with estimators that perform dramatically better than the estimator we would have na\"{i}vely thought of. Here is a case where applying the sufficiency principle helped us get a much better estimator than the sample mean.
\end{description}
\end{document}