Central Limit Theorem Simulation
The Central Limit Theorem (CLT) tells us that if we take the
average of n iid observations from a distribution with mean mu
and variance sigma2, the sampling distribution of the average
converges to a Normal distribution with mean mu and
variance sigma2/n. If we sample from a Normal distribution,
then the average is also normal for any n and the CLT is trivially
true. The challenge is to find and characterize a distribution that
defies or ``holds off'' the CLT for as long as possible. Your task
is to explore different distributions/random variables and explore
how quickly, in terms of sample size, the mean of a sample drawn
from these converges to the Normal. In other words, for a given
n, which distributions are least ``like'' or fit the Normal
distribution.
You need to figure out what type of distributions to explore and
then, for different (increasing) values of n, compare the sampling
distributions of the sample mean from these distributions to the
corresponding Normal distribution from the CLT. You should generate
B samples of size n for each distribution and then compare this
approximation to the sampling distribution to the Normal predicted
by the CLT. And then repeat this for different values of n.
You should consider symmetric, asymmetric, possibly multi-modal
distributions and try to find the type of distribution which
``matches'' the Normal for each n. You need to determine how to
measure ``match''. This can be done numerically via a formal
``goodness of fit'' test or, preferably, via a graphical technique
(e.g. Q-Q plots, empirical CDFs), or both. The goal here is to
explore different distributions and ways of comparing the sampling
distribution to the corresponding Normal distribution. There is no
``right'' answer and the process of discovery -- both statistically
and for R -- is the intent.
For this question, you should write a ``short'' report
outlining what you did and discovered, using plots and numerical summaries
to illustrate the essential conclusions. You do not need to show all
the trials you performed, but just a variety that illustrate the
different characteristics.
You may find R functions like rnorm, rpois, rexp, rbinom, rbeta,
rgamma, useful,
along with the corresponding densities dnorm, dpois, dexp, $\ldots$.
I try to get the students to think about how to "measure" how close
the distributions are to the Normal and over the course of
the homework suggest that they think about Q-Q plots and to
then to compute the correlation between the empirical quantiles
and the theoretical quantiles. And then lead them towards
qqnorm to compute these without plotting them (plot.it = FALSE)
and then use cor.
Random Walk
Implement a 2-dimensional random walk
where you go one step North, East, South or West, each with equal probability.
And make it efficient.
Reinforced random walk
Implement a "reinforced" 2-D random walk efficiently. We'll define a
"reinforced 2-D" random walk as one that remembers where it has been.
If it arrives at a point that it has previously visited, the
probability of going to the left/West is .25 - epsilon, to the
right/East is .25 + epsilon, and the other two probabilities remain at
.25.
If we haven't visited that location before, the probabilities are
uniform for the 4 different directions.
Make this efficient!
<2>Timing Algorithms
Timing of algorithms. We usually think of random variables and
associated density functions as being neat mathematical functions
expressed as a function of the value and some parameters. Mixture
distributions are slightly more general versions of random variables
and densities, but require more complex parameters. The idea of a
mixture is relatively simple. To sample from one we have a two step
random process. We have k simpler/regular densities and first
select which one of these k densities to sample from. Having
selected the component of the mixture, we generate a value from that
density in the usual manner for that density, and then we have a
value from the mixture distribution. To specify a mixture we
therefore have to specify the probability of selecting component
1, 2, ... 3 -- (p1, p2, ..., pk) -- and the
form and parameters for each of the k component densities.
In this question, we will focus on a mixture of k Normals. Your
job is to write a function that returns n sample values from a
mixture of Normals. It should accept three parameters: the number
of sample values (n), the probabilities for sampling from each
component (p above), and an object representing the densities of
the components. For the latter, one might choose to use a k x
2 matrix to represent these parameters or a list of length k, each
a named vector giving the mean and SD. Alternatively one might use
a list of functions to represent the densities with each having its
own parameters stored within the function (e.g. using lexical
scoping). This is the most general approach and allows one to have
different types of distributions within the mixture. One can create
a new class to represent these also and provide methods for the
class and to transform from one form to another. You can choose for
yourselves.
You will write two versions of the function
in order to compare two algorithms.
These are as follows.
-
The simplest way is to select which of the k distributions to
use for each of the n points and then iterate over these n ``component
identifiers'' for the distribution from which to sample and take a
sample of size 1 from the corresponding density.
- An alternative approach is to generate the component identifiers
for the n deviates but then to determine how many observations
are to be taken from each of the k components. Then iterate over
the k components and generate the appropriate number of random
values from each of these.
These two approaches produce the same results qualitatively, but
are different in terms of how fast they are.
Generating the identifiers for the components is a sample from
a multinomial distribution with parameters p. You can use
sample to do this relatively easily in R.
Write functions for each approach. Make certain they produce similar
results. You might use Q-Q plots (see qqplot) to compare
the output from the two functions. If you don't see a strong linear
relationship, this suggests there is a bug in one or both functions!
Next, perform multiple runs of each of these 2 functions for
different sample sizes (e.g. 1, 2, 5, 10, 100, 1000, 10000) to find
out how much time each function takes to complete. Use
system.time() to determine how long the commands take to
complete. Use sapply() to loop over different sample sizes
to get a matrix of sample run-times for the different sample sizes.
Then plot these times against sample size and see how the two
approaches perform. (Scatterplot, boxplots, etc.) Which algorithm is
better? Or does it depend on sample size? Is it likely to depend on
the parameters in the component densities? or the number of
components in the mixture, i.e. k?
When measuring times, especially for very short runs,
make certain to think about how to get accurate answers.
Simple statistical theory tells us how to do this!
Duncan Temple Lang
<duncan@wald.ucdavis.edu>
Last modified: Tue Jul 15 12:38:14 PDT 2008