Central Limit Theorem Simulation

The Central Limit Theorem (CLT) tells us that if we take the average of n iid observations from a distribution with mean mu and variance sigma2, the sampling distribution of the average converges to a Normal distribution with mean mu and variance sigma2/n. If we sample from a Normal distribution, then the average is also normal for any n and the CLT is trivially true. The challenge is to find and characterize a distribution that defies or ``holds off'' the CLT for as long as possible. Your task is to explore different distributions/random variables and explore how quickly, in terms of sample size, the mean of a sample drawn from these converges to the Normal. In other words, for a given n, which distributions are least ``like'' or fit the Normal distribution. You need to figure out what type of distributions to explore and then, for different (increasing) values of n, compare the sampling distributions of the sample mean from these distributions to the corresponding Normal distribution from the CLT. You should generate B samples of size n for each distribution and then compare this approximation to the sampling distribution to the Normal predicted by the CLT. And then repeat this for different values of n. You should consider symmetric, asymmetric, possibly multi-modal distributions and try to find the type of distribution which ``matches'' the Normal for each n. You need to determine how to measure ``match''. This can be done numerically via a formal ``goodness of fit'' test or, preferably, via a graphical technique (e.g. Q-Q plots, empirical CDFs), or both. The goal here is to explore different distributions and ways of comparing the sampling distribution to the corresponding Normal distribution. There is no ``right'' answer and the process of discovery -- both statistically and for R -- is the intent. For this question, you should write a ``short'' report outlining what you did and discovered, using plots and numerical summaries to illustrate the essential conclusions. You do not need to show all the trials you performed, but just a variety that illustrate the different characteristics. You may find R functions like rnorm, rpois, rexp, rbinom, rbeta, rgamma, useful, along with the corresponding densities dnorm, dpois, dexp, $\ldots$.

I try to get the students to think about how to "measure" how close the distributions are to the Normal and over the course of the homework suggest that they think about Q-Q plots and to then to compute the correlation between the empirical quantiles and the theoretical quantiles. And then lead them towards qqnorm to compute these without plotting them (plot.it = FALSE) and then use cor.

Random Walk

Implement a 2-dimensional random walk where you go one step North, East, South or West, each with equal probability. And make it efficient.

Reinforced random walk

Implement a "reinforced" 2-D random walk efficiently. We'll define a "reinforced 2-D" random walk as one that remembers where it has been. If it arrives at a point that it has previously visited, the probability of going to the left/West is .25 - epsilon, to the right/East is .25 + epsilon, and the other two probabilities remain at .25. If we haven't visited that location before, the probabilities are uniform for the 4 different directions.

Make this efficient! <2>Timing Algorithms Timing of algorithms. We usually think of random variables and associated density functions as being neat mathematical functions expressed as a function of the value and some parameters. Mixture distributions are slightly more general versions of random variables and densities, but require more complex parameters. The idea of a mixture is relatively simple. To sample from one we have a two step random process. We have k simpler/regular densities and first select which one of these k densities to sample from. Having selected the component of the mixture, we generate a value from that density in the usual manner for that density, and then we have a value from the mixture distribution. To specify a mixture we therefore have to specify the probability of selecting component 1, 2, ... 3 -- (p1, p2, ..., pk) -- and the form and parameters for each of the k component densities. In this question, we will focus on a mixture of k Normals. Your job is to write a function that returns n sample values from a mixture of Normals. It should accept three parameters: the number of sample values (n), the probabilities for sampling from each component (p above), and an object representing the densities of the components. For the latter, one might choose to use a k x 2 matrix to represent these parameters or a list of length k, each a named vector giving the mean and SD. Alternatively one might use a list of functions to represent the densities with each having its own parameters stored within the function (e.g. using lexical scoping). This is the most general approach and allows one to have different types of distributions within the mixture. One can create a new class to represent these also and provide methods for the class and to transform from one form to another. You can choose for yourselves. You will write two versions of the function in order to compare two algorithms. These are as follows.

These two approaches produce the same results qualitatively, but are different in terms of how fast they are. Generating the identifiers for the components is a sample from a multinomial distribution with parameters p. You can use sample to do this relatively easily in R. Write functions for each approach. Make certain they produce similar results. You might use Q-Q plots (see qqplot) to compare the output from the two functions. If you don't see a strong linear relationship, this suggests there is a bug in one or both functions! Next, perform multiple runs of each of these 2 functions for different sample sizes (e.g. 1, 2, 5, 10, 100, 1000, 10000) to find out how much time each function takes to complete. Use system.time() to determine how long the commands take to complete. Use sapply() to loop over different sample sizes to get a matrix of sample run-times for the different sample sizes. Then plot these times against sample size and see how the two approaches perform. (Scatterplot, boxplots, etc.) Which algorithm is better? Or does it depend on sample size? Is it likely to depend on the parameters in the component densities? or the number of components in the mixture, i.e. k? When measuring times, especially for very short runs, make certain to think about how to get accurate answers. Simple statistical theory tells us how to do this!
Duncan Temple Lang <duncan@wald.ucdavis.edu>
Last modified: Tue Jul 15 12:38:14 PDT 2008