Statistics 215B: Applied Statistics. Spring 2012

Course format:

3 hours of lecture per week, divided between discussing particular applications and papers, and presenting theory and methodology. There will be written assignments roughly every two weeks, and a term project that includes a written report and an oral class presentation. I hope that term projects will lead to publishable research: Bring your favorite data or favorite scientific problem.

The written assignments will largely be drawn from Freedman's book Statistical Models: Theory and Practice (2009 revised edition). I will not be lecturing on all the chapters from which I assign problems: I expect students to read and digest the material on their own, but I am happy to answer questions in class or in office hours, and if something turns out to be a stumbling block for more than a few students, I will lecture on it. I plan to reserve most of the lecture time to talk about particular applications and case studies.

List of pervasive themes:

List of applications (preliminary, time permitting):

Techniques and tools likely to be discussed

Reading list (preliminary)

Assignments

  1. Read Freedman, Statistical Models: Theory and Practice (SMTP), Chapters 1–4;
    Freedman, Statistical Models and Causal Inference: A Dialogue with the Social Sciences (SMCI), Chapters 1, 8;
    (chapter 8 is also here: href="http://statistics.berkeley.edu/~stark/Preprints/611.pdf)
    Shearer & Stark, 2011.
    [Due 1/26 in class] Freedman, SMTP, problems 4.B.7, 4.B.8, 4.B.11, 4.5.3, 4.5.5, 4.5.6, 4.5.10, 4.5.11.
  2. [Due 2/2 in class. Relates to the climate change paper we discussed in class on 1/17.]
    1. Consider a random walk with n=137 steps, constructed as follows:
      X(0) = 0.
      [X(i) - X(i-1)], i = 1, … 136, are IID, and take the value +1 or -1 with probability 1/2 each.
      You will test the hypothesis that a=0 on the assumption that the data (or subsets of the data) come from the normal linear model X(i) = ai + b + εi, where the errors {εi} are IID N(0, σ2), with σ2 unknown (to be estimated from the data), based on fitting the model by OLS.
      • (a) By simulation, estimate the actual significance level of a nominal 5% test of the hypothesis a = 0. That is, estimate how often OLS estimate of the slope a is "statistically significant at level 5%" when the significance calculation assumes that the normal linear model is true. Justify your choice of the number of replications in the simulation.
      • (b) By simulation, estimate the chance that the sign of the slope of the line fitted (by OLS) to the last 58 points in the series differs from the slope of the line fitted (by OLS) to the entire series of 137 points. Justify your choice of the number of replications.
      • (c) By simulation, estimate the chance that the sign of the slope of the line fitted (by OLS) to the last 58 points in the series differs from the slope of the line fitted (by OLS) to the entire series of 137 points, and that both estimated slopes are statistically significant at level 5%. Justify your choice of the number of replications.
      • (d) By simulation, estimate the chance that the sign of the slope of the line fitted (by OLS) to some contiguous block of 58 points in the series differs from the slope of the line fitted (by OLS) to the entire series of 137 points, and that both estimated slopes are statistically significant at level 5%. Justify your choice of the number of replications.
      • (e) By simulation, estimate the chance that the sign of the slope of the line fitted (by OLS) to some contiguous block of at least 30 points in the series differs from the slope of the line fitted (by OLS) to the entire series of 137 points, and that both estimated slopes are statistically significant at level 5%. Justify your choice of the number of replications.
    2. Now consider a different generating process:
      X(0) = 0. X(1) = 1.
      P([X(i) - X(i-1)] = [X(i-1) - X(i-2)]) = p, and
      P([X(i) - X(i-1)] = -[X(i-1) - X(i-2)]) = 1-p, i = 2, … 136.
      By simulation, estimate the probabilities in parts 1(a)–1(e) (above) when this process (rather than the random walk) generates the data, for p = 0.7, 0.8, and 0.9.
    3. What do you conclude about the significance of estimated regression coefficients when the regression model did not generate the data? What do you conclude about the climate change study? Discuss.
  3. Read Freedman, SMTP, Chapter 7; Freedman, SMCI, Chapters 12, 13; White et al. (2011).
    [Due 2/27].
    1. Freedman, SMTP, problems 7.B.2, 7.B.3, 7.C.5, 7.D.7, 7.E.2, 7.E.3, 7.E.10, 7.5.2, 7.5.3, 7.5.4, 7.5.5
    2. As we discussed in class, the experimental design used by White et al. does not match the way they analyzed the data. Their design was stratified on various things (study center, severity of disease, etc.), but Fisher's exact test and the Kruskal-Wallis test assume simple randomization without stratification. Moreover, the study does not seem to account for multiplicity in the use of Fisher's exact test to compare three pairs of treatments. This assignment will look at the effect of the mismatch between the design and the analysis and the failure to take into account multiplicity on apparent p-values.
      We have a population of 632 subjects (White et al. had 641 and then some were lost or excluded, and some responses were imputed; we're simplifying slightly). 158 subjects are assigned at random to each of four treatments. Consider a binary outcome variable, for instance, a variable that is 1 if at 52 weeks, the subject has improved by either 2 or more points on the Chalder fatigue questionnaire or by 8 or more points on the short form-36, and has improved on both; and that is zero otherwise. Let N denote the total number of 1s among the 632 subjects.
      1. Suppose N=80 for the moment. Allocate those 80 1s at random to the four treatment groups (control and three others). Find the three p-values for pairwise comparisons of control to each of the other three treatments using Fisher's exact test. Repeat the random allocation 1,000 times. What's the estimated chance that at least one p-value is below 0.05? What's the estimated chance that at least two p-values are below 0.05? Plot the empirical CDF of the smallest p-value in each each simulation. Repeat this simulation for N=160 and N=320 and report the results.
      2. The previous simulation ignored the stratification by centers. Invent a generalization of Fisher's exact test that takes stratification into account: the randomization across treatments does not mix across centers. Think of at least three ways to combine results across strata to get an overall test statistic. Explain what alternatives they should have the most power against.
      3. Code the test in the previous question that you like best. Base the p-value on simulation, since the test statistic no longer has a hypergeometric distribution.
      4. Suppose centers 1 and 2 have 106 subjects and centers 3—6 each have 105 subjects. Suppose that the reported results are as follows, where the numbers in parentheses are the number allocated to the treatment and the numbers not in parentheses are the number of 1s in the group.
        Center control treatment 1 treatment 2 treatment 3
        1 10 (27) 15 (27) 20 (26) 20 (26)
        2 10 (27) 15 (27) 20 (26) 20 (26)
        3 10 (27) 15 (26) 20 (26) 20 (26)
        4 20 (27) 15 (26) 15 (26) 10 (26)
        5 20 (27) 15 (26) 15 (26) 10 (26)
        6 20 (27) 15 (26) 15 (26) 10 (26)
        For the three paired comparisons with control, compare simulated p-values that take stratification into account with the p-values for Fisher's exact test (which ignores stratification). Try to find different sets of reported results that would make the two p-values differ as much as possible for some paired comparison. What happens if the centers have different sizes? Can you use Simpson's paradox to construct examples where the sign of the effect is reversed?
  4. Read Read Golomb et al. 2010; Jönrup, H. and B. Rennermalm 1976; Kaptchuk et al. 2006; Berk et al. 2009 and 2011.
    [Due 3/19].
    1. Simulate 1,000 iid N(0,1) random variables. Take the subset that are larger than 2. Find a 1-sided (upper) p-value of the z-test of the hypothesis that the subset you selected is a random sample from a N(0,1) population. Repeat this overall simulation 1,000 times. Plot the empirical cdf of the p-values. What fraction are below 0.1? Why is that fraction so much larger than 0.1? Isn't the null hypothesis true? Discuss.
    2. Simulate 1,000 iid N(0,1) random variables, as before, but instead of selecting those that are larger than 2, select the 50 that are largest. Find a 1-sided (upper) p-value of the z-test of the hypothesis that the subset you selected is a random sample from a N(0,1) population. Repeat this overall simulation 1,000 times. Plot the empirical cdf of the p-values. What fraction are below 0.1? Why is that fraction so much larger than 0.1? Isn't the null hypothesis true? Discuss. What's the difference between this situation and the first situation?
    3. Reproduce the simulations described in the "Simulation Results" section of the Berk et al. 2009 paper, that produced figures 3–7. Reproduce the figures. Repeat the simulations, this time constructing 95% confidence intervals for any variables that are selected. What fraction of the confidence intervals constructed cover their corresponding parameter? Is there a notable difference between the coefficient in the model that is actually zero and those that are not? Discuss.
    4. Simulate 600 iid N(0,1) random variables. Divide them into 6 groups of 100. Perform multiple linear regression of the first group onto the following 20 variables: the 5 other groups, the squares of the 5 other groups, the cubes of the other 5 groups, and the reciprocals of the other 5 groups. Select any estimated coefficients that are statistically significant at level 0.05. Construct 95% confidence intervals for just those "significant" coefficients. Note the number of confidence intervals you constructed, and the fraction of them that include zero—the true population value of all the coefficients in this set-up. Repeat the simulation of 600 variables, the regression, the selection, and the construction of confidence intervals, a total of 1,000 times. What fraction of simulations gave one or more confidence intervals? What fraction of simulations gave one or more confidence intervals that did not contain zero? What fraction of the confidence intervals you constructed overall contained zero? Discuss.
  5. Read McCormick et al. 2012; Pan et al. 2012; Freedman, SMCI, Chapter 11.
    [Due 4/9].
    Comment critically (not necessarily negatively) on McCormick et al.:

P.B. Stark, statistics.berkeley.edu/~stark. http://statistics.berkeley.edu/~stark/Teach/S215B/S12/index.htm Last modified 9 April 2012.