# Stat 21: Probability and Statistics for Business, Spring 2011

Welcome to Stat 21! Here's a syllabus and course calendar. We meet in 277 Cory from 9:10 AM to 10:00 AM on Mondays, Wednesdays, and Fridays, starting on Wednesday, January 19.

## Lectures

I'll post reading assignments here for each lecture; you should try to read the assignment before the lecture. I'll also put a brief outline of each lecture here after it happens. This is not meant to be a replacement for attending lecture.
Lecture 1 (Wednesday, January 19): Administrative material (see the syllabus here) and outline of course content.
Lecture 2 (Friday, January 21): read Chapter 1, on controlled experiments. In lecture we talked about controlled experiments. Our major examples were: (1) how could we prove that you learned something in this course, or that one professor is better than another? (2) the Salk vaccine field trials for polio (3) incorrectly controlled experiments, such as those with non-randomized controls (for example, the SAT coaching statistics) or historical controls.
Lecture 3 (Monday, January 24): read Chapter 2, on observational studies. Why can't we design a study to show smoking is bad? Examples of natural experiments: school admissions, politics, John Snow's cholera map. The clofibrate trial; adherence, our first example of a confounding variable. Other examples: ultrasound and low birth weight, the Samaritans and suicide, cervical cancer and circumcision.
Lecture 4 (Wednesday, January 26): read Chapter 3, on histograms. In lecture: finish Chapter 2, with examples on Simpson's paradox -- batting averages, sex bias in graduate admissions. Start Chapter 3: histograms. How to draw a histogram. Berkeley travel time to work example. Endpoint conventions for real-valued data and integer-valued data; distribution of number of children.
Lecture 5 (Friday, January 28): read Chapter 4, on the average and the standard deviation. Lecture: what can we learn from histograms? Left and right tails. When does it make sense to draw a histogram? Quantitative and qualitative variables. Discrete vs. continuous. What happens to the histogram if we change the data: by adding a constant to everything? by multiplying by a constant? by doing something more complicated? Some examples of unusual histograms: distribution of starting salaries of law school graduates, distribution of heights of French conscripts.
Lecture 6 (Monday, January 31): no new reading. (We'll continue with Chapter 4.) Lecture: what is the average? why is it useful? It's easy to compute. It minimizes a measure of "total distance" from the data; see Philip Stark's SticiGui, Chapter 4, section "measure of location" for more on this. It's the point where the histogram balances. What happens if we add a single outlier to the data set? How does the average behave when we add a constant to all the data, or multiply by a constant? (The same thing happens to the average.) What if we square all the data? (Something more complicated; we'll investigate this in more depth later, when we talk about standard deviations.)
Lecture 7 (Wednesday, February 2): We defined the root-mean-square and used it to calculate the standard deviation. We compared this to computing the average absolute deviation, and saw why the SD is a better measure of central tendency:the SD is easier to deal with theoretically, has a nice alternative formula (the square root of (the mean of the square minus the square of the mean)), and it behaves nicely when we add things together. The SD behaves nicely under linear transformations, what our text calls "change of scale". Various distributions have the same mean and standard deviation; the normal curve will be the one we use the most.
Lecture 8 (Friday, February 4): Reading: Chapter 5. The normal curve. (We won't cover chapter 6, on measurement error, but you should read it as it will help to have some acquaintance the concepts. Chapter 7 should be review; I recommend that you come back to it if you find yourself having trouble with graphs.)
Lecture 9 (Monday, February 7): Reading: Chapter 8. Scatter plots. How to predict the height of a child from the height of father and the mother; or from just the father. The result is roughly an elliptical cloud of points. The correlation coefficient r measures the shape of this cloud, or the strength of the relationship between the two variables. Distinction between dependent and independent variables. What's a high correlation?
Lecture 10 (Wednedsay, February 9): continuing with correlation. The SD line, how to compute r, some properties of correlation.
Lecture 11 (Friday, February 11): Reading: Chapter 9. More properties of correlation. It's a pure number, not affected by changes of scale, always between \$-1\$ and \$1\$. It is affected by more complicated transformations of the data (squaring, taking logs, etc). It is very strongly affected by outliers, so be careful about that! Ecological correlation is something to be careful about; Bush-Kerry election example. If you're curious about what's going on with elections, see Red State Blue State Rich State Poor State.
Lecture 12 (Monday, February 14): Making predictions using regression. The graph of averages: we find it by taking the average of the y's corresponding to some small interval of x's (or vice versa), and the regression line is a smoothed version of it. Using the regression method on individuals. Percentiles.
Lecture 13 (Wednesday, February 16): The regression fallacy, especially as it applies to test-retest situations. Explanations of it, both in the continuous case (the normal curve) and the discrete case (two types of people). Where it occurs in reality: Sports Illustrated cover jinx, sophomore slump, white coat hypertension. The RMS error for regression: measuring the error in a prediction method by the RMS error of predictions made by it. These errors are smaller than the "baseline" method of predicting a dependent variable to be just the average of the dependent variable over the entire population. The RMS error for linear regression is (1-r2)1/2 SD(y); thus the error is small when |r| is large (i. e. close to 1).
Lecture 14 (Friday, February 18): continue Chapter 11, on the RMS error of regression. Plotting the residuals: if we've done regression correctly, and if the underlying theory actually applies here, then there should be no trend in the residuals. Looking at vertical strips: the RMS error, the normal curve in a vertical strip. Using this to make predictions: given a certain x-value, we can predict the probability of ranges of y-values. (Note that if we have a range of x-values, making these predictions is much harder - we won't talk about this.)
Lecture 15 (Wednesday, February 23): Chapter 12. Deriving the equation of the regression line, both in point-slope and slope-intercept form. Interpretation of the coefficients -- in particular, why is the intercept of the regression line not always informative? Fertilizer example (A2 of text) - experimental design issues, problems with extrapolation. Least-squares: minimizing the sum of squares of errors E(m,b) as a function of m and b.
Lecture 16 (Friday, February 25): Finish Chapter 12: when does regression make sense? Hooke's law (this is a bit more complicated than what we did; if you want to know more see Walter Lewin's video) and least-squares (which is basically regression applied to measurement). Some examples of nonlinear things that would be hard to discover by regression: formula for area of a rectangle, or for time taken for an object to fall. Effects of education are not as strong as you'd think. and in particular it doesn't matter so much where you go to college: "Who needs Harvard?" Start Chapter 13 (elements of probability): some history. Frequentists and Bayesians. The Large Hadron Collider: from Overcoming Bias and video from the Daily Show. Prediction markets such as Intrade.
Monday, February 28: exam 1; see below for information.
Lecture 17 (Wednesday, March 2): continuing with probability. Our model of what's more formally called "random variables": drawing tickets from a box. Sampling with and without replacement. Conditional probability -- what's the probability that the second card in a deck is a face card? What's this probability, given that the first card is a face card? Isn't a face card? Conditional probability really reflects what information we have. The multiplication rule and independence.
Lecture 18 (Friday, March 4): More probability (Ch. 13-14). Some explicit examples of independence and dependence. The gambler's fallacy: "hot" and "due" numbers don't exist in lotteries. (But they may exist when you're gambling on the performance of people -- this is less clear, we'd have to look at the data.) Finding probabilities by counting: what's the distribution of the results from rolling two dice? Is 9 or 10 more likely when we roll three dice (Galileo's problem)? Mutually exclusive events.
Lecture 19 (Monday, March 7): finish Chapter 14: the Chevalier de Mere's problem and the birthday problem. Chapter 15 (the binomial formula): Pascal's triangle, deriving a formula for the binomial coefficients, how to compute binomial coefficients. The lady tasting tea, which is an example of Fisher's exact test.
Lecture 20 (Wednesday, March 9): more on the binomial theorem, and the law of averages (end Ch. 15, start Ch. 16) The most likely number of successes in n trials with success probability p is np. However, this only occurs with probability around 1/sqrt(n). The "law of averages" tells us that errors in percentages or averages shrink as the number of trials grow, so the probability that the fraction of successes in n independent trials with probability p of success is within a fixed interval around p goes to 1 as n goes to infinity. But the probability of getting exactly np successes (even if np is an integer) is very small for large n. "Extreme" events happen more often with small samples.
Lecture 21 (Friday, March 11): Finish Chapter 16. start Chapter 17: the expected value and standard error. Some examples of the sum of draws.
Lecture 22 (Monday, March 14): continue Chapter 17. Formulas for the expected value and the standard error of the sum of draws. The expected value grows like the number of draws and the standard error grows like its square root. Explicit numerical examples for two draws. Examples: 17B2, 17B5. Using the normal curve for estimating the result of sums of draws. A shortcut for finding the box SD in the two-value case.
Lecture 23 (Wednesday, March 16): finish Chapter 17, start Chapter 18. The expected value and standard error for sums of draws. Using boxes of 0s and 1s for counting. For "rare events", if n of them typically happen then fluctuations of size sqrt(n) are to be expected. Introudcing the normal approximation: what does it mean for histograms to approach some curve?
Lecture 24 (Friday, March 18): Chapter 18. The historgrams of sums of draws approach the normal curve. Using this to make computations. The continuity correction. The scope of the normal approximation.
SPRING BREAK

Lecture 25 (Monday, March 28): start Chapter 19, on sampling. Bad methods of sampling: convenience samples of various kind, Internet polls, online reviews, using WEIRD people as research subjects. Good methods: simple random sampling, but we still have to worry about response bias (from the way questions are worded and so on) and non-response bias (from people who don't answer the questions). The Literary Digest, Dewey defeats Truman.
Lecture 26 and 27 (Wednesday, March 30 and Friday, April 1): Ch. 19-20.
Monday, April 4: midterm 2.
Lecture 28 (Wednesday, April 6): Chapter 21, the accuracy of percentages. The beginning of statistical inference -- instead of reasoning from the population to the sample, we're going to reason from the sample to the population. The bootstrap method. Examples: 21A2 (estimating the percentage of people enrolled in college in a certain town). Confidence intervals and what they mean. Attaching confidence intervals to results from elections. 21 review 3: the percentage of people with big-screen TVs, and how the normal approximation breaks down with very small probabilities. The Poisson distribution (not on test).
Lecture 29 (Friday, April 8): (most of) Chapter 23, the accuracy of averages. The average, or expected value, of a (sample) average. Sample averages follow the normal curve, with EV and SE calculated in the usual way. The finite population correction factor still applied. Example: 23A5.
Lecture 30 (Monday, April 11): finish Chapter 23: Confidence intervals for sample averages, and some issues in their interpretation. 23C6: matching averages with the boxes they come from, and a brief digression into maximum likelihood.
Lecture 31 (Wednesday, April 13): Chapter 24 (measurement error). If we measure something several times we get a better idea of it than if we measure just once. Gauss model for measurement. Poincare's bread. Electrical load cells used to weigh trucks (24A@). A bit on propagation of uncertainty (link goes to Wikipedia article, which is more advanced mathematically than what we did). I had hoped to talk about Benford's law but didn't get to it.
Lecture 32 (Friday, April 15): start Chapter 26 (significance testing). The basic logic of significance testing. Null and alternative hypotheses. The z-score and P-values. NYC blackout example.
Lecture 33 (Monday, April 18): continue Chapter 26 (zero-one boxes). Working with qualitative data -- the same computations as in the previous lecture, but with 0-1 boxes. ESP example. The t-test, for small samples.
Lectures 34 and 35 (Wednesday, April 20 and Friday, April 22): Chapter 27, more tests for averages. The standard error of the difference of two independent random quantities. How does correlation change this? Examples: bias of editors of journals towards positive articles (27 review 9). Significance of differences in batting averages, and how this depends on sample size. Analysis of experimental data in treatment-control setup; we make two errors in this analysis but they cancel each other out.
Lecture 36 (Monday, April 25): Chapter 28, the chi-squared test for equality of distributions. Testing if a die is loaded. How to compute chi-squared. Why it's a good measure of goodness of fit. How to use it for hypothesis testing. Chi-squared with one degree of freedom is equivalent to the z-test. Alameda county jury age example (28A7).
Lecture 37 (Wednesday, April 27): Chapter 28 continued, the chi-squared test for independence. How to calculate the expected distribution in this case. Independence of hair color and sex. Independence of sex and voting behavior (28C2).
Lecture 38 (Friday, April 29): Chapter 29, some caveats about hypothesis testing.

## Homework

You should only hand in the required problems, but please do the recommended problems as well to prepare for section meetings.
Homework 1, due Tuesday, February 1, in section: recommended problems: 2A #1, #5, #9, #11. 3A #4, #7, #8. 3B #1. 3C #1, #4. 3E #2. required problems: 2 review 1, 5, 8; 3 review 3, 6, 7, 11.
Homework 2, due Tuesday, February 8, in section: recommended problems: 4A 1, 7, 8, 9. 4B 1, 2. 4C 1, 2, 5. 4D 6, 8. 4E 4-8. 5A 2, 5B 1, 3, 5, 5C 1, 2. 5D 1, 5. 5E 3, 5F 1. required problems: 4 review 6, 8, 9, 11; 5 review 1, 4, 8, 10.
Homework 3, due Tuesday, February 15, in section: recommended problems: 8 A3, 5, 6; B6, 8; C3; D1, 2. ch. 9 A6, 8; B1; C1, 3; D1; E4,5. required problems:ch. 8 review 3, 8, 9, 11. ch. 9 review: 3, 4, 8, 10.
Homework 4, due Tuesday, February 22, in section: recommended problems: Chapter 10: A3, A4, B1, B3, B4, C1, C2, D1, D2, E3. Chapter 11: A3, A8, B1, B2, C2, C3, D3, D7, E1, E2. required problems: Ch. 10 review 2, 4, 7, 9; Ch. 11 review 5, 7, 10, 12
Recommended problems from Chapter 12, which you should do before the exam: Ch. 12 A1, 3, 4; B1, 3; review problems 2, 5, 7, 9.
Homework 5, due Tuesday, March 8, in section: recommended problems: Chapter 13: A1, A5, B1, B2, B3, B4, C1, C7, D1, D4, D7. Chapter 14: A2, A4, B2, B4, B6, C3, C4, D6, D7. Required problems: Chapter 13 review 4, 6, 9, 11; Chapter 14 review 1, 9, 10, 14.
Homework 6, due Tuesday, March 15, in section: ch. 15 A1-A6; ch. 16 A3, 4, 6; B3, 4; C2, 3. Required problems: ch. 15 review 3, 8, 9, 10; ch. 16 review 4, 5, 9, 10.
Homework 7, due Tuesday, March 29, in section: recommended: ch. 17: A1,6; B1,4,6; C2,4; D3,4; E2,3,6. ch. 18: A2,4,5; B3,4; C2,5,8. Required: 17 review: 1, 4, 8, 13; 18 review: 2, 4, 5, 12.
Recommended problems from Chapters 19 and 20, which you should do before the second exam: Chapter 19: all exercises in set A; review Chapter 20: A2, A4, B2, B3, C3, C5; review 4, 6, 8, 9, 12. Chapter 20: review 3, 6, 11, 12. (Note that this is more problems than usual -- but you don't have to write them up!)
Homework 8, due Tuesday, April 12, in section: recommended problems: Ch. 21: A2-A6, B2, B4, C3, C5, C8, D2, E2. Ch. 23: A2, A3, A6, A10, B5, B6, B7. required problems: Ch. 21 review 5, 8, 11. Ch. 23 review 3, 7, 8, 10.
Homework 9, due Tuesday, April 19, in section: recommended problems: Ch. 24: A3, A4, A5, B4, C3, C5, C7. Ch. 26: A3, B1, B5, C3, C5, D3, D4, D5. required problems: Ch. 24 review 1, 3, 9, 11. Ch. 26 review 1, 7.
Homework 10, due Tuesday, April 26, in section: recommended problems: Ch. 26 E2, E3, E4, E5, F4, F5, F6. Ch. 27 A2, A6, B4, B7, C2, C3, C4, D2, D5, D9. Ch. 28 A4, A5, A6, A9. Required problems (to be turned in): Ch. 26 review 2, 4, 11; Ch. 27 review 3, 6, 8; Ch. 28 review 2, 6.
"Homework 11", not to be collected: from the end-of-section exercises: Ch. 28 B2, C3, C5, C6; Ch. 29 A2, B2, B3, B4, B8, C2, C8, D5. From the end-of-chapter exercises: Ch. 28 review 2, 3, 6, 9; Ch. 29 review 3, 4, 10, 11. In the case of 29 B2-B4: between exercises 2 and 3 some probabilities are given that you should be able to compute; make sure you know how to compute them. Statistics on homework grades:
• Homework 1: 61 students submitted. 25 points available. Mean 23.3 (93%), median 24 (96%), SD 2.5 (10%)
• Homework 2: 62 students submitted. 41 points available. Mean 33.4 (83%), median 35 (85%), SD 4.5 (11%)
• Homework 3: 63 students submitted. 36 points available. Mean 30.0 (83%), median 31 (86%), SD 5.0 (14%)
• Homework 4: 60 students submitted. 31 points available. Mean 28.7 (93%), median 30 (97%), SD 3.0 (10%)
• Homework 5: 64 students submitted. 34 points available. Mean 32.0 (94%), median 32 (94%), standard deviation 2.7 (8%).
• Homework 6: 63 students submitted. 38 points available. Mean 35.0 (92%), median 36 (95%), standard deviation 4.5 (12%)
• Homework 7: 65 students submitted. 30 points available. Mean 27.7 (92%), median 29 (97%), standard deviation 3.5 (12%)
• Homework 8: 62 students submitted. 40 points available. Mean 36.3 (91%), median 38 (95%), standard deviation 5.0 (12%)
• Homework 9: 64 students submitted. 29 points available. Mean 27.7 (96%), median 29 (100%), standard deviation 2.2 (7%)
• Homework 10: 57 students submitted. 48 points available. Mean 42.3 (88%), median 44 (92%), standard deviation 5.5 (11%)
Note that each homework counts the same amount towards your grade, despite the differing numbers of points. You can compute a composite homework score by the following process:
• take your score on each homework and dividing by the total number of points possible. (Use 0 for any homework you did not submit.)
• remove the smallest one of these numbers. (If you did not submit all the homeworks, you'll remove a 0 here.)
• add the remaining nine numbers together.
This will give you a number between 0 and 9. Among those students who have submitted at least one homework assignment, the average of these scores is 7.92 (88%), the median is 8.44 (94%), and the standard deviation is 1.32 (15%).

## Exams

Exams are the midterms, on Monday, February 28 and Monday, April 4 in class, and the final, on Monday, May 9.

### Exam 1

The first midterm will cover Chapters 1 through 5 and 8 through 12 of the text. You're allowed a calculator and handwritten notes on one side of an 8.5-by-11-inch or A4 sheet of paper that you've prepared yourself. You don't need to bring a blue book.
There is also a study guide and midterm 1 from Stat 20 last semester. Stat 20 and Stat 21 are very similar, so this should be useful. Note that this midterm covered through Chapter 11 (not Chapter 12) and in particular has no questions about the regression line, but the regression line will be on the exam.
There are also solutions to the first Stat 20 exam. When this exam was originally given, the average grade was 81; the median was 84; the standard deviation 14.
The "special review problems" from the text are also useful as preparation: Ch. 6 all except #8, and Ch. 15 #1-17. Answers to these problems are available on bspace.
Exam 1 post-mortem: Here are solutions to the first exam. Grades are posted on bspace. Some statistics: 68 students took the exam. The mean grade was 72.1, the median was 74, and the standard deviation was 15.9. The distribution of grades was as follows:
 95-100 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54 45-49 40-44 35-39 30-34 5 6 5 7 10 7 8 5 6 2 4 0 2 1

See the syllabus for the approximate distribution of letter grades.

### Exam 2

The second midterm will cover Chapters 13 through 20 of the text. You're allowed a calculator and handwritten notes on one side of an 8.5-by-11-inch or A4 sheet of paper that you've prepared yourself. You don't need to bring a blue book. These are the same rules as for exam 1.
There is also midterm 2 from Stat 20 last semester. Stat 20 and Stat 21 are very similar, so this should be useful. Note that last semester's exam covered Chapters 12 through 20, while this semester's exam covers Chapters 13 through 20. There are also solutions to the second Stat 20 exam. When this exam was originally given, the average grade was 89; the median was 92; the standard deviation 10; this semester's exam will probably be more difficult.
Special review problems for the exam: Ch. 15 #18-20, Ch. 23 #13-26. Solutions are available on bspace.
There is also a study guide for this exam.
Exam 2 post-mortem: Here are solutions to the second exam. Grades for the exam are posted on bspace. 66 students took the exam. The mean grade was 78.3, the median was 81, the standard deviation 12. The distribution of grades was as follows:
 95-100 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54 45-49 40-44 2 9 11 15 7 6 6 2 3 1 0 2
The coefficient of correlation, r, between scores on the first exam and scores on the second exam was 0.70.

### Exam 3

The final exam covers all those parts of the text that we covered -- that is, everything except Chapters 6, 7, 22, and 25. You're allowed a calculator and handwritten notes on three sides of an 8.5-by-11-inch or A4 sheet of paper that you've prepared yourself. You don't need to bring a blue book. These are the same rules as for exam 1.
There is a review sheet for the final; in addition there are review sheets for the midterms which you should go back and look at.
The "special review problems" at the end of Chapter 29 of the text are good review problems; solutions are posted
Here is last semester's Stat 20 final. When this exam was originally given, the average was 116 (out of 200), the median was 121, and the standard deviation was 32. There are no written solutions to this exam. Feel free to ask questions about it in office hours or by e-mail.

### Distribution of the sum of exam scores

If you add your two exam scores together, you'll get a number between 0 and 200. The average of these sums of scores is 151, the median 156, the standard deviation 25.5. The distribution is as follows:
 190-200 180-189 170-179 160-169 150-159 140-149 130-139 120-129 110-119 100-109 90-99 80-89 3 4 10 13 8 10 4 6 2 4 1 1

### Distribution of overall scores

Above you were given directions on how to compute a homework score, which is out of 9 points; let's call this H. You also have the sum of your two exam scores; call this E. Then you can compute an overall score, out of 60, as (20/9) H + (.2) E; this is your score going into the final exam. The distribution of these scores is as follows:
 58-60 56-58 54-56 52-54 50-52 48-50 46-48 44-46 42-44 40-42 38-40 36-38 34-36 30-34 20-30 less than 20 1 5 3 10 11 10 7 2 6 4 3 1 2 0 3 2
The 10th, 20th, ..., 90th percentiles are 35.9, 41.5, 43.9, 47.1, 49.6 (median), 50.3, 51.8, 53.2, and 54.5.

## Resources

Philip Stark's online Stat 21 textbook is very useful. In particular it has a large number of interactive demonstrations.