STAT 157: Probability and the Real World

This will be (slightly) edited during the semester. Dates like 8/29 indicate project mentioned in class.

Some possible course projects

Talk with me before starting any project, though it's good to think a little first.

Down-home experiments

These are experiments in the sense that you're going to generate your own, new, data.

Real card shuffles. 1. There's some elegant math theory about "random riffle shuffles", e.g. that it takes about 7 shuffles to make a card deck random (see Bayer-Diaconis or the more elementary Aldous-Diaconis paper. The theory assumes a kind of idealized shuffle which isn't quite what the average person really does. So there are interesting possible experiments where you compare shuffling 2 times, or 3 times, or 4 times, and then gather statistics (e.g. shape of bridge hands) from the resulting deals.

2. One could also do this with a cheap card-shuffling machine.

3. Another shuffling schemes is "smooshing" where a deck of cards is slid about on the table by two hands -- this is standard at Baccarat tables. How much does one need to smoosh? A reasonable test statistic is the number of cards originally together that are still together.

Is coin tossing really biased? Persi Diaconis suggests that in coin-tossing there is a small bias -- maybe 1/100 - towards the coin landing the same way as it started. See this news item or the journal article. To test this experimentally one would need say 40,000 tosses. More importantly, one needs some very strict protocol to ensure scientific respectability. Volunteers? Your chance to get invited onto the Late Late Show!

Luck-Skill spectrum. The "state fair" example in handout gives a setting with an adjustable parameter which interpolates between skill and luck. Can you invent another experiment which demonstrates the same point, and get data? The "wine cork" example deals with both "luck and skill" and with the idea of a "learning curve".

Mixing paper tickets. The 1970 draft lottery, intended to pick birthdays in random order, didn't do a very good job of randomization. This would be a nice reading project: see this site or do a Google Scholar search. The conclusion is that to physically mix a large number of objects is much harder than you think. Here's a course project. Imagine you want to run a lottery with say 100 names, and you do this by writing names on pieces of paper (e.g. stiff paper like business cards), putting them in some container (e.g. a cardboard box) and then just shaking, turning over, etc the box for 5 minutes. Then reach in and draw out tickets. My prediction is that this does a bad job of mixing -- one can do statistical tests on the results that show the order of the 100 draws are non random. It would be very interesting to do enough experiments to estimate how the number of shakes needed to mix grows with the number of tickets. As with the "biased coin tossing" project, it is important to have a strict protocol describing exactly what you're doing.

Sorting physical objects. There is a classical theory of computer algorithms for sorting. A definitive treatment is in D. Knuth The Art of Computer Programming, volume 3 but many introductory textbooks on algorithms will have something. The math question is: how long, on average, does the algorithm take? Here ``on average" means we assume the data starts in random order. This is a good reading project. One can make a course project by doing experiments with (say) 100 blue books; try several schemes for sorting into alphabetical order of names and see which is really quickest.

Psychology of probability. The book Cognition and Chance by Nickerson has many references to the original research experiments. It's a good source for reading projects and also for possible course projects repeating an experiment.

Studying real queues. There's an elegant and well-developed math theory of queues, but I suspect it doesn't really apply to most waiting lines we encounter in everyday life.

1. Coffee shop. A (several person) project is to sit in a coffee shop for several periods of time and record (in as much detail as practical) times of arrival to waiting lines and times of service completion. Then compare to theory models.

2. Online game rooms. Go to (say) pogo.com, go into (say) Spades and go into (say) Intermediate. You'll see a list of about 20 game rooms and how many people are in each; usually some are at or near the maximum allowed (125) and most others are nearly empty. Note this is the opposite of supermarket lines, whch stay roughly balanced because customers tend to choose short lines. What's going on with the game rooms? Well, if you want to find a game to join it's more sensible to choose an almost-full room. A project is to first gather some data on room occupancies over (say) a 3-hour period, and then formulate and test some conjectures.

3. Waves in long lines (8/27). Imagine joining the end of a long line -- for the first day of a popular movie, or at airport security. The people at the front are being served in a fairly regular way. But at the back, instead of moving forward one space at a time at the same rate as the front people at being served, you move forwards several spaces less frequently; a kind of ``occasional wave" of movement emerges spontaneously. It's not hard to make a toy model for this situation, but project can you collect some real-world data? As discussed in class, there is a similar "editing long documents" project.

The psychology of luck. The book The Luck Factor describes how people's self-assessment on a "lucky or unlucky" questionnaire correlates to various other attitudes, e.g. positive expectations for the future. So you could try replicating these results on a group of your friends or classmates.

Backtracking in computer games. If you play a human vs computer strategy game where (as in e.g. the Civilization series)
(i) past states can be stored
(ii) there's a current "human score minus computer score" count
then you can try the following. Set to a difficulty level where you usually lose. Play to the end, then backtrack to some position where you were doing relatively well, and re-start from there. Theory suggests that allowing yourself 4 such restarts should convert a 1/50 chance to a 1/2 chance of winning.

Near-misses. In class I gave a Scrabble example. Any other example of near-misses where you can get data?

Finding data to test theories: general

Generating your own new data (as in Experiments above) is a lot of work. It's often easier to find existing data somewhere on the Web (though still some work to put into the particular form needed for a project). Here are some examples, excluding sports and stock market which will be treated separately.

Near-misses and the least significant digit principle and the "local irregularity" statistic S. The theoretical predictions ought to hold in just about any reasonable data set of integer data; it would be valuable to check a large number of data sets.

Accuracy of Benford's law. The project is to get a number of different data sets (say 30) with the "highly variable" property. Recall typical examples: areas of States (in the U.S.A. or of countries in the world; prices of items in a department store; file sizes in your PC's storage. For each data set calculate

y = an estimate of the difference between the true first-digit distribution and the Benford distribution (after subtracting sampling variation);
x = log(interquartile range) for original data.

My prediction is that, plotting (x,y) for the different data sets, you will see y decreases as x increases. Maybe the curve resembles what you would get from logNormal data?

Categorical data to test power laws and descriptive statistics This data on birth names is a good start - can you find other recent data of this ``percents in different categories" kind? Can you make a table showing where people claim power laws hold (analogous to my Normal table)?

Route-lengths in transportation networks. I need the following type of data. Take the 12 [or 20 or 40] largest cities in some State or Country, and find the distances between each pair by road [or rail] and in straight line. See Figure 1 of this paper for an example; but I would like to expand the 2 data-sets there to 10 data-sets. Do this well and get your name on a scientific paper!

Study some new type of social network. A social network consists of
(i) a specified set of individual people
(ii) and a specified relationship which two people may have.

Mathematically, this gives you a graph where vertices are people and edges indicate where the relationship holds. Many notions of "relationship" have been studied, but I'm sure there are some that no-one has yet thought about.

Amazon.com customer book reviews. Sample books with (say) 10-40 reviews. For each review, note date posted and number of favorable votes. There is a strong association between these variables, described in Exploratory data analysis by Robert Huang. But there's scope for much further analysis.

Online data archives. There are various archives online such as JSE Data Archive which you could search to find data to test predictions.

Finding data to test theories: sports

Comparing stategies for betting on horse races. Take say 100 horse races -- look at odds and actual winner. Use the starting odds in each to impute winning probabilities. Determine what would have happened under each of several different strategies for choosing a horse to bet on in each race. Possible strategies: bet on
favorite
2nd favorite
3rd favorite
first name in alphabetical order
cutest name
For each strategy, the data will be (number of wins; overall dollar gain or loss) and you can compare this to the theoretical (calculated from imputed probabilities) expectation of number of wins and expectation of overall loss.

Timing of events within a sports match. Though the final result of a sports match should be affected by relative skill more than chance, one can ask whether, conditional on the final score, the various events within the match seem random or not. For instance

1. Consider soccer matches where the score is 1-1 at end of regulation time. Look at the times the two goals were scored. Are these uniform random? Or is there a tendency for a "quick equalizer"?

2. Common sense and theory both say that you should take more risks when losing and near the end of the match. For instance in football, classify interceptions by (which quarter? thrown by currently losing/winning team?). One expects the proportion of interceptions which are thrown by the losing team in the fourth quarter to be considerably larger than 1/8 (the proportion if uniform). Does data confirm this prediction? Can you see this effect in other sports?

3. Hot hands. For one player in a basketball game, record the sequence of successes/failures in their shots. Given the total number of successes (say 18 out of 29) do they occur in random order? Almost all sports players believe in some notion ("hot hands") that sometimes they are "on top of their form" some of the time but not other times, so that the pattern of successes is more clumpy than it would be if truly random. But statisticians who have studied this are dubious -- data looks pretty random to them. Project: gather some data, perhaps from another sport (e.g. volleyball: kills by spikers). Then there are standard ways to analyze such data.

I just noticed but haven't investigated a web site devoted to hot hands.

Timing of wins within a season. Some sports teams have a reputation for doing better or worse at the start or end of a season. For instance the Oakland As have a reputation for doing better in the second half of the season. Project: find relevant data (all teams, last 10 years say) and see if such effects are seen more often than "just chance" predicts.

Regression analysis of different sports. The variability of teams standings at season end reflects both difference in ability and chance. One can estimate the contribution of chance from the correlation between first half season and second half season: how does this compare across different sports? More simply, look at the 3 teams with the best records at mid-season: what is the chance one of these wins the Superbowl/World Series/Stanley Cup etc?

Martingale behavior of realtime probabilities. Tradesports is a web site where you can bet on a sports match in progress. In particular there is a "chart of the day" showing how "contract prices", i.e. estimated chances of a particular team winning, change during the match. I have copied some of this chart data. One could test data against theory in several ways. If the opening price (scale 0 - 100) is 60, this implicitly predicts P(win) = 60%, but also predicts that P(win but price goes below 20 at some point) = 10%. Other sites like Intrade do similarly for politics etc.

Point difference in football. Betting on football is usually done relative to a "point spread". I would like to have data on quantities like the following for NFL games.
(i) The difference between the actual spread and the point spread
(ii) The actual spread
(iii) Point spread versus odds-to-win.

Finding data to test theories: stock market

Malkiel A Random Walk Down Wall Street is a good source for reading projects. Some course projects:

1. Find coherent data on s.d. of stock index changes over (1 day; 1 week; 1 month; 1 year) to see how well the square root law works. Then test more subtle predictions of random walk theory, e.g. the arc sine law.

2. In the context of the Kelley criterion for apportioning between stocks, suppose the annual stock gain X can be decomposed as an independent sum X_1 + X_2 where X_1 could be known at some cost (imagine var(X_1) = 0.1 var(X), say). What is the long-term advantage of knowing X_1? Do a simulation study with various distributions. (Conceptual point: this is the simplest model for studying the value of "fundamental analysis").

3. Look at historical data on annual stock returns and short term interest rates. See how well the Kelley strategy would have worked, based on modeling the next year's return as a random pick from the previous 20 years returns.

4. Find a source (maybe cnn.com) that each day provides a 1 sentence explanation of why the stock market did what it did the day before. Create a table showing, for say 30 consecutive market days, how the index changed and the 1 sentence explanation. (OK, not so intellectually challeging, but useful data!)

Not Quite a Book Report

In elementary school you started "book report" projects, and through Berkeley you've done projects in the style "read some books or papers and write your own synthesis". I want this course to be different so I discourage such work. Or rather, let's try more creative versions such as

A Wikipedia entry. Write a Wikipedia entry (or entries) for a topic in this course that has no entry, or edit one with a unsatisfactory entry: for instance prediction markets.

More risks in everyday life. Look at the Ropeik-Gray book Risk. Choose one kind of risk they don't do (e.g. child abduction by stranger; flu pandemic) and write a section in the same style as the book.

Probabilities of asteroid collision with earth. This topic, mentioned occasionally through the course, is one topic that I would like a literature report on. What are the models, the data and the conclusions in the serious scientific literature?

Another such topic is ``agent-based models of epidemics", e.g. this paper which I'll talk about in class.

Miscellaneous research projects

Slightly separate from this course (being less "real-world") are some ongoing research projects where undergraduates can help. You may consider doing one of these as a course project, expecially if you are willing to keep going until completion (e.g. as a STAT 199 "Supervised Independent Study and Research" next semester).

Simulation studies

Simulation studies of properties of probability models aren't quite real-world but are fun anyway. Here are some possibilities.

Simulating self-organized criticality As (may be) described in class, here is a natural 2-dimensional model for epidemics. Take a large L x L square. Individuals arrive one at a time, at uniform random positions. Usually nothing happens; but with chance L^{-3/2} the new individual is infected. In this case the infection spreads to other individuals within distance 1, and an epidemic occurs in which infection continues to spread between further individuals at distance < 1 apart. After the epidemic has run its course, remove all infected individuals. Course project: simulate this process to check theory predictions of power-law tail of distribution of number of infected individuals in an epidemic. See Wikipedia "forest-fire model" for references to related work.

Why asteroid impact probability goes up, then down. The Wikipedia explanation doesn't quite address probabilities -- simulating a more detailed model might be interesting.

Sociology models of the kind in Dynamic Models of Segregation or in Chapter 9 of Complex Adaptive Systems would be interesting to simulate.

The basic model for sports. Suppose each team has a "skill level" measured by a real number x. Suppose when two teams with skills x and y play, the former wins with probability f(x - y) for some function f.
Within this model there are many things one can study. Given f, how can we estimate the skill levels from win/loss data? For unknown f, how to estimate the function f from data?

Bingo. When one player wins, many others have several ``4 out of 5" lines. How many?

Bigger projects

Predicting sports results. Continuing on from point difference in football and the basic model for sports above, it would be fascinating to actually make predictions for football results as the season progresses. This requires a lot of work -- definitely a team project.