STAT 157: Probability and the Real World

Some possible course projects

Talk with me before starting any project, though it's good to think a little first.

Down-home experiments

These are experiments in the sense that you're going to generate your own, new, data.

Real card shuffles. There's some elegant math theory about "random shuffles", e.g. that it takes about 7 shuffles to make a card deck random (see Bayer-Diaconis or the more elementary Aldous-Diaconis paper. The theory assumes a kind of idealized shuffle which isn't quite what the average person really does. So there are interesting possible experiments where you compare shuffling 2 times, or 3 times, or 4 times, and then gather statistics (e.g. shape of bridge hands) from the resulting deals.

Is coin tossing really biased? Persi Diaconis suggests that in coin-tossing there is a small bias -- maybe 1/100 - towards the coin landing the same way as it started. See this news item or the journal article. To test this experimentally one would need say 40,000 tosses. More importantly, one needs some very strict protocol to ensure scientific respectability. Volunteers? Your chance to get invited onto the Late Late Show!

Luck-Skill spectrum. The "state fair" example in handout gives a setting with an adjustable parameter which interpolates between skill and luck. Can you invent another experiment which demonstrates the same point, and get data? The "wine cork" example deals with both "luck and skill" and with the idea of a "learning curve".

Mixing paper tickets. The 1970 draft lottery, intended to pick birthdays in random order, didn't do a very good job of randomization. This would be a nice reading project: see this site or do a Google Scholar search. The conclusion is that to physically mix a large number of objects is much harder than you think. Here's a course project. Imagine you want to run a lottery with say 100 names, and you do this by writing names on pieces of paper (e.g. stiff paper like business cards), putting them in some container (e.g. a cardboard box) and then just shaking, turning over, etc the box for 5 minutes. Then reach in and draw out tickets. My prediction is that this does a bad job of mixing -- one can do statistical tests on the results that show the order of the 500 draws are non random. It would be very interesting to do enough experiments to estimate how the number of shakes needed to mix grows with the number of tickets. As with the "biased coin tossing" project, it is important to have a strict protocol describing exactly what you're doing.

Psychology of probability. The book Cognition and Chance by Nickerson has many references to the original research experiments. It's a good source for reading projects and also for possible course projects repeating an experiment.

Studying a real queue. There's an elegant and well-developed math theory of queues, but I suspect it doesn't really apply to most waiting lines we encounter in everyday life. A (several person) project is to sit in a coffee shop for several periods of time and record (in as much detail as practical) times of arrival to waiting lines and times of service completion. Then compare to theory models.

Finding data to test theories

Generating your own new data (as in Experiments above) is a lot of work. It's often easier to find existing data somewhere on the Web (though still some work to put into particular form needed for a project). Here are some examples, excluding sports and stock market which will be treated separately.

Accuracy of Benford's law. The project is to get a number of different data sets (say 30) with the "highly variable" property. For each data set calculate

y = an estimate of the difference between the true first-digit distribution and the Benford distribution (after subtracting sampling variation);
x = log(interquartile range) for original data.

My prediction is that, plotting (x,y) for the different data sets, you will see y decreases as x increases. Maybe the curve resembles what you would get from logNormal data?

Route-lengths in transportation networks. I need the following type of data. Take the 10 [or 20 or 40] largest cities in some State or Country, and find the distances between each pair by road [or rail] and in straight line. See Figure 1 of this paper for an example; but I would like to expand the 2 data-sets there to 10 data-sets. Do this well and get your name on a scientific paper!

Study some new type of social network A social network consists of
(i) a specified set of individual people
(ii) and a specified relationship which two people may have.

Mathematically, this gives you a graph where vertices are people and edges indicate where the relationship holds. Many notions of "relationship" have been studied, but I'm sure there are some that no-one has yet thought about.

Not Quite a Book Report

In elementary school you started "book report" projects, and through Berkeley you've done projects in the style "read some books or papers and write your own synthesis". I want this course to be different so I discourage such work. Or rather, let's try more creative versions such as

A Wikipedia entry. Write a Wikipedia entry (or entries) for a topic in this course that doesn't have a satisfactory entry.

More risks in everyday life. Look at the Ropeik-Gray book Risk. Choose one kind of risk they don't do (e.g. child abduction by stranger; flu pandemic) and write a section in the same style as the book.

Miscellaneous research projects

Slightly separate from this course (being less "real-world") are some ongoing research projects where undergraduates can help. You may consider doing one of these as a course project, expecially if you are willing to keep going until completion (e.g. as a STAT 199 "Supervised Independent Study and Research" next semester).

XXX

1/19. Sorting: algorithms and cards.

There is a classical theory of computer algorithms for sorting. A definitive treatment is in D. Knuth The Art of Computer Programming, volume 3 but many introductory textbooks on algorithms will have something. The math question is: how long, on average, does the algorithm take?. Here ``on average" means we assume the data starts in random order. This is a good reading project. One can make a course project jointly with the project mixing lottery-type tickets above. In that project, to repeat the experiment one needs to put the 500 cards back into order. If you try this without planning it may take you several hours. So think of several possible efficient schemes, and see which works fastest in practice.

1/24. Predicting sports results

In any sport, as the season progresses we see the results of games already played: how might we use these to try to predict the results of future games? Consider for instance College Basketball. There will be 64 teams invited to the NCAA Championship in March. Once we know the 64 teams, but before any championship matches are played, can we estimate the chance of each team being the ultimate winner? Here is one statistical approach. Suppose each team has a "skill level" measured by a real number x. Suppose when two teams with skills x and y play, the former wins with probability f(x - y) for some function f. Use past results (and a computer) to find the function f and the skill levels for each team which give the "best fit" to past game results. Then it is easy to make tournament predictions.

Project Carry out some such prediction scheme. This will require some work in data collection, plus programming skills. Definitely a team project.

1/24. Timing of events within a sports match

Though the final result of a sports match should be affected by relative skill more than chance, one can ask whether, conditional on the final score, the various events within the match seem random or not. For instance

1. Consider soccer matches where the score is 1-1 at end of regulation time. Look at the times the two goals were scored. Are these uniform random? Or is there a tendency for a "quick equalizer"?

2. Common sense and theory both say that you should take more risks when losing and near the end of the match. For instance in football, classify interceptions by (which quarter? thrown by currently losing/winning team?). One expects the proportion of interceptions which are thrown by the losing team in the fourth quarter to be considerably larger than 1/8 (the proportion if uniform). Does data confirm this prediction? Can you see this effect in other sports?

3. Hot hands. For one player in a basketball game, record the sequence of successes/failures in their shots. Given the total number of successes (say 18 out of 29) do they occur in random order? Almost all sports players believe in some notion ("hot hands") that sometimes they are "on top of their form" some of the time but not other times, so that the pattern of successes is more clumpy than it would be if truly random. But statisticians who have studied this are dubious -- data looks pretty random to them. Project: gather some data, perhaps from another sport (e.g. volleyball: kills by spikers). Then there are standard ways to analyze such data.

1/24. Timing of wins within a season

Some sports teams have a reputation for doing better or worse at the start or end of a season. For instance the Oakland As have a reputation for doing better in the second half of the season. Project: find relevant data (all teams, last 10 years say) and see if such effects are seen more often than "just chance" predicts.

1/26. Everyday predictions

Here is the Griffiths-Tenenbaum paper discussed in class. Project: contact them and either
convince them their theoretical predictions are wrong;
or ask for their data so you can do the predictions right.

1/26. More data for predicting future duration

Here is the Doomsday argument and here is Carlton Cave's critique of Gott's argument. A reading project is to describe Gott's data. A course project is to find some new data to test such predictions. Another (or perhaps the same) course project is to think of any real-world setting to which the following model (discussed in class) applies

given a sample T from Uniform[0,t_0] with unknown t_0, find a frequentist C.I. or Bayesian posterior for t_0.

1/26. Swedish Lottery game

A reading project is to find literature on this game (for once, Google doesn't work instantly because something else has same name). A course project is to experiment with computer simuations of different people using different adaptive strategies, to see which strategies work best and whether strategies converge or behave chaotically.

1/31. Game theory

The Hawk-Dove and Battle-of-the-Sexes examples were from Haigh Taking Chances. There are many possible reading projects within game theory.

2/7 Stock market.

The graphs were from Malkiel A Random Walk Down Wall Street which is a good source for reading projects. Some course projects:

1. Find coherent data on s.d. of stock index changes over (1 day; 1 week; 1 month; 1 year) to see how well the square root law works. Then test more subtle predictions of random walk theory, e.g. the arc sine law.

2. In the context of the Kelley criterion for apportioning between stocks, suppose the annual stock gain X can be decomposed as an independent sum X_1 + X_2 where X_1 could be known at some cost (imagine var(X_1) = 0.1 var(X), say). What is the long-term advantage of knowing X_1? Do a simulation study with various distributions. (Conceptual point: this is the simplest model for studying the value of "fundamental analysis").

3. Look at historical data on annual stock returns and short term interest rates. See how well the Kelley strategy would have worked, based on modeling the next year's return as a random pick from the previous 20 years returns.

3/2. Simulating self-organized criticality

As described in class, here is a natural 2-dimensional model for epidemics. Take a large L x L square. Individuals arrive one at a time, at uniform random positions. Usually nothing happens; but with chance L^{-3/2} the new individual is infected. In this case the infection spreads to other individuals within distance 1, and an epidemic occurs in which infection continues to spread between further individuals at distance < 1 apart. After epidemic has run its course, remove all infected individuals. Course project: simulate this process to check theory predictions of power-law tail of distribution of number of infected individuals in an epidemic. See Wikipedia "forest-fire model" for references to related work.

3/9. Phylogenetic tree models

Here is a non-technical paper. Doing further statistics analysis of trees from e.g. TREEBASE would be an interesting course project.

The psychology of luck

The book The Luck Factor describes how people's self-assessment on a "lucky or unlucky" questionnaire correlates to various other attitudes, e.g. positive expectations for the future. So you could try replicating these results on a group of your friends or classmates.