STAT 157: Some possible course projects

Note: these have been collected over many years, and are not so well organized. They are intended to illustrate the broad range of possibilities -- you are not required to do any specific project here.

These are listed in two ways: by topic, linked to lectures, and by "style of project". So the same project often appears twice.

Talk with me before doing substantial work on any project, though it's good for you to think a little first.

Projects, by class topic

Lecture 1: Everyday perception of chance.
(1) I described three ways to get data on the contexts where "ordinary people" think about chance -- blogs, twitter and search engine queries. Can you find another way?
(2) Repeat the "searching in blogs" project with different words or phrases. First you need to find a good way to search which gets results from "personal" blogs rather than more professional or commercial ones (which will be more prominent in search results). In particular, finding examples of usage of "likely" and "unlikely" would be interesting.
(3) Can you give a better categorization of these ``everyday life" contexts, compared with my own list?
(4) Look at the examples and think of a classification of why the person is interested in the probability of an event. For instance, is it because they intend to make a decision based on the likelihood, or is it just curiousity, or what?
(5) Regarding the list Which math probability predictions are actually verifiable? can you find other textbook theory which you can test via some new data.
(6) Look at recent books/articles on Big Data/Data Science which have a lot of examples; make some detailed classification, analogous to
my clasification of probability contexts, of the real-world contexts in which (in is asserted that) Big Data will play a large role in future.

Lecture 2: The Kelly criterion for favorable games: stock market investing for individuals.
(1) Review academic studies of data relevant to the efficient market hypothesis in general, or (in particular) the performance of managed mutual funds relative to index funds.
(2) Do a better (than my graphic in class) analysis of how well the stock market matched the ``Kelly" prediction that, from any starting date, with probability q the market will (at some point) go below q times its starting value.
(3-99) There are many other possible projects involving finance, for those with their own ideas.

Lecture 3: Risk to individuals: perception and reality.
(1) Write a report, in the style of the Ropeik-Gray Risk book (see this sample section), on some particular risk.
(2) In particular, the ``moderate alcohol consumption is beneficial to health" assertion is somewhat controversial -- look at discussion of this issue and read some of the actual papers describing studies.
(3) As one example of recent studies of the health effect of lifestyle factors see The combined effect on survival of four main behavioural risk factors for non-communicable diseases by Martin-Diener et al. on the list of papers. Raed and write a report on papers of this style.
(4) Take some risks that are currently in the news, and see how well they fit the 13 psychological factors listed in Ropeik How Risky Is It, Really?. Why our fears don't always match the facts.
(5) Write a report on academic statistical studies of the relationships between people's perception of the relative size of different risks, and their actual size.
(6) Do your own survey of what Berkeley students perceive as risky, in the style of my 2011 class syrvey, but using risks for which you can find some objective data actual size of risk.

Lecture 4: Psychology of probability: predictable irrationality.
(1) The book Cognition and Chance by Nickerson has many references to the original research experiments. It's a good source for reading projects and also for possible course projects repeating an experiment.
(2) The brief Wikipedia article Probability matching has the note you can help Wikipedia by expanding it.

Lecture 5: Game theory.


________________________________________________
Material below not yet edited for 2014
__________________________________________________

8/30: Coding and entropy. Similar in spirit is the "Road route networks linking 4 addresses" project on this page.

9/1: Prediction markets, fair games and martingales. See section 3.5 of the lecture write-up for the kinds of theoretical predictions that one could check against data.

9/13: Mixing and sorting.
(1) Real shuffles. There's some elegant math theory about "random riffle shuffles", e.g. that it takes about 7 shuffles to make a card deck random (see Bayer-Diaconis or the more elementary Aldous-Diaconis paper. The theory assumes a kind of idealized shuffle which isn't quite what the average person really does. So there are interesting possible experiments where you compare shuffling 2 times, or 3 times, or 4 times, and then gather statistics (e.g. shape of bridge hands) from the resulting deals. See the recent paper Assaf et al for related theory.
(2) One could also do this with a cheap card-shuffling machine.
(3) Another shuffling schemes is "smooshing" where a deck of cards is slid about on the table by two hands -- this is standard at Baccarat tables. How much does one need to smoosh? A reasonable test statistic is the number of cards originally together that are still together.
(4) Mixing paper tickets. The 1970 draft lottery, intended to pick birthdays in random order, didn't do a very good job of randomization. This would be a nice reading project: see this site or do a Google Scholar search. The conclusion is that to physically mix a large number of objects is much harder than you think. Here's a course project. Imagine you want to run a lottery with say 100 names, and you do this by writing names on pieces of paper (e.g. stiff paper like business cards), putting them in some container (e.g. a cardboard box) and then just shaking, turning over, etc the box for 5 minutes. Then reach in and draw out tickets. My prediction is that this does a bad job of mixing -- one can do statistical tests on the results that show the order of the 100 draws are non random. It would be very interesting to do enough experiments to estimate how the number of shakes needed to mix grows with the number of tickets.
(5) Sorting physical objects. There is a classical theory of computer algorithms for sorting. A definitive treatment is in D. Knuth The Art of Computer Programming, volume 3 but many introductory textbooks on algorithms will have something. The math question is: how long, on average, does the algorithm take? Here ``on average" means we assume the data starts in random order. This is a good reading project. One can make a course project by doing experiments with (say) 100 blue books; try several schemes for sorting into alphabetical order of names and see which is really quickest.

9/27: class on possible course projects.

(1) I have a short list of textbook predictions that one can verify on new data. Can you think of other such predictions?

(2) Waves in long lines as described in class and in this write-up. Imagine joining the end of a long line -- for the first day of a popular movie, or at airport security. The people at the front are being served in a fairly regular way. But at the back, instead of moving forward one space at a time at the same rate as the front people at being served, you move forwards several spaces less frequently; a kind of ``occasional wave" of movement emerges spontaneously. We have a toy model for this situation, and possible projects are to simulate the model, or to collect some real-world data. As discussed in class, there is a similar "editing long documents" project.

(3) The basic model for sports. Suppose each team has a "skill level" measured by a real number x. Suppose when two teams with skills x and y play, the former wins with probability f(x - y) for some function f.
Within this model there are many things one can study. Given f, how can we estimate the skill levels from win/loss data? For unknown f, how to estimate the function f from data?
I would also like to see a "literature survey" of what has been done with models like this.

(4) Lucky vs unlucky teams -- two ways in which gambling odds might be wrong.

(5) Some examples of write-ups.
Exploratory data analysis of amazon.com review data.
When Can One Test an Explanation? Compare and Contrast Benford's Law and the Fuzzy CLT.
The Great Filter, Branching Histories and Unlikely Events. Fun example of a little "math theory" paper.

Also a fascinating "failed" project Coincidences in Wikipedia. Can you think of some similar but more feasible project?

9/29: Tipping points and phase transitions Getting data on real epidemics, or metaphorical ones mentioned in class, looks hard. Let us instead think about studying real queues. There's an elegant and well-developed math theory of queues, but it doesn't really apply to most waiting lines we encounter in everyday life.
(1) Coffee shop. A (several person) project is to sit in a coffee shop for several periods of time and record (in as much detail as practical) times of arrival to waiting lines and times of service completion. Then compare to theory models.
(2) Online game rooms. Go to (say) pogo.com, go into (say) Spades and go into (say) Intermediate. You'll see a list of about 20 game rooms and how many people are in each; usually some are at or near the maximum allowed (125) and most others are nearly empty. Note this is the opposite of supermarket lines, whch stay roughly balanced because customers tend to choose short lines. What's going on with the game rooms? Well, if you want to find a game to join it's more sensible to choose an almost-full room. A project is to first gather some data on room occupancies over (say) a 3-hour period, and then formulate and test some conjectures.

10/4: The local uniformity principle.

(1) Luck-Skill spectrum. The "state fair" example in handout gives a setting with an adjustable parameter which interpolates between skill and luck. Can you invent another experiment which demonstrates the same point, and get data? The "wine cork" example deals with both "luck and skill" and with the idea of a "learning curve". Again, can you invent another experiment which demonstrates the same point, and get data?

(2) Near-misses and the least significant digit principle and the "local irregularity" statistic S. The theoretical predictions ought to hold in just about any reasonable data set of integer data; it would be valuable to check a large number of data sets.

(3) Probabilities of asteroid collision with earth. This topic, mentioned occasionally through the course, is one topic that I would like a literature report on. What are the models, the data and the conclusions in the serious scientific literature?

10/11: Branching processes, advantageous mutations and epidemics. For a project you could do a literature report on other realistic models of epidemics.

10/14: From neutral alleles to diversity statistics. More examples of real-world categorical data, and calculation of the summary statistics, would be useful.

10/18: Global economic risks. It would be interesting to compare the "expert" perception of such risks in the 2011 report with media coverage of such risks, e.g. from USA Today.

10/27: Size-biasing etc. (1) I surmise that when Colleges state their "average class size" they are using the Professor's viewpoint rather than the (more honest) student viewpoint. Can you find data to check this?
(2) Find stock market data to examine the qualitative "dust-to-dust" property.
(3) Find data on the $t$-year correlation for sports team winning percentage.

11/1: Luck. A "literature search" project is to look for any published academic study giving an extensive list of instances of events that surveyed people perceive and recall as luck (rather than just haphazard reporting of selected quotes).

11/8: Coincidences Are there other cases where you can study near-misses? Consider bingo with many players -- when one person wins, how many others will have lines with 4 out of 5 filled?

Projects, by style

Style 1: Down-home experiments

These are experiments in the sense that you're going to generate your own, new, data.

The psychology of luck. The book The Luck Factor describes how people's self-assessment on a "lucky or unlucky" questionnaire correlates to various other attitudes, e.g. positive expectations for the future. So you could try replicating these results on a group of your friends or classmates.

Backtracking in computer games. If you play a human vs computer strategy game where (as in e.g. the Civilization series)
(i) past states can be stored
(ii) there's a current "human score minus computer score" count
then you can try the following. Set to a difficulty level where you usually lose. Play to the end, then backtrack to some position where you were doing relatively well, and re-start from there. Theory suggests that allowing yourself 4 such restarts should convert a 1/50 chance to a 1/2 chance of winning.

Near-misses. In class I gave a Scrabble example. Any other example of near-misses where you can get data?

Finding data to test theories: general

Generating your own new data (as in Experiments above) is a lot of work. It's often easier to find existing data somewhere on the Web (though still some work to put into the particular form needed for a project). Here are some examples, excluding sports and stock market which will be treated separately.

Categorical data to test power laws and descriptive statistics. (10/10) This data on birth names is a good start - can you find other recent data of this ``percents in different categories" kind? Can you make a table showing where people claim power laws hold (analogous to my Normal table)?

Route-lengths in transportation networks. I need the following type of data. Take the 12 [or 20 or 40] largest cities in some State or Country, and find the distances between each pair by road [or rail] and in straight line. See Figure 1 of this paper for an example; but I would like to expand the 2 data-sets there to 10 data-sets. Do this well and get your name on a scientific paper!

Study some new type of social network. A social network consists of
(i) a specified set of individual people
(ii) and a specified relationship which two people may have.

Mathematically, this gives you a graph where vertices are people and edges indicate where the relationship holds. Many notions of "relationship" have been studied, but I'm sure there are some that no-one has yet thought about.

Amazon.com customer book reviews. Sample books with (say) 10-40 reviews. For each review, note date posted and number of favorable votes. There is a strong association between these variables, described in Exploratory data analysis by Robert Huang. But there's scope for much further analysis.

Online data archives. There are various archives online such as JSE Data Archive which you could search to find data to test predictions.

Finding data to test theories: sports

The book Mathletics is a good place to see the kinds of sports questions that have studied via statistical analysis of data.

Comparing stategies for betting on horse races. Take say 100 horse races -- look at odds and actual winner. Use the starting odds in each to impute winning probabilities. Determine what would have happened under each of several different strategies for choosing a horse to bet on in each race. Possible strategies: bet on
favorite
2nd favorite
3rd favorite
first name in alphabetical order
cutest name
For each strategy, the data will be (number of wins; overall dollar gain or loss) and you can compare this to the theoretical (calculated from imputed probabilities) expectation of number of wins and expectation of overall loss.

Timing of events within a sports match. Though the final result of a sports match should be affected by relative skill more than chance, one can ask whether, conditional on the final score, the various events within the match seem random or not. For instance

1. Consider soccer matches where the score is 1-1 at end of regulation time. Look at the times the two goals were scored. Are these uniform random? Or is there a tendency for a "quick equalizer"?

2. Common sense and theory both say that you should take more risks when losing and near the end of the match. For instance in football, classify interceptions by (which quarter? thrown by currently losing/winning team?). One expects the proportion of interceptions which are thrown by the losing team in the fourth quarter to be considerably larger than 1/8 (the proportion if uniform). Does data confirm this prediction? Can you see this effect in other sports?

3. Hot hands. For one player in a basketball game, record the sequence of successes/failures in their shots. Given the total number of successes (say 18 out of 29) do they occur in random order? Almost all sports players believe in some notion ("hot hands") that sometimes they are "on top of their form" some of the time but not other times, so that the pattern of successes is more clumpy than it would be if truly random. But statisticians who have studied this are dubious -- data looks pretty random to them. Project: gather some data, perhaps from another sport (e.g. volleyball: kills by spikers). Then there are standard ways to analyze such data.

I just noticed but haven't investigated a web site devoted to hot hands.

Timing of wins within a season. Some sports teams have a reputation for doing better or worse at the start or end of a season. For instance the Oakland As have a reputation for doing better in the second half of the season. Project: find relevant data (all teams, last 10 years say) and see if such effects are seen more often than "just chance" predicts.

Regression analysis of different sports. The variability of teams standings at season end reflects both difference in ability and chance. One can estimate the contribution of chance from the correlation between first half season and second half season: how does this compare across different sports? More simply, look at the 3 teams with the best records at mid-season: what is the chance one of these wins the Superbowl/World Series/Stanley Cup etc?

Point difference in football. Betting on football is usually done relative to a "point spread". I would like to have data on quantities like the following for NFL games.
(i) The difference between the actual spread and the point spread
(ii) The actual spread
(iii) Point spread versus odds-to-win.

Finding data to test theories: stock market

Malkiel A Random Walk Down Wall Street is a good source for reading projects. Some course projects:

1. Find coherent data on s.d. of stock index changes over (1 day; 1 week; 1 month; 1 year) to see how well the square root law works. Then test more subtle predictions of random walk theory, e.g. the arc sine law.

2. In the context of the Kelley criterion for apportioning between stocks, suppose the annual stock gain X can be decomposed as an independent sum X_1 + X_2 where X_1 could be known at some cost (imagine var(X_1) = 0.1 var(X), say). What is the long-term advantage of knowing X_1? Do a simulation study with various distributions. (Conceptual point: this is the simplest model for studying the value of "fundamental analysis").

3. Look at historical data on annual stock returns and short term interest rates. See how well the Kelley strategy would have worked, based on modeling the next year's return as a random pick from the previous 20 years returns.

4. Find a source (maybe cnn.com) that each day provides a 1 sentence explanation of why the stock market did what it did the day before. Create a table showing, for say 30 consecutive market days, how the index changed and the 1 sentence explanation. (OK, not so intellectually challeging, but useful data!)

Not Quite a Book Report

In elementary school you started "book report" projects, and through Berkeley you've done projects in the style "read some books or papers and write your own synthesis". I want this course to be different so I discourage such work. Or rather, let's try more creative versions such as

A Wikipedia entry. Write a Wikipedia entry (or entries) for a topic in this course that has no entry, or edit one with a unsatisfactory entry: for instance prediction markets.

Quotes about ubiquity of specific distributions. Collect quotes (textbook, popular science, research literature) about the ubiquity of Normal and power law distributions.

Another such topic is ``agent-based models of epidemics", e.g. this paper which I'll talk about in class.

Miscellaneous research projects

1. A recent book -- see this website attempts to rank people -- all 800,000 people with Wikipedia entries -- by analysis of their Wikipedia entries. Take some category of people that you are interested in (movie actors, hockey players, etc), find some ranking by human experts, and compare with this algorithmic ranking.

Slightly separate from this course (being less "real-world") are some ongoing research projects where undergraduates can help. You may consider doing one of these as a course project, expecially if you are willing to keep going until completion (e.g. as a STAT 199 "Supervised Independent Study and Research" next semester).

Simulation studies

Simulation studies of properties of probability models aren't quite real-world but are fun anyway. Here are some possibilities.

Simulating self-organized criticality As (may be) described in class, here is a natural 2-dimensional model for epidemics. Take a large L x L square. Individuals arrive one at a time, at uniform random positions. Usually nothing happens; but with chance L^{-3/2} the new individual is infected. In this case the infection spreads to other individuals within distance 1, and an epidemic occurs in which infection continues to spread between further individuals at distance < 1 apart. After the epidemic has run its course, remove all infected individuals. Course project: simulate this process to check theory predictions of power-law tail of distribution of number of infected individuals in an epidemic. See Wikipedia "forest-fire model" for references to related work.

Why asteroid impact probability goes up, then down. The Wikipedia explanation doesn't quite address probabilities -- simulating a more detailed model might be interesting.

Sociology models of the kind in Dynamic Models of Segregation or in Chapter 9 of Complex Adaptive Systems would be interesting to simulate.

The basic model for sports. Suppose each team has a "skill level" measured by a real number x. Suppose when two teams with skills x and y play, the former wins with probability f(x - y) for some function f.
Within this model there are many things one can study. Given f, how can we estimate the skill levels from win/loss data? For unknown f, how to estimate the function f from data?

Bigger projects

Predicting sports results. Continuing on from point difference in football and the basic model for sports above, it would be fascinating to actually make predictions for football results as the season progresses. This requires a lot of work -- definitely a team project.