STAT 157: Some possible course projects

Note: These are intended to illustrate the broad range of possibilities -- you are not required to do any specific project here. Talk with me before doing substantial work on any project, though it's good for you to think a little first.

Note: below not updated since 2014. More projects can be found in the 2017 lecture slides.

Projects, by class topic (2014).

Lecture 1: Everyday perception of chance.
(1) I described three ways to get data on the contexts where "ordinary people" think about chance -- blogs, twitter and search engine queries. Can you find another way?
(2) Repeat the "searching in blogs" project with different words or phrases. First you need to find a good way to search which gets results from "personal" blogs rather than more professional or commercial ones (which will be more prominent in search results). In particular, finding examples of usage of "likely" and "unlikely" would be interesting.
(3) Can you give a better categorization of these ``everyday life" contexts, compared with my own list?
(4) Look at the examples and think of a classification of why the person is interested in the probability of an event. For instance, is it because they intend to make a decision based on the likelihood, or is it just curiousity, or what?
(5) Regarding the list Which math probability predictions are actually verifiable? can you find other textbook theory which you can test via some new data.
(6) Look at recent books/articles on Big Data/Data Science which have a lot of examples; make some detailed classification, analogous to my clasification of probability contexts, of the real-world contexts in which (in is asserted that) Big Data will play a large role in future.

Lecture 2: The Kelly criterion for favorable games: stock market investing for individuals.
(1) Review academic studies of data relevant to the efficient market hypothesis in general, or (in particular) the performance of managed mutual funds relative to index funds.
(2) Do a better (than my graphic in class) analysis of how well the stock market matched the ``Kelly" prediction that, from any starting date, with probability q the market will (at some point) go below q times its starting value.
(3-99) There are many other possible projects involving finance, for those with their own ideas. Or see the section "Finding data to test theories: stock market" below.

Lecture 3: Risk to individuals: perception and reality.
(1) Write a report, in the style of the Ropeik-Gray Risk book (see this sample section), on some particular risk.
(2) In particular, the ``moderate alcohol consumption is beneficial to health" assertion is somewhat controversial -- look at discussion of this issue and read some of the actual papers describing studies.
(3) As one example of recent studies of the health effect of lifestyle factors see The combined effect on survival of four main behavioural risk factors for non-communicable diseases by Martin-Diener et al. on the list of papers. Raed and write a report on papers of this style.
(4) Take some risks that are currently in the news, and see how well they fit the 13 psychological factors listed in Ropeik How Risky Is It, Really?. Why our fears don't always match the facts.
(5) Write a report on academic statistical studies of the relationships between people's perception of the relative size of different risks, and their actual size.
(6) Do your own survey of what Berkeley students perceive as risky, in the style of my 2011 class survey, but using risks for which you can find some objective data actual size of risk.

Lecture 4: Psychology of probability: predictable irrationality.
(1) The book Cognition and Chance by Nickerson has many references to the original research experiments. It's a good source for reading projects and also for possible course projects repeating an experiment.
(2) The brief Wikipedia article Probability matching has the note you can help Wikipedia by expanding it.

Lecture 5: Game theory.
(1) Can you find another "observable" online game with a clear "game theory" component?

Lecture 6: Short/Medium term predictions in politics and economics.
(1) Look at some similar source of past forecasts and judge how accurate they were. For instance, Dunnigan -- Bay A Quick and Dirty Guide to War, 4th edition, 2008.
(2) (for a student who reads Chinese): track and analyze online discussion within China regarding the topics of the GJP questions involving China.
(3) (rather vague). Look at the technical statistics part of Tetlock's book, look at critiques of his work by others, look at subsequent academic literature. For instance Mandel - Barnes Accuracy of forecasts in strategic intelligence claim that Canadian experts are better than U.S. experts. Also check the "related publications" link on the IARPA ACE page.
(4) Look at the 2011 Global Risks chart, and give a rough assessment of the economic effects since 2011 of some of the identified risks.
(5) Look at the 2014 Global Risks chart, and think of some risks they did not consider, and analyze in the style of the Global Risks Survey.
(6) Compare the "expert" perception of risks in the 2011 report with extent of media coverage of such risks (at end-2010, when report was written).

"Examples/Suggestions for course projects" class.
(1) Two settings where I want data for my own research ..........
long waiting lines
small road networks.
Also a fascinating "failed" project Coincidences in Wikipedia. Can you think of some similar but more feasible project?

(2) A theory project: The basic model for sports. Suppose each team has a "skill level" measured by a real number x. Suppose when two teams with skills x and y play, the former wins with probability f(x - y) for some function f.
Within this model there are many things one can study. Given f, how can we estimate the skill levels from win/loss data? For unknown f, how to estimate the function f from data?
I would also like to see a "literature survey" of what has been done with models like this.

(3) Some examples of write-ups.
Lucky vs unlucky teams -- two ways in which gambling odds might be wrong.
Exploratory data analysis of amazon.com review data.
When Can One Test an Explanation? Compare and Contrast Benford's Law and the Fuzzy CLT.
The Great Filter, Branching Histories and Unlikely Events. Fun example of a little "math theory" paper.

Lecture 7: Coincidences, near misses and one-in-a-million chances.
(1) Find real-world examples of ``time" coincidences for events one might expect to happen at random times.
(2) Are there other cases where you can study near-misses? Consider bingo with many players -- when one person wins, how many others will have lines with 4 out of 5 filled?
(3) The ``Cal day poster" project -- find a variety of interesting unlikely events, some of whose chances are around 1 in a million, some which are much more likely, some much less likely.
(4) Is there a useful way to categorize coincidences? Does the Cambridge Coincidences Collection page work well on their examples? Can you do better?

Lecture 8: Ranking and rating.
(1) Which rating model is the best? Compare different models' performance in a sport (or sports) over a larger time frame using historical data.
(2) Can you beat the house? Construct your own rating/prediction model and win (us) lots of money.
(99) Many other ideas for your favorite sport. Or see the section "Finding data to test theories: sports" below.

Lecture 9: Luck.
(1) A "literature search" project is to look for any published academic study giving an extensive list of instances of events that surveyed people perceive and recall as luck (rather than just haphazard reporting of selected quotes).
(2) The book The Luck Factor describes how people's self-assessment on a "lucky or unlucky" questionnaire correlates to various other attitudes, e.g. positive expectations for the future. So you could try replicating these results on a group of your friends or classmates.
(3) The Wikipedia artlce Luck seems rather disorganized -- can you improve it?

Lecture 10: Prediction markets, fair games and martingales.
(1) Find other data-sets of this type -- probabilities changing with time -- and repeat the previous style of analysis -- the halftime price principle for sports, or the "expected number of down crossings" predictions for any events. One place to find data is the Advanced Football Analytics site.
(2) Also see section 3.5 of the lecture write-up for the kinds of theoretical predictions that one could check against data.

Lecture 11: Coding and entropy.
(1) A (hard) project is to invent algorithms for compressing "data" consisting of a sparse graph with vertex-names; study how well algorithms perform compared to the theoretical entropies, in the models discussed in this paper.

Lecture 12: Science fiction meets science.
(1) Read and write a report on the scientific literature regarding probabilities of asteroid impact. What are the models, the data and the conclusions in the serious scientific literature?
(2) Write a report analyzing different opinions regarding the Technological Singularity.
(3) Read and report on some scientific literature cited in the Preparing for Future Catastrophes article.

Lecture 13: Mixing: physical randomness, the local uniformity principle and card shuffling.
(1) Regarding our "checking predictions of the smooth density idealization" examples, repeat this kind of analysis for other data sets.
(2) I mentioned the elegant math theory about "random riffle shuffles", e.g. that it takes about 7 shuffles to make a card deck random (see Bayer-Diaconis or the more elementary Aldous-Diaconis paper. The theory assumes a kind of idealized shuffle which isn't quite what the average person really does. So there are interesting possible experiments where you compare shuffling 2 times, or 3 times, or 4 times, and then gather statistics (e.g. shape of bridge hands) from the resulting deals. See the recent paper Assaf et al for related theory.
(3) One could also do this with a cheap card-shuffling machine.
(4) Another shuffling schemes is "smooshing" where a deck of cards is slid about on the table by two hands -- this is standard at Baccarat tables. How much does one need to smoosh? A reasonable test statistic is the number of cards originally together that are still together.
(5) The "county fair dart game" example gives a setting with an adjustable parameter which interpolates between skill and luck. Can you invent another experiment which demonstrates the same point, and get data? The "wine cork" example deals with both "luck and skill" and with the idea of a "learning curve". Again, can you invent another experiment which demonstrates the same point, and get data?
(6) There is a classical theory of computer algorithms for sorting. A definitive treatment is in D. Knuth The Art of Computer Programming, volume 3 but many introductory textbooks on algorithms will have something. The math question is: how long, on average, does the algorithm take? Here ``on average" means we assume the data starts in random order. This is a good reading project. One can make a course project by doing experiments with (say) 100 blue books; try several schemes for sorting into alphabetical order of names and see which is really quickest.

Lecture 14: A glimpse at probability research: spatial networks on random points.
(1) Collect more data relevant to this paper.

Lecture 15: Size-biasing, regression effect and dust-to-dust phenomena.
(1) I surmise that when Colleges state their "average class size" they are using the Professor's viewpoint rather than the (more honest) student viewpoint. Can you find data to check this?
(2) Find stock market data to examine the qualitative "dust-to-dust" property.
(3) Find data on the $t$-year correlation for sports team winning percentage.

Lecture 16: Tipping points and phase transitions.
Getting data on real epidemics, or metaphorical ones mentioned in class, looks hard. Let us instead think about studying real queues. There's an elegant and well-developed math theory of queues, but it doesn't really apply to most waiting lines we encounter in everyday life.
(1) Coffee shop. A (several person) project is to sit in a coffee shop for several periods of time and record (in as much detail as practical) times of arrival to waiting lines and times of service completion. Then compare to theory models.
(2) Online game rooms. Go to (say) pogo.com, go into (say) Spades and go into (say) Intermediate. You'll see a list of about 20 game rooms and how many people are in each; usually some are at or near the maximum allowed (125) and most others are nearly empty. Note this is the opposite of supermarket lines, whch stay roughly balanced because customers tend to choose short lines. What's going on with the game rooms? Well, if you want to find a game to join it's more sensible to choose an almost-full room. A project is to first gather some data on room occupancies over (say) a 3-hour period, and then formulate and test some conjectures.
(3) For a project you could do a literature report on other realistic models of epidemics.

Miscellaneous Projects

These have been collected over many years, and are not so well organized.

Style 1: Down-home experiments

These are experiments in the sense that you're going to generate your own, new, data.

Backtracking in computer games. If you play a human vs computer strategy game where (as in e.g. the Civilization series)
(i) past states can be stored
(ii) there's a current "human score minus computer score" count
then you can try the following. Set to a difficulty level where you usually lose. Play to the end, then backtrack to some position where you were doing relatively well, and re-start from there. Theory suggests that allowing yourself 4 such restarts should convert a 1/50 chance to a 1/2 chance of winning.

Near-misses. In class I gave a Scrabble example. Any other example of near-misses where you can get data?

Finding data to test theories: general

Generating your own new data (as in Experiments above) is a lot of work. It's often easier to find existing data somewhere on the Web (though still some work to put into the particular form needed for a project). Here are some examples, excluding sports and stock market which will be treated separately.

Categorical data to test power laws and descriptive statistics. (10/10) This data on birth names is a good start - can you find other recent data of this ``percents in different categories" kind? Can you make a table showing where people claim power laws hold (analogous to my Normal table)?

Route-lengths in transportation networks. I need the following type of data. Take the 12 [or 20 or 40] largest cities in some State or Country, and find the distances between each pair by road [or rail] and in straight line. See Figure 1 of this paper for an example; but I would like to expand the 2 data-sets there to 10 data-sets. Do this well and get your name on a scientific paper!

Study some new type of social network. A social network consists of
(i) a specified set of individual people
(ii) and a specified relationship which two people may have.

Mathematically, this gives you a graph where vertices are people and edges indicate where the relationship holds. Many notions of "relationship" have been studied, but I'm sure there are some that no-one has yet thought about.

Amazon.com customer book reviews. Sample books with (say) 10-40 reviews. For each review, note date posted and number of favorable votes. There is a strong association between these variables, described in Exploratory data analysis by Robert Huang. But there's scope for much further analysis.

Online data archives. There are various archives online such as JSE Data Archive which you could search to find data to test predictions.

Finding data to test theories: sports

The book Mathletics is a good place to see the kinds of sports questions that have studied via statistical analysis of data.

Comparing stategies for betting on horse races. Take say 100 horse races -- look at odds and actual winner. Use the starting odds in each to impute winning probabilities. Determine what would have happened under each of several different strategies for choosing a horse to bet on in each race. Possible strategies: bet on
favorite
2nd favorite
3rd favorite
first name in alphabetical order
cutest name
For each strategy, the data will be (number of wins; overall dollar gain or loss) and you can compare this to the theoretical (calculated from imputed probabilities) expectation of number of wins and expectation of overall loss.

Timing of events within a sports match. Though the final result of a sports match should be affected by relative skill more than chance, one can ask whether, conditional on the final score, the various events within the match seem random or not. For instance

1. Consider soccer matches where the score is 1-1 at end of regulation time. Look at the times the two goals were scored. Are these uniform random? Or is there a tendency for a "quick equalizer"?

2. Common sense and theory both say that you should take more risks when losing and near the end of the match. For instance in football, classify interceptions by (which quarter? thrown by currently losing/winning team?). One expects the proportion of interceptions which are thrown by the losing team in the fourth quarter to be considerably larger than 1/8 (the proportion if uniform). Does data confirm this prediction? Can you see this effect in other sports?

3. Hot hands. For one player in a basketball game, record the sequence of successes/failures in their shots. Given the total number of successes (say 18 out of 29) do they occur in random order? Almost all sports players believe in some notion ("hot hands") that sometimes they are "on top of their form" some of the time but not other times, so that the pattern of successes is more clumpy than it would be if truly random. But statisticians who have studied this are dubious -- data looks pretty random to them. Project: gather some data, perhaps from another sport (e.g. volleyball: kills by spikers). Then there are standard ways to analyze such data.

I just noticed but haven't investigated a web site devoted to hot hands.

Timing of wins within a season. Some sports teams have a reputation for doing better or worse at the start or end of a season. For instance the Oakland As have a reputation for doing better in the second half of the season. Project: find relevant data (all teams, last 10 years say) and see if such effects are seen more often than "just chance" predicts.

Regression analysis of different sports. The variability of teams standings at season end reflects both difference in ability and chance. One can estimate the contribution of chance from the correlation between first half season and second half season: how does this compare across different sports? More simply, look at the 3 teams with the best records at mid-season: what is the chance one of these wins the Superbowl/World Series/Stanley Cup etc?

Point difference in football. Betting on football is usually done relative to a "point spread". I would like to have data on quantities like the following for NFL games.
(i) The difference between the actual spread and the point spread
(ii) The actual spread
(iii) Point spread versus odds-to-win.

Finding data to test theories: stock market

Malkiel A Random Walk Down Wall Street is a good source for reading projects. Some course projects:

1. Find coherent data on s.d. of stock index changes over (1 day; 1 week; 1 month; 1 year) to see how well the square root law works. Then test more subtle predictions of random walk theory, e.g. the arc sine law.

2. In the context of the Kelley criterion for apportioning between stocks, suppose the annual stock gain X can be decomposed as an independent sum X_1 + X_2 where X_1 could be known at some cost (imagine var(X_1) = 0.1 var(X), say). What is the long-term advantage of knowing X_1? Do a simulation study with various distributions. (Conceptual point: this is the simplest model for studying the value of "fundamental analysis").

3. Look at historical data on annual stock returns and short term interest rates. See how well the Kelley strategy would have worked, based on modeling the next year's return as a random pick from the previous 20 years returns.

4. Find a source (maybe cnn.com) that each day provides a 1 sentence explanation of why the stock market did what it did the day before. Create a table showing, for say 30 consecutive market days, how the index changed and the 1 sentence explanation. (OK, not so intellectually challeging, but useful data!)

Not Quite a Book Report

In elementary school you started "book report" projects, and through Berkeley you've done projects in the style "read some books or papers and write your own synthesis". I want this course to be different so I discourage such work. Or rather, let's try more creative versions such as

A Wikipedia entry. Write a Wikipedia entry (or entries) for a topic in this course that has no entry, or edit one with a unsatisfactory entry: for instance prediction markets.

Quotes about ubiquity of specific distributions. Collect quotes (textbook, popular science, research literature) about the ubiquity of Normal and power law distributions.

Another such topic is ``agent-based models of epidemics", e.g. this paper which I'll talk about in class.

Miscellaneous research projects

1. A recent book -- see this website attempts to rank people -- all 800,000 people with Wikipedia entries -- by analysis of their Wikipedia entries. Take some category of people that you are interested in (movie actors, hockey players, etc), find some ranking by human experts, and compare with this algorithmic ranking.

Slightly separate from this course (being less "real-world") are some ongoing research projects where undergraduates can help. You may consider doing one of these as a course project, expecially if you are willing to keep going until completion (e.g. as a STAT 199 "Supervised Independent Study and Research" next semester).

Simulation studies

Simulation studies of properties of probability models aren't quite real-world but are fun anyway. Here are some possibilities.

Simulating self-organized criticality As (may be) described in class, here is a natural 2-dimensional model for epidemics. Take a large L x L square. Individuals arrive one at a time, at uniform random positions. Usually nothing happens; but with chance L^{-3/2} the new individual is infected. In this case the infection spreads to other individuals within distance 1, and an epidemic occurs in which infection continues to spread between further individuals at distance < 1 apart. After the epidemic has run its course, remove all infected individuals. Course project: simulate this process to check theory predictions of power-law tail of distribution of number of infected individuals in an epidemic. See Wikipedia "forest-fire model" for references to related work.

Why asteroid impact probability goes up, then down. The Wikipedia explanation doesn't quite address probabilities -- simulating a more detailed model might be interesting.

Sociology models of the kind in Dynamic Models of Segregation or in Chapter 9 of Complex Adaptive Systems would be interesting to simulate.

The basic model for sports. Suppose each team has a "skill level" measured by a real number x. Suppose when two teams with skills x and y play, the former wins with probability f(x - y) for some function f.
Within this model there are many things one can study. Given f, how can we estimate the skill levels from win/loss data? For unknown f, how to estimate the function f from data?

Bigger projects

Predicting sports results. Continuing on from point difference in football and the basic model for sports above, it would be fascinating to actually make predictions for football results as the season progresses. This requires a lot of work -- definitely a team project.