Possible Problems and Datasets

  1. Airline Delays
    The flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008 are available in a database for the Data Expo. A subset of these data, e.g. information for 2007 and 2008 for the Sacramento, Oakland, San Francisco, and Los Angeles airports, provide ample opportunity for visual exploration. Prepared by Hadley Wickham.
  2. Geo-location from wireless signals:
    Inside a building, Global Positioning Systems do not provide an effective means for locating people and other things. Instead, systems are built to use WiFi set up for Internet access to locate objects. The University of Mannheim provides training data (CRAWDAD) for building and testing models to estimate the physical location of an object based on the received signal strength of the client from multiple access points. Methods such as those based on nearest neighbor techniques, work well in this situation. There is also the possibility of using Bayesian approach, hlm.
  3. Precipitation in the Colorado Frontrange:
    Doug Nychka as the National Center for Atmospheric Research provides access to forty years of observed daily precipitation data for the Colorado Front Range, a combination of relative flat plains with a transition to high mountains. These can be used to investigate the distribution of precipitation over space and time and to compare actual precipitation to that simulated from a regional climate model.
  4. Simulating a birth and assassination process:
    In a birth and death process proposed by Aldous and Krebs, the head of the family has children according to a Poisson process, and he is ``assassinated'' at a random time. The children of the head of the family produce offspring according to the same birth process, their children produce children, and so on. After the head is assassinated, his children become heads of their respective families, and vulnerable to assassination, i.e. children are protected by their parents. This process can be studied via simulation and presented via animations.
    We also have a variation of this which considers different types of offspring which are strong, medium and weak. They have different life expectancies and this comes from a mixture model of exponentials.
    This process can be used to model, for example, rumours.
  5. Simulating an ad hoc network
    Ad hoc networks for wireless communication, have no centralized node or fixed structure or topology. Instead, devices move over time and dynamically enter and exit the network. A synthetic experiment can study the properties of the network, e.g. the distribution of the minimally connected graph of nodes as a function of the power level of the nodes.
  6. Email filtering
    Spam Assassin has classified thousands of email messages as spam or regular email and made them available in a public corpus. Regular expressions can be used to examine the text in the mail message and derive variables, e.g. the number of recipients to whom the mail was sent, the percentage of capital words in the body of the text, whether or not the message is a reply to another message. Once the data have been derived, predictors can be investigated, e.g. using CART. Alternatively, naive Bayes methods can be used on word counts for the emails.
  7. Text mining the State of the union speeches
    Project Gutenberg makes available the State of the Union Address by United States Presidents from 1790-2001. These can be augmented to include the recent addresses. After preparing the text data in a form that is suitable for statistical analysis, e.g. making word counts for each speech, the speeches can be compared (e.g. via multi-dimensional scaling and hierarchical clustering) to see how they differ across time and party.
  8. Presidential election results
    County level data for the 2008 presidential election can be scraped from news Web sites (e.g. CNN and USAToday). With these data, students can make maps of the election results. Alternatively, results from the and the Democratic primaries can be scraped, combined with county-level census data, and counties that are pro-Obama vs. pro-Clinton mapped can be compared via maps and classification trees.
  9. Seal migration
    To study the migratory patterns of an elephant seal, students can smooth the seal's path, compare it to a random walk on a great circle via simulation, and create an animated ``mash-up'' of their findings on Google Earth. These data are made available by Brent Stewart, Hubbs-Sea World Research Institute, in Brillinger and Stewart (Canadian Journal of Statistics, 1998, pages 431-443)
  10. Intrusion detection in network traffic
    The Lincoln Labs network intrusion detection experiment with over 1.3 million connection records provides rich data for exploring. Accessing these data requires consideration of when to perform computations in the database that contains the data and when to pull data into R. Furthermore, given the volume of the data, plotting requires special consideration because you cannot "see" anything with the standard plotting routine.
  11. Baseball statistics
    Sean Lahman's Baseball Archive is a database that contains information about major league baseball teams, players, managers, and franchises for the years 1871 through 2008. An enormous number of questions and ideas can be explored through visualizations of the results of SQL queries to the database.
  12. California freeway traffic
    These data are provided by the Freeway Performance Measurement System (PeMS). PeMS receives flow and occupancy information from approximately 22,000 loop detectors embedded in the road surface of freeways across California. Every 30 seconds data are transmitted from the loop detectors to PeMs to yield two gigabytes of data per day; four terabytes of data are currently stored for public use at PEMS. There are several theories on congestions and traffic flow that can be studied with these data that involve looking at relationships between flow and occupancy over time.
  13. Mining the R-help mailing list
    This involves reading the e-mail messages on the R-help mailing list (or R-devel or any mailing list) and exploring the growth in the number of messags, the change in people, the nature and topics of questions over time, response time, geographical location of participants. We also look at the R functions and packages that are discussed and attempt to infer frequently asked questions (FAQ) and other common topics. The data are available from https://stat.ethz.ch/pipermail/r-help/.
  14. Air Pollution
    Roger Peng and Francesca Dominici ran a 2 day exploration in our summer school on mortality, weather, and air pollution from the National Morbidity, Mortality, and Air Pollution Study. The data and some papers on empirical Bayes approaches to the problem are available on Peng's website. They also have a book on the topic Statistical Methods for Environmental Epidemiology with R: A Case Study in Air Pollution and Health.
  15. Phase Transition Simulation
    The BML (Biham-Middleton-Levine) Traffic Model, (see Biham, Middleton, Levine, Phys Rev A, Vol 46, Issue 10, 1992, R6124-R6127) is a very simple dynamic process that exhibits a phase transition. The problem is simply stated as follows:
    • we have an r x c grid
    • we randomly select n points in the grid and place cars in each of those cells
    • each car is either red or blue.
    • at even moments in time, red cars move eastward, and at odd time points, blue cars move north, but a car cannot move if the cell to which it would move is occupied,
    • when a car gets to the edge, it wraps around, i.e. the grid is a torus.
    See D'Souza's and Holroyd's for more information. Simulate this and make it "fast". (We write it in R, then vectorize it and then write it in C.)
  16. Face Recognition
    Assorted sample data

Last modified: Sat Mar 7 17:31:05 PST 2009