Possible Problems and Datasets
- Airline Delays
- The flight arrival and departure details for all
commercial flights within the USA, from October 1987 to April 2008
are available in a database for the Data Expo. A subset of these
data, e.g. information for 2007 and 2008 for the Sacramento, Oakland,
San Francisco, and Los Angeles airports, provide ample opportunity for
visual exploration. Prepared by Hadley Wickham.
- Geo-location from wireless signals:
- Inside a building, Global
Positioning Systems do not provide an effective means for locating
people and other things. Instead, systems are built to use WiFi set up
for Internet access to locate objects. The University of Mannheim
provides training data (CRAWDAD) for building and testing
models to estimate the physical location of an object based on the
received signal strength of the client from multiple access
points. Methods such as those based on nearest neighbor techniques,
work well in this situation. There is also the possibility of using
Bayesian approach, hlm.
- Precipitation in the Colorado Frontrange:
- Doug Nychka as the National Center for
Atmospheric Research provides access to forty years of observed
daily precipitation data for the Colorado Front Range, a combination
of relative flat plains with a transition to high mountains. These
can be used to investigate the distribution of precipitation over
space and time and to compare actual precipitation to that simulated
from a regional climate model.
- Simulating a birth and assassination process:
- In a birth and death
process proposed by Aldous and Krebs, the head of the family has
children according to a Poisson process, and he is ``assassinated'' at
a random time. The children of the head of the family produce
offspring according to the same birth process, their children produce
children, and so on. After the head is assassinated, his children
become heads of their respective families, and vulnerable to
assassination, i.e. children are protected by their parents. This
process can be studied via simulation and presented via animations.
We also have a variation of this which considers different
types of offspring which are strong, medium and weak.
They have different life expectancies and this comes from a
mixture model of exponentials.
This process can be used to model, for example, rumours.
- Simulating an ad hoc network
- Ad hoc networks for wireless
communication, have no centralized node or fixed structure or
topology. Instead, devices move over time and dynamically enter and
exit the network. A synthetic experiment can study the properties of
the network, e.g. the distribution of the minimally connected graph of
nodes as a function of the power level of the nodes.
- Email filtering
- Spam Assassin has classified thousands of email
messages as spam or regular email and made them available in a public corpus. Regular expressions can be used to
examine the text in the mail message and derive variables, e.g. the
number of recipients to whom the mail was sent, the percentage of
capital words in the body of the text, whether or not the message is a
reply to another message. Once the data have been derived, predictors
can be investigated, e.g. using CART. Alternatively, naive Bayes
methods can be used on word counts for the emails.
- Text mining the State of the union speeches
- Project Gutenberg
makes available the State of the Union Address by United States
Presidents from 1790-2001. These can be augmented to include the
recent addresses. After preparing the text data in a form that is
suitable for statistical analysis, e.g. making word counts for each
speech, the speeches can be compared (e.g. via multi-dimensional
scaling and hierarchical clustering) to see how they differ across
time and party.
- Presidential election results
- County level data for the 2008
presidential election can be scraped from news Web sites (e.g. CNN and
USAToday). With these data, students can make maps of the election
results. Alternatively, results from the and the Democratic primaries
can be scraped, combined with county-level census data, and counties
that are pro-Obama vs. pro-Clinton mapped can be compared via maps and
- Seal migration
- To study the migratory patterns of an elephant
seal, students can smooth the seal's path, compare it to a random walk
on a great circle via simulation, and create an animated ``mash-up''
of their findings on Google Earth. These data are made available by
Brent Stewart, Hubbs-Sea World Research Institute, in Brillinger and Stewart (Canadian Journal of Statistics, 1998, pages 431-443)
- Intrusion detection in network traffic
- The Lincoln Labs network
intrusion detection experiment with over 1.3 million connection
records provides rich data for exploring. Accessing these data
requires consideration of when to perform computations in the database
that contains the data and when to pull data into R. Furthermore,
given the volume of the data, plotting requires special consideration
because you cannot "see" anything with the standard plotting
- Baseball statistics
- Sean Lahman's Baseball Archive is a database
that contains information about major league baseball teams,
players, managers, and franchises for the years 1871 through 2008. An
enormous number of questions and ideas can be explored through
visualizations of the results of SQL queries to the database.
- California freeway traffic
- These data are provided by the
Freeway Performance Measurement System (PeMS). PeMS receives flow and
occupancy information from approximately 22,000 loop detectors
embedded in the road surface of freeways across California. Every 30
seconds data are transmitted from the loop detectors to PeMs to yield
two gigabytes of data per day; four terabytes of data are currently
stored for public use at PEMS. There
are several theories on congestions and traffic flow that can be
studied with these data that involve looking at relationships between
flow and occupancy over time.
- Mining the R-help mailing list
- This involves reading the e-mail messages on the R-help mailing list
(or R-devel or any mailing list)
and exploring the growth in the number of messags,
the change in people,
the nature and topics of questions over time,
response time, geographical location of participants.
We also look at the R functions and packages that are discussed
and attempt to infer frequently asked questions (FAQ) and other
common topics. The data are available from
- Air Pollution
- Roger Peng and Francesca Dominici ran a 2 day exploration
in our summer school on mortality, weather, and air pollution from the National
Morbidity, Mortality, and Air Pollution Study.
The data and some papers on empirical Bayes approaches to the problem are available on Peng's website.
They also have a book on the topic
Statistical Methods for Environmental Epidemiology with R: A Case
Study in Air Pollution and Health.
- Phase Transition Simulation
The BML (Biham-Middleton-Levine) Traffic Model,
(see Biham, Middleton, Levine, Phys Rev A, Vol 46, Issue 10, 1992, R6124-R6127)
is a very simple dynamic process that exhibits a phase transition.
The problem is simply stated as follows:
See D'Souza's and
for more information.
Simulate this and make it "fast".
(We write it in R, then vectorize it and then write it in C.)
- we have an r x c grid
- we randomly select n points in the grid and place cars
in each of those cells
- each car is either red or blue.
- at even moments in time, red cars move eastward, and
at odd time points, blue cars move north, but a car cannot
move if the cell to which it would move is occupied,
- when a car gets to the edge, it wraps around, i.e. the grid
is a torus.
- Face Recognition
Assorted sample data
Last modified: Sat Mar 7 17:31:05 PST 2009