Explorations in Statistics Research

June 13-20, 2009
Berkeley


[Photos from the event]

Participants

Students

Amanda Broderick
Grant Brown
Luis Campos
Debra D'Angelo
Stephanie Finck
Christopher Harvey
Sarah Hopson
Alison Jaworski
Marissa Krimsky
Amelia McNamara
Reid Oda
Rachel Park
Andrew Sali
Krista Schaefer
Sydney Shaffer
Justin Shrake
Robert Smith
Michael Spitz
Anna Tassinari
Ken Truong
Christine Wang
Laura Westray
Boris Wong
Angie Zhu

Researchers

Jas Sekhon
Chris Volinsky
Claudia Tebaldi

Organizers

Mark Hansen
Deb Nolan
Duncan Temple Lang

Teaching Assistants

Gabe Becker
Megan Goldman
Tammy Greasby
Brianna Hirst

Visiting Assistants

Michelle Norris
Steven Sain
Carol Ramirez
Cari Kaufman

Panel Presenter

Dr. Colette Patt

Organizing Assistant

Emily Bergen

This program is made possible by a grant from the National Science Foundation.

Overview

The ESR Workshop is designed to introduce undergraduates to the exciting work being done in statistics, the science of data. The Workshop exposes students to how statisticians work on important and interesting scientific problems.

The focus of the workshop is on applying statistics to real problems, exploring and visualizing data. Statistical concepts guide our investigations which are ultimately shaped by computing.

The first full day, Sunday, was dedicated to an introduction to R. We also explored the CDC's Behavioral Risk Factor Surveillance System (BRFSS) data and delays in all of the US domestic airline flights in 2008 made available by the Burea of Transportation Statistics while learning about some basic visualization tools in R.

Over the next 6 days, we worked intensively for two days on each of three topics, guided by the expert statistician who did the research. The students worked in groups of 3 exploring different questions, using statistical techniques described in short "tutorials" by the researcher/speaker. The speaker, organizers, and graduate students assisted the groups of students as they explored the data, discussing, suggesting and clarifying ideas and helping with computational tasks in R. Students regularly reported to the whole group, presenting a relevant graphic and discussing with the researcher the implications of their findings.

Butterfly Ballots and other Election Irregularities

Jas Sekhon (UC Berkeley) introduced data regarding anomalies in the US presidential election of 2000. We explored the effects of the butterfly ballot in Palm Beach County in Florida and estimated how many votes were cast for Buchanan that we might expect to have been intended for Gore.

We started by examining the vote totals by county across the country and then looked at census data (e.g. ethnicity and race proportions, income and education levels) and used these to get better comparisons. We learned about the difficulties associated with different county sizes (i.e. total number of votes) and different underlying proportions and the effects on the variability of our sums and means.

We then went on to fit a weighted logistic regression model and estimated the proportion of votes using the Republican proportion and the census information. We also compared the results from the absentee ballots which did not use the butterfly ballot.

Finally we used shaded maps to show the abrupt change across county boundaries, illustrating the effect of the butterfly ballot.

Recommending Television Programs

The next topic was building a recommendation system for television viewers. Chris Volinsky from AT&T gave us detailed data on the television viewing "activities" of 1000 households, including what programs the viewers were watching, when they changed channels, and so on. Different groups of students explored one of 4 different time periods, each covering a large event such as the Oscars and the Superbowl. We discovered these big events, and explored what programs were popular and what times of the day different households tended to watch television. The goal was to move towards finding similar types of programs and similar types of viewers.

We moved on to deal with more households, but with aggregated data giving the amount of time they watched a particular program. With these data, we started computing distance measures and similarity between programs based on the viewers that watched them. We investigated clustering, principal components, singular value decomposition and multi-dimensional scaling. Next we built a recommendation system and used a form of cross-validation to see how well it performed.

We ended with text mining. The goal was to be able to recommend new shows that nobody had yet watched. To do this, we tried to find similar shows based on their short blurb/description. We used term frequency and inverse document frequency to create variables that we then used to compute distances between programs.

Combining Global Climate Models

Claudia Tebaldi (UBC) led the students in exploring global climate change. We looked at 21 international climate models that are used to simulate and study future climate characteristics under different scenarios. Over the 2 days, we examined different ways to combine these models to get better estimates.

We started by looking at temperature and precipitation for the last 60 years and seeing how these have changed generally and in different parts of the world and in different times of the year, i.e. seasons. Next we looked at the predictions from the 21 different models and compared how they matched the "observed" weather over these six decades.

We moved on to an approach to estimate reasonable weights for combining the models. These weights were based on both how well the models predicted the observed climate variable(s) and how much they agreed with the other models. This involves providing an initial starting point and iterating to estimate the weights until they converge. We estimated the weights separately for each of 22 regions around the world. We then explored the results to see which models were important and whether there were clear "winners".

The weights we worked with above were somewhat ad hoc. Claudia used a different approach via a hierarachical Bayesian model which connect the weights as inverses of the variances of a Normal distribution. We used Markov Chain Monte Carlo to estimate the parameters of the model and then looked at these to see how temperature is likely to change by comparing the densities in different regions. We ended up with an informative map.

Other Activities

Dr. Colette Patt from the Dean's office at Berkeley led a 1 1/2 hour discussion about thinking of and applying to graduate school. She gave a marvellous overview ranging from thinking about why one might go to graduate school to very practical and important details about applying to and accepting a particular school. She conveyed a wealth of information to the students, and there were lots of questions and discussions, with comments from the professors providing additional perspectives specifically about graduate programs in statistics.

On Thursday, five graduate students described their experiences in graduate school and answered an array of interesting questions from the students.

On Friday evening, we held our Workshop banquet at the Great China restaurant.


Last modified: Mon May 10 18:40:41 PDT 2010