OverviewThe ESR Workshop is designed to introduce undergraduates to the exciting work being done in statistics, the science of data. The Workshop exposes students to how statisticians work on important and interesting scientific problems.
The first full day, Sunday, was dedicated to an introduction to R. We also explored housing data and delays in all of the US domestic airline flights in 2008 made available by the Burea of Transportation Statistics while learning about some basic visualization tools in R.
Over the next 6 days, we worked intensively for two days on each of three topics, guided by the expert statistician who did the research. The students worked in groups of 3 exploring different questions, using statistical techniques described in short "tutorials" by the researcher/speaker. The speaker, organizers, and graduate students assisted the groups of students as they explored the data, discussing, suggesting and clarifying ideas and helping with computational tasks in R. Students regularly reported to the whole group, presenting a relevant graphic and discussing with the researcher the implications of their findings.
Keeping a Search Engine Fresh
Carrie Grimes from Google Research described how search engines work. A copy of each website is created and indexed by following a trail of links from a group of high quality websites. This process is called 'crawling the web.' The index is updated by revisiting the sites and determining whether changes have occurred. Carrie introduced important issues that come up when trying to index the world wide web such as page quality, page rank, frequency of page updates, and differentiating between informative and non-informative page changes. The main issue we looked at was determining how often websites should be revisited by the search engine to ensure that the index is up-to-date.
The first goal was to estimate the rate of change for each website and the time between changes, using real data sampled from the internet by Google. By looking at the distribution of the estimates, we noticed many natural categories of website updating. Many websites are created once and never change such as notes for a course, while others are updated frequently such as news sites.
Next, we compared two preset groups of urls to determine how similar they were with respect to frequency and consistency of updates by comparing the distribution and variability of the time between changes.
We then discussed the exponential and Poisson distributions and their relationship to our problem in that we are counting the number of changes during a period of time as well as measuring the waiting time till the next website update. We compared a naive estimator of lambda to an improved estimator that takes into account the fact that websites might be updated multiple times before the search engine updates its index. This is an instance of censored data since a search engine cannot observe the entire internet constantly and must rely on sampling at different points in time. As a result, the data we observe are the result of censored Poisson process. Therefore, we looked at what url sampling schemes are most impacted by censoring (long vs short intervals, fast vs slow rates of change).
Lastly, we explored how to choose the optimal time between updates so that you minimize the number of times you 'crawl' an unchanged website and minimize the amount of time that the index has an out-of-date version of the website. After we simulated data to choose the optimal sampling scheme for a website with a given rate of change, Carrie described a Bayesian method to decide how often to crawl each url.
Simulations of the Universe
Dave Higdon, head of Statistical Sciences at the Los Alamos National Laboratory, introduced our second research topic for the week, which was in cosmology, specifically related to simulation-aided inference. Dave started by giving us a brief introduction to the history of the Los Alamos National Labs and the current research being conducted in weapons, extreme physics, hydrology, and traffic simulation, among other topics. Dave proceeded to describe his work in cosmology, particularly related to the structure of the universe. We learned that prior to previous belief, the universe is not only expanding but at an accelerated rate. He also described what humans see along with what they don’t: dark matter and the mysterious dark energy.
The first task of the day was to work on examining seven simulated universes by examing spatial data from seven simulations, each generated using different values of two parameters - the amount of dark matter and spatial variance - but the same initial conditions. The task was to determine which of seven simulations matched best to the "truth", which was another set of simulated data generated from different initial values. We wanted to determine if we could estimate the parameters of the simulated truth by matching similarities in the second order effects. That is, we didn't expect the stars and galaxies to match up, but for the clumping and distances between stars to be similar to the truth. Our explorations of this spatial data included visual comparisons of the simulations to the simulated ‘truth’, some cluster analysis and regression analysis of the parameters against different degrees of clumping in the data
In the middle of the day, we also took a tour of the supercomputers at NCAR, where we were able to see several supercomputers, large robot-controlled tape drives among other machines and learned how these are maintained and managed. We also visited to the visualization lab and saw several animated visualizations relating to all the research conducted at the facility. In particular, we saw a graphic tracing a potential pathway of the Gulf of Mexico oil spill and we even got to wear 3-D glasses for several of the visualizations!
To end the day, Dave described the modeling framework behind making inference from physical observations. In particular, he described the use of Gaussian processes to interpolate data points and a Bayesian framework using Markov Chain Monte Carlo (MCMC) to find a posterior distribution for inference. He had a fun demo using different sized water bottles to explain the intuition behind the Metropolis algorithm, and then our final task of the day was to examine which initial values and proposal distributions used in the algorithm would be appropriate to obtain an accurate sample.
The next day continued with more in depth exploration of the Metropolis algorithm for sampling from multivariate distributions. We worked on determining which proposal distributions are most appropriate for sampling from univariate, bivariate and twisted normal distributions as well as a mixture distribution. We examined acceptance rates from the algorithm and number of iterations until convergence for bad initial values, and we made comparisons of simulated distributions to theoretical distributions.
During lunch, we had a panel where the graduate students discussed their experiences in graduate school and how they got there. They all come from various backgrounds and have a lot of unique perspectives about coursework and graduate school life among other things.
Dave then introduced Gaussian processes using an example from baseball related to batting averages. We explored how these processes work and then examined the effect of correlation on the Gaussian process. Additionally, we investigated the conditional distribution of the Gaussian process when we know a few points. We fit Gaussian processes to two-, four- and six-dimensional data sets as well as a cosmology data set and had to think about how to effectively visualize multiple dimensions and to access the accuracy of our predictions.
Regional Climate Modeling
Steve Sain is a scientist at the National Center for Atmospheric Research (NCAR) and is one of the organizers of ESR. His work involves developing statistical methods for geophysical sciences, providing assessments for global climate change, and performing various atmospheric simulation experiments. For ESR, Stephan focused on comparing and quantifying uncertainties between different regional and global climate models.
Global climate models (GCM) are models that describe the large scale climatology of the planet by solving a system of differential equations factoring in the climate forcings and energy transfers between the atmosphere, ocean, terrestrial and marine biosphere, cryosphere (sea ice, snow), and land surface. The differential equations account for Newton’s laws, continuity of mass, thermodynamics, etc. Uncertainties can be introduced through natural climate variability, poorly understood physics, uncertainty in the future forcings (e.g. increase in green house gases), and poor resolution of the models (high resolution is computational intensive). However, GCMs cannot predict local events such as drought, floods, or heat waves so regional climate models (RCMs) were introduced to capture small scale climate changes.
For ESR, we focused on the surface average seasonal values of temperature and precipitation over 30 years for North America. Our first task was to understand the natural variability of various RCM models where the boundary conditions are all driven by the same GCM (NCEP) and characterize its climate. For most models we notice a clear difference between different many landmarks (land, water, mountain ranges, etc.) for the mean over 20 years.
Our next goal was to compare the RCMs to high resolution NCEP outputs. NCEP is a GCM that assimilates real data into its output and also provided the boundary conditions for all the RCMs. Thus the NCEP outputs could be considered to be the truth when we want to characterize any systematic differences between the RCMs and the NCEP model. This task was difficult because the difference must be measured relative to the natural model variability from the RCM and the NCEP where visualizing these differences across space and time proved to be another challenge. Most groups tackled this question by taking the difference between each RCM and NCEP for the same year then summarizing that difference. NCEP outputs were systematically higher than some RCMs for the extreme values (top 10%) and the correlation over time was higher for winter than summer. Stephan pointed out that these analyses were interesting but more effort is required to account for spatial dependencies between residuals from neighboring points and the multiple testing problems for detecting significant differences. Methods such as functional ANOVA and Markov Random Fields (MRFs) were introduced to tackle both issues and create valid inference on the entire map as a whole.
On the second day, we then move on to the problem of comparing different GCMs whose distribution of outputs are expected to be similar but not the observed yearly outputs (the climates should be the same but the yearly weather may not). Lastly we explored the difference between the same GCMs across 32 years starting from 1970 and 2040 assuming the growth of greenhouse gases and explore the potential climate changes. Stephan talked about the hierarchical Bayesian models that underlie all the climate models and the computational difficulties they encounter. The Bayesian framework provides a cohesive method of analyzing the changes relative to the different sources of variability. We ended on the remark that the role of Statistics in atmospheric sciences is relatively new and many questions are still open. There are various opportunities for collaboration and visiting in NCAR and hopefully some of us will return here again.