Integrating Computing into the Statistics Curricula

A Workshop to Develop Educational Materials

Date: Sunday, July 26 to Wednesday, July 29, 2009

Location: U.C. Berkeley

Home | Overview | Schedule | Data | Potential Projects | Project Templates | Local Arrangements |

Goals for the Case Studies in Statistical Computing Workshop

The goal is to create a collection of materials that instructors for various courses can use for teaching computing and data analysis/applied statistics. The main focus is computing, but in the context of real problems or a simulation study. These courses will primarily be at the upper-division undergraduate level, but introductory graduate level and advanced graduate level are also welcome.

The materials can be

The materials are not intended to explicitly and formally teach computing material (e.g. the regular expressions language). Rather they are intended to illustrate the use of a technology or discuss different approaches to programming a particular task. Obviously if one is using technologies or statistical methods that are not likely to be taught in a class (e.g. KML or Google Maps or clustering), then an introduction will be necessary.

Our plan is to combine these case studies and solutions as separate chapters in a book. We think this will facilitate instructors in teaching novel, interesting statistics classes that emphasize either computing and/or applications of statistics and "modern" statistical methods. It is a non-traditional book in the same vein as Deb and Terry's Stat Labs and Statistics a Guide to the Unknown. Our focus is different (computing and applications) and also we will have both data and code!.

The idea is that instructors will be able to use this book to get detailed discussion of projects, exercises and examples they can use to teach statistical computing in an interesting, applied manner. This will get instructors over the initial hurdles of having real and interesting problems and working through all the details of the computations.

The book will also serve as a hands-on and quite different text for students who want to work through more complete and real statistical analyses and computations.

The types of courses we think might use this material include

There are some solutions we have written up over the years to some of our assignments. They are not in the ideal form, but may be of some value to get a sense of what we are thinking about generally. See Samples. See our list of data for case studies if you are looking for a starting point.

What we want NOW!

To get things moving along and make the 3 day workshop more efficient, we'd like each participant to send us a short writeup (about half a page) describing the topic she plans to work on. We'll collate these and put them on the web site so everyone can see what others are thinking of and see if people want to work together, pick different topics which are not yet being covered, etc.

You might provide a description of

Structure of a chapter

We are thinking that a chapter might be structure somewhat along the following lines

There are 18 participants in this workshop and we would be very happy to have a chapter from each of you. However, we are also entirely happy to have people work in pairs or even larger groups. If we end up with 12 or more separate case-studies, that will be terrific. So please chat about different team up, and feel free to work in more than one group or on more than one topic. For example, you may want to work with someone on a topic where you present an alternative solution to the problem.

Many of the case-studies/chapters will be based on R. However, this is not in any way a requirement. If you want to use MATLAB or SAS, that is fine. However, perhaps we could arrange to have an R implementation of the same computations to accompany the text.

We envisage that each chapter will have a corresponding R package. At the least, the package will contain the data and a description of it. It will also likely have some code in the demos directory which the reader can evaluate to see different plots, get results, etc. that are presented in the paper. These might be in a "dynamic document" (be it in Sweave, R-Docbook (XML), or R-Word format). The package might also contain functions that students and instructors can use when doing their own analysis.

A chapter may have an additional R package that contains the "instructor" solutions that we do not make publically available but rather limit to instructors.

What characteristics make a good activity for the student?

These are just some thoughts and possible guidelines rather than definitive rules or requirements.
  • Rather than just being focused on sexy or topical/popular data sources, the focus should be on extracting information of interest and addressing questions of interest.
  • It should be reasonably clear why we want to study this topic.
  • Statistics or visualization should have a real role to play (rather than just being a "by the way"), i.e. it should lead to some insight or useful summary of the data.
  • Visualization should play a significant role. Ideally the students should have the opportunity to consider novel ways to visualize the data (rather than just using the standard plot() methods with little annotation/customization)
  • Testing and application of common methods (e.g. regression) should not be the dominant feature but rather a step in discovering the answer to the scientific context. In other words, fitting a model is not the end point.
  • The case study might illustrate a particular computational or statistical issue. We can use these to introduce "modern" topics, i.e. topics that are not traditionally taught in regular classes that undergraduates or even Masters students might take. For example, multivariate techniques such as MDS and classification trees, or topics such as boosting and bagging, or the EM algorithm and so on.
  • The exercise might be a simulation of a statistical method or a stochastic process. Ideally the purpose of the simulation would be to answer a question about the apprporiateness or characteristic of a method or a simulation in the context of a particular scientific/social question.
  • When the topic has associated data that are available via the Web it is valuable to include accessing the data as part of the activity. It illustrates the dynamic and current nature of the problem and potentially allows the students to find something new. Even if the date are
  • When possible, allow different students work different aspects of the data and problem, e.g. split climate into regions, television viewing into different markets/cities. Alternatively, have them pursue different avenues of exploration, methods of analysis, etc.
  • There is benefit in having the students learn about a package as part of the activity, e.g. the maps package or the maptools for reading shape files. We want them to learn how to find software (packages and functions) and use these to solve a problem.
  • Describe the problem/material for other instructors as well as students so that we can share the materials with other instructors.
  • When providing solutions, think about how we might keep some back from the students, but available to instructors.
  • Allow the students to take the initiative and direct how to proceed so that they feel they are in charge and discovering things. In other words, they are involved in active learning.
  • Provide sufficient direction/guidance as to how to get started and where to go for weaker students, but not too prescribed to stifle/inhibit creativity.
  • Projects should ideally integrate several computational tasks that use different technologies/skills so that students have a sense of putting things together to complete a task from soup to nuts.
  • Projects should allow the students to add this to a portfolio that they might use in a job or graduate program application. These might be interesting plots, creative
  • One might build projects from a series of homeworks, e.g.
    1. write a function to read email messages in,
    2. create derived variables,
    3. build a classifier
  • Atypical data formats, i.e. when read.table doesn't work.
    • non-rectangular data
    • Non-numeric data, e.g. text, graphs, etc.
  • A case study might be an involved description of debugging in a real situation, or optimizing code to make it run faster, or designing a class system or implementing a package. These could provide valuable experiential observations rather than merely identifying the functions that one might use.

  • Last modified: Mon Jul 13 2009