Integrating Computing into the Statistics Curricula
A Workshop to Develop Educational Materials
Date: Sunday, July 26 to Wednesday, July 29, 2009
Location: U.C. Berkeley
Potential Projects |
Project Templates |
Local Arrangements |
Goals for the Case Studies in Statistical Computing Workshop
The goal is to create a collection of materials that instructors for
various courses can use for teaching computing and data
analysis/applied statistics. The main focus is computing, but in the
context of real problems or a simulation study. These
courses will primarily be at the upper-division undergraduate level, but
introductory graduate level and advanced graduate level are also welcome.
The materials can be
The materials are not intended to explicitly and formally teach
computing material (e.g. the regular expressions language).
Rather they are intended to illustrate the use of a technology
or discuss different approaches to programming a particular task.
Obviously if one is using technologies or statistical methods that are
not likely to be taught in a class (e.g. KML or Google Maps or
clustering), then an introduction will be necessary.
Our plan is to combine these case studies and solutions as separate
chapters in a book. We think this will facilitate instructors in
teaching novel, interesting statistics classes that emphasize either
computing and/or applications of statistics and "modern" statistical
methods. It is a non-traditional book in the same vein as Deb and Terry's Stat Labs and
a Guide to the Unknown. Our focus is different (computing
and applications) and also we will have both data and code!.
The idea is that instructors will be able to use this book to get
detailed discussion of projects, exercises and examples they can use
to teach statistical computing in an interesting, applied manner.
This will get instructors over the initial hurdles of having real
and interesting problems and working through all the details of the
The book will also serve as a hands-on and quite different text for
students who want to work through more complete and real statistical
analyses and computations.
The types of courses we think might use this material
- case studies/projects with a scientific/policy question and a worked solution,
- a simulation/computer experiment,
- a comparison of different approaches to implementing a
- an explicit visualization problem,
- shorter homeworks/exercises.
There are some solutions we have written up over the years to some of
our assignments. They are not in the ideal form, but may be of some
value to get a sense of what we are thinking about generally.
See our list of data for case studies if
you are looking for a starting point.
- statistical methods classes
- applied statistical computing classes
- survey of modern statistical topics and applications
- introductory "honors" statistics class
- seminar classes
- consulting classes
What we want NOW!
To get things moving along and make the 3 day workshop
more efficient, we'd like each participant to
send us a short writeup (about half a page) describing
the topic she plans to work on. We'll collate these
and put them on the web site so everyone can see what others
are thinking of and see if people want to work together,
pick different topics which are not yet being covered, etc.
You might provide a description of
- the problem and data (if applicable)
- the learning objectives
- the target audience(s) (e.g. undergraduate, masters, PhD)
- what computational techniques are involved
Structure of a chapter
We are thinking that a chapter might
be structure somewhat along the following lines
There are 18 participants in this workshop and we would be very
happy to have a chapter from each of you. However, we are also
entirely happy to have people work in pairs or even larger groups. If
we end up with 12 or more separate case-studies, that will be
terrific. So please chat about different team up, and feel free to
work in more than one group or on more than one topic. For example,
you may want to work with someone on a topic where you present an
alternative solution to the problem.
Many of the case-studies/chapters will be based on R.
However, this is not in any way a requirement. If you want
to use MATLAB or SAS, that is fine. However, perhaps we could
arrange to have an R implementation of the same computations
to accompany the text.
We envisage that each chapter will have a corresponding R package. At
the least, the package will contain the data and a description of it.
It will also likely have some code in the demos directory which the
reader can evaluate to see different plots, get results, etc. that are
presented in the paper. These might be in a "dynamic document" (be it
in Sweave, R-Docbook (XML), or R-Word format). The package might also
contain functions that students and instructors can use when doing
their own analysis.
A chapter may have an additional R package that contains the
"instructor" solutions that we do not make publically available
but rather limit to instructors.
- introduction to the scientific problem
- a description of the data and any auxiliary data
- a description of an approach to one particular aspect of the
- additional possible exercises or explorations
What characteristics make a good activity for the student?
These are just some thoughts and possible guidelines rather than
definitive rules or requirements.
- Rather than just being focused on sexy or topical/popular data
sources, the focus should be on extracting information of
interest and addressing questions of interest.
- It should be reasonably clear why we want to study this topic.
- Statistics or visualization should have a real role to play
(rather than just being a "by the way"), i.e. it should lead
to some insight or useful summary of the data.
- Visualization should play a significant role. Ideally the
students should have the opportunity to consider novel ways to
visualize the data (rather than just using the standard plot()
methods with little annotation/customization)
- Testing and application of common methods (e.g. regression)
should not be the dominant feature but rather a step in
discovering the answer to the scientific context. In other
words, fitting a model is not the end point.
- The case study might illustrate a particular computational
or statistical issue. We can use these to introduce "modern"
topics, i.e. topics that are not traditionally taught in regular
classes that undergraduates or even Masters students might take.
For example, multivariate techniques such as MDS and classification
trees, or topics such as boosting and bagging, or the EM algorithm
and so on.
- The exercise might be a simulation of a statistical method or
a stochastic process. Ideally the purpose of the simulation
would be to answer a question about the apprporiateness or
characteristic of a method or a simulation in the context of a
particular scientific/social question.
- When the topic has associated data that are available via the Web
it is valuable to include accessing the data as part of the
It illustrates the dynamic and current nature of the problem and
potentially allows the students to find something new. Even if the
- When possible, allow different students work different
aspects of the data and problem, e.g. split climate into regions,
television viewing into different markets/cities. Alternatively,
have them pursue different avenues of exploration, methods of
- There is benefit in having the students learn about a package
as part of the activity, e.g. the maps package or the maptools
for reading shape files. We want them to learn how to find
software (packages and functions) and use these to solve a problem.
- Describe the problem/material for other instructors as well as
students so that we can share the materials with other instructors.
- When providing solutions, think about how we might keep some back
from the students, but available to instructors.
- Allow the students to take the initiative and direct how to
proceed so that they feel they are in charge and discovering
In other words, they are involved in active learning.
- Provide sufficient direction/guidance as to how to get started
and where to go for weaker students, but not too prescribed to
- Projects should ideally integrate several computational tasks
that use different technologies/skills so that students have
a sense of putting things together to complete a task from
soup to nuts.
- Projects should allow the students to add this to a portfolio
that they might use in a job or graduate program application.
These might be interesting plots, creative
- One might build projects from a series of homeworks,
- write a function to read email messages in,
- create derived variables,
- build a classifier
- Atypical data formats, i.e. when read.table doesn't work.
- non-rectangular data
- Non-numeric data, e.g. text, graphs, etc.
- A case study might be an involved description of
debugging in a real situation, or optimizing code
to make it run faster, or designing a class system or
implementing a package. These could provide valuable experiential
observations rather than merely identifying the functions that one
Last modified: Mon Jul 13 2009