Stat 28: Statistical Methods for Data Science

Statistical Methods for Data Science

STAT 28 is a lower division course to follow STAT 8/CS 8 (Foundations of Data Science). The course will introduce a broad range of statistical methods that are used to solve data problems. Currently, the course plans to cover the following specific topics: group comparisons, parametric statistical models, multivariate data visualization, multiple linear regression and classification. Students will obtain hands on experience in implementing a range of commonly used statistical methods on numerous real world datasets.

The labs in the course will offer an introduction to the widely used R statistical language, as well as methods of reproducible statistical analysis.

See also this blurb about the course.

Who should take this course? What are the prerequisites?

This audience for this class is envisioned to be 1) pre-majors who want to get more experience with data analysis or 2) majors outside of statistics/computer science who would like to gain greater statistical skills for data analysis

All students are required to have taken STAT C8/CS C8 (i.e. DATA 8). This is the only required pre-requisite; STAT 88 is not required.

The course intentionally does not require calculus or linear algebra. However, mathematical fluency and comfort at the level of precalculus (Math 32) is expected. We will be using equations to explain concepts more heavily than DATA 8.

DATA 8 is a "real" requirement. We will assume students are familiar with the resampling methods of statistics taught in DATA 8 (bootstrap and permutation tests), which are not commonly taught in intro statistics courses (like STAT 2 or STAT 20/21). Furthermore, even though we will transition students to R, rather than python, we assume students have the introduction to programming that is taught in DATA 8, so that we do not need to reteach basic programming ideas (e.g. for-loops).

Comparison to other courses

Here is a brief description of other courses offered in the statistics department and how they compare to STAT 28.

STAT 20

STAT 20 is an introduction to statistics and probability. It covers topics in statistics and probability covered DATA 8 and STAT 88, but using a more traditional mathematical approach to the topics without the computational tools included. STAT 20 does not satisfy the prerequisite for STAT 28 and has little overlap in topics with STAT 28.

STAT 88

STAT 88 is a 2-unit connector course for DATA 8. It teaches ideas of probability beyond that taught in DATA 8, and requires calculus background. It can be taken concurrently with DATA 8. It has little overlap with the material in STAT 28.

STAT C100/CS C100/DATA 100

DS 100 is a new upper-division course. This intermediate level class bridges between Data 8 and upper division computer science and statistics courses as well as methods courses in other fields. In this class, students will master the data science life-cycle and learn many of the basic principles and techniques of data science spanning algorithms, statistics, machine learning, visualization, and data systems.

DS100 will be offered for the first time in Spring of 2017. The prerequisites for this course are Data Science 8, Math 54, and CS 61a. It is intended for advanced sophomore and juniors and seniors. DS100 will use Python as its programming language. DS100 bridges to majors in an area related to data science (e.g., statistics or computer science).

Here is a link to DS 100

STAT 133, STAT 134, STAT 135

These three courses are the core courses required for all majors in statistics. These courses cover, respectively, statistical computing (including programming in R), probability, and the theory of statistics. These courses have a much more in-depth treatment of these topics. STAT 133/134/135 also provide the foundation to take upper-division courses in statistics. The upper-division courses offered in the statistics department cover many of the methods described in STAT 28 in greater detail and with greater emphasis on the mathematical understanding of the methods. STAT 28, on the other hand, is a survey through some important such methods in one semester.

The main focus of STAT 133 is statistical computing. STAT 133 teaches R, but also teaches other computational skills, such as unix environment, shell scripting, and/or SQL (depending on the instructor) and goes far beyond the computing instruction in STAT 28. The examples and motivation in STAT 133 are from statistical data analysis, so STAT 133 involves a great deal of analyzing data, and statistical methods to do so are also taught. But the goal of STAT 133 is not a wide-ranging survey of statistical methods; STAT 133 is a deep dive into computing tools in the context of data analysis. The focus of STAT 28, on the other hand, is understanding the tools for data analysis; the computation is important for being able to do data analysis, but unlike STAT 133, is not the main topic of lectures, exams, etc. So both courses involve data analysis and computation in R but with different focuses and depth.

STAT 134 teaches probability and uses calculus. It is a more traditional math class. STAT 135 is an introduction to the mathematical underpinings of statistical estimation and hypothesis testing. STAT 135 in focused on teaching these ideas in the context of data and real-life situations, and so includes data analysis to complement the mathematical ideas. But STAT 135 remains much more mathematical than either Data 8 or STAT 28. STAT 134 is a pre-requisite for STAT 135.

Syllabus

The following is a tentative syllabus

Week	Topics
1	Course Logistics Visualizing distributions Review Discrete Probability
2	Continuous distributions Density estimation
3	Review Permutation Tests t-tests
4	Multiple Testing Review Bootstrap Confidence Intervals Parametric Confidence Intervals
5	Linear Regression: Bootstrap and Parametric
6	Curve fitting: polynomial fits, LOESS, smooth density plots
7	Multivariate visualization Hierarchical clustering and heatmaps
8	PCA
9	Midterm Multiple regression: Introduction
10	Multiple regression: Testing
11	Multiple regression: Colinearity
12	Multiple regression: Categorical explanatory variables
13	Logistic Regression
14	Classification Trees

Last updated 10/24/2016