Statistics 151B
Modern Statistical Prediction and Machine Learning
University of California, Berkeley
Spring 2012
Instructor
Jon McAuliffe
Office: 449 Evans Hall
Email: jon@stat.berkeley.edu
Office hours: Thursdays, 12:30pm - 1:30pm, and by appointment
Graduate student instructor
Jeff Regier
Office: 323 Evans Hall
Email: jeff@stat.berkeley.edu
Office hours: Mondays, 11:30am - 12:30pm; Tuesdays, 1:00pm - 2:00pm; in 432 Evans Hall
Lectures
Tuesdays and Thursdays, 2:00pm - 3:30pm
Until Thu 26 Jan: 534 Davis Hall
Starting Tue 31 Jan: 102 Moffitt
Discussion section
Mondays, 2:00pm - 4:00pm
Until Thu 26 Jan: 170 Barrows Hall
Starting Tue 31 Jan: 102 Moffitt
Syllabus: pdf
Announcements
[24 Apr] The slides I used in last Thursday's lecture are now posted below in the Materials section.
[12 Apr] Important notes on homework 4:
You must convert sa.heart$chd to a factor. If you do not, predict() will fail to behave in the necessary way.
Having converted sa.heart$chd to a factor, you must pass the argument family=binomial to gam(). If you do not, gam() will return an error.
The plot of ss.orig.fit will show the response on the logit scale, i.e log((1-p)/p), not on the probability scale p. Therefore your bootstrap plot should use the logit scale too. This will happen by default when you call predict(), if you take care of the two items above.
Also, the plot of ss.orig.fit will not include the fitted intercept. But your bootstrap plot will. This will cause the ss.orig.fit curve to look like it has been shifted upwards relative to your bootstrap plot, by about 0.7. You don't need to do anything about this.
Don't call plot() on any of your bootstrap-fitted gam objects. There's no need to, and such a plot may not come out correctly (for reasons that aren't worth learning).
[03 Apr] A revised syllabus has been posted (the order of the remaining lecture topics has been rearranged).
[15 Mar] Today we will start to cover the paper
Paul Viola and Michael J. Jones (2004). Robust real-time face detection. International Journal of Computer Vision 57(2): 137-154.
The pdf of the article is here: pdf.[09 Feb] Jeff has moved his Thursday office hours to Monday 11:30am - 12:30pm. See the top of this page.
[24 Jan] The University scheduling office has found us a larger
room.
Starting Tuesday, January 31st, we will meet for lecture in 102
MOFFITT.
Starting Monday, January 30th, discussion section will also meet
in 102 MOFFITT.
On Thursday, January 26th, we will meet
one last time in 534 Davis Hall. Everyone previously on the
waitlist should now be enrolled.
[20 Jan] Jeff set up a Piazza course discussion forum. Have a look. I encourage you to discuss clarifications of problem set questions, as you would face to face, but no hints or sharing of solutions please.
[18 Jan] I have posted the first lecture's slides below.
[17 Jan 12:53pm/PT] I just updated the syllabus.
[17 Jan] Scheduled office hours start next week. You are free to ask me for an appointment anytime.
[17 Jan] If you have a laptop, bring it to the discussion sections. You will be able to try out R in real time with Jeff.
[17 Jan] Welcome to the course.
Midterm
Here is last year's midterm: pdf. Jeff will take questions about it at next Monday's discussion section.
Here are some problems to work on as you prepare for the exam. Some are harder than others, and several are harder than what will appear on the midterm. They are meant to get you thinking about the material; they do not constitute a practice midterm. Problem numbers are from HTF.
A question about trees: suppose you have fitted both an OLS linear regression and a regression tree to a training set, obtaining a prediction rule from each. Consider evaluating these two prediction rules at a test point x0, with (unobserved) response y0. If I change, say, the jth component of x0 by sending it off to +infinity, what happens to the squared error of the linear model's prediction? How about the tree's prediction? What do you conclude from this about when it is appropriate to use a linear model vs. a tree?
Final competition
Homeworks
Homework 4, due Monday, April 16th: pdf     [  sa-heart.data  ]
Homework 3, due Monday, March 19th: pdf
Homework 2, due Monday, March 5th: pdf     [  abalone.data  ]     [  hw2-funcs.R  ]
Homework 1, due Tuesday, February 7th, 2pm: pdf     [  babies.data  ]
Materials
Netflix prize slides (by Padhraic Smyth): pdf
Kernel methods slides:
Lecture 1 slides: pdf
Primary course text: The Elements of Statistical Learning, 2nd ed., 5th printing.
Available as a free pdf: direct download link.
Supplemental course text: Introductory Statistics with R, 2nd ed.
Available as a pdf via SpringerLink. To access the entire book on SpringerLink, you need to enable the Berkeley library proxy in your browser; instructions are here.
More references on statistical prediction and machine learning: