Statistics 222: Statistics Master's Capstone, Spring 2013
- Instructors: P.B. Stark,
statistics.berkeley.edu/~stark and D.M. Goldschmidt
Office Hours: [PBS: Tuesdays, 11am–12pm, 403 Evans Hall] [DMG: TBA]
- Meets: Tuesday, Thursday 9-11am, 1011 Evans Hall
- Restrictions: open only to Statistics Masters students in the 1-year degree program
Course format:
This is a project-based course involving collaborative problem solving, including collaborative software design
and a group presentation.
There is an emphasis on developing good "hygiene" for collaboration, source-control, and
reproducible computational research.
Scenario: You are in a start-up company. Your team of 5 has to come up with a recommendation engine.
You will need to do divide up the job.
The team will have to do some web scraping of unstructured data, select and test appropriate machine-learning
and classification algorithms, decide what approach to take, write up the results in a formal report,
and make a presentation to the venture capitalists (a group of faculty) who are considering funding
your company.
The company has chosen the software tools and environment you will use, but you have to do the coding,
data analysis, etc.
You have 14 weeks to develop an approach and prepare a winning presentation.
Software tools we will use
- Git, github (resources: http://try.github.com,
http://www.sbf5.com/~cduan/technical/git/,
http://git-scm.com/book,
http://gitready.com/
-
VirtualBox or VMWare
- Python:
Python documentation (use version 2.7.3),
Python on the Mac,
Python bootcamp
(videos),
NumPy,
SciPy,
SciPy Scikit learn,
IPython,
NetworkX,
pandas,
statsmodels,
Orange,
PyML
Python ML resources
- A gallery of IPython notebooks:
https://github.com/IPython/IPython/wiki/A-gallery-of-interesting-IPython-Notebooks.
Note especially
https://github.com/jrjohansson/scientific-python-lectures.
- mySQL (for Mac users, MAMP is
an easy way to get mySQL going)
- Amazon Web Services (AWS)
- LaTeX, including Beamer for presentations
Data sources
The course project will use
comedy preference data
from the UC Irvine Machine Learning Data Repository.
Your goal is to predict, as accurately as possible, which of two videos people find funnier.
In order to accomplish this goal, you will need to scrape YouTube pages for features that might help
predict which of a pair of videos people will find funnier, to winnow the potential features down to a manageable set,
and to apply an appropriate algorithm (or algorithms).
Grading
The course grade will be based on class participation and a term project, which includes a paper and an oral presentation
to a set of faculty "clients."
The presentation will take place on Wednesday, 8 May, 2-3:30pm.
If you are unable to be on campus then, please let me know as soon as possible; I need to schedule faculty observers.
Schedule
- Week 1: tools for source control, collaborative development, and reproducible environments: git, github,
AWS, reproducible environments. Guest lecturer: Aaron Culich
- Week 2: introduction to the YouTube comedy data; SQL. Guest lecturers: Aaron Culich, Harrison Dekker
- Week 3: YouTube Python API. SQLite within Python. Python classes. Detecting code "smells."
Directed weighted graphs; adjacency matrices. Guest lecturer: Aaron Culich
- Week 4: more on reproducible and collaborative computational research. Guest lecturers: Aaron Culich, Fernando Perez.
Materials: http://bit.ly/Y2XmHh
- Week 5: IPython. Guest lecturer: Fernando Perez
- Week 6: nonparametric hypothesis tests based on permutations. Constructing the directed graph for "funnier"
- Week 7: …
- 8 May: final presentations
Assignments
- Install git if you don't already have it. If you have a mac or pc, the easiest way to do this
is to install the github client. Once you have git installed, clone the repository
github/pbstark/S222_S13_git into a fresh directory on your machine.
Edit the file "hello_world.txt" to add your name.
Save the file; commit the edit; and push the file back to github.
-
-
Write SQL queries to extract the following information about the comedy data:
-
number of distinct video IDs
-
counts of each distinct ID, in decreasing order
-
number of distinct video ID pairs
-
counts of each distinct pair, in decreasing order
-
distinct codes in the left-right field (should be only "left" and "right")
-
number of times the left video was found funnier and number of times the right video was found funnier
-
for each pair that occurs more than once, number and percentage of times each member of the
pair was found funnier
-
From within Python, find the list of unique video IDs in the data, retrieve the metadata for
each unique ID, and store it in an SQLite database (as a file).
This will be our "snapshot" of the metadata for the project.
Note that the metadata on YouTube will be changing with time as the videos are viewed,
removed, etc.
You will need to figure out how to store the metadata in a format that SQLite can hold.
-
Construct a Python set that contains all the unique video IDs.
HINT: look up Python sets in docs.python.org; think about iterating over rows in the table
and adding elements to the set.
-
Iterate over the set, for each element retrieving the YouTube metadata,
and storing the result as a row in the SQLite table.
HINT: do you need to to anything to the object that the YouTube API returns to be able to
store it in an SQLite table?
-
Test the hypothesis that there is a humor advantage to being on the left or the right,
using a permutation test.
-
Find all unique unordered pairs of videos that occur in the data
in both orderings.
That is, all {ID1, ID2} pairs that occur in the data both as (ID1, ID2) and
(ID2, ID1).
-
For each such pair, find the number of times it occurs in each order,
and the number of times each was found to be funnier in each of those orders.
Think of this as a two-by-two contingency table.
-
Consider the null hypothesis that the order of presentation (left versus right)
doesn't matter—that the labeling is arbitrary, as if at random, without
any connection to which video will be found to be funnier.
By analogy to Fisher's exact test, determine the joint distribution of the counts
in the four cells in each two-by-two contingency table.
-
Devise a test statistic applicable to a single such table, with power against
the alternative that there is either a left-side advantage or a right-side advantage.
-
Devise a test statistic applicable to the collection of all such two-by-two
tables, with power against the alternative that there is either a left-side advantage
for all pairs, or a right-side advantage for all pairs.
-
Construct the weighted (directed) adjacency matrix for the comedy training data,
and plot the directed graph.
Hint:
http://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html
Also, http://networkx.github.com/,
http://networkx.github.com/documentation/latest/gallery.html, and
http://nbviewer.ipython.org/5088324.
Copyright ©2013, P.B. Stark. All rights reserved.
This page is http://statistics.berkeley.edu/~stark/Teach/S222/S13/index.htm.