Statistics 222: Statistics Master's Capstone, Spring 2013

Instructors: P.B. Stark, statistics.berkeley.edu/~stark and D.M. Goldschmidt
Office Hours: [PBS: Tuesdays, 11am–12pm, 403 Evans Hall] [DMG: TBA]
Meets: Tuesday, Thursday 9-11am, 1011 Evans Hall
Restrictions: open only to Statistics Masters students in the 1-year degree program

Course format:

This is a project-based course involving collaborative problem solving, including collaborative software design and a group presentation. There is an emphasis on developing good "hygiene" for collaboration, source-control, and reproducible computational research.

Scenario: You are in a start-up company. Your team of 5 has to come up with a recommendation engine. You will need to do divide up the job. The team will have to do some web scraping of unstructured data, select and test appropriate machine-learning and classification algorithms, decide what approach to take, write up the results in a formal report, and make a presentation to the venture capitalists (a group of faculty) who are considering funding your company. The company has chosen the software tools and environment you will use, but you have to do the coding, data analysis, etc. You have 14 weeks to develop an approach and prepare a winning presentation.

Software tools we will use

Git, github (resources: http://try.github.com, http://www.sbf5.com/~cduan/technical/git/, http://git-scm.com/book, http://gitready.com/
VirtualBox or VMWare
Python: Python documentation (use version 2.7.3), Python on the Mac, Python bootcamp (videos), NumPy, SciPy, SciPy Scikit learn, IPython, NetworkX, pandas, statsmodels, Orange, PyML Python ML resources
A gallery of IPython notebooks: https://github.com/IPython/IPython/wiki/A-gallery-of-interesting-IPython-Notebooks. Note especially https://github.com/jrjohansson/scientific-python-lectures.
mySQL (for Mac users, MAMP is an easy way to get mySQL going)
Amazon Web Services (AWS)
LaTeX, including Beamer for presentations

Data sources

The course project will use comedy preference data from the UC Irvine Machine Learning Data Repository. Your goal is to predict, as accurately as possible, which of two videos people find funnier. In order to accomplish this goal, you will need to scrape YouTube pages for features that might help predict which of a pair of videos people will find funnier, to winnow the potential features down to a manageable set, and to apply an appropriate algorithm (or algorithms).

Grading

The course grade will be based on class participation and a term project, which includes a paper and an oral presentation to a set of faculty "clients." The presentation will take place on Wednesday, 8 May, 2-3:30pm. If you are unable to be on campus then, please let me know as soon as possible; I need to schedule faculty observers.

Schedule

Week 1: tools for source control, collaborative development, and reproducible environments: git, github, AWS, reproducible environments. Guest lecturer: Aaron Culich
Week 2: introduction to the YouTube comedy data; SQL. Guest lecturers: Aaron Culich, Harrison Dekker
Week 3: YouTube Python API. SQLite within Python. Python classes. Detecting code "smells." Directed weighted graphs; adjacency matrices. Guest lecturer: Aaron Culich
Week 4: more on reproducible and collaborative computational research. Guest lecturers: Aaron Culich, Fernando Perez. Materials: http://bit.ly/Y2XmHh
Week 5: IPython. Guest lecturer: Fernando Perez
Week 6: nonparametric hypothesis tests based on permutations. Constructing the directed graph for "funnier"
Week 7: …
8 May: final presentations

Assignments

Install git if you don't already have it. If you have a mac or pc, the easiest way to do this is to install the github client. Once you have git installed, clone the repository github/pbstark/S222_S13_git into a fresh directory on your machine. Edit the file "hello_world.txt" to add your name. Save the file; commit the edit; and push the file back to github.
- If you are using a Mac:
  - Install VirtualBox 4.2.6 for OS X hosts, from here https://www.virtualbox.org/wiki/Downloads
  - Install Ubuntu 12.10 in your VirtualBox, from THIS LINK http://releases.ubuntu.com/quantal/ubuntu-12.10-desktop-amd64.iso.torrent (not the Mac download)
- Install mySQL, IPython, and IPython Notebook in your VirtualBox, or in Linux if you are already running Linux.
- Watch the 40-minute IPython demo video from PyCon 2012: http://IPython.org/videos.html
- Watch the first two videos on Python and IPython http://www.youtube.com/watch?v=v_3NjQB3q-Q&list=SP7E11B34616530F5E&index=7
Write SQL queries to extract the following information about the comedy data:
1. number of distinct video IDs
2. counts of each distinct ID, in decreasing order
3. number of distinct video ID pairs
4. counts of each distinct pair, in decreasing order
5. distinct codes in the left-right field (should be only "left" and "right")
6. number of times the left video was found funnier and number of times the right video was found funnier
7. for each pair that occurs more than once, number and percentage of times each member of the pair was found funnier
From within Python, find the list of unique video IDs in the data, retrieve the metadata for each unique ID, and store it in an SQLite database (as a file). This will be our "snapshot" of the metadata for the project. Note that the metadata on YouTube will be changing with time as the videos are viewed, removed, etc. You will need to figure out how to store the metadata in a format that SQLite can hold.
1. Construct a Python set that contains all the unique video IDs. HINT: look up Python sets in docs.python.org; think about iterating over rows in the table and adding elements to the set.
2. Iterate over the set, for each element retrieving the YouTube metadata, and storing the result as a row in the SQLite table. HINT: do you need to to anything to the object that the YouTube API returns to be able to store it in an SQLite table?
Test the hypothesis that there is a humor advantage to being on the left or the right, using a permutation test.
1. Find all unique unordered pairs of videos that occur in the data in both orderings. That is, all {ID1, ID2} pairs that occur in the data both as (ID1, ID2) and (ID2, ID1).
2. For each such pair, find the number of times it occurs in each order, and the number of times each was found to be funnier in each of those orders. Think of this as a two-by-two contingency table.
3. Consider the null hypothesis that the order of presentation (left versus right) doesn't matter—that the labeling is arbitrary, as if at random, without any connection to which video will be found to be funnier. By analogy to Fisher's exact test, determine the joint distribution of the counts in the four cells in each two-by-two contingency table.
4. Devise a test statistic applicable to a single such table, with power against the alternative that there is either a left-side advantage or a right-side advantage.
5. Devise a test statistic applicable to the collection of all such two-by-two tables, with power against the alternative that there is either a left-side advantage for all pairs, or a right-side advantage for all pairs.
Construct the weighted (directed) adjacency matrix for the comedy training data, and plot the directed graph. Hint: http://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html Also, http://networkx.github.com/, http://networkx.github.com/documentation/latest/gallery.html, and http://nbviewer.ipython.org/5088324.