Statistics 222: Statistics Master's Capstone, Spring 2013

Course format:

This is a project-based course involving collaborative problem solving, including collaborative software design and a group presentation. There is an emphasis on developing good "hygiene" for collaboration, source-control, and reproducible computational research.

Scenario: You are in a start-up company. Your team of 5 has to come up with a recommendation engine. You will need to do divide up the job. The team will have to do some web scraping of unstructured data, select and test appropriate machine-learning and classification algorithms, decide what approach to take, write up the results in a formal report, and make a presentation to the venture capitalists (a group of faculty) who are considering funding your company. The company has chosen the software tools and environment you will use, but you have to do the coding, data analysis, etc. You have 14 weeks to develop an approach and prepare a winning presentation.

Software tools we will use

Data sources

The course project will use comedy preference data from the UC Irvine Machine Learning Data Repository. Your goal is to predict, as accurately as possible, which of two videos people find funnier. In order to accomplish this goal, you will need to scrape YouTube pages for features that might help predict which of a pair of videos people will find funnier, to winnow the potential features down to a manageable set, and to apply an appropriate algorithm (or algorithms).

Grading

The course grade will be based on class participation and a term project, which includes a paper and an oral presentation to a set of faculty "clients." The presentation will take place on Wednesday, 8 May, 2-3:30pm. If you are unable to be on campus then, please let me know as soon as possible; I need to schedule faculty observers.

Schedule

Assignments

  1. Install git if you don't already have it. If you have a mac or pc, the easiest way to do this is to install the github client. Once you have git installed, clone the repository github/pbstark/S222_S13_git into a fresh directory on your machine. Edit the file "hello_world.txt" to add your name. Save the file; commit the edit; and push the file back to github.
  2. Write SQL queries to extract the following information about the comedy data:
    1. number of distinct video IDs
    2. counts of each distinct ID, in decreasing order
    3. number of distinct video ID pairs
    4. counts of each distinct pair, in decreasing order
    5. distinct codes in the left-right field (should be only "left" and "right")
    6. number of times the left video was found funnier and number of times the right video was found funnier
    7. for each pair that occurs more than once, number and percentage of times each member of the pair was found funnier
  3. From within Python, find the list of unique video IDs in the data, retrieve the metadata for each unique ID, and store it in an SQLite database (as a file). This will be our "snapshot" of the metadata for the project. Note that the metadata on YouTube will be changing with time as the videos are viewed, removed, etc. You will need to figure out how to store the metadata in a format that SQLite can hold.
    1. Construct a Python set that contains all the unique video IDs. HINT: look up Python sets in docs.python.org; think about iterating over rows in the table and adding elements to the set.
    2. Iterate over the set, for each element retrieving the YouTube metadata, and storing the result as a row in the SQLite table. HINT: do you need to to anything to the object that the YouTube API returns to be able to store it in an SQLite table?
  4. Test the hypothesis that there is a humor advantage to being on the left or the right, using a permutation test.
    1. Find all unique unordered pairs of videos that occur in the data in both orderings. That is, all {ID1, ID2} pairs that occur in the data both as (ID1, ID2) and (ID2, ID1).
    2. For each such pair, find the number of times it occurs in each order, and the number of times each was found to be funnier in each of those orders. Think of this as a two-by-two contingency table.
    3. Consider the null hypothesis that the order of presentation (left versus right) doesn't matter—that the labeling is arbitrary, as if at random, without any connection to which video will be found to be funnier. By analogy to Fisher's exact test, determine the joint distribution of the counts in the four cells in each two-by-two contingency table.
    4. Devise a test statistic applicable to a single such table, with power against the alternative that there is either a left-side advantage or a right-side advantage.
    5. Devise a test statistic applicable to the collection of all such two-by-two tables, with power against the alternative that there is either a left-side advantage for all pairs, or a right-side advantage for all pairs.
  5. Construct the weighted (directed) adjacency matrix for the comedy training data, and plot the directed graph. Hint: http://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html Also, http://networkx.github.com/, http://networkx.github.com/documentation/latest/gallery.html, and http://nbviewer.ipython.org/5088324.

Copyright ©2013, P.B. Stark. All rights reserved. This page is http://statistics.berkeley.edu/~stark/Teach/S222/S13/index.htm.