Spring 2012 Undergraduate Research Project Descriptions

All applications are due by 12pm on November 30, 2011.

Deadline has been extended to 12pm on Tuesday, December 6, 2011.


Probability, Data and Simulation

Professor David Aldous

I have a variety of projects, ranging from seeking interesting data over the web to running simulations of stochastic processes.   See http://www.stat.berkeley.edu/~aldous/Research/Ugrad/ugrad_res.html  for explicit suggestions and examples of previous projects, and see http://www.stat.berkeley.edu/~aldous/157/topics.html for implicit suggestions.  I am happy to meet with students and discuss which projects might fit their own interests and expertise.


Evaluation of Similarity Metrics (SimBench)

Professor Jim Pitman

This project aims to build a test collection and to develop a test methodology that can be used to evaluate similarity measures for the purpose of Scientific Recommender Systems.

The student’s tasks would be to support the team in:

1.  Verify existing and implement new similarity metrics (citation- and text-based)

2.  Assist the team in improving the evaluation methodology

3.  Running some statistical analysis on the results

 

The student is expected to work approx. 10 hours per week. Good statistical knowledge and some programming experience in Java would be beneficial.

 


Evaluation of meta data extraction tools


Professor Jim Pitman


As part of our ongoing research we are building a database for storing bibliographic data of scientific documents. These documents are usually available in PDF format. To be able to add these documents to the database, tools to extract the meta data from PDF documents are required. As part of this project, a framework for PDF meta data extraction tools was developed that makes these tools interchangeable. A first oversight of available tools revealed different strengths and weaknesses of all of them.

The students first task will be the evaluation of available tools on a larger data set. That requires:

    - choosing and importing a test collection
    - building an environment for automated measuring the tools performance
    - statistical analysis of the results

The results should then guide the further implementation of meta data extraction tools in the project environment.

The student is expected to work approx. 10 hours per week. Some statistic skills and knowledge of the Java programming language are required.

Precinct-Based Voting Systems and Voter Anonymity

Professor Philip B. Stark

Precinct-count optical scan (PCOS) voting systems, which are used in Alameda County, Contra Costa County, Marin County, and many other California counties, scan ballots cast by voters in polling places.  These hand-marked ballots drop through the scanner into a ballot box.

It is widely claimed that as the ballots fall into the ballot box, they are in effect shuffled, which protects voter anonymity: Even if you observed the order in which people voted, you would not be able to tell which ballot was cast by which voter once the ballots "mix" in the ballot box.

Is this true?

This project will involve visiting the offices of the Registrar of Voters of three Bay-Area counties that use PCOS systems made by three different voting equipment vendors.  We will conduct an experiment: feed numbered ballots through PCOS systems and measure how well the systems shuffle ballots, using statistical tests.  For instance, we will estimate the average fraction of ballots that end up in their original order, for each vendor's equipment.

Touchscreen voting systems deployed in California "shuffle" the order of electronic records of voters' selections relative to the paper print-out of the voters' selections, again with the purported goal of protecting voter anonymity.  How well does the software shuffle the electronic records?  Does it use a random permutation? Does it use a pseudo-random number generator?  If so, is it a good one? If so, how is the seed selected? Is the shuffling deterministic? Can the shuffling be reversed? *If* we can get official access to the systems and software, we will quantify how well these systems shuffle.

This project is time-sensitive.  The data will be collected in January and February; we must submit an article describing the experiment and results by April.