CS 267 Application of Parallel Computers

Homework Assignment #0

Curt Hansen

My Bio

I am a PhD student in biostatistics and masters student (earned degree in 2010) in mathematics here at Berkeley. My relevant coursework has included computational statistics, statistical genetics, time series analysis, ordinary and partial differential equations, and numerical linear algebra. In addition, I have taken the undergraduate courses in artificial intelligence (CS188) and algorithms (CS170).

My particular interest is in bioinformatics, which often involves the processing of very large datasets and intensive computations.

While I have a fairly strong computing background, in particular in Matlab and R, and know the basics of C and C++, I am completely new to parallel computing. Every program I've ever written has been "single threaded".

My goal from this course is to learn what options exist for parallel computing, what drives a choice for one implementation over another, what performance improvements can be expected, and how parallel computing can be applied to typical problems in bioinformatics. My intention is to involve some aspect of parallel computing in my dissertation.

Sample Application

The sample application I chose comes from the article "Evaluating Parallel Computing Systems in Bioinformatics" by Gough and Kane in 2006 [1].

Background
Despite the rather general title of the article, the authors focus on the application area of gene sequence searching. This involves taking a given sequence of interest, which may have been chosen for a variety of reasons, and attempting to locate it on a reference genome for the organism in question. A genome for an organism is the complete set of DNA for that organism. DNA is composed of nucleotide molecules called nucleotides (the familiar C, G, A, and T). Thus the genome for an organism can be depicted in a sequence of letters from the set {C,G,A,T}. The length of these sequences varies by organism. In the case of homo sapiens,  the length is approximately 2.8 billion characters. Somewhat surprisingly, there exist many simpler organisms with longer genomes. For example, in October 2010 researchers announced that Paris japonica, a white flower from Japan, has a  genome length 50 times longer than that of humans [2]. Roughly speaking, the genome for any particular organism can be divided into stretches called "genes" that code for proteins with the remainder serving no known function (at least unknown at present). In humans, only about 2% of the genome has been found to belong to a gene; this sparsity exists in other organisms.

As the genome for an organism is analyzed and researchers determine which areas of the genome are genes (and what those genes do) and which are "filler", this gets documented in one or more databases. The process of documenting the genome of an organism is termed "annotation". GenBank, administered by the National Center for Biotechnology Information, is one of the preeminent databases and contains annotations for  many organisms. As of 2006, GenBank had over 40 million gene sequences and was growing exponentially (it grew 2500-fold from 1987 to 2004). In particular, it was (and is) growing roughly 15 times faster than the increase in processor speed.

Finally, one of the most frequent rationales for performing a gene sequence search is to match a stretch of genome for a newly investigated organism to the annotated version for a previously investigated organism. This facilitates in determining the nature and function of suspected genes in the new organism. Hence the need for tools to perform this search.

Description of Study
The goal of the authors was to evaluate the performance of several computing cluster configurations using two different existing parallel processing algorithms for gene search, ParAlign and mpiBlast. Both of these algorithms use MPI. While the article does not explain the algorithms in detail, ParAlign and mpiBlast are similar in that both run in a cluster environment using MPI and differ in that mpiBlast performs a pre-processing in which all possible 11-letter words are computed initially from the database and are used in the subsequent search.

The configurations under consideration were all clusters of workstations and included:

1. 24 Dell 220 workstations running Linux 2.4.20, using Intel Pentium III processors at 633MHz with memory of 512MB each, and connected via fast Ethernet using MPICH 1.2.5

2. 24 Dell 270 workstations running Linux 2.4..20, using Intel Pentium IV processors at 2.66 GHz with memory of 512MB each, and connected via fast Ethernet using MPICH 1.2.5

3. 12 Orion workstations running Linux 2.6.6, using Transmeta processors at 1.2 GHz with memory of 1GB each, and connected via gigabit Ethernet using MPICH2

Each algorithm-cluster combination was assessed using a variety of standard searches of different length and with different size genome databases. ParAlign was assessed for the two Dell configurations and mpiBlast was assessed on all three  (but where the cluster size of each Delll platform was reduced to 12, equal to that of the Orion cluster). Presumably, ParAlign was not assessed on the Orion configuration due to technical incompatiility.

Performance was measured in terms of time to completion.

Results
Across all combinations involving ParAlign, there was a nearly linear relationship between the size of the query (i.e., length of search sequence, size of database) and the time to completion. The Dell 270 cluster performed roughly twice as fast as the Dell 220 cluster, a result the authors do not appear to explain but is likely due to the faster processor speed in the first configuration given that all other parameters are the same. Note that the increase in performance is only two-fold, while the processor speed is four times faster. For a small (large) database, the performance of the 220 cluster was approximately 60 (120) seconds while for the 270 cluster it was approximately 20 (60) seconds.

Across all combinations involving mpiBlast, there was a non-linear relationship between the size of the query and the time to completion for all three platforms. In each case, as query size increased, the time to completion initially increased at a relatively high rate, then hit a "sweet spot" where it increased slowly over a range of query sizes, and then entered a range where it increased dramatically as query size increased. The two Dell configurations performed the nearly the same, despite the faster processing speed of the 270 cluster. In all cases, the Orion cluster performed twice as fast as either of the Dell clusters. As an indication, for a search sequence length of 12,500 the Orion and Dell clusters completed in approximately 100 and 200 seconds, respectively. For a length of 25,000, they completed in approximately 180 and 360 seconds, respectively. For a length of 28,000, they completed in approximately 400 and 800 seconds, respectively.

Because the query sizes were much larger for the mpiBlast runs (10k-30k) than for the ParAlign runs (150-600), there can be no direct comparison. However, the authors note that mpiBlast is approximately 50 times faster than ParAlign (presumably after controlling for search length).

Conclusion
The following general conclusions can be drawn from the study:

1. Processor speed is clearly not always an important factor in the performance of a cluster. This depends on the details of the algorithm in question.

2. A critical factor is the efficiency of the algorithm used. Here, mpiBlast outperforms ParAlign.

3. Other factors include network speed, MPI version, and memory per processor.

Note that the study did not model every possible combination of algorithm, processor speed, memory, MPI version, and network speed. In particular, the last point is made in part due to the superior performance of the Orion configuration over the Dell configurations, which had a slower network, earlier MPI version, and lower memory per processor.

References

[1] Gough, Erik S. and Kane, Michael D., "Evaluating Parallel Computing Systems in Bioinformatics", Proceedings of the Third International Conference on Information Technology: New Generations, IEEE. 2006.

[2] http://www.physorg.com/news205731281.html