PB HLTH 292, Section 020
Statistics and Genomics Seminar

Spring 2008



Thursday, January 24th

Completion of the Drosophila Gene Collection
Dr. Joe Carlson
Celniker Lab, Life Sciences Division, Lawrence Berkeley National Laboratory

A comprehensive collection of cDNA clones representing the entire transcriptome is an indispensable resource for annotating the genome to determine the transcribed regions, and for conducting experiments to assess the biological role of transcripts. Random EST sequencing is a very efficient procedure for establishing a collection but completing it requires a more directed approach. We have been working on completing the Drosophila gene collection through large scale targetted inverse PCR amplification of specific exons. This work has particular relevance for experimental verification of transcript structure predictions and it has resulted in significant modifications of the genome annotation.


Thursday, January 31st

Detecting interactions in association mapping studies and the prospects of using evolutionary inferences to inform such studies
Professor Rasmus Nielsen
Departments of Integrative Biology and Statistics, UC Berkeley

For most common diseases with heritable components, not a single or a few single-nucleotide polymorphisms (SNPs) explain most of the variance for these disorders. Instead, much of the variance may be caused by interactions (epistasis) among multiple SNPs or interactions with environmental conditions. I will discuss a new statistical model for analyzing and interpreting genomic data that influence multifactorial phenotypic traits with a complex and likely polygenic inheritance. The new method is based on Markov chain Monte Carlo (MCMC) and allows for identification of sets of SNPs and environmental factors that when combined increase disease risk or change the distribution of a quantitative trait. Using simulations, we show that the MCMC method can detect disease association when multiple, interacting SNPs are present in the data. When applying the method on real large-scale data from a Danish population-based cohort, multiple interactions are identified that severely affect serum triglyceride levels in the study individuals. The method is designed for quantitative traits but can also be applied on qualitative traits. It is computationally feasible even for a large number of possible interactions and differs fundamentally from most previous approaches by entertaining nonlinear interactions and by directly addressing the multiple-testing problem. I will also discuss an approach for using evolutionary information to inform association mapping studies. We have developed a Bayesian approach which combines structural, population genetic and comparative genomic data to quantify the probability that a particular mutation is deleterious. We validate the method on real data and show that has better frequentist properties than previous method for predicting the fitness effects of mutations.


Thursday, February 7th

A complete classification of epistatic two-locus models
Dr. Ingileif B. Hallgrimsdottir
Department of Statistics, University of Oxford

We describe a geometric framework to classify interaction in epistatic biallelic two-locus models. We show that there are 387 distinct types of two-locus models, which can be reduced to 69 when symmetry between loci and alleles is accounted for. We study the the biological relevance of these models and we discuss the connection between these model classes and the two-locus models that are commonly used. We will also discuss the connection between Li and Reich's classification of two-locus disease models with 0/1 penentrance values and ours. Finally we consider implications for studying the power of statistical tests for interaction.

Joint work with Debbie Yuster.


Thursday, February 14th

Case-Control Association Testing with Incomplete Genealogy
Dr. Timothy A. Thornton
Department of Statistics, UC Berkeley

We consider the problem of case-control association testing when some sampled individuals are related, with the relationships unknown. Using related individuals in case-control studies has compelling advantages. With related individuals, correlations among relatives must be taken into account to ensure validity of the test. We first give an overview of proposed methods when the genealogy is completely specified. We then consider the case when the genealogy is incomplete and present a new approach.


Tuesday, February 19th

Regulatory genomics of Drosophila and mammalian genomes
Professor Manolis Kellis
MIT Computer Science and Artificial Intelligence Laboratory, Broad Institute of MIT, and Harvard Stata Center

A systematic understanding of gene regulation in animal genomes requires the ability to determine pre- and post-transcriptional regulatory networks, and their dynamics across development. Our lab is interested in the computational underpinnings of such endeavors, developping algorithms and machine learning techniques to discover regulatory elements and their interconnections in flies and mammals. We have used comparative genomics of 12 Drosophila genomes and of 24 mammalian genomes to discover regulatory motifs associated with promoter and enhancer regions, as defined by their chomatin marks. We have also used comparative methods to discover and characterize microRNA (miRNA) genes and their targets in Drosophila, discovering many novel miRNAs and miRNA families, which reveal a denser miRNA targeting network with increased potential for combinatorial control. We have also demonstrated that both arms of a miRNA hairpin can be functional, and both strands of a miRNA gene can be transcribed and lead to functional miRNA regulators. In the case of the Hox-encoded miRNA miR-iab-4, we have shown that its anti-sense miRNA can lead to homeotic transformation of halteres into wings, establishing it as a new Hox gene, and the first functional anti-sense miRNA. Lastly, we have used comparative genomics to infer regulatory networks based on individual conserved instances of regulatory motifs, which show functional enrichments similar to and sometimes higher than genome-scale experimental methods such as ChIP-chip. As part of the ENCODE and modENCODE projects, we are now studying dynamics of developmental and cell-differentiation networks in Drosophila and human, the tissue-specificity, and the sequence determinants of the establishment and maintenance of chromatin state.


Thursday, February 21st

Methods for mapping disease genes in large families
Professor Mark Abney
Department of Human Genetics, University of Chicago

Finding genomic regions that have been inherited jointly in diseased family members is a key strategy for mapping genes of biomedical interest. Large, extended families will typically provide more information than small families for mapping but are also more challenging to do computations on. Here I discuss novel methods that allow for rapid computations on families of virtually arbitrary size opening up the possibility of analyzing data sets that have been intractable. The methods I discuss are useful for analyzing genotype data to find regions of joint inheritance and for more precisely modeling phenotype data to better describe the trait.


Thursday, February 28th

Characterization of Cancer Gene Mutations in Human Cancer Cell lines for Correlation with Drug Activity
Professor Ogechi N. Ikediobi
Department of Clinical Pharmacy, UCSF

The panel of 60 human cancer cell lines (the NCI-60) assembled by the National Cancer Institute for anticancer drug discovery is a widely used resource. The NCI-60 has been characterized pharmacologically and at the molecular level more extensively than any other set of cell lines. There had not, however, been a systematic sequence analysis of the NCI-60 for key genes causally implicated in oncogenesis. We report the sequence analysis of 24 known cancer genes in the NCI-60 and an assessment of four of the 24 genes for homozygous deletions. Using a pharmacogenomic approach, we have identified an association between mutation in BRAF and the anti-proliferative potential of phenothiazine compounds. Phenothiazine compounds have been used as anti-psychotics and as adjunct anti-emetics during cancer chemotherapy, and more recently reported to have anti-cancer properties. However, to date the phenothiazine anti-cancer mechanism of action has not been elucidated. We demonstrate that BRAF mutation (V600E) in melanoma is predictive of an increased sensitivity to phenothiazines. We also show that RAS mutant and RAS/BRAF wild type melanoma cell lines are less sensitive to inhibition by phenothiazines than are BRAF mutant melanoma cell lines. This pattern of increased sensitivity to phenothiazines based on the presence of V600E BRAF mutation may be unique to melanomas; we do not observe it in a panel of colorectal cancers. The clinical implications for the use of phenothiazines for the treatment of melanoma, in light of the in vitro differential sensitivity between V600E BRAF mutant and RAS mutant melanomas are discussed.


Thursday, March 6th

Detecting Alternative Splicing from Exon Array Data
Dr. Elizabeth A. Purdom
Division of Biostatistics and Department of Statistics, UC Berkeley

Recently Affymetrix released the Human Exon 1.0ST array, a whole genome array that queries the expression levels of known and putative exonic regions. The exon array allows for detecting subtle differences in the expression patterns of genes, in particular the alternative splicing of exons. In this talk, we will introduce our algorithm for finding alternative splicing events, FIRMA (Finding Isoforms using Robust Multichip Analysis) and discuss its performance on real datasets. We will also highlight some challenges in using the array and the effect this has on finding alternative splicing.

This is joint work with Ken Simpson, Mark Robinson, and Terry Speed.


Thursday, March 13th

Multivariate Analysis of Codon Usage in Vinyl Chloride Reductase Genes: An Example of Decomposing the Chisquare of a Contingency Table Effectively
Professor Susan Holmes
Department of Statistics, Stanford University

Particular types of multivariate analyses use Chisquare distances and inertia instead of inertia. I will show the effect of taking such an approach in the context of a codon bias study for a very useful bacteria that likes to eat dry cleaning fluid.

This is joint work with Alfred Spormann and Joey McMurdie from Chemical Engineering at Stanford.


Thursday, April 3rd

Using In-Vitro Experiments to Model the Affinity of DNA Sequences to the Transcription Factor Bicoid
Dr. Juli K. Atherton
Department of Statistics, UC Berkeley

In genomic research, determining locations on the genome to which a transcription factor binds with medium to high affinity might help identify possible transcription factor binding sites. Hence, interest lies in developing models that predict the affinity of a transcription factor based on nucleotide sequence. Often, data from in-vitro experiments are used when building such models.

Two types of in-vitro experiments designed to determine the affinities of a transcription factor to DNA sequences are considered. The first experiment is the systematic evolution of ligands by exponential enrichment (SELEX) experiment. In this experiment, one begins with a large random pool of DNA sequences and, after many rounds, selects for the highest affinity sequence. The second experiment is a multiplex assay experiment. This experiment is done on a select group of DNA sequences and provides much more precise measurements of affinities. The data presented are for Bicoid, a transcription factor in Drosophila melanogaster.

In this talk, I will begin with a simple biochemical explanation of both experiments. I will then discuss our analysis of the data thus far, stressing the statistical methodology, and also addressing issues in the design of these experiments. As this work is ongoing, I will finish by mentioning current and future research regarding these experiments.

This is joint work with Peter Bickel's group and Mark Biggin's lab.


Thursday, April 10th

Gene expression signatures in human disease: an illustration with genomic data on breast cancer
Dr. Darlene R. Goldstein
Institut de mathNimatiques, NIcole Polytechnique FNidNirale de Lausanne (EPFL)

Gene expression profiling is gaining increasing prominence for subtype identification, diagnosis and prognosis in human disease, particularly for cancers. However, genes constituting the signatures vary across studies, with little overlap. In this talk, I will introduce the concept of biologically based coexpression modules - sets of genes with highly correlated expression - and provide a unification of proposed signatures using quantitative measures of module activity. The methods will be illustrated with a comprehensive analysis of publicly available expression data on 2833 breast cancer tumors.

This is joint work with Pratyaksha Wirapati and Mauro Delorenzi of the Swiss Institute of Bioinformatics.


Thursday, April 17th

Estimating and modeling rates of evolution
Dr. Rachel B. Bevan
Division of Biostatistics, UC Berkeley

One of the primary goals of biological research is to understand the causes and mechanisms of molecular evolution. Statistical and computational research in the field focuses on modeling evolution probabilistically, with the goal of better understanding evolutionary relationships between species. I will present work aimed at improving these models and applications to real data.

Firstly, I will present a fast method that allows for quick estimation of relative evolutionary rates of proteins, an important component in accurate phylogenetic estimation. The DistR approach to estimate gene/protein evolutionary rates based is on pairwise distances between pairs of taxa derived from gene/protein sequence data is presented. Simulation studies indicate that this algorithm accurately estimates rates and is robust to missing data. Secondly, I will discuss two different approaches to incorporating gene rates into phylogenetic inference: i) Allowing each gene/protein to have a single gene-wide rate of evolution; ii) integrating out over the possible rates of evolution of a gene using the Gamma distribution. Finally, time-permitting I will present current work on using a mixture model to account for the rates of evolution of sites in multi-gene data sets.


Thursday, April 24th

Stochasticity and Networks in Genomic Data
Professor John Quackenbush
Dana-Farber Cancer Institute and the Harvard School of Public Health

Two trends are driving innovation and discovery in biological sciences: technologies that allow holistic surveys of genes, proteins, and metabolites and a realization that biological processes are driven by complex networks of interacting biological molecules. However, there is a gap between the gene lists emerging from genome sequencing projects and the network diagrams that are essential if we are to understand the link between genotype and phenotype. 'Omic technologies were once heralded as providing a window into those networks, but so far their success has been limited. To circumvent these limitations, we developed a method that combines 'omic data with other sources of information. Here we will present an approach that uses literature networks as constraints on a Bayesian Network analysis of microarray data, we show that we are able to recover evidence for a wide range of known networks and pathways, even in experiments not explicitly designed to probe them.

With a putative gene-interaction network, the problem of producing viable models of the cell remains. While systems biology approaches that attempt to develop quantitative, predictive models of cellular processes have received great attention, it is surprising to note that the starting point for all cellular gene expression, the transcription of RNA, has not been described and measured in a population of living cells. To address this problem, we propose a simple model for transcript levels based on Poisson statistics and provide supporting experimental evidence for genes known to be expressed at high, moderate, and low levels. Not only do these data confirm our model, but this general strategy opens up a potential new approach, Mesoscopic Biology, that can be used to assess the natural variability of processes occurring at the cellular level in biological systems.


Thursday, May 1st

Biodefense: Bioinformatics Challenges in a Global Context
Dr. Tom Slezak
Associate Program Leader, Informatics and Assays, Lawrence Livermore National Laboratory

The 21st century is seeing biology establishing itself as the fastest-evolving science. Fueled by rapid advances in both biotechnology and bioinformatics, overwhelming amounts of data are presenting us with many new scientific and technological challenges in all aspects of biology, including biodefense. Increasing human incursion into remote areas is causing the emergence of "novel" natural pathogens that are rapidly spread by modern travel and commerce. Additionally, the dark side of recent biotechnology advances raises the likelihood that genetic engineering and synthetic biology might cause harmful results, whether inadvertently or maliciously.

The field of biodefense exists in an interesting and turbulent intersection of technology, politics, mission-space, economics, and ethics. Far from being isolated from the chaos, bioinformatics all too frequently finds itself right in the middle of the controversy. This talk will draw upon LLNL's experiences over the past 8 years to present one viewpoint of some of the key challenges facing bioinformatics in the biodefense field. From inadequate algorithms and inappropriate computer architectures, to impotent bureaucracy and pork-barrel politics, to the inability to get genomic data from countries with dangerous epidemics (or from colleagues at federal agencies in the US), to the ethical problems raised by trying to defend against malicious genetic engineering; all of these impact researchers working in bioinformatics applied to biodefense. Efforts underway at LLNL to deal with many of these challenges will be discussed.


Thursday, May 8th

Statistical Assignment of DNA Sequences using Bayesian Phylogenetics
Dr. Kasper Munch
Department of Integrative Biology, UC Berkeley

The assignment of DNA from organic material to species or taxonomic groups is integral to a number of scientific disciplines. The identification of unknown specimens based on Cytochrome Oxidase I (COI) has become known as DNA Barcoding. More importantly, identifying sequences from environmental samples in metagenoomic approaches may allow environments to be characterized according to their genetic fingerprint. This approach is particularly suitable for viruses and bacterial species, which have relatively small genome. Even without taking genomic approaches, however, DNA sequencing of selected markers from environmental samples may provide ecological information or identify relevant species such as human pathogens.

A Bayesian approach can be used to calculate a probability of assignment to each taxonomic unit represented in a sequence database. The probability of assignment to each taxa serves as a measure of confidence in the assignment. In this talk I will introduce the assignment problem and a tool that tool that implements the Bayesian approach. At the end I may have time to touch on other current research.

This is joint work with Wouter Boomsma, John Huelsenbeck and Rasmus Nielsen.