PB HLTH 292, Fall 2010

PB HLTH 292, Section 008
Statistics and Genomics Seminar

Fall 2010

Thursday, August 26th

Identifying subtypes of pairs of motifs to elucidate transcription factor subtype-cofactor associations
Dr. Abha S. Bais
Department of Computational and Systems Biology, University of Pittsburgh

Sequences bound by a transcription factor (TF) are presumed to contain sequence elements that reflect its DNA binding preferences and its downstream regulatory effects. Typically, experimentally found binding sites of a TF (TFBSs) are similar enough to be summed up by a canonical motif. However, numerous studies have now shown that groups of nucleotide variants of binding sites, ie. subtypes of BSs, may contribute to distinct modes of downstream regulation by the TF via differential recruitment of its cofactors. A TF A may bind to BSs of subtype a1 or a2 depending on whether it associates with a cofactor B or C, respectively. While approaches for discovery of pairs (or dyads) of motifs abound, none address the problem of identifying variants or subtypes of dyads. Many TFs function as key components of multiple regulatory pathways, thereby targeting different subsets of genes perhaps with different binding preferences. It is, therefore, crucial to identify discriminating sequence motifs that lead to the different modes of TF-DNA association and their corresponding downstream regulation. I will talk about an integrated approach to discover subtypes of dyads together with the sequence subsets they are enriched in. Using both simulated datasets and biological examples, I demonstrate how current state-of-the-art motif discovery can be successfully exploited to address this question.

Thursday, September 2nd

Detecting epistasis via Markov bases
Caroline Uhler
Department of Statistics, UC Berkeley

Rapid research progress in genotyping techniques have allowed large genome-wide association studies. Existing methods often focus on determining associations between single loci and a specific phenotype. However, a particular phenotype is usually the result of complex relationships between multiple loci and the environment. We describe a two-stage method for detecting epistasis by combining the traditionally used single-locus search with a search for multiway interactions. Our method is based on an extended version of Fisher's exact test. To perform this test, a Markov chain is constructed on the space of multidimensional contingency tables using the elements of a Markov basis as moves. We test our method on simulated data and compare it to a two-stage logistic regression method and to a fully Bayesian method, showing that we are able to detect the interacting loci when other methods fail to do so.

Thursday, September 9th

Independent filtering increases detection power for high-throughput experiments
Dr. Richard Bourgon
Genentech

With high-dimensional data, variable-by-variable statistical testing is often used to select variables whose behavior differs across conditions. Such an approach requires adjustment for multiple testing, which can result in low statistical power. A two-stage approach that first filters variables by a criterion independent of the test statistic, and then only tests variables which pass the filter, can provide higher power. We show that use of some filter/test statistics pairs presented in the literature may, however, lead to loss of type I error control. We describe other pairs which avoid this problem. In an application to microarray data, we found that gene-by-gene filtering by overall variance followed by a t-test increased the number of discoveries by 50%. We also show that this particular statistic pair induces a lower bound on fold-change among the set of discoveries. Independent filterinbgusing filter/test pairs that are independent under the null hypothesis but correlated under the alternativbeis a general approach that can substantially increase the efficiency of experiments.

Thursday, September 16th

Reconstructing DNA copy number by penalized estimation and imputation
Professor Chiara Sabatti
Division of Biostatistics and Department of Statistics, Stanford University

Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and Wang (2008). We mount a fresh attack on this difficult optimization problem by: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization-minimization) algorithm, and (c) applying a fast version of Newton's method to jointly up-date all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way. We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.

This is joint work with Thomas Zhang and Kenneth Lange.

Thursday, September 30th

Statistical applications in the analysis of reverse-phase protein microarray data: Results from a cross-platform evaluation study
Dr. Houston Gilbert
Genentech

Reverse-phase protein microarrays (RPPMA) allow for the simultaneous detection of a single protein in complex analyte mixtures, such as those obtained from cell tissue culture or clinical sample protein lysate. To gain a better understanding of the RPPMA arena, we evaluated three fee-for-service providers of this technology. Practical, statistical and biological results from the evaluation study have informed our own strategies for moving forward with RPPMA technology in research and development programs. The evaluation study has also highlighted areas for each of the companies to improve upon their own platforms.

Joint work with Maureen Wong, Zachary Boyd, Jenny Wu, Sree Ranjani Ramani, Yibing Yan, Mark Lackner, Lisa Belmont, and Lino Gonzalez.

Thursday, October 7th

Beyond the genomes: understanding the molecular functions of genetic variants
Dr. Sean Mooney
Buck Institute

Abstract.

Thursday, October 14th

Two-sample tests of differential expression on gene networks
Dr. Laurent Jacob
Department of Statistics, UC Berkeley

Measuring gene expressions to study a biological phenomenon or build prognosis tools is now common practice. When analyzing this type of data, one is very often interested in detecting pre-defined sets of genes that are known to work together and are significantly differentially expressed between two particular conditions. Multivariate statistics allow to test for differential expression at the gene set level directly which makes them more interpretable than the widely used gene set enrichment approach. However, they are known to lose power quickly with increasing dimension. At the same time, an increasing number of regulation networks are becoming available, specifying, for example, which genes activate or inhibit the expression of other genes. We intend to use these networks to build spaces of lower dimension, yet retaining most of the expression shift of gene sets. This makes the multivariate testing amenable and provably more powerful under (partly) coherent expression shift assumption.

Thursday, October 28th

Whole-genome sequencing of lung cancer samples
Dr. Zemin Zhang
Genentech

Next generation sequencing technologies have greatly reduced the barrier for whole genome sequencing, which enables systematic survey of the entire mutation spectrum of human cancer samples. In collaboration with Complete Genomics, we sequenced and compared the tumor and normal tissue of a 51 year old Caucasian male with non-small cell lung cancer. The patient's primary lung tumor was sequenced to 60x coverage and adjacent normal tissue to 46x coverage. More than 50,000 single nucleotide variations (SNVs) were discovered in the tumor which yielded about 17.7 somatic mutations per megabase of DNA. In addition, we observed a distinct pattern of selection against mutations within expressed genes compared to non-expressed genes and in promoter regions up to 5 kb upstream of all protein-coding genes, clearly identifying selection pressures within a tumor environment. We will also discuss the identification of somatic structural and copy number variants, computational prediction of driver mutations, and our latest effort on expanded whole genome sequencing for additional lung tumors and cell lines.

Thursday, November 4th

Estimation of allele frequency and association mapping using next-generation sequencing data
Dr. Su Yeon Kim
Department of Statistics, UC Berkeley

Estimation of allele frequencies is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., <15X). However, SNP calling and allele frequency estimation in such studies is associated with substantial statistical uncertainty because of varying coverage, high error rates, etc. We present a new maximum likelihood method for estimating the allele frequencies in low and medium coverage next-generation sequencing data, based on integrating over uncertainty in the data for each individual rather than calling genotypes. This method can be directly applied to detect associations in case/control studies. We compare our method to methods based on genotype calling using simulations, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained using exon-capturing, we show that the patterns observed in the simulations in fact also can be found in real data . In particular, the null distribution of the test statistic computed based on called genotypes shows a significant departure from the chi-square(1) distribution expected using classical asymptotic theory. However, the test statistic calculated using the full likelihood method closely follows the expected distribution. Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling is used, it is better not to filter individuals based on call confidence score.

Thursday, November 18th

Characterizing microbial diversity from metagenomic data
Dr. Thomas J. Sharpton
J. David Gladstone Institute for Cardiovascular Disease, UCSF

Despite their importance to human and environmental health, we know relatively little about the taxonomic and functional diversity of microorganisms. The recent innovation of shotgun sequencing environmentally acquired DNA, a process known as metagenomic sequencing, provides unique insight into the natural diversity of microbes, but comes at the expense of tremendous data complexity. My collaborators and I have designed a series of bioinformatic tools that circumvent the challenges of metagenomic sequence data and enable the characterization of the taxonomic and functional diversity of microorganisms directly from nature. In particular, I developed PhylOTU, computational workflow that identifies Operational Taxonomic Units (OTUs) from metagenomic sequence data via the use of phylogenomic principles and probabilistic sequence profiles. Methodological accuracy was verified through tests of simulated metagenomic data. I subsequently applied PhylOTU to marine metagenomic sequence libraries and identified microbial taxa missed by traditional sequence-based investigations. This suggests that PhylOTU, when applied to metagenomic data, can identify novel microbial taxa. In addition to discussing PhylOTU, I will describe preliminary research being conducted through this collaboration that leverages similar probabilistic profile-based methods to explore the functional diversity of microorganisms.

Thursday, December 2nd

Data mining with biomaRt
Dr. Steffen Durinck
Lawrence Berkeley National Laboratory

A comprehensive analysis of high-throughput biological experiments involves integration of a variety of data sources. Much of this (meta) data is stored in publicly available databases, accessible through well-defined web interfaces. One simple example is the annotatation of a set of features that are found differentially expressed in a microarray experiment with corresponding gene symbols and genomic locations. BioMart is a generic, query oriented data management system, capable of integrating distributed data resources. It is developed at the European Bioinformatics Institute (EBI) and Cold Spring Harbour Laboratory (CSHL). biomaRt is a software package aimed at integrating data from BioMart systems into R, providing efficient access to a wealth of biological data from within a data analysis environment and enabling biological database mining. In this talk I'll discuss resources that are currently available through biomaRt (e.g. Ensembl, Reactome, COSMIC) and how to perform queries to BioMart databases.