PB HLTH 292, Section 008
Statistics and
Genomics Seminar
Fall 2010
Thursday, August 26th
Identifying subtypes of
pairs of motifs to elucidate transcription factor subtype-cofactor associations
Dr. Abha S. Bais
Department of Computational and
Systems Biology,
University of Pittsburgh
Sequences bound by a transcription factor (TF) are presumed to contain
sequence elements that reflect its DNA binding preferences and its
downstream regulatory effects. Typically, experimentally found
binding sites of a TF (TFBSs) are similar enough to be summed up by a
canonical motif. However, numerous studies have now shown that groups
of nucleotide variants of binding sites, ie. subtypes of BSs, may
contribute to distinct modes of downstream regulation by the TF via
differential recruitment of its cofactors. A TF A may bind to BSs of
subtype a1 or a2 depending on whether it associates with a cofactor B
or C, respectively. While approaches for discovery of pairs (or
dyads) of motifs abound, none address the problem of identifying
variants or subtypes of dyads. Many TFs function as key components of
multiple regulatory pathways, thereby targeting different subsets of
genes perhaps with different binding preferences. It is, therefore,
crucial to identify discriminating sequence motifs that lead to the
different modes of TF-DNA association and their corresponding
downstream regulation. I will talk about an integrated approach to
discover subtypes of dyads together with the sequence subsets they
are enriched in. Using both simulated datasets and biological
examples, I demonstrate how current state-of-the-art motif discovery
can be successfully exploited to address this question.
Thursday, September 2nd
Detecting epistasis
via Markov bases
Caroline Uhler
Department of Statistics, UC Berkeley
Rapid research progress in genotyping techniques have allowed large
genome-wide association studies. Existing methods often focus on
determining associations between single loci and a specific phenotype.
However, a particular phenotype is usually the result of complex
relationships between multiple loci and the environment. We describe a
two-stage method for detecting epistasis by combining the
traditionally used single-locus search with a search for multiway
interactions. Our method is based on an extended version of Fisher's
exact test. To perform this test, a Markov chain is constructed on the
space of multidimensional contingency tables using the elements of a
Markov basis as moves. We test our method on simulated data and
compare it to a two-stage logistic regression method and to a fully
Bayesian method, showing that we are able to detect the interacting
loci when other methods fail to do so.
Thursday, September 9th
Independent filtering
increases detection power for high-throughput experiments
Dr. Richard Bourgon
Genentech
With high-dimensional data, variable-by-variable statistical testing is often used to select variables whose behavior differs across conditions. Such an approach requires adjustment for multiple testing, which can result in low statistical power. A two-stage approach that first filters variables by a criterion independent of the test statistic, and then only tests variables which pass the filter, can provide higher power. We show that use of some filter/test statistics pairs presented in the literature may, however, lead to loss of type I error control. We describe other pairs which avoid this problem. In an application to microarray data, we found that gene-by-gene filtering by overall variance followed by a t-test increased the number of discoveries by 50%. We also show that this particular statistic pair induces a lower bound on fold-change among the set of discoveries. Independent filterinbgusing filter/test pairs that are independent under the null hypothesis but correlated under the alternativbeis a general approach that can substantially increase the efficiency of experiments.
Thursday, September 16th
Reconstructing DNA
copy number by penalized estimation and imputation
Professor Chiara Sabatti
Division of Biostatistics and
Department of Statistics, Stanford University
Recent advances in genomics have underscored the surprising ubiquity
of DNA copy number variation (CNV). Fortunately, modern genotyping
platforms also detect CNVs with fairly high reliability. Hidden
Markov models and algorithms have played a dominant role in the
interpretation of CNV data. Here we explore CNV reconstruction via
estimation with a fused-lasso penalty as suggested by Tibshirani and
Wang (2008). We mount a fresh attack on this difficult optimization
problem by: (a) changing the penalty terms slightly by substituting a
smooth approximation to the absolute value function, (b) designing
and implementing a new MM (majorization-minimization) algorithm, and
(c) applying a fast version of Newton's method to jointly up-date all
model parameters. Together these changes enable us to minimize the
fused-lasso criterion in a highly effective way.
We also reframe the reconstruction problem in terms of imputation via
discrete optimization. This approach is easier and more accurate than
parameter estimation because it relies on the fact that only a
handful of possible copy number states exist at each SNP. The dynamic
programming framework has the added bonus of exploiting information
that the current fused-lasso approach ignores. The accuracy of our
imputations is comparable to that of hidden Markov models at a
substantially lower computational cost.
This is joint work with Thomas Zhang and Kenneth Lange.
Thursday, September 30th
Statistical
applications in
the analysis of reverse-phase protein microarray data: Results from a
cross-platform evaluation study
Dr. Houston Gilbert
Genentech
Reverse-phase protein microarrays (RPPMA) allow for the simultaneous
detection of a single protein in complex analyte mixtures, such as
those obtained from cell tissue culture or clinical sample protein
lysate. To gain a better understanding of the RPPMA arena, we
evaluated three fee-for-service providers of this technology.
Practical, statistical and biological results from the evaluation
study have informed our own strategies for moving forward with RPPMA
technology in research and development programs. The evaluation study
has also highlighted areas for each of the companies to improve upon
their own platforms.
Joint work with Maureen Wong, Zachary Boyd, Jenny Wu, Sree Ranjani
Ramani, Yibing Yan, Mark Lackner, Lisa Belmont, and Lino Gonzalez.
Thursday, October 7th
Beyond the genomes:
understanding the molecular functions of genetic variants
Dr. Sean Mooney
Buck Institute
Abstract.
Thursday, October 14th
Two-sample tests of
differential expression on gene networks
Dr. Laurent Jacob
Department of Statistics, UC Berkeley
Measuring gene expressions to study a biological phenomenon or build
prognosis tools is now common practice. When analyzing this type of
data, one is very often interested in detecting pre-defined sets of
genes that are known to work together and are significantly
differentially expressed between two particular
conditions. Multivariate statistics allow to test for differential
expression at the gene set level directly which makes them more
interpretable than the widely used gene set enrichment
approach. However, they are known to lose power quickly with
increasing dimension. At the same time, an increasing number of
regulation networks are becoming available, specifying, for example,
which genes activate or inhibit the expression of other genes. We
intend to use these networks to build spaces of lower dimension, yet
retaining most of the expression shift of gene sets. This makes the
multivariate testing amenable and provably more powerful under
(partly) coherent expression shift assumption.
Thursday, October 28th
Whole-genome
sequencing of lung cancer samples
Dr. Zemin Zhang
Genentech
Next generation sequencing technologies have greatly reduced the
barrier for whole genome sequencing, which enables systematic survey
of the entire mutation spectrum of human cancer samples. In
collaboration with Complete Genomics, we sequenced and compared the
tumor and normal tissue of a 51 year old Caucasian male with
non-small cell lung cancer. The patient's primary lung tumor
was sequenced to 60x coverage and adjacent normal tissue to 46x
coverage. More than 50,000 single nucleotide variations (SNVs) were
discovered in the tumor which yielded about 17.7 somatic mutations
per megabase of DNA. In addition, we observed a distinct pattern of
selection against mutations within expressed genes compared to
non-expressed genes and in promoter regions up to 5 kb upstream of
all protein-coding genes, clearly identifying selection pressures
within a tumor environment. We will also discuss the identification
of somatic structural and copy number variants, computational
prediction of driver mutations, and our latest effort on expanded
whole genome sequencing for additional lung tumors and cell lines.
Thursday, November 4th
Estimation of allele
frequency and association mapping using next-generation sequencing data
Dr. Su Yeon Kim
Department of Statistics, UC Berkeley
Estimation of allele frequencies is of fundamental importance in
population genetic analyses and in association mapping. In most
studies using next-generation sequencing, a cost effective approach
is to use medium or low-coverage data (e.g., <15X). However, SNP
calling and allele frequency estimation in such studies is associated
with substantial statistical uncertainty because of varying coverage,
high error rates, etc. We present a new maximum likelihood method
for estimating the allele frequencies in low and medium coverage
next-generation sequencing data, based on integrating over
uncertainty in the data for each individual rather than calling
genotypes. This method can be directly applied to detect
associations in case/control studies. We compare our method to
methods based on genotype calling using simulations, and show that
the likelihood method outperforms the genotype calling methods in
terms of: (1) accuracy of allele frequency estimation, (2)
distribution of allele frequencies across neutrally evolving sites,
and (3) statistical power in association mapping studies. Using real
re-sequencing data from 200 individuals obtained using
exon-capturing, we show that the patterns observed in the simulations
in fact also can be found in real data . In particular, the null
distribution of the test statistic computed based on called genotypes
shows a significant departure from the chi-square(1) distribution
expected using classical asymptotic theory. However, the test
statistic calculated using the full likelihood method closely follows
the expected distribution. Overall, our results suggest that
association mapping and estimation of allele frequencies should not
be based on genotype calling in low to medium coverage data.
Furthermore, if genotype calling is used, it is better not to filter
individuals based on call confidence score.
Thursday, November 18th
Characterizing
microbial diversity from metagenomic data
Dr. Thomas J. Sharpton
J. David Gladstone Institute for
Cardiovascular Disease, UCSF
Despite their importance to human and environmental health, we know
relatively little about the taxonomic and functional diversity of
microorganisms. The recent innovation of shotgun sequencing
environmentally acquired DNA, a process known as metagenomic
sequencing, provides unique insight into the natural diversity of
microbes, but comes at the expense of tremendous data complexity. My
collaborators and I have designed a series of bioinformatic tools
that circumvent the challenges of metagenomic sequence data and
enable the characterization of the taxonomic and functional diversity
of microorganisms directly from nature. In particular, I developed
PhylOTU, computational workflow that identifies Operational Taxonomic
Units (OTUs) from metagenomic sequence data via the use of
phylogenomic principles and probabilistic sequence
profiles. Methodological accuracy was verified through tests of
simulated metagenomic data. I subsequently applied PhylOTU to marine
metagenomic sequence libraries and identified microbial taxa missed
by traditional sequence-based investigations. This suggests that
PhylOTU, when applied to metagenomic data, can identify novel
microbial taxa. In addition to discussing PhylOTU, I will describe
preliminary research being conducted through this collaboration that
leverages similar probabilistic profile-based methods to explore the
functional diversity of microorganisms.
Thursday, December 2nd
Data mining with biomaRt
Dr. Steffen Durinck
Lawrence Berkeley National
Laboratory
A comprehensive analysis of high-throughput biological experiments
involves integration of a variety of data sources. Much of this
(meta) data is stored in publicly available databases, accessible
through well-defined web interfaces. One simple example is the
annotatation of a set of features that are found differentially
expressed in a microarray experiment with corresponding gene symbols
and genomic locations. BioMart is a generic, query oriented data
management system, capable of integrating distributed data
resources. It is developed at the European Bioinformatics Institute
(EBI) and Cold Spring Harbour Laboratory (CSHL). biomaRt is a
software package aimed at integrating data from BioMart systems into
R, providing efficient access to a wealth of biological data from
within a data analysis environment and enabling biological database
mining. In this talk I'll discuss resources that are currently
available through biomaRt (e.g. Ensembl, Reactome, COSMIC) and how to
perform queries to BioMart databases.