PB HLTH 292, Section 020
Statistics
and Genomics Seminar
Spring 2006
Thursday, January
19th
Analysis
issues of oligonucleotide tiling array data
Professor Ru-Fang Yeh
Department of Epidemiology and Biostatistics, University of California,
San Francisco
The recent development of DNA
tiling arrays have made it possible to
experimentally annotate the genome and various protein-DNA interactions
through unbiased interrogation of large genomic regions. Depending on
its utility, data from tiling array experiments pose unique analytic
challenges that are very different from traditional expression array
analysis. In this talk, I will discuss the preprocessing and analysis
issues of tiling array-based experiments for DNA copy-number
alterations (array CGH) and histone modifications (ChIP-chip) using
Nimblgen custom arrays, and offer our preliminary solutions.
Thursday, January
26th
Quantitative
trait mapping study designs from an informationp erspective
Professor Saunak Sen
Department of Epidemiology and Biostatistics, University of California,
San Francisco
Genetic mapping using crosses
between inbred strains (of mice yeast, Arabidopsis, etc) is an
important biological tool. It is used to detect and localize the
genetic elements responsible for the variation in a phenotype (trait
such as blood pressure in mice) of interest.
We consider inbred line crosses from an information perspective in
order to examine the efficiency of different genotyping and phenotyping
strategies. Our central result is a simple formula to quantify
the information content of any combined phenotyping and genotyping
design. This is used to derive a number of results including
finding efficient genotyping designs given genotyping cost, number of
phenotyping replications needed, and the effect of multiple loci on
selective genotyping strategies.
This is joint work with Jaya Satagopan of Memorial Sloan Kettering
Cancer Center, New York, NY, and Gary Churchill of the Jackson
Laboratory, Bar Harbor, ME.
Thursday, February
2nd
Hybridization Efficiency Analysis Of Probes Targeting 16S
rRNA Genes Using The Affymetrix Genechip Format
Todd Z. DeSantis
Lawrence Berkeley National Laboratory
Background: Detection of diverse 16S rRNA gene types in complex
mixtures can be achieved using arrays of probes targeting specific
sequences in 16S rRNA genes. Whereas probes for expression arrays are
designed to leverage the diversity among various genes in one genome,
16S probes rely upon the diversity of the same gene found in many
genomes. Also, expression arrays are validated by their accurate
estimation of changes in analyte concentration, but 16S arrays are
expected to provide definitive present-absent scoring of each
prokaryotic taxa. The degree of uniqueness of a probe for a particular
target species or other defined operational taxonomic unit will dictate
its reliability but has yet to be quantified for prediction of
hybridization accuracy.
Methods: To obtain these metrics, amplicons of the16S rRNA gene
from Francisella tularensis were fragmented, labeled and
isothermally hybridized to replicate Affymetrix custom arrays
containing 491,069 unique 25mer probes with various degrees of
probe-target complimentarity, melting temperature, and secondary
structure potential. Hybrid abundance at each probe location was
determined by florescence intensity.
Results: As expected, probes exactly complimentary to the
target but with various sequence composition produced intensities
ranging over 3 orders of magnitude yet replicate probes on the same
array produced a coefficient of variation under 20%. Although
mismatching probes were able to capture target sequence, a general
decrease in intensity was observed with probes divergent from the
target and the effect could be attenuated by masking probes with high
melting temperatures.
Conclusion: The data collected allows the development of a
probabilistic model that aids in predicting the confidence that a
probe’s response is due to the presence of the corresponding target in
solution.
This is joint work with K. D. Hansen, E. L. Brodie, Y. M. Piceno, J.
Bullard, P. Hu, G. L. Andersen.
Thursday,
February 9th
Automatically
Detecting and Genotyping Genetic Variants (SNPs)
by Sequencing of Diploid Samples
Professor Matthew Stephens
Department of Statistics, University of Washington, Seattle
The detection and genotyping of
sequence variations, particularly Single Nucleotide Polymorphisms
(SNPs), is at the core of all genetic analysis. The principle approach
for detecting variants in a specific gene is to sequence that gene in a
sample of (diploid) individuals. (The term "diploid" refers to the fact
that each individual has two copies of their genome, one inherited from
each parent.) Identification of SNPs from this kind of sequence data
has been greatly aided by the use of computational and statistical
methods. However, existing algorithms are not sufficiently accurate to
be used without potentially costly confirmation, usually by a human
manually checking each call.
This talk will describe the problem, and our work on a new and more
accurate statistical method to detect and genotype SNPs. The new
algorithm improves on existing approaches in two key ways. First, it
takes more detailed account of systematic variation in peak heights due
to read-specific and sequence-context effects. If unaccounted for these
systematic effects obscure the signal we are aiming to detect. Second,
it computes a formal statistical measure of the evidence for potential
genotypes at each position in each sequence. This enables the
application of standard statistical methods to efficiently combine
evidence across multiple reads for an individual, which results in
exceptional accuracy for data with ``double-coverage'', where
individuals are sequenced on both the forward and reverse strands. It
also provides a quantitative assessment of the confidence in each SNP
identified, and in each genotype
called. This is particularly useful in identifying a subset of highly
accurate SNP and genotype calls which may be accepted without manual
confirmation.
Thursday,
February 16th
Modified
BIC for Change-point Problems with Applications to Array-CGH Data
Dr. Nancy Zhang
Department of Statistics, UC Berkeley
We study the problem of
estimating the number of change-points in a data series that is
hypothesized to have undergone abrupt changes. First, we focus on the
scenario of independent Gaussian data points with changing mean values,
and then generalize to the Poisson process with changing rate parameter
as well as general exponential families. This can be viewed as a
problem in model selection, where the dimension of the model grows with
the number of change-points assumed. However, the classic Bayes
Information Criterion (BIC) can not be applied because of
irregularities in the likelihood function. By asymptotic approximation
of the Bayes Factor, we derive the Modified BIC that is theoretically
justified for the change-point models that we study.
An example of application as well as a source of inspiration for the
Gaussian model is the analysis of array comparative genomic
hybridization (array-CGH) data. Array-CGH measures the number of
chromosome copies at each genome location of a cell sample, and
isuseful for finding the regions of genome deletion and amplification
in tumor cells. The Modified BIC statistic is tested on array-CGH data
sets and compared to existing methods. Variations to the basic
change-point model that are inspired by array-CGH data is also
discussed.
Thursday, February 23rd
Novel
Algorithms for Investigating New RNA Features
Professor Irmtraud Meyer
University of British Columbia
RNA molecules have remarkable
chemical properties that allow them to play a variety of important
functional roles in the cell. This talk introduces several novel
methods to discover and study new functional features of RNA molecules.
Many biological classes of RNA molecules exert their function by
assuming a distinct structure. Irmtraud Meyer will introduce a new
algorithm by which we can show that these RNA molecules not only encode
information on the final, functional secondary structure, but also on
the folding pathway that guides the formation of the functional
structure.
Due to the degeneracy of the genetic code, the protein-coding RNA
molecule can encode an extra layer of information, for example,
information on RNA secondary structure. Meyer will show how these
overlapping layers of information can be detected and present several
results that show that RNA secondary structure may play an active role
in the regulation of human pre-mRNA and mRNA sequences.
Thursday, March 9th
SIFTER: A statistical graphical model for predicting protein
molecular function
Barbara Engelhardt
Department of Computer Science, UC Berkeley
We present a simple statistical model of molecular function evolution
to predict protein molecular function. The model description encodes
general knowledge of how molecular function evolves within a
phylogenetic tree based on the proteins' sequence. Inputs are a
phylogeny for a set of evolutionarily related protein sequences and
any available molecular function characterizations for those
proteins. Conditional probabilities are a variant of a continuous time
Markov chain, making inference straightforward, and the resulting
posterior probabilities for each protein can be used to predict
protein function. We present results from testing our model on three
protein families, and compare prediction results on these extant
proteins to other available protein function prediction methods. For
the deaminase family for example, our method achieves 93.9% whereas
BLAST achieves 72.7%, GOtcha achieves 87.9%, and Orthostrapper
achieves 72.7% in prediction accuracy.
Thursday, March 16th
Data-adaptive test statistics for genomics
Dr. Sach Mukherjee
Department of Statistics, UC Berkeley
In recent years, there has been a great deal of interest in hypothesis
testing problems in genomics. In problems of this kind, we are typically
presented with a dataset containing measurements pertaining to a large
number of molecules, from which we would like to select a number of
molecules which are likely to satisfy a hypothesis of biological interest.
The selection of differentially expressed genes from microarray data is a
particularly well-known exemplar of this broad class of problems. In this
talk, I will put forward a data-adaptive approach to genomic hypothesis
testing, in which a test statistic is learned directly from data. This
strategy is made possible by the use of a simple measure called
"reproducibility", which can be computed without any knowledge of the
ground truth but is nonetheless correlated with risk under the true (but
unknown) data-generating distribution. I will discuss the relationship
between reproducibility and risk, and show how reproducibility may be used
as a proxy for risk in the learning of test statistics. Finally, I will
present a case-study in which this data-adaptive approach is used to
select differentially expressed genes from real and simulated microarray
data.
Thursday, March 23rd
Application of 'omics' to the study of chemically-exposed humans
Professor Martyn T. Smith
Division of Environmental Health Sciences, UC Berkeley
We are using the Illumina, Affymetrix and Ciphergen platforms to study
human populations. We have examined the effects of benzene exposure on
peripheral blood mononuclear cell (PBMC) gene expression in a
population of shoe-factory workers with well-characterized
occupational exposures to benzene using both Affymetrix and Illumina
microarrays. RNA was isolated from the PBMC of exposed workers along
with matched controls. PBMC RNA was amplified and hybridized to
Affymetrix U133 chips and Illumina Sentrix Human-8 beadchips. Data
from the two platforms has been compared. Among the top 200 genes
identified, there was only 16% (32 genes) concordance between the two
platforms, but expression ratios were very similar for the concordant
genes. This and another study of dioxin exposed individuals
highlights some of the challenges of examining gene expression by
microarray in human occupational exposure settings where the
discrimination of subtle differential expression changes (mostly <
2-fold) against a background of inter-individual variation is
necessary. We have also used array-based proteomics to study people
exposed to benzene, arsenic and dioxin as well as cases of leukemia.
We have developed novel statistical approaches to analyzing this
proteomic spectral data. The Illumina platform is also being used for
high-throughput genotyping of thousands of SNPs and new approaches to
analyzing this type of data are being developed collaboratively.
This is joint work with Cliona Mchale, Alan Hubbard, Jingsong Chen, Christine Skibola, Christine Hegedus, Merrill Birkner and Luoping Zhang.
Thursday, April 6th
Whole-genome alignments and polytopes for comparative genomics
Colin Dewey
Department of Electrical Engineering and Computer
Sciences, UC Berkeley
Whole-genome sequencing of many species has presented us with the opportunity to deduce the evolutionary relationships between each and every nucleotide. In this talk, I will present algorithms for this problem, which is that of multiple whole-genome alignment. The sensitivity of whole-genome alignments to parameter values can be ascertained through the use of alignment polytopes, which will be explained. I will also show how whole-genome alignments are used in comparative genomics, including the identification of novel genes, the location of micro-RNA targets, and the elucidation of cis-regulatory element and splicing signal evolution.
Thursday, April 13th
Combinatorial Regulation in Yeast Transcription Networks
Professor Hao Li
Department of Biochemistry and Biophysics and California
Institute for Quantitative Biomedical Research, UCSF
Yeast has evolved a complex regulatory network to control its gene
expression in response to changes in environment. It is quite common
that in response to an external stimulus, several transcription factors
are activated and they work in combinations to control different subsets
of genes in the genome. We are interested in how the promoters of genes
are designed to integrate signals from multiple transcription factors
and what are the functional constraints. To answer how, we have
developed a number of computational algorithms to systematically map the
binding sites and target genes of transcription factors using sequence
and gene expression data. To analyze the functional constraints, we have
employed mechanistic models to study the dynamical behavior of genes
regulated by multiple factors. We have also developed experimental tools
to monitor the dynamics of gene expression quantitatively with high
temporal resolution.
Thursday, April 20th
Optical
Mapping and its applications to discovering structural variations in
genomes
Anton Valouev
Department of Mathematics, University of Southern California
Optical Mapping is a powerful
high-throughput genome wide restriction mapping technology in which
restriction maps of single DNA molecules can be acquired using light
microscopy. In a way very similar to sequencing, individual optical
maps must be assembled to yield accurate whole-genome restriction maps.
They can be compared to published sequences to identify structural
variants in the genome in the form of apparent insertions/deletions,
restriction sites (novel or missing), inversions and translocations. In
this talk I will give an overview of Optical Mapping technology and
will explain some statistical aspects associated with calling
structural variants in genomes.
Thursday, April 27th
Some issues in the analysis of high dimensional cancer data
Professor Jane Fridlyand
Department of Epidemiology and Biostatistics and Comprehensive Cancer Center, UCSF
This talk will consist of the two parts. In the first part we will discuss issues arising in the analysis of the array CGH data including its segmentation, comparison across the platforms and meta-analysis issues. In particular we will discuss some approaches to joint analysis of the copy number, expression and methylation arrays. In the second part of the talk we will introduce Magnetic Resonance Spectroscopy technology and discuss some high-level analysis issues. Much of this presentation will contain work in progress.
Thursday, May 4th
Analysis of Brain Images: Methods and Models
Professor William Jagust, MD
School of Public Health, UC Berkeley
Helen Wills Neuroscience Institute, UC Berkeley
Lawrence Berkeley National Laboratory
The imaging technologies of positron emission tomography (PET), magnetic resonance imaging (MRI) and functional MRI (fMRI) make use of basic physical principles to derive images of the brain. For PET, injected radionuclides are used to image biochemical and molecular properties. MRI and fMRI use magnetic resonance signals to define anatomy or physiology, respectively. All three of these techniques require considerable analysis and data reduction to produce 3D or 4D images from the signals, and to perform hypothesis-testing statistics on the data. This talk will first review the basic methods of deriving biochemical, anatomical, and physiological information from PET and MR signals by describing the first steps in image processing. These approaches often require the use of models to define relationships between dynamic signal change and biochemistry or physiology. Once the biological signal is derived, these images are used to test specific hypotheses about how brain structure, biochemistry, or physiology are related to other variables such as a disease state or cognitive state. Testing these hypotheses involves problems related to large arrays of data, as the images are composed of many 3D volume elements (voxels) obtained in a relatively small number of subjects. This talk will also review approaches to this problem.
Thursday, May 11th
Inferring Transcriptional Subnetworks from Microarray Expression Data using Regression Splines
Dr. Debopriya Das
Life Sciences Division
Lawrence Berkeley National Laboratory
With the availability of genome-wide mRNA profiles, it is now possible to integrate such data with DNA sequence information to globally decipher several key aspects of transcription regulation. However, gene regulation in eukaryotes is complex and is inherently combinatorial in nature. Additionally, in mammals, the transcription factor (TF) binding sites are strongly degenerate, making their computational identification even more elusive. I will present a method called MARSMotif, based on multivariate regression splines, which systematically accounts for these critical features. It allows adaptive determination of transcriptional subnetworks (cis-regulatory motif combinations, associated target genes and regulated pathways) from expression data and is equally applicable to both low eukaryotes and mammals. Using expression profiles from yeast and human as examples, I will discuss how one can achieve a systematic understanding of underlying regulatory subnetworks using this approach. Condition-specific gene activation by a common TF will be addressed and supportive experimental evidence for novel predictions will also be presented.