PB HLTH 292, Section 020

Statistics and Genomics Seminar

Spring 2006


Thursday, January 19th

Analysis issues of oligonucleotide tiling array data
Professor Ru-Fang Yeh
Department of Epidemiology and Biostatistics, University of California, San Francisco

The recent development of DNA tiling arrays have made it possible to experimentally annotate the genome and various protein-DNA interactions through unbiased interrogation of large genomic regions. Depending on its utility, data from tiling array experiments pose unique analytic challenges that are very different from traditional expression array analysis. In this talk, I will discuss the preprocessing and analysis issues of tiling array-based experiments for DNA copy-number alterations (array CGH) and histone modifications (ChIP-chip) using Nimblgen custom arrays, and offer our preliminary solutions.


Thursday, January 26th

Quantitative trait mapping study designs from an informationp erspective
Professor Saunak Sen
Department of Epidemiology and Biostatistics, University of California, San Francisco

Genetic mapping using crosses between inbred strains (of mice yeast, Arabidopsis, etc) is an important biological tool.  It is used to detect and localize the genetic elements responsible for the variation in a phenotype (trait such as blood pressure in mice) of interest.

We consider inbred line crosses from an information perspective in order to examine the efficiency of different genotyping and phenotyping strategies.  Our central result is a simple formula to quantify the information content of any combined phenotyping and genotyping design.  This is used to derive a number of results including finding efficient genotyping designs given genotyping cost, number of phenotyping replications needed, and the effect of multiple loci on selective genotyping strategies.

This is joint work with Jaya Satagopan of Memorial Sloan Kettering Cancer Center, New York, NY, and Gary Churchill of the Jackson Laboratory, Bar Harbor, ME.


Thursday, February 2nd

Hybridization Efficiency Analysis Of Probes Targeting 16S rRNA Genes Using The Affymetrix Genechip Format
Todd Z. DeSantis
Lawrence Berkeley National Laboratory

Background: Detection of diverse 16S rRNA gene types in complex mixtures can be achieved using arrays of probes targeting specific sequences in 16S rRNA genes. Whereas probes for expression arrays are designed to leverage the diversity among various genes in one genome, 16S probes rely upon the diversity of the same gene found in many genomes. Also, expression arrays are validated by their accurate estimation of changes in analyte concentration, but 16S arrays are expected to provide definitive present-absent scoring of each prokaryotic taxa. The degree of uniqueness of a probe for a particular target species or other defined operational taxonomic unit will dictate its reliability but has yet to be quantified for prediction of hybridization accuracy.
Methods: To obtain these metrics, amplicons of the16S rRNA gene from Francisella tularensis were fragmented, labeled and isothermally hybridized to replicate Affymetrix custom arrays containing 491,069 unique 25mer probes with various degrees of probe-target complimentarity, melting temperature, and secondary structure potential. Hybrid abundance at each probe location was determined by florescence intensity.
Results: As expected, probes exactly complimentary to the target but with various sequence composition produced intensities ranging over 3 orders of magnitude yet replicate probes on the same array produced a coefficient of variation under 20%. Although mismatching probes were able to capture target sequence, a general decrease in intensity was observed with probes divergent from the target and the effect could be attenuated by masking probes with high melting temperatures.
Conclusion: The data collected allows the development of a probabilistic model that aids in predicting the confidence that a probe’s response is due to the presence of the corresponding target in solution.

This is joint work with K. D. Hansen, E. L. Brodie, Y. M. Piceno, J. Bullard, P. Hu, G. L. Andersen.


Thursday, February 9th

Automatically Detecting and Genotyping Genetic Variants (SNPs)
by Sequencing of Diploid Samples

Professor Matthew Stephens
Department of Statistics, University of Washington, Seattle

The detection and genotyping of sequence variations, particularly Single Nucleotide Polymorphisms (SNPs), is at the core of all genetic analysis. The principle approach for detecting variants in a specific gene is to sequence that gene in a sample of (diploid) individuals. (The term "diploid" refers to the fact that each individual has two copies of their genome, one inherited from each parent.) Identification of SNPs from this kind of sequence data has been greatly aided by the use of computational and statistical methods. However, existing algorithms are not sufficiently accurate to be used without potentially costly confirmation, usually by a human manually checking each call.

This talk will describe the problem, and our work on a new and more accurate statistical method to detect and genotype SNPs. The new algorithm improves on existing approaches in two key ways. First, it takes more detailed account of systematic variation in peak heights due to read-specific and sequence-context effects. If unaccounted for these systematic effects obscure the signal we are aiming to detect. Second, it computes a formal statistical measure of the evidence for potential genotypes at each position in each sequence. This enables the application of standard statistical methods to efficiently combine evidence across multiple reads for an individual, which results in exceptional accuracy for data with ``double-coverage'', where individuals are sequenced on both the forward and reverse strands. It also provides a quantitative assessment of the confidence in each SNP identified, and in each genotype called. This is particularly useful in identifying a subset of highly accurate SNP and genotype calls which may be accepted without manual confirmation.



Thursday, February 16th

Modified BIC for Change-point Problems with Applications to Array-CGH Data
Dr. Nancy Zhang
Department of Statistics, UC Berkeley

We study the problem of estimating the number of change-points in a data series that is hypothesized to have undergone abrupt changes. First, we focus on the scenario of independent Gaussian data points with changing mean values, and then generalize to the Poisson process with changing rate parameter as well as general exponential families. This can be viewed as a problem in model selection, where the dimension of the model grows with the number of change-points assumed. However, the classic Bayes Information Criterion (BIC) can not be applied because of irregularities in the likelihood function. By asymptotic approximation of the Bayes Factor, we derive the Modified BIC that is theoretically justified for the change-point models that we study.

An example of application as well as a source of inspiration for the Gaussian model is the analysis of array comparative genomic hybridization (array-CGH) data. Array-CGH measures the number of chromosome copies at each genome location of a cell sample, and isuseful for finding the regions of genome deletion and amplification in tumor cells. The Modified BIC statistic is tested on array-CGH data sets and compared to existing methods.  Variations to the basic change-point model that are inspired by array-CGH data is also discussed.



Thursday, February 23rd

Novel Algorithms for Investigating New RNA Features
Professor Irmtraud Meyer
University of British Columbia

RNA molecules have remarkable chemical properties that allow them to play a variety of important functional roles in the cell. This talk introduces several novel methods to discover and study new functional features of RNA molecules.

Many biological classes of RNA molecules exert their function by assuming a distinct structure. Irmtraud Meyer will introduce a new algorithm by which we can show that these RNA molecules not only encode information on the final, functional secondary structure, but also on the folding pathway that guides the formation of the functional structure.

Due to the degeneracy of the genetic code, the protein-coding RNA molecule can encode an extra layer of information, for example, information on RNA secondary structure. Meyer will show how these overlapping layers of information can be detected and present several results that show that RNA secondary structure may play an active role in the regulation of human pre-mRNA and mRNA sequences.



Thursday, March 9th

SIFTER: A statistical graphical model for predicting protein molecular function
Barbara Engelhardt
Department of Computer Science, UC Berkeley

We present a simple statistical model of molecular function evolution to predict protein molecular function. The model description encodes general knowledge of how molecular function evolves within a phylogenetic tree based on the proteins' sequence. Inputs are a phylogeny for a set of evolutionarily related protein sequences and any available molecular function characterizations for those proteins. Conditional probabilities are a variant of a continuous time Markov chain, making inference straightforward, and the resulting posterior probabilities for each protein can be used to predict protein function. We present results from testing our model on three protein families, and compare prediction results on these extant proteins to other available protein function prediction methods. For the deaminase family for example, our method achieves 93.9% whereas BLAST achieves 72.7%, GOtcha achieves 87.9%, and Orthostrapper achieves 72.7% in prediction accuracy.


Thursday, March 16th

Data-adaptive test statistics for genomics
Dr. Sach Mukherjee
Department of Statistics, UC Berkeley

In recent years, there has been a great deal of interest in hypothesis testing problems in genomics. In problems of this kind, we are typically presented with a dataset containing measurements pertaining to a large number of molecules, from which we would like to select a number of molecules which are likely to satisfy a hypothesis of biological interest. The selection of differentially expressed genes from microarray data is a particularly well-known exemplar of this broad class of problems. In this talk, I will put forward a data-adaptive approach to genomic hypothesis testing, in which a test statistic is learned directly from data. This strategy is made possible by the use of a simple measure called "reproducibility", which can be computed without any knowledge of the ground truth but is nonetheless correlated with risk under the true (but unknown) data-generating distribution. I will discuss the relationship between reproducibility and risk, and show how reproducibility may be used as a proxy for risk in the learning of test statistics. Finally, I will present a case-study in which this data-adaptive approach is used to select differentially expressed genes from real and simulated microarray data.


Thursday, March 23rd

Application of 'omics' to the study of chemically-exposed humans
Professor Martyn T. Smith
Division of Environmental Health Sciences, UC Berkeley

We are using the Illumina, Affymetrix and Ciphergen platforms to study human populations. We have examined the effects of benzene exposure on peripheral blood mononuclear cell (PBMC) gene expression in a population of shoe-factory workers with well-characterized occupational exposures to benzene using both Affymetrix and Illumina microarrays. RNA was isolated from the PBMC of exposed workers along with matched controls. PBMC RNA was amplified and hybridized to Affymetrix U133 chips and Illumina Sentrix Human-8 beadchips. Data from the two platforms has been compared. Among the top 200 genes identified, there was only 16% (32 genes) concordance between the two platforms, but expression ratios were very similar for the concordant genes. This and another study of dioxin exposed individuals highlights some of the challenges of examining gene expression by microarray in human occupational exposure settings where the discrimination of subtle differential expression changes (mostly < 2-fold) against a background of inter-individual variation is necessary. We have also used array-based proteomics to study people exposed to benzene, arsenic and dioxin as well as cases of leukemia. We have developed novel statistical approaches to analyzing this proteomic spectral data. The Illumina platform is also being used for high-throughput genotyping of thousands of SNPs and new approaches to analyzing this type of data are being developed collaboratively.

This is joint work with Cliona Mchale, Alan Hubbard, Jingsong Chen, Christine Skibola, Christine Hegedus, Merrill Birkner and Luoping Zhang.


Thursday, April 6th

Whole-genome alignments and polytopes for comparative genomics
Colin Dewey
Department of Electrical Engineering and Computer Sciences, UC Berkeley

Whole-genome sequencing of many species has presented us with the opportunity to deduce the evolutionary relationships between each and every nucleotide. In this talk, I will present algorithms for this problem, which is that of multiple whole-genome alignment. The sensitivity of whole-genome alignments to parameter values can be ascertained through the use of alignment polytopes, which will be explained. I will also show how whole-genome alignments are used in comparative genomics, including the identification of novel genes, the location of micro-RNA targets, and the elucidation of cis-regulatory element and splicing signal evolution.


Thursday, April 13th

Combinatorial Regulation in Yeast Transcription Networks
Professor Hao Li
Department of Biochemistry and Biophysics and California Institute for Quantitative Biomedical Research, UCSF

Yeast has evolved a complex regulatory network to control its gene expression in response to changes in environment. It is quite common that in response to an external stimulus, several transcription factors are activated and they work in combinations to control different subsets of genes in the genome. We are interested in how the promoters of genes are designed to integrate signals from multiple transcription factors and what are the functional constraints. To answer how, we have developed a number of computational algorithms to systematically map the binding sites and target genes of transcription factors using sequence and gene expression data. To analyze the functional constraints, we have employed mechanistic models to study the dynamical behavior of genes regulated by multiple factors. We have also developed experimental tools to monitor the dynamics of gene expression quantitatively with high temporal resolution.


Thursday, April 20th

Optical Mapping and its applications to discovering structural variations in genomes
Anton Valouev
Department of Mathematics, University of Southern California

Optical Mapping is a powerful high-throughput genome wide restriction mapping technology in which restriction maps of single DNA molecules can be acquired using light microscopy. In a way very similar to sequencing, individual optical maps must be assembled to yield accurate whole-genome restriction maps. They can be compared to published sequences to identify structural variants in the genome in the form of apparent insertions/deletions, restriction sites (novel or missing), inversions and translocations. In this talk I will give an overview of Optical Mapping technology and will explain some statistical aspects associated with calling structural variants in genomes.


Thursday, April 27th

Some issues in the analysis of high dimensional cancer data
Professor Jane Fridlyand
Department of Epidemiology and Biostatistics and Comprehensive Cancer Center, UCSF

This talk will consist of the two parts. In the first part we will discuss issues arising in the analysis of the array CGH data including its segmentation, comparison across the platforms and meta-analysis issues. In particular we will discuss some approaches to joint analysis of the copy number, expression and methylation arrays. In the second part of the talk we will introduce Magnetic Resonance Spectroscopy technology and discuss some high-level analysis issues. Much of this presentation will contain work in progress.


Thursday, May 4th

Analysis of Brain Images: Methods and Models
Professor William Jagust, MD
School of Public Health, UC Berkeley
Helen Wills Neuroscience Institute, UC Berkeley
Lawrence Berkeley National Laboratory

The imaging technologies of positron emission tomography (PET), magnetic resonance imaging (MRI) and functional MRI (fMRI) make use of basic physical principles to derive images of the brain. For PET, injected radionuclides are used to image biochemical and molecular properties. MRI and fMRI use magnetic resonance signals to define anatomy or physiology, respectively. All three of these techniques require considerable analysis and data reduction to produce 3D or 4D images from the signals, and to perform hypothesis-testing statistics on the data. This talk will first review the basic methods of deriving biochemical, anatomical, and physiological information from PET and MR signals by describing the first steps in image processing. These approaches often require the use of models to define relationships between dynamic signal change and biochemistry or physiology. Once the biological signal is derived, these images are used to test specific hypotheses about how brain structure, biochemistry, or physiology are related to other variables such as a disease state or cognitive state. Testing these hypotheses involves problems related to large arrays of data, as the images are composed of many 3D volume elements (voxels) obtained in a relatively small number of subjects. This talk will also review approaches to this problem.


Thursday, May 11th

Inferring Transcriptional Subnetworks from Microarray Expression Data using Regression Splines
Dr. Debopriya Das
Life Sciences Division
Lawrence Berkeley National Laboratory

With the availability of genome-wide mRNA profiles, it is now possible to integrate such data with DNA sequence information to globally decipher several key aspects of transcription regulation. However, gene regulation in eukaryotes is complex and is inherently combinatorial in nature. Additionally, in mammals, the transcription factor (TF) binding sites are strongly degenerate, making their computational identification even more elusive. I will present a method called MARSMotif, based on multivariate regression splines, which systematically accounts for these critical features. It allows adaptive determination of transcriptional subnetworks (cis-regulatory motif combinations, associated target genes and regulated pathways) from expression data and is equally applicable to both low eukaryotes and mammals. Using expression profiles from yeast and human as examples, I will discuss how one can achieve a systematic understanding of underlying regulatory subnetworks using this approach. Condition-specific gene activation by a common TF will be addressed and supportive experimental evidence for novel predictions will also be presented.