Statistics and Genomics Seminar

PH 296, Section 003

Statistics and Genomics Seminar

Spring 2005

Thursday, January 27th

Estimating evolutionary pathways by mutagenetic trees
Niko Beerenwinkel
Department of Mathematics, UC Berkeley

Mutagenetic trees constitute a class of graphical models to describe the dependency structure of non-reversible genetic changes. We present efficient methods for estimating mutagenetic trees and mixture models of these. The techniques are applied to estimating evolutionary pathways of HIV under pressure of antiviral therapy, and to chromosome alterations that accumulate during tumor progression.

Thursday, February 3rd

Quality measures for Affymetrix chips
Dr. Julia Brettschneider
Department of Statistics, UC Berkeley

With microarray technology getting more established in many branches of life science research, scientists in both academia and corporate environments raise their expectations for reliability and reproducibility of the measurements. The quality of microarray data has emerged as a new research topic suited to be approached cooperatively by biotechnologists and statisticians. While there is a quality report included in the MAS 5.0 output, the community of Affymetrix users is still far from having established uniformly applied quality standards.

We introduce several new tools for both spatial and numerical quality assessment. Our quality measures are based on probe level and probeset level information obtained as a by-product of RMA (Irizarry et al.). They provide convenient ways to search for individual chips of poor quality, for quality trends over time, and for systematic quality patterns related to experimental conditions or sample properties.

In the attempt to capture a variety of quality problems we test our quality measures on data sets from very different sources reaching from a small lab experiment with drosophila to a multi-site study on human brains.

Thursday, February 10th

Detecting Gene Interaction in Affected Sib-Pair Linkage Analysis
Ingileif B. Hallgrimsdottir
Department of Statistics, UC Berkeley

Linkage analysis has proved to be a valuable approach for identifying disease genes associated with Mendelian disorders, to date around 1200 genes have been mapped. However, the success stories are scarse when it comes to complex disorders, in which both environmental factors and many, possible interacting, genes contribute to disease susceptibility.

I will present recent work on detecting (statistical) interaction from affected sib-pair data, i.e. data where the families considered are comprised of parents and two affected children. I will present a new parametrization of the joint IBD probabilities at two loci that allow us to model interaction. I will then discuss how this new parametrization relates to variance-components and how it can be used to develop tests for two-locus linkage.

Joint work with Terry Speed.

Thursday, February 17th

Localization of protein binding surfaces within families of proteins
Dr. Dmitry Korkin
Andrej Sali Lab, Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, UCSF

Studies of the conservation of protein binding sites have led to valuable insights into the evolutionary and structural nature of protein-protein interactions. In this work, we analyze whether binding sites of homologous proteins are localized, i.e. whether they share similar relative positions on protein surfaces. Binding sites are obtained from PIBASE, a structural database of domain-domain interactions, for each domain family, as classified by the Structural Classification of Proteins (SCOP). The binding sites within each family are then superposed using a structural alignment of the member domain structures. Three distinct measures are then used to quantify the localization of binding sites in each family. The analysis of 1,884 SCOP domain families reveals that ~71% of families have binding sites with localization values greater than expected by chance. We find that 563 families have significantly greater localization, and 144 families have significantly lower localization of binding sites, than expected by chance. Examination of families at both extremes of binding site localization suggests that localization can be a helpful measure for describing the functional diversity of protein-protein interactions, complementing measures of sequence and structural conservation. Knowledge of binding site localization may also provide guidance in the modeling of protein assembly structures.

Thursday, February 24th

The Genome Parsing Suite: a Method to Rapidly Identify and Characterize Retroid Agents in Genomes
Professor Marcie McClure
Montana State University, Bozeman

Retroid agents are genomes that replicate by reverse transcription (RT) of an RNA intermediate. Some Retroid agents are implicated in disease via insertional mutagenesis, while others have been found to encode proteins essential to primate reproduction or provide regulatory sequences for host cell processes. Reports on the number of Retroid agents in the human genome estimate that they make-up approximately 17% of total the DNA. Given the evolutionary and developmental importance of Retroid agents we have developed new software, the Genome Parsing Suite (GPS), to identify and characterize reverse transcriptase signals in any genome database, and to annotate the Retroid agents that encode these potential RTs.
The GPS approach is quite different in concept from RepeatMasker, a program designed to identify and mask out Retroid agents in the human genome with consensus DNA for repetitive elements. The GPS utilizes protein rather than nucleotide sequences to screen for the presence of Retroid agents, thereby providing a deeper query into a genome. Once nucleotide substitution reaches mutational saturation, DNA sequences are no longer useful for homology searches, while the corresponding protein sequences easily retain enough signal to identify distant potential homologues. The prototype GPS provides information about the Retroid Agent, including genes present, condition of genes, agent boundaries, location of the agent in the genome etc.
The GPS was used to analyze all significant WU-tBLASTn hits in the human genome returned for 30 representative RT queries. A total of 128,779 unique RT signals were identified, approximately 2.78% of the sequenced portion of
the human genome. This is far fewer than the 500,000 LINEs (including solo UTRs or other gene components, and fragments) found by RepeatMasker. This method can detect Retroid sequences not associated with any RT remnant, as long as mutational saturation has not eroded all DNA identity. The estimated number of LINEs in the human genome using RepeatMasker, however, does not account for concerns regarding small sequence length hits; 1) some may be random, and 2) those in close proximity actually belong to the same highly divergent LINE genome. In addition the early version of the human genome was highly redundant for repetitive sequences resulting in an overestimation of their abundance. Only 157 LINEs are without stop-codons or frame-shifts. Interestingly, 7, 332 unique sequences are retrieved by RTs not previously reported in the human genome.

Thursday, March 10th

Environmental Exposures and the Molecular Epidemiology of Childhood Leukemia: The Northern California Childhood Leukemia Study
Dr. Catherine Metayer
Buffler Group, School of Public Health, UC Berkeley

The Northern California Childhood Leukemia Study, an approximately population-based case control study of childhood leukemia, was initiated in 1996 in 17 counties of Northern California and expanded to additional 18 counties in Central California in 1999. This study is designed to identify etiologic associations between environmental exposures and leukemia in children (ages 0 to 14 years). A high percentage of the subjects (about 40%) are children of Hispanic origin, which is unique in a study of childhood leukemia. To date over 800 cases have been enrolled in the study, with case participation rates ranging from 83% (1996-1999) to 88% (2000-2004). Two matched control subjects for each case will be randomly selected from the statewide birth registry. Increased awareness of the molecular and cytogenetic diversity of leukemia within major subtypes (acute lymphoblastic leukemia and acute myeloid leukemia), and the significance of this diversity for clinical and epidemiologic research, underscore the need to uncover the etiologies of molecularly distinct subgroups of leukemia. Buccal cell specimens are collected from cases, controls, and their biological mothers to study genetic susceptibility. Specific polymorphic genes studied to date include GSTT1, GSTM1, CYPs, NQO1, MDR, MTHFR 677 and 1298. Data are collected for a wide spectrum of environmental exposures including parental occupational exposures, parental tobacco smoke, pesticides and other chemicals, maternal and child diet, and child immunological factors. This comprehensive information on environmental exposures and genetic characteristics, in conjunction with improved disease classification and stratification by Hispanic status, provide significant insights into the etiology of childhood leukemia.

Thursday, March 17th

Efficient Computation of Close Upper and Lower Bounds on the Number of Recombinations Needed in Evolutionary History
Professor Daniel Gusfield
Department of Computer Science, UC Davis

Meiotic recombination takes two equal length sequences and produces a third sequence of the same length consisting of some prefix of one of the sequences, followed by a suffix of the other sequence. Meiotic recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species, and other forms of recombination allow the sharing of genetic material between species. Efforts to deduce patterns of historical recombination or to estimate the frequency or the location of recombination are central to modern-day genetics, for example in ``association mapping".

In studying recombination, a common underlying problem is to determine the *minimum* number of recombinations needed to generate a given set of molecular sequences from an ancestral sequence (which may or may not be known), using some specified model of the permitted site mutations. The common assumption for SNP sites is the *infinite sites model* in population genetics, i.e., that any site (in the study) can mutate at most once in the entire history of the sequences, so each site can take on only two states, and the extant sequences are binary sequences.

We define Rmin(M) to be the minimum number of recombinations needed to generate a set of sequences M from any ancestral sequence, allowing only one mutation per site over the entire history of the sequences. No polynomial-time algorithm to compute Rmin(M) is known, and a variation of the problem is known to be NP-hard. Song and Hein developed an algorithm that computes Rmin(M) exactly, but takes super-exponential time. There are also polynomial-time algorithms that computes Rmin(M) in special cases that arise frequently when the recombination rate is ``modest'' (Gusfield et al.).

Since there is no know efficient method to compute Rmin(M) exactly, several papers have considered efficient computation of *lower bounds* on Rmin. By far, the best method (balancing both time an accuracy) is encoded in a program called RecMin, written by Simon Myers, and based on ``the haplotype lower bound". RecMin requires the setting of parameters which affect both the computation time used, and the quality of the result -- the bound and the computation time are both non-decreasing with increasing parameter values. We define the ``optimal RecMin bound" as the lower bound that RecMin would produce if the parameters were set to their maximum possible values. In general, it is not feasible to use RecMin to compute the optimal RecMin bound.

In this talk we do several things. First, we introduce an algorithm that uses Integer Linear Programming to compute the Optimal RecMin Bound. Second, with ideas that dramatically speed up the ILP, we show through extensive experimentation using simulated and real data sets, that this approach computes the Optimal RecMin Bound faster than RecMin (when RecMin can compute it) and that it can efficiently compute the Optimal RecMin Bound for problem sizes considered large in current applications (where RecMin cannot compute the optimal bound). Third, we introduce additional ideas that allow the algorithm to find lower bounds even better than the Optimal RecMin Bound, and show through extensive experiments that this approach remains practical on problem sizes considered large today. Thus, we provide a practical method that is superior to all other known practical lower bound methods. Fourth, on the Upper Bound side, we present an efficient algorithm that, given sequences M, constructs a history that generates M using recombinations and one mutation per site. The number of recombinations used in the history provides an upper bound on Rmin(M), but the history itself is of independent interest. Fifth, and most importantly, through extensive experimentation with simulated and real data, we show that the computed upper and lower bounds are frequently very close, and are *equal* with high frequency for a surprisingly large range of data. Thus, with the use of a very effective lower bound and an efficient algorithm for computing upper bounds, this approach allows the efficient *exact* computation of Rmin(M) with high frequency in a large range of data. This is an important empirical result that is
expected to have a very significant impact. Programs implementing the new algorithms discussed in this talk are available on the web.

Joint work with Yun Song and Yufeng Wu.

Thursday, March 31st

Statistical methods for constructing transcriptional regulatory networks
Dr. Biao Xing
Genentech

Transcriptional regulatory networks specify regulatory interactions among regulatory genes and between regulatory genes and their target genes. Uncovering transcriptional regulatory networks helps us to better understand the complex cellular processes and responses. We present two statistical methods for constructing transcriptional regulatory networks using gene expression data, promoter sequences, and transcription factor binding sites. Both start from identifying active transcription factors under each individual experiment, using a feature selection approach. The first method employs a naive normal mixture model to classify the transformed gene expression data for each transcription factor and uses the posterior probability of being in the `induced' or `repressed' classes to measure the strength of regulatory interactions. Evidence is averaged across different experiments to infer the overall regulatory network structures. The second method employs a causal inference model to model the causal effect of a transcription factor on its potential target genes. A nonparametric marginal structural model is built for every transcription factor and gene pair, which also allows controlling for potential confounding effects of other transcription factors on the expression level of the gene. The p-value associated with the causal parameter in each of these models is used to measure the regulatory interaction strength. These results are used to infer the overall regulatory interaction matrix and network structures. Simulation studies and analysis of yeast data have shown that both methods are capable of identifying significant transcriptional regulatory interactions and uncovering underlying regulatory network structures and both can be complementary to each other to maximize significant findings.

Joint work with Mark van der Laan.

Thursday, April 7th

Efficient Haplotype Analysis Tools
Dr. Eran Halperin
The International Computer Science Institute (ICSI)

Each person's genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person's genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a haplotype. The determination of the haplotypes within a population is essential for understanding genetic variation and the inheritance of complex diseases.

Experimental determination of a person's component haplotypes is an expensive and time consuming process, and it is more attractive to first determine genotypes experimentally and then use them to compute haplotypes. This computation is not simple and is complicated by the fact that current sequencing technology often gives the DNA sequence with some missing nucleotide bases at some positions.

In this talk I will introduce efficient and accurate maximum likelihood based methods for the reconstruction of haplotype frequencies from noisy haplotype data or from genotype data. I will also give a high level description of HAP (www.icsi.berkeley.edu/~heran/HAP) - a haplotype phase reconstruction tool. Finally, I will mention some consequences of the use of HAP for the phasing of the genome wide dataset released by Perlegen Sciences.

The main part of the talk is a joint work with Elad Hazan (Princeton).
HAP is a joint project with Eleazar Eskin (UCSD).

Thursday, April 14th

COMODE - A web application for constrained motif detection
Oliver Bembom
Division of Biostatistics, UC Berkeley

The COMODE algorithm developed by Keles et al. (2003) searches a set of unaligned DNA sequences for a shared transcription factor binding site (motif). It allows the user to incorporate prior biological knowledge about the transcription factor by specifying a set of constraints that the position weight matrix of any discovered motif must satisfy. These constraints can be very general, ranging from palindromicity to requiring a certain parameterized shape of the information content profile across the motif.

I will discuss a few extensions and modifications to this algorithm and then
present a working prototype for an implementation of COMODE as a web
application.

Joint work with Sunduz Keles, Mark van der Laan, Jack Lin.

Thursday, April 21st

Interpreting HIV mutations to predict response to antiretroviral therapy: The deletion/substitution/addition (DSA) algorithm for the estimation of direct causal effects
Maya Petersen
Division of Biostatistics, UC Berkeley

Our goal is to estimate the causal effect of mutations detected in the HIV strains infecting a patient on clinical virologic response to specific antiretroviral drugs and drug combinations. We consider the following data structure: 1) viral genotype, which we summarize as the presence or absence of each viral mutation considered by the Stanford HIV Database as likely to have some effect on virologic response to antiretroviral therapy; 2) drug regimen initiated following assessment of viral genotype (the regimen may involve changing some or all of the drugs in a patient's previous regimen); and, 3) change in plasma HIV RNA level (viral load) over baseline at twelve and twenty-four weeks after starting this regimen.

The effects of a set of mutations on virologic response are heavily confounded by past treatment. In addition, viral mutation profiles are often used by physicians to make treatment choices; we are interested in the direct causal effect of mutations on virologic outcome, not mediated by choice of other drugs in a patient's regimen. Finally, the need to consider multiple mutations and treatment history variables, as well as multi-way interactions between these variables, results in a high-dimensional modeling problem. This application thus requires data-adaptive estimation of the direct causal effect of a set of mutations on viral load under a particular drug, controlling for confounding and blocking the effect the mutations have on the assignment of other drugs. We developed such an algorithm based on a mix of the direct effect causal inference framework and the data adaptive regression deletion/substitution/addition (DSA) algorithm.

k with Sunduz Keles, Mark van der Laan, Jack Lin.

Thursday, April 28th

Mapping Evolutionary Pathways of HIV-1 Drug Resistance using Conditional Selection Pressure
Professor Christopher Lee
Center for Bioinformatics, UCLA

Can genomics provide a new level of strategic intelligence about rapidly evolving pathogens? We have developed a new approach to measure the rates of all possible evolutionary pathways in a genome, as a conditional Ka/Ks network, and have applied this to several datasets, including clinical sequencing of 50,000 HIV-1 samples. These data reveal specific accessory mutations that greatly accelerate the evolution of multi-drug resistance, and other kinetic trap mutations that block it. Our analysis was highly reproducible in four independent datasets, and can decipher a pathogen's evolutionary pathways to multi-drug resistance even while such mutants are still rare.

Thursday, May 5th

Multiple Testing and Error Control in Graphical Model Selection
Dr. Mathias Drton
Department of Mathematics, UC Berkeley

Graphical models provide a framework for exploring and visualizing patterns of conditional (in-)dependence. Gaussian graphical models, in particular, have been applied to explore gene expression data with the aim of generating scientific hypotheses about the structure of gene regulatory networks. Model selection plays a central role in such exploratory data analysis. In this talk I will consider constraint-based graphical model selection, which proceeds by testing the model-defining conditional independence hypotheses. Viewing the approach as a multiple testing problem, I will show how to exploit recent methodological advances to address this problem.

Thursday, May 19th

Local False Discovery Rates
Professor Brad Efron
Department of Statistics, Stanford University

Large-scale simultaneous inference, with hundreds or thousands of hypothesis tests to consider at the same time, has become a fact of statistical life in our era of high-throughput scientific devices: microarrays, proteomic machines, time of flight spectroscopy, flow cytometry, and a variety of MRI scanners. Recent literature has concentrated on selecting non-null cases from among a large majority of nulls, with the control of Type I error, size, the primary objective. Local false discovery rates, a version of Benjamini and Hochberg's tail area FDR procedure, employs empirical Bayes ideas to carry out both size and power calculations. I will discuss a methodology that does this using a minimum of frequentist or Bayesian assumptions.