PB HLTH 292, Section 020
Statistics and Genomics Seminar

Spring 2007



Thursday, January 18th

Bayesian network analysis of MAPK connectivity in cancer
Dr. Sach Mukherjee
Department of Statistics, UC Berkeley

The Mitogen-Activated Protein Kinase or MAPK pathway is a biochemical pathway which plays a central role in cellular signaling. Aberrant functioning of the pathway is heavily implicated in a number of cancers. Although the MAPK system is one of the most studied cellular signaling modules, cancer-specific pathway features remain poorly characterized, and relatively little is known regarding isoform-specific connectivity. I will present a Bayesian network approach to learning pathway connectivity in this setting, using Markov chain Monte Carlo to infer dependency graph structure from a combination of experimental data and biological prior knowledge. I will discuss some of the methodological aspects of this problem and go on to present results obtained from breast cancer data. I will also discuss some of the (many) caveats of this type of analysis and highlight opportunities and challenges for future research.


Monday, January 29th

Reproducibility & sensitivity of Affymetrix exon arrays
Dr. Derek Chiang
Broad Institute, Cambridge, MA

Affymetrix exon arrays were designed to interrogate each predicted coding region in the human genome with 3 or 4 oligonucleotides. While the expanded coverage of exon arrays enables genome-wide surveys of intragenic expression changes, the lower probe density may reduce the accuracy of probeset expression level estimates. Preliminary data from biological replicates highlight the need to filter non-responsive and cross-hybridizing probes. I will describe a changepoint analysis procedure to detect intragenic differential expression. When calibrating this procedure with characterized cell lines, I detected low signal-to-noise ratios for a set of positive control transcripts. Finally, I will discuss some potential strategies to increase the sensitivity and specificity of detecting intragenic expression changes with exon arrays.


Thursday, February 15th

Reconstructing ancestry blocks in admixed individuals using high density genotype data
Professor Hua Tang
Department of Genetics, Stanford University

A chromosome in an individual of recently admixed ancestry resembles a mosaic of chromosomal segments, or ancestry blocks, each derived from a particular ancestral population. We consider the problem of inferring ancestry along the chromosomes in an admixed individual and thereby delineating the ancestry blocks. Using a simple population model, we infer gene-flow history in each individual. Compared with existing methods, which are based on a hidden Markov model, the Markov-hidden Markov model (MHMM) we propose has the advantage of accounting for the background linkage disequilibrium (LD) that exists in ancestral populations. When there are more than two ancestral groups, we allow each ancestral population to admix at a different time in history. We use simulations to illustrate the accuracy of the inferred ancestry as well as the importance of modeling the background LD; not accounting for background LD between markers may mislead us to false inferences about mixed ancestry in an indigenous population. The MHMM makes it possible to identify genomic blocks of a particular ancestry by use of any high-density single-nucleotide-polymorphism panel. One application of our method is to perform admixture mapping without genotyping special ancestry-informative-marker panels.


Thursday, February 22nd

Mathematical and Computational Aspects of Meiotic Recombination
Dr. Yun S. Song
Department of Computer Science, UC Davis

Meiotic recombination creates a mosaic genome from the two homologous genomes of an individual. In addition to being a major mechanism that can create new genetic types in a population, recombination has far-reaching consequences on the genealogy of chromosomes; as a result of recombination, different regions in a chromosome can have different evolutionary histories. In this talk, I will use algebraic techniques based on diffusion processes to address a couple of statistical problems related to meiotic recombination. I will then sketch some algorithmic work on reconstructing parsimonious evolutionary histories with recombination.


Thursday, March 1st

Estimating a variable length Markov model from biological sequence data
Dr. Daniel Dalevi
Lawrence Berkeley National Laboratory

Stationary fixed order Markov models are often used for the analysis of biological sequence data such as DNA and amino acid sequences. They are also a common choice when studying clumping of essential genes and when deriving expected distances in gene order problems. One drawback of these models is that the number of parameters grows very rapidly when increasing the order and they may be hard to fit when data is limited. One way to overcome this problem is to use variable length Markov models, VLMM, where the order depends on the context. This class of models has been applied to various problems in bioinformatics, linguistics and Internet applications. In this presentation I will talk about how a VLMM can be estimated efficiently from data using the method of maximal fluctuation.


Thursday, March 8th

Range expansion and biological introductions in marine ecosystems: the example of Cyclope neritea (Nassariidae) along the French coasts
Dr. Benoit Simon-Bouhet
Department of Integrative Biology, UC Berkeley

Species' range is influenced by both biotic and abiotic factors. Among them, two are of a growing influence: (i) temperature (in relation with global change) and (ii) human activities responsible for habitat alterations, pollution and biological introductions.

We studied the recent evolution of the species range of Cyclope neritea, a marine Nassariid gastropod discovered thirty years ago along the French Atlantic and English Channel coasts, at the edge of its previously recognized native range. The particular geographical location of the newly recorded populations made likely a natural range expansion towards North in response to global warming. However, given habitat discontinuities and the low dispersal abilities of C. neritea, we had to consider the alternative hypothesis of a recent introduction related to human-mediated exchanges of cultivated species to explain this sudden range expansion.

By studying mitochondrial and nuclear polymorphisms, we showed a strong phylogeographic structure in the native range of C. neritea, as expected for a direct-developing species. Thanks to this structure, we demonstrated that the appearance of new populations near the range limit of the species was mainly caused by massive and recurrent human-mediated introductions of individuals coming from several highly divergent sources in the native range. Nevertheless, we could not exclude the possibility of a historical presence of cryptic populations (at low density) originating from natural spread that could be now reinforced by the temperature increase in the eastern Atlantic. The recent evolution of the range of C. neritea is thus an example of the synergistic action of both natural and human-mediated processes, acting on long and short time scales.

Keywords: Species range, natural expansion, biological introductions, coastal areas, phylogeography, gene flow, mollusk.


Thursday, March 15th

An Atlas of Spatial Patterns of Gene Expression in Drosophila melanogaster Imaginal Discs
Cyrus Harmon
Department of Molecular and Cell Biology, UC Berkeley

Regulation of gene expression controls the development of insect imaginal discs and many genes involved in developmental processes have been shown to have interesting, non-trivial spatial expression patterns in imaginal discs. We seek to build an atlas of the spatial extent of gene expression of a large number of genes in Drosophila melanogaster imaginal discs. We present a data acquisition pipeline for mass-isolation of imaginal discs, for using gene expression microarrays to identify candidate genes, for imaging discs stained with labeled probes and for the computational anlaysis of these patterns. Our methods automatically learn imaginal disc shapes from a small number of manually segmented training examples and subsequently automatically align, extract and score new images and build maps of expression from these images. Our methods enable us to search a database of patterns for a pattern similar to a given query image and to cluster both the genes and the pixels that make up the maps, thereby identifying genes that have similar patterns, and regions of the tissues of interest that have similar profiles of gene expression across a large number of genes.


Thursday, March 22nd

Genetic Epidemiology Initiatives in the Northern California Childhood Leukemia Study
Dr. Anand Chokkalingam
Division of Epidemiology, UC Berkeley

Childhood leukemia is the leading cancer among children under age 15. Causes for the disease are largely unknown. Evidence of heritability from studies of twins and families, in addition to its early age of onset, suggest a role of genetic susceptibility factors. The Northern California Childhood Leukemia Study (NCCLS), initiated in 1995, is a population-based case-control study encompassing 35 counties in Northern and Central California. By the end of anticipated subject accrual in 2009, the study will have enrolled ~1000 cases and ~1550 controls. In addition to collection of extensive demographic, dietary, environmental, and other epidemiologic data via questionnaire, DNA specimens have been collected via buccal cytobrushes from 98% of subjects, including cases, controls and their biological mothers. Large-scale Illumina-based genotyping of a subset of these DNAs has recently commenced, including 1536 single nucleotide polymorphisms (SNPs) in ~200 genes involved in pathways highly relevant to leukemia and cancer in general: DNA repair, immune function, folate metabolism, oxidative stress, and xenobiotic metabolism and transport. We have taken a haplotype-tagging approach to select these SNPs. In addition, we are genotyping 96 ancestry informative markers to estimate genetic ancestry in order to quantify and adjust for potential population stratification. The same 1536-plex panel will eventually be applied to all subjects in the study, as well as biological fathers of cases (total ~6000 subjects). Through these efforts, and with replication in other comparable studies participating in a new international childhoodl leukemia consortium, we hope to identify specific genes and biological pathways involved in the etiology of childhood leukemia.


Thursday, April 5th

Resampling Techniques for Protein Structure Prediction using Rosetta
Benjamin Blum
Computer Science Division, UC Berkeley

Rosetta, developed by the Baker lab at the University of Washington, is perhaps the leading method for ab initio protein structure prediction today. The core algorithm is simply Monte Carlo search for the global minimum of a carefully-tuned energy function containing both physical and statistical components. In order to predict structure for a target protein sequence, Rosetta generates a large population of local minima of the energy function (referred to as ~Sdecoys~T). From this population, a single prediction is chosen by either energetic or clustering-based criteria. In this talk, I will describe techniques for learning an approximation to the energy landscape from an initial Rosetta decoy set in the area ~Saround~T the native structure. The fitted energy landscape can then be used to guide further rounds of Rosetta sampling. This general approach, as stated, is confronted by several serious problems: the landscape is very high-dimensional and very irregular. We side-step these issues by attempting to identify structural features within the decoy population that best account for energy differences and enriching them in further Rosetta search. This technique has proven successful on a range of small proteins.


Thursday, April 19th

Nucleotide bias among recent substitutions in the human genome
Professor Katherine S. Pollard
Genome Center and Department of Statistics, UC Davis

We scanned the human genome, counting the number of fixed substitutions that occurred on the human lineage since divergence from the common ancestor with the chimpanzee, and determined what fraction are AT to GC (weak-to-strong). For windows of 100bp containing many (5-11) substitutions, there is a remarkable excess of weak-to-strong ~Sbiased~T substitutions over what would be expected under a null model based on global or local rates. Examination of individual substitutions as members of biased clusters revealed unexpected biased clustered substitutions (UBCS) are common near the telomeres of all autosomes but not the sex chromosomes. Repeating the analysis on independent substitutions in the chimp lineage, we found that human and chimp orthologous regions show a remarkable similarity in the shape and magnitude of their respective UBCS maps, suggesting a relatively stable force leads to clustered bias. The strong and stable signal near telomeres may have participated in the evolution of isochores. One exception to the UBCS pattern found in all autosomes is chromosome 2, which shows a UBCS peak mid-chromosome, mapping to the fusion site of two ancestral chromosomes. This provides evidence that the fusion occurred as recently as 740 thousand years ago and no more than about 3 MYA. No biased clustering was found in SNPs, suggesting that clusters of biased substitutions are selected from mutations. UBCS is strongly correlated with male recombination rates, which explains the lack of UBCS signal on chromosome X. Female recombination rates are entirely unrelated to the residual UBCS signal unexplained by male recombination. Finally, regions of extreme bias are enriched for genes. These observations support the hypothesis that Biased Gene Conversion (BGC), specifically in the male germline, played a significant role in the evolution of the human genome. It is possible that BGC is a male reproductive strategy, increasing rates of genetic drift and accelerating evolution overall.


Thursday, April 26th

The Pendulum Model: Compositional Dynamics of the Eubacterial Genomes
Dr. Jun Yu
Beijing Genomics Institute, Chinese Academy of Sciences

To eubacterial genomes, the canonical genetic code is neither a product of "frozen accident" nor a results of co-evolution with metabolic pathways; it is optimized to utilize mutational mechanisms that create dynamic changes in protein sequences and thus their physiochemical properties when GC and purine (AG) contents vary. Governed by the alpha subunit of DNA polymerase III, the GC content of eubacterial genomes changes from 20% to 80% or splits into half: above or below 50%. Eleven and fifteen amino acids (precisely half of the Table each) are GC-sensitive and AG-sensitive, respectively. As the "GC Pendulum" moves toward high-GC or low-GC contents, the corresponding dominant amino acids changes drastically, providing either robustness against mutations or ample protein sequence diversity, respectively, for organisms to achieve their best fitness. The "AG Pendulum" swings wider as the genomic GC contents are lower because all the amino acids in the low-GC group are AG-sensitive, providing better amino acid diversity. Transcript-centric compositional dynamics also comes into play as genomic GC contents are high, creating a positive GC content gradient along the transcripts.

A Brief Biography of Dr. Jun Yu
Dr. Jun Yu is currently a professor and Associate Director of Beijing Genomics Institute, Chinese Academy of Sciences (CAS). He has joint appointments and supervises graduate students at the Institute of Computer Sciences (CAS), Zhejiang University, and Chinese Agricultural University. Dr. Yu obtained his B.S. degree in biochemistry from Jilin University in 1983 and Ph.D. degree in biomedical sciences from New York University Medical School in 1990. He had worked as a Research Assistant Professor at NYU since 1990 until he joined University of Washington Genome Center in 1993. Dr. Yu's primary research interests include genome biology and bioinformatics. He has led many major genome projects in China, including the Human Genome Project (the Chinese effort; known as the 1% Genome Project), the Superhybrid Rice Genome Project, the Silkworm Genome Project, and the Chicken Genome Diversity Project, which all resulted in high-impact publications in major international journals, including Science, Nature, and PLoS Biology. He has published over 100 scientific papers and a few dozens of books and book chapters. Dr. Yu has won numerous academic awards, including the Award for Outstanding Science and Technology Achievements (Group, 2003, Chinese Academy of Sciences), Scientific Leader of the Year, 2002 (The first "SA50" Award by the journal Scientific American), "Qiushi" Award for Scientific Achievement (Group, 2002, QiuShi Science and Technology Foundation, Hong Kong), 100-Talent Plan (Chinese Academy of Sciences, 2002-2005), Outstanding Young Investigator Award (B Class, the Natural Science Foundation, 1999-2002), American Foundation for Urological Diseases Ph.D. Research Scholar (1991-1993), China-US Biology Examination and Application (CUSBEA,1983).


Thursday, May 3rd

Evolutionary analysis of the relaxin peptide family and their receptors
Professor Terry Speed
Department of Statistics, UC Berkeley

Relaxin is a peptide hormone with an important reproductive function, and is a member of a broader family of peptide hormones with diverse reproductive and non-reproductive functions. A research group at the Howard Florey Institute in Melbourne has been studying this hormone for many years, so when a student there, Tracey Wilkinson, was looking for a thesis topic in applied bioinformatics, the title of this talk seemed to be a potentially fruitful one. That turned out to be the case, and this talk will summarize some of Tracey's many interesting results. This is a wonderful time to be studying gene families, and I hope to convince you that pure and applied molecular evolution is a facscinating area of genome science.

Joint work with Tracey Wilkinson.


TO BE RESCHEDULED

Molecular basis and consequences of phase variation in Vibrio cholerae
Professor Fitnat Yildiz
Department of Environmental Toxicology, UC Santa Cruz

V. cholerae, the causative agent of the disease cholera, is a natural inhabitant of aquatic environments. The pathogen causes periodic, seasonal cholera outbreaks in regions where the disease is endemic and can spread worldwide in pandemics. The ability of V. cholerae to cause epidemics is linked to its survival in aquatic habitats. The capacity of V. cholerae to undergo a phase variation event, that results in the generation of two morphologically different colony variants termed smooth and rugose, is predicted to be important for the survival of the pathogen in natural aquatic habitats. With the availability of the completely sequenced genome of V. cholerae, we employed a genome-wide approach to understand the molecular basis and molecular consequences of smooth-to-rugose phase variation. To this end, we performed whole genome expression profiling studies of smooth and rugose phase variants and of the regulatory mutants. Analysis of the expression data revealed that "rugosity" and "smoothness" is modulated by second messenger cyclic di-guanylic acid (c-diGMP) and controlled by a complex hierarchy of positive and negative transcriptional regulators. In this talk, I will describe genetic and genomic characterization of networks controlling phenotypic variation V. cholerea.