PB HLTH 292, Spring 2009

PB HLTH 292, Section 020
Statistics and Genomics Seminar

Spring 2009

Thursday, January 22nd

WikiPathways: Pathway Editing for the People
Dr. Bruce R. Conklin
Gladstone Institute of Cardiovascular Disease, UCSF

To facilitate the contribution and maintenance of pathway information by the biology community, we established WikiPathways (www.wikipathways.org). WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. WikiPathways thus presents a new model for pathway databases that enhances and complements ongoing efforts. Building on the same MediaWiki open source software that powers Wikipedia, we added a custom graphical pathway editing tool and integrated databases covering major gene, protein, and small-molecule systems. The familiar web-based format of WikiPathways greatly reduces the barrier to participate in pathway curation. More importantly, the open, public approach of WikiPathways allows for broader participation by the entire community, ranging from students to senior experts in each field. This approach also shifts the bulk of peer review, editorial curation, and maintenance to the community.

Thursday, January 29th

Full Transcriptome Analysis using the Illumina Genome Analyzer
Dr. Gary P. Schroth
Illumina Inc.

The Illumina Genome Analyzer is a high throughput DNA sequencing platform that routinely generates several billion bases of very high quality sequence information from a large variety of genomic applications. We will show examples of how the instrument is being used for a large variety of applications in genome biology including eukaryotic and prokaryotic resequencing, SNP discovery, gene expression analysis, ChIP-SEQ, genome-wide mapping of DNA methylation sites, and miRNA discovery and analysis. We will present details of how the mRNA-Seq assay is being used to quantify gene expression levels with high specificity over a broad dynamic range. In addition to quantifying expression levels, this data is also being used to characterize thousands of novel alternative transcripts in the human transcriptome. We will also discuss the development of new software and analysis tools that can help users glean biological meaning from the massive amounts of data produced by the system.

Thursday, February 5th

From expression profiling to putative master regulators
Professor Terry Speed
Department of Statistics, UC Berkeley

People conduct microarray gene expression experiments or studies in order to find out which genes are regulated, e.g. as a result of a treatment, or over time. The genes so observed that can usually be validated by qrt-pcr. Of equal or even greater interest are the regulators of the genes found to be activated/differentially expressed. How do we identify these regulators, and when we have found some candidates and view their statuses as hypotheses, how are these tested?

Thursday, February 12th

Preliminary Transcriptome Analysis of a Trio: Mother, Father, Daughter
Dr. Hugh Rienhoff
MyDaughtersDNA.org

Unknown syndromes and sporadic cases with suspected genetic etiology are the most difficult cases to diagnose and manage and yet the prevalence of such cases numbers in the hundreds-of-thousands. The identification of mutations causing novel sporadic genetic disease is an open-ended search guided by the clinical similarity with known Mendelian diseases and the application of standard global genetic interrogations such as karyotyping, comparative genomic hybridization and genomic copy number variation. To extend the methodologies available to these patients we have pioneered the use of RNA sequencing by examining the RNA of white blood cells in a trio -- an affected daughter and her two parents. The RNA sequence has been complemented by low-pass whole genome sequencing and the 1.4M SNP chip for cross-platform validation to identify significant insertions and deletions. In toto, the dataset is rich in known and unsuspected phenomenology as well as offering hypotheses to test.

Thursday, February 19th

Analysis of 2D-DIGE protein expression data
Professor Elmer Fernandez
Catholic University of Cordoba

Nowadays it is possible to afford a whole view of the proteome in a glance thanks to high-throughput techniques such as 2D Differential in-gel electrophoresis (2D-DIGE). This technique produce high mount of data with complex structure that should be analyzed by means of appropriate analytical techniques. The primary goal of this kind of experiments is the detection of proteins showing a statistically significant difference on expression under different experimental conditions or the identification of potential biomarkers that could be use for early diagnosis. In this talk we will show the fundamentals of 2D-DIGE technology as well as some statistical methods used to deal with this kind of data.

Thursday, February 26th

The Importance of Race/Ethnicity & Genetics in Biomedical Research and Clinical Practice; Lessons from the Genetics of Asthma in Latino Americans (GALA) Study
Professor Esteban Burchard
Department of Medicine, UCSF

A debate has recently arisen over the use of racial classification in medicine and biomedical research. In particular, with the completion of a rough draft of the Human Genome, some have suggested that racial classification may not be useful for biomedical studies since it reflects "a fairly small number of genes that describe appearance,"¹ and that "there is no basis in the genetic code for race."² Based in part on these conclusions, some have argued for the exclusion of racial and ethnic classification from biomedical research.³ In the United States, race and ethnicity have been a source of discrimination, prejudice, marginalization and even subjugation. Excessive focus on racial/ethnic differences runs the risk of undervaluing the great diversity that exists among individuals within groups. However, this risk needs to be weighed against the fact that in epidemiologic and clinical research, racial and ethnic categories are useful for generating and exploring hypotheses on environmental and genetic risk factors and interactions between risk factors for important medical outcomes. Erecting barriers to the collection of information such as race and ethnicity may provide protection against the aforementioned risks, however it will simultaneously retard progress in biomedical research and limit effectiveness in clinical decision-making.

Today I hope to convey the importance of Race & Ethnicity in Biomedical, Genetic and Clinical Research. I will begin by providing fundamental evidence of genetic differences between racial and ethnic populations. I will then demonstrate racially-specific differences in genetic risk for diseases including Alzheimer's Disease and ethnic-specific differences in drug responsivness. Finally, I will present data from the ongoing Genetics of Asthma in Latino Americans (GALA) Study.

References:
1. Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature 2001; 409:860-921.
2. Venter C, quoted in interview for NY Times News Article "Do Races Differ? Not Really, Genes Show" written by N. Angier. Do Races Differ? Not Really, Genes Show. New York Times 2000.
3. Schwartz RS. Racial profiling in medical research. N Engl J Med 2001; 344:1392-3.

Thursday, March 5th

Methods for Allocating Ambiguous Short-Reads
Margaret Taub
Department of Statistics, UC Berkeley

With the rise in prominence of biological research using new short-read DNA sequencing technologies comes the need for new techniques for aligning and assigning these reads to their genomic location of origin. Until now, methods for allocating reads which align with equal or similar fidelity to multiple genomic locations have not been model-based, and have tended to ignore potentially informative data. Here, I will demonstrate that existing methods for assigning ambiguous reads can produce biased results. I will also present a new method for allocating ambiguous reads to the genome, developed within a framework of statistical modeling, which shows promise in alleviating these biases, both in simulated and real data.

This is joint work with my advisor, Terry Speed, and Doron Lipson from Helicos Biosciences.

Thursday, March 12th

A Method for the Analysis of Longitudinal Multi-factorial Microarray Data
Professor Wing Hung Wong
Department of Statistics, Stanford University

Time-course microarray experiment is capable of capturing the dynamic profiles of genomic response to multiple experimental factors. Analytic methods are needed to simultaneously handle the time course (longitudinal) structure and multi-factorial structure in the data. We will introduce a robust non-parametric ANOVA (NANOVA) method for the analysis of multi-factor effects while accounting for multiple testing and non-normality nature of microarray data. To incorporate time course measurements, factor effects are evaluated based on information pooled across time. The proposed method can effectively extract the gene specific response feature and provide quantitative information about the expression pattern of a gene. It has a broader applicability in longitudinal factorial data in general and can be extended to cross-sectional time course data. This method was applied to four data sets from a large-scale clinical study of burn injury. Our analysis identified age-related and gender-related burn responsive genes and characterized their response features. T-cell and B-cell related immune systems, insulin-related signaling pathway and various metabolic processes were found to be differentially perturbed in pediatric and adult burn patients. Gender differences in burn injury were detected in several sex chromosome genes. We also assessed age and burn effects across four tissues (blood, skin, muscle and fat) and identified muscle as the most differentially perturbed tissue in burn injured children compared with adults. Finally, our analysis of age impact on adult survivability after burn suggests several metabolic processes as the potential contributors to increasing death rate in older burn patients.

Joint work with Baiyu Zhou.

Thursday, April 2nd

Analysis of "DMET Plus" - a customized genotyping panel for simultaneous assessment of a wide variety of polymorphisms involved in adsorption, distribution, metabolism and excretion of compounds in humans
Dr. Simon Cawley
Director, Algorithms & Data Analysis, Affymetrix

Exploring the genetic determinants of variations in human response to drugs requires extensive monitoring of polymorphisms in genes involved in Adsorption, Distribution, Metabolism and Excretion (collectively referred to as ADME). The current database of polymorphisms known to have functional effects includes a broad variety of type of polymorphism, including SNPs, insertion/deletion events and variations in chromosome copy number. Some of the markers of interest involve more than two alleles and many have proximal secondary polymorphisms. This diversity of type of polymorphism makes it technically very challenging to develop a unified approach that is capable of high-throughput determination of all the underlying types. Affymetrix recently released DMET Plus, a solution enabling simultaneous interrogation of all the key types of polymorphism of interest for pharmacogenetics studies. Development of the array and assay required an analysis approach that was general enough to handle the diverse collection of types of polymorphism, that delivered highly reliable genotype calls and that could operate under the constraint that genotype calls be made based on analysis of a single sample at a time. This talk will focus on the analytical challenges that arose during the development and will describe the genotype calling methods put in place for the final product.

Thursday, April 9th

Statistical Analysis of Histone Modifications
Professor Ping Ma
Department of Statistics, University of Illinois, Urbana-Champaign

Gene activities in eukaryotic cells are concertedly regulated by transcription factors and chromatin structure. The basic repeating unit of chromatin is the nucleosome, an octamer containing two copies each of four core histone proteins. Recent high throughput studies have begun to uncover the global regulatory role of nucleosome positioning and modifications. While nucleosome occupancy in promoter regions typically occludes transcription factor binding, thereby repressing global gene expression, the mechanism of histone modification is more complex. Histone tails can be modified in various ways, including acetylation, methylation, phosphorylation, and ubiquitination. Even the regulatory mechanism of histone acetylation, the best characterized modification to date, is still not fully understood.

In this talk, I will present a some statistical method to analyze some genome-wide histone modification datasets.

Thursday, April 16th

Algorithms for structure prediction and concentration estimation of alternatively spliced isoforms
Professor Angel Rubio
Centro de Estudios e Investigaciones Tecnicas de Gipuzkoa (CEIT), University of Navarra

xon and exon+junction microarrays are promising tools for studying alternative splicing. Current analytical tools applied to these arrays lack two relevant features: the ability to predict the strucuture of unknown spliced isoforms and the ability to quantify the concentration of known and unknown isoforms. We have developed an algorithm that is able to (1) estimate the number of different transcripts expressed under several conditions, (2) predict the precursor mRNA splicing structure and (3) quantify the transcript concentrations including unknown forms. I will present the results for real and simulated data. In addition, we have preliminary results of a new version that exploits the redundancy of the probes in the Affymetrix exon (or exon+ junction) arrays.

Thursday, April 23rd

A Novel Topology for Representing Protein Folds
Professor Mark Segal
Division of Biostatistics, UCSF

Various topologies for representing three dimensional protein structures have been advanced for purposes ranging from prediction of folding rates to ab initio structure prediction. Examples include relative contact order, Delaunay tessellations, and backbone torsion angle distributions. Here we introduce a new topology based on a novel means for operationalizing three dimensional proximities with respect to the underlying chain. The measure involves first interpreting a rank-based representation of the nearest neighbors of each residue as a permutation, then determining how perturbed this permutation is relative to an unfolded chain. We show that the resultant topology provides improved association with folding and unfolding rates as determined for a set of two-state proteins under standardized conditions. Furthermore, unlike existing topologies, the proposed geometry exhibits fine scale structure with respect to sequence position along the chain, potentially providing insights into folding initiation and/or nucleation sites.

Thursday, April 30th

Statistical methods for high-throughput phenotypic studies
Professor Jenny Bryan
Department of Statistics, University of British Columbia

Researchers in functional genomics can now obtain quantitative phenotypes for large collections of organisms, each of which is characterized by the deletion of an individual gene. By observing the phenotypic consequence of deletion across diverse conditions, we obtain specific information on the functional roles of the disrupted gene. The repertoire of massively parallel perturbations being applied to live cells and organisms extends well beyond the simple knockout or knockdown of single genes. Recent examples include other genomic modifications, such as the insertion of alternative regulatory regions, and treatment with large libraries of well-characterized and novel compounds. Finally, researchers may apply these interventions in a combinatorial fashion, e.g., mating yeast deletion mutants to create double knockouts or treating a panel of knockouts with a large collection of drugs.

I will present statistical approaches I have developed for the analysis of data from these high-throughput phenotypic studies, with some coverage of low-level issues, such as normalization, and high-level analyses, such as clustering and growth curve modelling on a large scale.

Thursday, May 7th

Estimating recombination rates in microbial populations from metagenomic data
Philip L. F. Johnson
Graduate Group in Biophysics, UC Berkeley

Microbial populations exchange genetic information at widely varying rates, dramatically affecting the evolutionary potential of a population. Traditionally, microbial recombination rates have been calculated as a genome-wide average from multi-locus sequence typing at carefully chosen loci. New metagenomic sequencing projects, however, hold the potential to identify not only average rates of recombination, but also local "hotspots" of recombination because they generate short, overlapping fragments of DNA sequence, each deriving from a different individual, at random locations across the genome. We have developed a composite likelihood estimator that operates on these data. This method will help elucidate the rates of exchange of genetic material in microbial genomes.

Thursday, May 14th

Selective genotyping and phenotyping strategies in a complex trait context
Professor Saunak Sen
Division of Biostatistics, UCSF

Selective genotyping and selective phenotyping strategies, where a subset of individuals are genotyped or phenotyped, can reduce the cost of genetic studies. In experimental crosses (where two or more strains are mated to form a segregating population), the efficiency of these strategies has been evaluated in simplified settings where a single locus contributes to the trait of interest, and when the trait is normally distributed. Complex traits, where multiple loci contribute to the trait, possibly with interactions, are incompatible with this simplified setting; additionally such traits may not be normally distributed. We analyze selective genotyping and phenotyping considering these complexities not considered previously, and suggest approaches that will work better in more realistic scenarios. Our approach is based on a general framework for calculating the expected information content of experimental strategies.