PB HLTH 292, Fall 2011

PB HLTH 292, Section 008
Statistics and Genomics Seminar

Fall 2011

Thursday, September 1st

Computational Methods for the Prediction of the Impact of Missense Variants
Dr. Emidio Capriotti
Department of Bioengineering, Stanford University

Large-scale sequencing and genotyping techniques are allowing to scan the whole human genome providing a huge amount of genetic variation data. Single Nucleotide Variants (SNVs), which are the main cause of human genome variability, can also be responsible for the insurgence of human pathologies. The missense SNVs occurring in coding regions and resulting in single amino acid polymorphisms (SAPs) may affect protein function and lead to a diseased state. Although several methods to predict disease-related SAPs have been developed [1,2] a reliable solution of this problem is still unavailable. In the next future, the development of more accurate algorithms for the prediction of the effects of SAPs will be important to annotate the large amount of variation data.

In this talk, I will summarize my research activity in the prediction of the impact of genetic variants presenting a new structure-based method for the detection of deleterious SAPs [3] and a new algorithm for the prediction of cancer-causing SAPs [4] that result in better performances than previously implemented tools.

REFERENCES
1. Capriotti E, Calabrese R, Casadio R. (2006) Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22: 2729-2734.
2. Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R. (2009). Functional annotations improve the predictive score of human disease-related mutations in proteins. Human Mutation. 30; 1237-1244.
3. Capriotti E, Altman RB (2011). Improving the prediction of disease-related variants using protein three-dimensional structure. BMC Bioinformatics. 12 (Suppl 4); S3.
4. Capriotti E, Altman RB. (2011). A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics. Epub.

Emidio Capriotti is a Marie Curie research fellow at the Department of Bioengineering of the Stanford University, Palo Alto (California) and contracted researcher at the University of Balearic Islands at Palma de Mallorca (Spain). His research interests include protein structure prediction, the prediction of the effect of single point protein mutation on human health and protein stability, molecular dynamics of protein systems, and RNA structure comparison and prediction. Before joining Stanford University, he was postdoctoral researcher in the Centro de Investigacion Principe Felipe (CIPF) at Valencia (Spain) and earned the Master in Bioinformatics and the Ph.D. in Physical Sciences at the University of Bologna (Italy).

Thursday, September 8th

Working with Pacific Biosciences Data
Dr. James Bullard
Pacific Biosciences

This talk will focus on a number of interesting data analysis problems related to working with data from single molecule sequencing machines. I will discuss current sequencing efforts and interesting ongoing applications at Pacific Biosciences.

Thursday, September 15th

Unsupervised Unwanted Variation Removal for Expression Data
Dr. Laurent Jacob
Department of Statistics, UC Berkeley

Large gene expression studies typically contain some unwanted variation factors. These factors can arise from technical phenomenons (batch effect, platform effect...) or from biological signals which are unrelated to the factor of interest in a study (heterogeneity in ages, ethnic groups...). Our objective is to remove these unwanted variation factors without losing the factor of interest. This is particularly difficult when neither the unwanted variation factors nor the factor of interest are known a priori (eg when doing clustering). I will present various methods relying on the existence of replicate arrays.

Thursday, September 22nd

BMCP: A Segmentation Algorithm for SNP Microarray Copy Number Data from Cancer Studies
Yu Chuan Tai
Roche Molecular Diagnostics

High-density SNP microarrays provide a useful tool for the detection of copy number changes in tumors. We propose an algorithm called BMCP, for segmentation and estimation of SNP microarray copy number data from cancer studies. Segmentation concerns separating a chromosome into regions of equal copy number differences between the sample of interest and some reference, and involves the detection of locations of copy number difference changes. Estimation concerns determining true copy number for each segment. BMCP not only gives posterior estimates for the parameters of interest, namely locations for copy number difference changes and true copy number estimates, but also useful confidence measures. In addition, BMCP can segment multiple samples simultaneously. Finally, BMCP incorporates an adjustment factor for signal attenuation due to tumor heterogeneity or normal contamination that can improve copy number estimates.

This is joint work with Mark Kvale and John Witte at UCSF.

Thursday, September 29th

Alternative Splicing of Single Codons is Regulated and Conserved
Dr. Robert Bradley
Fred Hutchinson Cancer Research Center

Thousands of human genes contain introns ending in NAGNAG (N any nucleotide), where both NAGs can function as 3' splice sites, yielding isoforms that differ by inclusion/exclusion of three bases. However, the physiological relevance and conservation of this NAGNAG splicing has been very controversial. Using very deep RNA-Seq data from human and mouse tissues with both technical and biological replicates, we found that alternative splicing of single codons is commonly regulated between tissues, and furthermore that strongly tissue-specific NAGNAGs are highly conserved between species, implying selective maintenance. Specific sequence features, including a more distal location of the branch point and presence of a pyrimidine immediately before the first NAG, influence NAGNAG splicing both globally and in a splicing reporter system. Strikingly, mutations that create, destroy, or alter NAGNAGs give rise to a dramatic increase in the gain/loss of codons at exon-exon boundaries in both mammals and flies. Our study demonstrates that NAGNAG alternative splicing generates widespread differences between the proteomes of mammalian tissues, and suggests that the evolutionary trajectories of mammalian proteins are strongly biased by the locations and phases of the introns that interrupt coding sequences.

Thursday, October 6th

Statistical Challenges for Ion Torrent Semiconductor Sequencing
Dr. Simon Cawley
Ion Torrent

The recent development of Ion Torrent semiconductor sequencing has the potential to do for DNA sequencing what complementary metal-oxide semiconductor (CMOS) imaging has done for photography - replacing slow and expensive methods with a fast, cheap and scalable alternative. The technology uses massively parallel direct detection of hydrogen ions from polymerase synthesis to detect sequence. The raw signal data acquired presents interesting signal processing and inference challenges in turning voltage data from the chip into usable DNA bases. These problems can be divided into hydrogen ion accounting in which we infer a mean number of hydrogen ions produced per molecule from the time-varying signal and phase correction in which we reconstruct the ensemble populations of polymerase locations on the copies of DNA and recover the sequence. Development of improved models to handle these signal-processing challenges may lead to the solution to one of the million-dollar Life Technologies Grand Challenges.

Thursday, October 20th

High-Dimension Gaussian Graphical Model Building with Re-Sampling Based Methods
Professor Jie Peng
Department of Statistics, UC Davis

Regularization techniques are widely used for tackling high-dimension-low-sample-size problems. Yet, finding the right amount of regularization is challenging, especially in the unsupervised setting, where traditional methods such as BIC or cross-validation often result in too many false positives. In this talk, we first introduce Gaussian graphical models (GGMs) and its inference under the high-dimension regime. We then discuss approaches based on data perturbation, particularly those utilize selection frequencies over networks built on re-sampled data sets. We propose a method to infer networks by directly controlling the false discovery rates (FDRs) of edge detection. The idea is to fit a mixture distribution for the selection frequencies and then estimate the FDRs. This method is illustrated by both simulated and real data sets.

Thursday, November 3rd

Measuring and Predicting Metabolic Fluxes through 13C Carbon Labeling Experiments for Pure Cultures and Microbial Communities
Dr. Hector Garcia Martin
Joint BioEnergy Institute, Berkeley, CA

Systems biology aims to provide a predictive and quantitative understanding of cell behaviour as the outcome of the interaction of its comprising parts. Metabolic flux profiles (i.e. the number of molecules traversing each biochemical reaction encoded in its genome per unit time) are not only a key phenotypic characteristic but also embody the essence of this complexity since they represent the final functional output of the interactions of all the molecular machinery studied by all the other "omics" fields. Two of the most popular methods for studying metabolic fluxes are Flux Balance Analysis (FBA) and 13C Metabolic Flux Analysis (13C MFA), each of them displaying its own advantages and disadvantages. In this talk I will present a new method: Two scale 13C Metabolic Flux Analysis (2S-13CMFA), which combines the advantages of FBA and 13C MFA. I will showcase its applications and possibilities with data from the KEIO knockout collection. Time permitting, I will also introduce how we are adapting these methods to study metabolic fluxes in microbial communities as well.

Thursday, November 10th

Recent Shared Ancestry within Large Populations
Dr. Peter Ralph
Department of Evolution and Ecology, UC Davis

Within large, whole-genome datasets, a substantial number of pairs of individuals share genetic material from ancestors who lived in the last 100 generations, a fact that stands in contrast to the distant relatedness that governs most genetic diversity. These recent relationships show up in data as long genomic runs of identity between individuals, and provide information about recent demographic history and population structure.

I will introduce some theory about pedigrees and large-sample coalescent trees that provide expectations for data, and present some results from analysis of a large genomic dataset, in which we see geographic structure coming from recent shared ancestry, and infer some aspects of recent population history.

Thursday, November 17th

Normalization and Differential Expression in RNA-Seq
Professor Sandrine Dudoit
Division of Biostatistics and Department of Statistics, UC Berkeley

This talk concerns statistical methods and software for the analysis of transcriptome sequencing (RNA-Seq) data. We first present exploratory data analysis (EDA) approaches for quality assessment/control (QA/QC) of RNA-Seq reads. Next, we propose within-lane normalization methods to adjust for sample-specific gene-level effects such as length and GC-content. We also propose between-lane normalization procedures to account for distributional differences such as sequencing depth. Finally, we consider the quantitation of (differential) gene expression levels using generalized linear models (GLM). This work was motivated by a collaboration with the Sherlock Lab on transcriptome analysis in Saccharomyces. Our exploratory data analysis and normalization methods are implemented in the open-source Bioconductor R package EDASeq (http://www.bioconductor.org).

Thursday, December 8th

RHadoop, the Marriage of R and Hadoop
Dr. Antonio Piccolboni
Revolution Analytics

R is the de-facto standard for statistical computation; Hadoop has that role for fault tolerant, highly scalable distributed computing. The RHadoop project aims at bringing together these two technologies. In this talk we will review some of the characteristics and applications of Hadoop, including applications in genomics. We will then introduce three packages that provide the R developer with access to three different Hadoop components, the file system HDFS (rhdfs), the database HBASE (rhbase) and the parallel computing platform MapReduce (rmr). For rmr, we will go into some detail and build up to a massively scalable implementation of a simple clustering algorithm, k-means, in 20 lines of R code.