PB HLTH 292, Section
008
Statistics and
Genomics Seminar
Fall 2011
Thursday,
September 1st
Computational Methods for the Prediction of the Impact of Missense Variants
Dr. Emidio Capriotti
Department of Bioengineering, Stanford University
Large-scale sequencing and genotyping techniques are allowing to scan the whole human genome providing a huge amount of genetic variation data. Single Nucleotide Variants (SNVs), which are the main cause of human genome variability, can also be responsible for the insurgence of human pathologies. The missense SNVs occurring in coding regions and resulting in single amino acid polymorphisms (SAPs) may affect protein function and lead to a diseased state. Although several methods to predict disease-related SAPs have been developed [1,2] a reliable solution of this problem is still unavailable. In the next future, the development of more accurate algorithms for the prediction of the effects of SAPs will be important to annotate the large amount of variation data.
In this talk, I will summarize my research activity in the prediction of the impact of genetic variants presenting a new structure-based method for the detection of deleterious SAPs [3] and a new algorithm for the prediction of cancer-causing SAPs [4] that result in better performances than previously implemented tools.
REFERENCES
1. Capriotti E, Calabrese R, Casadio R. (2006) Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22: 2729-2734.
2. Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R. (2009). Functional annotations improve the predictive score of human disease-related mutations in proteins. Human Mutation. 30; 1237-1244.
3. Capriotti E, Altman RB (2011). Improving the prediction of disease-related variants using protein three-dimensional structure. BMC Bioinformatics. 12 (Suppl 4); S3.
4. Capriotti E, Altman RB. (2011). A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics. Epub.
Emidio Capriotti is a Marie Curie research fellow at the Department of Bioengineering of the Stanford University, Palo Alto (California) and contracted researcher at the University of Balearic Islands at Palma de Mallorca (Spain). His research interests include protein structure prediction, the prediction of the effect of single point protein mutation on human health and protein stability, molecular dynamics of protein systems, and RNA structure comparison and prediction. Before joining Stanford University, he was postdoctoral researcher in the Centro de Investigacion Principe Felipe (CIPF) at Valencia (Spain) and earned the Master in Bioinformatics and the Ph.D. in Physical Sciences at the University of Bologna (Italy).
Thursday,
September 8th
Working with Pacific Biosciences Data
Dr. James Bullard
Pacific Biosciences
This talk will focus on a number of interesting data analysis problems related to working with data from single molecule sequencing machines. I will discuss current sequencing efforts and interesting ongoing applications at Pacific Biosciences.
Thursday,
September 15th
Unsupervised Unwanted Variation Removal for Expression Data
Dr. Laurent Jacob
Department of Statistics, UC Berkeley
Large gene expression studies typically contain some unwanted
variation factors. These factors can arise from technical phenomenons
(batch effect, platform effect...) or from biological signals which
are unrelated to the factor of interest in a study (heterogeneity in
ages, ethnic groups...). Our objective is to remove these unwanted
variation factors without losing the factor of interest. This is
particularly difficult when neither the unwanted variation factors nor
the factor of interest are known a priori (eg when doing
clustering). I will present various methods relying on the existence
of replicate arrays.
Thursday,
September 22nd
BMCP: A Segmentation Algorithm for SNP Microarray Copy Number Data from Cancer Studies
Yu Chuan Tai
Roche Molecular Diagnostics
High-density SNP microarrays provide a useful tool for the detection of copy number
changes in tumors. We propose an algorithm called BMCP, for segmentation and estimation
of SNP microarray copy number data from cancer studies. Segmentation concerns
separating a chromosome into regions of equal copy number differences between the
sample of interest and some reference, and involves the detection of locations of
copy number difference changes. Estimation concerns determining true copy number for
each segment. BMCP not only gives posterior estimates for the parameters of
interest, namely locations for copy number difference changes and true copy number
estimates, but also useful confidence measures. In addition, BMCP can
segment multiple samples simultaneously. Finally, BMCP incorporates
an adjustment factor for signal attenuation due to tumor heterogeneity or
normal contamination that can improve copy number estimates.
This is joint work with Mark Kvale and John Witte at UCSF.
Thursday,
September 29th
Alternative Splicing of Single Codons is Regulated and Conserved
Dr. Robert Bradley
Fred Hutchinson Cancer Research Center
Thousands of human genes contain introns ending in NAGNAG (N any nucleotide), where both NAGs can function as 3' splice sites, yielding isoforms that differ by inclusion/exclusion of three bases. However, the physiological relevance and conservation of this NAGNAG splicing has been very controversial. Using very deep RNA-Seq data from human and mouse tissues with both technical and biological replicates, we found that alternative splicing of single codons is commonly regulated between tissues, and furthermore that strongly tissue-specific NAGNAGs are highly conserved between species, implying selective maintenance. Specific sequence features, including a more distal location of the branch point and presence of a pyrimidine immediately before the first NAG, influence NAGNAG splicing both globally and in a splicing reporter system. Strikingly, mutations that create, destroy, or alter NAGNAGs give rise to a dramatic increase in the gain/loss of codons at exon-exon boundaries in both mammals and flies. Our study demonstrates that NAGNAG alternative splicing generates widespread differences between the proteomes of mammalian tissues, and suggests that the evolutionary trajectories of mammalian proteins are strongly biased by the locations and phases of the introns that interrupt coding sequences.
Thursday,
October 6th
Statistical Challenges for Ion Torrent Semiconductor Sequencing
Dr. Simon Cawley
Ion Torrent
The recent development of Ion Torrent semiconductor sequencing has the
potential to do for DNA sequencing what complementary metal-oxide
semiconductor (CMOS) imaging has done for photography - replacing slow
and expensive methods with a fast, cheap and scalable alternative.
The technology uses massively parallel direct detection of hydrogen
ions from polymerase synthesis to detect sequence. The raw signal
data acquired presents interesting signal processing and inference
challenges in turning voltage data from the chip into usable DNA
bases. These problems can be divided into hydrogen ion accounting
in which we infer a mean number of hydrogen ions produced per molecule
from the time-varying signal and phase correction in which we
reconstruct the ensemble populations of polymerase locations on the
copies of DNA and recover the sequence. Development of improved
models to handle these signal-processing challenges may lead to the
solution to one of the million-dollar Life Technologies Grand
Challenges.
Thursday,
October 20th
High-Dimension Gaussian Graphical Model Building with Re-Sampling Based Methods
Professor Jie Peng
Department of Statistics, UC Davis
Regularization techniques are widely used for tackling high-dimension-low-sample-size problems. Yet, finding the right amount of regularization is challenging, especially in the unsupervised setting, where traditional methods such as BIC or cross-validation often result in too many false positives. In this talk, we first introduce Gaussian graphical models (GGMs) and its inference under the high-dimension regime. We then discuss approaches based on data perturbation, particularly those utilize selection frequencies over networks built on re-sampled data sets. We propose a method to infer networks by directly controlling the false discovery rates (FDRs) of edge detection. The idea is to fit a mixture distribution for the selection frequencies and then estimate the FDRs. This method is illustrated by both simulated and real data sets.
Thursday,
November 3rd
Measuring and Predicting Metabolic Fluxes through 13C Carbon Labeling Experiments for Pure Cultures and Microbial Communities
Dr. Hector Garcia Martin
Joint BioEnergy Institute, Berkeley, CA
Systems biology aims to provide a predictive and quantitative understanding of cell behaviour as the outcome of the interaction of its comprising parts. Metabolic flux profiles (i.e. the number of molecules traversing each biochemical reaction encoded in its genome per unit time) are not only a key phenotypic characteristic but also embody the essence of this complexity since they represent the final functional output of the interactions of all the molecular machinery studied by all the other "omics" fields. Two of the most popular methods for studying metabolic fluxes are Flux Balance Analysis (FBA) and 13C Metabolic Flux Analysis (13C MFA), each of them displaying its own advantages and disadvantages. In this talk I will present a new method: Two scale 13C Metabolic Flux Analysis (2S-13CMFA), which combines the advantages of FBA and 13C MFA. I will showcase its applications and possibilities with data from the KEIO knockout collection. Time permitting, I will also introduce how we are adapting these methods to study metabolic fluxes in microbial communities as well.
Thursday,
November 10th
Recent Shared Ancestry within Large Populations
Dr. Peter Ralph
Department of Evolution and Ecology, UC Davis
Within large, whole-genome datasets, a substantial number of pairs of
individuals share genetic material from ancestors who lived in the last
100 generations, a fact that stands in contrast to the distant
relatedness that governs most genetic diversity. These recent
relationships show up in data as long genomic runs of identity between
individuals, and provide information about recent demographic history
and population structure.
I will introduce some theory about pedigrees and large-sample coalescent
trees that provide expectations for data, and present some results from
analysis of a large genomic dataset, in which we see geographic
structure coming from recent shared ancestry, and infer some aspects of
recent population history.
Thursday,
November 17th
Normalization and Differential Expression in RNA-Seq
Professor Sandrine Dudoit
Division of Biostatistics and Department of Statistics, UC Berkeley
This talk concerns statistical methods and software for the analysis of transcriptome sequencing (RNA-Seq) data. We first present exploratory data analysis (EDA) approaches for quality assessment/control (QA/QC) of RNA-Seq reads. Next, we propose within-lane normalization methods to adjust for sample-specific gene-level effects such as length and GC-content. We also propose between-lane normalization procedures to account for distributional differences such as sequencing depth. Finally, we consider the quantitation of (differential) gene expression levels using generalized linear models (GLM). This work was motivated by a collaboration with the Sherlock Lab on transcriptome analysis in Saccharomyces. Our exploratory data analysis and normalization methods are implemented in the open-source Bioconductor R package EDASeq (http://www.bioconductor.org).
Thursday,
December 8th
RHadoop, the Marriage of R and Hadoop
Dr. Antonio Piccolboni
Revolution Analytics
R is the de-facto standard for statistical computation; Hadoop has that role
for fault tolerant, highly scalable distributed computing. The RHadoop
project aims at bringing together these two technologies. In this talk we
will review some of the characteristics and applications of Hadoop,
including applications in genomics. We will then introduce three packages
that provide the R developer with access to three different Hadoop
components, the file system HDFS (rhdfs), the database HBASE (rhbase) and
the parallel computing platform MapReduce (rmr). For rmr, we will go into
some detail and build up to a massively scalable implementation of a simple
clustering algorithm, k-means, in 20 lines of R code.