PB HLTH 295, Section 003
Statistics and Genomics Seminar

Fall 2016

[Home]

Thursday, August 25th

CompMS2miner an R Package for Total Metabolome Identification, Facile Data-Visualization and Sharing Applied to the CASMI Challenge
Dr. William M. B. Edmands
Division of Environmental Health Sciences, UC Berkeley

A long-standing challenge of untargeted metabolomic/lipidomic profiling by liquid-chromatography - high resolution mass spectrometry analysis (LC-HRMS) is rapid, precise and automatable transition from unknown mass spectral features to full metabolite identification using MS/MS fragmentation data and many other resources. The CompMS2miner package was developed in the R programming language for comprehensive, well organized and reproducible unknown feature identification. CompMS2miner integrates many useful modular and fully extensible metabolite identification tools such as dynamic noise filtration, composite spectra generation, product ion substructure annotation, Phase II metabolite prediction, MS/MS spectral database and in silico fragmentation matching, random forest recursive feature elimination based retention time prediction, mean maximum nearest neighbour chemical similarity scoring and differential evolution based weighted consensus score optimization. Data curation, visualization and sharing is made possible at any stage of the CompMS2miner package workflow via a self-contained application "Composite MS2 Explorer" developed with the R shiny package. The published application carries with it the details of versions of any software packages used to generate it and optionally a copy of the R code as a script or markdown file used to generate it. In theory, the CompMS2miner workflow should allow facilitate fully reproducible research and allow metabolite identification results to persist in relative perpetuity. The intention is that an entire dataset or a subset thereof can be published as a stand alone web application for exploration by other investigators alongside metabolomic/ lipidomic publications.

The package has recently been pitched against the CASMI (Critical Assessment of Small Molecule Identification) challenge 2016, an annual metabolite identification competition. Preliminary results using CompMS2miner rank amongst the contest winners. The results of the CASMI identification challenge will eventually be published as a publicly available compMS2explorer application using a single function internal to the package.

Thursday, September 1st

Precision Tumor Monitoring and Outcome Prediction from Mathematical Model of Circulating Tumor DNA
Drs. Ash Alizadeh and David Matthew Kurtz
Stanford University School of Medicine

Predicting an individual's response to treatment remains a major challenge in clinical oncology. Current methods rely on clinical and biological risk factors identified prior to therapy, such as tumor stage, histological grade, or tumor genotype. These factors are associated with differences in response and survival in the population; however, their ability to predict outcome for an individual patient is limited. Emerging blood-based biomarkers, such as circulating tumor-derived DNA (ctDNA), allow opportunities to measure tumor dynamics over time, either prior to or during therapy. We created an ordinary differential equation (ODE) based mathematical model relating ctDNA to underlying tumor growth dynamics over time. By applying this model to ctDNA time series data, we can create a continuous view of tumor dynamics over time. We have tested this model in a cohort of patients with diffuse large B cell lymphoma (DLBCL), the most common blood cancer in adults, using ctDNA measurements over the course of their therapy. This model allowed patient-specific predictions of tumor volume and clinical outcomes, previously not possible from standard clinical data. Mathematical models of tumor dynamics grounded in mechanism provide wide opportunities in personalized medicine and tailored therapeutics.

Thursday, September 8th

Fast and Scalable Machine Learning in R & Python with H2O
Dr. Erin LeDell
H2O.ai

The focus of this talk is scalable machine learning using the H2O R and Python packages. H2O is an open source distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java; however, fully featured APIs are available in R, Python, Scala, REST/JSON and also through a web interface.

Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of generalized linear models, gradient boosting machines, random forest, deep neural nets, dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), and anomaly detection methods, among others. The ability to create stacked ensembles, or "super learners," from a collection of supervised base learners is provided via the h2oEnsemble R package.

R and Python Jupyter notebooks with H2O machine learning code examples will be demoed live and made available on GitHub for attendees to follow along on their laptops. For those interested in running the code on a multi-node Amazon EC2 cluster, an H2O AMI is also available.

Thursday, September 15th

Title
Speaker
Affiliation

Abstract.

Thursday, September 22nd

Massively Multiplexed Single Cell Analysis Revels Predictive Signatures in Human Disease
Professor Sean Bendall
Department of Pathology, Stanford University

In recent years, advances in single cell proteomic technology have provided us with tools to quantify the expression of multiple genes in individual cells. The ability to simultaneously measure multiple gene products and regulatory modifications on the same cell is necessary to resolve the incredible diversity of cell subsets, as well as to define their function in the host. While Fluorescence-based reporter technology has driven these measurements over the last few decades we have now reached a limit in the number of simultaneous measurements that can be achieved. Now, the advent of highly sensitive 'mass-based' reporter represents a new technology that promises to significantly extend these capabilities. Immunophenotyping by mass spectrometry (CyTOF mass cytometry) provides the ability to measure more than three dozen proteins at a rate of 1,000 cells per second. These reporter technologies have further been extended into the realm of sub-cellular imaging. We review these technologies and highlight some of their recent advances in single-cell assays.

Thursday, September 29th

Some Improvements in Microbiome Analysis Techniques
Paul Joseph (Joey) McMurdie
Whole Biome

Despite a rapidly-improving accessibility of sequencing resources in microbiome research, amplicon sequencing, especially of the 16S rRNA marker gene, remains the first and most-common metagenomics method applied to new biospecimens. A pervasive misunderstanding of amplicon sequencing data in the microbiome literature is that (1) the sequences must be processed through an ad hoc "OTU" clustering, and (2) that the resolution limit is a sequence similarity radius of 3%. In this talk I will show that neither of these presumptions about amplicon sequencing data is true, and that there is enough information provided by current Illumina platforms to support de novo single- nucleotide resolution in practice. This is achieved through an algorithm, called DADA2, that also exhibits best-in-class performance in both specificity and computational scaling to large datasets. The translational implication is that sub-species and intra- genomic variation (e.g. 16S rRNA) is distinguishable in this relatively cost-efficient amplicon seq approach -- in many cases providing a basis for detecting pathogenic strains, strains that are associated with a particular patient outcome, and strains that are unique to an individual host. There is a large and growing number of publicly available Illumina-based amplicon sequencing studies that are likely to benefit from the analytical improvements afforded by reprocessing their data with this method.

Thursday, October 6th

Estimating the Global Diversity of Bacteria Using Metagenomics
Dr. Joshua Ladau
Gladstone Institutes, GICD

Measuring and mapping the global diversity of bacteria and archaea remains a major challenge in microbiology. Maps of the diversity of microbes would aid in understanding the evolutionary and ecological processes underlying global microbial systems, assist forensics and forecasting climate change, and improve agriculture and public health. Despite their utility, constructing maps of microbial diversity has been hindered by a paucity of data. To address this challenge, here we utilize a computational approach, species distribution modeling, which combines metagenomic samples and maps of environmental conditions such as temperature and precipitation. We use this approach to (i) map the current distributions of bacteria across global marine surface waters and (ii) develop a novel statistical framework for estimating total biodiversity. Our results indicate (i) temperate, seasonal peaks in the global diversity of marine bacteria, and (ii) a Horvitz-Thompson-like estimator of total microbial biodiversity with low bias and low variance. These results have key implications for microbial ecology, and the methods presented here promise to be widely useful in metagenomics research.

Thursday, October 13th

Non-Parametric Modelling of Gene Expression during Cellular Development from Single-Cell RNA-Seq
Valentine Svensson
European Molecular Biology Laboratory (EMBL) - European Bioinformatics Institute (EBI), Hinxton, UK

Cells are defined to be in certain states depending on their phenotypes. As cells develop from one state to another, the molecular composition is changed due to transcriptional regulation, modulating the output of mRNA. During these transition, which can be due to natural temporal processes or a response to a stimulus, the abundance of different mRNA species will change in dynamic ways. Regression models based on Gaussian Processes turn out to be a good fit for investigating these dynamic expression trends in a general way. Gaussian Process models are expandable and flexible, allowing us to investigate development even when we have not observed a time variable, and when multiple developmental processes are happening at once, such as cell fate bifurcations. Applying these models to single cell RNA-seq data have allowed us to investigate the process of cell development on a genomic level, during cell specialization, organism development, and immune response.

Thursday, October 20th

Towards Systems and Synthetic (Mechano)Biology
Professor Sanjay Kumar
Department of Bioengineering, UC Berkeley

One of the most exciting developments in cell biology over the past two decades is the recognition that cell behavior is regulated by biophysical inputs from the microenvironment, including extracellular matrix topology, mechanics, and adhesivity. This "mechanobiology" impacts an enormous diversity of problems ranging from stem cell engineering to developmental biology to cancer. Consequently, there is much interest in exploiting engineered materials to steer cell behavior for technological and therapeutic purposes. However, two ongoing challenges in this field limit progress towards these goals: 1) The difficulty of developing quantitative dose-response relationships that connect mechanobiological signal strength to phenotype; and 2) The relative absence of high-throughput material platforms amenable to systems-level discovery and screening. In this presentation I will describe efforts my colleagues and I are taking to address these limitations. First, we are applying the tools of synthetic biology to quantitatively tune activation of mechanobiological signals, which has enabled us to develop quantitative control over cellular mechanics, motility, and invasion. Second, we have created spatially patterned materials that allow creation of combinatorial gradients of matrix stiffness and adhesivity, thereby facilitating high-throughput discovery of factors that mediate mechanosensitive behaviors. We have used this platform to identify microRNAs that mediate sensing and actuation of stiffness signals in brain cancer. A major objective for us in the future will be to integrate these platforms with the rich diversity of tools now available for biological systems-level discovery and screening and then mine the resulting data to extract new regulatory principles.

Thursday, October 27th

Title
Speaker
Affiliation

Abstract.

Thursday, November 3rd

Bayesian Image Analysis in Fourier Space, with Applications in Medical Imaging
Professor John Kornak
Department of Epidemiology and Biostatistics, UC San Francisco

Bayesian image analysis provides a solution for improving image quality relative to deterministic methods such as linear filtering, by balancing a priori expectations of image characteristics with a model for the noise process. However, conventional Bayesian image analysis models, defined in the space of the image itself, are not as frequently used in practice as they could be. The reason for the limited application is likely that these models can be difficult to specify and implement for the average user, and are relatively slow to compute (typically requiring iterative methods).

We will give a reformulation of the conventional Bayesian image analysis paradigm in Fourier space, i.e., such that the prior and likelihood are defined in terms of probability density functions (pdfs) across spatial frequencies. These pdfs are tied together across Fourier space by defining a function over Fourier space for each of the pdf parameters. In this way, spatially correlated priors, that are relatively difficult to model and compute in conventional image space, can often be more efficiently modeled as a set of independent processes across Fourier space. The originally inter-correlated and high-dimensional problem in image space is thereby broken down into a series of independent one-dimensional problems (using ‘parameter functions’ to capture variation in the model’s prior parameters over Fourier space). The Fourier space independence definition leads to easy model specification and relatively fast and direct computation that is on the order of that for deterministic filtering methods.

We will describe the Bayesian image analysis in Fourier space (BIFS) modeling approach, and demonstrate useful properties of isotropy and resolution invariance to model specification. We will give specific applications of BIFS in medical imaging, and contrast with results based on Markov random field based models.

Thursday, November 10th

Network Modeling of Topological Domains Using Hi-C Data
Dr. Rachel Wang
Department of Statistics, Stanford University

Genome-wide chromosome conformation capture techniques such as Hi-C enable the generation of 3D genome contact maps and offer new pathways toward understanding the spatial organization of genome. It is widely recognized that chromosomes form domains of enriched interactions playing significant roles in gene regulation and development. In particular, it is of interest to identify densely interacting, contiguous regions at the sub-megabase scale known as topologically associating domains (TADs), which are believed to be conserved across cell types and even species. Although a few algorithms have been proposed to detect TADs, developing statistical frameworks capable of incorporating the hierarchical nature of TADs and known biological covariates remains a nascent field. We develop a network model that explicitly makes use of cell-type specific CTCF binding sites to detect multiscale domains. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.

Thursday, November 17th

Methods for Genetic Studies across Multiple Phenotypes
Professor Noah Zaitlen
Department of Medicine, UC San Francisco

Testing for associations in big data faces the problem of multiple comparisons, with true signals buried inside the noise of all associations queried. This is particularly true in genetic association studies where a substantial proportion of the variation of human phenotypes is driven by numerous genetic variants of small effect. The current strategy to improve power to identify these weak associations consists of applying standard marginal statistical approaches and increasing study sample sizes. While successful, this approach does not leverage the environmental and genetic factors shared between the multiple phenotypes collected in contemporary cohorts. Here we develop two methods that improve the power of detecting associations when a large number of correlated variables have been measured on the same samples. Our analyses over real and simulated data provide direct support that large sets of correlated variables can be leveraged to achieve dramatic increases in statistical power equivalent to a two or even three or four fold increase in sample size.

Thursday, December 1st

Robust Strategies for Analysis of Single Cell mRNA-Seq Data
Professor Elizabeth Purdom
Department of Statistics, UC Berkeley

mRNA sequencing of single cells allows for detection of important subtypes of cells. However, single cell mRNA-Seq can be quite noisy and is generally sparsely sequenced. In this talk, we will discuss our strategies for the analysis of single cell mRNA-Seq for detection of subtypes in the context of finding subtypes of neuronal cells. We will introduce our RSEC procedure for robust clustering using subsampling and ensemble clustering methods in order to achieve robust clusters. We will also discuss our method Slingshot for assignment and ordering of single cells to differentiating lineages.

Thursday, December 8th

Controlling the Rate of False Discoveries in Tandem Mass Spectrum Identifications
Professor Uri Keich
School of Mathematics and Statistics, University of Sydney, Australia

A typical shotgun proteomics experiment produces thousands of tandem mass spectra, each of which can be tentatively assigned a corresponding peptide by using a database search procedure that looks for a peptide-spectrum match (PSM) that optimizes the score assigned to a matched pair. Some of the resulting PSMs will be correct while others will be false, and we have no way to verify which is which. The statistical problem we face is of controlling the false discovery rate (FDR), or the expected proportion of false PSMs among all reported pairings. While there is a rich statistical literature on controlling the FDR in the multiple hypothesis testing context, controlling the FDR in the PSM context is mostly done through the "home grown" method called target-decoy competition (TDC).

After a brief introduction to the problem of tandem mass spectrum identification we will explore the reasons why the mass spec community has been using this non-standard approach to controlling the FDR. We will then discuss how calibration can increase the number of correct discoveries and offer an alternative method for controlling the FDR in the presence of calibrated scores. We will conclude by arguing that our analysis extends to a more general setup than the mass spectrum identification problem.

Joint work with Bill Noble (University of Washington)

PB HLTH 295, Section 003 Statistics and Genomics Seminar