Statistics and Genomics Seminar
STAT 278B, Section 003

Spring 2019


Thursday, January 31st

Dissecting Gene Regulation with Machine Learning: Discoveries and Challenges
Professor Katie Pollard
Department of Epidemiology and Biostatistics, UC San Francisco, Gladstone Institute, and Chan-Zuckerberg Biohub

Machine learning is a popular statistical approach in many fields, including genomics. We and others have used a variety of supervised machine-learning techniques to predict genes, regulatory elements, 3D interactions between regulatory elements and their target genes, and the effects of mutations on regulatory element function. I will highlight a few of these studies, emphasizing the strengths and weaknesses of different predictive models and the biological insights gained via variable importance analysis. Then I will talk about some of our recent work exploring the limitations of popular machine-learning methods in genomics, where the biology underlying the data used to train the models frequently violates one or both parts of the independent and identically distributed (IID) assumption. The talk will conclude with some thoughts on modeling non-IID data and interpreting over-fit models, with the aim of improving the application of supervised learning to biological data and emphasizing the mechanistic insights gained from modeling over performance statistics per se.

Thursday, February 7th

Inverse RNA folding and Computational Riboswitch Detection
Professor Danny Barash
Department of Computer Science, Ben-Gurion University

The inverse RNA folding problem for designing sequences that fold into a given RNA secondary structure was introduced in the early 1990's in Vienna. Using a coarse-grain tree graph representation of the RNA secondary structure, we extended the inverse RNA folding problem to include constraints such as thermodynamic stability and mutational robustness, developing a program called RNAexinv. In the next step, we formulated a fragment-based design approach of RNA sequences that can be useful to practitioners in a variety of biological applications. In this shape-based design approach, specific RNA structural motifs with known biological functions are strictly enforced while others can possess more flexibility in their structure in favor of preserving physical attributes and additional constraints. Our program is called RNAfbinv (recently extended to incaRNAfbinv by incorporating a weighted sampling approach borrowed from incaRNAtion).

Detection of riboswitches in genomic sequences using structure based methods, including the use of incaRNAfbinv, will also be discussed.

Thursday, March 14th

Integrated Analysis of Cancer Data: Multi-omic Clustering and Personalized Ranking of Driver Genes
Professor Ron Shamir
The Blavatnik School of Computer Science, Tel Aviv University

Large biological datasets are currently available, and their analysis has applications to basic science and medicine. While inquiry of each dataset separately often provides insights, integrative analysis may reveal more holistic, systems-level findings. We demonstrate the power of integrated analysis in cancer on two levels: (1) in analysis of one omic in many cancer types together, and (2) in analysis of multiple omics for the same cancer. In both levels we develop novel methods and observe a clear advantage to integration. We also describe a novel method for identifying and ranking driver genes in an individual's tumor and demonstrate its advantage over prior art.

Thursday, February 21st

Data-Adaptive Filtering of Untargeted LC-MS Metabolomics Data
Courtney Schiffman
Graduate Group in Biostatistics, UC Berkeley

Untargeted metabolomics datasets contain large proportions of uninformative features that can impede subsequent statistical analysis such as biomarker discovery and metabolic pathway analysis. Thus, there is a need for versatile and data-adaptive methods for filtering data prior to investigating the underlying biological phenomena. I will present a data-adaptive pipeline for filtering metabolomics data that are generated by liquid chromatography-mass spectrometry (LC-MS) platforms. The data-adaptive pipeline includes novel methods for filtering features based on blank samples, proportions of missing values, and estimated intra-class correlation coefficients. I will show an application of the filtering pipeline to an Acute Myeloid Leukemia (AML) metabolomics dataset generated in our laboratory, and demonstrate how our method surpasses current filtering methods in terms of removing poor quality features and retaining high quality ones. Finally, I will talk about an example of how proper feature filtering can improve subsequent statistical analyses, such as metabolic pathway analysis.

Thursday, February 28th

Multiple Forces Shape Microbial Community Structure in the Phyllosphere
Norma Morella
Department of Plant and Microbial Biology, UC Berkeley

As our knowledge of host-associated microbial communities (microbiomes) continues to deepen, there remain key unresolved questions across multiple systems. Among these is an understanding of the forces underlying the assembly of, selection within, and co-evolution among microbiota, all of which depend in part on microbiome transmission mode. My PhD thesis research has focused on characterizing the forces that shape bacterial communities of the phyllosphere (the above-ground surfaces of plants). The first part of this work investigates the importance of vertically transmitted (parent to offspring) microbes in seedling health. Then, I show how long-term microbiome experimental evolution can be used to better understand how microbiomes evolve on genetically distinct hosts and adapt to their environment over time. Lastly, I explore the role of bacteriophage viruses in shaping bacterial communities. Overall, this work helps disentangle the multiple forces shaping community structure in the phyllosphere and has broad implications in other host systems.

Thursday, March 7th

Close-Kin Genetic Methods to Infer Demography and Dispersal Patterns of Mosquitoes
Professor John Marshall
Division of Epidemiology and Biostatistics, UC Berkeley

Malaria, dengue, Zika and other mosquito-borne diseases continue to pose a major global health burden through much of the world, despite the widespread distribution of insecticide-based tools and antimalarial drugs. Consequently, there is interest in novel strategies to control these diseases, including the release of mosquitoes transfected with Wolbachia and engineered with CRISPR-based gene drive and disease-refractory systems. The safety and effectiveness of these strategies are critically dependent on a detailed understanding of mosquito demography and movement patterns at both fine and broad spatial scales, yet there are major gaps in our understanding of these. The declining cost of genome sequencing and novel methods for analyzing geocoded genomic data provide opportunities to address these knowledge gaps. In this talk, we discuss a new approach to infer fine-scale mosquito dispersal patterns and demographic parameters, such as population size and mating structure, by considering the information contained in a set of pairs of closely-related individuals whose locations are known. These methods have previously been applied to fish such as tuna, sharks and coral trout; but have not yet been applied to insects. We propose in silico simulations of mosquito ecology and dispersal to determine sampling routines capable of quantifying known dispersal patterns and demographic parameters. The resulting models will be used to explore the potential impact of novel mosquito control interventions, and to inform biosafety and trial design considerations.

Thursday, March 21st

Data-driven Design for Computational Imaging Systems
Michael Kellman
Department of Electrical Engineering and Computer Sciences, UC Berkeley

Computational imaging systems marry the design of hardware and computation to create a new generation of modalities that image beyond what is currently possible. A computational imaging system's performance is fundamentally governed by how well the sought information is encoded in (experimental design) and decoded from (computational reconstruction) the measurements. In settings where both the encoding and decoding steps are non-linear---a prominent example is phase retrieval---analytical methods that assess the system's reconstruction performance become difficult to establish and might not necessarily result in improved designs.

In this talk, I will overview our recent works to jointly learn aspects of the experimental design and computational reconstruction to optimize the performance of a computational imaging system. I will consider unrolling the iterations of a traditional model-based image reconstruction algorithm (e.g. compressed sensing or phase retrieval) to form network whose layers are composed of linear (gradient-descent steps) and non-linear (proximal steps) components, making it possible to optimize the entire imaging pipeline. As an application, I will explain how a standard microscope can be transformed to image transparent samples without staining beyond its inherent resolution-limit. In particular, I will demonstrate that one can drastically decrease the number of measurements needed to obtain super-resolved images by learning the design of the microscope.

Thursday, April 4th

Statistical and Computational Challenges in Conformational Biology
Professor Mark Segal
Department of Epidemiology and Biostatistics, UC San Francisco

Chromatin architecture is critical to numerous cellular processes including gene regulation, while conformational disruption can be oncogenic. Accordingly, discerning chromatin configuration is of basic importance, however, this task is complicated by a number of factors including scale, compaction, dynamics, and inter-cellular variation.

The recent emergence of a suite of proximity ligation-based assays, notably Hi-C, has transformed conformational biology with, for example, the elicitation of topological and contact domains providing a high resolution view of genome organization. Such conformation capture assays provide proxies for pairwise distances between genomic loci which can be used to infer 3D coordinates, although much downstream analysis bypasses this reconstruction step.

After demonstrating advantages deriving from obtaining 3D genome reconstructions, in particular from superposing genomic attributes on a reconstruction and identifying extrema (’3D hotspots’) thereof, we showcase methodological challenges surrounding such analyses, as well as advancing a novel reconstruction approach based on principal curves. Open issues highlighted include (i) performing and synthesizing reconstructions from single-cell assays, (ii) devising rotation invariant methods for 3D hotspot detection, (iii) assessing genome reconstruction accuracy, and (iv) averting reconstruction uncertainty by direct integration of Hi-C data and genomic features. By using p-values from (epi)genome wide association studies as the feature the latter approach provides a conformational lens for viewing GWAS findings.

Thursday, April 18th

Unlocking the Power of Continuity in Single Cell RNA-Seq: Differential Gene Expression Along Developmental Trajectories
Hector Roux de Bezieux
Graduate Group in Biostatistics, UC Berkeley

Trajectory inference is often used with single-cell RNA-seq data to study dynamic changes in gene expression levels during, e.g., cell cycle, differentiation, or cellular activation. Downstream of trajectory inference, researchers are often interested in discovering genes that are associated with a particular lineage in the trajectory. Furthermore, genes that are differentially expressed between developmental/activational lineages might be highly relevant to the system under study. Newly developed method allows flexible inference of (i) within-lineage differential expression by detecting associations between gene expression and pseudotime over an entire lineage, or comparing gene expression between points/regions within the linage; and (ii) between-lineage differential expression by comparing gene expression between lineages over the entire lineages or at specific points/regions. This talk will fous on the method on applications, demonstrating its modularity and unique capacity to deliver powerful insights into complex biological phenomenons.

Thursday, April 25th

The Evolution of Gene Expression Across Species, Tissues, and Cell Types
Professor Peter Sudmant
Department of Integrative Biology, UC Berkeley

Phenotypic differences among species are often driven by evolutionary adaptations in gene expression, yet many developmental programs and pathways are deeply conserved. Many studies have sought to understand the evolution of gene expression among homologous tissues however some of these studies have reached disparate conclusions. In particular, the question of whether gene expression signatures are more similar between identical tissues of different species, or different tissues of identical species has been a source of confusion in the literature. Here I describe a comparative framework and set of analyses that demonstrate how gene expression datasets can be compared across different species, tissues, and experiments. These analyses show that inter-tissue gene expression distances within a species are conserved and highlights the pitfalls of previous analyses and methodologies. I will discuss ongoing extensions of these approaches to single-cell gene expression data and to analyses of splicing.

Thursday, May 9th

Latent Privacy Risk in Omics Data Manifesting over Time
Dr. Zhiqiang Hu
Department of Plant and Microbial Biology, UC Berkeley

Privacy risk from individual’s genome has garnered increasing attention. Sharing genomes without personal identifiers is common practice. However, recent research studies and forensics underscored the ability to re-identify a person, using genomic relatives and quasi-identifiers, such as sex, birthdate and zip code. The additional availability of omics data, such as transcriptomics or methylation data, has implications for privacy, as it may also be linked to the genome, potentially allowing privacy breach.

Gene expression data can be linked the genome based on genotypes inferred from expression QTLs (eQTLs). Splicing data has been reported as safe, as splicing QTLs (sQTLs) are 200 times less abundant than eQTLs, but we demonstrate that splicing data can be reliably linked to a single genome, from a population of thousands to a million, depending on the measured tissues. Such a linking for DNase hypersensitive sites and gene expression data now enables the identification of a genome from a greater pool of the world population size. This new risk has arisen due to growing biological knowledge and data. Our study implies further hidden privacy risks in existing data, which will only manifest over time. While it is accepted that the whole security of many cryptographic security methods degrade over time, the need to preserve individuals' genomic privacy for their lifetime and beyond (for descendents) poses unique challenges to the effective sharing of high-throughput molecular data.