PB HLTH 295, Section 001
Statistics and Genomics Seminar

Fall 2017


Thursday, August 24th

Case-Control Studies of Gene-Environment Interaction and Latent Pathology
Professor Iryna Lobach
Department of Epidemiology and Biostatistics, UC San Francisco

Analyses of association between the genetic basis and disease phenotype have the potential to provide valuable clues to the underlying aeteology of complex phenotypes. Better clues might be generated as a result of analyses of the association to the underlying pathological diagnosis that might be latent in substantial proportion of healthy controls and/or distinct pathologic mechanisms might present with the same clinical phenotype. We devise a pseudolikelihood approach that examines the role of gene-environment interaction in the pathologic phenotype that is latent in substantial proportion of healthy controls and/or cannot be discriminated in cases. Epidemiologic studies provide frequencies of the pathologic diagnosis within clinical phenotypes. Simulation studies compare estimation and inferences obtained based on the proposed method and the usual logistic regression. Application is demonstrated in the analyses of prostate cancer and Alzheimer's disease.

Thursday, August 31st

Polymorphic Retrovirus Derived Mobile Elements in the Human Genome: Drivers of Complex Disease?
Professor Stephen Francis
Department of Epidemiology and Biostatistics, UC San Francisco

Human endogenous retrovirus type K (HERV-K)-like elements (HKLE) are thought to have aided in evolution of immuno-diversity in humans, yet this enhanced immune fitness may cause increased risk for immune-related disorders. To investigate this hypothesis, we developed a computational pipeline, HERVnGoSeq, to identify insertionally polymorphic HKLEs, including intact and partial HERV-K, solo HERV-K LTRs, and composite SVA elements with incorporated HERV-K LTR, from whole genome sequencing data. We detected 990 polymorphic HKLEs among 2,504 diverse individuals. Integration of expression data from lymphoblastoid cells from a subset of subjects (n=442) revealed 345 polymorphic HKLEs that function as quantitative trait loci (eQTL) for genes enriched in immune and epigenetic regulation. The strongest cis-eQTL HKLE insertion is associated with HLA-DRB1 expression and co-occurs with the HLA-DRB1*1501 allele. Three SNPs in linkage disequilibrium with this HKLE are robust, established risk loci for multiple sclerosis. Polymorphic HKLE insertions likely contribute to the heritability of human traits and risk of complex diseases.

Thursday, September 7th

Characterizing Stromal Cells in Glioma and their Impact on Survival
Dr. Josie Hayes
Department of Neurological Surgery, University of California SF

Fibroblasts in the tumor microenvironment can promote tumor growth, angiogenesis, inflammation and metastases. While stromal cells have been well studied in other cancer types, there is a paucity of reports of these cells in brain tumors and doubt as to whether these cells even exist in the brain. This seminar reports the isolation and genomic characterization of stromal cells localized prominently around abnormal tumor vasculature in adult glioma and shows their prognostic impact in a subset of tumors.
DNA methylation arrays, exome sequencing and both single cell and bulk RNA sequencing were used to characterize cultures of these cells alongside their matched bulk tumor tissue and cultured tumor cells. Profiling of archetypal protein and RNA markers infer stromal cell localization around abnormal vasculature in high-grade glioma specifically and an mRNA stromal cell signature predicts survival in a subset of patients delineated by the World Health Organization.
The seminar will conclude with the interesting finding that these cells have distinct clonal mutations of their own and will show how the mutations may have arisen using their trinucleotide mutation context and clonal patterns.

Thursday, September 14th

Studying the 3D Architecture of the Genome: The Case Study of P. falciparum
Dr. Nelle Varoquaux
Department of Statistics, UC Berkeley, and Berkeley Institute for Data Science

The spatial and temporal organization of the 3D structure of chromosomes is thought to have an important role in genomic function, but is yet poorly understood. For example, a relative paucity of specific transcription factors while the abundance of chromatin remodeling enzyme in the deadly human parasite P. falciparum points towards the involvement of global and local chromatin structure to control gene expression.
Recent advances in chromosomes conformation capture (3C) technologies, initially developed to assess interactions between specific pairs of loci, allow to simultaneously measure contacts on a genome scale, paving the way for more systematic and genome-wide analysis of the 3D architecture of the genome. These new Hi-C techniques result in a genome-wide contact map, a matrix indicating the contact frequency between pairs of loci.
I will present how we build 3D models of genome folding from contact counts at different timepoints of the life of P. falciparum, and how we used those to gain insights into the relationship between 3D structure and gene regulation.

Thursday, September 21st

Project Jupyter as a Toolkit for Research and Education
Professor Fernando Perez
Department of Statistics, UC Berkeley

Project Jupyter, evolved from the IPython environment, provides a platform for interactive computing that is widely used today in research, education, journalism and industry. The core premise of the Jupyter architecture is to design tools around the experience of interactive computing. It provides an environment, protocol, file format and libraries optimized for the computational process when there is a human in the loop, in a live iteration with ideas and data assisted by the computer.

I will discuss both how the overall architecture of Jupyter supports a variety of workflows that are central to the processes of research and education. I will also illustrate some new developments on the platform and demonstrate some of its new capabilities.

Thursday, September 28th

Unsupervised Clustering and Epigenetic Classification of Single Cells
Dr. Timothy Daley
Departments of Statistics and Bioengineering, Stanford University

Single cell ATAC-seq (scATAC-seq) technologies offer a unique opportunity to interrogate cellular level epigenetic heterogeneity through patterns of variability in open chromatin. Unfortunately, current analysis of scATAC-seq requires prior knowledge of the underlying population, through techniques such as FACS sorting subpopulations and bulk ATAC-seq. This has limited the application of scATAC-seq for unknown heterogeneous cellular populations. We present a method that solely utilizes scATAC-seq data for the unsupervised clustering of cells and determination of cluster-specific open regions. Our proposed method allows us to discover regions of open chromatin specific to cell identity and use them to identify the transcription factors and genes that drive variation in cell identity.

Thursday, October 5th

Software for Distributed Computation on Medical Databases
Dr. Balasubramanian Narasimhan
Department of Statistics, Stanford University

Combining the information latent in distributed medical databases promises to personalize medical care by enabling reliable, stable modeling of outcomes with rich feature sets (including patient characteristics and treatments received). However, there are barriers to aggregation of medical data, due to lack of standardization of ontologies, privacy concerns, proprietary attitudes toward data, and a reluctance to give up control over end use. Statisticians have long known that aggregation of data is not always necessary for model fitting. In models based on maximizing a likelihood, the computations can be distributed, with aggregation limited to the intermediate results of calculations on local data, rather than raw data. We describe a set of software tools that allow the rapid assembly of a collaborative computational project, based on the flexible and extensible R statistical software and other open source packages, that can work across a heterogeneous collection of database environments, with full transparency to allow local officials concerned with privacy protections to validate the safety of the method.

This is joint work with several collaborators including Philip Lavori and Daniel Rubin.

Thursday, October 12th

Tensor Response Regression and Neuroimaging Analysis
Professor Lexin Li
Division of Biostatistics, UC Berkeley

Classical regression models treat variables (predictor or response) as a vector and estimate a corresponding vector of regression coefficients. Modern applications in medical imaging generate variables of more complex form such as multidimensional arrays (tensors). Traditional statistical and computational methods are proving insufficient for analysis of those data due to their ultrahigh dimensionality as well as complex structure. In this talk, we propose a family of tensor response regression models that efficiently exploit the special structure of tensors. Under this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient estimation and prediction. Fast and highly scalable estimation algorithms are proposed, numerous forms of regularizations are studied, and asymptotic properties are obtained. Effectiveness of the new methods is demonstrated on real neuroimaging data analysis.

Thursday, October 19th

Selection-Adjusted Estimation of Effect Sizes: An Application to eQTLs
Snigdha Panigrahi
Department of Statistics, Stanford University

My talk will describe new methods to provide adjusted estimates for effect sizes in GWAS (Genome-wide association studies)/ eQTLS (Expression quantitative trait loci studies) post a genome-wide selection. The starting point in such studies is an exhaustive collection of genetic variants, but, only a small proportion of the genome or few effects with appreciable size are believed to govern an outcome. Aligned with the goal of adaptive inference, a line of works including Bogdan et al. (2015); Barber et al. (2015); Candes et al. (2016) have provided methods that guarantee FDR control over the selected genetic variants. While FDR is now a well- accepted global error criterion over the reported discoveries, a daunting challenge is to calibrate the strengths of these discovered associations. Estimates that ignore the genome-wide selection preceding inference based on the same data can lead to misleading conclusions about that the magnitudes of the true underlying associations. An approach to overcome selection bias is the classical concept of data-splitting in Cox (1975), that is collect new data or use a hold-out data for inference. However, this might still be a luxury in many GWAS experiments that are known to suffer from low power owing to small sample sizes. In this talk, I will discuss methods that allow the geneticist to use the same data-set for selection or discoveries and later, provide inference on the magnitudes of effects corresponding to these interesting findings. Motivated to measure these effect sizes as consistent point estimates and intervals with target coverage, my methods are modeled along the truncated approach to selective inference, introduced in Lee et al. (2016); Fithian et al. (2014). I will describe the computational bottleneck in performing conditional inference and put forth a new set of selective-inferential tools to achieve this goal with higher statistical power in comparison to these afore-mentioned references.


Thursday, October 26th

Single-Cells, Simulation and Kidneys in a Dish
Luke Zappia
Bioinformatics, University of Melbourne and Murdoch Childrens Research Institute

Single-cell RNA sequencing (scRNA-seq) is rapidly becoming a tool of choice for biologists wishing to investigate gene expression at greater resolution, particularly in areas such as development and differentiation. Single-cell data presents an array of bioinformatics challenges, data is sparse (for both biological and technical reasons), quality control is difficult and it is unclear how to replicate measurements. As scRNA-seq datasets have become available so have a plethora of analysis methods. We have catalogued software tools that implement these methods in the scRNA-tools database (www.scRNA-tools.org). Evaluation of analysis methods relies on having a truth to test against or deep biological knowledge to interpret the results. Unfortunately current scRNA-seq simulations are frequently poorly documented, not reproducible and do not demonstrate similarity to real data or experimental designs. In this talk I will present Splatter, a Bioconductor package for simulating scRNA-seq data that is designed to address these issues. Splatter provides a consistent, easy to use interface to several previously published simulations allowing researchers to estimate parameters, produce synthetic datasets and compare how well they replicate real data. Splatter also includes Splat, our own simulation model. Based on a gamma-Poisson hierarchical model, Splat includes additional features often seen in scRNA-Seq data, such as dropout, and can be used to simulate complex experiments including multiple cell types, differentiation lineages and multiple batches. I will also briefly discuss an analysis of a complex kidney organoid dataset, showing how more cells and different levels of clustering help to reveal greater biological insight.

Thursday, November 2nd

Single Cell Dynamics in High-Throughput Time-Lapse Screening Data: Developing Generic Methods for Data Re-use and Comparative Analyses
Dr. Alice Schoenauer Sebag
University of California SF

Biological screens test large sets of experimental conditions with respect to their specific biological effect on living systems. Technical and computational progresses have made it possible to perform such screens at a large scale - up to hundreds of thousands of experiments. Such approaches have for example applications in functional genomics (e.g. RNA interference screens) or pharmacology (drug screens).

Live cell imaging is an excellent tool to study in detail the consequences of chemical perturbation on a given biological process. However, the analysis of live cell screens demands the combination of robust computer vision methods and quality control procedures, and efficient statistical approaches for the detection of significant effects.

In this talk, two approaches to address these challenges will be described: the first multivariate workflow for the study of single cell motility in such large-scale data will be presented in the first part, and the development of a new distance for drug target inference by in silico comparisons of parallel siRNA and drug screens will be described in the second. The developed frameworks are applied to publicly available High-Content Screening (HCS) data, demonstrating their applicability and the benefits of HCS data remining.

Thursday, November 16th

Deconstructing Stem Cell Hierarchies and Discriminating Transient Cell States During Homeostatic Tissue Maintenance and Injury-Induced Regeneration
Dr. Russell Fletcher
Department of Molecular and Cell Biology, UC Berkeley

Tissue homeostasis and regeneration are mediated by programs of adult stem cell renewal and differentiation. A fundamental challenge in stem cell biology is to define both the cell fate potential of a given stem cell and where cell fates are specified along a developmental trajectory. Moreover, defining detailed lineage trajectory maps and discriminating stem cell states are necessary for identifying the regulatory networks that govern cell fate transitions. Our group has been applying single-cell level techniques to investigate how stem cell behavior differs during uninjured, homeostatic tissue maintenance and injury-induced regeneration in the olfactory epithelium, a sensory neural tissue. In this talk, I will present recent work aimed at elucidating stem cell lineage trajectories and resolving cell-type heterogeneity in these different contexts. In addition to its utility in validation, I will argue that integrating transgenic lineage tracing approaches with single-cell RNA-sequencing (scRNA-seq) analyses from the onset provides greater resolution and helps solve challenges associated with predicting stem cell lineages. For example, by time-stamping cells with transgenic labeling, the lineage trajectory can be resolved even if there are sudden, drastic changes in transcription that defy the concept of a cell lineage as a continuum of gradual gene expression changes. In specific situations, this integrated approach can also allow one to resolve the problem of looping trajectories, which is inherent to stem cell self-renewal. Using this approach, we have been able to map stem cell lineage trajectories, identify transient stem cell states, and demonstrate that these transient states are temporal windows during which cell fate specification occurs. If time allows, I will discuss our findings on the molecular mechanisms that control olfactory stem cell fate choice and how we have used scRNA-seq to pinpoint the defect in lineage progression resulting from a mutation in a key stem cell transcription factor.

Thursday, December 7th

Data-driven Selection of Normalization Methods for Single-Cell RNA-Seq
Michael Cole
Department of Physics, UC Berkeley

Due to systematic measurement biases, data normalization is an essential pre-processing step in the analysis of single-cell RNA sequencing data. While a variety of normalization procedures are available for bulk RNA-seq, there can be multiple, competing considerations behind the assessment of normalization performance, some of them study-specific. The choice of normalization method can have a large impact on the results of downstream analyses (e.g., clustering, inference of cell lineages, differential expression analysis), and thus it is critically important to assess the performance of competing methods in order to select a suitable procedure for the study at hand.

I will discuss scone – a framework for assessing a wide range of scRNA-seq normalization procedures based on a comprehensive set of data-driven performance metrics. We have demonstrated the effectiveness of scone on a selection of scRNA-seq datasets across a variety of protocols, ranging from plate- to droplet-based methods. We find that scone is able to correctly rank normalization methods according to their performance, leading to higher agreement with independent validation data.