Statistics and Genomics Seminar
STAT 278B, Section 003

Spring 2018



[Home]


Thursday, January 25th

Cluster Randomized Test Negative Designs: Inference and Application to Vector Trials to Eliminate Dengue Fever
Professor Nicholas P. Jewell
Division of Biostatistics and Department of Statistics, UC Berkeley

The successful introduction of the intracellular bacterium Wolbachia into Aedes aegypti mosquitoes enables a practical approach for dengue prevention through release of Wolbachia-infected mosquitoes. Wolbachia reduces dengue virus replication in the mosquito and, once established in the mosquito population, it is possible that this will provide a long-term and sustainable approach to reducing or eliminating dengue transmission. A critical next step is to assess the efficacy of Wolbachia deployments in reducing dengue virus transmission in the field. We describe and discuss the statistical design of a large-scale cluster randomised test-negative parallel arm study to measure the efficacy of such interventions. Comparison of permutation inferential approaches to model based methods will be described. Extensions to allow for individual covariates, and alternate designs such as the stepped wedge approach, will also be briefly introduced.


Thursday, February 15th

Enable Precision Data for Precision Medicine
Dr. Jun Ye
President and CEO, Sentieon

Sentieon (www.sentieon.com), incorporated in 2014, develops and supplies a suite of bioinformatics secondary analysis tools that process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency. Current released products include DNAseq and DNAscope for germline variant calling, and TNseq and TNscope for tumor-normal somatic variant detection. bThe Sentieon tools are easily scalable, easily deployable, easily upgradable, software-only solutions. The Sentieon tools achieve their efficiency and consistency through optimized computing algorithm design and enterprise-strength software implementation, and achieve high accuracy using the industryb.
s most validated mathematics models. Sentieon products have won top awards at multiple precisionFDA challenges, and ranked first place on the most recent ICGC-TCGA DREAM Mutation Calling challenge leaderboard in all three categories (snv, indel, SV). We strive to enable precision genomics data for precision medicine.


Thursday, February 22nd

Resolving Whole Organism Cell Fate with CRISPR/Cas9
Dr. Aaron McKenna
Department of Genome Sciences, University of Washington, Seattle

Multicellular organisms develop by way of a lineage tree, a series of cell divisions that give rise to cell types, tissues, and organs. However, our knowledge of the cell lineage and its determinants remains extremely fragmentary for nearly all species. This includes all vertebrates and arthropods such as Drosophila, wherein cell lineage varies between individuals; embryos and organs are often visually inaccessible, and progenitor cells disperse by long-distance migration. We recently pioneered a novel paradigm for recording cell lineage and other aspects of developmental history that has the potential to fundamentally transform our understanding of vertebrate biology. In brief, we engineer cells to stochastically introduce mutations at specific locations in the genome during development. The resulting patterns of mutations, which can be efficiently queried by massively parallel sequencing, can be used to reconstruct lineage and, more generally, to determine the “molecular histories” of individual cells on an organism-wide scale. Here we demonstrate our technique at the single-cell level on a variety of model organisms, including zebrafish and fly, tracing the lineage of tens of thousands of cells within individual organisms and organ systems.


Thursday, March 1st

Inferring Gene Interactions and Functional Modules Beyond Standard Statistical Models
Professor Haiyan Huang
Department of Statistics, UC Berkeley

Identifying gene interactions has been one of the major tasks in understanding biological processes. However, due to the difficulty in characterizing/inferring different types of biological gene relationships, as well as several computational issues arising from dealing with high-dimensional biological data, finding groups of interacting genes remains challenging. In this talk, I will introduce our recent effort on identifying higher-level gene-gene interactions (i.e., gene group interactions) by evaluating conditional dependencies between genes, i.e., the relationships between genes after removing the influences of a set of other functionally related genes. The detailed technique involves performing sparse canonical correlation analysis with repeated subsampling and random partition. This technique is especially unique and powerful in evaluating conditional dependencies when the correct dependent gene sets are unknown or only partially known. When used effectively, this is a promising technique to recover gene relationships that would have otherwise been missed by standard methods. Comparisons with other methods using simulated and real data show this method achieves considerably lower false positive rates. In addition, I will discuss an ongoing work on using a bagged semi-supervised clustering approach to study changes in the membership of functional gene pathways in response to genetic or phenotypic variation.


Thursday, March 8th

Exploring the Potential of Exome Sequencing in Newborn Screening
Dr. Aashish Adhikari
Department of Plant and Microbial Biology, UC Berkeley

The NBSeq project is evaluating the effectiveness of whole exome sequencing (WES) for detecting inborn errors of metabolism (IEM) in newborn screening (NBS). 1216 de-identified archived dried blood spots from MS/MS true positive and false positive cases previously identified by the California NBS program were sequenced and analyzed using customized variant interpretation pipelines developed to address distinct requirements for NBS. Sensitivity of causal mutation detection varied across metabolic disorders. Overall, exomes failed to flag 1 in 8 affected individuals, highlighting limitations of our genetic understanding of even long-studied classic Mendelian disorders. Incorporation of copy number variant detection and variant curation in the interpretation pipeline, as well as an integrative review of genetic, biochemical and clinical data improved overall sensitivity. In some cases, exomes confidently identified disorders different from the metabolic center diagnoses, suggesting that sequencing information would have been valuable for proper clinical diagnoses in those cases. While still not sufficiently specific alone for screening of most IEMs, WES can facilitate timely and more precise case resolution.


Thursday, March 15th

Putting the Pieces Together: Modeling Macromolecular Structure by Integrating Sparse, Noisy, Ambiguous and Incoherent Biophysical Data
Dr. Daniel Saltzberg
Department of Bioengineering and Therapeutic Sciences, School of Pharmacy and Medicine, UCSF

Understanding the structure of macromolecules and macromolecular complexes gives insight into how these machines function, are designed, have evolved and can be modulated. These systems, however, tend to be too large, dynamic or heterogeneous for direct structural methods such as NMR, X-ray crystallography and electron microscopy to provide an unambiguous assignment of atomic structure. We instead apply an integrative approach where data from indirect, noisy and low-information biophysical methods, such as FRET, chemical crosslinking and solution X-ray scattering and other information such as statistical potentials and force fields are converted into individual spatial restraints that are combined into a single scoring function that ranks prospective structural models. Ideally, our scoring function is constructed in a Bayesian manner, where we can directly compute the likelihood of a proposed structure given the available data and prior information. Our approach comprises four steps: 1) information gathering; 2) design of model representation and data restraints; 3) sampling model configurations and 4) assessment of sampling exhaustiveness and model validation. The result is an ensemble of model configurations that explain all data and information simultaneously, while the breadth, or precision, of the ensemble represents our uncertainty in the position of each atomic center. We have applied this approach to elucidating the structure of a number large macromolecular complexes, including the nuclear pore complex and the exosome complex, and are working towards modeling the spatiotemporal organization of cellular components.


Thursday, March 22nd

Toward Understanding How We -- And our Diseases -- Develop Using Global Spectral Clustering for Dynamic Networks and Semi-Soft Clustering for Single Cell Gene Expression
Professor Kathryn Roeder
Department of Statistics and Department of Computational Biology, Carnegie Mellon University

Knowing how genes are expressed, how they are co-regulated over development and across different cell types yields insight into how genetic variation translates into risk for complex disease. Here we take on two related statistical challenges in this area: (1) clustering cells based on single cell RNA-sequencing data; and (2) dynamic clustering of genes based on gene expression over developmental periods.
Advances in technology now enable measurement of RNA levels, aka gene expression, for individual cells. Compared to traditional tissue-level or bulk RNA sequence data, sequencing of single cells is expected to yield more insight into how a disease develops by revealing where risk genes are expressed and how they alter development. For complex organs like the developing brain, however, the cells are not pre-labeled by type; rather they need to be classified by type and it is often not known, a priori, which cell types are present. I will present an efficient clustering algorithm (SOUP) that we developed for this problem. SOUP aims at semi-soft clustering, based on the existence of both pure cells that belong to one single cluster, as well as mixed cells that are transitioning between two or more cell types. The performance of SOUP is illustrated in both simulations and real data.

Tightly co-regulated genes form communities that work together for some critical biological function. It is of interest to discover these communities and model how they change over time. Once gene expression data are divided into homogeneous spatial and temporal periods, sample sizes are typically small. This makes inference challenging because the gene networks are estimated with substantial error, and estimating dynamic networks can be even more so. Yet, for dynamic networks, we can also add information across time periods using a refined statistical procedures, PisCES. PisCES utilizes a global community detection method that combines information across a series of networks, longitudinally (or spatially), to strengthen the inference for each period. PisCES , based on evolutionary spectral clustering and degree correction methods, can reveal dense communities that persist, merge and diverge over development, and others that are loosely organized and short lived.


Thursday, April 5th

Mixture Model Dased Analysis of CRISPR i/a Screens
Dr. Timothy Daley
Department of Statistics and Department of Bioengineering, Stanford University

CRISPR inhibition and activation (CRISPRi/a) screens allow researchers interrogate complex biological phenotypes at the genome-wide scale. Unfortunately, CRISPRi/a suffer from two issues that complicate the analysis. First, it is difficult to design effective sgRNAs; and second, both CRISPRi and CRISPRa exhibit variable levels of interference and activation, respectively, that can lead to variable phenotypic signal in the screen. Mixture models are a natural tool for dealing with such issues. I will present my work in mixture model based analysis for CRISPRi/a screens to allow for more refined and powerful analysis.


Thursday, April 12th

Using Publicly Available Gene Association Data in a Data-Driven Approach to Predict Human Carcinogenicity Based on the Key Characteristics
Dr. Linda Rieswijk
Division of Environmental Health Sciences, UC Berkeley

Recently, Smith et al. identified 10 key characteristics (KCs) of human carcinogens, which were developed to provide a systematic method for evaluating mechanistic data in support of hazard identification conclusions. The KCs are very helpful in identifying and organizing results from pertinent mechanistic studies in carcinogen identification, but it is unclear how to use gene association data from toxicogenomics studies with this framework. Here, we apply a data-driven approach, which is focused on using publicly available gene association data to improve the prediction of human carcinogenicity based on the KCs. For this study we used both chemical- as well as disease-related gene association data from the Comparative Toxicogenomics Database. With the application of bioinformatics tools such as WikiPathways and Cytoscape we are trying to find common mechanisms affected by human carcinogens. After comparing each agent individually and on the IARC group level we calculate a Key Characteristics score (KC score) which indicates the strength of the association of a particular agent with the KCs of human carcinogens. A better prediction of human carcinogenicity may ultimately improve the hazard identification step of unknown environmental stressors or new chemical agents within the process of risk assessment. More importantly this type of research will help in generating new hypotheses for follow up research.


Thursday, April 19th

Unlocking a Masked Portion of Genome for Analysis of Conformation Capture Studies by Leveraging Multi-Mapping Reads
Professor Sunduz Keles
Department of Biostatistics & Medical Informatics and Department of Statistics, University of Wisconsin, Madison

Hi-C assays are currently state-of-the-art for mapping long-range chromatin interactions genome-wide and data from these assays are now being routinely used to interpret results from genome-wide association studies. Existing data analysis pipelines for Hi-C. completely discard reads that map to more than a single location. We found that multi-reads have profound effects on the numbers of usable reads and significant interactions, including reproducibly identifiable promoter-enhancer interactions, which is one of the most valued actionable inference from Hi-C. We developed a generative model-based approach, named mHi-C (https://github.com/keleslab/mHiC) for leveraging multi-reads in Hi-C data analysis. We evaluated mHi-C with computational experiments and with the use of additional data types. mHi-C yielded remarkable increase in sequencing depth. This translated into significant and consistent gains in reproducibility of the raw contact counts, identified significant contacts and promoter-enhancer interactions, and domain structures. In addition to discovering novel promoter-enhancer interactions and refining TAD boundaries in a biologically supported way, mHi-C provided a less biased assessment of Hi-C signal originating from highly repetitive regions. In the second part of the talk, I will introduce atSNP Search (http://atsnp.biostat.wisc.edu) which provides statistical evaluation of impact of SNPs on transcription factor-DNA interactions.


Thursday, April 26th

Statistical Methods for Label-Free Quantitative MS-Based Proteomics
Professor Lieven Clement
Department of Applied Mathematics, Computer Science and Statistics, Ghent University

Quantitative label free mass spectrometry (MS)-based proteomics have become very popular because of their ease of use and reliability of results. However, the nature of the label-free quantification protocol renders the data analysis a challenging task. The MS intensities are measured at the peptide-level, while researchers generally aim at assessing differential abundance at the protein level; peptide-level data are very noisy and missing values are omnipresent. The associated data analysis workflow typically consists of three major steps: peptide identification, peptide and protein quantification, and differential analysis.

In this talk we will cover the challenges and pitfalls associated to different steps in the data analysis workflow. We show that modelling MS-based intensity data at the peptide level outperforms classical summarization-based approaches, which typically do not correct for differences in peptide characteristics and/or for the between-sample differences in the number of peptides that are identified per protein. Peptide-based linear models, however, still suffer from overfitting, outlying peptide intensities and unbalanced peptide-level data which is inherent to proteomics. We therefore introduce three modular extensions: ridge regression, M-estimation and empirical Bayes to stabilise the variance estimation, which provides fold change estimates with a higher precision and accuracy, and further improves sensitivity and specificity. By exploiting the link between mixed models and penalisation, our tools also facilitates the use of mixed models in the proteomics community.


Thursday, May 3rd

Stage-Wise Testing for Differential Expression Analysis in Sequencing Studies
Koen Van den Berge
Department of Applied Mathematics, Computer Science and Statistics, Ghent University

Characterizing gene expression through sequencing (RNA-seq) has provided a unique opportunity to unravel the molecular processes in biological tissues. Innovations and cost reductions in sequencing technology have led to RNA-seq experiments with complex designs. In addition, recent computational advances have allowed expression analysis on the transcript level. Both settings imply multiple hypotheses for every gene, i.e. assessing differential expression for multiple contrasts (transcripts) in complex experiments (transcript-level analysis), leading to challenging multiple testing problems. Conventional approaches control the false discovery rate (FDR) on the individual hypothesis level and fail to establish proper gene-level error control, which compromises downstream validation experiments. We introduce stageR (https://github.com/statOmics/stage), a two-stage procedure that leverages the increased power of aggregated hypothesis tests while maintaining high biological resolution by post-hoc analysis of genes passing the screening hypothesis. The two-stage procedure is a general paradigm that can be adopted whenever individual hypotheses can be aggregated and achieves an optimal middle ground between biological resolution and statistical power. In this talk, I will show that it provides gene-level FDR control in studies with complex designs while boosting power for interaction effects without compromising the discovery of main effects. In a differential transcript usage/expression context, stage-wise testing gains power by aggregating hypotheses at the gene level, while providing transcript-level assessment of genes passing the screening stage. Finally, the flexibility of the approach will be illustrated by exploring different p-value aggregation methods and interesting applications in single cell RNA-sequencing will be discussed.