PB HLTH 295, Section 001
Statistics and Genomics Seminar

Spring 2017



[Home]


Thursday, January 19th


Quantifying and Mitigating the Effect of Preferential Sampling on Phylodynamic Inference
Professor Julia Palacios
Department of Statistics and Department of Biomedical Data Science, Stanford University

Phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from a population of interest. One way to accomplish this task formulates an observed sequence data likelihood exploiting a coalescent model for the sampled individuals’ genealogy and then integrating over all possible genealogies via Monte Carlo or, less efficiently, by conditioning on one genealogy estimated from the sequence data. However, when analyzing sequences sampled serially through time, current methods implicitly assume either that sampling times are fixed deterministically by the data collection protocol or that their distribution does not depend on the size of the population. Through simulation, we first show that, when sampling times do probabilistically depend on effective population size, estimation methods may be systematically biased. To correct for this deficiency, we propose a new model that explicitly accounts for preferential sampling by modeling the sampling times as an inhomogeneous Poisson process dependent on effective population size. We demonstrate that in the presence of preferential sampling our new model not only reduces bias, but also improves estimation precision. Finally, we compare the performance of the currently used phylodynamic methods with our proposed model through clinically-relevant, seasonal human influenza examples.

Joint work with Michael D. Karcher, Trevor Bedford, Marc A. Suchard and Vladimir N. Minin.


Thursday, January 26th


Optimally Combining Outcomes to Improve Prediction
Dr. David Benkeser
Divisions of Biostatistics, UC Berkeley

In many studies, multiple instruments are used to measure different facets of an unmeasured outcome of interest. For example, in studies of childhood development, children are administered tests in several areas and researchers combine these test scores into a univariate measure of neurocognitive development. Researchers are interested in predicting this development score based on household and environment characteristics early in life in order to identify children at high risk for neurocognitive delays. We propose a method for estimating the combined measure that maximizes predictive performance. Our approach allows modern machine learning techniques to be used to predict the combined outcome using potentially high- dimensional covariate information. In spite of the highly adaptive nature of the procedure, we nevertheless obtain valid estimates of the prediction algorithm's performance for predicting the combined outcome as well as confidence intervals about these estimates. We illustrate the methodology using longitudinal cohort studies of early childhood development.


Thursday, February 9th


Imaging and Sequencing Single Cells
Professor Aaron Streets
Department of Bioengineering, UC Berkeley

Phenotype classification of single cells reveals biological variation that is masked in ensemble measurement. This heterogeneity is found in gene and protein expression as well as in cell morphology. Many techniques are available to probe phenotypic heterogeneity at the single cell level, for example quantitative imaging and single-cell RNA sequencing, but it is difficult to perform multiple assays on the same single cell. In order to directly track correlation between morphology and gene expression at the single cell level, we developed a microfluidic platform for quantitative coherent Raman imaging and immediate RNA sequencing (RNA-Seq) of single cells. With this device we actively sort and trap cells for analysis with stimulated Raman scattering microscopy (SRS). The cells are then processed in parallel pipelines for lysis, and preparation of cDNA for high-throughput transcriptome sequencing. SRS microscopy offers three-dimensional imaging with chemical specificity for quantitative analysis of protein and lipid distribution in single cells. Meanwhile, the microfluidic platform facilitates single-cell manipulation, minimizes contamination, and furthermore, provides improved RNA-Seq detection sensitivity and measurement precision, which is necessary for differentiating biological variability from technical noise. By combining coherent Raman microscopy with RNA sequencing, we can better understand the relationship between cellular morphology and gene expression at the single-cell level.


Thursday, February 16th


Statistical Machine Learning in Molecular Biology and Ecotoxicology
Dr. James Bentley (Ben) Brown
Lawrence Berkeley National Laboratory

I'll discuss some challenges in molecular biology and ecotoxicology that are currently driving us to develop new methods of feature selection for extremely high dimensional systems -- particularly systems where effects are driven by interactions of unknown form and high order.


Thursday, February 23rd


Transcriptome Analysis at the Gene Isoform Level Using Hybrid Sequencing
Professor Kin Fai Au
Department of Internal Medicine and Department of Biostatistics, University of Iowa

New generation sequencing techniques can provide very informative insights into the transcriptome. However, the currently available transcriptome analysis tools are for Second Generation Sequencing (SGS) short reads and the short read length of which can introduce bias and even errors in downstream analysis. While the recent application of Third Generation Sequencing (TGS) long reads, such as PacBio and Oxford Nanopore Technologies data, to human transcriptome analysis has greatly advanced the field, key bioinformatic analysis platforms are missing. Furthermore, hybrid sequencing (Hybrid-Seq), which integrates SGS short read data into the analysis of TGS data, can improve the overall performance and resolution of the output data. Indeed, a handful of existing publications demonstrate the potential power of Hybrid-Seq for genome data analysis. Here I present a series of bioinformatics methods to analyze transcriptome at the gene isoform levels. These methods include 1) IDP to identify and quantify gene isoforms; 2) IDP-fusion to annotate fusion genes and identify fusion gene isoforms; 3) IDP-ASE to phase genotype and quantify allele-specific expression at the gene isoform level; The proof-of-concept applications to breast cancer cells and human embryonic stem cells reveal the isoform-level complexity of fusion gene expression and allele-specific expression, and also discover novel genes involved in pluripotency regulation, novel tumorigenesis-relevant gene fusions and ASE bias of oncogenes and pluripotency markers.


Thursday, March 2nd


Gene Drive: What is Possible at the Population Level with Currently-Available Molecular Components?
Professor John Marshall
Divisions of Biostatistics and Epidemiology, UC Berkeley

A great deal of progress has been made on engineering gene drive systems capable of spreading into populations despite a fitness cost. Some of these systems have been proposed for spreading disease-refractory genes into insect disease vector populations, thus reducing their ability to transmit diseases such as malaria and dengue fever to humans. Other systems have been proposed for spreading a fitness load or generating a male gender bias, thus suppressing vector populations and disease transmission. With the help of mathematical models, we discuss which of the recently-engineered systems (CRISPR/Cas9-based homing systems, homing endonuclease genes, and toxin-antidote-based systems such as Medea) hold the most promise for achieving these goals. For homing-based systems, we address the concern that the emergence of homing-resistant alleles may limit their spread. We discuss the versatility of systems that use combinations of toxins and antidotes to favor their own inheritance. These systems are less invasive; but highly stable, and have potential for confined population replacement. Finally, we discuss remediation strategies for removing driving transgenes from the environment in the event of unforeseen consequences.


Thursday, March 9th


Multiplexing Droplet-Based Single Cell RNA-Sequencing Using "Genetic Barcodes"
Professor Jimmie Ye
Department of Epidemiology and Biostatistics, UC San Francisco

Droplet-based single-cell RNA-sequencing (dscRNA-seq) has enabled rapid, massively parallel profiling of transcriptomes from tens of thousands of cells. Multiplexing samples for single cell capture and library preparation in dscRNA-seq would enable cost-effective designs of differential expression and genetic studies while avoiding technical batch effects, but its implementation remains challenging. Here, we introduce an in-silico algorithm demuxlet that harnesses naturally occurring genetic variation in a pool of genetically diverse cells to assign each cell to its donor of origin and identify droplets containing cells from two samples. These two capabilities enable a pooled experimental design where cells from genetically diverse samples are multiplexed and captured at much higher throughput. To demonstrate the performance of our method, we sequenced a pool of ~14k peripheral blood mononuclear cells (PBMCs) from 8 lupus patients. Given genotyping data for each pooled sample, our method correctly assigned > 95% of cells to the originating sample, identified doublets enriched for multiple cell types, and estimated doublet rates consistent with previous reports. We further demonstrate the utility of sample multiplexing by characterizing cell type-specific responses of ~15k PBMCs to a potent cytokine, IFN-b, and identifying novel PBMC biomarkers distinguishing ~30k PBMCs from rheumatoid arthritis and lupus patients. Our computational tool enables sample multiplexing of droplet-based single cell RNA-seq for large-scale studies of population variation and could be extended to other single cell datasets where natural or synthetic DNA barcodes are available.


Thursday, March 23rd


Fast Inference of Fine-Scale Recombination Rates from Phased or Unphased Data
Jeffrey Spence
Graduate Group in Computational Biology, UC Berkeley

Recombination plays a fundamental role in evolution by breaking up the correlation of alleles at physically linked loci. In many species, the rate of recombination varies substantially across the genome, and these fine-scale differences in recombination rates are important for detecting natural selection, association mapping, inferring demographic histories, and in many other applications.

To infer fine-scale recombination rates, it is necessary to observe a large number of recombination events either by sequencing many trios or by using population genetics theory to model recombination events that have occurred since the most recent common ancestor of a small number of unrelated individuals. Such population genetics-based methods tend to perform well, and, by requiring less data than trio-based methods, are much less expensive. Unfortunately, existing methods make unrealistic assumptions about the demographic history of the population from which the individuals are sampled and rely on computationally expensive MCMC sampling to obtain a posterior distribution over fine-scale recombination rates.

I will present a new method that takes the demographic history of the sample into account, improving estimation accuracy. Furthermore, we use a penalized maximum composite likelihood approach, which is orders of magnitude faster than previous MCMC approaches. Our method can handle phased (haplotype) data, unphased (genotype) data, or even just frequency information such as that obtained from pool-seq protocols.


Thursday, April 6th


Fine Mapping Associations between Multi-Allelic Genetic Variability in the HLA Locus and Expression of T Cell Receptor Genes
Dr. Eilon Sharon
Department of Genetics, Stanford University

n each individual, a highly diverse T-cell receptor (TCR) repertoire interacts with peptides presented by major histocompatibility complex (MHC) molecules. This interaction enables the adaptive immune system to discriminate between self and foreign antigens. Despite extensive research, it remains controversial whether the germline-encoded TCR-MHC contacts promote TCR-MHC specificity and if so, whether there are differences in the compatibilities between different TCR V-genes and different MHC alleles. We applied eQTL mapping in a large human cohort to test for association between genetic variation and TCR V-gene usage in a large human cohort. We observed strong trans associations between genetic variation in the MHC locus and usage biases in TCR V-genes. We then used several techniques to fine map the association signals. Our investigation revealed particular amino acid residues in MHC genes that influence TCR V-gene usage. Remarkably, many of these residues are in direct contact or spatial proximity to either the TCR or the presented peptide in co-crystal structures. Our results show that MHC variants, several of which are linked to autoimmune diseases, can directly affect TCR-MHC interaction, and provide the first examples of trans-QTLs mediated by protein-protein interactions.


Thursday, April 13th


Data Re-Use is Not Parasitism: Translational Medicine Using Public Data
Professor Purvesh Khatri
Stanford University School of Medicine

Current experiment and analysis paradigm in biomedical and biological research is to reduce heterogeneity in the data to ensure the conclusions are not confounded by biological, technological and demographic factors. However, this paradigm does not account for the real-world patient population heterogeneity, which in turn requires replication in multiple independent cohorts prior to translation into clinical practice. Consequently, biomedical research today is slow and expensive. I will describe an analytic framework that turns the current paradigm on its head. This talk will focus on how heterogeneity across independent experiments can lead to identification of disease signatures that are diagnostic, prognostic, therapeutic and mechanistic across a broad spectrum of diseases including infections, autoimmune diseases, cancer, and organ transplants. I will also discuss how biological and technical heterogeneity in publicly-available data can be leveraged to make translational medicine robust, faster and cheaper.


Thursday, April 20th


Meta Analysis with Very Small Number of Studies
Professor Lu Tian
Department of Statistics and Department of Biomedical Data Science, Stanford University

Meta analysis is a popular tool to synthesize evidence from multiple sources. In the presence of study heterogeneity, the random effects model is the appropriate statistical model for meta analysis. There are two major limitations for the associated statistical inference based on the random effects model: the dependence on the restricted parametric assumption and the asymptotic validity requiring a large number of studies. If only a very small number of studies are available for meta analysis, appropriate parametric assumptions are unavoidable in order to effectively summarize the data. On the other hand, when the number of studies is small, most of the current inference methods are either inapplicable or invalid. In this talk we propose a class of exact inference procedures for the commonly used normal-normal random effects model, which are valid regardless of the number of studies. We will illustrate the new proposal with both simulation study and real data examples.


Thursday, April 27th


Web Visualization, Gene Set Analysis, and Workflow Management
Professor Ian Holmes
Department of Bioengineering, UC Berkeley

My lab develops tools for computational genome annotation, from infrastructure through analysis to visualization. I'll describe three tools we have recently published. JBrowse is a web-based tool for browsing genome annotations, widely used by model organism projects and by databases in the plant and animal genomics communities. Recent developments include a plugin architecture with a thriving ecosystem of third-party extensions, and a work-in-progress plugin for circular visualization. WTFgenes is a program for analyzing gene sets that uses a Markov Chain Monte Carlo approach to finding enriched ontology terms. Biomake is a system for workflow management that extends the GNU make paradigm with logic programming and other features.


Thursday, May 11th


Coalescent-Based Species Tree Inference Using Covariance Matrices
Geno Guerra
Department of Statistics, UC Berkeley

In phylogenetics, completely resolving the Tree of Life is an open problem. While many methods exist, they produce inconsistent results, or rely on heuristic approximations to resolve the topology of a set of species. Here I will present preliminary work on a new method for inferring species tree topologies, branch lengths, and population sizes via maximum likelihood using coalescent theory through a novel calculation of the covariance between pairs of coalescence times between different species.