PB HLTH 292, Section 008
Statistics and Genomics Seminar


Fall 2012


Thursday, August 30th

Multilabel Classification in Disease Diagnosis Using Public Gene Expression Data
Professor Haiyan Huang
Department of Statistics, UC Berkeley

The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO) is currently the largest database that systematically documents the genome-wide molecular basis of diseases. This talk introduces an effort to turn the NCBI GEO expression repository into an automated disease diagnosis database, such that a query gene expression profile can be assigned to one or multiple disease concepts. As hierarchical multi-label classification (HMC) is a natural formulation of a disease diagnosis question, this talk also discusses some statistical issues involved in HMC.

Particularly, in multi-label classification problems, it is actually common for the classifiers of each class to have different statistical properties due to the varied quality and quantity of training data for different classes. Ignoring these differences cannot guarantee an optimal classification precision rate at a fixed recall rate. We solve this by introducing a novel ranking strategy called the local precision rate (LPR). Optimizing the overall precision-recall curve derives this LPR approach. Following this derivation, we further propose a new LPR estimator based on local polynomial smoothers. We note that under certain conditions, LPR can be shown to be mathematically equivalent to 1-local false discovery rate (lfdr). However, simulation and real data applications demonstrate that the newly proposed LPR estimator consistently outperforms the traditional methods used to estimate 1-lfdr when the training data is noisy and complex.


Thursday, September 6th

Challenges in the Development of Robust Clinical Classifiers from High-Dimensional Data
Dr. Darya Chudova
Data Analysis and Algorithm Development, Veracyte, Inc.

Since the rise of the high-throughput experimental technologies, numerous genomic classifiers have been proposed to address a variety of clinical questions. However, few have been successfully incorporated in clinical workflows. The simple objective of developing well-validated genomic models that improve patient care beyond current practice standards turned out difficult to achieve in practice (Subramanian, Simon 2010). Next to the appropriateness of study designs for both discovery and validation, real-world biological and technical variation found in genomic data sets sampled across varied patient cohorts over significant periods of time is the biggest challenge. Using examples from our recent development and validation of the gene expression based thyroid cancer molecular test (Alexander et al 2012), we will describe some of the challenges encountered along the way, including study designs, sampling heterogeneity, dilution of malignant content with adjacent cell types and peripheral blood, analytical verification in the presence of technical batch and reagent variation, and share some of the lessons learned from this experience.


Thursday, September 13th

Indels, Ecotypes, and Browsers
Professor Ian Holmes
Department of Bioengineering, UC Berkeley

I will talk about three recent interests of our group. The first is the accurate estimation of mutation rates, particularly indel rates, from homologous sequences. The hope is that more accurate estimates will lead to better annotations of reference and personal genomes. The second interest is an investigation of the concept of "ecotypes" in microbial ecology. For example, are there really a finite number of "types" of intestinal flora? Or is this an artifact of poor statistics? The third area is in the web-based visualization and representation of genomic data. Specifically, I will review recent progress of our "JBrowse" genome browser.


Thursday, September 20th

Molecular Elucidation and Engineering of Stem Cell Fate Decisions
Professor David Schaffer
Chemical and Biomolecular Engineering, Bioengineering, and The Helen Wills Neuroscience Institute, UC Berkeley

Stem cells are defined by their hallmark abilities to self-renew, or divide while in an immature state, and to differentiate into one or more specialized cell types. Elucidating the mechanisms that govern these fate decisions is critical for understanding the roles stem cells play in the development of organisms and maintenance of adult tissues, as well as for harnessing stem cells to repair organs damaged by disease or injury. Stem cell behaviors are strongly regulated by their microenvironment or niche, a specialized region of tissue that presents complex repertoires of signals. There has been considerable progress in studying important soluble biochemical signals, but comparatively less effort has been focused on investigating biophysical mechanisms by which the "solid phase" of the microenvironment regulates cell function, in large part due to experimental complexities in investigating and mimicking the complexity of the extracellular matrix (ECM), cell-cell interactions, and other components. Recent work demonstrates that bioactive, synthetic materials can be harnessed to emulate and thereby study the effects of solid phase, biophysical cues on cell function. For example, by using modular, bioactive materials, we have found that that material stiffness profoundly impacts neural stem cell and human embryonic stem cell self-renewal and differentiation, and mechanistic analysis implicates key mechanotransductive pathways in this process that are important in cell culture and in vivo. Furthermore, nanoscale spatial organization in the presentation of immobilized signals can modulate the activity of these signals, and nanostructured biological-polymeric conjugates likewise serve as potent effectors of neural stem and human embryonic stem cell function. Biomimetic materials can thus be employed to study basic biophysical mechanisms by which the solid phase of a stem cell microenvironment regulates cell function, as well as offer safe, scaleable, and robust systems to control stem cells for biomedical application.


Thursday, September 27th

Pathway Based Methods to Unravel Functional Relationships: Application to Pathway Network Inference and Drug Repurposing
Dr. Ana Conesa
Head, Genomics of Gene Expression Lab, Centro de Investigacion Principe Felipe, Valencia

Pathway analysis of gene expression data has traditionally been used for the interpretation of the biological meaning of a number of genes changing their expression between experimental conditions. In this scenario, functional enrichment and gene set methods interrogate the significant abundance of genes belonging to specific functional classes among those that respond to the experiment. This analysis translates lists of genes into lists of enriched functions. However, functional relationships of biological systems imply also interaction between functional blocks and functional relatedness between samples. In this seminar I will introduce two examples of methodologies addressing these ideas: pathway network inference (PNI) from gene expression and functional data, as a way to explore the connections between functional entities in a cellular system. This approach is useful to reveal common regulatory elements of cellular pathways, and the links between them. The second example is in the use of functional profiling to analyze the similarity between drug treatments and suggest novel indications of existing drugs. Our approach analyzes the semantic similarity between functional profiles and provides a measure of functional distance between samples that can be used for drug repurposing. We will show experimental validation of this approach and discuss the scope of its application.


Thursday, October 11th

Leveraging Population History to Map Complex Traits in the Post-GWAS Era
Professor Elad Ziv
Department of Medicine, UCSF

Genome wide association studies (GWAS) have recently added a substantial number of new loci associated with a variety of complex traits. Yet, for most complex traits, the majority of heritability – the fraction of susceptibility that is estimated to be due to inherited factors based on family studies – remains unaccounted for. It is commonly assumed that this "missing heritability" is accounted for by rare genetic variants that are beyond the scope of GWAS. Whole exome or whole genome sequencing studies offer novel approaches to identify rare variants associated with complex traits. A major challenge of these studies is how to integrate information from multiple rare variants. Inferring the locus or loci underlying a complex trait can dramatically improve the chances of finding the relevant genetic variants since it substantially cuts down the number of hypotheses that need to be tested. Using population genetic principles and knowledge of demographic history can dramatically improve the chances of identifying the important loci underlying a complex trait. I will discuss the approach of admixture mapping and illustrate examples of this approach from our work in mapping cancer susceptibility loci. I will also consider other approaches more suitable for mapping traits in "founder" populations.


Thursday, October 18th

A New Approach To High Dimensional Interaction Testing
Noah Simon
Department of Statistics, Stanford University

There has been rising interest in looking for interactions (or marginal interactions) between features in high dimensional regression models. Standard practice for testing marginal interactions has been to fit all bivariate models and use a post-hoc procedure on the nominal p-values of the interactions to control the false discovery rate. This approach is simple and often useful, but has potential pitfalls --- it is strongly parametric and does not lend itself to a permutation based test.

In our work, we consider the more specific case of continuous covariates with a binary response. We propose an alternative permutation based method. Our method is motivated by posing the problem in a discriminant analysis framework and considering the "equivalent" regression based model. At the end of the day, this gives us a simple, robust method for testing marginal interactions using marginal correlations. We will also discuss a simple extension/reformulation for testing marginal interactions with the Cox model for survival analysis.

This is joint work with my advisor Rob Tibshirani.


Thursday, October 25th

Roles of miR-34 miRNAs in Epigenetic Regulation during Somatic Reprogramming
Dr. Paul Lin
Department of Molecular and Cell Biology, UC Berkeley

Somatic reprogramming is rooted in the remarkable cellular plasticity retained throughout differentiation. The process can be triggered by exogenous expression of a set of defined ESC-specific transcription factors, Pou5f1 (Oct4), Sox2, Klf4, in the presence or absence of c-Myc. Although iPSCs exhibited pluripotency in multiple functional assays, iPSCs generated by these classic reprogramming factors are often not functionally equivalent to ESCs due to the "epigenetic memory" from the originating tissue, as well as exhibit aberrant silencing of specific imprinting loci, such as Gtl2. These epigenetic differences between iPSCs and ESCs constitute an important molecular basis underlying their functional differences in vivo. The mammalian miR-34 family microRNAs contain three homologues localized on two distinct genomic loci, mir-34a and mir-34b/c, both of which are bona fide targets of p53, one of the most important suppressors for somatic reprogramming. Our recent findings suggest that mir-34a is a key regulator which mediates the suppressive effect of p53 on reprogramming through post-transcriptional silencing of Sox2 and Nanog. Interestingly, although deficiency of mir-34a and mir-34b/c both enhances reprogramming efficiency, mir-34b/c-/- iPSCs exhibit stronger pluripotency in vivo as compared to wildtype or mir-34a-/- iPSCs, and better resemble wildtype ESCs functionally. Consistently, in mir-34b/c-/- iPSCs we observed a normal imprinting status of Glt2 locus, whose aberrant silencing has been shown to be correlated with the compromised in vivo pluripotency of iPSCs. In addition, through high-throughput sequencing, we identified the derepression of a specific family of long-terminal repeat (LTR) retroelements, RLTR4, in mir-34b/c-/- iPSCs, possibly through the loss of the histone mark histone 3 lysine 9 trimethylation (H3K9me3). Taken together, our results suggest that mir-34 microRNAs could play important roles in regulating epigenetic events happened during somatic reprogramming.


Thursday, November 1st

Leveraging Family Pedigrees to Improve Genome Variant Identification from Next-Generation Sequencing Data
Dr. Francisco M. De La Vega
VP Genome Science, Real Time Genomics, Inc.

The advances in high-throughout sequencing (HTS) are bringing the use of genomes closer to the clinic. This has already been successful in the elucidation of congenital disease genes, with many other clinical applications envisioned. An important question yet to be answered is whether current HTS protocols provide data that meets clinical standards of quality. A simple approach to assess the clinical "grade" of a HTS genome is to evaluate the fraction of clinically relevant mutations (e.g. "DM" mutations from HGMD) that can be called at a given threshold of genotyping accuracy. Applying this metric to a diverse set of public genomes show substantial undercalling for such sets of clinical variants. The sequencing of family pedigrees has been shown to simplify the disease gene identification by eliminating irrelevant genotypes present in the healthy parents. However, data analysis of family members could also provide a significant boost in variant calling accuracy. Sequencing related individuals redundantly samples shared haplotypes; this additional information could benefit variant calling in all family members. Beyond the naive filtering of Mendelian errors, joint variant calling can be formulated under the Bayesian method that is more commonly used in variant identification in HTS data by testing Mendelian segregation hypothesis. This approach reduces Mendelian errors in trios to 0.1% compared to 2% in singleton calling, improves specificity of de novo variant identification by reducing FP >50%, and reduces significantly undercalled clinical variants. Coupled with a fast read mapping, analysis of a 60X family trio can be completed in under a day on a commodity server. Joint Bayesian calling can also be reformulated to analyze data from other analysis settings where related samples are being concurrently sequenced, such as tumor/normal tissue pairs or iPSCs lineages data. Improving the accuracy of HTS data will be crucial for the adoption of genomes and exomes in clinical settings.


Thursday, November 8th

Long Non-Coding RNA (lncRNA) Regulatory Networks in Primate-Specific Phenotypes and Human Disease
Professor Leonard Lipovich
Center for Molecular Medicine and Genetics and Department of Neurology, Wayne State University

Ribonucleic acid (RNA) molecules, except messenger RNA, have long been known to possess essential cellular roles not contingent on protein-coding capacity. The diversity and complexity of non-coding RNAs was a revelation of the first post-genomic decade. In contrast to microRNAs, lncRNAs utilize heterogeneous epigenetic and post-transcriptional mechanisms as both positive and negative regulators. Beyond isolated examples, lncRNA contributions to gene regulatory networks remain poorly understood. In 2012, the ENCODE (Encyclopedia of DNA Elements) Consortium, in which we participate, highlighted the low evolutionary conservation of human lncRNAs, ~5,000 of which are not detectable beyond primates. Using custom microarrays, we interrogated the in vivo human brain transcriptome in the surgically resected neocortex of seven epilepsy patients, comparing areas of high and low electrical activity within each patient and groupwise. We identified 1,288 differentially-expressed lncRNAs, of which 26 formed a trans-regulatory network with coordinately expressed protein-coding genes throughout the genome. Eight additional lncRNAs were cis-encoded, through genomic positional proximity or sense-antisense overlaps, with differentially-expressed protein-coding genes, which in turn regulated activity-dependent downstream targets. We validated the network placement of primate-specific lncRNAs by in vitro reverse genetics. We developed a model, CUBE (Cis-trans Ubiquitous Bidirectional Expression), where cis-encoded lncRNAs are network nodes, extending directional weighted edges toward protein-coding partner genes. These genes are joined by additional, genomewide edges to in-trans targets. Non-conserved lncRNAs at cis-trans interfaces regulate conserved genes, contributing to the molecular basis of primate, including human, phenotypic uniqueness, and exposing the limitations of animal models for studying human disease. We will discuss our NGS analysis of human melanoma lncRNAs along the apoptosis-proliferation axis, as well as imputation and experimental validation of lncRNA-containing networks for the next phase of ENCODE.


Thursday, November 15th

Functional Genomics and Epigenomics of Complex Disease Genetics
Professor Manolis Kellis
MIT Computer Science and Artificial Intelligence Lab, Broad Institute of MIT and Harvard

In this talk, I will present our efforts to integrate genomic variation, epigenomic variation, and functional genomics datasets with genome-wide association studies to understand the molecular basis of complex disease. (1) We use reference epigenomic maps of in multiple human cell types to dramatically expand the annotation of non-coding regions and to link active enhancers to their upstream regulators and their downstream target genes using coordinated patterns of activity across cell types. (2) We use the resulting regulatory predictions to revisit disease-associated loci, revealing SNPs that disrupt or create predicted enhancer and causal regulatory motifs, providing mechanistic hypotheses for the observed associations for individual loci. (3) Beyond the few genome-wide significant loci retained by traditional GWAS, we find functional enrichments across 1000s of type-1-diabetes-associated SNPs in cell type-specific enhancers using a rank-based statistical test for enrichment. (4) In addition to reference epigenomes, we study both genomic and epigenomic variation in Alzheimer’s disease across 750 individuals, revealing a global hyper-methylation signature in brain-specific enhancers containing specific motifs and methyl-QTLs for 60,000 probes. (5) We systematically validate 1,000s of our regulatory predictions using Massively Parallel Reporter Assays by disrupting individual binding sites and individual nucleotides of predicted causal regulators, revealing their distinct roles in specifying enhancer activity. Our results suggest a general framework for integrating multi-cell functional genomics and epigenomics information to decipher cis-regulatory connections in complex disease.


Thursday, November 29th

Approaches to Finding Missing Heritability in Autoimmune Disease
Professor Lisa Barcellos
Division of Epidemiology, UC Berkeley

A substantial genetic component in most autoimmune diseases is supported by multiple and strong lines of epidemiologic evidence. Genome-wide association studies (GWAS), in which several hundred thousand to more than a million SNPs are assayed in thousands of individuals, have been conducted and have begun to unravel the polygenic etiology underlying these conditions. Similar to other complex diseases, autoimmune disease variants identified through GWAS confer relatively small increments in risk (allelic odds ratios of 1.1-1.5) and explain only a very small proportion of heritability. Several explanations for the missing heritability have been proposed and include the involvement of epigenetic mechanisms. While the human genome is comprised of more than 3 billion base pairs, a much smaller number are known coding sequences. Differences among individuals, including those that influence health and disease, are likely to include epigenetic changes such as DNA methylation. The field of epigenetics is a new and exciting area of research for complex diseases. The presentation will address some of the challenges related to conducting large epidemiologic studies of DNA methylation in autoimmune disease. Examples will be drawn from studies of type 1 diabetes, Sjogrens syndrome, rheumatoid arthritis and multiple sclerosis.


Thursday, December 6th

Tree-Based Methods for Creating Survival Risk Groups
Professor Annette Molinaro
Departments of Neurological Surgery and of Epidemiology and Biostatistics, UCSF

We recently developed partDSA, the Partitioning Deletion/Substitution/Addition algorithm, a multivariate method that, similarly to Classification and Regression Trees (CART), utilizes loss functions to select and partition predictor variables to build a tree-like regression model for a given (uncensored) outcome. However, unlike CART, partDSA permits both 'and' and 'or' conjunctions of predictors, elucidating interactions between variables as well as their independent contributions. partDSA thus permits tremendous flexibility in the construction of predictive models and has been shown to supersede CART in both prediction accuracy and stability. As the resulting models continue to take the form of a decision tree, partDSA also provides an ideal foundation for developing a clinician-friendly tool for accurate risk prediction and stratification. However, until now, partDSA has been limited to only uncensored outcomes.

We have now extended partDSA for the setting of right-censored outcomes via two observed data loss functions: the Inverse Probability Censoring Weighted (IPCW) and Brier Score weighting schemes. We show in numerous simulations and a data analysis that both proposed adaptations for partDSA perform as well, and often considerably better, than two competing tree-based methods.