PB HLTH 295, Section 003
Statistics and Genomics Seminar

Spring 2014

Thursday, January 23rd

Theoretical and Applied Problems in Statistical Signal and Image Analysis
Professor Armin Schwartzman
Department of Statistics, North Carolina State University

In this talk, I will present an overview of my work related to image analysis and large-scale multiple testing. The main topic concerns the problem of detecting local significant regions in signals and images, where the need is to make inferences about spatial features such as smooth peaks whose number, height and location are unknown, rather than individual pixels or voxels. Examples include finding protein binding sites in ChIP-Seq genomic data (1D), finding geographical regions of environmental change (2D), and finding regions of neural activation in brain imaging (3D). To solve this problem, I propose a formal topological multiple testing approach involving kernel smoothing and testing of local maxima. Theory and simulations show that global error rates are controlled asymptotically and that the optimal bandwidth corresponds to the “matched filter” principle, where the kernel size should be close to that of the peaks to be detected. In the remaining time, I will mention other work of mine on large-scale multiple testing with strong correlation, analysis of positive definite matrix data in diffusion tensor images of the brain, and analysis of Landsat images for tracking receding mountain glaciers.

Thursday, January 30th

Methods for the Analysis of Exposure Effect on Secondary Outcomes in Case-Control Studies
Dr. Tamar Sofer
Harvard School of Public Health

Case-control studies are designed towards studying associations between exposures and a single, primary disease outcome. Typically, information about secondary outcomes is also collected, but association studies targeting secondary outcomes should account for the case-control sampling scheme. The Inverse Probability Weighted (IPW) estimator is often used in such studies to prevent potential selection bias when estimating population effects. In this talk, I will present methods for analysis of secondary outcomes, which extend the IPW estimator. First, I suggest a "control function" assisted IPW estimator. In this approach we add a so-called control function to the usual IPW. This function incorporates information about the population disease model and the selection bias. The resulting estimator is more efficient than the usual IPW if the disease model and the selection bias function are correctly specified, and robust to misspecification of the selection bias function. Second, I present an IPW pseudo-likelihood approach for the analysis of the effect of a set of genetic variants on multiple correlated secondary outcomes. Using the pseudo-likelihood we develop a variance component test, for the set effect. Upon rejection of the null hypothesis of no effect, we propose to identify and estimate non-zero effects of genetic variants using oracle-penalized weighted pseudo-likelihood. The proposed estimators are evaluated in simulation studies and demonstrated on case-control genome-wide association studies of lung cancer and type 2 diabetes.

Thursday, February 6th

A High Dimensional Graphical Model for Count Data with Application to RNA Co-expression Networks
Dr. Yoonha Choi
Stanford University School of Medicine

With next-generation technologies generating large-scale genomic data, there is the potential to construct complex networks that could provide new insights in understanding the molecular process. A number of statistical methods have been developed for constructing networks based on a Gaussian assumption that may not be appropriate for non-Gaussian data such as RNA-seq data. In this study, we propose a novel statistical approach based on the Poisson lognormal distribution to model complex covariance structures for count data. This approach uses a penalized likelihood method to estimate sparse partial correlations. However, the implementation of the method is problematic due to the complexity of the likelihood. Laplace integration is used to approximate the likelihood and its derivatives and the alternating direction method of multipliers is applied to find the maximum-likelihood estimator. Using simulations, we show that the proposed method performs better than the Gaussian models when RNA expression levels are low and the strength of signals is weak. We also apply the proposed method to published RNA-seq data and construct networks that may generate testable hypotheses of genetic regulatory interactions.

Thursday, February 13th

An IDP-Based Swiss-Army-Knife-LikeToolkit for Signaling Diversification
Professor A. Keith Dunker
Department of Biochemistry and Molecular Biology, Indiana University School of Medicine

Intrinsically disordered proteins (IDPs) and IDP regions commonly occur in eukaryotic cells [1] and have been strongly implicated both in protein-protein interaction networks [2] and in transcription factor structure [3]. These IDPs and IDP regions readily undergo alternative splicing [4], and posttranslational modification (PTM) [5]. Very few mutations are needed to radically alter any one or even combinations of these activities, which all play key roles in cell signaling. These same activities determine which genes are expressed in different cells [6], and how these gene products interact in different cell types [7]. The orchestrated interactions among IDPs, their interactions with multiple protein and nucleic acid partners, their further modification via PTMs, and their cell-type diversification via alternative splicing are proposed to bring about cellular differentiation. In summary, we propose that this set of IDP-dependent activities constitutes a multifaceted, Swiss-Army-Knife-like toolkit that enabled very rapid (on an evolutionary time scale) diversification of cell signaling, thus underlying the initial development of metazoans, their subsequent rapid evolution, and their current developmental pathways.

[1] Xue, B., Dunker, A.K., and Uversky, V.N. Orderly order in protein intrinsic disorder distribution: Disorder in 3500 proteomes from viruses and the three domains of life. J. Biol. Struct. Dyn. 30:137-149 (2012).
[2] Dunker, A.K., Cortese, M.S., Romero, P., Iakoucheva, L.M., and Uversky, V.N. Flexible nets: the roles of intrinsic disorder in protein interaction networks. FEBS J. 272: 5129-5148 (2005).
[3] Liu, J., Perumal, N.B., Uversky, V.N., Oldfield, C.J., Su, E.W., Dunker, A.K. Intrinsic disorder in transcription factors. Biochemistry. 45:6873-6888 (2006).
[4] Romero, P. R., Zaidi, S., Fang, Y.Y., Uversky, V.N., Radivojac, P., Cortese, M., Strickmeyer, M., Obradovic, Z., and Dunker, A.K. Alternative splicing in concert with protein intrinsic disorder enables functional diversity in multicellular organisms. Proc. Nat'l. Acad. Sci. USA 103:8390-8395 (2006).
[5] Hsu, W.L. Oldfield, C.J., Xue, B., Meng, J., Huang, F., Romero, P.R., Uversky, V.N., and Dunker, A.K. Exploring the binding diversity of intrinsically disordered proteins involved in one-to-many signaling. Protein Sci. 22: 258-273 (2013).
[6]Liu Y, Matthews KS, Bondos SE. Internal regulatory interactions determine DNA binding specificity by a Hox transcription factor. J. Mol. Biol. 390:760-774 (2009).
[7]Buljan M, Chalancon G, Dunker AK, Bateman A, Balaji S, Fuxreiter M, Babu MM. Alternative splicing of intrinsically disordered regions and rewiring of protein interactions. Curr. Opin. Struct. Biol. 23:443-450 (2013).

Joint work with Sarah Bondos, Department of Molecular and Cellular Medicine, Texas A & M University Health Sciences Center.

Thursday, February 20th

Deep Sequencing Reveals Details of Translation
Dr. Liana Lareau
UC Berkeley Siebel Stem Cell Institute and the California Institute for Quantitative Biosciences (QB3)

Ribosome profiling measures translation by sequencing ribosome-protected fragments of mRNAs to identify the position and number of ribosomes on each gene. We show how ribosome profiling provides unexpected insight into mechanistic aspects of translation. Our ribosome profiling revealed two distinct populations of ribosome footprints: 28 nt and unexpected 20 nt footprints, equally abundant and clearly spaced codon by codon throughout reading frames. We conclude that the two footprint sizes represent two conformations of the ribosome as it ratchets along its mRNA template. The balance of small and large footprints varies by codon and is correlated with translation speed. The ability to visualize conformational changes in the ribosome during the elongation cycle, at single-codon resolution, provides a new way to study the detailed kinetics of translation and a new probe with which to identify the factors that affect each step in the elongation cycle.

Thursday, February 27th

Nonparametric Estimation of Network Structure
Professor Patrick J. Wolfe
Departments of Statistical Science and Computer Science, University College London

Networks are a key conceptual tool for analysis of rich data structures, yielding meaningful summaries in the biological as well as other sciences. As datasets become larger, however, the interpretation of network-based summaries becomes more challenging. A natural next step in this context is to think of modeling a network nonparametrically -- and here we will show how such an approach is possible, both in theory and in practice. As with a histogram, nonparametric models can fully represent variation in a network, without presupposing a particular set of motifs or other distributional forms. Advantages and limitations of the approach will be discussed, along with open problems at the methodological frontier of statistical network analysis.

Joint work with Sofia Olhede (http://arxiv.org/abs/1312.5306/).

Thursday, March 6th

The Dynamic Regulatory Architecture of the Human Genome
Professor Anshul Kundaje
Departments of Genetics and Computer Science, Stanford University

Large scale efforts such as ENCODE and The Roadmap Epigenomics Project are generating massive compendia of diverse functional genomic data to interrogate the human transcriptome, regulome and epigenome in diverse cellular contexts. We have developed novel computational approaches to integrate these data to obtain comprehensive maps of regulatory elements and their dynamics across 100s of diverse human cell-types and tissues. We have also investigated the complex relationship between genetic variation and variation of regulatory chromatin states and gene expression across individuals from diverse ancestry groups. We are developing novel machine learning methods to learn unified predictive regulatory models allowing us to identify genomic regulatory grammars and subnetworks of interacting regulators that drive context-specific gene expression responses. I will present an overview of these integrative analyses that provide a multi-scale view of the regulatory architecture of the human genome and the dynamic regulation programs that result in the phenotypic diversity of cells and tissues in the human body.

Thursday, March 13th

Data and Code Sharing in Bioinformatics: From Bermuda to Toronto to Your Laptop
Professor Victoria Stodden
Department of Statistics, Columbia University

Large-scale sequencing projects paved the way for the adoption of pioneering open data policies, making bioinformatics one the leading fields for data availability and access. In this talk I will trace the history of open data in -omics based research, and discuss how open code as well are data are being addressed today. This will include discussing leading edge tools and computational infrastructure developments intended to facilitate reproducible research through workflow tracking, computational environments, and data and code sharing.

Victoria Stodden: [http://www.stodden.net/]
Slides: [PDF]
Video: [https://archive.org/details/UC_Berkeley_Statistics_2014_03_13_Victoria_Stodden]
] International Data Release Workshop, May 12--13, 2009, Toronto, Canada, Participants list: [PDF]

Thursday, March 20th

CAGe - A Hybrid Pipeline for Efficient Variant Calling
Adam Bloniarz
Department of Statistics, UC Berkeley

We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference, while correctly learning the remaining 5-10% that must be processed using more computationally expensive methods. CAGe runs on a deeply sequenced human whole genome sample in less than 20 minutes, potentially reducing the burden of variant calling by an order of magnitude after one memory-efficient pass over the data.

Thursday, April 10th

Extensive Cross-regulation in Post-transcriptional Regulatory Network in Drosophilia
Marcus Stoiber
Graduate Group in Biostatistics, UC Berkeley

We have generated an extensive map of RNA-protein interactions for 24 RNA-binding proteins (RBPs) using RIP-seq in Drosophila. The landscape of the resulting network features "hot spot" RNAs (HSRs), analogous to high occupancy target (HOT) regions previously reported in ChIP studies. These are RNAs bound by many RBPs, and show significant enrichment for genes involved in the RNAi response (including AGO2 and Dcr-2), nonsense-mediated decay (NMD), and splicing regulation, indicating detailed post-transcriptional regulation of these pathways. Additionally, the mRNAs of RBPs that bind many RNAs tend to be HSRs. Analysis of RBP co-occupation of mRNAs enables putative classifications for several RBPs currently of unknown biological function. Two quaking related factors with previously unidentified targets have binding profiles consistent with heterogeneous nuclear ribonucleoproteins (hnRNPs) functioning in neurogenesis. Integration with protein interaction maps indicates a protein complex comprised of RBPs that also participate in a reciprocal interaction with their corresponding mRNAs. The area of post-transcriptional regulation has been relatively void of global studies compared to transcriptional regulation. The extensive cross-regulation in this post-transcriptional regulatory network that we observe, particularly the involvement of RNAi, NMD and splicing regulation, point to the need for system-wide study of this neglected leg of the central dogma.

Thursday, April 17th

IPython: From Interactive Computing to Computational Narratives
Dr. Fernando Perez
Henry H. Wheeler Jr. Brain Imaging Center, UC Berkeley

IPython is an open source environment for interactive and parallel computing that supports all stages in the lifecycle of modern scientific computing and data analysis: individual exploration, collaborative development, large-scale production using parallel resources, publication and education. The web-based IPython Notebook takes full-advantage of web technologies to integrate rich media and client-side Javascript into the data science workflow. It allows scientists to share their work in an open document format that is a true "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching.

I will describe how a single architecture covers this entire use spectrum: the underlying protocols provide a language-agnostic environment to explore data, run production analyses and share results in a variety of output formats, from websites to printed books. I will illustrate this with use cases taken from a variety of disciplines and programming languages.

Thursday, April 24th

Long-Term Memory Formation Produces Wide-Spread Transcriptional and Epigenetic Changes in the Hippocampus
Dr. Lucia Peixoto
Department of Biology, University of Pennsylvania

A fundamental question in neuroscience is how memories are stored and retrieved in the brain. Many psychiatric and neurodevelopmental disorders are associated with cognitive deficits. Characterizing the biological basis of memory storage and retrieval is therefore critical for understanding normal and abnormal brain function. Long-term memory formation requires transcription, translation and epigenetic processes that control gene expression. Thus, identifying the identity of the transcriptional and epigenetic changes that occur after memory acquisition and retrieval is of broad interest and importance. In this talk I will discuss the challenges of using omic technologies, both microarray and high-throughput sequencing, to the study of gene expression and epigenetic regulation in the context of brain an behavior and how those studies led us to propose a role for the regulation of chromatin accessibility in memory consolidation.

Thursday, May 1st

Inferring Network Architecture in the Immune System: From Populations to Single Cells
Professor Nir Yosef
Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, UC Berkeley

Complex, interacting systems are omnipresent in the world - from our social networks to the molecular circuits that guide the behavior of our cells. Charting out the connectivity of such systems and understanding their function has proven an extremely important, yet often daunting task. In this talk I will describe my work on modeling networks of molecules that control the differentiation and function of cells in the immune system and explain how we used the resulting models to gain insight into biological organizational principles and to identify key regulators of autoimmunity.

As our model system, we investigated the differentiation of naive T cells into autoimmune-inducing Th17 T helper cells, which, despite enormous clinical importance, remain poorly understood. The resulting "underlying" network consists of two self-reinforcing, but mutually antagonistic, sets of regulatory molecules that collectively achieve appropriate balance between Th17 cells and other, competing T cell lineages. Specifically, we identified several molecular circuits that are crucial to blocking the pathogenesis of Th17-dependent autoimmunity by shifting the balance toward the differentiation of immunosuppressive T cells.

While in this study we treated the Th17 cells as a single "lineage", accumulating data suggest that there are distinct Th17 sub-populations that possess very different pathogenic potential; Depending on the cytokine milieu to which the cells are exposed, in-vitro polarized Th17 cells can either cause severe autoimmune responses or have little effect on autoimmunity. This is also reflected in in-vivo studies, whereby Th17 cells sorted from sites of inflammation exhibit substantial heterogeneity in their effector function. In order to characterize the different sub-types of Th17 cells and gain better understanding of the regulatory mechanisms that govern this diversity, we are now studying RNA-seq from single Th17 cells. I will end this talk by presenting some of the hurdles in this analysis, and some of the lessons we have learned about how well in-vitro differentiated Th17 cells reflect their in-vivo counterparts and which previously uncharacterized regulators may be directly responsible for Th17 pathogenicity.