PB HLTH 292, Section 008
Statistics and Genomics Seminar

Fall 2009



Thursday, September 3rd

SNP Association Studies with Case-Parent Trios
Professor Ingo Ruczinski
Department of Biostatistics, Johns Hopkins University

While most SNP association studies are case-control based, family based designs and in particular case-parent trio designs have some very attractive features. We discuss and demonstrate those via a genome-wide and a candidate gene association study that employ case-parent trios. We also extend the logic regression methodology, originally developed for cohort and case-control studies, to detect SNP-SNP and SNP-environment interactions in studies of trios with affected probands. Trio logic regression accounts for the linkage disequilibrium (LD) structure in the genotype data, and accommodates missing genotypes via haplotype-based imputation. We also derive an efficient algorithm to simulate case-parent trios where genetic risk is determined via epistatic interactions.


Thursday, September 10th

Adjusted Bayesian inference for selected parameters
Professor Daniel Yekutieli
Department of Statistics and Operations Research, Tel Aviv University

I will address the problem of providing inference for parameters selected after viewing the data. A frequentist solution to this problem is using False Discovery Rate controlling multiple testing procedures to select the parameters and constructing False Coverage-statement Rate adjusted confidence intervals for the selected parameters.I will argue that selection also affects Bayesian inference and present a Bayesian framework for providing inference for selected parameters. I will explain the role of selection in controlling the occurrence of false discoveries in Bayesian analysis and demonstrate how to specify selection criteria. I will also explain the relation between our Bayesian approach and the Bayesian FDR approach and apply it to microarray data.


Thursday, September 17th

Detection and Improved Ranking of Disease-Associated SNPs and SNP Interactions Using Logic Regression
Dr. Holger Schwender
Department of Biostatistics, Johns Hopkins University

A major goal of genetic association studies is the identification of SNPs (Single NucleotidePolymorphisms) and SNP interactions that are associated with the disease of interest. Aproblem concerned with this task is that SNPs often show only an effect on the disease riskwhen interacting with other SNPs so that testing each SNP individually might fail to detectsuch SNPs. This problem can be overcome by employing methods such as logic regressionthat take the multivariate structure of the SNP data into account. In my talk, I will present aprocedure called logicFS that uses logic regression as base learner in bagging to identifyinteresting SNPs and SNP interactions, and to quantify their importance for a correctprediction of the response. These importances can then also be employed to rank the SNPsmore appropriately than marginal testing. Finally, I will show how this procedure can beadapted to test sets of SNPs and applied to other than binary responses and to trio data.


Thursday, October 1st

TumorBoost: Normalization of allele-specific tumor copy numbers in paired tumor/normal designs for genotyping microarrays
Dr. Pierre Neuvial
Department of Statistics, UC Berkeley

High-throughput genotyping microarrays can be used not only to assess changes in total DNA copy number, but also changes in allele-specific copy numbers (ASCNs). Even after state of the art preprocessing methods, ASCN estimates still suffer from systematic effects that make them difficult to use effectively for downstream analyses, such as ASCN segmentation and calling in cancer studies.

We have developed a method for normalizing ASCN estimates of a tumor based on ASCN estimates from a single matched normal. The method improves separation between ASCN states, and applies to any tumor/normal pair of genotyping microarrays.

This is joint work with Henrik Bengtsson.


Thursday, October 8th

Non-coding sequences near duplicated genes evolve rapidly
Dr. Dennis Kostka
Gladstone Institutes, UCSF

Gene expression divergence and chromosomal rearrangements have both been put forward as major contributors to phenotypic differences between closely related species. It has also been established that duplicated genes show enhanced rates of positive selection in their amino acid sequences. If functional divergence is largely due to changes in gene expression, it follows that regulatory sequences in duplicated loci should evolve rapidly. To test this hypothesis we performed likelihood ratio tests on all non-coding loci within 5kb of every transcript in the human genome and identified sequences with increased substitution rates in the human-chimp lineage. The fraction of rapidly evolving elements is significantly higher nearby genes that are duplicated in humans and chimps compared to non-duplicated genes. 5' untranslated regions are particularly enriched for accelerated sequences. We also conducted a genome-wide scan for nucleotide substitutions predicted to affect transcription factor binding. Rates of binding site turnover are elevated in accelerated non-coding sequences of duplicated loci. Many of the genes associated with these fast-evolving sequences belong to functional categories identified in previous studies of positive selection on amino acid sequences. However, our approach highlights several processes and pathways that have not been emphasized in studies of single copy genes. We find a particularly striking pattern of accelerated evolution nearby genes involved in establishment and maintenance of pregnancy, processes that differ significantly between humans and monkeys. Our findings suggest that positive selection on the regulation of duplicated genes has played a significant role in human evolution.


Thursday, October 22nd

An atlas of open chromatin spanning diverse human cell types in health and disease
Professor Jason D. Lieb
Department of Biology, Carolina Center for Genome Sciences, and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill

FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) is a simple, low-cost genomic method for the isolation and identification of nucleosome-depleted regions in eukaryotic cells. Identification of "open" chromatin regions has been one of the most accurate and robust methods to identify functional promoters, enhancers, silencers, insulators, and locus control regions in mammalian cells. FAIRE-seq data from several human cell lines, from pancreatic islet cells, and from clinical breast tumor samples will be presented.

FAIRE data has a low signal-to-noise ratio relative to ChIP-seq experiments, which motivated us to develop a new signal processing algorithm called ZINBA (Zero Inflated Negative Binomial Algorithm). Sequencing data can be summarized by the number of sequence reads in fixed non-overlapping windows spanning the genome (read-count data) or the number overlapping reads per basepair (base-count data). ZINBA is a flexible statistical method that exploits the advantages of both summaries to identify "peak" signals from sequencing data. Implementation of ZINBA includes two steps. In the first step, a generalized linear model is used to model the read-counts data and to select genomic windows with enriched sequence counts after adjusting for relevant confounding factors such as mappability, GC content, or copy number alterations. In the second step, within these enriched regions the exact boundaries of peaks are determined by fitting a piecewise linear regression model using the base-count data.

The data presented was generated by Paul Giresi, Linda Grasfeder, Kyle Gaulton, Jeremy Simon, Piotr Mieczkowski, and Takao Nammo in collaboration with Jorge Ferrer (Hospital Clinic de Barcelona), Karen Mohlke (UNC), and Charles Perou (UNC). ZINBA is being developed by Naim Rashid, Paul Giresi, Wei Sun, and Joe Ibrahim.


Thursday, October 29th

Using RNA-Seq for a global analysis of alternative splicing regulation in Drosophila melanogaster
Angela N. Brooks
Department of Molecular and Cell Biology, UC Berkeley

Splicing regulators are proteins that bind to newly-transcribed pre-mRNA and guide their processing into mature mRNAs. Specifically, they regulate the removal of introns usually at a nearby splice site. They are believed to operate by inhibiting or promoting assembly of the spliceosome, a complex of proteins which is involved in splicing out intronic sequences. Splice regulators are important both for maintaining proper constitutive splicing, as well as guiding alternative splicing regulation; in this way they influence the expression of specific mRNA isoforms and ultimately the repertoire of proteins in the cell.

Many splicing regulators recognize their target transcripts through sequence specific binding to the RNA. We have identified exons that are targets of a splicing regulator, Pasilla, by RNAi knockdown of the gene and subsequent identification of differentially spliced transcripts using RNA-Seq. Alternatively spliced exons from 478 genes are affected by depletion of Pasilla, including exons previously unannotated as being skipped. The Pasilla binding site is enriched upstream from repressed exons and downstream from enhanced exons. The locations of these silencer and enhancer sequences are consistent with findings from the mammalian ortholog of Pasilla. I will also present preliminary results on RNA-Seq analysis of knockdowns of 26 additional RNA binding proteins, including SR and hnRNP proteins.

This is joint work from the labs of Steven Brenner, Brenton Graveley at the University of Connecticut Health Center, Sandrine Dudoit, and the modENCODE Consortium.


Thursday, November 5th

Combining data using Bayes factors
Dr. Robert Gentleman
Genentech

Combining data from different experiments should account in some way for relative error rates in the experiments. While this is standard practice in meta-analysis it seems not to be addressed on many high throughput screens. I will focus on rotein interaction data in this talk and propose a paradigm for data integration.


Thursday, November 12th

Statmap: A utility for the principled mapping of short reads to a reference genome
Nathan Boley
Department of Statistics, UC Berkeley

Next generation sequencing technologies have given rise to a host of assays that are able to quickly answer a diverse set of biological questions. These assays, which include RNA-seq, ChIP-seq, methyl-seq, Hi-C-seq, and DNase-seq are similar in that, at the end of a "*-seq" experiment, they result in a set of sequences, or 'reads', generated by the sequencing platform, and it is from these that we draw our conclusions. Hence, the first key analytical task in the analysis of these assays is to "map" the reads into the space from which they came (e.g. the genome, the transcriptome, etc...). As the assays have developed, the biological questions they attempt to answer have become more subtle, and the downstream analyses into they are integrated have become increasingly complex. The need for reliable measures of statistical confidence in biological interpretations has become apparent, and thus too has the need for tools that are able to map the results of an experiment in a way that provides information about mapping uncertainty to downstream analysis.

Statmap is one such tool. It is exceptionally fast, and it produces every possible mapping from which a read could have come, up to a threshold in sequencing error and/or alternate genome probability. In addition, statmap can map paired end reads, junction reads, polya tails and update mapping probabilities under an assay specific model. The probability model and the architecture that underly statmap, as well as its application to downstream analysis and the generation of confidence bounds will be discussed. This software is currently available at encodestatistics.org.


Thursday, November 19th

Detecting Rare Variants in Candidate Genes for Mitochondrial Diseases using Resequencing Microarrays
Dr. Wenyi Wang
Postdoctoral Scholar, Stanford Genome Technology Center, and Visiting Scholar, Department of Statistics, UC Berkeley

Oligonucleotide resequencing arrays provide cost-effective approaches to identify key biomarkers in human disorders. Mitochondrial diseases are one disease family with the underlying genetic variations not yet fully understood. As a proof-of-principle study, we sequenced 39 candidate genes in healthy individuals and patients with mitochondrial diseases using custom designed resequencing arrays.

The genetic variation in a body of sequence data can be summarized by the nucleotide variation within each sample, measured by the variant frequency, typically <1 per 1000bp, and the variation across samples at each variant nucleotide position, measured by the minor allele frequency (MAF) and can be categorized as common (MAF>=5%) and rare (MAF<5%) variants. Our challenge was to maximally detect sequence variations that occur at very low frequencies, with minimal number of false positives. Sequence Robust Multi-array Analysis (SRMA) is a statistical method that exploits the biological information in the high dimensional data using multi-level linear mixture models. Our algorithm substantially reduces the false discovery rate in existing algorithm by more than 10 fold, with a negligible false negative rate. We have identified novel and possibly pathological mutations in our pilot data. Our base-calling methods are applicable to other custom resequencing arrays. More broadly, they can provide guidance for experimental advances in DNA enrichment for resequencing, and improvements in array hybridization.

This is joint work with Terry Speed.


Thursday, December 3rd

A shape-based approach to signal-noise deconvolution in transcription factor ChIP-Seq Data
Oleg Mayba
Department of Statistics, UC Berkeley

Chromatin immunoprecipitation followed by next-generation sequencing has become a popular assay for DNA-protein interactions. One of the popular areas of application has been to transcription factor binding site identification, with putative sites identified as 'peaks' or enriched regions along genome. Due to various biases in the data, most investigators use some kind of control to filter out false positive peaks. We discuss the biases in the data,some common types of control and their limitations and propose an approach that yields comparable results in absence of the control and can also be used as a post-control-filtering step.