PB HLTH 292, Section 008
Statistics and
Genomics Seminar
Fall 2009
Thursday, September 3rd
SNP Association Studies with Case-Parent Trios
Professor Ingo Ruczinski
Department of Biostatistics, Johns
Hopkins University
While most SNP association studies are case-control based, family based designs and in particular case-parent trio designs have some very attractive features. We discuss and demonstrate those via a genome-wide and a candidate gene association study that employ case-parent trios. We also extend the logic regression methodology, originally developed for cohort and case-control studies, to detect SNP-SNP and SNP-environment interactions in studies of trios with affected probands. Trio logic regression accounts for the linkage disequilibrium (LD) structure in the genotype data, and accommodates missing genotypes via haplotype-based imputation. We also derive an efficient algorithm to simulate case-parent trios where genetic risk is determined via epistatic interactions.
Thursday, September 10th
Adjusted Bayesian inference for selected parameters
Professor Daniel Yekutieli
Department of Statistics and
Operations Research, Tel Aviv University
I will address the problem of providing inference for parameters selected after viewing the data. A frequentist solution to this problem is using False Discovery Rate controlling multiple testing procedures to select the
parameters and constructing False Coverage-statement Rate adjusted confidence intervals for the selected parameters.I will argue that selection also affects Bayesian inference and present a Bayesian framework for providing inference for selected parameters. I will explain the role of selection in controlling the occurrence of false discoveries in Bayesian analysis and demonstrate how to specify selection criteria. I will also explain the relation between our Bayesian approach and the Bayesian FDR approach and apply it to microarray data.
Thursday, September 17th
Detection and Improved
Ranking of Disease-Associated SNPs and SNP Interactions Using Logic Regression
Dr. Holger Schwender
Department of Biostatistics, Johns Hopkins University
A major goal of genetic association studies is the identification of SNPs (Single NucleotidePolymorphisms) and SNP interactions that are associated with the disease of interest. Aproblem concerned with this task is that SNPs often show only an effect on the disease riskwhen interacting with other SNPs so that testing each SNP individually might fail to detectsuch SNPs. This problem can be overcome by employing methods such as logic regressionthat take the multivariate structure of the SNP data into account. In my talk, I will present aprocedure called logicFS that uses logic regression as base learner in bagging to identifyinteresting SNPs and SNP interactions, and to quantify their importance for a correctprediction of the response. These importances can then also be employed to rank the SNPsmore appropriately than marginal testing. Finally, I will show how this procedure can beadapted to test sets of SNPs and applied to other than binary responses and to trio data.
Thursday, October 1st
TumorBoost:
Normalization of allele-specific tumor copy numbers in paired tumor/normal designs for genotyping microarrays
Dr. Pierre Neuvial
Department of Statistics, UC Berkeley
High-throughput genotyping microarrays can be used not only to assess changes in total DNA copy number, but also changes in allele-specific copy numbers (ASCNs). Even after state of the art preprocessing methods, ASCN estimates still suffer from systematic effects that make them difficult to use effectively for downstream analyses, such as ASCN segmentation and calling in cancer studies.
We have developed a method for normalizing ASCN estimates of a tumor based on ASCN estimates from a single matched normal. The method improves separation between ASCN states, and applies to any tumor/normal pair of genotyping microarrays.
This is joint work with Henrik Bengtsson.
Thursday, October 8th
Non-coding sequences near duplicated genes evolve rapidly
Dr. Dennis Kostka
Gladstone Institutes, UCSF
Gene expression divergence and chromosomal rearrangements have both been
put forward as major contributors to phenotypic differences between
closely related species. It has also been established that duplicated
genes show enhanced rates of positive selection in their amino acid
sequences. If functional divergence is largely due to changes in gene
expression, it follows that regulatory sequences in duplicated loci
should evolve rapidly. To test this hypothesis we performed likelihood
ratio tests on all non-coding loci within 5kb of every transcript in the
human genome and identified sequences with increased substitution rates
in the human-chimp lineage. The fraction of rapidly evolving elements is
significantly higher nearby genes that are duplicated in humans and
chimps compared to non-duplicated genes. 5' untranslated regions are
particularly enriched for accelerated sequences. We also conducted a
genome-wide scan for nucleotide substitutions predicted to affect
transcription factor binding. Rates of binding site turnover are
elevated in accelerated non-coding sequences of duplicated loci. Many of
the genes associated with these fast-evolving sequences belong to
functional categories identified in previous studies of positive
selection on amino acid sequences. However, our approach highlights
several processes and pathways that have not been emphasized in
studies of single copy genes. We find a particularly striking pattern of
accelerated evolution nearby genes involved in establishment and
maintenance of pregnancy, processes that differ significantly between
humans and monkeys. Our findings suggest that positive selection on the
regulation of duplicated genes has played a significant role in human
evolution.
Thursday, October 22nd
An atlas of open chromatin spanning diverse human cell types in health and disease
Professor Jason D. Lieb
Department of Biology, Carolina Center for Genome Sciences, and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill
FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) is a simple, low-cost genomic method for the isolation and identification of nucleosome-depleted regions in eukaryotic cells. Identification of "open" chromatin regions has been one of the most accurate and robust methods to identify functional promoters, enhancers, silencers, insulators, and locus control regions in mammalian cells. FAIRE-seq data from several human cell lines, from pancreatic islet cells, and from clinical breast tumor samples will be presented.
FAIRE data has a low signal-to-noise ratio relative to ChIP-seq experiments, which motivated us to develop a new signal processing algorithm called ZINBA (Zero Inflated Negative Binomial Algorithm). Sequencing data can be summarized by the number of sequence reads in fixed non-overlapping windows spanning the genome (read-count data) or the number overlapping reads per basepair (base-count data). ZINBA is a flexible statistical method that exploits the advantages of both summaries to identify "peak" signals from sequencing data. Implementation of ZINBA includes two steps. In the first step, a generalized linear model is used to model the read-counts data and to select genomic windows with enriched sequence counts after adjusting for relevant confounding factors such as mappability, GC content, or copy number alterations. In the second step, within these enriched regions the exact boundaries of peaks are determined by fitting a piecewise linear regression model using the base-count data.
The data presented was generated by Paul Giresi, Linda Grasfeder, Kyle Gaulton, Jeremy Simon, Piotr Mieczkowski, and Takao Nammo in collaboration with Jorge Ferrer (Hospital Clinic de Barcelona), Karen Mohlke (UNC), and Charles Perou (UNC). ZINBA is being developed by Naim Rashid, Paul Giresi, Wei Sun, and Joe Ibrahim.
Thursday, October 29th
Using RNA-Seq for a global analysis of alternative splicing regulation in Drosophila melanogaster
Angela N. Brooks
Department of Molecular and Cell
Biology, UC Berkeley
Splicing regulators are proteins
that bind to newly-transcribed pre-mRNA and guide their processing
into mature mRNAs. Specifically, they regulate the removal of
introns usually at a nearby splice site. They are believed to
operate by inhibiting or promoting assembly of the spliceosome, a
complex of proteins which is involved in splicing out intronic
sequences. Splice regulators are important both for maintaining
proper constitutive splicing, as well as guiding alternative splicing
regulation; in this way they influence the expression of specific
mRNA isoforms and ultimately the repertoire of proteins in the cell.
Many splicing regulators recognize their target transcripts through
sequence specific binding to the RNA. We have identified exons that
are targets of a splicing regulator, Pasilla, by RNAi
knockdown of the gene and subsequent identification of differentially
spliced transcripts using RNA-Seq. Alternatively spliced exons from
478 genes are affected by depletion of Pasilla, including
exons previously unannotated as being skipped. The Pasilla
binding site is enriched upstream from repressed exons and downstream
from enhanced exons. The locations of these silencer and enhancer
sequences are consistent with findings from the mammalian ortholog of
Pasilla. I will also present preliminary results on RNA-Seq
analysis of knockdowns of 26 additional RNA binding proteins,
including SR and hnRNP proteins.
This is joint work from the labs of Steven Brenner, Brenton Graveley at the University of Connecticut Health Center, Sandrine Dudoit, and the modENCODE Consortium.
Thursday, November 5th
Combining data using Bayes factors
Dr. Robert Gentleman
Genentech
Combining data from different experiments should account in some way for relative error rates in the experiments. While this is standard practice in meta-analysis it seems not to be addressed on many high throughput screens. I will focus on rotein interaction data in this talk and propose a paradigm for data integration.
Thursday, November 12th
Statmap: A utility for the principled mapping of short reads to
a reference genome
Nathan Boley
Department of Statistics, UC Berkeley
Next generation sequencing technologies have given rise to
a host of assays that are able to quickly answer a diverse set of
biological questions. These assays, which include RNA-seq, ChIP-seq,
methyl-seq, Hi-C-seq, and DNase-seq are similar in that, at the end of
a "*-seq" experiment, they result in a set of sequences, or 'reads',
generated by the sequencing platform, and it is from these that we
draw our conclusions. Hence, the first key analytical task in the
analysis of these assays is to "map" the reads into the space from
which they came (e.g. the genome, the transcriptome, etc...). As the
assays have developed, the
biological questions they attempt to answer have become
more subtle, and the downstream analyses into they are integrated
have become increasingly complex. The need for reliable measures of
statistical confidence in biological interpretations has become
apparent, and thus too has the need for tools that are able to map the
results of an experiment in a way that provides information about
mapping uncertainty to downstream analysis.
Statmap is one such tool. It is exceptionally fast, and
it produces every possible mapping from which a read
could have come, up to a threshold in sequencing error and/or
alternate genome probability. In addition, statmap can map paired end
reads, junction reads, polya tails and update mapping probabilities
under an assay specific model. The probability model and the
architecture that underly statmap, as well as its application to
downstream analysis and the generation of confidence bounds will be
discussed. This software is currently available at
encodestatistics.org.
Thursday, November 19th
Detecting Rare Variants in Candidate Genes for Mitochondrial Diseases using Resequencing Microarrays
Dr. Wenyi Wang
Postdoctoral Scholar, Stanford Genome Technology Center, and Visiting Scholar, Department of Statistics, UC Berkeley
Oligonucleotide resequencing arrays provide cost-effective approaches to identify key biomarkers in human disorders. Mitochondrial diseases are one disease family with the underlying genetic variations not yet fully understood. As a proof-of-principle study, we sequenced 39 candidate genes in healthy individuals and patients with mitochondrial diseases using custom designed resequencing arrays.
The genetic variation in a body of sequence data can be summarized by the nucleotide variation within each sample, measured by the variant frequency, typically <1 per 1000bp, and the variation across samples at each variant nucleotide position, measured by the minor allele frequency (MAF) and can be categorized as common (MAF>=5%) and rare (MAF<5%) variants. Our challenge was to maximally detect sequence variations that occur at very low frequencies, with minimal number of false positives. Sequence Robust Multi-array Analysis (SRMA) is a statistical method that exploits the biological information in the high dimensional data using multi-level linear mixture models. Our algorithm substantially reduces the false discovery rate in existing algorithm by more than 10 fold, with a negligible false negative rate. We have identified novel and possibly pathological mutations in our pilot data. Our base-calling methods are applicable to other custom resequencing arrays. More broadly, they can provide guidance for experimental advances in DNA enrichment for resequencing, and improvements in array hybridization.
This is joint work with Terry Speed.
Thursday, December 3rd
A shape-based approach to signal-noise deconvolution in transcription
factor ChIP-Seq Data
Oleg Mayba
Department of Statistics, UC Berkeley
Chromatin immunoprecipitation followed by next-generation sequencing has
become a popular assay for DNA-protein interactions. One of the popular
areas of application has been to transcription factor binding site
identification, with putative sites identified as 'peaks' or enriched
regions along genome. Due to various biases in the data, most
investigators use some kind of control to filter out false positive peaks.
We discuss the biases in the data,some common types of control and their
limitations and propose an approach that yields comparable results in
absence of the control and can also be used as a post-control-filtering
step.