The seminars are held on Wednesdays, 4:10-5:00, in room 1011 Evans Hall
Simple models for protein folding
Dan Rokhsar
University of California, Berkeley
Proteins are polymers that fold rapidly, reproducibly, and reversibly to specific functional structures out of an astronomical number of possible conformations. This seminar will discuss statistical mechanical models for the folding of simple lattice polymers whose dynamics can be analysed from computer simulations of the folding process. A new statistical measure, the folding probability, is used to identify the folding pathway of the model protein.
Joe Hellerstein, UC Berkeley
Database systems contain increasingly large and detailed representations of the real world, from business records to scientific data to video surveillance output. Database systems have long used crude statistical methods to improve their internal efficiency, and are increasingly incorporating statistical techniques in the production of approximate results.
In this talk I will present an overview of statistical techniques and research opportunities in database systems. The perspective will be statistically very simplistic; the goal is to raise problems amenable to statistical solutions, rather than present the solutions themselves. I will also attempt a short software demonstration based on the CONTROL research project, a joint effort between Berkeley's database research group and statistician Peter Haas of IBM.
Johanna Nichols, UC Berkeley
Native American languages fall into some 150 distinct families and exhibit great structural diversity. Some stunningly interesting findings potentially await statistically-based cross-linguistic analysis.
An age for the whole population of indigenous American language families, and thus a chronology for human settlement of the New World, can be estimated from average rates of diversification and immigration, but determining these with any precision is very difficult.
The origin of the population and of subpopulations have been traced by comparing frequencies of key structural characters in family-based samples. For instance, a small set of rare features is found almost uniquely along the entire Pacific coast in the Americas, and continues around the Pacific rim in Asia and Australasia. It points to crossings of Beringia by coastally adapted peoples beginning just after the end of glaciation, and thus suggests Pleistocene age for the non-Pacific language population. This kind of analysis, however, is fraught with problems of choosing characters, defining them, assessing their stability, determining their independence, and dealing with small samples.
Probative evidence of family relatedness is well known to fade out after some 6000 years. The evidence traditionally considered probative is threshold frequencies of shared vocabulary and individual paradigms or similar closed sets of cognate forms, where one single set can be taken to prove relatedness if it is considered highly unlikely to recur independently. In principle it must be possible to use larger numbers of weaker pieces of evidence to argue for likely common descent from an untraceably distant single ancestor.
A New Methodology for Evaluating Traffic Incident Detection Algorithms
Karl Petty
ESTIMATING AND ADJUSTING FOR PUBLICATION BIAS USING DATA AUGMENTATION IN BAYESIAN META-ANALYSIS
Geof GivensMeta-analysis reviews, collects, and synthesizes individual sample surveys to estimate an overall effect size. `Publication bias' is a relatively new statistical phenomenon that only arises when one attempts through a meta-analysis to review all studies, significant or insignificant, in order to provide a total perspective on a particular issue. If the studies for a meta-analysis are chosen through a literature review, an inherent selection bias may arise, since for example, studies may tend to be published more readily if they are statistically significant, or deemed to be of higher quality. This has recently received some notoriety as an issue in the evaluation of the relative risk of lung cancer associated with passive smoking, following legal challenges to a 1992 EPA analysis which concluded that such exposure is associated with significant excess risk of lung cancer.
We introduce a Bayesian approach which estimates and adjusts for publication bias, correcting for both the number and outcome of missing studies. Estimation is based on a data augmentation principle within a hierarchical model, and the number and outcomes of unobserved studies are simulated using Gibbs sampling methods. This technique yields a quantitative adjustment for the passive smoking meta-analysis. We estimate that there may be both negative and positive but insignificant studies omitted, and that failing to allow for these would mean that the estimated excess risk may be overstated by around 30%, both in US studies and in the global collection of studies.
A further extension of this method introduces an additional hierarchy that permits the stratification of studies by sample characteristics, study design elements such as blinding and control, and many other objective and subjective factors. We apply this method to a meta-analysis of studies of cervical cancer rates associated with use of oral contraceptives.
Robert Becker, University of California, Davis and Lawrence Livermore National Laboratory
Chandrika Kamath, Lawrence Livermore National Laboratory
Astrophysics is in the middle of an explosion of data, whose analysis requires levels of sophistication not normally found among scientists. These large data sets invariably give rise to questions that can only be addressed statistically. A cooperative effort between astronomers and computer scientists has been initiated at Lawrence Livermore National Laboratory to grapple with some of these problems. We will present examples of astronomical data, as well as some of the techniques that will be used to analyze them.
Measurements of plasma concentrations of the Human Immunodeficiency Virus (HIV), obtained
using the first generation of quantitative HIV assays, provided clinically useful
information in managing HIV-disease. Recently, protease inhibitor therapies have been
introduced that appear capable of decreasing plasma HIV concentrations to extremely low
levels. This led to a call for new quantitative HIV assays reliable at much lower
concentrations than their progenitors.
Biochemists have a ready bag of statistical tools for evaluating the reliability of
quantitativeassays.While convenient, these standard metrics do not effectively capture the
complex trade-off between signal generation and noise production seen at the very low end
of these HIV assays. More mathematically sophisticated approaches based upon nonlinear
mixed effects models have also been of limited utility. These models are obscure to most
biochemists; moreover, the answers they provide to the primary questions of interest in
this setting depend crucially upon the covariance assumptions. In contrast, I will
describe a simple, nonparametric approach to evaluating such assays, devised in
collaboration with Carl Schaper, that has proven useful in effecting a substantial
improvement in the performance of the Chiron assay.
Stuart Russell
Computer Science, UCB
When interpreting data generated by more complex processes - for example, images generated by scenes containing unknown numbers of complex objects in various spatial arrangements, or text generated by an intelligent speaker - more expressive power is needed. The current view in AI is that we need some combination of probability theory with the expressive power afforded by first-order logic and/or programming languages. I will survey various approaches including object-oriented Bayesian networks, probabilistic frame-based systems, stochastic logic programs, stochastic functional programs, BUGS models, and (time permitting) Grenanders pattern theory, attempting to compare these approaches in terms of representation, inference, and learning. It seems possible to interpret all approaches within a single framework of first-order probabilistic logics, and perhaps to provide a general-purpose inference algorithm based on MCMC in the space of models (in the logical sense). This may allow investigation of the relationship between expressiveness and the complexity of inference and learning, as has been done for first-order logic and its subsets.
Roger Brent
The Molecular Sciences Institute
The genome projects have transformed the way biological information is obtained. Bigger changes are coming. I'll review the types of data that are becoming available and those that we can reasonably anticipate. Then I'll try to talk about how we might use these torrents of data to do what we want, which is to understand living things.
In the lab, the accumulation of interaction data from two-hybrid experiments has allowed us to develop computational tools to search this data for patterns of protein interactions of functional significance. The next challenge is to extend these algorithms so that they conjoin connection data with other kinds of functional genomic data. This is perhaps as much an information problem or an epistemological problem rather than a classical statistics problem, but it's important enough to set out at some length.
Barring breakthroughs, the inferences we can make from systematically generated biological data will often be disappointing, in that these inferences will not be of sufficient insight or probability to interest the majority of contemporary biologists. A biological approach to the problem is to bring into being technologies to generate new types of biological information, and, in the interim, to continue to develop techniques that embody some of the reach and power of classical transmission genetics to important but genetically intractable systems. By accelerating the analysis of genetic pathways, we can have a near-term impact upon biological research.
Thus we will remain gainfully employed while we move toward a future in which the data in our possession enable the analytical (computational) prediction of the behavior of living systems. I expect that the development of a predictive biology will be one of the major creative enterprises of the 21st century. People who understand mathematics and statistics, and who are willing to apply their understanding to this quest, are in a position to make substantial contributions to human knowledge.
Doug Nychka
National Center for Atomospheric Research
Large, Nonstationary Fields over Time and Space
A different perspective on modeling spatial data provides a route to handling large problems. Standard methods for analyzing spatial fields focus on the covariance of the spatial process. The problem with this approach for geophysical problems is the difficulty in formulating non stationary fields and, even when this is successful, computing spatial estimates using largecovariance matrices. This talk considers the advantages of modeling the process directly instead of short cutting to the second order moments. This basicchange of emphasis from covariance function to the process is the key ingredient of a hierarchal model for spatial or space/time data. In the simplest case the idea is to expand the spatial field with respect to a basisand then model the variances of the basis coefficients. This alone is not a new idea. But recent developments in multiresolution bases such as wavelets allow one flexibility in capturing nonstationary structure and also permit rapid evaluation of the basis functions. The spatial estimates for a large number of locations can be found using iterative techniques, such as the conjugate gradient method, in place of standard solutions of linear systems. Such methods are common in the field of meteorological data assimilation, but have had minor impact in statistics. Here the use of the wavelet basis is important, making the matrix multiplications in the iterations efficient. As an exampleof this approach we consider monthly precipitation records at approximately 800 sites in the Western US.
Tree--based Models in Biostatistics: Applications and Problems
Treebased or recursive partitioning methods introduced by Breiman et al. (1984) offer a popular nonparametric approach to prediction problems in the context of classification, regression and eventhistory data. Recursive partitioning methods handle covariates measured on a variety of scales naturally without requiring any arbitrary coding. In addition, results of tree-based analyses are easily conveyed to nonstatisticians.
Due mainly to the dichotomization of covariates at each step in the model building process, tree-based predictors are often highly variable, degrading predictive performance. Often one observes that small changes in the learning sample have a profound impact on the resulting prediction rule.
Based on datasets of breast cancer and heart disease patients, we will demonstrate treebased analyses of medical data, point out the problems that may arise and discuss possible solutions, all of which can be seen as attempts to stabilize treebased predictors.