Fall 1998 Neyman Seminars

The seminars are held on Wednesdays, 4:10-5:00, in room 1011 Evans Hall

Sept 2 Dan Rokhsar
Physics, UCB
Simple Models for Protein Folding
Sept  9 Joseph  Hellerstein
Computer Science, UCB
Statistics as Used in Database Systems: Current Practice and Research Opportunities
Sept  16 James Reimann
Genentech
Culturing Statistics in Biotech
Sept  23 Johanna Nichols
Slavic Languages, UCB
Linguistic Evidence for the Source and Chronology of the First Settlement of the Americas
Sept  30 Karl Petty
Statistics, UCB
A New Methodology for Evaluating Traffic Incident Detection Algorithms
Oct 7 Geof Givens
Statistics, Colorado State University
Estimating and Adjusting for Publication Bias Using Data Augmentation in  Bayesian Meta-analysis
Oct 14 Bob Becker  and Chandrika Kamath
Astronomy, UC Davis & Computer Science, LLNL
Statistics, Pattern Recognition, and Astrophysics
Oct 21 Robert Fusaro
Chiron Corporation & Biostatistics, UCB
Statistical Issues in Quantitative Assay Development
Oct 28 Stuart Russell
Computer Science, UCB
Expressive Probability Models
Nov 4 Roger Brent
The Molecular Sciences Institute
Challenges in Extracting Biological Truth from Functional Genomic Data
Nov 11 Doug Nychka
National Center for Atomospheric Research
Large, Nonstationary Fields over Time and Space
Nov 18 Felix Dannegger
Statistics, Stanford University
Tree--based models in Biostatistics: Applications and Problems

Abstracts

 

Simple models for protein folding

Dan Rokhsar
University of California, Berkeley

Proteins are polymers that fold rapidly, reproducibly, and reversibly to specific functional structures out of an astronomical number of possible conformations. This seminar  will discuss statistical mechanical models for the folding  of simple lattice polymers whose dynamics can be analysed from computer simulations of the folding process. A new statistical measure, the folding probability, is used to  identify the folding pathway of the model protein.

 


Statistical Techniques and Opportunities in Database Systems

Joe Hellerstein,  UC Berkeley

Database systems contain increasingly large and detailed representations of the real world, from business records to scientific data to video surveillance output. Database systems have long used crude statistical methods to improve their internal efficiency, and  are increasingly incorporating statistical techniques in the production of approximate results.

In this talk I will present an overview of statistical techniques and research opportunities in database systems. The perspective will be statistically very simplistic; the goal is to raise problems amenable to statistical solutions, rather than present the solutions themselves. I will also attempt a short software demonstration based on the CONTROL research project, a joint effort between Berkeley's database research group and statistician Peter Haas of IBM.

 


Culturing Statistics in Biotech
James Reimann, Genentech, Inc.

Biopharmaceutical companies research, manufacture, and develop proteins for medical therapy, and use biostatisticians in many diverse areas: clinical trials, general statistical consulting, yield improvement, quality control, precision analysis, pharmacokinetic studies, bio-informatics, statistical education, and strategic planning. This talk will describe the breadth of statistical methods used in biotech problems, illustrated by Genentech case studies. The role of statisticians in the organization will also be explored, including non-statistical skills needed to be a success in industry, and the (hard-to-define) benefit added by statisticians even when formal statistical methods are not used.

 


Linguistic evidence for the source and chronology of the first settlement of the Americas

Johanna Nichols, UC Berkeley

Native American languages fall into some 150 distinct families and exhibit great structural diversity. Some stunningly interesting findings potentially await statistically-based cross-linguistic analysis.

An age for the whole population of indigenous American language families, and thus a chronology for human settlement of the New World, can be estimated from average rates of diversification and immigration, but determining these with any precision is very difficult.

The origin of the population and of subpopulations have been traced by comparing frequencies of key structural characters in family-based samples. For instance, a small set of rare features is found almost uniquely along the entire Pacific coast in the Americas, and continues around the Pacific rim in Asia and Australasia. It points to crossings of Beringia by coastally adapted peoples beginning just after the end of glaciation, and thus suggests Pleistocene age for the non-Pacific language population. This kind of analysis, however, is fraught with problems of choosing characters, defining them, assessing their stability, determining their independence, and dealing with small samples.

Probative evidence of family relatedness is well known to fade out after some 6000 years. The evidence traditionally considered probative is threshold frequencies of shared vocabulary and individual paradigms or similar closed sets of cognate forms, where one single set can be taken to prove relatedness if it is considered highly unlikely to recur independently. In principle it must be possible to use larger numbers of weaker pieces of evidence to argue for likely common descent from an untraceably distant single ancestor.


A New Methodology for Evaluating Traffic Incident Detection Algorithms

Karl Petty
University of California, Berkeley


In this paper we present a novel approach to evaluating incident detection algorithms. Previous evaluations of incident detection algorithms have focused on determining the false alarm rate versus detection rate curve---a process which we argue is inherently fraught with difficulties. Instead, we impose a cost structure to incident detection algorithms. A cost is imposed for dispatching a tow truck to assist an incident and benefits are incurred when the delay associated with an incident is reduced. We present a method for performing this evaluation on a freeway with empirical data. We demonstrate this methodology on a few standard algorithms on a particular freeway.

 


ESTIMATING AND ADJUSTING FOR PUBLICATION BIAS USING DATA AUGMENTATION IN BAYESIAN META-ANALYSIS

Geof Givens
Colorado State University

Meta-analysis reviews, collects, and synthesizes individual sample surveys to estimate an overall effect size. `Publication bias' is a relatively new statistical phenomenon that only arises when one attempts through a meta-analysis to review all studies, significant or insignificant, in order to provide a total perspective on a particular issue. If the studies for a meta-analysis are chosen through a literature review, an inherent selection bias may arise, since for example, studies may tend to be published more readily if they are statistically significant, or deemed to be of higher quality. This has recently received some notoriety as an issue in the evaluation of the relative risk of lung cancer associated with passive smoking, following legal challenges to a 1992 EPA analysis which concluded that such exposure is associated with significant excess risk of lung cancer.

We introduce a Bayesian approach which estimates and adjusts for publication bias, correcting for both the number and outcome of missing studies. Estimation is based on a data augmentation principle within a hierarchical model, and the number and outcomes of unobserved studies are simulated using Gibbs sampling methods. This technique yields a quantitative adjustment for the passive smoking meta-analysis. We estimate that there may be both negative and positive but insignificant studies omitted, and that failing to allow for these would mean that the estimated excess risk may be overstated by around 30%, both in US studies and in the global collection of studies.

A further extension of this method introduces an additional hierarchy that permits the stratification of studies by sample characteristics, study design elements such as blinding and control, and many other objective and subjective factors. We apply this method to a meta-analysis of studies of cervical cancer rates associated with use of oral contraceptives.


Statistics, Pattern Recognition, and Astrophysics

Robert Becker, University of California, Davis and Lawrence Livermore National Laboratory

Chandrika Kamath, Lawrence Livermore National Laboratory

Astrophysics is in the middle of an explosion of data, whose analysis requires levels of sophistication not normally found among scientists. These large data sets invariably give rise to questions that can only be addressed statistically. A cooperative effort between astronomers and computer scientists has been initiated at Lawrence Livermore National Laboratory to grapple with some of these problems. We will present examples of astronomical data, as well as some of the techniques that will be used to analyze them.


Statistical Issues in the Development of Quantitative HIV Assays

Robert E. Fusaro
Chiron Diagnositics
and
Division of Biostatistics,
School of Public Health, University of California, Berkeley


Measurements of plasma concentrations of the Human Immunodeficiency Virus (HIV), obtained using the first generation of quantitative HIV assays, provided clinically useful information in managing HIV-disease. Recently, protease inhibitor therapies have been introduced that appear capable of decreasing plasma HIV concentrations to extremely low levels. This led to a call for new quantitative HIV assays reliable at much lower concentrations than their progenitors.

Biochemists have a ready bag of statistical tools for evaluating the reliability of quantitativeassays.While convenient, these standard metrics do not effectively capture the complex trade-off between signal generation and noise production seen at the very low end of these HIV assays. More mathematically sophisticated approaches based upon nonlinear mixed effects models have also been of limited utility. These models are obscure to most biochemists; moreover, the answers they provide to the primary questions of interest in this setting depend crucially upon the covariance assumptions. In contrast, I will describe a simple, nonparametric approach to evaluating such assays, devised in collaboration with Carl Schaper, that has proven useful in effecting a substantial improvement in the performance of the Chiron assay.


Expressive Probability Models

Stuart Russell
Computer Science, UCB

In recent decades, the expressive power of formal languages for defining probability models has been increasing rapidly. For example, Bayesian networks, Markov networks, hidden Markov models, and dynamic Bayesian networks allow the description of fairly complex stochastic processes. However, these models are essentially propositional probability models, defining joint distributions over a fixed set of random variables, or a fixed set replicated over time or space.

When interpreting data generated by more complex processes - for example, images generated by scenes containing unknown numbers of complex objects in various spatial arrangements, or text generated by an intelligent speaker - more expressive power is needed. The current view in AI is that we need some combination of probability theory with the expressive power afforded by first-order logic and/or programming languages. I will survey various approaches including object-oriented Bayesian networks, probabilistic frame-based systems, stochastic logic programs, stochastic functional programs, BUGS models, and (time permitting) Grenander’s pattern theory, attempting to compare these approaches in terms of representation, inference, and learning. It seems possible to interpret all approaches within a single framework of first-order probabilistic logics, and perhaps to provide a general-purpose inference algorithm based on MCMC in the space of models (in the logical sense). This may allow investigation of the relationship between expressiveness and the complexity of inference and learning, as has been done for first-order logic and its subsets.


Extracting Biological Truth from Functional Genomic Data

Roger Brent
The Molecular Sciences Institute

The genome projects have transformed the way biological information is obtained. Bigger changes are coming. I'll review the types of data that are becoming available and those that we can reasonably anticipate. Then I'll try to talk about how we might use these torrents of data to do what we want, which is to understand living things.

In the lab, the accumulation of interaction data from two-hybrid experiments has allowed us to develop computational tools to search this data for patterns of protein interactions of functional significance. The next challenge is to extend these algorithms so that they conjoin connection data with other kinds of functional genomic data. This is perhaps as much an information problem or an epistemological problem rather than a classical statistics problem, but it's important enough to set out at some length.

Barring breakthroughs, the inferences we can make from systematically generated biological data will often be disappointing, in that these inferences will not be of sufficient insight or probability to interest the majority of contemporary biologists. A biological approach to the problem is to bring into being technologies to generate new types of biological information, and, in the interim, to continue to develop techniques that embody some of the reach and power of classical transmission genetics to important but genetically intractable systems. By accelerating the analysis of genetic pathways, we can have a near-term impact upon biological research.

Thus we will remain gainfully employed while we move toward a future in which the data in our possession enable the analytical (computational) prediction of the behavior of living systems. I expect that the development of a predictive biology will be one of the major creative enterprises of the 21st century. People who understand mathematics and statistics, and who are willing to apply their understanding to this quest, are in a position to make substantial contributions to human knowledge.


 

Doug Nychka
National Center for Atomospheric Research

Large, Nonstationary Fields over Time and Space

A different perspective on modeling spatial data provides a route to handling large problems. Standard methods for analyzing spatial fields focus on the covariance of the spatial process. The problem with this approach for geophysical problems is the difficulty in formulating non stationary fields and, even when this is successful, computing spatial estimates using largecovariance matrices. This talk considers the advantages of modeling the process directly instead of short cutting to the second order moments. This basicchange of emphasis from covariance function to the process is the key ingredient of a hierarchal model for spatial or space/time data. In the simplest case the idea is to expand the spatial field with respect to a basisand then model the variances of the basis coefficients. This alone is not a new idea. But recent developments in multiresolution bases such as wavelets allow one flexibility in capturing nonstationary structure and also permit rapid evaluation of the basis functions. The spatial estimates for a large number of locations can be found using iterative techniques, such as the conjugate gradient method, in place of standard solutions of linear systems. Such methods are common in the field of meteorological data assimilation, but have had minor impact in statistics. Here the use of the wavelet basis is important, making the matrix multiplications in the iterations efficient. As an exampleof this approach we consider monthly precipitation records at approximately 800 sites in the Western US.


Felix Dannegger
Statistics Department, Stanford University

Tree--based Models in Biostatistics: Applications and Problems

Tree—based or recursive partitioning methods introduced by Breiman et al. (1984) offer a popular nonparametric approach to prediction problems in the context of classification, regression and event—history data. Recursive partitioning methods handle covariates measured on a variety of scales naturally without requiring any arbitrary coding. In addition, results of tree-based analyses are easily conveyed to nonstatisticians.

Due mainly to the dichotomization of covariates at each step in the model building process, tree-based predictors are often highly variable, degrading predictive performance. Often one observes that small changes in the learning sample have a profound impact on the resulting prediction rule.

Based on datasets of breast cancer and heart disease patients, we will demonstrate tree—based analyses of medical data, point out the problems that may arise and discuss possible solutions, all of which can be seen as attempts to stabilize tree—based predictors.