Projects in Progress

NIMBLE: Flexible software for hierarchical modeling in R

Project description: We develop the NIMBLE software for hierarchical modeling in R, now on version 1.0.1 of our R package. The key ideas are (1) to divorce the model structure from the algorithms used to fit the model, allowing one to apply a variety of fitting techniques (Bayesian and otherwise) to a given model, (2) to allow models to be specified in BUGS-like syntax but computed in a flexible fashion with the computation done in C++, and (3) to provide a platform for algorithm developers to easily make their algorithms available. We have a paper in Journal of Computational and Graphical Statistics describing the overall approach.

Collaborators: Perry de Valpine (UC Berkeley Environmental Science), Daniel Turek (Lafayette College Mathematics), Paul van Dam-Bates (Fisheries and Oceans Canada), Wei Zhang (University of Glasgow Statistics)

Statistical methods for uncertainty quantification in assessment and modeling of climate extremes

Project description: I collaborate with the Department of Energy's CASCADE Scientific Focus Area at Lawrence Berkeley National Laboratory. The goal of the project is to characterize the behavior of extreme climate events in observations and climate models historically and into the future, with the ultimate goal of improving how climate models reproduce such events. From the statistical perspective the key goal is to characterize uncertainty in assessment and in detection and attribution of climate extremes. Sources of uncertainty include initial and boundary conditions, inter-model variability and variability over time. We have several papers published related to statistical methods for quantification, detection and attribution of extreme events, and an R package (climExtremes) on CRAN that provides code for doing such analyses.

Collaborators: William Collins (Lawrence Berkeley National Laboratory [LBL]), Mark Risser (LBL), Christina Patricola (Iowa State University), Travis O'Brien (Indiana University), Michael Wehner (LBL)

Statistical methods for paleoecological and paleoclimatological data

Project description: I was the statistical lead on the PalEON project, an effort to collate paleoecological and paleoclimatological data, analyze the data in a spatio-temporal context, and provide products useful for assessing and driving ecosystem models to understand how ecosystem properties have changed over the past 2000 years in the northeastern and midwestern U.S. We have a number of papers published on the scientific results of this work and the statistical methods and data products resulting from the methods.

Collaborators: Jason McLachlan (Notre Dame Biology), Andria Dawson (Mount Royal University), Mike Dietze (Boston University Dep't of Earth and Environment), Simon Goring (U. of Wisconsin Geography), Mevin Hooten (Colorado State Statistics), Steve Jackson (USGS Southwest Climate Science Center/U. of Arizona), John Tipton (U. of Arkansas Statistics), Dave Moore (U. of Arizona Natural Resources and Environment), Neil Pederson (Harvard Forest), Jack Williams (U. of Wisconsin Geography)

Completed Projects

Exposure measurement error in air pollution epidemiology

Project description: In environmental epidemiology, the exposure of interest is often poorly ascertained. Adam Szpiro and I have extended our previous (separate) work in this area, developing a framework to understand the sources of bias in health effects estimation caused by different components of exposure estimation error. The framework leads to a methodology involving analytic bias correction and bootstrap estimation of uncertainty. As part of this effort, we're also considering implications for analyses of multiple pollutants. We have a paper, with discussion published in Environmetrics.

Collaborator: Adam Szpiro (Univ. of Washington Biostatistics)

Hierarchical modeling of the global distribution of cardiovascular risk factors and of malnutrition indicators

Project description: In work growing out of Mariel Finucane's dissertation work at Harvard Schol of Public Health, I'm involved in the development of hierarchical Bayesian methods to pool information about global health metrics. To analyze the global distribution of cardiovascular risk factors, as part of the Global Burden of Disease project, we used a combination of nationally-representative surveys and academic studies of blood pressure, cholesterol, BMI, and diabetes to estimate levels and time trends in these risk factors for all the nations of the world. Because of data sparsity, this required borrowing strength in various ways, which we did using hierarchical linear models, combined with Markov random field methods for nonlinear time trends. Four papers using the methodology were published in the Lancet and an overview of the methodology appeared in a special issue of Statistical Science on Bayesian analyses that have made a real-world impact, and the larger group with which this work was done has continued to publish papers using the methodology. We then extended this work using Bayesian nonparametrics to model the full distributional features of malnutrition indicators using a combination of individual level data and summary statistics. One paper has been published in the Lancet on childhood malnutrition and another in Lancet Global Health on anemia, in addition to a paper in JASA on the statistical methods.

Collaborators: Mariel Finucane (Gladstone Institutes), Majid Ezzati (Imperial College Epidemiology/Biostatistics), Goodarz Danaei (Harvard Global Health and Population), Gretchen Stevens (WHO)

Analysis of massive climate model output

Project description: I've worked with computational and climate researchers at the Lawrence Berkeley National Lab to develop software and methods to process and analyze huge output datasets from high resolution climate model runs. The goal is to build tools available on the Earth System Grid to manipulate, process, and characterize model output. From a statistical perspective, the general challenges include characterizing sources of variability, separating signal from noise, and spatial modeling of noisy output. My current focus is on methods and software for analyzing precipitation extremes, focusing on modeling variability in space and time in a computationally-efficient manner. I'm also working on distributed parallel computing methods for the linear algebra calculations involved in fitting spatial statistics models. For this we have a paper in the Journal of Statistical Software and an R package, bigGP, on CRAN.

Collaborators: Michael Wehner (Lawrence Berkeley Lab), Prabhat (Lawrence Berkeley Lab), Lawrence Berkeley Lab visualization group, Ben Lipshitz (UC Berkeley EECS), Tina Zhuo (Georgia Tech), Cari Kaufman (UC Berkeley Statistics)

Spatio-temporal modeling of traffic pollution in eastern Massachusetts

Project description: This work built on previous work by Alexandros Gryparis and Brent Coull to model black carbon concentrations in eastern Massachusetts, for use as exposure estimates in a variety of health studies. The previous model was (roughly) additive in space and time. In this work we introduced temporal structure to account for space-time interaction and including 7-day average data using a linearized approximation for computational feasibility. The space-time structure included both large-scale structure using basis functions and small-scale structure using a tapered, separable space-time covariance structure. We have a paper in Annals of Applied Statistics on this work.

Collaborators: Nikolay Bliznyuk (Univ. of Florida Statistics) and Brent Coull (Harvard Biostatistics)

Flexible Markov random field models for spatial process representation

Project description: This work builds on my spatial confounding work described below. In that work I analyzed bias from unmeasured spatial confounders in spatial regression models. A key result was that we can reduce bias by when the unmeasured confounder operates at a large spatial scale, which motivates the use of smooth spatial process representations. As a result, I'm investigating the use of MRF approximations to thin plate splines, based on the neighborhood structure in Rue and Held (2005), for computationally-efficient latent process representations. I've been comparing the TPS approximation to standard CAR models to understand the features of the resulting spatial processes. The idea in applications is to use the MRF on a fine grid and take the irregular areas as reflecting the spatial variation in the grid cells overlapped by a given area. I have a paper published in the Electronic Journal of Statistics.

Assessing remotely-sensed aerosol as a proxy for ground-level fine particulate matter

Project description: I just finished a three-year grant from the Health Effects Institute to study the use of remotely-sensed aerosol optical depth as a proxy for monthly and longer-term PM2.5 concentrations in the eastern U.S. The final report on this work is under review by HEI. Components of the project included:

  • Published work (Paciorek et al. 2008 in Environmental Science and Technology) assessing GOES-based GASP AOD as a proxy for PM2.5 using basic correlation analysis.
  • Published work (Paciorek and Liu 2009 in Environmental Health Perspectives) combining AOD and ground-level measurements of PM2.5, with land use covariate information, using a statistical prediction model for PM2.5. In this work we found that AOD did not add to the predictive power of the model.
  • Extensions of our statistical modeling approach to flexibly model systematic discrepancy between the AOD proxy and PM2.5 using computationally-efficient Markov random field models. This work has been submitted (Paciorek, submitted) to JRSSC.
  • Assessment of GOES reflectance as a proxy for PM2.5, with consideration of new screening approaches to avoid cloud contamination and better account for spatially-varying surface reflectance.
  • Consideration of multi-year averages of MISR AOD as a proxy for multi-year PM2.5 concentrations in the eastern U.S.

Collaborators: Yang Liu (Harvard Environmental Health), Shobha Kondragunta (NOAA-NESDIS)

Smoothing methods for effects of multiple emissions sources in place of buffer-type variables in environmental exposure modeling

Project description: Researchers analyzing exposure to pollutants often face the problem of accounting for the effects of multiple sources of pollution, mediated by distance, direction, emissions strength, and wind, among other factors. I have developed a simple approach to estimate the smooth effect of distance to emissions sources, accounting for distance, source strength and multiple sources, based on the linearity of a mixed model representation. This work is motivated by a variety of studies in the Environmental Health department at HSPH, including air pollution exposure in Brooklyn from two major highways, exposure at TF Green Airport in Providence, exposure in eastern Massachusetts in conjunction with multiple HSPH health studies, as well as large-scale estimation of PM exposure as part of my HEI-funded work and as part of the NHS exposure project. Note that the exposure could be a social exposure rather than pollution, such as distance to recreational facilities or to liquor outlets, with the outcome of interest a health outcome. This work was part of my HEI final report.

Collaborators: Len Zwack, Jon Levy, and Francine Laden (Harvard Environmental Health)

Spatio-temporal estimation of particulate matter exposure in the Nurses Health Study and effects of exposure uncertainty in health effects analysis

Project description: We have been investigating the health effects of particulate matter air pollution effects in one of the major cohort studies, the Nurses' Health Study. Our part of the project was to estimate individual exposure to particulate matter. We have built a spatio-temporal model to estimate monthly exposure to PM10 and PM2.5 for 1988-2002 in the northeast U.S. using government monitoring data and GIS covariates. As only sparse PM2.5 data are available before 1999, we estimate PM2.5 based on PM10 measurements and visibility information. As part of this latter effort, we have devised a method for using airport visibility data to estimate PM2.5 while properly accounting for the uncertainty and truncation in the visibility data. We have exposure papers in various journals: Paciorek et al. 2009 in the Annals of Applied Statistics, Yanosky et al. 2009 in Environmental Health Perspectives, and Yanosky et al. 2008 in Atmospheric Environment. Related epidemiological papers that use these exposure estimates are Puett et al. 2008 in American Journal of Epidemiology and Puett et al. 2009 in Environmental Health Perspectives.

Collaborators: Jeff Yanosky (Penn State Medical School) and Francine Laden and Helen Suh (Harvard Environmental Health) among others

Post-glacial tree dynamics in New England

Project description: We are interested in understanding tree population dynamics over the last 15000 years. To investigate dynamics based on pollen deposited in pond sediments over the last 3000 years in south-central New England, we have built a Bayesian hierarchical model to relate fossil pollen from sediment cores to tree populations, calibrated by modern tree plots and colonial records. The model is run in predictive mode to estimate composition back in time when only pollen data are available. The model is being used to consider changes in tree populations over time. We are considering extensions of the modeling to link with models of ecosystem processes and with molecular data that provide additional information about population spread. We have a paper (Paciorek and McLachlan 2009 in JASA) about the statistical modeling. This work serves as part of the framework for a large collaborative NSF grant to integrate paleoecological and paleoclimate data for the northeast US (the PalEON project).

Collaborators: Jason McLachlan (Notre Dame Biology)

Bias in regression models with spatial confounding and the use of more flexible Markov random field models for spatial process representation

Project description: It is common in environmental applications and other areas with spatial data that residuals in regression models are spatially correlated. Researchers often account for spatial structure in the residuals using a spatial term in the mean or a spatial covariance structure. I have a paper, Paciorek 2010 in Statistical Science, in which I develop a simple framework for understanding bias and precision in such spatial regression models. If the covariate(s) are spatially correlated, then the residuals may be correlated with the covariate(s). In this context, the correlation may be due to an unmeasured confounder, and one may hope to account for confounding by including spatial structure in the model. Results indicate that the scales of spatial variation are critical and that bias from unmeasured confounders is reduced only if the confounder varies at a scale larger than the spatial variation in the covariate of interest (or of a component of the covariate). Furthermore, random effects models, kriging, and other forms of penalized models reduce bias less than fixed effects models that remove spatial variability up to pre-specified scales, but fixed effects models show more variability. This is because of the inherent bias-variance tradeoff in penalized models.

Measurement error induced by spatial exposure estimation

Project description: Motivated in part by the exposure estimation work on the Nurses' Health study and collaborative work of Alexandros and Brent, we are interested in the measurement error problem induced when spatial exposure estimates are used in epidemiological models of health outcomes based on cohort data. We have developed a framework for thinking about the problem and argue that some standard approaches in the literature are flawed. We suggest and investigate the performance of several alternative approaches to adjusting for measurement error in the epidemiological models. We have a paper, Gryparis et al. 2009 in Biostatistics on this work.

Collaborators: Alexandros Gryparis (a former student in Harvard Biostatistics), Brent Coull (Harvard Biostatistics)

Fourier basis representation for spatial data

Project description: This work builds on my work on spatial logistic regression for large datasets. The Fourier basis representation of spatial processes is an efficient way to represent Gaussian spatial processes on a fine grid with substantial computational advantages. This approach has been pioneered by Chris Wikle at the University of Missouri. I have explored various parameterizations and MCMC algorithms based on the Fourier representation and have written an article (Paciorek 2007b) and template R code published in Journal of Statistical Software.

Fitting spatial models for large datasets with binary outcomes

Project description: Epidemiological researchers are interested in models for binary data that treat space as a covariate. These models tend to be hard to fit, because one is fitting the space covariate nonparametrically in two dimensions. I have compared several approaches to fitting such data, including Bayesian Gaussian process models, penalized quasi-likelihood, and Simon Wood's multipenalty spline optimization coded in the mgcv library for R. The Bayesian approaches rely on efficient representations of the underlying spatial risk surface. In particular, I have modified Chris Wikle's work with the Fourier basis, which allows use of the FFT to speed computation, and Kammann and Wand's (2003) work with basis function representation approximations to standard covariance functions. The Fourier basis outperforms other methods in simulations, fitting better than both penalized likelihood and Bayesian methods, and computing more quickly than other Bayesian methods. See the preprint of the paper (Paciorek 2007a), published in Computational Statistics and Data Analysis.

Collaborator: Louise Ryan (Harvard Biostatistics)

A new class of nonstationary covariances for spatial data

Project description: I have a paper in Environmetrics (Paciorek and Schervish 2006) on the spatial modelling in my dissertation. In particular, I use a nonstationary covariance that generalizes the work of Higdon, Swall and Kern (1998). In contrast to my dissertation, this works uses efficient representations mentioned in the previous paragraph to increase the speed of fitting the nonstationary Gaussian process. The efficient representations are used for stationary processes in the hierarchy of the model that determine the structure of the nonstationarity.

Collaborator: Mark Schervish (CMU Statistics)

Misinformation in the conjugate prior for the linear model

Project description: I wrote a short paper (Paciorek 2006) for Bayesian Analysis on the conjugate prior for the normal linear model. When the prior mean in this model is poorly chosen, the resulting posterior for the error variance and the posterior variance for the coefficients can be inflated. This is of particular concern with the unit information prior and free-knot spline models (e.g., BARS) that use the unit information prior (in fact the paper grew out of anomalous results I obtained when using BARS in my thesis work).

Effects of elevated wild pig abundance on tropical forest plant demography

Project description: With Kalan Ickes (and the hard-earned data from his doctoral research in ecology at LSU), I worked on an analysis of the effects of elevated wild pig populations on the demographics of the woody plant community at the Pasoh Forest Dynamics Plot in peninsular Malaysia. This analysis (Ickes, Paciorek, and Thomas 2005) has been published in Ecology.

Collaborators: Kalan Ickes (Clemson Biology), Sean Thomas (Toronto Forestry)

Ph.D. dissertation on nonstationary covariance modelling

Project description: I finished my PhD dissertation in May 2003. The dissertation focused on nonstationary covariance modelling for spatial data and nonparametric regression. Details are here.

Collaborator: Mark Schervish (CMU Statistics)

False discovery rate testing for spatial data

Project description: We wrote a paper (Ventura, Paciorek, and Risbey 2004) overviewing the use of the false discovery rate (FDR) methodology, first introduced by Benjamini and Hochberg (1995), for multiple testing with climatological and geophysical data in which the multiple tests are done at multiple spatial locations. The spatial aspect introduces dependence between the tests. We assessed the robustness and power of several FDR approaches and concluded that the simple original Benjamini and Hochberg algorithm works well with spatial data. We also present a simple modification that substantially increases power when there are many significant locations. An earlier technical report version is also available.

Collaborators: Valerie Ventura (CMU Statistics), James Risbey (CSIRO Tasmania)

Trends in storminess in the northern hemisphere based on multiple indicators

Project description: For my applied research qualifier at CMU, I completed an analysis (Paciorek, Risbey, Ventura, and Rosen 2002), published in Journal of Climate, of storminess in the Northern Hemisphere over the past 50 years using the NCEP/NCAR reanalysis dataset. The goals were 1.) to compare different indices of storminess, and 2.) to investigate trends in storminess over time. The maps in the paper were created with a package of UNIX mapping tools called GMT. Data files of the storm indices we use in this analysis are available here.

Collaborators: James Risbey (CSIRO Tasmania), Valerie Ventura (CMU Statistics), Richard Rosen (NOAA)

Statistical language modelling

Project description: We built exponential language models based on features of whole sentences. The goal was to rerank n-best lists produced by an initial acoustic/n-gram model. An early approach to this problem, presented at the 2000 Speech Transcription Workshop, was applied to Switchboard conversational speech with little success. One way of re-ranking n-best lists is via Powell's algorithm. I devised a local regression approach to reranking n-best lists, but this approach has not been successful.

Collaborator: Roni Rosenfeld (CMU Computer Science)

Tree resprouting on Barro Colorado Island

Project description: For my master's thesis in ecology, I analyzed resprouting in trees and shrubs on Barro Colorado Island, Panama. This work (Paciorek, Condit, Hubbell, and Foster 2000) has been published in Journal of Ecology.

Collaborator: Rick Condit (Smithsonian Tropical Research Institute, Panama)

Last updated: January 2017.