The Neyman Seminar: 1011 Evans, 4:10-5:00 pm Wednesday, November 19, 2003

Reducing Size and Complexity of Very Large Geophysical Data Sets

Amy Braverman

Earth and Space Sciences Division- Jet Propulsion Laboratory,
California Institute of Technology

Abstract

This talk discusses a procedure for compressing large data sets, particularly geophysical ones like those obtained from remote sensing satellite instruments. Data are partitioned by space and time, and a penalized clustering algorithm applied to each subset independently. The algorithm is based on the entropy-constrained vector quantizer (ECVQ) of Chou, Lookabaugh and Gray (1989). In each subset ECVQ trades off error against data reduction to produce a set of representative points that stand in for the original observations. Since data are voluminous, a preliminary set of representatives is determined from a sample, then the full subset is clustered by assigning each observation to the nearest representative point. After replacing the initial representatives by the centroids of these final clusters, the new representatives and their associated counts constitute a compressed version, or summary, of the raw data. Since the initial representatives are derived from a sample, the final summary is subject to sampling variation. A statistical model for the relationship between compressed and raw data provides a framework for assessing this variability, and other aspects of summary quality.

The procedure is being used to produce global data products the Multi-angle Imaging SpectroRadiometer (MISR), one instrument aboard the NASA's Terra satellite. MISR produces approximately 1 TB per month of radiance and geophysical data. Practical considerations for this application are discussed, and examples using compressed MISR data are presented.