Reducing Size and Complexity of Very Large Geophysical Data Sets
Amy Braverman
Earth and Space Sciences Division- Jet Propulsion Laboratory,
California Institute of Technology
Abstract
This talk discusses a procedure for compressing large
data sets, particularly geophysical ones like those obtained from
remote sensing satellite instruments. Data are partitioned by space and time,
and a penalized clustering algorithm applied to each subset independently. The
algorithm is based on the entropy-constrained vector quantizer (ECVQ) of Chou,
Lookabaugh and Gray (1989). In each subset ECVQ trades off error against data
reduction to produce a set of representative points that stand in for the
original observations. Since data are voluminous, a preliminary set of
representatives is determined from a sample, then the full subset is clustered
by assigning each observation to the nearest representative point. After
replacing the initial representatives by the centroids of these final clusters, the new representatives and their associated counts constitute a compressed
version, or summary, of the raw data. Since the initial representatives are
derived from a sample, the final summary is subject to sampling variation. A
statistical model for the relationship between compressed and raw data provides a framework for assessing this variability, and other aspects of summary
quality.
The procedure is being used to produce global data products the Multi-angle Imaging SpectroRadiometer (MISR), one instrument aboard the NASA's Terra satellite. MISR produces approximately 1 TB per month of radiance and geophysical data. Practical considerations for this application are discussed, and examples using compressed MISR data are presented.