SemiPar Package for S-Plus

Home

GIS Research in Longwood Medical Area

GIS at Harvard University

FAQs

Scripts

Spatial Statistics

Links

Downloading/installing: SemiPar is a module for S-plus, which is commercial software for Windows and Unix. It may already be available in many academic and research computing environments or your institution may have a site license. Otherwise, individual versions of the software can be purchased at http://www.insightful.com/products/splus/default.asp. A basic familiarity with S-plus would be helpful when using the SemiPar code. A possible starting point for learning is http://www.isds.duke.edu/computing/S/Snotes/Splus.html. There are also a number of books for learning S-plus, and the manuals that come with S-plus are helpful for getting started. To download the SemiPar code, go to http://www.maths.unsw.edu.au/~wand/semipar.html and follow the instructions. Unfortunately, there appears to be little documentation for SemiPar at present. I will attempt to give enough detail below for users to fit basic models. Template code for a standard analysis can be obtained here .

Estimating a Surface: Estimating a spatial surface, including the effect of additional covariates, is done using the spm() function in SemiPar. The function uses a thin-plate spline representation of the spatial surface. There is a single parameter that controls the degree to which the data are smoothed to produce the estimated surface. This can be controlled by the user by giving spm() the number of degrees of freedom (df). The df can be interpreted in a heuristic way by comparing to polynomial regression modelling. The following holds approximately: one df is a constant surface, two df is a linear surface (a plane), and three df a quadratic surface (one extreme value), with additional df corresponding to additional complexity in the underlying surface. SemiPar can also fit the df automatically using an established statistical method called REstricted Maximum Likelihood (REML). I suggest using the REML default, while further exploring the fit by varying the df as an exploratory tool. However, I caution that it is easy to see patterns in noisy data that are not real phenomenon, so one can easily produce a map with particular features that seem interesting or suggest certain subject matter conclusions, but that are not supported statistically. On the other hand, based on empirical evidence, Matt Wand, the creator of the SemiPar software believes that REML tends to oversmooth the data, leaving out real spatial features, so using more df than the REML estimate may be appropriate. A simple, but effective way of estimating the df is to use cross-validation. This involves splitting the dataset into multiple groups and, sequentially for each group, omitting the group when fitting the model and then predicting the values in the group with the model and comparing to the actual values. By quantifying the mean square error (the sum of squared differences between predictions and actual values) for each possible value of df, one can decide on the best value for df. Ideally, one does this for each data point (i.e., each data point serving as a group), but this can be slow computationally, so larger groups are often used. I have not yet determined how one obtains predicted values for arbitrary locations using SemiPar, which would be necessary to carry out cross-validation.

A simple example of use of the spm function is the following S code:
fit <- spm(dat$z~dat$covar1+dat$covar2+ f(dat$x,dat$y,knots=my.knots,adf=10),omit.missing=T)

By omitting the 'adf=10' argument, by which the user in this case has specified 10 df, one forces the use of REML to estimate the df. In this example, covar1 and covar2 are vectors of additional predictors, while x and y are vectors of spatial coordinate values and z is the vector of data. One also needs to define knots. The object my.knots is a k by 2 matrix of k knot locations with each row specifying the x and y coordinates of a knot. Knots are location points on which the surface is based. The more quickly the underlying surface varies in space, the more knots are needed to capture the effect. One should not use more knots than data points. In general, 50-100 knots in a grid on the region of interest is probably sufficient. One way to choose the knots would be to lay down a regular grid on the region of interest. There is also software for selecting knot locations. One possible algorithm is a space-filling algorithm, for which one can use the cover.design() function included with SemiPar. This is probably most helpful for selecting knots from amongst the data locations when the data fall in an irregular spatial region. Note that it is also possible to fit nonlinear smooth functions to the additional predictors (named covar1 and covar2 above) in place of the linear forms given above, as demonstrated in the template .

Mapping a surface in S-plus: The spm.plot() function in SemiPar allows one to make a map of the estimated surface in S-plus. While not easily allowing for all of the data layers available in GIS software, plotting in S-plus can give an initial sense for what the surface looks like, and many of the attributes of the map (colors, titles, legends) can be customized. See the example file for a template .

. The key function is plot.spm(). An example is:

plot.spm(fit,plot.params=plot.control(plot.it=rep(F,8)),
image.params=image.control(bdry=my.bdry,range.x=range(my.bdry[,1]),
range.y=range(my.bdry[,2]),mesh.size=c(num.cells.lon,num.cells.lat),
leg.loc=c(-71.3,42.26),main=""))

The my.bdry argument is a q by 2 matrix with each of the q rows a set of x, y values. Connecting these q coordinate points determines the plotting region. While SemiPar can make predictions where there are no data, it makes little sense to plot these areas because the uncertainty is so large. The simplest boundary region is a rectangle, implemented by including the four corner locations in the my.bdry object. The boundary region can be specified by pointing and clicking from within the software. To do this, omit the 'bdry=my.bdry' text in the above command and follow the instructions that S-plus gives. The mesh.size argument determines the number of grid boxes that are plotted; more is generally better to avoid blocky surfaces, up to the point at which the computations become too slow. The plot.spm() function automatically generates a scale, which is important. The colors of the image can be changed using the Options pull-down menu in the Unix graphical window. This should also be manipulable in Windows, although I have not investigated this.

Uncertainty: An important aspect of a statistical model is estimating the uncertainty in the results. The model underlying SemiPar can provide estimated standard errors for the surface at each location. However, I do not see how to easily extract this information using SemiPar for the spatial surface. One can see uncertainty bands for the other covariates using the argument 'se=T' in plot.spm(). The mgcv package for R appears to provide uncertainty estimates in an accessible fashion, so this may be a good alternative.

Exporting data to ArcGIS: The estimated surface can be exported to a data file that can then be imported into ArcGIS for additional mapping. To generate the file of x,y,z values needed in ArcGIS, where x and y are spatial coordinates, and z is the predicted surface at the (x,y) location, one can use the ArcGIS.spm() function, following the template code . Note that at time of writing (9/28/03), there was a bug in this function that gives incorrect predictions. A corrected version for which the predictions match those in plot.spm() can be found here and needs to be sourced into S-plus as done in the template code. This creates a comma-delimited (.csv) file. Note that ArcGIS assumes x and y are longitude and latitude, respectively, and that the pixels are square, so your lon-lat grid (created implicitly by the range.x, range.y, and mesh.size arguments to ArcGIS.spm()) should be equally spaced. Next open the ArcMap program of ArcGIS and open an empty map. Make sure that the Spatial Analyst is enabled (if there's no toolbar for it, go to View->Toolbars and activate it). Add the data to the map using Tools->Add XY Data. Once you find your file, the x and y fields should be automatically filled in in the window. Finally, you need to convert your 'feature' data to 'raster' data, so that ArcGIS knows that a whole pixel (grid box) is associated with each location. Using the Spatial Analyst toolbar, select Convert->Features to Raster. The input feature should be your_file_name.csv Events. Select 'z' for the field box. Change the output cell size to the appropriate grid size based on the values you used in ArcGIS.spm() in S-plus. The map should now display a pixelated image of the prediction values from SemiPar. To change the color scheme, you can right-click on the new layer, select Properties, and adjust the scheme.