Smoothers

1 Smoothers

In many of the graphs we've looked at, we added a straight line representing the best linear regression line that went through the data we were plotting. Such lines can be very helpful when there are lots of points, or when outliers might detract from our seeing a relationship among the points. But plotting the best linear regression line has some limitations. For one thing, the regression has to fit all the data, so finding a good regression fit is often a compromise between being true to the data in any given range, and trying to come up with a single line that does reasonably well throughout the entire range. For some data, this may be appropriate. For example, if we know that two variables really follow a linear relationship, then we'd have to assume that deviations from that relationship are just noise, and the best straight line would be a meaningful way to display their relationship on a graph. However, situations like that are not that common. To come up with a way of visualizing relationships between two variables without resorting to a regression lines, statisticians and mathematicians have developed techniques for smoothing curves. Essentially this means drawing lines through the points based only on other points from the surrounding neighborhood, not from the entire set of points. There are many different types of smoothers available, and most of them offer an option that controls how much smoothing they will do as well as options to control the basic methods that they use, so it's usually possible to find a smoother that will work well for a particular set of data.

2 Kernel Smoothers

Kernel smoothers work by forming a weighted average of all the y-values corresponding to points whose x-values are close to the x-value of a point being plotted. The function that defines the weights is known as a kernel, and the number of points involved in the weighted average is based on a parameter known as the bandwidth. The default kernel is a box function; in other words, it simply averages together y-values which are within the specified bandwidth of a given x-value, and uses that average as the y-value for the x-value in question. With a very tiny bandwidth, this corresponds to a "connect-the-dots" type of drawing. With a very large bandwidth, it will basically estimate every y-value as the mean of all the y-values. However, even when the bandwidth is carefully chosen, using the box kernel rarely will result in a truly smooth graph. For example, consider a plot of OD.Ratio versus Proanthocyanins from the wine data set that we've used in previous examples. The following code produces a plot of the variables, and superimposes a line representing a box kernel smooth with the default bandwidth:

> plot(wine$Proanthocyanins,wine$OD.Ratio)
> lines(ksmooth(wine$Proanthocyanins,wine$OD.Ratio))

Here's the graph:

Notice how choppy the line is, especially where there isn't much data. That's because the box kernel is too extreme - it either adds in a point or not. So using the box kernel is like stacking up a bunch of square boxes around each point, and we don't really get a smooth result.

More commonly, kernels will have a maximum at distances that are very small, and will decrease gradually as the (absolute value) of the distance from the center of the kernel increases. This means that nearby points will have lots of influence on the weighted estimate that will be plotted, but as we move away from a particular point, the neighboring points will have less and less influence. We can modify how many points are considered through the bandwidth - including more points tends to give smoother curves that don't respond as well to local variation, while decreasing the bandwidth tends to make the curve look "choppier". One of the most common kernels used in smoothing is the Gaussian or normal kernel. This kernel is the familiar "bell curve" - largest in the middle (corresponding in this cases to distances of zero from a particular point), and gradually decreasing over it's supported range. The width of that range is determined by the bandwith when using a kernel smoother. Here's the Proanthocyanins vs. OD.Ratio plot, smoothed with a normal kernel using the default bandwidth:

Notice the change in the line when switching to the normal kernel; the line is now smooth, and we can see that a linear relationship that holds up until around a Proanthocyanin concentration of about 2.

3 Locally Weighted Regression Smoothers

Another approach that is often used to smooth curves is locally weighted regression. Instead of taking a weighted average of y-values near the x-values we want to plot, the nearby points are used in a (usually quadratic) weighted regression, and predicted values from these local regressions are used as the y-values that are plotted. The lowess function in R implements this technique by using the reciprocal of the residuals of successive fits as the weights, downgrading those points that don't contribute to a smooth fit. In the lowess function, the argument f= specifies the fraction of the data to be used in the local regressions. Specifying a larger value results in a smoother curve.

To illustrate, consider a plot of literacy versus phys, the number of physicians per 100000 people, from the world data set that we've used in previous examples. The following code produces a plot of the data with a lowess smoothed curve superimposed:

> plot(world$literacy,world$phys)
> lines(lowess(world$literacy,world$phys))

The graph appears below:

4 Spline Smoothers

Another type of smoothing is known as spline smoothing, named after a tool formerly used by draftsmen. A spline is a flexible piece of metal (usually lead) which could be used as a guide for drawing smooth curves. A set of points (known as knots) would be selected, and the spline would be held down at a particular x,y point, then bent to go through the next point, and so on. Due to the flexibility of the metal, this process would result in a smooth curve through the points.

Mathematically, the process can be reproduced by choosing the knot points and using (usually cubic) regression to estimate points in between the knots, and using calculus to make sure that the curve is smooth whenever the individual regression lines are joined together. The smooth.spline function in R performs these operations. The degree of smoothness is controlled by an argument called spar=, which usually ranges between 0 and 1. To illustrate, consider a data set consisting of the wheat production of the United States from 1910 to 2004. The data set can be found at http://www.stat.berkeley.edu/~spector/s133/data/wheat.txt. The following lines will produce a plot of the data, and superimpose a spline smooth.

> wheat = read.table('http://springer/data/wheat.txt',header=TRUE)
> plot(wheat$year,wheat$production)
> lines(smooth.spline(wheat$year,wheat$production))

Here's the result:

5 Supersmoother

While most smoothers require specification of a bandwidth, fraction of data, or level of smoothing, supersmoother is different in that it figures these things out for itself. Thus, it's an excellent choice for situations where smoothing needs to be done without any user intervention. Supersmoother works by performing lots of simple local regression smooths, and, at each x-value it uses those smooths to decide the best y-value to use. In R, supersmoother is made available through the supsmu function.

To illustrate, consider the car data which we used earlier when we were studying cluster analysis. The following lines produce a plot of weight versus MPG, and superimposes a supersmoother line.

> plot(cars$Weight,cars$MPG)
> lines(supsmu(cars$Weight,cars$MPG))

The plot appears below:

6 Smoothers with Lattice Plots

When working with lattice graphics, we've already seen the use of panel.lmline, which displays the best regression line in each panel of a lattice plot. A similar function, panel.loess, is available to superimpose a locally weighted regression smoother in each panel of a plot. As a simple illustration, consider the built-in Orange data set, which has information about the age and circumference of several orange trees. First, let's look at a plot with the best regression line smoother superimposed on the plot of age versus circumference for each Tree:

> library(lattice)
> xyplot(circumference~age|Tree,data=Orange,
+        panel=function(x,y,...){panel.xyplot(x,y,...);panel.lmline(x,y,...)})

Here's the plot:

To create the same plot, but using the panel.loess function, we can use the following:

> xyplot(circumference~age|Tree,data=Orange, 
+        panel=function(x,y,...){panel.xyplot(x,y,...);panel.loess(x,y,...)})

Here's how the plot looks:

If a panel. function doesn't exist for a smoother you'd like to use, you can use the panel.lines function to plot it directly:

> xyplot(circumference~age|Tree,data=Orange,
+        panel=function(x,y,...){panel.xyplot(x,y,...);z=supsmu(x,y);panel.lines(z$x,z$y,...)})

In this case supersmoother came closer to a straight line than lowess.

File translated from T_EX by T_TH, version 3.67.
On 30 Nov 2007, 08:46.