Smoothers

1  Smoothers

In many of the graphs we've looked at, we added a straight line representing the best linear regression line that went through the data we were plotting. Such lines can be very helpful when there are lots of points, or when outliers might detract from our seeing a relationship among the points. But plotting the best linear regression line has some limitations. For one thing, the regression has to fit all the data, so finding a good regression fit is often a compromise between being true to the data in any given range, and trying to come up with a single line that does reasonably well throughout the entire range. For some data, this may be appropriate. For example, if we know that two variables really follow a linear relationship, then we'd have to assume that deviations from that relationship are just noise, and the best straight line would be a meaningful way to display their relationship on a graph. However, situations like that are not that common. To come up with a way of visualizing relationships between two variables without resorting to a regression lines, statisticians and mathematicians have developed techniques for smoothing curves. Essentially this means drawing lines through the points based only on other points from the surrounding neighborhood, not from the entire set of points. There are many different types of smoothers available, and most of them offer an option that controls how much smoothing they will do as well as options to control the basic methods that they use, so it's usually possible to find a smoother that will work well for a particular set of data.

2  Kernel Smoothers

Kernel smoothers work by forming a weighted average of all the y-values corresponding to points whose x-values are close to the x-value of a point being plotted. The function that defines the weights is known as a kernel, and the number of points involved in the weighted average is based on a parameter known as the bandwidth. The default kernel is a box function; in other words, it simply averages together y-values which are within the specified bandwidth of a given x-value, and uses that average as the y-value for the x-value in question. With a very tiny bandwidth, this corresponds to a "connect-the-dots" type of drawing. With a very large bandwidth, it will basically estimate every y-value as the mean of all the y-values. However, even when the bandwidth is carefully chosen, using the box kernel rarely will result in a truly smooth graph. For example, consider a plot of OD.Ratio versus Proanthocyanins from the wine data set that we've used in previous examples. The following code produces a plot of the variables, and superimposes a line representing a box kernel smooth with the default bandwidth:
> plot(wine$Proanthocyanins,wine$OD.Ratio)
> lines(ksmooth(wine$Proanthocyanins,wine$OD.Ratio))

Here's the graph:
Notice how choppy the line is, especially where there isn't much data. That's because the box kernel is too extreme - it either adds in a point or not. So using the box kernel is like stacking up a bunch of square boxes around each point, and we don't really get a smooth result.
More commonly, kernels will have a maximum at distances that are very small, and will decrease gradually as the (absolute value) of the distance from the center of the kernel increases. This means that nearby points will have lots of influence on the weighted estimate that will be plotted, but as we move away from a particular point, the neighboring points will have less and less influence. We can modify how many points are considered through the bandwidth - including more points tends to give smoother curves that don't respond as well to local variation, while decreasing the bandwidth tends to make the curve look "choppier". One of the most common kernels used in smoothing is the Gaussian or normal kernel. This kernel is the familiar "bell curve" - largest in the middle (corresponding in this cases to distances of zero from a particular point), and gradually decreasing over it's supported range. The width of that range is determined by the bandwith when using a kernel smoother. Here's the Proanthocyanins vs. OD.Ratio plot, smoothed with a normal kernel using the default bandwidth:
Notice the change in the line when switching to the normal kernel; the line is now smooth, and we can see that a linear relationship that holds up until around a Proanthocyanin concentration of about 2.
The argument that controls the size of the neighborhood that's used by ksmooth to estimate the smoothed value at each point is called bandwidth. We can examine the effect of changing the bandwidth as follows;
kplotfn = function(bw){
    plot(wine$Proanthocyanins,wine$OD.Ratio,main=bw)
    lines(ksmooth(wine$Proanthocyanins,wine$OD.Ratio,bandwidth=bw),col='green')
}

bws = seq(.1,1,by=.1)
par(mfrow=c(5,2),mar=c(2,4,2,1)+.1)
sapply(bws,kplotfn)

After adjusting the margins, the plot we got looks something like a lattice plot. But how could we produce an actual lattice plot with this data? We would need a data frame with x and y values for each bandwidth. What does ksmooth return?
> result = ksmooth(wine$Proanthocyanins,wine$OD.Ratio)
> class(result)
[1] "list"
> names(result)
[1] "x" "y"

Since the lengths of x and y must be equal, we can convert the output to a data frame directly. Let's write a function which will generate a data frame with x, y and bandwidth values for a single bandwidth:
kfun = function(bw)
   data.frame(bw=bw,as.data.frame(ksmooth(wine$Proanthocyanins,wine$OD.Ratio,bandwidth=bw)))

Notice that R realized it had to repeat the single bw value to make it compatible with x and y. As always, we should test the function:
> result = kfun(.5)
> head(result)
   bw         x        y
1 0.5 0.4100000 1.881429
2 0.5 0.4279096 1.881429
3 0.5 0.4458192 1.812500
4 0.5 0.4637288 1.812500
5 0.5 0.4816384 1.879000
6 0.5 0.4995480 1.879000

Now we can create a list of data frames, one for each bandwidth:
> frames = lapply(seq(.1,1,by=.1),kfun)
> allbws = do.call(rbind,frames)
> dim(allbws)
[1] 1780    3

Notice the use of the do.call function, which is used here to call rbind with the data frames for all of the different bandwidths.
To get both points and lines on each panel, we can create a custom panel function:
mypanel = function(x,y,...){
              panel.xyplot(wine$Proanthocyanins,wine$OD.Ratio);
	      panel.xyplot(x,y,type='l',col='red')
}
xyplot(y~x|factor(bw),data=allbws,type='l',as.table=TRUE,layout=c(2,5),panel=mypanel)

The resulting plot appears below:

3  Locally Weighted Regression Smoothers

Another approach that is often used to smooth curves is locally weighted regression. Instead of taking a weighted average of y-values near the x-values we want to plot, the nearby points are used in a (usually quadratic) weighted regression, and predicted values from these local regressions are used as the y-values that are plotted. The lowess function in R implements this technique by using the reciprocal of the residuals of successive fits as the weights, downgrading those points that don't contribute to a smooth fit. In the lowess function, the argument f= specifies the fraction of the data to be used in the local regressions. Specifying a larger value results in a smoother curve.
To illustrate, consider a plot of literacy versus phys, the number of physicians per 100000 people, from the world data set that we've used in previous examples. The following code produces a plot of the data with a lowess smoothed curve superimposed:
> plot(world$literacy,world$phys)
> lines(lowess(world$literacy,world$phys))

The graph appears below:
The argument to lowess that controls the level of smoothing is f, the fraction of the data which will be used in the local regressions. Let's compare the results of smoothing the literacy/physician curve using different values of f
lplotfn = function(f){
    plot(world$literacy,world$phys,main=f)
    lines(lowess(world$literacy,world$phys,f=f),col='orange')
}

fs = seq(.1,1,by=.1)
par(mfrow=c(5,2),mar=c(2,4,2,1)+.1)
sapply(fs,lplotfn)

For this particular data set, there's not much difference in the appearance of the smoothed curve, although there is a bit of noise when using the lowest value of f



File translated from TEX by TTH, version 3.67.
On 18 Apr 2011, 15:27.