Linear Regression

1  Regression Diagnostics

Two statistics which have proven to be useful in identifying influential observations are Cook's distance and a statistic known as the hat statistic. Cook's distance is calculated for each observation by comparing the results of the regression with that observation included to the results when that particular observation is removed. Thus, it can find observations which are outliers with regard to both the dependent variables and the independent variables. In R, the cooks.distance function calculates the statistics given an lm model object. The hat statistic is based entirely on the independent variables of the model, so it focuses on observations which are distant from the others with regard to the independent variables in the model. In R, the lm.influence function will return a list, including a component named hat, which contains the hat statistic. For simple regressions, with just one independent variable, influential observations are usually at the far reaches of the range of either the dependent or independent variables. For example, for the wine.lm model, Proline was one of the independent variables which seemed effective in predicting alcohol. Let's perform a simple regression using this variable, and then plot the results, highlighting those points that had unusually high Cook's distances or hat statistic values:
> simple.lm = lm(Alcohol~Proline,data=wine)
> cooks = cooks.distance(simple.lm)
> hat =  lm.influence(simple.lm)$hat
> par(mfrow=c(2,1))
> plot(wine$Proline,wine$Alcohol,col=ifelse(cooks > quantile(cooks,.90),'red','black'))
> plot(wine$Proline,wine$Alcohol,col=ifelse(hat > quantile(hat,.90),'blue','black'))

The plots are displayed below:
In the top graph, the points displayed in red represent the observations with large Cook's distances; in the bottom graph, the blue points are those with high hat statistic values. Of course, with more variables in the model, things are not so simple. Let's look at a plot of predicted values versus actual values for the full regression model for this data, using the same coloring conventions:
> cooks = cooks.distance(wine.lm)
> hat = lm.influence(wine.lm)$hat
> par(mfrow=c(2,1))
> plot(wine$Alcohol,predict(wine.lm),col=ifelse(cooks > quantile(cooks,.9),'red','black'))
> plot(wine$Alcohol,predict(wine.lm),col=ifelse(hat > quantile(hat,.9),'blue','black'))

Here are the plots:
The extreme Cook's distance points seem to congregate at the outer edges of the predicted values, but the extreme hat statistics points don't follow a simple pattern. In practice, many statisticians use the rule of thumb that Cook's distances bigger than the 10th percentile of an F distribution with p and n-p degrees of freedom represent potential problems, where n is the number of observations, and p is the number of parameters estimated. For the wine data, that cutoff point can be calculated as follows:
> qf(.1,7,178-7)
[1] 0.4022056

In fact, for this data set none of the Cook's distances are greater than this value.
For the hat statistic, a cutoff of 2 * p/n has been proposed; for the wine data this corresponds to a value of 0.079. With the wine example, there are ten such points. Plotting each independent variable against the dependent variable, and highlighting the extreme points in orange helps to show where these points are:
>  par(mfrow=c(3,2))
> sapply(names(coef(wine.lm)[-1]),
+    function(x)plot(wine[[x]],wine$Alcohol,
+                    col=ifelse(hat >.0786 ,'orange','black'),xlab=x))

Here are the plots:

2  Collinearity

Another problem which might occur when using linear regression is known as collinearity. This problem occurs when the independent variables are so highly correlated that they contain redundant information, which confuses the regression process. When data is collinear, the standard errors of the parameter estimates get very large, and removing one or two variables may make the coefficients change dramatically. In addition, collinearity can mask important relationships in the data. The classic data set to illustrate collinearity is known as the Longley data set, available in R under the name longley. This data set contains a variety of measurements about the population of the US in an attempt to predict employment. Let's take a look at the result of regressing Employed against the other variables in the Longley data set:
> lreg = lm(Employed ~ .,data=longley)
> summary(lreg)

Call:
lm(formula = Employed ~ ., data = longley)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.41011 -0.15767 -0.02816  0.10155  0.45539 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 ** 
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141    
GNP          -3.582e-02  3.349e-02  -1.070 0.312681    
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 ** 
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212    
Year          1.829e+00  4.555e-01   4.016 0.003037 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955,	Adjusted R-squared: 0.9925 
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10 

On the surface, nothing seems wrong - in fact, with an adjusted R-squared of .9925, it seems great. We can see the problem with the data by looking at the pairwise scatterplots, using the pairs function:
> pairs(longley)

Here's the plot:
Many of the variables seem to be correlated with each other, so it's difficult to see which is causing the problem. A statistic known as VIF (Variance Inflation Factor) can be very useful in situations like this. In R, the vif function in the car package will provide this statistic. Before using vif on the Longley data, let's look at the wine data we used previously:
> wine.lm = lm(Alcohol~.,data=subset(wine,select=-Cultivar))
> library(car)
> vif(wine.lm)
     Malic.acid             Ash  Alkalinity.ash       Magnesium         Phenols 
       1.575916        2.180108        2.179282        1.417855        4.330552 
     Flavanoids      NF.phenols Proanthocyanins Color.intensity             Hue 
       7.029040        1.793883        1.947243        2.493007        2.542273 
       OD.Ratio         Proline 
       3.736818        2.441810 

None of the inflation factors is bigger than 10, which indicates collinearity is not a problem with the data set, confirmed by looking at the pairs plot for the wine data set:
There does seem to be a linear relationship between Flavanoids and Phenols - not surprisingly those two variables have the highest VIFs.
Now let's return to the Longley data.
> vif(lreg)
GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
   135.53244   1788.51348     33.61889      3.58893    399.15102    758.98060 

The two largest VIFs are for GNP and Year. Let's eliminate them from the model, and see how the VIFs change:
> lreg1 = lm(Employed ~ .,data=subset(longley,select=-c(GNP,Year)))
> summary(lreg1)

Call:
lm(formula = Employed ~ ., data = subset(longley, select = -c(GNP, 
    Year)))

Residuals:
       Min         1Q     Median         3Q        Max 
-0.6561730 -0.2576601 -0.0008123  0.1213544  1.2225443 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.781314   6.886470   2.001 0.070657 .  
GNP.deflator  0.207046   0.081376   2.544 0.027270 *  
Unemployed   -0.012412   0.002780  -4.465 0.000955 ***
Armed.Forces -0.005968   0.003325  -1.795 0.100170    
Population    0.306601   0.123795   2.477 0.030755 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.5671 on 11 degrees of freedom
Multiple R-squared: 0.9809,	Adjusted R-squared: 0.9739 
F-statistic: 141.1 on 4 and 11 DF,  p-value: 2.26e-09 

> vif(lreg1)
GNP.deflator   Unemployed Armed.Forces   Population 
   35.970754     3.147600     2.497795    34.588299 

The reduced model is probably more realistic than the full model, but GNP.deflator and Population are still highly correlated
> with(longley,cor(GNP.deflator,Population))
[1] 0.9791634

Removing GNP.deflator results in a model that seems to make sense:
> lreg2 = lm(Employed ~ .,data=subset(longley,select=-c(GNP,Year,GNP.deflator)))
> summary(lreg2)

Call:
lm(formula = Employed ~ ., data = subset(longley, select = -c(GNP, 
    Year, GNP.deflator)))

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3835 -0.2868 -0.1353  0.3596  1.3382 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.323091   4.211566  -0.314  0.75880    
Unemployed   -0.012292   0.003354  -3.665  0.00324 ** 
Armed.Forces -0.001893   0.003516  -0.538  0.60019    
Population    0.605146   0.047617  12.709 2.55e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6843 on 12 degrees of freedom
Multiple R-squared: 0.9696,	Adjusted R-squared: 0.962 
F-statistic: 127.7 on 3 and 12 DF,  p-value: 2.272e-09 

> vif(lreg2)
  Unemployed Armed.Forces   Population 
    3.146686     1.918225     3.514335 

Now let's look at some alternatives to ordinary linear regression.

3  Generalized Additive Models (gam)

One of the most useful alternative methods to regression is known as a generalized additive model. Instead of fitting a single linear parameter to try to explain the relationship between independent variables and dependent variables, GAM models perform spline smooths on selected variables, and use these smoothed versions of the independent variables to try to explain the values of the dependent variables. To try to make the information from the analysis similar to the familiar lm output, an estimated number of degrees of freedom is calculated for each variable, based on how different the fitted spline smooth for that variable is from the strictly linear relationship that lm uses for prediction, and an F-statistic is produced for each independent variable to test whether the smoothed version of the variable made a significant contribution to the predicted value of the dependent variable. In R, the gam function is provided by the mgcv library. This library also provides the s function, which is used by gam to identify the variables that should be smoothed before they are used to predict the dependent variable. Without prior knowledge, it's not unreasonable to try smoothing on all the variables. We can fit a gam model by using the same formula as we used with lm, passing each variable in the model to the s function:
> library(mgcv)
> wine.gam = gam(Alcohol~s(Malic.acid)+s(Alkalinity.ash)+
+                s(Proanthocyanins)+s(Color.intensity)+
+                s(OD.Ratio)+s(Proline),data=wine[-1])
> wine.gam

Family: gaussian
Link function: identity

Formula:
Alcohol ~ s(Malic.acid) + s(Alkalinity.ash) + s(Proanthocyanins) +
    s(Color.intensity) + s(OD.Ratio) + s(Proline)

Estimated degrees of freedom:
 1 7.920717 3.492826 4.022189 1 3.567478   total =  22.00321

GCV score:  0.2314599

Like the lm function, gam provides a more familiar table when the summary method is invoked:
> summary(wine.gam)

Family: gaussian
Link function: identity

Formula:
Alcohol ~ s(Malic.acid) + s(Alkalinity.ash) + s(Proanthocyanins) +
    s(Color.intensity) + s(OD.Ratio) + s(Proline)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.00062    0.03376   385.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                     edf Est.rank      F  p-value
s(Malic.acid)      1.000    1.000 14.133 0.000240 ***
s(Alkalinity.ash)  7.921    9.000  4.403 3.84e-05 ***
s(Proanthocyanins) 3.493    9.000  1.844 0.064387 .
s(Color.intensity) 4.022    9.000  8.391 3.66e-10 ***
s(OD.Ratio)        1.000    1.000  2.246 0.135990
s(Proline)         3.567    9.000  5.508 1.42e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.692   Deviance explained = 72.9%
GCV score = 0.23146   Scale est. = 0.20285   n = 178

The relative importance of the variables has changed somewhat from the linear regression results. Notice that the adjusted R-squared value is 0.692, as compared to 0.575 for linear regression, showing an improvement in prediction by using the smoothed versions of the independent variable.
Applying the plot function to a gam model is often the most useful part of the analysis. The plots produced show how the independent variable was smoothed before fitting, so a straight (or nearly straight) line for a particular variable means a truly linear relationship was found, while deviations from linearity describe the nature of non-linear relationships that exist. Here's the results of using plot on our gam model. For convenience, I've put all the plots in a single graphic; in practice, you might want to examine each plot separately. I've used the par function to adjust the margins so that the individual plots will be a little larger:
> par(mfrow=c(3,2),mar=c(2,4,1,2)+.1,oma=c(0,0,4,0),xpd=FALSE)
> plot(wine.gam)

Here's the plot:
For the variables Malic.acid and OD.ratio, the relationships do seem to be linear; this is supported by the fact that gam only used a single degree of freedom to fit these terms. For some of the other variables, it's clear that linear relationships hold only over a limited range of the data. The Alkalinity.ash plot is particularly interesting, but it may indicate oversmoothing.

4  Recursive Partitioning

We've already seen how recursive partitioning can be used for classification, but it can also be used for regression if the dependent variable passed to rpart is not a factor. When used for regression, rpart follows a similar strategy as for classification; each variable is tested for all possible splits, looking for large separation between the dependent variable values for one side of the split as opposed to the other. As is the case for classification, rpart presents its results as a tree, with terminal nodes representing the best prediction the model can provide. Here are the results of using recursive partitioning on the wine data frame to predict Alcohol. I'm using a period on the right hand side of the model to indicate that rpart should consider all of the variables in the data frame (except Cultivar):
> library(rpart)
> wine.rpart = rpart(Alcohol ~ . ,data=wine[-1])
> wine.rpart
n= 178

node), split, n, deviance, yval
      * denotes terminal node

 1) root 178 116.654000 13.00062
   2) Color.intensity< 3.325 50   9.161498 12.13980
     4) Ash>=2.41 13   1.269369 11.86846 *
     5) Ash< 2.41 37   6.598724 12.23514 *
   3) Color.intensity>=3.325 128  55.969350 13.33687
     6) Proline< 900 79  28.974900 13.05013
      12) Color.intensity< 8.315 61  21.586760 12.93197
        24) Proline< 722.5 43  14.291710 12.80209
          48) Malic.acid< 3.1 27   8.142067 12.62889
            96) NF.phenols>=0.33 14   2.388493 12.30929 *
            97) NF.phenols< 0.33 13   2.783477 12.97308 *
          49) Malic.acid>=3.1 16   3.972794 13.09437 *
        25) Proline>=722.5 18   4.837111 13.24222 *
      13) Color.intensity>=8.315 18   3.650294 13.45056 *
     7) Proline>=900 49  10.025970 13.79918
      14) Color.intensity< 4.44 10   0.787410 13.27700 *
      15) Color.intensity>=4.44 39   5.812631 13.93308 *

Since rpart doesn't actually estimate any coefficients, we can't produce a table of hypothesis tests as we did for lm or gam, but we can get a sort of multiple R-squared value by squaring the correlation between the true value of the dependent variable and the value that rpart predicts:
> cor(wine$Alcohol,predict(wine.rpart))^2
[1] 0.7248247

This unadjusted R-squared value is a little higher than the adjusted R-squared value from the gam model.

5  Comparison of the 3 Methods

A very simple way to get an idea of how the three methods compare with each other is to make a plot of predicted versus actual values for the three methods, using a different color for each:
> plot(predict(wine.lm),wine$Alcohol,col='red',xlab='Predicted',ylab='Actual')
> points(predict(wine.gam),wine$Alcohol,col='blue')
> points(predict(wine.rpart),wine$Alcohol,col='green')
> legend('topleft',legend=c('lm','gam','rpart'),col=c('red','blue','green'),pch=1,cex=.8)

Here's the plot:



File translated from TEX by TTH, version 3.67.
On 25 Apr 2011, 11:24.