Linear Regression

1  Linear Regression

Linear regression is a very popular procedure for modeling the value of one variable on the value(s) of one or more other variables. The variable that we're trying to model or predict is known as the dependent variable, and the variables that we use to make predictions are known as independent variables, or covariates. Linear regression makes the assumption that the changes in the dependent variable can be modeled as a monotonic linear function of the independent variables; that is, we assume that a change of a certain amount in the independent variables will result in a change in the dependent variable, and the amount of change in the dependent variable is constant across the range of the independent variables. As a simple example, suppose we're interested in the relationship between the horsepower of a car (the independent variable) and the miles per gallon of gas (MPG) of the car. When we fit a linear regression model, we're saying that a change of one horsepower will have the same effect on the MPG regardless of the value of horsepower that we're changing. In other words, a linear regression model would assume that if we had a car with 100 horsepower, and compared it to a car with 101 horsepower, we'd see the same difference in MPG as if we had a car with 300 horsepower and compared it to a car with 301 horsepower. Relationships like this often hold for a limited range of independent variable values, but the linear regression model assumes that it applies for the entire range of the independent variables.
Even with these limitations, linear regression has proven itself to be a very valuable tool for modeling, and it's widely used in many branches of research. There are also a variety of diagnostic measures, both numerical and graphical, that can help us to determine whether the regression is doing a good job, so it's not unreasonable that many people use linear regression as their first tool when trying to model a variable of interest.

2  The lm command

The lm command uses the model-based formula interface that we've already seen in the context of lattice graphics. The dependent variable is placed on the left-hand side of the tilde (~), and the independent variables are placed on the right-hand side, joined together with plus signs (+). When you want to use all the variables in a data frame (except for the dependent variable) as independent variables, you can use a period (.) for the right-hand side of the equation.
These models correspond to a mathematical model that looks like this:
yi = b0 + b1 x1 + b2 x2 + ... + bp xp + ei      for i=1,...,n
(1)
The bs represent coefficients that measure how much the dependent variable changes for each unit of change in the independent variable, and are often refered to as the slopes. The term b0 is often known as the intercept. (To omit the intercept in a formula in R, insert the term -1.) The e's represent the part of the observed dependent variable that can't be explained by the regression model, and in order to do hypothesis testing, we assume that these errors follow a normal distribution. For each term in the model, we test the hypothesis that the b corresponding to that term is equal to 0, against the alternative that the b is different from 0.
To illustrate the use of the lm command, we'll construct a regression model to predict the level of Alcohol in the wine data set, using several of the other variables as independent variables. First, we'll run lm to create an lm object containing all the information about the regression:
> wine.lm = lm(Alcohol~Malic.acid+Alkalinity.ash+Proanthocyanins+Color.intensity+OD.Ratio+Proline,data=wine[-1])

To see a very brief overview of the results, we can simply view the lm object:
> wine.lm

Call:
lm(formula = Alcohol ~ Malic.acid + Alkalinity.ash + Proanthocyanins +     Color.intensity + OD.Ratio + Proline, data = wine[-1])

Coefficients:
    (Intercept)       Malic.acid   Alkalinity.ash  Proanthocyanins
      11.333283         0.114313        -0.032440        -0.129636
Color.intensity         OD.Ratio          Proline
       0.158520         0.225453         0.001136

To get more information about the model, the summary function can be called:
> summary(wine.lm)

Call:
lm(formula = Alcohol ~ Malic.acid + Alkalinity.ash + Proanthocyanins +
    Color.intensity + OD.Ratio + Proline, data = wine[-1])

Residuals:
      Min        1Q    Median        3Q       Max
-1.502326 -0.342254  0.001165  0.330049  1.693639

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)     11.3332831  0.3943623  28.738  < 2e-16 ***
Malic.acid       0.1143127  0.0397878   2.873  0.00458 **
Alkalinity.ash  -0.0324405  0.0137533  -2.359  0.01947 *
Proanthocyanins -0.1296362  0.0846088  -1.532  0.12732
Color.intensity  0.1585201  0.0231627   6.844 1.32e-10 ***
OD.Ratio         0.2254528  0.0834109   2.703  0.00757 **
Proline          0.0011358  0.0001708   6.651 3.76e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5295 on 171 degrees of freedom
Multiple R-Squared: 0.5889,     Adjusted R-squared: 0.5745
F-statistic: 40.83 on 6 and 171 DF,  p-value: < 2.2e-16

The probability levels in the last column of the bottom table are for testing the null hypotheses that the slopes for those variables are equal to 0; that is, we're testing the null hypothesis that changes in the independent variable will not result in a linear change in the dependent variable. We can see that most of the variables in the model do seem to have an effect on the dependent variable.
One useful measure of the efficacy of a regression model is the multiple R-Squared statistic. This essentially measures the squared correlation of the dependent variable values with the values that the model would predict. The usual interpretation of this statistic is that it measures the fraction of variability in the data that is explained by the model, so values approaching 1 mean that the model is very effective. Since adding more variables to a model will always inflate the R-Squared value, many people prefer using the adjusted R-Squared value, which has been adjusted to account for the number of variables included in the model.
When you pass a model object to the plot function, it will display one or more plots that the author of the model fitting function felt were appropriate for studying the effectiveness of the model. For the lm object, four plots are created by default:
  1. A plot of residuals versus fitted (predicted) values - The residuals are the part of the dependent variable that the model couldn't explain, and they are our best available estimate of the error term from the regression model. They're calculated by subtracting the predicted value from the actual value of the dependent variable. Under the usual assumptions for the linear regression model, we don't expect the variability of the residuals to change over the range of the dependent variable, so there shouldn't be any discernable pattern to this plot. Note that outliers in the plot will be labeled by their observation number making it easy to track them down.
  2. A normal quantile-quantile plot of the standardized residuals - For the probabilities we saw in the summary table to be accurate, we have assumed that the errors of the model follow a normal distribution. Thus, we'd expect a normal quantile-quantile plot of the residuals to follow a straight line. Deviations from a straight line could mean that the errors don't follow a normal distribution.
  3. A scale-location plot - This plot is similar to the residuals versus fitted values plot, but it uses the square root of the standardized residuals. Like the first plot, there should be no discernable pattern to the plot.
  4. A Cook's distance plot - Cook's distance is a statistic that tries to identify points which have more influence than other points. Generally these are points that are distant from other points in the data, either for the dependent variable or one or more independent variables. Each observation is represented as a line whose height is indicative of the value of Cook's distance for that observation. There are no hard and fast rules for interpreting Cook's distance, but large values (which will be labeled with their observation numbers) represent points which might require further investigation.
Here are the four plots for the wine.lm object:

3  Using the model object

The design of the R modeling functions makes it very easy to do common tasks, regardless of the method that was used to model the data. We'll use lm as an example, but most of these techniques will work for other modeling functions in R. We've already seen that the plot function will produce useful plots after a model is fit. Here are some of the other functions that are available to work with modeling objects. In each case, the modeling object is passed to the function as its first argument.
  1. Coefficients - The coef function will return a vector containing the coefficients that the model estimated, in this case, the intercept and the slopes for each of the variables:
    > coef(wine.lm)
        (Intercept)      Malic.acid  Alkalinity.ash Proanthocyanins Color.intensity
       11.333283116     0.114312670    -0.032440473    -0.129636226     0.158520051
           OD.Ratio         Proline
        0.225452840     0.001135776
    
    
  2. Predicted Values - The predict function, called with no additional arguments, will return a vector of predicted values for the observations that were used in the modeling process. To get predicted values for observations not used in building the model, a data frame containing the values of the observations can be passed to predict through the newdata= argument. The variables in the data frame passed to predict must have the same names as the variables used to build the model. For example, to get a predicted value of Alcohol for a mythical wine, we could use a statement like this:
    > predict(wine.lm,newdata=data.frame(Malic.acid=2.3,Alkalinity.ash=19,
    +              Proanthocyanins=1.6,Color.intensity=5.1,OD.Ratio=2,6,Proline=746.9))
    [1] 12.88008
    
    
  3. Residuals - The residuals function will return a vector of the residuals from a model.
In addition, the summary function, which is usually used to display a printed summary of a model, often contains useful information. We can see what it contains by using the names function:
> names(summary(wine.lm))
 [1] "call"          "terms"         "residuals"     "coefficients"
 [5] "aliased"       "sigma"         "df"            "r.squared"
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

4  Regression Diagnostics

Two statistics which have proven to be useful in identifying influential observations are Cook's distance and a statistic known as the hat statistic. Cook's distance is calculated for each observation by comparing the results of the regression with that observation included to the results when that particular observation is removed. Thus, it can find observations which are outliers with regard to both the dependent variables and the independent variables. In R, the cooks.distance function calculates the statistics given an lm model object. The hat statistic is based entirely on the independent variables of the model, so it focuses on observations which are distant from the others with regard to the independent variables in the model. In R, the lm.influence function will return a list, including a component named hat, which contains the hat statistic. For simple regressions, with just one independent variable, influential observations are usually at the far reaches of the range of either the dependent or independent variables. For example, for the wine.lm model, Proline was one of the independent variables which seemed effective in predicting alcohol. Let's perform a simple regression using this variable, and then plot the results, highlighting those points that had unusually high Cook's distances or hat statistic values:
> simple.lm = lm(Alcohol~Proline,data=wine)
> cooks = cooks.distance(simple.lm)
> hat =  lm.influence(simple.lm)$hat
> par(mfrow=c(2,1))
> plot(wine$Proline,wine$Alcohol,col=ifelse(cooks > quantile(cooks,.90),'red','black'))
> plot(wine$Proline,wine$Alcohol,col=ifelse(hat > quantile(hat,.90),'blue','black'))

The plots are displayed below:
In the top graph, the points displayed in red represent the observations with large Cook's distances; in the bottom graph, the blue points are those with high hat statistic values. Of course, with more variables in the model, things are not so simple. Let's look at a plot of predicted values versus actual values for the full regression model for this data, using the same coloring conventions:
> cooks = cooks.distance(wine.lm)
> hat = lm.influence(wine.lm)$hat
> par(mfrow=c(2,1))
> plot(wine$Alcohol,predict(wine.lm),col=ifelse(cooks > quantile(cooks,.9),'red','black'))
> plot(wine$Alcohol,predict(wine.lm),col=ifelse(hat > quantile(hat,.9),'blue','black'))

Here are the plots:
The extreme Cook's distance points seem to congregate at the outer edges of the predicted values, but the extreme hat statistics points don't follow a simple pattern. In practice, many statisticians use the rule of thumb that Cook's distances bigger than the 10th percentile of an F distribution with p and n-p degrees of freedom represent potential problems, where n is the number of observations, and p is the number of parameters estimated. For the wine data, that cutoff point can be calculated as follows:
> qf(.1,7,178-7)
[1] 0.4022056

In fact, for this data set none of the Cook's distances are greater than this value.
For the hat statistic, a cutoff of 2 * p/n has been proposed; for the wine data this corresponds to a value of 0.079. With the wine example, there are ten such points. Plotting each independent variable against the dependent variable, and highlighting the extreme points in orange helps to show where these points are:
>  par(mfrow=c(3,2))
> sapply(names(coef(wine.lm)[-1]),
+    function(x)plot(wine[[x]],wine$Alcohol,
+                    col=ifelse(hat >.0786 ,'orange','black'),xlab=x))

Here are the plots:

5  Collinearity

Another problem which might occur when using linear regression is known as collinearity. This problem occurs when the independent variables are so highly correlated that they contain redundant information, which confuses the regression process. When data is collinear, the standard errors of the parameter estimates get very large, and removing one or two variables may make the coefficients change dramatically. In addition, collinearity can mask important relationships in the data. The classic data set to illustrate collinearity is known as the Longley data set, available in R under the name longley. This data set contains a variety of measurements about the population of the US in an attempt to predict employment. Let's take a look at the result of regressing Employed against the other variables in the Longley data set:
> lreg = lm(Employed ~ .,data=longley)
> summary(lreg)

Call:
lm(formula = Employed ~ ., data = longley)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.41011 -0.15767 -0.02816  0.10155  0.45539 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 ** 
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141    
GNP          -3.582e-02  3.349e-02  -1.070 0.312681    
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 ** 
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212    
Year          1.829e+00  4.555e-01   4.016 0.003037 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955,	Adjusted R-squared: 0.9925 
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10 

On the surface, nothing seems wrong - in fact, with an adjusted R-squared of .9925, it seems great. We can see the problem with the data by looking at the pairwise scatterplots, using the pairs function:
> pairs(longley)

Here's the plot:
Many of the variables seem to be correlated with each other, so it's difficult to see which is causing the problem. A statistic known as VIF (Variance Inflation Factor) can be very useful in situations like this. In R, the vif function in the car package will provide this statistic. Before using vif on the Longley data, let's look at the wine data we used previously:
> wine.lm = lm(Alcohol~.,data=subset(wine,select=-Cultivar))
> library(car)
> vif(wine.lm)
     Malic.acid             Ash  Alkalinity.ash       Magnesium         Phenols 
       1.575916        2.180108        2.179282        1.417855        4.330552 
     Flavanoids      NF.phenols Proanthocyanins Color.intensity             Hue 
       7.029040        1.793883        1.947243        2.493007        2.542273 
       OD.Ratio         Proline 
       3.736818        2.441810 

None of the inflation factors is bigger than 10, which indicates collinearity is not a problem with the data set, confirmed by looking at the pairs plot for the wine data set:
There does seem to be a linear relationship between Flavanoids and Phenols - not surprisingly those two variables have the highest VIFs.
Now let's return to the Longley data.
> vif(lreg)
GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
   135.53244   1788.51348     33.61889      3.58893    399.15102    758.98060 

The two largest VIFs are for GNP and Year. Let's eliminate them from the model, and see how the VIFs change:
> lreg1 = lm(Employed ~ .,data=subset(longley,select=-c(GNP,Year)))
> summary(lreg1)

Call:
lm(formula = Employed ~ ., data = subset(longley, select = -c(GNP, 
    Year)))

Residuals:
       Min         1Q     Median         3Q        Max 
-0.6561730 -0.2576601 -0.0008123  0.1213544  1.2225443 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.781314   6.886470   2.001 0.070657 .  
GNP.deflator  0.207046   0.081376   2.544 0.027270 *  
Unemployed   -0.012412   0.002780  -4.465 0.000955 ***
Armed.Forces -0.005968   0.003325  -1.795 0.100170    
Population    0.306601   0.123795   2.477 0.030755 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.5671 on 11 degrees of freedom
Multiple R-squared: 0.9809,	Adjusted R-squared: 0.9739 
F-statistic: 141.1 on 4 and 11 DF,  p-value: 2.26e-09 

> vif(lreg1)
GNP.deflator   Unemployed Armed.Forces   Population 
   35.970754     3.147600     2.497795    34.588299 

The reduced model is probably more realistic than the full model, but GNP.deflator and Population are still highly correlated
> with(longley,cor(GNP.deflator,Population))
[1] 0.9791634

Removing GNP.deflator results in a model that seems to make sense:
> lreg2 = lm(Employed ~ .,data=subset(longley,select=-c(GNP,Year,GNP.deflator)))
> summary(lreg2)

Call:
lm(formula = Employed ~ ., data = subset(longley, select = -c(GNP, 
    Year, GNP.deflator)))

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3835 -0.2868 -0.1353  0.3596  1.3382 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.323091   4.211566  -0.314  0.75880    
Unemployed   -0.012292   0.003354  -3.665  0.00324 ** 
Armed.Forces -0.001893   0.003516  -0.538  0.60019    
Population    0.605146   0.047617  12.709 2.55e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.6843 on 12 degrees of freedom
Multiple R-squared: 0.9696,	Adjusted R-squared: 0.962 
F-statistic: 127.7 on 3 and 12 DF,  p-value: 2.272e-09 

> vif(lreg2)
  Unemployed Armed.Forces   Population 
    3.146686     1.918225     3.514335 

Now let's look at some alternatives to ordinary linear regression.

6  Generalized Additive Models (gam)

One of the most useful alternative methods to regression is known as a generalized additive model. Instead of fitting a single linear parameter to try to explain the relationship between independent variables and dependent variables, GAM models perform spline smooths on selected variables, and use these smoothed versions of the independent variables to try to explain the values of the dependent variables. To try to make the information from the analysis similar to the familiar lm output, an estimated number of degrees of freedom is calculated for each variable, based on how different the fitted spline smooth for that variable is from the strictly linear relationship that lm uses for prediction, and an F-statistic is produced for each independent variable to test whether the smoothed version of the variable made a significant contribution to the predicted value of the dependent variable. In R, the gam function is provided by the mgcv library. This library also provides the s function, which is used by gam to identify the variables that should be smoothed before they are used to predict the dependent variable. Without prior knowledge, it's not unreasonable to try smoothing on all the variables. We can fit a gam model by using the same formula as we used with lm, passing each variable in the model to the s function:
> library(mgcv)
> wine.gam = gam(Alcohol~s(Malic.acid)+s(Alkalinity.ash)+
+                s(Proanthocyanins)+s(Color.intensity)+
+                s(OD.Ratio)+s(Proline),data=wine[-1])
> wine.gam

Family: gaussian
Link function: identity

Formula:
Alcohol ~ s(Malic.acid) + s(Alkalinity.ash) + s(Proanthocyanins) +
    s(Color.intensity) + s(OD.Ratio) + s(Proline)

Estimated degrees of freedom:
 1 7.920717 3.492826 4.022189 1 3.567478   total =  22.00321

GCV score:  0.2314599

Like the lm function, gam provides a more familiar table when the summary method is invoked:
> summary(wine.gam)

Family: gaussian
Link function: identity

Formula:
Alcohol ~ s(Malic.acid) + s(Alkalinity.ash) + s(Proanthocyanins) +
    s(Color.intensity) + s(OD.Ratio) + s(Proline)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.00062    0.03376   385.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                     edf Est.rank      F  p-value
s(Malic.acid)      1.000    1.000 14.133 0.000240 ***
s(Alkalinity.ash)  7.921    9.000  4.403 3.84e-05 ***
s(Proanthocyanins) 3.493    9.000  1.844 0.064387 .
s(Color.intensity) 4.022    9.000  8.391 3.66e-10 ***
s(OD.Ratio)        1.000    1.000  2.246 0.135990
s(Proline)         3.567    9.000  5.508 1.42e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.692   Deviance explained = 72.9%
GCV score = 0.23146   Scale est. = 0.20285   n = 178

The relative importance of the variables has changed somewhat from the linear regression results. Notice that the adjusted R-squared value is 0.692, as compared to 0.575 for linear regression, showing an improvement in prediction by using the smoothed versions of the independent variable.
Applying the plot function to a gam model is often the most useful part of the analysis. The plots produced show how the independent variable was smoothed before fitting, so a straight (or nearly straight) line for a particular variable means a truly linear relationship was found, while deviations from linearity describe the nature of non-linear relationships that exist. Here's the results of using plot on our gam model. For convenience, I've put all the plots in a single graphic; in practice, you might want to examine each plot separately. I've used the par function to adjust the margins so that the individual plots will be a little larger:
> par(mfrow=c(3,2),mar=c(2,4,1,2)+.1,oma=c(0,0,4,0),xpd=FALSE)
> plot(wine.gam)

Here's the plot:
For the variables Malic.acid and OD.ratio, the relationships do seem to be linear; this is supported by the fact that gam only used a single degree of freedom to fit these terms. For some of the other variables, it's clear that linear relationships hold only over a limited range of the data. The Alkalinity.ash plot is particularly interesting, but it may indicate oversmoothing.

7  Recursive Partitioning

We've already seen how recursive partitioning can be used for classification, but it can also be used for regression if the dependent variable passed to rpart is not a factor. When used for regression, rpart follows a similar strategy as for classification; each variable is tested for all possible splits, looking for large separation between the dependent variable values for one side of the split as opposed to the other. As is the case for classification, rpart presents its results as a tree, with terminal nodes representing the best prediction the model can provide. Here are the results of using recursive partitioning on the wine data frame to predict Alcohol. I'm using a period on the right hand side of the model to indicate that rpart should consider all of the variables in the data frame (except Cultivar):
> library(rpart)
> wine.rpart = rpart(Alcohol ~ . ,data=wine[-1])
> wine.rpart
n= 178

node), split, n, deviance, yval
      * denotes terminal node

 1) root 178 116.654000 13.00062
   2) Color.intensity< 3.325 50   9.161498 12.13980
     4) Ash>=2.41 13   1.269369 11.86846 *
     5) Ash< 2.41 37   6.598724 12.23514 *
   3) Color.intensity>=3.325 128  55.969350 13.33687
     6) Proline< 900 79  28.974900 13.05013
      12) Color.intensity< 8.315 61  21.586760 12.93197
        24) Proline< 722.5 43  14.291710 12.80209
          48) Malic.acid< 3.1 27   8.142067 12.62889
            96) NF.phenols>=0.33 14   2.388493 12.30929 *
            97) NF.phenols< 0.33 13   2.783477 12.97308 *
          49) Malic.acid>=3.1 16   3.972794 13.09437 *
        25) Proline>=722.5 18   4.837111 13.24222 *
      13) Color.intensity>=8.315 18   3.650294 13.45056 *
     7) Proline>=900 49  10.025970 13.79918
      14) Color.intensity< 4.44 10   0.787410 13.27700 *
      15) Color.intensity>=4.44 39   5.812631 13.93308 *

Since rpart doesn't actually estimate any coefficients, we can't produce a table of hypothesis tests as we did for lm or gam, but we can get a sort of multiple R-squared value by squaring the correlation between the true value of the dependent variable and the value that rpart predicts:
> cor(wine$Alcohol,predict(wine.rpart))^2
[1] 0.7248247

This unadjusted R-squared value is a little higher than the adjusted R-squared value from the gam model.

8  Comparison of the 3 Methods

A very simple way to get an idea of how the three methods compare with each other is to make a plot of predicted versus actual values for the three methods, using a different color for each:
> plot(predict(wine.lm),wine$Alcohol,col='red',xlab='Predicted',ylab='Actual')
> points(predict(wine.gam),wine$Alcohol,col='blue')
> points(predict(wine.rpart),wine$Alcohol,col='green')
> legend('topleft',legend=c('lm','gam','rpart'),col=c('red','blue','green'),pch=1,cex=.8)

Here's the plot:



File translated from TEX by TTH, version 3.67.
On 25 Apr 2011, 15:27.