Regression Diagnostics

This chapter studies whether regression is an appropriate summary of a given set bivariate data, and whether the regression line was computed correctly. Problems with regression are generally easier to see by plotting the residuals rather than the original data. The vertical amount by which the line misses a datum is called a residual—it is the error in estimating the value of Y for that datum from its value of X using the regression line. Plotting the residuals against X is called a residual plot.

If the data are heteroscedastic, nonlinearly associated, or have outliers, the regression line is not a good summary of the data, and it is not appropriate to use regression to summarize the data. Heteroscedasticity, nonlinearity and outliers are easier to see in a residual plot than in a scatterplot of the raw data. There can be errors of arithmetic in calculating the regression line, so that the slope or intercept is wrong. It is easy to catch such errors by looking at residual plots, where they show up as a nonzero mean or a trend.

Residual plots

A residual is the vertical difference between the Y value of an individual and the regression line at the value of X corresponding to that individual, for regressing Y on X. That is, suppose there are n pairs of measurements of X and Y:

(x1, y1), (x2, y2), … , (xn, yn),

and that the equation of the regression line (see is

y = ax + b.

The vertical residual e1 for the first datum is

e1 = y1 − (ax1 + b).

The vertical residual for the second datum is

e2 = y2 − (ax2 + b),

and so on. The ith vertical residual is the amount by which the regression line at the ith value of X misses the ith value of Y—the error in using the regression line to estimate the ith datum. (For regressing X on Y, a residual is the horizontal difference between the X value of the individual and the regression line at the value of Y corresponding to that individual.)

Plotting the residuals as a function of the "independent" variable (the one that is being regressed upon) can tell a great deal about the regression, including whether it was appropriate to use linear regression in the first place, and whether the regression was computed correctly. A plot of the residuals against the corresponding values of the independent variable is called a residual plot. It is a scatterplot of the n points

(x1, e1), (x2, e2), … , (xn, en).

A residual plot is like a scatterplot of the original data, but with (axi + b) subtracted from the value of yi for each point (xi, yi), i = 1, 2, … , n. Subtracting the regression line from the data removes from Y any overall average and any trend with X.

shows the GMAT data again, this time with the residual plots displayed instead of the original data. (If the figure does not show Quantitative GMAT versus Verbal GMAT, select those variables from the drop-down menus.)

Clicking Plot Residuals will toggle the display back to a scatterplot of the data. Clicking Plot Residuals again will change the display back to the residual plot. Notice that for the residual plot for quantitative GMAT versus verbal GMAT, there is (slight) heteroscedasticity: the scatter in the residuals for small values of verbal GMAT (the range 12–22) is a bit larger than the scatter of the residuals for larger values of verbal GMAT.

Reading Residual Plots

Regression is a poor summary of data that have heteroscedasticity, nonlinear association, or outliers. These are easier to see in a residual plot than in a scatterplot of the original data. is the residual plot for more severely heteroscedastic data:

The heteroscedasticity is clearly evident—the vertical scatter is quite different in different vertical strips, large in some slices and small in others. Click Plot Data in to display a scatterplot of the raw data. You will see that the heteroscedasticity, while still visible, is less pronounced than in the residual plot. (Click Plot Residuals to show the residual plot again.)

Note that for both there is no trend in the residuals: the horizontal (X) axis divides the residuals into two roughly symmetrical parts. This is because the data are linearly associated. In contrast, is a residual plot for data with nonlinear association:

Recall that two variables have nonlinear association if their scatterplot shows a curved pattern. The residuals have a nonlinear association with X if and only if the original observations of Y have a nonlinear association with X, but it is easier to see the nonlinear association in the residual plot than in the scatterplot, just as it is easier to see heteroscedasticity in the residual plot. is the residual plot for data that are both nonlinearly associated and heteroscedastic:

In the residuals are scattered asymmetrically around the x axis: They show a systematic sinuous pattern characteristic of nonlinear association. In some ranges of X, all the residuals are below the x axis (negative), while in other ranges, all the residuals are above the x axis (positive). Nonlinear association between the variables shows up in a residual plot as a systematic pattern. Toggle back and forth between plotting the data and plotting the residuals in Note that the nonlinearity and heteroscedasticity are easier to see in the residual plots.

Similarly, a residual plot shows outliers if and only if the original data contain outliers, but—as with heteroscedasticity and nonlinearity—it is easier to see outliers in the residual plot than in a scatterplot of the original data. In summary:

Residual plots make some aspects of the data easier to see

Residuals have heteroscedasticity, nonlinearity, or outliers only if the original data do too.

It is easier to see heteroscedasticity, nonlinearity, and outliers in a residual plot than in a scatterplot of the original data.

Heteroscedasticity shows up in a residual plot as a difference in the scatter of the residuals for different ranges of values of the independent variable.

Nonlinearity shows up in a residual plot as a tendency for the residuals to be predominantly positive for some ranges of values of the independent variable and predominantly negative for other ranges.

Outliers show up in a residual plot as unusually large positive or negative values.

We have seen that residual plots can help identify heteroscedasticity, nonlinearity and outliers. They can also help identify computational errors in calculating the regression line. At issue above was whether regression is a good summary of the data: whether regression is appropriate. At issue here, in contrast, is whether the slope and intercept of the regression line were computed correctly.

If the regression line is computed correctly, the vertical residuals from regression line average to zero and the correlation between the vertical residuals and X is zero. If the average of the residuals is not zero or the correlation between the residuals and X is not zero (i.e., if the residual plot shows a trend), there was a computational error. For example, could not be residual plots for correctly computed regression lines.

 

If the residuals have a trend, the slope of the regression line was computed incorrectly. If the residuals do not have a trend but their average is not zero, the intercept of the regression line was computed incorrectly. If the residuals have a trend and their average is not zero, then the slope of the regression line was computed incorrectly, and the intercept of the regression line might also have been computed incorrectly.

If the regression line is computed correctly, then if we treat the residuals themselves as data and compute another regression line for them—that is, if we regress the residuals against X—that regression line will coincide with the x axis (a horizontal line at height y = 0).

In summary:

Residual plots can reveal computational errors

A residual plot shows at a glance whether the regression line was computed correctly.

If the regression line was computed correctly, the point of averages of the residual plot will be on the x axis, and the residuals will not have a trend: the correlation coefficient for the residuals and X will be zero.

If the residuals have a trend, the slope of the regression line was miscalculated.

If the residuals do not have a trend, but the average of the residuals is not zero, the intercept of the regression line was miscalculated.

If the residuals have a trend and the average of the residuals is not zero, the slope of the regression line was miscalculated, and the intercept of the regression might also have been miscalculated.

The following exercise tests your ability to read residual plots for evidence of heteroscedasticity, nonlinearity, or outliers, and to determine whether the regression line was computed correctly and appropriately.

Summary

The residuals from a regression line are the values of the dependent variable Y minus the estimates of their values using the regression line and the independent variable X. If the ith datum is (xi, yi) and the equation of the regression line is y = ax+b, then the ith residual is

ei = yi − ( axi+b).

A residual plot is a scatterplot of the residuals versus their corresponding values of X, that is, a plot of the n points

(xi, ei), i = 1, … , n.

A residual plot shows heteroscedasticity, nonlinear association, or outliers if and only if the original scatterplot does, but it is easier to see these qualitative features of bivariate data in the residual plot than in the scatterplot of the original data.

If the regression line is computed correctly, the correlation coefficient between the residuals and the independent variable is zero—the residuals do not have a trend with X—and the average of the residuals is zero. If the residuals have a trend, the slope of the regression line was computed incorrectly. If the residuals do not have a trend but the mean of the residuals is not zero, the intercept of the regression line was computed incorrectly. If the residuals have a trend and the mean of the residuals is not zero, the slope of the regression line was computed incorrectly, and the intercept of the regression line might or might not have been computed correctly.

Key Terms