This chapter studies whether regression is an appropriate summary of a given set bivariate data, and whether the regression line was computed correctly. Problems with regression are generally easier to see by plotting the residuals rather than the original data. The vertical amount by which the line misses a datum is called a residual—it is the error in estimating the value of Y for that datum from its value of X using the regression line. Plotting the residuals against X is called a residual plot.

If the data are heteroscedastic, nonlinearly associated, or have outliers, the regression line is not a good summary of the data, and it is not appropriate to use regression to summarize the data. Heteroscedasticity, nonlinearity and outliers are easier to see in a residual plot than in a scatterplot of the raw data. There can be errors of arithmetic in calculating the regression line, so that the slope or intercept is wrong. It is easy to catch such errors by looking at residual plots, where they show up as a nonzero mean or a trend.

A residual is the vertical difference between the Y value of an individual and the regression line at the value of X corresponding to that individual, for regressing Y on X. That is, suppose there are n pairs of measurements of X and Y:

(x_{1}, y_{1}),
(x_{2}, y_{2}), … ,
(x_{n}, y_{n}),

and that the equation of the regression line (see

y = ax + b.

The vertical residual e_{1} for the first datum is

e_{1} = y_{1} − (ax_{1} + b).

The vertical residual for the second datum is

e_{2} = y_{2} − (ax_{2} + b),

and so on.
The i^{th} vertical residual is the amount by which the regression line at the
i^{th} value of
X misses the i^{th} value of Y—the error in
using the regression line to estimate the i^{th} datum.
(For regressing X on Y, a residual is the horizontal difference
between the X value of the individual and the regression line at the value of Y
corresponding to that individual.)

Plotting the residuals as a function of the "independent" variable (the one that is being regressed upon) can tell a great deal about the regression, including whether it was appropriate to use linear regression in the first place, and whether the regression was computed correctly. A plot of the residuals against the corresponding values of the independent variable is called a residual plot. It is a scatterplot of the n points

(x_{1}, e_{1}), (x_{2}, e_{2}),
… , (x_{n}, e_{n}).

A residual plot is like a scatterplot of the original data, but with
(ax_{i} + b) subtracted from the value of
y_{i} for each point
(x_{i}, y_{i}), i = 1, 2, … , n.
Subtracting the regression line from the data removes from Y any overall average
and any trend with X.

Clicking Plot Residuals will toggle the display back to a scatterplot of the data. Clicking Plot Residuals again will change the display back to the residual plot. Notice that for the residual plot for quantitative GMAT versus verbal GMAT, there is (slight) heteroscedasticity: the scatter in the residuals for small values of verbal GMAT (the range 12–22) is a bit larger than the scatter of the residuals for larger values of verbal GMAT.

Regression is a poor summary of data that have heteroscedasticity, nonlinear association, or outliers.
These are easier to see in a residual plot than in a scatterplot of the original data.

The heteroscedasticity is clearly evident—the vertical scatter is quite different
in different vertical strips, large in some slices and small in others.
Click Plot Data in

Note that for both

Recall that two variables have nonlinear association if their
scatterplot shows a curved pattern.
The residuals have a nonlinear association with X if and only if the
original observations of Y have a nonlinear association with X,
but it is easier to see the nonlinear association in the
residual plot than in the scatterplot,
just as it is easier to see heteroscedasticity in the residual plot.

In

Similarly, a residual plot shows outliers if and only if the original data contain outliers, but—as with heteroscedasticity and nonlinearity—it is easier to see outliers in the residual plot than in a scatterplot of the original data. In summary:

Residual plots make some aspects of the data easier to see

Residuals have heteroscedasticity, nonlinearity, or outliers only if the original data do too.

It is easier to see heteroscedasticity, nonlinearity, and outliers in a residual plot than in a scatterplot of the original data.

Heteroscedasticity shows up in a residual plot as a difference in the scatter of the residuals for different ranges of values of the independent variable.

Nonlinearity shows up in a residual plot as a tendency for the residuals to be predominantly positive for some ranges of values of the independent variable and predominantly negative for other ranges.

Outliers show up in a residual plot as unusually large positive or negative values.

We have seen that residual plots can help identify heteroscedasticity, nonlinearity and
outliers.
They can also help identify computational errors in calculating the regression line.
At issue above was whether regression is a good summary of the data: whether regression is
*appropriate*.
At issue here, in contrast, is whether the slope and intercept of the regression line
were *computed correctly*.

If the regression line is computed correctly, the vertical residuals
from regression line average to zero and the correlation between the
vertical residuals and X
is zero.

If the residuals have a trend, the slope of the regression line was computed incorrectly. If the residuals do not have a trend but their average is not zero, the intercept of the regression line was computed incorrectly. If the residuals have a trend and their average is not zero, then the slope of the regression line was computed incorrectly, and the intercept of the regression line might also have been computed incorrectly.

If the regression line is computed correctly, then if we treat the residuals themselves as
data and compute another regression line for *them*—that is, if we
regress the *residuals* against X—*that* regression line
will coincide with the x axis
(a horizontal line at height
y = 0).

In summary:

Residual plots can reveal computational errors

A residual plot shows at a glance whether the regression line was computed correctly.

If the regression line was computed correctly, the point of averages of the residual plot will be on the x axis, and the residuals will not have a trend: the correlation coefficient for the residuals and X will be zero.

If the residuals have a trend, the slope of the regression line was miscalculated.

If the residuals do not have a trend, but the average of the residuals is not zero, the intercept of the regression line was miscalculated.

If the residuals have a trend and the average of the residuals is not zero, the slope of the regression line was miscalculated, and the intercept of the regression might also have been miscalculated.

The following exercise tests your ability to read residual plots for evidence of heteroscedasticity, nonlinearity, or outliers, and to determine whether the regression line was computed correctly and appropriately.

The residuals from a regression line are the values of
the dependent variable Y minus the estimates of their values using the
regression line and the independent variable X.
If the i^{th} datum is
(x_{i}, y_{i}) and the equation of the regression
line is y = ax+b, then the i^{th} residual is

e_{i} = y_{i} − ( ax_{i}+b).

A residual plot is a scatterplot of the residuals versus their corresponding values of X, that is, a plot of the n points

(x_{i}, e_{i}),
i = 1, … , n.

A residual plot shows heteroscedasticity, nonlinear association, or outliers if and only if the original scatterplot does, but it is easier to see these qualitative features of bivariate data in the residual plot than in the scatterplot of the original data.

If the regression line is computed correctly, the correlation coefficient between the residuals and the independent variable is zero—the residuals do not have a trend with X—and the average of the residuals is zero. If the residuals have a trend, the slope of the regression line was computed incorrectly. If the residuals do not have a trend but the mean of the residuals is not zero, the intercept of the regression line was computed incorrectly. If the residuals have a trend and the mean of the residuals is not zero, the slope of the regression line was computed incorrectly, and the intercept of the regression line might or might not have been computed correctly.

- correlation coefficient
- dependent variable
- football-shaped
- graph of averages
- heteroscedasticity
- histogram
- homoscedastic
- independent variable
- mean
- nonlinear
- nonlinearity
- outlier
- percentile
- regression line
- residual
- residual plot
- rms
- scatterplot
- SD
- slice
- variable