This chapter addresses four types of error relating to regression: error in estimating the dependent variable from the independent variable using regression, error of summarizing a scatterplot by the regression line when regression is not appropriate, error in calculating the slope and intercept of the regression line, and error in thinking about regression. Unless the correlation coefficient r is ±1, the regression line does not pass through all the data. The vertical amount by which the line misses a datum is called a residual—it is the error in estimating the value of Y for that datum from its value of X using the regression line. The residuals from the regression line are telling: They reveal qualitative information, such as whether regression is a reasonable summary of the scatterplot and whether the regression line was computed correctly, and quantitative information, such as the average error of estimating Y by the regression line.
For football-shaped scatterplots with r not equal to ±1, the average value of Y for individuals whose value of X is in a narrow range tends to be fewer standard deviations (SDs) from the mean than the values of X, so the regression line estimates the value of Y for such individuals to be fewer SDs from the mean than their values of X are. This is called the regression effect, or regression towards the mean. Failing to take the regression effect into account is called the regression fallacy.
A residual is the vertical difference between the Y value of an individual and the regression line at the value of X corresponding to that individual, for regressing Y on X. That is, suppose there are n pairs of measurements of X and Y:
(x1, y1), (x2, y2), … , (xn, yn),
and that the equation of the regression line is
y = a × x + b.
The vertical residual of the first datum is
e1 = y1 - (a × x1 + b).
The vertical residual of the second datum is
e2 = y2 - (a × x2 + b),
and so on. The ith vertical residual is the amount by which the regression line at the ith value of X misses the ith value of Y—the error in using the regression line to estimate the ith datum. A residual plot is a scatterplot of the residuals versus the corresponding values of X, that is, a scatterplot of the n points
(x1, e1), (x2, e2), … , (xn, en).
A residual plot is like a scatterplot of the original data, but with (a×xi + b) subtracted from the value of yi for each point (xi, yi), i = 1, 2, … , n. Subtracting the regression line from the data removes from Y any overall average and any trend with X. We shall see presently that residual plots can tell us a great deal about the regression of Y on X, including whether the regression line is an appropriate summary of the data, and whether the regression line was computed correctly.
For regressing X on Y, a residual is the horizontal difference between the X value of the individual and the regression line at the value of Y corresponding to that individual. Plotting the residuals as a function of the "independent" variable (the one that is being regressed upon) can tell a great deal about the regression, including whether it was appropriate to use linear regression in the first place, and whether the regression was computed correctly. A plot of the residuals against the corresponding values of the independent variable is called a residual plot.
shows the GMAT data again, this time with the residual plots displayed instead of the original data. (If the figure does not show Quantitative GMAT versus Verbal GMAT, select those variables from the drop-down menus.)
You need Java to see this.
Clicking the "Plot Data" button will toggle the display back to a scatterplot of the data; that button will then say "Plot Residuals." Clicking the button again will change the display back to the residual plot. Notice that for the residual plot for quantitative GMAT versus verbal GMAT, there is (slight) heteroscedasticity: the scatter in the residuals for small values of verbal GMAT (the range 12-22) is a bit larger than the scatter of the residuals for larger values of verbal GMAT.
Nonlinearity, heteroscedasticity, outliers, as well as mistakes in computing the regression line are easier to see in a residual plot than in a plot of the original data. is the residual plot for more severely heteroscedastic data:
You need Java to see this
The heteroscedasticity is clearly evident—the scatter in vertical strips is quite different in different vertical strips. Note that for both of these plots, there is no trend in the residuals: the horizontal (X) axis divides the residuals into two roughly symmetrical parts. This is because the data are linearly associated. In contrast, is the residual plot for some data with nonlinear association:
Recall that two variables have nonlinear association if their scatterplot shows a curved pattern. The residuals have a nonlinear association with X if and only if the original observations of Y have a nonlinear association with X, but it is easier to see the nonlinear association in the residual plot than in the scatterplot, just as it is easier to see heteroscedasticity in the residual plot. is the residual plot for data that are both nonlinearly associated and heteroscedastic:
In the residuals are scattered asymmetrically around the x axis: They show a systematic sinuous pattern characteristic of nonlinear association. In some ranges of X, all the residuals are below the x axis (negative), while in other ranges, all the residuals are above the x axis (positive). Nonlinear association between the variables shows up in a residual plot as a systematic pattern.
A residual plot shows outliers if and only if the original data contain outliers, but, as is the case with heteroscedasticity and nonlinearity, it is easier to see outliers in the residual plot than in a scatterplot of the original data.
Residual plots make some aspects of the data easier to see
Residuals have heteroscedasticity, nonlinearity, or outliers only if the original data do too.
It is easier to see heteroscedasticity, nonlinearity, and outliers in a residual plot than in a scatterplot of the original data.
Heteroscedasticity shows up in a residual plot as a difference in the scatter of the residuals for different ranges of values of the independent variable. Nonlinearity shows up in a residual plot as a tendency for the residuals to be predominantly of one sign for ranges of values of the independent variable. Outliers show up in a residual plot as unusually large values.
Recall that the regression line is the line for which the rms of the vertical residuals is smallest. Just as the deviations from the mean average to zero, the vertical residuals from regression line must average to zero. Furthermore, the correlation coefficient between the vertical residuals and X must be zero—the residuals cannot have a trend. If the average of the residuals is not zero, or if there is a trend in the residuals, something went wrong in computing the regression line. If the residuals have a trend, i.e., if the correlation coefficient between the residuals and X is not zero (if, overall, the plot slopes up or slopes down), the slope of the regression line was computed incorrectly. If the residuals do not have a trend but their average is not zero, the intercept of the regression line was computed incorrectly. If the residuals have a trend and their average is not zero, then the slope of the regression line was computed incorrectly, and the intercept of the regression line might also have been computed incorrectly. If the regression line is calculated correctly and the points in the residual plot
(x1, e1), (x2, e2), … , (xn, en),
were treated as data, the regression line for regressing the residuals against X would be the x-axis: the intercept of the line would be zero, and the slope of the line would be zero.
Residual plots can reveal computational errors
A residual plot shows at a glance whether the regression line was computed correctly.
If the regression line was computed correctly, the point of averages of the residual plot will be on the x-axis, and the residuals will not have a trend: the correlation coefficient for the residuals and X will be zero.
If the residuals have a trend, the slope of the regression line was miscalculated.
If the residuals do not have a trend, but the average of the residuals is not zero, the intercept of the regression line was miscalculated.
If the residuals have a trend and the average of the residuals is not zero, the slope of the regression line was miscalculated, and the intercept of the regression might also have been miscalculated.
At issue is not whether regression is appropriate; merely whether the regression line is computed correctly. It is easier to see from the residual plot than from the scatterplot of the original data and the regression line whether the average of the residuals is zero, and whether the residuals have a trend. The regression line for a residual plot should coincide with the x-axis (a horizontal line at height y = 0). If not, something is wrong. The residual plot is thus a helpful regression diagnostic: If the residual plot shows a trend, or shows that the average of the residuals is not zero, something went wrong in computing the regression line. For example, could not be the residual plots for a correctly computed regression lines.
The residual plot lets us see at a glance whether the regression line is an appropriate summary of the data—whether the data show heteroscedasticity, nonlinearity, or outliers—and whether the regression line was computed correctly. The following exercise tests your ability to read residual plots.
The regression line does not pass through all the data points on the scatterplot exactly unless the correlation coefficient is ±1. In general, the data are scattered around the regression line. Each datum will have a vertical residual from the regression line; the sizes of the vertical residuals will vary from datum to datum. What is the typical vertical distance of a datum from the regression line?
Recall that the rms is a measure of the typical size of elements in a list. Thus the rms of the vertical residuals is a measure of the typical vertical distance from the data to the regression line, that is, the typical error in estimating the value of Y by the height of the regression line. It turns out that the rms of the vertical residuals from the regression line (the rms error of regression) is
(1 - r2)1/2×SDY.
This equation is derived in the footnotes. ) The rms error of regression is always between 0 (which occurs when r is ±1) and SDY (which occurs when r= 0). When r is not zero, the regression line accounts for some of the variability of Y, so the scatter around the regression line is less than the overall scatter in Y. When r is ±1, the regression line accounts for all of the variability of Y, and the rms of the vertical residuals is zero. When r = 0, the regression line does not "explain" any of the variability of Y: The regression line is a horizontal line at height mean(Y), so the rms of the vertical residuals from the regression line is the rms of the deviations of the values of Y from the mean of Y, which is, by definition, the SD of Y.
If the scatterplot is homoscedastic and football-shaped, the mean of the values in a thin vertical strip will be about the same as the height of the regression line, and the SD of the values in a vertical strip will be about the same as the rms (vertical) error of regression. Why?
Recall that the regression line is a smoothed version of the graph of averages: The height of the regression line at the point x is an estimate of the average of the values of Y for individuals whose value of X is close to x. If the scatterplot is football shaped, the regression line follows the graph of averages reasonably well: In each vertical slice, the deviations of the values of Y from their mean is approximately the vertical residuals of those values of Y from the regression line. The SD of the values of Y in the slice are thus approximately the rms of the residuals in the slice. Because football-shaped scatterplots are homoscedastic, the SD of the values of Y in every vertical slice is about the same, so the rms error of regression is a reasonable estimate of the scatter of the values of Y in vertical slices through football-shaped scatterplots.
If a scatterplot is homoscedastic but shows nonlinear association, the rms error of regression will tend to overestimate the scatter in a typical vertical slice: part of the residuals comes from scattre If a scatterplot is heteroscedastic but shows linear association, the rms error of regression will overestimate the scatter in some slices and underestimate the scatter in other slices. If a scatterplot has outliers but is otherwise homoscedastic and shows linear association, the rms error of regression will tend to overestimate the scatter in slices. The strength of linear association affects the size of the rms error of regression, but it does not affect whether the rms error of regression is a good estimate of the scatter in vertical slices.
The following exercises check your ability to calculate the rms error of regression and your understanding of its use as a summary. Use the scatterplot to solve the following exercise.
The values of Y corresponding to a given value of X (or a small range of values of X) have a distribution that typically differs from the overall distribution of the values of Y across all values of X. The mean of the values of Y in a slice typically differs from the overall mean of Y, and the SD of the values of Y in a slice typically differs from the overall spread of Y.
lets us superpose the histogram of a variable for all individuals with the histogram of that variable just for those individuals whose value of that or another variable is within a given range—a slice through the scatterplot. That is, it allows us to look at the histogram of Y values for all individuals in a set of multivariate data, and the histogram of Y values for only those individuals who have X values in a specified range. We shall look at the GMAT data.
We first superposed histograms to study association in Chapter 3, "Multivariate Data." This applet should display the verbal GMAT scores when you first visit this page. If not, select "Verbal" from the Variable drop-down menu. Use the Restrict to drop-down menu to select Quantitative GMAT. The mean of the values of Verbal GMAT scores for just those individuals whose Quantitative GMAT scores are in a restricted range is typically different from the mean of the Verbal GMAT scores for all individuals; The SD of the restricted set of Verbal GMAT scores is also typically different from the overall SD of the Verbal GMAT scores. The SD is a measure of their spread, and in the case of football-shaped scatterplots, is about the same as the rms error of regression. We can use what we know about univariate distributions to calculate properties of the distribution of values of verbal GMAT corresponding to a given value of quantitative GMAT.
For example, Chebychev's inequality limits the fraction of values that are more than k (restricted) SDs from the (restricted) mean. If the correlation coefficient r is positive and the data are homoscedastic, a given percentile of the distribution of Y for a value of X above the mean of X will be larger than the same percentile of the overall distribution of Y. Similarly, a given percentile of the distribution of Y for a value of X below the mean of X will be smaller than the same percentile of the overall distribution of Y. Restricting attention to individuals with a given value of X that is above the mean of X is looking at a subset of the population that tends to have larger than average values of Y; the scatter of those values will tend to be less than the scatter of Y for the entire population (by the factor (1 - r2)½). Restricting attention to just those individuals with a value of X that is smaller than the mean of X is looking at a subset of the population that tends to have smaller than average values of Y; the scatter of those values will tend to be less than the overall scatter of Y for the entire population. The same thing holds for negative correlation, mutatis mutandis.
In Chapter 5, "Regression," we saw that for football-shaped scatterplots the graph of averages is not as steep as the SD line, unless r = ±1: If 0 = r < 1, the average value of Y for individuals whose values of X are about k×SDX above the mean(X) is less than k×SDY above the mean(Y). Similarly, if -1 < r < 0, the average value of Y for individuals whose values of X are about k×SDX above mean(X) is less than k×SDY below mean(Y).
This phenomenon is called the regression effect or regression towards the mean. Individuals with a given value of X tend to have values of Y that are closer to the mean, where closer means fewer SD away. Consider the IQs of a large group of married couples. Essentially by definition, the average IQ score is 100. The SD of IQ is about 15 points. Suppose that for this group, the correlation between the IQs of spouses is 0.7—women with above average IQ tend to marry men with above average IQ, and vice versa. Consider a woman in the group whose IQ is 150 (genius level). What is our best estimate of her husband's IQ? We shall estimate his IQ using the regression line: Her IQ is 150, which is 50 points above average. 50 points is
(3 1/3)×15points = 3 1/3 SD,
so we would estimate the husband's IQ to be r×3 1/3 SD = 0.7×3 1/3 SD above average, or about 2 1/3 SD above average. 2 1/3 SD is 35 points, so we expect the husband's IQ to be about 135, not nearly as "smart" as she is.
Now let's predict the IQ of the wife of a man whose IQ is 135. His IQ is 2 1/3 SD above average, so we expect her IQ to be 0.7×2 1/3 SD above average. That's about 1.63 SD or 1.63×15 = 24½ points above average, or 124½, not as "smart" as he is. How can this be consistent?
The algebra is correct. The phenomenon is quite general. It is called the regression effect. The regression effect is caused by the same thing that makes the slope of the regression line smaller in magnitude than the slope of the SD line. If the scatterplot is football shaped and r is at least zero but less than 1, then
If the scatterplot is football shaped and r is less than zero but greater than -1:
Only if r is ±1 does the regression line estimate the value of Y to be as many SDs from the mean as the value of X is; otherwise, the regression line estimates the value of Y to be fewer SDs from the mean. If r is positive but less than 1, the regression line estimates Y to be above its mean if X is above its mean, but by fewer SDs. If r is negative but greater than -1, the regression line estimates Y to be below its mean if X is above its mean, but by fewer SDs than X is above its mean. This is another way of expressing the regression effect.
The regression effect does not say that an individual who is a given number of SD from average in one variable must have a value of the other variable that is closer to average—merely that individuals who are a given number of SD from the mean in one variable tend on average to be fewer SD from the mean in the other.
In most test-retest situations, the correlation between scores on the test and scores on the re-test is positive, so individuals who score much higher than average on one test tend to score above average, but closer to average, on the other test. (In the previous example, ":individuals" are couples, the first test is the IQ of one spouse, and the second test is the IQ of the other spouse.) Similarly, individuals who are much lower than average in one variable tend to be closer to average in the other (but still below average). Those who perform best usually do so with a combination of skill (which will be present in the retest) and exceptional luck (which will likely not be so good in a retest). Those who perform worst usually do so as the result of a combination of lack of skill (which still won't be present in a retest) and bad luck (which is likely to be better in a retest). If the scatterplot is football-shaped, many more individuals are near the mean than in the tails. A particularly high score could have come from someone with an even higher true ability, but who had bad luck, or someone with a lower true ability who had good luck. Because more individuals are near average, the second case is more likely; when the second case occurs on a retest, the individual's luck is just as likely to be bad as good, so the individual's second score will tend to be lower. The same argument applies, mutatis mutandis, to the case of a particularly low score on the first test.
The regression effect does not require the second score to be less extreme than the first: nothing prevents an individual from have a score that is even more extreme on the second test. The regression effect describes what happens on the average.
Failing to account for the regression effect, concluding that something must cause the difference in scores, is called the regression fallacy. The regression fallacy sometimes leads to amusing mental gymnastics and speculation, but can also be pernicious.
Example: Pilot training in the Israeli Airforce. (From Tversky and Kahneman, 1974.) The Israeli Airforce performed a study to determine the effectiveness of punishment and reward on pilot training. Some students were praised after particularly good landings, and others were reprimanded after particularly bad ones. Students who were praised usually did worse on their next landing, while those who were reprimanded usually did better on their next landing. The obvious conclusion is that reward hurts, and punishment helps.
How might this be an instance of the regression fallacy? After a particularly bad landing, one would expect the next to be closer to average, whether or not the student is reprimanded. Similarly, after a particularly good landing, one would expect the next to be closer to average, whether or not the student is praised.
The following exercise checks your understanding of the regression effect.
The residuals from a regression line are the values of the dependent variable Y minus the estimates of their values using the regression line and the independent variable X. If the ith datum is (xi, yi) and the equation of the regression line is y = a×x+b, then the ith residual is
ei = yi - ( a×xi+b).
A residual plot is a scatterplot of the residuals versus their corresponding values of X, that is, a plot of the n points
(xi, ei), i = 1, … , n.
A residual plot shows heteroscedasticity, nonlinear association, or outliers if and only if the original scatterplot does, but it is easier to see these qualitative features of bivariate data in the residual plot than in the scatterplot of the original data.
If the regression line is computed correctly, the correlation coefficient between the residuals and the independent variable is zero—the residuals do not have a trend with X—and the average of the residuals is zero. If the residuals have a trend, the slope of the regression line was computed incorrectly. If the residuals do not have a trend but the mean of the residuals is not zero, the intercept of the regression line was computed incorrectly. If the residuals have a trend and the mean of the residuals is not zero, the slope of the regression line was computed incorrectly, and the intercept of the regression line might or might not have been computed correctly.
The rms of the residuals, also called the rms error of regression, measures the average error of the regression line in estimating the dependent variable Y from the independent variable X. The rms error of regression depends only on the correlation coefficient of X and Y and the SD of Y:
rms error of regression = (1 - (rXY)2)½ ×SDY.
If the correlation coefficient is ±1, the rms error of regression is zero: The regression line passes through all the data. If r = 0, the rms error of regression is SDY: The regression line estimates Y no better than the mean of Y does—in fact, the regression line is a horizontal line whose intercept is the mean of Y.
For football-shaped scatterplots, unless r = ±1 the graph of averages is not as steep as the SD line: The average of Y in a vertical slice is fewer SDs from the mean than the value of X that defines the slice. The regression line estimates the value of the dependent variable to be fewer SDs from the mean than the value of the independent variable. This is called the regression effect, or regression towards the mean. The regression line estimates the value of the dependent variable to be on the same side of the mean as the value of the independent variable if r is positive, and on the opposite side of the mean if r is negative. (If r = 0, it estimates that the value of the dependent variable will equal the mean.) Ignoring the regression effect leads to the regression fallacy: inventing extrinsic causes for phenomena that are explained adequately by the regression effect.