# Errors in Regression

The regression line generally does not go through all the data: approximating the data using the regression line entails some error. As discussed in chapter the vertical amount by which the line misses a datum is called a residual—it is the error in estimating the value of Y for that datum from its value of X using the regression line. The rms of the residuals has a simple relation to the correlation coefficient and the SD of Y: It is $$\sqrt{(1-r^2)} \times SD(Y)$$ .

There are common mistakes in interpreting regression, including the regression fallacy and fallacies related to ecological correlation, discussed below.

## The RMS Error of Regression

The regression line does not pass through all the data points on the scatterplot exactly unless the correlation coefficient is ±1. In general, the data are scattered around the regression line. Each datum will have a vertical residual from the regression line; the sizes of the vertical residuals will vary from datum to datum. The rms of the vertical residuals measures the typical vertical distance of a datum from the regression line.

Recall that the rms is a measure of the typical size of elements in a list. Thus the rms of the vertical residuals is a measure of the typical vertical distance from the data to the regression line, that is, the typical error in estimating the value of Y by the height of the regression line. A bit of algebra shows that the rms of the vertical residuals from the regression line (the rms error of regression) is

$$\sqrt{(1-r^2)} \times SD_Y$$

The rms error of regression is always between 0 and $$SD_Y$$. It is zero when $$r = \pm 1$$ and $$SD_Y$$ when $$r = 0$$. (Try substituting $$r = 1$$ and $$r = 0$$ into the expression above.) When $$r = \pm 1$$, the regression line accounts for all of the variability of Y, and the rms of the vertical residuals is zero. When $$r = 0$$, the regression line does not "explain" any of the variability of Y: The regression line is a horizontal line at height mean(Y), so the rms of the vertical residuals from the regression line is the rms of the deviations of the values of Y from the mean of Y, which is, by definition, the SD of Y. When $$r$$ is not zero, the regression line accounts for some of the variability of Y, so the scatter around the regression line is less than the overall scatter in Y.

If the scatterplot is football-shaped, the mean of the values in a thin vertical strip will be about the same as the height of the regression line, and the SD of the values in a vertical strip will be about the same as the rms (vertical) error of regression. Why?

Recall that the regression line is a smoothed version of the graph of averages: The height of the regression line at the point $$x$$ is an estimate of the average of the values of Y for individuals whose value of X is close to $$x$$. If the scatterplot is football-shaped, the regression line follows the graph of averages reasonably well: In each vertical slice, the deviations of the values of Y from their mean is approximately the vertical residuals of those values of Y from the regression line. The SD of the values of Y in the slice are thus approximately the rms of the residuals in the slice. Because football-shaped scatterplots are homoscedastic, the SD of the values of Y in every vertical slice is about the same, so the rms error of regression is a reasonable estimate of the scatter of the values of Y in vertical slices through football-shaped scatterplots.

In contrast, when the scatterplot is not football-shaped—because of nonlinearity, heteroscedasticity or outliers—the rms error of regression is not a good measure of the scatter in a "typical" vertical slice. If a scatterplot is homoscedastic and shows nonlinear association, the rms error of regression tends to overestimate the scatter in a typical vertical slice: the residuals have a contribution from scatter around the average in the slice, and a contribution from the difference between the average in the slice and the height of the regression line in the slice. Similarly, if a scatterplot is heteroscedastic and shows linear association, the rms error of regression will overestimate the scatter in some slices and underestimate the scatter in other slices. If a scatterplot has outliers and is otherwise homoscedastic and shows linear association, the rms error of regression will tend to overestimate the scatter in slices. The strength of linear association affects the size of the rms error of regression, but it does not affect whether the rms error of regression is a good estimate of the scatter in vertical slices.

The following exercises check your ability to calculate the rms error of regression and your understanding of its use as a summary.

## The Distribution of Data in Slices through a Scatterplot

The values of Y for a given value of X (or a small range of values of X) have a distribution that typically differs from the overall distribution of the values of Y without regard for the value of X. Therefore, the mean of the values of Y in such a slice typically differs from the overall mean of Y, and the SD of the values of Y in a slice typically differs from the overall SD of Y.

lets us superpose the histogram of a variable for all individuals with the histogram of that variable just for those individuals whose value of that or another variable is within a given range—a slice through the scatterplot. That is, it allows us to look at the histogram of Y values for all individuals in a set of multivariate data, and the histogram of Y values for only those individuals who have X values in a specified range. We shall look at the GMAT data.

We first superposed histograms to study association in This applet should display the verbal GMAT scores when you first visit this page. If not, select "Verbal" from the Variable drop-down menu. Use the Restrict to drop-down menu to select Quantitative GMAT. The mean of the values of Verbal GMAT scores for just those individuals whose Quantitative GMAT scores are in a restricted range is typically different from the mean of the Verbal GMAT scores for all individuals; The SD of the restricted set of Verbal GMAT scores is also typically different from the overall SD of the Verbal GMAT scores. The SD is a measure of their spread, and in the case of football-shaped scatterplots, is about the same as the rms error of regression. We can use what we know about univariate distributions to calculate properties of the distribution of values of verbal GMAT corresponding to a given value of quantitative GMAT.

If the correlation coefficient r is positive and the data are homoscedastic, individuals with a given value of X that is above the mean of X are a subset of the population that tends to have larger than average values of Y; and the scatter of those values tends to be less than the scatter of Y for the entire population, by the factor $$\sqrt{(1 - r^2)}$$. Individuals with a value of X that is smaller than the mean of X are a subset of the population that tends to have smaller than average values of Y; and the scatter of those values tends to be less than the overall scatter of Y for the entire population. The same thing holds for negative correlation, mutatis mutandis.

## The Regression Effect

In we saw that for football-shaped scatterplots the graph of averages is not as steep as the SD line, unless $$r = \pm1$$: If $$0 < r < 1$$, the average value of Y for individuals whose values of X are about $$kSD_X$$ above the mean(X) is less than $$kSD_Y$$ above the mean(Y). Similarly, if $$-1 < r < 0$$, the average value of Y for individuals whose values of X are about $$kSD_X$$ above mean(X) is less than $$kSD_Y$$ below mean(Y).

This phenomenon is called the regression effect or regression towards the mean. Individuals with a given value of X tend to have values of Y that are closer to the mean, where closer means fewer SD away. Consider the IQs of a large group of married couples. Essentially by definition, the average IQ score is 100. The SD of IQ is about 15 points. Suppose that for this group, the correlation between the IQs of spouses is 0.7—women with above average IQ tend to marry men with above average IQ, and vice versa. Consider a woman in the group whose IQ is 150 (genius level). What is our best estimate of her husband's IQ? We shall estimate his IQ using the regression line: Her IQ is 150, which is 50 points above average. 50 points is

$$3 \tfrac{1}{3} \times 15 points = 3 \tfrac{1}{3}$$

so we would estimate the husband's IQ to be $$r \times 3\tfrac{1}{3} SD = 0.7 \times 3\tfrac{1}{3} SD$$ above average, or about $$2\tfrac{1}{3} SD$$ above average. Now $$2\tfrac{1}{3} SD$$ is 35 points, so we expect the husband's IQ to be about 135, not nearly as "smart" as she is.

Now let's predict the IQ of the wife of a man whose IQ is 135. His IQ is $$2\tfrac{1}{3} SD$$ above average, so we expect her IQ to be $$0.7 \times 2\tfrac{1}{3} SD$$ above average. That's about 1.63 SD or $$1.63 \times 15 = 24\tfrac{1}{2}$$ points above average, or $$124\tfrac{1}{2}$$, not as "smart" as he is. How can this be consistent?

The algebra is correct. The phenomenon is quite general. It is called the regression effect. The regression effect is caused by the same thing that makes the slope of the regression line smaller in magnitude than the slope of the SD line. If the scatterplot is football-shaped and r is at least zero but less than 1, then

• In a vertical slice containing above-average values of X, most of the y coordinates are below the SD line.
• In a vertical slice containing below-average values of X, most of the y coordinates are above the SD line.

If the scatterplot is football-shaped and $$r$$ is less than zero but greater than −1:

• In a vertical slice for above-average values of X, most of the y coordinates are above the SD line.
• In a vertical slice for below-average values of X, most of the y coordinates are below the SD line.

Only if $$r$$ is ±1 does the regression line estimate the value of Y to be as many SDs from the mean as the value of X is; otherwise, the regression line estimates the value of Y to be fewer SDs from the mean. If $$r$$ is positive but less than 1, the regression line estimates Y to be above its mean if X is above its mean, but by fewer SDs. If $$r$$ is negative but greater than −1, the regression line estimates Y to be below its mean if X is above its mean, but by fewer SDs than X is above its mean. This is another way of expressing the regression effect.

The regression effect does not say that an individual who is a given number of SD from average in one variable must have a value of the other variable that is closer to average—merely that individuals who are a given number of SD from the mean in one variable tend on average to be fewer SD from the mean in the other.

In most test/re-test situations, the correlation between scores on the test and scores on the re-test is positive, so individuals who score much higher than average on one test tend to score above average, but closer to average, on the other test. (In the previous example, ":individuals" are couples, the first test is the IQ of one spouse, and the second test is the IQ of the other spouse.) Similarly, individuals who are much lower than average in one variable tend to be closer to average in the other (but still below average). Those who perform best usually do so with a combination of skill (which will be present in the retest) and exceptional luck (which will likely not be so good in a retest). Those who perform worst usually do so as the result of a combination of lack of skill (which still won't be present in a retest) and bad luck (which is likely to be better in a retest). If the scatterplot is football-shaped, many more individuals are near the mean than in the tails. A particularly high score could have come from someone with an even higher true ability, but who had bad luck, or someone with a lower true ability who had good luck. Because more individuals are near average, the second case is more likely; when the second case occurs on a retest, the individual's luck is just as likely to be bad as good, so the individual's second score will tend to be lower. The same argument applies, mutatis mutandis, to the case of a particularly low score on the first test.

The regression effect does not require the second score to be less extreme than the first: nothing prevents an individual from have a score that is even more extreme on the second test. The regression effect describes what happens on the average.

Failing to account for the regression effect, concluding that something must cause the difference in scores, is called the regression fallacy. The regression fallacy sometimes leads to amusing mental gymnastics and speculation, but can also be pernicious.

Example: Pilot training in the Israeli Airforce. (From Tversky and Kahneman, 1974.) The Israeli Airforce performed a study to determine the effectiveness of punishment and reward on pilot training. Some students were praised after particularly good landings, and others were reprimanded after particularly bad ones. Students who were praised usually did worse on their next landing, while those who were reprimanded usually did better on their next landing. The obvious conclusion is that reward hurts, and punishment helps.

How might this be an instance of the regression fallacy? After a particularly bad landing, one would expect the next to be closer to average, whether or not the student is reprimanded. Similarly, after a particularly good landing, one would expect the next to be closer to average, whether or not the student is praised.

The following exercise checks your understanding of the regression effect.

## Summary

The rms of the residuals, also called the rms error of regression, measures the average error of the regression line in estimating the dependent variable Y from the independent variable X. The rms error of regression depends only on the correlation coefficient of X and Y and the SD of Y:

$$\mbox{rms error of regression} = \sqrt{(1 - (r_{XY})^2)} \times SD_Y$$

If the correlation coefficient is $$\pm1$$, the rms error of regression is zero: The regression line passes through all the data. If r = 0, the rms error of regression is $$SD_Y$$: The regression line estimates Y no better than the mean of Y does—in fact, when $$r = 0$$ the regression line is a horizontal line whose intercept is the mean of Y.

For football-shaped scatterplots, unless $$r = \pm 1$$ the graph of averages is not as steep as the SD line: The average of Y in a vertical slice is fewer SDs from the mean than the value of X that defines the slice. The regression line estimates the value of the dependent variable to be fewer SDs from the mean than the value of the independent variable. This is called the regression effect, or regression towards the mean. The regression line estimates the value of the dependent variable to be on the same side of the mean as the value of the independent variable if $$r$$ is positive, and on the opposite side of the mean if $$r$$ is negative. (If $$r = 0$$, it estimates that the value of the dependent variable will equal the mean.) Ignoring the regression effect leads to the regression fallacy: inventing extrinsic causes for phenomena that are explained adequately by the regression effect.

## Key Terms

• correlation coefficient
• dependent variable
• football-shaped
• graph of averages
• heteroscedasticity
• histogram
• homoscedastic
• independent variable
• mean
• mutatis mutandis
• nonlinear
• nonlinearity
• outlier
• percentile
• regression effect
• regression fallacy
• regression line
• residual
• residual plot
• rms
• rms error of regression
• scatterplot
• SD
• slice
• variable