Regression

In we saw that if the scatterplot of Y versus X is football-shaped, it can be summarized well by five numbers: the mean of X, the mean of Y, the standard deviations SDX and SDY, and the correlation coefficient rXY. Such scatterplots also can be summarized by the regression line, which is introduced in this chapter. The regression line approximates the relationship between X and Y. The slope and intercept of the regression line can be found from the five numbers. The regression line is the line that fits the data best, in a sense made precise in this chapter. Regression is a common statistical tool, better suited to summarizing some scatterplots than to drawing inferences.

The SD Line

The SD line goes through the point of averages, and has slope equal to SDY/SDX if the correlation coefficient r is greater than or equal to zero. The SD line has slope −SDY/SDX if r is negative. That is, the SD line climbs (sinks, if r is negative) by SDY when you move to the right by SDX. The sign of the slope of the SD line is the same as the sign of r, and the size of the slope of the SD line is SDY/SDX. In standard units, the slope of the SD line is one if r is greater than or equal to zero, and equal to minus one if r is negative. If SDX is zero, the SD line is not defined. shows a football-shaped scatterplot, with its SD line overlaid.

The line slopes up to the right, because r is positive (0.5 at first). Change r and the number of points n to see how the SD line changes. Notice that the points in the scatterplot all lie on the SD line if and only if the correlation coefficient r is ±1 and that the SD line always goes through the point of averages, but does not always go through the origin (0,0). If you click the SDs button, you will see that the SD line always goes diagonally through the rectangle defined by the point of averages plus and minus SDX horizontally and SDY vertically.

The SD line typically does not split the points in the scatterplot evenly. When the correlation coefficient r is positive, in vertical slices to the left of the point of averages, most of the values of Y are above the SD line, and in vertical slices to the right of the point of averages, most of the Y values are below the SD line. When r is negative, in vertical slices to the left of the point of averages, most of the values of Y are below the SD line, and in vertical slices to the right of the point of averages, most of the Y values are above the SD line.

The Graph of Averages

A graph of averages divides a scatterplot into class intervals of the horizontal (X) variable and plots the averages of the Y values in those intervals against the midpoints of the intervals. That is, it plots a typical value of Y in each interval of values of X. If we wanted to summarize the Y values of points whose X values that fall in some range, the average Y values of those points would be a reasonable summary. That is what the graph of averages displays. shows a scatterplot of the GMAT data, with the SD line and the graph of averages. If the figure does not show a scatterplot of Quantitative GMAT versus Verbal GMAT, please change the variables accordingly.

The graph of averages, plotted in as yellow squares, lies systematically above the SD line when X is to the left of the point of averages, and systematically below the SD line when X is to the right of the point of averages. The slope of the graph of averages is less extreme than that of the SD line. The graph of averages plots a "typical" value of Y in each class interval of X. The typical value of Y differs systematically from the SD line. The following exercise checks your ability to read the graph of averages.

introduced a tool to look at histograms of slices through scatterplots, in other words, histograms of subsets of the data defined by restricting the values of some of the variables. The graph of averages plots, for a collection of slices through the data, the mean of the values of Y in each slice (on the y axis) against the midpoint of the interval of X that defines the slice (on the x axis). allows us to superpose histograms of subsets of the GMAT data with histograms of all the GMAT data.

Select Verbal GMAT as the variable to display in Then restrict attention to individuals whose undergraduate GPA was at least 2.4 and no larger than 2.75. That corresponds to the fourth point in the graph of averages. You will see that the peak of the histogram is at a slightly lower Verbal GMAT score than that of the overall population, and that the mean Verbal GMAT score for those 72 individuals is about 34.2, compared with 35.1 for the overall population of 913 students. You can check these numbers against the graph of averages in by putting the cursor over the corresponding yellow square and reading the meter in the bottom-right corner.

The SD line rises by SDY for each run of SDX. The graph of averages typically rises by less than SDY for each run of SDX. In fact, it rises by about r×SDY for each run of SDX. The average of Y near the average of X is roughly the overall average of Y if the scatterplot is football-shaped, so the point of averages tends to be close to the graph of averages for football-shaped scatterplots. This suggests that a line that passes through the point of averages and has slope r×SDY/SDX would fit the graph of averages pretty well, giving a reasonable summary of the scatterplot. The line that passes through the point of averages and has slope r×SDY/SDX is called the regression line.

The Regression Line

The regression line is a smoothed version of the graph of averages. It goes through the point of averages, and rises by exactly r×SDY for each SDX it runs to the right: Its slope is r×SDY/SDX, compared with SDY/SDX for the SD line. Because |r| ≤1, the regression line is not as steep as the SD line. is a scatterplot of the GMAT data, with the graph of averages, the SD line, and the regression line. If the figure does not show Verbal GMAT versus Quantitative GMAT, please change the variables accordingly.

The regression line fits the graph of averages much better than the SD line does: The slope of the regression line reflects the average increase of Y associated with a given increase of X. The regression line is not as steep as the SD line; this is true in general unless the correlation coefficient is ±1, in which case, the SD line and the regression line coincide.

The following exercises check your understanding of the relationships among the graph of averages, the SD line, and the regression line

The relationship of the regression line to pairs of variables (bivariate data) is analogous to the relationship of the mean to measurements of one variable (univariate data). Consider a single (unknown) individual in the data set. If we knew nothing about that individual, a good guess of that individual's value of Y would be the mean of Y for the entire data set. If we knew the value of X for that individual, a good guess of the individual's value of Y would be the mean of the values of Y for those individuals with the same value of X: essentially, a point on the graph of averages. For each value of X, the regression line estimates the average value of Y. If the scatterplot is football-shaped, the estimate is sensible; if not, the regression line tends to differ systematically from the graph of averages. The regression line involves the same five numbers we used in the previous chapter to summarize football-shaped scatterplots: the mean of X, the mean of Y, SDX, SDY, and rXY.

The vertical residual of a datum from the regression line is the difference between the value of Y for the datum and the height of the regression line at the value of X of the datum: The residual is the vertical distance by which the regression line misses the datum:

vertical residual = (measured value of Y) − (estimated value of Y).

Recall from that the mean is the number for which the rms of the deviations is smallest. Similarly, the regression line is the line for which the rms of vertical residuals is smallest. No line fits the data better than the regression line—in the sense that the rms of the vertical residuals from the regression line is smaller than the rms of the vertical residuals from any other line. Minimizing the square root of the sum of the squares of the residuals (the rms of the residuals) is equivalent to minimizing the sum of the squares of the residuals. For this reason the regression line is sometimes called the least squares line.

There are really two regression lines: one for regressing Y on X, and one for regressing X on Y. The regression line for regressing X on Y also passes through the point of averages, but its slope is SDY/(r×SDX). This line minimizes the sum of the squared horizontal residuals, instead of the sum of squared vertical residuals. It is steeper than the SD line unless r is 1 or −1. Often, the variable that is regressed upon is called the independent variable, and the variable that is being regressed is called the dependent variable. Usually, the variable plotted on the horizontal (x) axis is the independent variable and the variable plotted on the vertical (y) axis is the dependent one. In this book, the independent variable of the regression line is always plotted on the horizontal axis and the dependent variable is always plotted on the vertical axis.

Estimating using the Regression Line

How might we use the regression line? If we knew nothing about an individual, our best guess of his value of Y would be the average of the values of Y for the entire data set. (Here, the best guess is the one that minimizes the average of the squared deviations.) If we knew the value of X for that individual, our best guess of his value of Y would be the average value of Y for individuals with that value of X. The regression line gives us that, more or less: The regression line (for regressing Y on X, when regression is appropriate) is an estimate of the average value of Y for individuals with a given value of X. If the scatterplot is football-shaped, the average value of Y for individuals with a value of X that is k×SDX above the overall average of X is about r×k×SDY above the overall average value of Y. The easiest way to use the regression line is to convert from the original measurement units to standard units and back again:

Estimating using the Regression Line

For regressing Y on X, (estimate of Y in standard units) = r×(measured X in standard units)

For regressing X on Y, (estimate of X in standard units) = r×(measured Y in standard units).

Estimating Y from X using regression is reasonable only if the following conditions hold:

The Equation of the Regression Line

Recall that the equation of a line is y = a×x + b, where a is the slope (rise/run, number of units of y the line goes up when x increases by one unit) and b is the y-intercept (the value of y on the line when x = 0, where the line crosses the y-axis). The regression line for regressing Y on X is the unique line that passes through the point of averages and has slope r×SDY/SDX. That lets us solve for the slope and intercept of the regression line:

The point of averages is (mean(X), mean(Y)). The slope of the regression line is a = r×SDY/SDX. The point (mean(X), mean(Y)) is on the line, so

mean(Y) = r×SDY/SDX×mean(X) + b, and thus

b = mean(Y) − r×SDY/SDX×mean(X).

The equation of the regression line is therefore

y = r×SDY/SDX×x + [mean(Y) − r×SDY/SDX×mean(X)].

Special Cases of the Regression Line

The equation of the regression line involves only the means of X and Y, the standard deviations of X and Y, and the correlation coefficient of X and Y. It is usually easier and clearer conceptually to work in standard units, rather than use the equation of the regression line. The following exercises check your ability to find and manipulate the equation of the regression line.

Example 9-1: Using the Regression Line to Estimate One Variable from Another
(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Estimating the value of Y associated with a value of X that is larger than any of those observed, or smaller than any of those observed, is called extrapolation. (Estimating the value of X associated with a value of Y larger than any of those observed, or smaller than any of those observed, is also extrapolation.) Estimating the value of Y associated with a value of X that is within the range of the observed values of X but is not equal to any of the observed values of X is called interpolation; so is estimating the value of X associated with a value of Y that is within the range of measured values of Y. Extrapolation is extremely suspect—without data in the range in which the estimate is wanted, there is no reason to believe that the relationship between X and Y is the same as it is in the region in which there are data. Interpolation is sometimes reasonable when the scatterplot is football-shaped, especially if there are many data near the value of X or Y at which the estimate is sought.

The regression line makes sense as a summary of a scatterplot when the correlation makes sense as a summary of association; namely, when the scatterplot is football-shaped: homoscedastic, without large outliers, and shows linear association. illustrates using the regression line to estimate one variable from another.

Suppose we have measured the heights and weights of 1000 individuals, and that the scatterplot of weight versus height is roughly football-shaped. Say the average weight is 150 lbs. with an SD of 20 lbs., the average height is 66" with an SD of 3", and the correlation coefficient between height and weight is 0.6. Use the regression line to estimate the following:

  1. The weight of an individual whose height is 66".
  2. The weight of an individual whose height is 72".
  3. The height of an individual whose weight is 160 lbs.
  4. The height of an individual whose weight is 2SD below average.

Solution:

  1. 66" is the average height, so the regression line would estimate that the individual's weight is the average weight, namely, 150 lbs.
  2. 72" is 6" above average, which is 2 SD above average, so 72" is 2 standard units. The regression line would thus estimate that the individual's weight is r×2 = 1.2 standard units. The SD of weight is 20 lbs., so the individual is estimated to have weight 150 lbs. + 1.2×20 lbs. = 150 lbs. + 24 lbs. = 174 lbs.
  3. 160 lbs. is 10 lbs. above average, and 10 lbs. is 0.5SD, so the individual has weight 0.5 standard units. The regression line estimates that the individual's height to be r×0.5 = 0.3 standard units. The SD of height is 3", so the regression line estimate of estimate the height is 66" + 0.3×3" = 66.9".
  4. The individual's weight is −2 standard units, so the regression line estimate of the height is r×(−2) = −1.2 standard units. This is 66" − 1.2×3" = 62.4".

The following exercises check your ability to estimate using the regression line.

The best use of the regression line is to interpolate: to estimate the value of Y at a value of X that is within the range of measured values of X, but not equal to any of the measured values, when the scatterplot of X and Y is football-shaped. The regression line is also useful for estimating the value of one of the variables corresponding to a given value of the other, using just five summary numbers: mean(X), mean(Y), SDX, SDY, and r. However, one must be wary: Extrapolation always should be met with suspicion, and if the scatterplot is not football-shaped, the regression line need not summarize the relationship between X and Y well, even within the range of the data.

Summary

This chapter presents the SD line, the graph of averages, and the regression line. The SD line passes through the point of averages and has slope ±SDY/SDX; the sign of the slope is the same as the sign of the correlation coefficient r. The SD line is presented only for contrast; it is not itself useful. The graph of averages summarizes a scatterplot by the averages of the data in vertical slices, plotted against the horizontal midpoints of the slices. Each point in the graph of averages summarizes the values of Y for points with values of X that are in a small range. If the scatterplot is football-shaped, the points in the graph of averages fall nearly on a straight line. That line, called the regression line, passes through the point of averages and has slope r×SDY/SDX, which is smaller than the slope of the SD line.

The vertical residual from a datum (x, y) to a line is equal to y minus the height of the line at x. The rms of the vertical residuals from all the data and the regression line is smaller than the rms of the vertical residuals from the data to any other line: In this sense, the regression line is the line that best summarizes a scatterplot.

If a scatterplot is football-shaped, the regression line comes close to passing through the graph of averages, and provides a reasonable summary of the scatterplot. If the scatterplot shows nonlinearity, heteroscedasticity, or outliers, the regression line will tend to be systematically above the graph of averages for some values of X and systematically below the graph of averages for other values of X. Then the regression line is not a good summary of the scatterplot. To estimate the value of Y associated with a given value of X using the regression line, it is easiest to work in standard units:

(estimate of Y in standard units) = r×(measured X in standard units).

This is the regression line for regressing Y on X; Y is called the dependent variable in the regression and X is called the independent variable in the regression. There is another regression line for regressing X on Y, as follows:

(estimate of X in standard units) = r×(measured Y in standard units).

In that line, Y is called the independent variable in the regression and X is called the dependent variable in the regression.

The equation of the regression line for regressing Y on X is:

y = r×SDY/SDX×x + [mean(Y) − r×SDY/SDX×mean(X)].

The equation of the regression line depends on five measured quantities: the mean of X, the mean of Y, the standard deviation of X, the standard deviation of Y, and the correlation coefficient. Estimating the value of Y at a value of X beyond the range of measured values of X is called extrapolation. Estimating the value of Y at a value of X within the range of measured values of X is called interpolation. Extrapolation is very hard to justify, and should be treated suspiciously: Usually, there is little reason to believe that the observed relationship between X and Y holds where nothing was observed. Interpolation can be reasonable when the scatterplot is football-shaped. This is the most useful application of the regression line.

Key Terms