The point of averages and the two numbers
SD_{X} and SD_{Y} give us some information about a scatterplot,
but they do not tell us
the extent of the association between the variables.
The correlation
coefficient r is a quantitative measure of association: it tells us whether
the scatterplot tilts up or down, and how tightly the data cluster around a straight
line.
In this chapter we study the correlation coefficient, and when it can be used with
the point of averages, SD_{X}, and SD_{Y} to
summarize scatterplots.

Sometimes we shall use subscripts to clarify which correlation coefficient we are
talking about: The symbol r_{XY} denotes the correlation coefficient for
X and Y.
The correlation coefficient for a scatterplot of Y versus X is always equal to the
correlation coefficient for a scatterplot of X versus Y
(r_{XY} = r_{YX}).
See for yourself: Change

Look at scatterplots of different pairs of variables in the GMAT data (ignore the School variable). There are six pairs of variables:

- Quantitative GMAT and Verbal GMAT
- Quantitative GMAT and Undergraduate GPA
- Quantitative GMAT and First-year MBA GPA
- Verbal GMAT and Undergraduate GPA
- Verbal GMAT and First-year MBA GPA
- Undergraduate GPA and First-year MBA GPA

The following exercise helps train your eye to see small differences in correlation in scatterplots.

An Example of Exercise 7-3 (Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Correlation is a measure of linear association: how nearly a scatterplot follows a straight line. Two variables are positively correlated if the scatterplot slopes upwards (r > 0); they are negatively correlated if the scatterplot slopes downward (r < 0). Note that linear association is not the only kind of association: Some variables are nonlinearly associated (discussed later in this chapter). For example, the average monthly rainfall in Berkeley, CA, is associated with the month of the year, but that association is nonlinear: It is a seasonal variation that cycles annually. Correlation does not measure nonlinear association, only linear association. The correlation coefficient is appropriate only for quantitative variables, not ordinal or categorical variables, even if their values are numerical.

Correlation is a measure of association, not causation. For example, the average height of people at maturity in the United States has been increasing for decades. Similarly, there is evidence that the number of plant species is decreasing with time. These two variables have a negative correlation, but there is no (straightforward) causal connection between them. A secular trend in both manifests as a correlation between them.

The correlation coefficient r is close to 1 if the data cluster tightly around
a straight line that slopes up from left to right.
The correlation coefficient is close to −1 if the data cluster tightly around a straight
line that slopes down from left to right.
If the data do not cluster around a straight line, the correlation coefficient
r is close to zero, even if the variables have a strong
nonlinear association.

The following exercises check your knowledge of basic facts about r, and your ability to gauge r by eye.

Some scatterplots have curved patterns.
Such scatterplots are said to show *nonlinear association* between the two variables.
The correlation coefficient does not reflect
nonlinear relationships between variables, only linear ones.
For example, even if the association is quite strong, if it is nonlinear, the correlation
coefficient r can be small or zero.

In
*sin*(X).
(The plot is half a period of the sine function.)
Even though the association is perfect—one can predict Y exactly from X—the
correlation coefficient r is exactly zero.
This is because the association is purely nonlinear.
The correlation coefficient measures whether there is a trend in the data, and what
fraction of the scatter in the data is accounted for by the trend.

In

Correlation and Association

The correlation coefficient r measures only linear association: how nearly the data fall on a straight line. It is not a good summary of association if the scatterplot has a nonlinear (curved) pattern.

Recall that data are *homoscedastic* if the SD
of the values of Y for points in a vertical slice through the scatterplot is about the same,
regardless of the location of the slice.
In contrast, if the SD of the values of Y in a vertical slice varies a great deal
depending on the location of the slice, the data are *heteroscedastic*.
All the scatterplots we have seen so far in this chapter are roughly homoscedastic.

The scatter in a vertical slice near the right of

Correlation and Heteroscedasticity

The correlation coefficient r is not a good summary of association if the data are heteroscedastic.

Recall that a datum that does not fit the overall pattern in the data or that is many
SD from the other data in at least one of its
coordinates is called an outlier.
A single outlier that is far from the
point of averages
can have a large effect on the correlation coefficient.

Try adding a point to the scatterplot and seeing how much you can change the correlation coefficient. Clear the point, and try again. See how large and how small you can make the correlation coefficient be by adding just one point. You should be able to change r from 0 to plus or minus 0.12 or more. If you could add a point beyond the limits of the plot, you could make r vary from nearly −1 to nearly 1. The following exercise checks your understanding of the influence a single point can have on the correlation coefficient.

Correlation and Outliers

The correlation coefficient r is not a good summary of association if the data have outliers.

If a scatterplot does not show nonlinearity, heteroscedasticity or outliers, it is "football-shaped."

Five-number summary of football-shaped Scatterplots

Football-shaped scatterplots can be summarized rather well by five numbers:

the mean of X, the mean of Y, the SD of X, the SD of Y, and r.

Correlation is a measure of
linear association between two variables.
If larger than average values of X tend to occur in conjunction with larger than
average values of Y and smaller than average values of X tend to occur in conjunction
with smaller than average values of Y, the
correlation coefficient
r_{XY}
of X and Y is positive.
If larger than average values of X tend to occur in conjunction with smaller than average
values of Y and smaller than average values of X tend to occur in conjunction with larger
than average values of Y, r_{XY} is negative.
The correlation coefficient of X and Y is always between −1 and +1.
If the points in a scatterplot of Y versus X fall on a
straight line with slope greater than zero, r_{XY} = 1.
If the points in a scatterplot of Y versus X fall on a straight line with slope less
than zero, r_{XY} = −1.
If the points in a scatterplot of Y versus X fall on a horizontal line,
r_{XY} is not defined.

The correlation coefficient does not measure all kinds of association—only linear
association.
The correlation coefficient, the point of averages, SD_{X} and SD_{Y}
summarize football-shaped scatterplots well,
but not scatterplots that show nonlinearity,
heteroscedasticity or
outliers.
Two variables can have strong nonlinear association and small or zero correlation.
A single outlier can make the correlation coefficient
small or large.

Strong correlation between two variables does not entail a causal relationship between them; neither does a causal relationship between two variables entail any correlation between them. Beware claims of causality on the basis of correlation.

- association
- categorical
- correlation
- correlation coefficient r
- football-shaped
- heterostedastic
- homoscedastic
- linear association
- mean
- nonlinear
- ordinal
- outlier
- point of averages
- quantitative
- secular trend
- standard deviation (SD)
- transformation
- variable