The point of averages and the two numbers
SDX and SDY give us some information about a scatterplot,
but they do not tell us
the extent of the association between the variables.
coefficient r is a quantitative measure of association: it tells us whether
the scatterplot tilts up or down, and how tightly the data cluster around a straight
In this chapter we study the correlation coefficient, and when it can be used with
the point of averages, SDX, and SDY to
correlation coefficient r.
When you first load this page,
r = 0.35.
Quantitative GMAT and Verbal GMAT are positively associated:
Students with above average Quantitative GMAT scores tend also to have larger than average
Verbal GMAT scores, and students with below average Quantitative GMAT scores tend to have
below average Verbal GMAT scores.
The value of r is positive when variables are positively associated;
the value of r is negative when variables are negatively associated.
The value of r is always between −1 and +1.
Sometimes we shall use subscripts to clarify which correlation coefficient we are
talking about: The symbol rXY denotes the correlation coefficient for
X and Y.
The correlation coefficient for a scatterplot of Y versus X is always equal to the
correlation coefficient for a scatterplot of X versus Y
(rXY = rYX).
See for yourself: Change
r = 0.35 again.
Look at scatterplots of different pairs of variables in the GMAT data
(ignore the School variable).
There are six pairs of variables:
The following exercise helps train your eye to see small differences in
correlation in scatterplots.
An Example of Exercise 7-3 (Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)
Correlation is a measure of linear association: how nearly a scatterplot follows a
Two variables are positively correlated if the scatterplot slopes upwards
(r > 0); they are negatively correlated if the scatterplot slopes
downward (r < 0).
Note that linear association is not the only kind of association:
Some variables are nonlinearly associated (discussed later in this chapter).
For example, the average monthly rainfall in Berkeley, CA, is associated with the month
of the year, but that association is nonlinear: It is a seasonal variation that
Correlation does not measure nonlinear association, only linear association.
The correlation coefficient is appropriate only for
quantitative variables, not
categorical variables, even if their values are numerical.
Correlation is a measure of association, not causation.
For example, the average height of people at maturity in the United States has been
increasing for decades.
Similarly, there is evidence that the number of plant species is decreasing with time.
These two variables have a negative correlation, but there is no (straightforward)
causal connection between them.
A secular trend in both manifests as a
correlation between them.
The correlation coefficient r is close to 1 if the data cluster tightly around
a straight line that slopes up from left to right.
The correlation coefficient is close to −1 if the data cluster tightly around a straight
line that slopes down from left to right.
If the data do not cluster around a straight line, the correlation coefficient
r is close to zero, even if the variables have a strong
r, and specific numbers of data n.
Note that r is undefined if n is less than two.
The following exercises check your knowledge of
basic facts about r, and your ability to gauge r by eye.
Some scatterplots have curved patterns.
Such scatterplots are said to show nonlinear association between the two variables.
The correlation coefficient does not reflect
nonlinear relationships between variables, only linear ones.
For example, even if the association is quite strong, if it is nonlinear, the correlation
coefficient r can be small or zero.
(The plot is half a period of the sine function.)
Even though the association is perfect—one can predict Y exactly from X—the
correlation coefficient r is exactly zero.
This is because the association is purely nonlinear.
The correlation coefficient measures whether there is a trend in the data, and what
fraction of the scatter in the data is accounted for by the trend.
Correlation and Association
The correlation coefficient r measures only linear association: how nearly
the data fall on a straight line.
It is not a good summary of association if the scatterplot has a nonlinear (curved) pattern.
Recall that data are homoscedastic if the SD
of the values of Y for points in a vertical slice through the scatterplot is about the same,
regardless of the location of the slice.
In contrast, if the SD of the values of Y in a vertical slice varies a great deal
depending on the location of the slice, the data are heteroscedastic.
All the scatterplots we have seen so far in this chapter are roughly homoscedastic.
The scatter in a vertical slice near the right of
Correlation and Heteroscedasticity
The correlation coefficient r
is not a good summary of association if the data are
Recall that a datum that does not fit the overall pattern in the data or that is many
SD from the other data in at least one of its
coordinates is called an outlier.
A single outlier that is far from the
point of averages
can have a large effect on the correlation coefficient.
Try adding a point to the scatterplot and seeing how much you can change the
Clear the point, and try again.
See how large and how small you can make the correlation coefficient be by adding just one point.
You should be able to change r from 0 to plus or minus 0.12 or more.
If you could add a point beyond the limits of the plot, you could make r
vary from nearly −1 to nearly 1.
The following exercise checks your understanding of
the influence a single point can have on the correlation coefficient.
Correlation and Outliers
The correlation coefficient
r is not a good summary of association if the
data have outliers.
If a scatterplot does not show nonlinearity, heteroscedasticity or
outliers, it is "football-shaped."
Five-number summary of football-shaped Scatterplots
Football-shaped scatterplots can be summarized rather
well by five numbers:
the mean of X, the
mean of Y, the
X, the SD of Y, and
Correlation is a measure of
linear association between two variables.
If larger than average values of X tend to occur in conjunction with larger than
average values of Y and smaller than average values of X tend to occur in conjunction
with smaller than average values of Y, the
of X and Y is positive.
If larger than average values of X tend to occur in conjunction with smaller than average
values of Y and smaller than average values of X tend to occur in conjunction with larger
than average values of Y, rXY is negative.
The correlation coefficient of X and Y is always between −1 and +1.
If the points in a scatterplot of Y versus X fall on a
straight line with slope greater than zero, rXY = 1.
If the points in a scatterplot of Y versus X fall on a straight line with slope less
than zero, rXY = −1.
If the points in a scatterplot of Y versus X fall on a horizontal line,
rXY is not defined.
The correlation coefficient does not measure all kinds of association—only linear
The correlation coefficient, the point of averages, SDX and SDY
summarize football-shaped scatterplots well,
but not scatterplots that show nonlinearity,
Two variables can have strong nonlinear association and small or zero correlation.
A single outlier can make the correlation coefficient
small or large.
Strong correlation between two variables does not entail a causal relationship
between them; neither does a causal relationship between two variables entail any
correlation between them.
Beware claims of causality on the basis of correlation.