Correlation and Association

The point of averages and the two numbers SDX and SDY give us some information about a scatterplot, but they do not tell us the extent of the association between the variables. The correlation coefficient r is a quantitative measure of association: it tells us whether the scatterplot tilts up or down, and how tightly the data cluster around a straight line. In this chapter we study the correlation coefficient, and when it can be used with the point of averages, SDX, and SDY to summarize scatterplots.

The Correlation Coefficient

is a scatterplot of the GMAT data, which were introduced in has a new number below the scatterplot: the correlation coefficient r. When you first load this page, should be a scatterplot of Quantitative GMAT score versus Verbal GMAT score. If not, please select those variables from the drop-down menus at the top of the figure. Then the figure should show r = 0.35. Quantitative GMAT and Verbal GMAT are positively associated: Students with above average Quantitative GMAT scores tend also to have larger than average Verbal GMAT scores, and students with below average Quantitative GMAT scores tend to have below average Verbal GMAT scores. The value of r is positive when variables are positively associated; the value of r is negative when variables are negatively associated. The value of r is always between −1 and +1.

Sometimes we shall use subscripts to clarify which correlation coefficient we are talking about: The symbol rXY denotes the correlation coefficient for X and Y. The correlation coefficient for a scatterplot of Y versus X is always equal to the correlation coefficient for a scatterplot of X versus Y (rXY = rYX). See for yourself: Change to show the scatterplot of Verbal GMAT versus Quantitative GMAT. You should see r = 0.35 again.

Look at scatterplots of different pairs of variables in the GMAT data (ignore the School variable). There are six pairs of variables:

The following exercise helps train your eye to see small differences in correlation in scatterplots.

An Example of Exercise 7-3 (Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Correlation is a measure of linear association: how nearly a scatterplot follows a straight line. Two variables are positively correlated if the scatterplot slopes upwards (r > 0); they are negatively correlated if the scatterplot slopes downward (r < 0). Note that linear association is not the only kind of association: Some variables are nonlinearly associated (discussed later in this chapter). For example, the average monthly rainfall in Berkeley, CA, is associated with the month of the year, but that association is nonlinear: It is a seasonal variation that cycles annually. Correlation does not measure nonlinear association, only linear association. The correlation coefficient is appropriate only for quantitative variables, not ordinal or categorical variables, even if their values are numerical.

Correlation is a measure of association, not causation. For example, the average height of people at maturity in the United States has been increasing for decades. Similarly, there is evidence that the number of plant species is decreasing with time. These two variables have a negative correlation, but there is no (straightforward) causal connection between them. A secular trend in both manifests as a correlation between them.

The correlation coefficient r is close to 1 if the data cluster tightly around a straight line that slopes up from left to right. The correlation coefficient is close to −1 if the data cluster tightly around a straight line that slopes down from left to right. If the data do not cluster around a straight line, the correlation coefficient r is close to zero, even if the variables have a strong nonlinear association. lets you make scatterplots with specific values of the correlation coefficient r, and specific numbers of data n. Note that r is undefined if n is less than two.

The following exercises check your knowledge of basic facts about r, and your ability to gauge r by eye.

Nonlinear Association

Some scatterplots have curved patterns. Such scatterplots are said to show nonlinear association between the two variables. The correlation coefficient does not reflect nonlinear relationships between variables, only linear ones. For example, even if the association is quite strong, if it is nonlinear, the correlation coefficient r can be small or zero.

In the scatter in X for a given value of Y is very small, so the association is strong. In fact, there is a deterministic relationship between the two variables: Y = sin(X). (The plot is half a period of the sine function.) Even though the association is perfect—one can predict Y exactly from X—the correlation coefficient r is exactly zero. This is because the association is purely nonlinear. The correlation coefficient measures whether there is a trend in the data, and what fraction of the scatter in the data is accounted for by the trend.

In the correlation coefficient is reasonably large (0.71), because there is an overall trend in the data. However, the correlation coefficient still does not show how strongly associated the variables are, because the pattern of their relationship is curved (nonlinear). The correlation coefficient is not a good summary of the association of these variables.

 

Correlation and Association

The correlation coefficient r measures only linear association: how nearly the data fall on a straight line. It is not a good summary of association if the scatterplot has a nonlinear (curved) pattern.

Homoscedasticity and Heteroscedasticity

Recall that data are homoscedastic if the SD of the values of Y for points in a vertical slice through the scatterplot is about the same, regardless of the location of the slice. In contrast, if the SD of the values of Y in a vertical slice varies a great deal depending on the location of the slice, the data are heteroscedastic. All the scatterplots we have seen so far in this chapter are roughly homoscedastic. shows a heteroscedastic scatterplot with the corresponding correlation coefficient.

The scatter in a vertical slice near the right of is much larger than the scatter in a vertical slice near the left of the plot. There is not much association between Y and X, but the correlation coefficient is still 0.15—an artifact of the heteroscedasticity.

Correlation and Heteroscedasticity

The correlation coefficient r is not a good summary of association if the data are heteroscedastic.

Outliers

Recall that a datum that does not fit the overall pattern in the data or that is many SD from the other data in at least one of its coordinates is called an outlier. A single outlier that is far from the point of averages can have a large effect on the correlation coefficient. show two extreme examples. In the outlier makes the correlation coefficient nearly one; without it, the correlation coefficient would be nearly zero. In the outlier makes the correlation coefficient nearly zero; without it, the correlation coefficient would be nearly one.

 

lets you add points to the scatterplot by clicking the scatterplot; a point is added wherever the cursor is. Adding a point typically will change the correlation coefficient. You can add as many points as you wish. Click the Clear Added Points button to delete the points you added. If you click the Ignore Added Points button, the new points will not be included in computing the correlation coefficient

Try adding a point to the scatterplot and seeing how much you can change the correlation coefficient. Clear the point, and try again. See how large and how small you can make the correlation coefficient be by adding just one point. You should be able to change r from 0 to plus or minus 0.12 or more. If you could add a point beyond the limits of the plot, you could make r vary from nearly −1 to nearly 1. The following exercise checks your understanding of the influence a single point can have on the correlation coefficient.

 

Correlation and Outliers

The correlation coefficient r is not a good summary of association if the data have outliers.

If a scatterplot does not show nonlinearity, heteroscedasticity or outliers, it is "football-shaped."

Five-number summary of football-shaped Scatterplots

Football-shaped scatterplots can be summarized rather well by five numbers:

the mean of X, the mean of Y, the SD of X, the SD of Y, and r.

Summary

Correlation is a measure of linear association between two variables. If larger than average values of X tend to occur in conjunction with larger than average values of Y and smaller than average values of X tend to occur in conjunction with smaller than average values of Y, the correlation coefficient rXY of X and Y is positive. If larger than average values of X tend to occur in conjunction with smaller than average values of Y and smaller than average values of X tend to occur in conjunction with larger than average values of Y, rXY is negative. The correlation coefficient of X and Y is always between −1 and +1. If the points in a scatterplot of Y versus X fall on a straight line with slope greater than zero, rXY = 1. If the points in a scatterplot of Y versus X fall on a straight line with slope less than zero, rXY = −1. If the points in a scatterplot of Y versus X fall on a horizontal line, rXY is not defined.

The correlation coefficient does not measure all kinds of association—only linear association. The correlation coefficient, the point of averages, SDX and SDY summarize football-shaped scatterplots well, but not scatterplots that show nonlinearity, heteroscedasticity or outliers. Two variables can have strong nonlinear association and small or zero correlation. A single outlier can make the correlation coefficient small or large.

Strong correlation between two variables does not entail a causal relationship between them; neither does a causal relationship between two variables entail any correlation between them. Beware claims of causality on the basis of correlation.

Key Terms