Computing the Correlation Coefficient

This chapter explains how to calculate the correlation coefficient r, a quantitative measure of linear association. To calculate r for a pair of variables involves transforming them to standard units, then taking the average of the product of the two variables in standard units.

Ecological correlation is the correlation coefficient calculated for averages of individuals, rather than for individuals. Ecological correlations say little about the (linear) association for individuals; generally, ecological correlations tend to overstate the strength of the association for individuals.

Computing r

We saw in chapter that the correlation coefficient measures linear association. We know that r does not measure nonlinear association. We know that the value of r can be deceptive if the data are heteroscedastic or contain outliers. We know that r is always between −1 and +1. We know how to estimate r by eye. But we do not know how to compute r from data. In this section, we shall learn how to compute the correlation coefficient: r is the average product of X and Y, after putting X and Y on an equal footing by transforming them to standard units—standard deviations above the mean.

Standard units

Standard units are a way of putting different kinds of observations on the same scale. The idea is to replace a datum by the number of standard deviations it is above the mean of the data. If a datum is above the mean, its value in standard units is positive; if it is below the mean, its value in standard units is negative. A datum that is above the mean by 2.5 times the SD is 2.5 in standard units.

datum in standard units = number of SDs the datum is above the mean = \( \frac{\mbox{original datum − mean of original data}}{\mbox{SD of original data}} \).

When a list is transformed to standard units, the mean of the new list is zero, and the SD of the new list is one: that is what it means for a set of data to be in standard units. Standard units are dimensionless. If the original list has units, the original SD has the same units. To transform a measurement to standard units, we divide the measurement (minus the mean) by the SD, which cancels the original units.

If we know the mean and SD of the original data, we can restore a datum that is in standard units to the original units of measurement, as follows:

original value = (value in standard units) × SD + mean.

Note that both the transformation from original units to standard units and the transformation from standard units to original units are affine transformations. illustrates converting from original units to standard units and back. It is a dynamic example: it changes whenever you reload the page.

 

The Sign of a Value in Standard Units

Values that are larger than the mean are positive in standard units

Values that are less than the mean are negative in standard units.

The following exercise checks your ability to convert a measurement to standard units.

Computing r

The correlation coefficient r of two variables X and Y is the average of the product of X in standard units and Y in standard units. You must be sure to multiply the measurements corresponding to the same individual. The order in which you multiply doesn't matter, but you should not change the order of one set of measurements relative to the other. will help make the idea clear.

The correlation coefficient does not change if the lists are transformed in any of the following ways:

This is because the first step in calculating the correlation coefficient is to convert the two lists to standard units, and any of those changes (except the last) produces two lists that—after converting to standard units—are the same as the original lists in standard units. The last transformation produces two lists that, after converting to standard units, are the negatives of the original lists in standard units. The two negative signs cancel when the corresponding elements of the lists are multiplied.

If only one of the lists is multiplied by a negative number, r changes sign, but has the same magnitude. A positive association becomes negative, and vice versa.

Because computing the correlation coefficient involves converting both lists to standard units and multiplying the results, and because multiplication does not depend on the order of the factors, it does not matter which list is first:

rXY = rYX.

Two football-shaped scatterplots can have the same correlation coefficient but look quite different if the SDs of the variables are different.

The following exercises check your ability to convert variables to standard units and to compute the correlation coefficient.

Ecological Correlation

Correlations based on averages can be arbitrarily misleading if they are interpreted to be about individuals. Correlations based on averages are usually too high, because they ignore the variability across individuals. Correlation of averages is called ecological correlation.

For example, is a scatterplot of the GMAT data set, averaged by school. That is, there are now five "individuals;" each one is one of the five schools. The first-year MBA GPA for a school is the average of the first-year MBA GPAs of all the students at that school, etc.

For the averaged data, the correlation of quantitative and verbal GMAT scores is 0.95; for the original data, it was only 0.35. If you interpreted the correlation based on averaged data as the association between quantitative and verbal GMAT scores for individuals, you would be way off: There is far more scatter in individual students' scores. This effect is called "ecological correlation."

On the other hand, averaging can reduce correlations. For example the correlation of the averaged first-year MBA GPA and undergraduate GPA is zero, while for the original data, it is 0.24.

For a large group of college students, the ages of Freshmen will vary, as will the ages within other years, so the correlation coefficient for age and the number of years one has been in school will not equal 1.

However, if we take the average ages for each class {Freshmen, Sophomores, Juniors, Seniors}, the averages will probably be very close to 19, 20, 21, and 22, respectively, and the average number of year of education will be pretty close to 13, 14, 15, and 16. The correlation coefficient between the average age in a class and the average number of years of education in a class will be much closer to 1. Nonetheless, we cannot predict an individual's age very well from the number of years he or she has been in school.

For a really extreme example, imagine dividing the university population into two groups, faculty and undergraduates (we leave out graduate students and staff). The ages and educational levels vary both within and across these groups, but what happens if we plot just the averages for the two groups? We will get two points, one with average age about 20 and average number of years of education about 14, and one with average age closer to 45 and average number of years of education about 22. These two points will lie on a straight line (any two points do), and the line will have positive slope (the faculty are older on the average, and have more years of education on the average), so the correlation coefficient will be +1

 

Ecological Correlation

Correlation coefficients of averages are called ecological correlations.

Correlations of averages of measurements can differ enormously from correlations of individual measurements.

Typically, they are much larger, but they can be smaller, too.

When you examine a claim that the association between two variables is strong, be alert to the possibility that the stated correlation is an ecological correlation. If it is, the correlation coefficient for individuals could be quite different—but tends to be smaller in magnitude

Summary

Converting to standard units makes different variables commensurable. A measurement in standard units is the number of SDs the measurement is above the mean. Values larger than the mean are positive in standard units; values below the mean are negative in standard units. The mean of a list in standard units is zero, and the SD of a list in standard units is 1. The correlation coefficient of X and Y is the average of the products of X and Y in standard units. It is important to multiply the value of X by the value of Y for the same individual.

Ecological correlations are correlation coefficients of averages across groups of individuals, rather than correlation coefficients for individuals. Ecological correlations tend to be stronger than the correlation coefficient for individuals, although the opposite is also possible. Beware arguments about association that rely on ecological correlations.

Key Terms