Association

Many interesting questions about multivariate data involve association. For example, within the United States, there is a negative association between the amount of money an individual spends on healthcare in a given year and the number of years he survives beyond that year: the more spent on healthcare, the shorter the life expectancy. What does this mean? Could it be true? Does healthcare shorten your life?

Association is a property of two or more variables. In this example, the two variables are the amount an individual spends on healthcare, and the number of additional years the individual survives. Association is not the same as causation: Two variables can be strongly associated but have no causal connection, or can have a causal connection and no discernible association. Concluding that association implies a causal relationship is the post hoc ergo propter hoc fallacy discussed in This chapter presents tools for studying more than one variable at a time, and the relationships among variables, including association.

Association of Pairs of Variables

We augmented measures of location for single variables by measures of spread, and we shall do the same for pairs of variables. A measure of the horizontal spread is the SD of the variable plotted on the horizontal or \(x\) axis. We shall write this as \( SD_X \). Similarly, the \(SD\) of the variable plotted on the vertical or \(y\) axis is a measure of the vertical spread. We'll write this as \( SD_Y \).

The \(SD\) is a measure of the scatter in a list. The typical deviation of the \(x\) coordinate of a point from the mean of the \(x\) coordinates is \( SD_X \). The typical deviation of the \(y\) coordinate of a point from the mean of the y coordinates is \( SD_Y \). We know from Chebychev's inequality that, for example, the \(x\) coordinates of at least 75% of the points will be within \( \pm 2 \times SD_X \) of the \(x\) coordinate of the point of averages, and that the \(y\) coordinates of at least 75% of the points will be within \( \pm 2 \times SD_Y \) of the \(y\) coordinate of the point of averages. However, in narrow ranges of \(x\) (vertical slices), the scatter in \(y\) might typically be smaller than \( SD_Y \), and in narrow ranges of \(y\) (horizontal slices), the scatter in \(x\) might typically be smaller than \( SD_X \). If so, the two variables are associated.

If individuals with larger than average values of one variable tend to have larger than average values of the other, and individuals with smaller than average values of one variable tend to have smaller than average values of the other, the scatter of the values of \(Y\) in vertical slices through the scatterplot will be smaller than \( SD_Y \). Such a scatterplot shows positive association. If individuals with larger than average values of one variable tend to have smaller than average values of the other, and individuals with smaller than average values of one variable tend have larger than average values of the other, the scatter of the values of \(Y\) in vertical slices through the scatterplot also will be smaller than \( SD_Y \); this is called negative association. Positive and negative associations are examples of linear association; variables can be associated nonlinearly as well.

is the scatterplot of the GMAT data again, but this time with four new lines: two vertical lines at the mean value of \(X\), plus and minus the \(SD\) of \(X\), the variable plotted on the horizontal axis; \( SD_X \), and two horizontal lines at the mean value of \(Y\), plus and minus the \(SD\) of \(Y\), the variable plotted on the vertical axis. There is also a new button, labeled "No SDs." Click the button. The label will change to "SDs," and the lines will go away. Buttons on figures in this book usually say what will happen when you click them. If the figure does not show the scatterplot of Verbal GMAT versus Quantitative GMAT, change the variables using the drop-down menus at the top of the figure.

The cloud of points in the scatterplot tilts slightly upward towards the right. Individuals with larger than average Quantitative GMAT scores tend to have larger than average Verbal GMAT scores, and individuals with smaller than average Quantitative GMAT scores tend to have smaller than average Verbal GMAT scores, so these variables are positively associated. The association between the Verbal and Quantitative GMAT scores of these students is not very strong. Knowing that a particular student scored above average on the Verbal GMAT does not let us guess how well the student did on the Quantitative GMAT much more accurately than we could have without knowing the student's Verbal GMAT score. For example, the student with the highest Quantitative GMAT score didn't do nearly as well on the Verbal GMAT as many other students, and the student with the lowest Quantitative GMAT score did above average on the Verbal portion of the exam. The overall vertical scatter in the scatterplot, SDy, is the \(SD\) of the Quantitative GMAT scores, which is about 6.77. Now consider just those students whose Verbal GMAT scores are between 43 and 45 (there are 77 such students). The \(SD\) of the Quantitative GMAT scores of those students was only about 5.91, rather less than the overall \(SD\) of the Quantitative GMAT scores, because there is an association between the variables. If we took other "slices" through the data—other narrow ranges of Verbal GMAT—we would typically find something similar: the \(SD\) of the Quantitative GMAT scores for students whose Verbal GMAT scores are in narrow ranges tends to be a bit smaller than the overall \(SD\) of the Quantitative GMAT scores.

Two variables are associated if knowing the value of one of them tells us something about the value of the other. Slightly more precisely, \(X\) and \(Y\) are associated if the \(SD\) of the values of \(Y\) of points whose \(X\) coordinates are in a narrow range of values (a vertical slice through the scatterplot) is smaller than the overall \(SD\) of \(Y\), or if the \(SD\) of the values of \(X\) of points whose \(Y\) coordinates are in a narrow range of values (horizontal slice through the scatterplot) is smaller than the overall \(SD\) of \(X\). In the \(SD\) of values of Quantitative GMAT in narrow ranges of Verbal GMAT is about 5% smaller than the overall \(SD\) of Quantitative GMAT: They are associated, but only weakly.

Use the drop-down menus to plot first-year MBA GPA versus undergraduate GPA. Again, there is a slight positive association between these variables: The cloud of points tilts upward to the right. Students who had higher than average GPAs as undergraduates tended to have higher than average GPAs in their first year of business school. This is not surprising. What is perhaps surprising is how much scatter there is: The undergraduate GPA really does not predict the graduate GPA very well. If it did, the scatterplot would have less scatter in any vertical slice through the plot than it does. The association between these variables is even weaker than the association between Verbal and Quantitative GMAT.

If the association were strong, we could do a good job of predicting the first-year MBA GPA of a student from his or her undergraduate GPA. Because the association is weak, knowing a student's undergraduate GPA doesn't help us very much to predict his or her performance in the first year of an MBA program. This is part of what makes the admissions screening process difficult, and why schools combine several criteria in making admissions decisions.

There is a good reason the undergraduate GPA might not be a good predictor of MBA GPA for these students. How does a student get into the data set? He or she must have been admitted to an MBA program. If the admissions process balances undergraduate GPA with other factors that might predict whether a student will succeed, such as letters of recommendation, GMAT scores, etc., you might expect that the students with below average (for this group) undergraduate GPAs were admitted precisely because there were other reasons for thinking the student would succeed, as reflected in the first year MBA GPA. Indeed, this seems to be the case: The association between undergraduate GPA and first year MBA GPA is weaker for those students whose undergraduate GPA was significantly below average.

lets you look at the distribution of one variable for subsets of a multivariate data set defined by restricting some of the variables to various ranges.

Use the drop-down menu at the top of to select "1st year MBA GPA" as the variable to show in the histogram. You should now see a green histogram of all the first year MBA GPAs. Now do the following:

  1. In the Restrict to list box, select Undergrad. GPA.
  2. Click the >= check box.
  3. Type 3.8 into the text box after >= and press Return or Enter.

Now you will see two histograms superposed on one another. The blue histogram is that of all 913 students' first-year MBA GPAs. The green histogram is that of the first year MBA GPAs of just those 59 students whose undergraduate GPA was 3.8 or above. For each class interval, the shorter bin is plotted in front, so you can see the height of every bin for both the original data and the restricted data. The bottom line of the tool, after the List Data button, shows the number of individuals in the full data set (913), the mean of their first-year MBA GPAs (3.1088), the \(SD\) of their first-year MBA GPAs (0.492), the number of individuals in the restricted data set (59), the mean of just their first-year MBA GPAs (3.5167), and the \(SD\) of just their first-year MBA GPAs (0.3853).

Notice that the first-year MBA GPAs of the 59 students who did really well as undergraduates (GPA 3.8 or above) is higher on the average than that of the overall group of 913 students, and the scatter in their scores is smaller: The association between these variables is positive.

Change the restrictions on the undergraduate GPA as follows:

  1. Type 2.6 into the >= text box and press Return or Enter.
  2. Click the <= check box.
  3. Type 2.8 into the <= text box and press Return or Enter.

The blue histogram remains as it was, but the green histogram is that of the first-year MBA GPAs of just those 82 students whose undergraduate GPA was between 2.6 and 2.8. In the last line of the tool, you can see that the \(SD\) of the first-year MBA GPAs of the students in this slice is larger (0.5489) than that of the entire group of 913 students (0.492). That is the case in many such slices through the scatterplot, so the association is weak.

The following exercises address assessing association from scatterplots and superposed histograms.

Post Hoc Ergo Propter Hoc

Association between variables often is used as evidence that there is a causal relationship between variables—erroneously. The introduction to this chapter noted that there is a negative association between money spent on healthcare, and life expectancy: The more one spends on healthcare in a given year, the fewer additional years he tends to survive. Does that mean that spending money on healthcare tends to shorten one's life?

Certainly not. As noted in the introduction, it is generally the sickest individuals who spend the most on healthcare in a given year. Their life expectancies are short, whether or not they get healthcare. Healthcare probably lengthens their lives. The negative association of these variables has little to do with any causal relationship between them.

More generally, association does not measure causation. There is a fallacy of logic known since ancient times: post hoc ergo propter hoc, which translates as "after this, therefore because of this." It is common to assume that if two things are associated, there is some causal relationship between them: One causes the other.

That is simply fallacious. For example, the Moon and Earth are gradually getting further apart. Similarly, the Dow-Jones Industrial Average (DJIA) generally has had an upward trend on a time scale of decades: both have positive secular trends. Does the increase in the DJIA cause the Earth and Moon to separate? Does the increased distance between Earth and Moon cause the DJIA to go up? Does some other single thing make them both increase? I think none of these is plausible.

As another example, consider the quantitative GMAT scores and first-year MBA GPAs of the students in the GMAT data set. Even though students with higher than average quantitative GMAT scores tended to have higher than average first-year MBA GPAs, getting higher quantitative GMAT scores did not cause the GPA to be higher. If a student takes a GMAT preparation class, that might help his or her GMAT score. Will it also improve the student's first-year MBA GPA? Probably not. Can doing better in the first year of business school cause a student's GMAT score to go up? Certainly not if the GMAT was taken before the first year of business school.

Work the following exercises to check your understanding of the difference between association and causation.

Videos of Exercises

(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

  • Summary

    Two variables are associated if the scatter of one variable in slices defined by restricting the other variable to a small range is generally less than the overall scatter of the first variable. (A vertical slice is a set of points with \(x\) coordinates that are in a restricted range. A horizontal slice is a set of points with \(y\) coordinates that are in a restricted range. Vertical scatter is the SD of the \(y\) coordinates of a collection of points; horizontal scatter is the \(SD\) of the \(x\) coordinates of a collection of points.)

    One can detect association by comparing the histogram of one variable for the entire data set with the histogram of subsets of the data defined by restricting the other variable to small ranges: If the variables are associated, the histograms for the subsets will have less spread than the histogram for the entire data set has. It is easier to see association between pairs of variables with scatterplots. Association can be linear or nonlinear. If two variables are linearly associated, the points in their scatterplot are scattered more or less symmetrically around a straight line. If two variables are nonlinearly associated, the points in their scatterplot are scattered around a curve.

    Two variables are positively associated if individuals with higher than average values of one variable also tend to have higher than average values of the other variable, and individuals with lower than average values of one variable tend to have lower than average values of the other variable. Two variables are negatively associated if individuals with higher than average values of one variable tend to have lower than average values of the other variable, and vice versa. Linear association is always positive or negative. Nonlinear association need not be positive or negative.

    Association is not causation: two variables can have strong association and have no causal connection, and two variables can have a causal (deterministic) connection and no association. Post hoc ergo propter hoc is the fallacy of concluding from association between two variables that the variables have a cause-and-effect relationship.

    Key Terms