Many interesting questions about multivariate data involve association.
For example, within the United States, there is a *negative association*
between the amount of money an individual spends on healthcare in a given year
and the number of years he survives beyond that year: the more spent on
healthcare, the shorter the life expectancy.
What does this mean?
Could it be true?
Does healthcare shorten your life?

Association is a property of two or more
variables.
In this example, the two variables are the amount
an individual spends on healthcare, and the number of additional years the individual
survives.
Association is not the same as causation: Two variables can be strongly associated but have no
causal connection, or can have a causal connection and no discernible association.
Concluding that association implies a causal relationship is the post hoc ergo propter hoc
fallacy discussed in

We augmented measures of location for single variables by measures of spread, and we shall do the same for pairs of variables. A measure of the horizontal spread is the SD of the variable plotted on the horizontal or \(x\) axis. We shall write this as \( SD_X \). Similarly, the \(SD\) of the variable plotted on the vertical or \(y\) axis is a measure of the vertical spread. We'll write this as \( SD_Y \).

The \(SD\) is a measure of the scatter in a list. The typical deviation of the \(x\) coordinate of a point from the mean of the \(x\) coordinates is \( SD_X \). The typical deviation of the \(y\) coordinate of a point from the mean of the y coordinates is \( SD_Y \). We know from Chebychev's inequality that, for example, the \(x\) coordinates of at least 75% of the points will be within \( \pm 2 \times SD_X \) of the \(x\) coordinate of the point of averages, and that the \(y\) coordinates of at least 75% of the points will be within \( \pm 2 \times SD_Y \) of the \(y\) coordinate of the point of averages. However, in narrow ranges of \(x\) (vertical slices), the scatter in \(y\) might typically be smaller than \( SD_Y \), and in narrow ranges of \(y\) (horizontal slices), the scatter in \(x\) might typically be smaller than \( SD_X \). If so, the two variables are associated.

If individuals with larger than average values of one variable tend to have larger
than average values of the other, and individuals with smaller than average values
of one variable tend to have smaller than average values of the other, the scatter
of the values of \(Y\) in vertical slices through the scatterplot will be smaller than
\( SD_Y \).
Such a scatterplot shows *positive*
association.
If individuals with larger than average values of one variable tend to have
smaller than average values of the other, and individuals with smaller than
average values of one variable tend have larger than average values of the other,
the scatter of the values of \(Y\) in vertical slices through the scatterplot also will
be smaller than \( SD_Y \); this is called *negative association*.
Positive and negative associations are examples of *linear association*;
variables can be associated *nonlinearly* as well.

The cloud of points in the scatterplot tilts slightly upward towards the right.
Individuals with larger than average Quantitative GMAT scores tend to have larger
than average Verbal GMAT scores, and individuals with smaller than average Quantitative
GMAT scores tend to have smaller than average Verbal GMAT scores, so these variables
are *positively associated*.
The association between the Verbal and Quantitative GMAT scores of these students is
not very strong. Knowing that a particular student scored above average on the
Verbal GMAT does not let us guess how well the student did on the Quantitative GMAT
much more accurately than we could have without knowing the student's Verbal
GMAT score.
For example, the student with the highest Quantitative GMAT score didn't do nearly
as well on the Verbal GMAT as many other students, and the student with the lowest
Quantitative GMAT score did above average on the Verbal portion of the exam.
The overall vertical scatter in the scatterplot, SDy, is the \(SD\) of the Quantitative
GMAT scores, which is about 6.77.
Now consider just those students whose Verbal GMAT scores are between 43 and 45
(there are 77 such students).
The \(SD\) of the Quantitative GMAT scores of those students was only about 5.91, rather less
than the overall \(SD\) of the Quantitative GMAT scores, because there is an association
between the variables.
If we took other "slices" through the data—other narrow ranges of
Verbal GMAT—we would typically find something similar: the
\(SD\) of the Quantitative
GMAT scores for students whose Verbal GMAT scores are in narrow ranges tends to be
a bit smaller than the overall \(SD\) of the Quantitative GMAT scores.

Two variables are *associated* if knowing the value of one of them tells us
something about the value of the other.
Slightly more precisely, \(X\) and
\(Y\) are associated if the
\(SD\) of the values of \(Y\) of points
whose \(X\) coordinates are in a narrow range of values (a vertical slice through the
scatterplot) is smaller than the overall \(SD\) of \(Y\),
or if the \(SD\) of the values of \(X\) of
points whose \(Y\) coordinates are in a narrow range of values (horizontal slice through
the scatterplot) is smaller than the overall \(SD\) of \(X\).
In

Use the drop-down menus to plot first-year MBA GPA versus undergraduate GPA. Again, there is a slight positive association between these variables: The cloud of points tilts upward to the right. Students who had higher than average GPAs as undergraduates tended to have higher than average GPAs in their first year of business school. This is not surprising. What is perhaps surprising is how much scatter there is: The undergraduate GPA really does not predict the graduate GPA very well. If it did, the scatterplot would have less scatter in any vertical slice through the plot than it does. The association between these variables is even weaker than the association between Verbal and Quantitative GMAT.

If the association were strong, we could do a good job of predicting the first-year MBA GPA of a student from his or her undergraduate GPA. Because the association is weak, knowing a student's undergraduate GPA doesn't help us very much to predict his or her performance in the first year of an MBA program. This is part of what makes the admissions screening process difficult, and why schools combine several criteria in making admissions decisions.

There is a good reason the undergraduate GPA might not be a good predictor of MBA GPA for these students. How does a student get into the data set? He or she must have been admitted to an MBA program. If the admissions process balances undergraduate GPA with other factors that might predict whether a student will succeed, such as letters of recommendation, GMAT scores, etc., you might expect that the students with below average (for this group) undergraduate GPAs were admitted precisely because there were other reasons for thinking the student would succeed, as reflected in the first year MBA GPA. Indeed, this seems to be the case: The association between undergraduate GPA and first year MBA GPA is weaker for those students whose undergraduate GPA was significantly below average.

Use the drop-down menu at the top of

- In the Restrict to list box, select Undergrad. GPA.
- Click the >= check box.
- Type 3.8 into the text box after >= and press Return or Enter.

Now you will see two histograms superposed on one another. The blue histogram is that of all 913 students' first-year MBA GPAs. The green histogram is that of the first year MBA GPAs of just those 59 students whose undergraduate GPA was 3.8 or above. For each class interval, the shorter bin is plotted in front, so you can see the height of every bin for both the original data and the restricted data. The bottom line of the tool, after the List Data button, shows the number of individuals in the full data set (913), the mean of their first-year MBA GPAs (3.1088), the \(SD\) of their first-year MBA GPAs (0.492), the number of individuals in the restricted data set (59), the mean of just their first-year MBA GPAs (3.5167), and the \(SD\) of just their first-year MBA GPAs (0.3853).

Notice that the first-year MBA GPAs of the 59 students who did really well as undergraduates (GPA 3.8 or above) is higher on the average than that of the overall group of 913 students, and the scatter in their scores is smaller: The association between these variables is positive.

Change the restrictions on the undergraduate GPA as follows:

- Type 2.6 into the >= text box and press Return or Enter.
- Click the <= check box.
- Type 2.8 into the <= text box and press Return or Enter.

The blue histogram remains as it was, but the green histogram is that of the first-year MBA GPAs of just those 82 students whose undergraduate GPA was between 2.6 and 2.8. In the last line of the tool, you can see that the \(SD\) of the first-year MBA GPAs of the students in this slice is larger (0.5489) than that of the entire group of 913 students (0.492). That is the case in many such slices through the scatterplot, so the association is weak.

The following exercises address assessing association from scatterplots and superposed histograms.

Association between variables often is used as evidence that there is a causal relationship between variables—erroneously. The introduction to this chapter noted that there is a negative association between money spent on healthcare, and life expectancy: The more one spends on healthcare in a given year, the fewer additional years he tends to survive. Does that mean that spending money on healthcare tends to shorten one's life?

Certainly not. As noted in the introduction, it is generally the sickest individuals
who spend the most on healthcare in a given year.
Their life expectancies are short, whether or not they get healthcare.
Healthcare probably lengthens their lives.
The negative association of these variables has little to do with any causal relationship
between them.

More generally, association does not measure causation.
There is a fallacy of logic known since ancient times: *post hoc ergo propter hoc*,
which translates as "after this, therefore because of this."
It is common to assume that if two things are associated, there is some causal
relationship between them: One causes the other.

That is simply fallacious. For example, the Moon and Earth are gradually getting further apart. Similarly, the Dow-Jones Industrial Average (DJIA) generally has had an upward trend on a time scale of decades: both have positive secular trends. Does the increase in the DJIA cause the Earth and Moon to separate? Does the increased distance between Earth and Moon cause the DJIA to go up? Does some other single thing make them both increase? I think none of these is plausible.

As another example, consider the quantitative GMAT scores and first-year MBA GPAs of the students in the GMAT data set. Even though students with higher than average quantitative GMAT scores tended to have higher than average first-year MBA GPAs, getting higher quantitative GMAT scores did not cause the GPA to be higher. If a student takes a GMAT preparation class, that might help his or her GMAT score. Will it also improve the student's first-year MBA GPA? Probably not. Can doing better in the first year of business school cause a student's GMAT score to go up? Certainly not if the GMAT was taken before the first year of business school.

Work the following exercises to check your understanding of the difference between association and causation.

(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Two variables are associated
if the scatter of one variable in *slices* defined by restricting the other
variable to a small range is generally less than the overall scatter of the first variable.
(A *vertical slice* is a set of points with \(x\) coordinates that are in a
restricted range.
A *horizontal slice* is a set of points with \(y\) coordinates that are in a
restricted range.
Vertical scatter is the SD of the \(y\) coordinates of a
collection of points; horizontal scatter is the \(SD\) of the \(x\) coordinates of a
collection of points.)

One can detect association by comparing the histogram
of one variable for the entire
data set with the histogram of subsets of the data defined by restricting
the other variable to small ranges: If the variables are associated, the
histograms for the subsets
will have less spread than the histogram for the entire data set has.
It is easier to see association between pairs of variables with
scatterplots.
Association can be *linear* or *nonlinear*.
If two variables are linearly associated,
the points in their scatterplot are scattered more or less symmetrically around a
straight line.
If two variables are nonlinearly associated, the points in their scatterplot
are scattered around a curve.

Two variables are *positively associated* if individuals with higher
than average values of one variable also tend to have higher than average
values of the other variable, and individuals with lower than average values
of one variable tend to have lower than average values of the other variable.
Two variables are *negatively associated* if individuals with higher
than average values of one variable tend to have lower than average
values of the other variable, and *vice versa*.
Linear association is always positive or negative.
Nonlinear association need not be positive or negative.

Association is not causation: two variables can have strong association and
have no causal connection, and two variables can have a causal
(deterministic) connection and no association.
*Post hoc ergo propter hoc* is the fallacy of concluding from
association between two variables that the variables have a
cause-and-effect relationship.

- association
- bin
- bivariate
- Chebychev's inequality
- football-shaped
- heterosedastic
- homoscedastic
- horizontal slice
- linear association
- mean
- multivariate
- nonlinear association
- negative association
- nonlinearity
- outlier
- point of averages
- positive association
*post hoc ergo propter hoc*- range
- scatterplot
- secular trend
- variable
- vertical slice