Multivariate Data and Scatterplots

Many sets of data involve measurements of a variety of variables for one set of individuals. The individuals might be people, places, objects, times, etc. For instance, we might measure the heights and weights of a particular set of people. Or we might measure the weights of a particular set of people, before and after they increase their daily exercise. Or we might measure the SAT scores of a group of students before and after they take a preparatory course. Or we might measure the viral load of a group of AIDS patients before and after taking a drug regimen. These are examples of bivariate data, because there are two measurements per individual. More generally, multivariate data involves making two or more measurements per individual. Scatterplots are a way to visualize multivariate data to help classify and understand the relationships among the variables.

Multivariate Data

So far, we have been looking at one variable at a time. We now start to look at the relationship among two or more variables, each measured for the same collection of individuals. An "individual" is not necessarily a person: it might be an automobile, a place, a family, a university, etc. For example, the two variables might be the heights of a man and of his son, in which case the "individual" is the pair (father, son). Such pairs of measurements are called bivariate data. Observations of two or more variables per individual in general are called multivariate data.

We will use the GMAT data as an example of a multivariate data set. These data were made available by Howard Wainer of the Educational Testing Service. The data comprise 5 variables measured for each of 913 individuals, who were then students in their second year of an MBA program at five good business schools. The variables are the undergraduate GPA, verbal and quantitative GMAT scores, first-year MBA GPA, and an integer indicating which of the five business schools the student attended. I do not know the year these data were collected. The applet in allows you to display a histogram of each of those variables in turn:

The drop-down menu at the top of (which should show "Verbal GMAT" when you first visit the page) lets you select which variable in the data set is displayed in the histogram. The List data button opens a table of the values of the 5 variables for all 913 students. The "Mean" and "SD" are the mean and SD of the variable currently displayed.

By looking at these five histograms, we can learn about the distribution of each of the variables. For example, both the verbal and quantitative GMAT scores have means of about 35 and SDs a bit over 6 points. That is, on the average, these students scored about 35 points on the verbal GMAT and about 35 points on the quantitative GMAT, but individual scores varied from those averages, typically by about 6 points.

Did students with higher than average verbal GMAT scores also tend to have higher than average quantitative GMAT scores (are verbal and quantitative GMAT scores positively associated)? Or, perhaps, was there a tendency for students who did better than average on the verbal GMAT to do worse than average on the quantitative GMAT (are verbal and quantitative GMAT scores negatively associated)? Similarly, did students who had higher than average undergraduate GPAs tend to do better than average in their first year as MBA students? Suppose you were the director of admissions for an MBA program. Which variables seem to predict how a student will do in his or her first year in the MBA program? How would you decide whom to admit?

Such questions are hard to answer using just the five histograms. These questions are about the association of the measured variables. The histograms say nothing about the association. The association is also quite hard to see directly from the list of data, especially for lists as long as this one. (Try it: click the List Data button and see whether you can find a relationship among the variables.) To see association graphically, we need to display more than one variable at a time.

Scatterplots

One of the best tools for studying the association of two variables graphically is the scatterplot or scatter diagram. Scatterplots are especially helpful when the number of data is large—studying a list is then virtually hopeless. A scatterplot plots two measured variables against each other, for each individual. That is, the x (horizontal) coordinate of a point in a scatterplot is the value of one measurement (X) of an individual, and the y (vertical) coordinate of that point is the other measurement (Y) of the same individual. We call such a plot a scatterplot of Y versus X or a scatterplot of Y against X.

shows scatterplots of pairs of variables.

Initially, the scatterplot should show the quantitative GMAT scores (on the vertical or "y" axis) versus the verbal GMAT scores (on the horizontal or "x" axis). You can change which variable is plotted against which by using the drop-down menus containing the variable names, located at the top of the figure. If the scatterplot is not of quantitative GMAT versus verbal GMAT, please change to those variables.

Clicking the List Data button opens a table of the data. Clicking the Univariate Stats button opens a window that contains summary statistics of the 5 variables: the number of individuals for whom each variable was measured, the mean and SD of each variable, and the minimum, lower quartile, median, upper quartile and maximum of each variable.

The red square in the middle of the scatterplot is the point of averages. Its horizontal coordinate is the mean of the values of the variable plotted on the horizontal axis (the mean verbal GMAT score at first), and its vertical coordinate is the mean of the values of the variable plotted on the vertical axis (the mean quantitative GMAT score at first). The point of averages is a measure of the "center" of a scatterplot, quite analogous to the mean as a measure of the center of a list.

Put the cursor over the point of averages. The "meter" at the bottom of the plot that looks like

x = 35.nn y = 35.nn

will show that the x (horizontal) coordinate of the cursor is about 35 and the y (vertical) coordinate of the cursor is also about 35. Put the cursor over the highest blue dot on the plot. You should be able to tell from the meter that the x-value of that point is about 40, and its y-value is about 60. The dot corresponds to a single student whose verbal GMAT score was about 40, and whose quantitative GMAT score was about 60. That student scored above average on both parts of the GMAT test, but much further above average in the quantitative score.

Click the List Data button to open a table of all the student data. The top line shows that the variables are:

School "1st year MBA GPA" "Verbal GMAT" "Quant. GMAT" "Undergrad. GPA"

Each row in the table corresponds to one student. The first column in each row is the student's business school, the second column is his first-year MBA GPA, the third column is his verbal GMAT score, the fourth column is his quantitative GMAT score, and the fifth column is his undergraduate GPA. The first row in the table is:

1 3.155 31 37 3.53.

That is, the first student attended business school 1, had a first-year MBA GPA of 3.155, scored 31 on the verbal GMAT, 37 on the quantitative GMAT, and had an undergraduate GPA of 3.53.

Find the record of the student whose quantitative GMAT score was 60. (Hint: it's the 13th student in school 3.) Click that row of the table. The corresponding point in the scatterplot should turn yellow. You can highlight any number of points by clicking the corresponding rows of the table. To clear the highlighting of a point, click its row again.

The following exercises check your ability to use the scatterplot applet in to answer questions about multivariate data.

Describing Scatterplots

Scatterplots let us see the relationships among variables. Does one variable tend to be larger when another is large? Does the relationship follow a straight line? Is the scatter in one variable the same, regardless of the value of the other variable?

Linearity and Nonlinearity

The scatterplot in illustrates a linear relationship between the variables. The scatterplot is roughly football-shaped: the points do not lie exactly on a line, but are scattered more-or-less evenly around one. (Note: this figure will be different every time you visit or reload the page.)

The scatterplot in illustrates nonlinearity. The pattern in the relationship between the variables is not a straight line—it is curved. The data are scattered more-or-less evenly around a curve: The scatter in the values of Y is about the same for different values of X, that is, in different vertical slices through the scatterplot.

Homoscedasticity and Heteroscedasticity

When the scatter in Y is about the same in different vertical slices through a scatterplot, the data (and the scatterplot) are said to be homoscedastic (equal scatter). So far, all the plots in this section have been homoscedastic. is a scatterplot of heteroscedastic data: The scatter in vertical slices depends on where you take the slice.

Outliers

A point that does not fit the overall pattern of the data, or that is many SDs from the bulk of the data, is called an outlier. are examples of scatterplots of data with with a large outlier.

 

The following exercises check your ability to categorize scatterplots.

Summary

Multivariate data are observations of two or more variables per individual. Two variables at a time (bivariate data) can be displayed in a scatterplot. The points in a scatterplot represent individuals. The coordinates of each point are the values of the two variables for that individual. The point of averages in a scatterplot is the point with coordinates

(mean of X, mean of Y),

where X is the variable plotted on the x (horizontal) axis and Y is the variable plotted on the y (vertical) axis. Two variables are associated if the scatter of one variable in slices defined by restricting the other variable to a small range is less than the overall scatter of the first variable.

Outliers are points many standard deviations away from the bulk of the data in at least one of their coordinates. Homoscedasticity means same scatter: The vertical scatter in different vertical slices through the scatterplot is about the same, regardless of where the slice is centered. Heteroscedasticity means different scatter: The vertical scatter in different vertical slices varies appreciably, depending on where the slice is centered. If a scatterplot shows linear association (or no association), homoscedasticity, and no outliers, it is said to be football-shaped.

Key Terms