This applet lets you study the relationship between pairs of variables using scatterplots, the correlation coefficient, the graph of averages, linear regression, and residual plots.

You can select one of four data sets using the drop-down menu, or type in the URL of a different dataset. The four data sets are

The next two choice boxes let you select which variable in the selected data set to plot on the X axis, and which to plot on the Y axis. Again, on monochrome monitors on unix systems, the box itself might not be visible: click on the name of the variable to see the other choices.

The buttons should be self-explanatory. They let you plot ±1 SD from the point of averages, which is plotted in red; plot the SD Line, the graph of averages (yellow squares), and the regression line, pop up a window containing the currently plotted dataset, pop up a window containing summary statistics for each variable, and toggle from a scatterplot of the data to a residual plot, use or ignore points you have added by clicking on the graph, and clear points you added previously by clicking on the graph. (The univariate summary statistics are always for the original data; they do not include any points you have added.) You can find the X and Y values for any point by positioning the mouse cursor over it: the coordinates of the cursor are given in the lower right corner of the applet. If you select some rows of data in the dataset window and strike "return," the corresponding points in the scatterplot will be plotted in yellow, rather than blue.


Description of the Data Sets

Cities data. These data come from a September 25, 1987, article by W. Tucker in the National Review. Mr. Tucker presented the results of analyses by Prof. Jeffrey Simonoff of the Department of Statistics and Operations Research, Stern School of Business, New York University. The "conceivably relevant factors" Prof. Simonoff considered in studying the the homeless rate per thousand population were the population size, vacancy rate, and unemployment rate, in 50 cities in the USA. The homeless figures for 35 of the cities came from the 1984 Report to the Secretary of Housing and Urban Development on Emergency Shelters and Homeless Populations. The homeless data for the other 15 cities (St. Louis, Santa Monica, Newark, Yonkers, Dallas-Fort Worth, Denver, Charleston WV, Atlanta, San Diego, New Orleans, Albuquerque, Tucson, Burlington, Milwaukee, Providence, and Lincoln NE) were from local sources, and were chosen because 1987 or 1988 homeless estimates for those cities happened to be available. The other data for those cities came from various federal agencies, such as the Census Bureau, HUD, and the NOAA (Prof. J. Simonoff, personal communication, 1998.). The cities47 data set excludes the largest three of the 50 cities.

CCV data. The Correlation Check Vehicle (CCV) data were made available by Leo Breiman, Department of Statistics, UCB. The data were collected by the Environmental Protection Agency. The test vehicles are 1977 Chevrolet Novas, modified in various ways, including the removal of their catalytic converters and other emissions-control systems. These data were collected in 1979 at a single laboratory, and used constant engine load and fuel temperature. The measured variables are the emissions of hydrocarbons (HC), nitrogen oxides (NOx), and carbon monoxide (CO), measured in milligrams per mile. Three outliers whose cause was known were removed from the data set; 96 measurements of each of the three variables remain. "Test" is just a number that identifies the case, not a measurement.

GMAT data. These data were made available by Howard Wainer, Educational Testing Service. The data are the undergraduate GPA's, verbal and quantitative GMAT scores, and first-year MBA GPA's of 913 students from five major universities. I do not know how the students were selected, nor do I know the year of the study.

Univariate Statistics of the Data Sets

Give the URL of a file on the web. The file should be in the following format:

  1. The first line has the variable names, separated by tabs or spaces. Complicated names that contain spaces can be surrounded with quotation marks. If a line contains two slashes (//), everything from the slashes to the end of the line is ignored---it is a comment.
  2. the remaining lines contain the data, separated by spaces or tabs. The order of the data should be every variable for the first case, then every variable for the second case, etc., up to the last case. Again, everything in a line after two slashes is ignored.