Lab 1


When gasoline is pumped into the tank of an automobile, hydrocarbon vapors in the tank are forced out and into the atmosphere, producing a significant amount of air pollution. For this reason, vapor recovery devices are often installed on gasoline pumps. It is difficult to test a recovery device in actual operation since all that can be measured is the amount of vapor recovered, and, by means of a ``sniffer," whether any vapor escaped into the atmosphere. Therefore, to get an idea of the efficiency of a given device, it is necessary to estimate the amount of hydrocarbons that would have been emitted if the device was not in place. To be useful, this estimate must be based only on variables that can be measured in practice.

In the next two labs, you will be using the data collected during an experiment conducted in the early 1980's to ascertain such a predictive relationship. The data set, taken from pages 566--569 of Mathematical Statistics and Data Analysis by John A. Rice, 2nd edition (Belmont, Calif: Wadsworth, 1995), is stored in /data/vapor.dat as a 125 by 5 matrix. Each of the n=125 rows of this matrix represents an observation on 5 variables. A description of each of these variables is given below.

Again, the variable we are interested in estimating is the amount of emitted hydrocarbons, column 5. The first three observations or rows of vapor.dat are given below:

For questions 1 through 6, we will consider only those variables that record the state of the dispensed gasoline, and . In the second part of this lab, questions 7 through 9, we will make use of and , the measured temperature and pressure of the tank. A much more thorough investigation of how all four variables can be used to estimate Y will be taken up in lab 2.

1. Standardize each of the variables , , and Y by subtracting off their means and dividing by their standard deviations. For the rest of the lab, when we refer to a variable , or Y, we will mean these standardized versions.

2. Let denote the design matrix formed from the basis functions 1, and , where 1 represents the function that is the constant 1 over the design set. Form the three-by-three Gram matrix associated with these variables. We obtain the sample correlation matrix of and by dividing the lower two-by-two submatrix of the Gram matrix by n-1 = 125-1. What do you observe? Next, perform the same calculations with the matrix formed by adding the column Y to the end of . In this case, we obtain the sample correlation matrix of , and Y by dividing the lower three-by-three submatrix of the result by n-1=125-1. What do you observe? Use a series of pairwise plots to confirm your observations. Please do not include any of these plots with your write-up, but simply report your observations. You might also want to examine plots of , and Y against observation number.

In the following questions, we will explore how the high correlation between and manifests itself in standard error estimates and derived confidence intervals. Unless otherwise indicated, you are encouraged to calculate the required least squares estimates both directly and via the lm command (see the attached S appendix).

3. Let represent the regression function of Y on our two predictor variables. For the moment, assume that

Using , Y and formula (13) on page 181 of your text, form the ordinary least squares estimates of , and . Calculate RSS, the residual sum of squares, and form

Recall that under the assumptions of the homoskedastic linear model, is an unbiased estimate of of (see Sections 8.3 and 10.3 of your text). Use S and the Gram matrix you calculated in question 1 to form the variance-covariance matrix of Next, try fitting the same model, this time by leaving out the term. Compare the new estimate of and its standard error to the values you obtained earlier. What do you find?

4. Observe that and each fall roughly in the interval . Returning to the model with , make a plot of the versus as ranges from -1.5 to 2.5. Contrast this with a similar plot for . What do you observe? Explain this result intuitively based on a plot of versus .

In the plot below, we present 95% confidence intervals for and as varies from -1.5 to 2.5. Again, notice the strong influence of the level of on this confidence interval.

5. You have observed in question 2 that the Gram matrix for this fit is of the form ,

where is the correlation between and . Therefore, the variance-covariance matrix of is of the form .

Use this result to explain the effect observed in question 4. What would you expect to happen if ? What does it mean for to be zero?

6. One way to remove the dependence between and would be to model as a function of and use the residuals from that fit in our model. That is, let

be the least squares ``predictor" of , where and are obtained in the usual manner by the method of least squares. Construct a new variable , and repeat the steps in question 2, substituting for for in the model. The Gram matrix formed from 1, and should be of what form? Why? The new variable ranges over the interval . Using -0.7 as the low value and 2.5 as the high value, construct plots similar to those from question 4. Does this agree with your answers to question 5?

Below we present a plot similar to the one above, this time using rather than .

Next, we will consider just the variables relating to the state of the tank at the time of the experiment, and , the tank's initial temperature and pressure, respectively. Here, we will consider these two variables on their original scale (that is, without standardizing).

7. Make simple plots of and versus observation number, as well as a plot of versus . Rather than use these variables as they are, we are going to recode them as indicator functions. That is, we are interested in defining , , and from so that each takes on only the values of 0 or 1 (here, l, m, and h refer to values of that are low, medium and high, respectively). For example, we will set

Similarly, we define to be 1 if is between 45 and 75, and to be 1 only if is larger than 75.

Suppose we make similar definitions for indicator functions based on , taking a low value to be 3.7 or below, a medium value to be between 3.7 and 6, and a high value to be larger than 6. Do you anticipate any problems creating a model which includes both of the three indicator functions for each of and ?

8. Suppose we are interested in the effect that different levels of tank temperature has on hydrocarbon emissions. Using the indicator functions given above, we could propose the following model for the mean of Y

Why is the constant function not included in this model? Find estimates of , , and using ordinary least squares. Because this sort of model gives rise to particularly simple normal equations, it is important in this case that you consider how you would arrive at these estimates directly. Find a 95% confidence interval for and , and interpret these quantities.

9. Consider instead the following model

Why does not make an appearance in this model? Using ordinary least squares, find estimates of , , and . Again, it is important that you consider the form of the normal equations for this model, while the actual fitting can be done via the lm command. Find a 95% confidence interval for and interpret this quantity.

When would an experimenter be more interested in the model as specified in question 8? When is the form presented in this question more useful? Is there any difference in the estimates given by these models?