When gasoline is pumped into the tank of an automobile, hydrocarbon vapors in the tank are forced out and into the atmosphere, producing a significant amount of air pollution. For this reason, vapor recovery devices are often installed on gasoline pumps. It is difficult to test a recovery device in actual operation since all that can be measured is the amount of vapor recovered, and, by means of a ``sniffer," whether any vapor escaped into the atmosphere. Therefore, to get an idea of the efficiency of a given device, it is necessary to estimate the amount of hydrocarbons that would have been emitted if the device was not in place. To be useful, this estimate must be based only on variables that can be measured in practice.
In the next two labs, you will be using the data collected during an experiment conducted in the early 1980's to ascertain such a predictive relationship. The data set, taken from pages 566--569 of Mathematical Statistics and Data Analysis by John A. Rice, 2nd edition (Belmont, Calif: Wadsworth, 1995), is stored in /data/vapor.dat as a 125 by 5 matrix. Each of the n=125 rows of this matrix represents an observation on 5 variables. A description of each of these variables is given below.
In the following questions, we will explore how the high
correlation between and
manifests itself
in standard error estimates and derived confidence intervals.
Unless otherwise indicated, you are encouraged to calculate
the required least squares estimates both directly and via the
lm command (see the attached S appendix).
Using , Y and formula (13) on page 181 of your text,
form the ordinary least squares
estimates of
,
and
.
Calculate RSS, the residual sum of squares, and form
Recall that
under the assumptions of the homoskedastic linear
model, is an unbiased estimate of of
(see
Sections 8.3 and 10.3 of your text).
Use S
and the Gram matrix you calculated in question 1 to form the
variance-covariance matrix of
Next, try fitting the same model, this time by leaving out the
term.
Compare the new estimate of
and its standard
error to the values you obtained earlier. What do you find?
In the plot below, we present 95% confidence intervals for
and
as
varies from -1.5 to 2.5. Again,
notice the strong influence of the level of
on this confidence
interval.
where is the correlation between
and
.
Therefore, the variance-covariance matrix of
is of the form
.
Use this result to explain the effect observed in question 4.
What would you expect to happen if ? What does it mean
for
to be zero?
be the least squares ``predictor" of , where
and
are obtained in the usual
manner by the method of least squares.
Construct a new variable
,
and repeat the steps in question 2, substituting
for
for in the model.
The Gram matrix formed from
1,
and
should be of what form?
Why?
The new variable
ranges over the
interval
. Using -0.7 as the low
value and 2.5 as the high value, construct plots similar to those
from question 4. Does this agree with your answers to question 5?
Below we present a plot similar to the one above, this time using
rather than
.
Similarly, we define to be 1
if
is between 45 and 75, and
to be 1
only if
is larger than 75.
Suppose we make similar definitions for indicator functions
based on , taking a low value to be 3.7 or below, a medium
value to be between 3.7 and 6, and a high value to be larger than
6. Do you anticipate any problems creating a model which includes both
of the three indicator functions for each of
and
?
Why is the constant function not included in
this model?
Find estimates of
,
, and
using ordinary least squares. Because
this sort of model gives rise to particularly simple
normal equations, it is important in this
case that you consider how you would arrive at these estimates
directly.
Find a 95% confidence interval for
and
, and interpret these quantities.
Why does not make an appearance in this model?
Using
ordinary least squares,
find estimates of
,
, and
. Again, it is important that you consider
the form of the normal equations for this model, while the actual
fitting can be done via the lm command.
Find a 95% confidence interval for
and interpret this quantity.
When would an experimenter be more interested in the model as specified in question 8? When is the form presented in this question more useful? Is there any difference in the estimates given by these models?