Model Selection
We now turn to the actual business of developing a predictive
equation for Y based on
,
,
, and
.
- 1.
Standardize each of the variables
,
,
,
,
and Y
by subtracting off their means and dividing by their standard deviations.
For the rest of the lab, when we refer to a variable
,
,
,
or Y,
we will
mean these standardized versions.
- 2.
As mentioned previously,
we will model the hydrocarbon emissions using the full quadratic
model based on
,
,
, and
. This means that we will be using
all linear predictors, quadratics, and pairwise interactions based on these
variables.
In all, there are 15 terms including the intercept. Form the
design matrix for this model and get the least squares estimates for
the coefficients in this model using lm.
Look at the P-values corresponding to each term in the model. Are all
of the terms significant? Note that the quadratics and interactions terms do
not have mean 0, so it make sense to include the intercept.
- 3.
We will restrict our attention to hierarchical models. Thus, we can only
delete
a linear term from the model if all higher order terms involving it
have already been removed. For example, if
has a higher
P-value than
we cannot delete
before removing
from
the model. Also, if
has a higher P-value than
we cannot
delete
before removing
We need to look for another term to delete. Think about the intercept as being
a varaibale of order 0. At each step
in the deletion process, those terms that could
be removed without violating the hierarchical structure are termed
removable.
In general, the variable deletion rules for model selection will be

Do stepwise deletion starting from the model with all linear terms,
quadratics and interactions until all the
P-values associated with the removable terms are 0.05 or less.
Which terms remain in the model?
Parts 4 and 5 are optional. I recomend you to do them,
and turn them in with
the rest of the lab, but it will not be part of the grade.
- 4.
Now we will see if any of the deleted terms can be added back into
the model to significantly improve the fit.
Try adding terms to the current model one at a time and see if the
fit is improved significantly. (Look at the P-values. The term is
significant if the corresponding P-value is 0.05 or less.)
Remove any terms that are not significant.
Remember that under a hierarchical model, we cannot add a higher order
term if the associated linear term is not already in the model.
Which ones were added? What is your new model? What do you expect would
happen if you left out
from the beginning?
- 5.
Construct an F-test of the hypothesis that your final
model (from quesion 4 above) is in fact linear. For such a test,
you will need the fitted values
from your final model from question 4 as well as those from the linear
submodel.
Calculate the latter in two ways. First, use ordinary least
squares on Y and the relevant predictor variables. Next, replace
Y with the fitted values based on your larger model. Each technique
should yield the same result. Why? Compare your F-statistic
with the appropriate F-distribution and calculate a P-value for
this test. Present your results in ANOVA table and interpret them.