Model Selection


We now turn to the actual business of developing a predictive equation for Y based on , , , and .

1. Standardize each of the variables , , , , and Y by subtracting off their means and dividing by their standard deviations. For the rest of the lab, when we refer to a variable , , , or Y, we will mean these standardized versions.

2. As mentioned previously, we will model the hydrocarbon emissions using the full quadratic model based on , , , and . This means that we will be using all linear predictors, quadratics, and pairwise interactions based on these variables. In all, there are 15 terms including the intercept. Form the design matrix for this model and get the least squares estimates for the coefficients in this model using lm. Look at the P-values corresponding to each term in the model. Are all of the terms significant? Note that the quadratics and interactions terms do not have mean 0, so it make sense to include the intercept.

3. We will restrict our attention to hierarchical models. Thus, we can only delete a linear term from the model if all higher order terms involving it have already been removed. For example, if has a higher P-value than we cannot delete before removing from the model. Also, if has a higher P-value than we cannot delete before removing We need to look for another term to delete. Think about the intercept as being a varaibale of order 0. At each step in the deletion process, those terms that could be removed without violating the hierarchical structure are termed removable.

In general, the variable deletion rules for model selection will be

Do stepwise deletion starting from the model with all linear terms, quadratics and interactions until all the P-values associated with the removable terms are 0.05 or less. Which terms remain in the model?

Parts 4 and 5 are optional. I recomend you to do them, and turn them in with the rest of the lab, but it will not be part of the grade.

4. Now we will see if any of the deleted terms can be added back into the model to significantly improve the fit. Try adding terms to the current model one at a time and see if the fit is improved significantly. (Look at the P-values. The term is significant if the corresponding P-value is 0.05 or less.) Remove any terms that are not significant. Remember that under a hierarchical model, we cannot add a higher order term if the associated linear term is not already in the model. Which ones were added? What is your new model? What do you expect would happen if you left out from the beginning?

5. Construct an F-test of the hypothesis that your final model (from quesion 4 above) is in fact linear. For such a test, you will need the fitted values from your final model from question 4 as well as those from the linear submodel. Calculate the latter in two ways. First, use ordinary least squares on Y and the relevant predictor variables. Next, replace Y with the fitted values based on your larger model. Each technique should yield the same result. Why? Compare your F-statistic with the appropriate F-distribution and calculate a P-value for this test. Present your results in ANOVA table and interpret them.