Basic Step : Start R on a Leland system computer OR on a PC
(Can use the directions given in the first part of your lab session, which are reproduced here for your convenience)
NOTE: If you use ``tree'' you will not be able to run all of the help.
> cd private > cd stat141
> setenv DISPLAY Machine'sNameorIP address:0.0
We are going to use a specific dataset from the R library for this session. Details of this dataset bp.obese is available in the Appendix of Dalgaard's book ``Introductory Statistics with R''. Access the data as follows.
> library(ISwR, lib.loc="/afs/ir/class/stats141/lib/") > data(bp.obese) > bp.obese[1:5,] sex obese bp 1 0 1.31 130 2 0 1.31 148 3 0 1.19 146 4 0 1.11 122 5 0 1.34 140
The first column has an indicator variable for sex (0 = Male, 1 = Female). The second and third column are measurements on obeseity and blood pressure in appropriate units. We want to analyze the relationship between obeseity and blood pressure. Since we are going use this dataset all along we first use the attach command of R.
> attach(bp.obese) > plot(bp,obese)
There is an indication of some degree of correlation between the two variables. We measure it by Pearson's product moment correlation.
> cor(bp,obese) [1] 0.3261385
The computed correlation is rather low. Therefore it may be
interest to test the hypothesis that the true correlation is
zero (
where is the population correlation).
Test statistic :
> cor.test(obese,bp) Pearson's product-moment correlation data: obese and bp t = 3.45, df = 100, p-value = 0.0008222 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1405800 0.4895626 sample estimates: cor 0.3261385
The -value is rather low, so we reject the null hypothesis.
One would like to see if there is a linear relationship between the
-variable (bp) and the -variable (obese). The model is
We use the R function lm in performing the linear regression of bp on obese. The same function is used even in case of multiple linear regression, too (to be covered later in class).
> bp.obese.lr1 <- lm(bp~obese) > summary(bp.obese.lr1) Call: lm(formula = bp ~ obese) Residuals: Min 1Q Median 3Q Max -27.570 -11.241 -2.400 9.116 71.390 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 96.818 8.920 10.86 < 2e-16 *** obese 23.001 6.667 3.45 0.000822 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 17.28 on 100 degrees of freedom Multiple R-Squared: 0.1064, Adjusted R-squared: 0.09743 F-statistic: 11.9 on 1 and 100 DF, p-value: 0.0008222
Interpretation :
Look at the table above. The first row corresponds to the intercept or constant term ( in the model). The estimate of is 96.818 with estimated standard error 8.920. Then we can test the hypothesis that by computing the -statistic . The value is given as 10.86. Under , this statistic has a -distribution with degrees of freedom. The last column entry gives the -value for this test. Similar interpretation for or the coefficient of the variable obese in the linear regression model with the corresponding null hypothesis . In both the cases the -values are very small, and so we can reject the null hyptheses in both the cases.
The Multiple R-squared is simply the square of the ordinary product moment correlation in case of simple linear regression (there is a generalized notion for multiple regression). The Adjusted R-square is similar to R-squared but adjusts for the number of explanatory variables. (This becomes important if there are many explanatory variables, becuse then this qunatity gives a good idea about true fit). An Adjusted R-squared value close to 1 implies a very good fit and a small positive value (as in this case) means a rather poor fit.
Finally, the -value is the -statistic used to test the hypothesis that the model is void, that is, all the coefficients except the constant term are zero. In the case of simple linear regression, this reduces to testing .
We want to see if the resduals and the fitted values behave as expected.
> plot(obese,bp) > lines(obese,fitted(bp.obese.lr1))
The second command plots the fitted values for given values of obese in the same plot.
> names(bp.obese.lr1) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" > bp.lr1.resid <- bp.obese.lr1$residuals > bp.lr1.fit <- bp.obese.lr1$fitted.values > plot(bp.lr1.fit,bp.lr1.resid,type)
The residual vs. fitted plot looks quite random, but for a few outliers, (details later). We also do a QQ-plot of the residuals to see if the look normal.
> qqnorm(bp.lr1.resid,type='n') > qqline(bp.lr1.resid)
NOTE: A clear departure from normality is visible; the distribution of residuals is skewed to the right
Look in Chapter 14 of Sokal & Rohlf for details
The prediction and confidence bands are evaluated at the given values of the explanatory variable (in this instance obese). The values at which prediction and confidence limits are to be calculated is given by the sequence seq(.2, 2.5, .1) (chosen by us!). In order to avoid some loopholes, including multiple lines, it is important to create a single data frame containing the column of values of the explanatory variables at which the predictions are to made, along with the upper and lower confidence and prediction limits.
> conf.frame <- data.frame(obese=seq(.2,2.5,.1)) > bp.lr1.conf <- predict(bp.obese.lr1,interval="conf", newdata = conf.frame) > bp.lr1.pred <- predict(bp.obese.lr1,interval="pred", newdata = conf.frame) > plot(obese,bp) > matlines(conf.frame$obese,bp.lr1.conf,lty=c(1,2,2)) > matlines(conf.frame$obese,bp.lr1.pred,lty=c(1,4,4))
The R function matlines is used to plot multiple lines (in this case 3, corresponding to the predicted values and lower and upper confidence or prediction limits resectively). Notice that a lot of points are outside the two confidence limits, indicating a rather poor fit of this linear regression model to this data.
***NOTE : This part presents materials which you are not familiar with and which may not be covered in detail in class. It is not an absolute necessesity to go through all these diagnostic plots. But it may be useful for those who want to do a final project.
Here we consider five different measures to check for unusual observations.
> par(mfrow=c(2,2)) > plot(rstandard(bp.obese.lr1)) # standardized residuals > plot(rstudent(bp.obese.lr1)) # studentized residuals > plot(dffits(bp.obese.lr1),type="l") # dffits > matplot(dfbetas(bp.obese.lr1),type="l") # dfbetas plotted together > lines(sqrt(cooks.distance(bp.obese.lr1)),lwd=2) # sqrt of cook's distance
Note that large absolute values of standardized residuals or studentized residuals indicate that the corresponding observation is an -outlier. Whereas the other three measures depict a combined inluence of both -outlyingness and -outlyingness. Based on these plots one suspects that the observations number 20 and 102 are outliers.
One can drop these two observations and fit the same linear regression model by using the following commands.
> bp.obese.lr2 <- lm(bp ~ obese, subset = -c(20,102)) > summary(bp.obese.lr2) Call: lm(formula = bp ~ obese, subset = -c(20, 102)) Residuals: Min 1Q Median 3Q Max -26.546 -9.740 -1.829 9.946 63.987 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 98.342 8.041 12.230 < 2e-16 *** obese 20.963 6.015 3.485 0.000738 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 15.21 on 98 degrees of freedom Multiple R-Squared: 0.1103, Adjusted R-squared: 0.1012 F-statistic: 12.15 on 1 and 98 DF, p-value: 0.0007376
Notice that there is no perceptible change in terms of the estimates or the overall fit of the model.
So far we have not considered the third variable in the data frame, namely sex. It may be that there is a difference between the intercept terms in the regression model for males and females. In order to incorporate the effect one can use the following analysis where one uses two explanatory variables (or regressors). One is obese and the other one is sex which is just an indicator variable.
Details of this analysis are skipped and may be used as an exercise.
> bp.obese.lr3 <- lm(bp ~ obese + sex) > summary(bp.obese.lr3) Call: lm(formula = bp ~ obese + sex) Residuals: Min 1Q Median 3Q Max -24.263 -11.613 -2.057 6.424 72.207 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.287 8.937 10.438 < 2e-16 *** obese 29.038 7.172 4.049 0.000102 *** sex -7.730 3.715 -2.081 0.040053 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 17 on 99 degrees of freedom Multiple R-Squared: 0.1438, Adjusted R-squared: 0.1265 F-statistic: 8.314 on 2 and 99 DF, p-value: 0.0004596