RLAB5

Next: About this document ...

R Lab Session : Part 5

Basic Step : Start R on a Leland system computer OR on a PC

(Can use the directions given in the first part of your lab session, which are reproduced here for your convenience)

Open a Leland Unix prompt (like Samson for Windows or Xterm if you are at a Unix station)
NOTE: If you use ``tree'' you will not be able to run all of the help.
[Optional] If you have a subdirectory stat141 in the directory private go there by entering the Unix commands
```
    > cd private
    > cd stat141
```
(For those accessing Leland system) Set the environment display to your computer (so that you may be able to see the plots) by typing the command
> setenv DISPLAY Machine'sNameorIP address:0.0
Type ``R'' at the prompt.

Correlation and Regression

We are going to use a specific dataset from the R library for this session. Details of this dataset bp.obese is available in the Appendix of Dalgaard's book ``Introductory Statistics with R''. Access the data as follows.

> library(ISwR, lib.loc="/afs/ir/class/stats141/lib/")
> data(bp.obese)
> bp.obese[1:5,]
  sex obese  bp
1   0  1.31 130
2   0  1.31 148
3   0  1.19 146
4   0  1.11 122
5   0  1.34 140

The first column has an indicator variable for sex (0 = Male, 1 = Female). The second and third column are measurements on obeseity and blood pressure in appropriate units. We want to analyze the relationship between obeseity and blood pressure. Since we are going use this dataset all along we first use the attach command of R.

> attach(bp.obese)
> plot(bp,obese)

Correlation

There is an indication of some degree of correlation between the two variables. We measure it by Pearson's product moment correlation.

> cor(bp,obese)
[1] 0.3261385

The computed correlation is rather low. Therefore it may be interest to test the hypothesis that the true correlation is zero ( $H_0 : \rho = 0$ where $\rho$ is the population correlation). Test statistic :

$\begin{displaymath} t_s = r\sqrt{\frac{n-2}{1-r^2}} \end{displaymath}$

has a

distribution with

degrees of freedom under

> cor.test(obese,bp)

        Pearson's product-moment correlation

data:  obese and bp
t = 3.45, df = 100, p-value = 0.0008222
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1405800 0.4895626
sample estimates:
      cor
0.3261385

The -value is rather low, so we reject the null hypothesis.

Simple Linear Regression

One would like to see if there is a linear relationship between the -variable (bp) and the -variable (obese). The model is

$\begin{displaymath} y_i = \alpha + \beta x_i + \epsilon_i, \qquad i=1, 2, \ldots, n \end{displaymath}$

We use the R function lm in performing the linear regression of bp on obese. The same function is used even in case of multiple linear regression, too (to be covered later in class).

> bp.obese.lr1 <- lm(bp~obese)
> summary(bp.obese.lr1)

Call:
lm(formula = bp ~ obese)

Residuals:
    Min      1Q  Median      3Q     Max
-27.570 -11.241  -2.400   9.116  71.390

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   96.818      8.920   10.86  < 2e-16 ***
obese         23.001      6.667    3.45 0.000822 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 17.28 on 100 degrees of freedom

Multiple R-Squared: 0.1064,     Adjusted R-squared: 0.09743
F-statistic:  11.9 on 1 and 100 DF,  p-value: 0.0008222

Interpretation :

Look at the table above. The first row corresponds to the intercept or constant term ( $\alpha$ in the model). The estimate of $\alpha$ is 96.818 with estimated standard error 8.920. Then we can test the hypothesis that $H_0 : \alpha = 0$ by computing the -statistic $t = {\hat \alpha}/SE(\alpha)$ . The value is given as 10.86. Under , this statistic has a -distribution with degrees of freedom. The last column entry gives the -value for this test. Similar interpretation for $\beta$ or the coefficient of the variable obese in the linear regression model with the corresponding null hypothesis $H_0:\beta = 0$ . In both the cases the -values are very small, and so we can reject the null hyptheses in both the cases.

The Multiple R-squared is simply the square of the ordinary product moment correlation in case of simple linear regression (there is a generalized notion for multiple regression). The Adjusted R-square is similar to R-squared but adjusts for the number of explanatory variables. (This becomes important if there are many explanatory variables, becuse then this qunatity gives a good idea about true fit). An Adjusted R-squared value close to 1 implies a very good fit and a small positive value (as in this case) means a rather poor fit.

Finally, the -value is the -statistic used to test the hypothesis that the model is void, that is, all the coefficients except the constant term are zero. In the case of simple linear regression, this reduces to testing $H_0:\beta = 0$ .

How good is the fit ?

We want to see if the resduals and the fitted values behave as expected.

> plot(obese,bp)
> lines(obese,fitted(bp.obese.lr1))

The second command plots the fitted values for given values of obese in the same plot.

> names(bp.obese.lr1)
 [1] "coefficients"  "residuals"     "effects"       "rank"
 [5] "fitted.values" "assign"        "qr"            "df.residual"
 [9] "xlevels"       "call"          "terms"         "model"
> bp.lr1.resid <- bp.obese.lr1$residuals
> bp.lr1.fit <- bp.obese.lr1$fitted.values
> plot(bp.lr1.fit,bp.lr1.resid,type)

The residual vs. fitted plot looks quite random, but for a few outliers, (details later). We also do a QQ-plot of the residuals to see if the look normal.

> qqnorm(bp.lr1.resid,type='n')
> qqline(bp.lr1.resid)

NOTE: A clear departure from normality is visible; the distribution of residuals is skewed to the right

Confidence bands and prediction bands

Look in Chapter 14 of Sokal & Rohlf for details

The prediction and confidence bands are evaluated at the given values of the explanatory variable (in this instance obese). The values at which prediction and confidence limits are to be calculated is given by the sequence seq(.2, 2.5, .1) (chosen by us!). In order to avoid some loopholes, including multiple lines, it is important to create a single data frame containing the column of values of the explanatory variables at which the predictions are to made, along with the upper and lower confidence and prediction limits.

> conf.frame <- data.frame(obese=seq(.2,2.5,.1))
> bp.lr1.conf <- predict(bp.obese.lr1,interval="conf",
newdata = conf.frame)
> bp.lr1.pred <- predict(bp.obese.lr1,interval="pred",
newdata = conf.frame)
> plot(obese,bp)
> matlines(conf.frame$obese,bp.lr1.conf,lty=c(1,2,2))
> matlines(conf.frame$obese,bp.lr1.pred,lty=c(1,4,4))

The R function matlines is used to plot multiple lines (in this case 3, corresponding to the predicted values and lower and upper confidence or prediction limits resectively). Notice that a lot of points are outside the two confidence limits, indicating a rather poor fit of this linear regression model to this data.

Regression diagonstics ***

***NOTE : This part presents materials which you are not familiar with and which may not be covered in detail in class. It is not an absolute necessesity to go through all these diagnostic plots. But it may be useful for those who want to do a final project.

Here we consider five different measures to check for unusual observations.

The first one is standardized residuals defined as -th residual $y_i - \hat y_i$ divided by its own standard deviation.
The second one is called studentized residuals. It differe from standardized residual in that, instead of using the residuals from the fitted model, the quantity used as proxy is $y_i - \hat y_{(-i)i}$ where the second term means that it is the fitted value of -th -value from the regression model after deleting the -th row from the original data.
dffits measures the influence of ech observation on the fit
dfbetas measures how the deletion of the -th observation changes the estimate. The plot below (corresponding to the command matplot(dfbetas...)) shows how the estimates of $\alpha$ and $\beta$ changes as the observations are deleted one at a time.
Cook's distance is another measure for the influence of a particular observation on the overall fit of the model. The square-root of Cook's distance has similar characteristics to that of dffits

> par(mfrow=c(2,2))
> plot(rstandard(bp.obese.lr1)) # standardized residuals
> plot(rstudent(bp.obese.lr1)) # studentized residuals
> plot(dffits(bp.obese.lr1),type="l")  # dffits
> matplot(dfbetas(bp.obese.lr1),type="l")  # dfbetas plotted together
> lines(sqrt(cooks.distance(bp.obese.lr1)),lwd=2) # sqrt of cook's distance

Note that large absolute values of standardized residuals or studentized residuals indicate that the corresponding observation is an -outlier. Whereas the other three measures depict a combined inluence of both -outlyingness and -outlyingness. Based on these plots one suspects that the observations number 20 and 102 are outliers.

One can drop these two observations and fit the same linear regression model by using the following commands.

> bp.obese.lr2 <- lm(bp ~ obese, subset = -c(20,102))
> summary(bp.obese.lr2)

Call:
lm(formula = bp ~ obese, subset = -c(20, 102))

Residuals:
    Min      1Q  Median      3Q     Max
-26.546  -9.740  -1.829   9.946  63.987

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   98.342      8.041  12.230  < 2e-16 ***
obese         20.963      6.015   3.485 0.000738 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 15.21 on 98 degrees of freedom
Multiple R-Squared: 0.1103,     Adjusted R-squared: 0.1012
F-statistic: 12.15 on 1 and 98 DF,  p-value: 0.0007376

Notice that there is no perceptible change in terms of the estimates or the overall fit of the model.

Linear Regression : 2 explanatory variables

So far we have not considered the third variable in the data frame, namely sex. It may be that there is a difference between the intercept terms in the regression model for males and females. In order to incorporate the effect one can use the following analysis where one uses two explanatory variables (or regressors). One is obese and the other one is sex which is just an indicator variable.

Details of this analysis are skipped and may be used as an exercise.

> bp.obese.lr3 <- lm(bp ~ obese + sex)
> summary(bp.obese.lr3)

Call:
lm(formula = bp ~ obese + sex)

Residuals:
    Min      1Q  Median      3Q     Max
-24.263 -11.613  -2.057   6.424  72.207

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   93.287      8.937  10.438  < 2e-16 ***
obese         29.038      7.172   4.049 0.000102 ***
sex           -7.730      3.715  -2.081 0.040053 *
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 17 on 99 degrees of freedom
Multiple R-Squared: 0.1438,     Adjusted R-squared: 0.1265
F-statistic: 8.314 on 2 and 99 DF,  p-value: 0.0004596

About this document ...

Next: About this document ...

Elizabeth Purdom 2002-12-02