RLab4

Next: Note on Bringing in

R Lab 4

To see a review of how to start R, look at the beginning of Lab1

Addendum to Bringing in Data: If the data is saved as a text file online, you can also bring in data with scan() or read.table() where the "file" is a website

> graft.df<-read.table("http://www-stat.stanford.edu/~epurdom/s141/data/graft.txt",header=T)

Getting Started

We are going to use data from the ISwR library that we used in Lab 1. To bring in the library:

At Unix Terminal:
> library(ISwR, lib.loc="/afs/ir/class/stats141/lib/")
At Personal Computer - you must first download the package to your computer. If you have problems doing this see note at the bottom of the lab.
1. Click on "Packages" and scroll down to "Install Package(s) from CRAN..."
2. A list of possible packages to install should appear in a new window. Scroll down and click on "ISwR"
3. R should proceed to install the package. If asked "Delete downloaded files (y/N)?" type "y".
4. Bring in the library by typing:
  > library(ISwR)
In the Statistics PC Lab:
> library(ISwR,lib.loc="//Rgmiller/epurdom/public/Rlib")

Now, go ahead and bring the data we want into R:

> data(lung) > data(heart.rate) > data(tlc) > data(coking) > data(zelazo) > data(melanom)

To see a description of the data, see Introductory Statistics with R.

Factor Data

The first step for any data analysis is to get the data in the correct form. If we look at the data for heart.rate, where at different times after recieving medication, the heart rate of nine patients with congestive heart failure is measured. We see that the first column is the observed values of the energy for all of the observations collect on heart rate in column hr. Then in the next two columns, subj tells us which subject the observation belongs to and time tells us what time. This is an example of a two-way setup - there are two different classes of ``grouping'' that an observation could fall into.

> heart.rate
    hr subj time
1   96    1    0
2  110    2    0
3   89    3    0
...
19  86    1   60
20 108    2   60
21  85    3   60
...

Now if I look at the summary of heart.rate, I see that while for the variable hr, summary gave the standard summary information, for subj and time we see just the counts for each class within the variables. This is because subj and time are entered as factor variables. R does not consider them as numbers but just an indication of classification - which is important, since these numbers have no meaning.

> summary(heart.rate)
       hr              subj     time  
 Min.   : 67.00   1      : 4   0  :9  
 1st Qu.: 78.75   2      : 4   30 :9  
 Median : 93.00   3      : 4   60 :9  
 Mean   : 93.14   4      : 4   120:9  
 3rd Qu.:104.50   5      : 4          
 Max.   :128.00   6      : 4          
                  (Other):12

Often, if you bring "character" data in with read.table, it will be transformed to factor data. Otherwise we want to create factor variables ourselves.

Balanced Design

With data from a balanced design (all the groups within a grouping have the same number of observations) such as the data last week on trees, we can use the command gl to create factors.

Data:

River Birch European Birch

Flooded Control Flooded Birch Control

1.45 1.70 0.21 1.34

1.19 2.04 0.58 0.99

1.05 1.49 0.11 1.17

1.07 1.91 0.27 1.30

First bring data in:

> atp<-scan()
1: 1.45 1.70 0.21 1.34
5: 1.19 2.04 0.58 0.99
9: 1.05 1.49 0.11 1.17
13: 1.07 1.91 0.27 1.30
17: 
Read 16 items

Now we want to create a factor variable identifing the four groups, using the command gl, which creates balanced factor data. We need it to match the order of the data. So we need n=4 classifications, and we need to repeat it k=1 times.

> gr_gl(n=4,k=1,length=16,labels=c("R.Flood","R.Ctrl","E.Flood","E.Ctrl"))
> gr
 [1] R.Flood R.Ctrl  E.Flood E.Ctrl  R.Flood R.Ctrl  E.Flood E.Ctrl  R.Flood
[10] R.Ctrl  E.Flood E.Ctrl  R.Flood R.Ctrl  E.Flood E.Ctrl 
Levels: R.Flood R.Ctrl E.Flood E.Ctrl
> atp.df<-data.frame(atp,group=gr)

ALWAYS check to see if you got the factor variable in the right order!

Exercise 1 Suppose we want to think of this data as a two way ANOVA with one grouping being for Control v. Flooded and another grouping being for type of tree (River v. European). How would you set up the factors for this?

Unequal Length

gl requires that the groups be of equal length. To manually make factors, we could create a vector of numbers to represent the group. But for R to read them as factors we need to make them as factors, or convert them.

Here we manually make the factors for the (balanced) tree data above.

> grvalues<-c(rep(x=c(1,2,3,4),times=4))
> grvalues
 [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
> is.numeric(grvalues)
[1] TRUE
> is.factor(grvalues)
[1] FALSE
> gr2<-factor(grvalues,labels=c("R.Fl","R.C","E.Fl","E.C"))
> gr2
 [1] R.Fl R.C  E.Fl E.C  R.Fl R.C  E.Fl E.C  R.Fl R.C  E.Fl E.C  R.Fl R.C  E.Fl
[16] E.C 
Levels: R.Fl R.C E.Fl E.C
> is.numeric(gr2)
[1] FALSE
> is.factor(gr2)
[1] TRUE

We sometimes we might have two categorical variables: with $A_1,\ldots,A_a$ categories and with $B_1,\ldots,B_b$ categories. You sometimes want one variable indicating if the observation was in and - i.e. $a\times b$ possible groupings:

> coking$width
 [1] 4  4  4  4  4  4  8  8  8  8  8  8  12 12 12 12 12 12
Levels: 4 8 12
> coking$temp
 [1] 1600 1600 1600 1900 1900 1900 1600 1600 1600 1900 1900 1900 1600 1600 1600 1900 1900 1900
Levels: 1600 1900
> coking$width:coking$temp
 [1] 4:1600  4:1600  4:1600  4:1900  4:1900  4:1900  8:1600  8:1600  8:1600  8:1900  8:1900 
[12] 8:1900  12:1600 12:1600 12:1600 12:1900 12:1900 12:1900
Levels: 4:1600 4:1900 8:1600 8:1900 12:1600 12:1900

Exercise 2 Suppose you were given data tlc that had observations of lung capacity for one observation in one column and then another column had values 1 and 2 representing the sex of the observed person.

   sex    tlc
1    1   3.40
2    1   3.41
3    2   3.80
4    1   3.90
5    1   4.00
6    1   4.10
7    1   4.46
8    2   4.55
...

How would you check to see if the grouping variable was a factor or numeric? How could you convert it to a factor if it was not already? (call the new dataset tlc).

Exercise 3 Recreate factor variables for subj and time both ways (i.e use gl for one, and then use a combination of rep() and factor()) Hint: for rep(), the times option can be a vector too.

Other useful commands

rbind and cbind stick together vectors of equal lengths into a matrix, where the vectors become the rows or columns, respectively, of the new matrix.

To combine two vectors into one long vector, you can just use the c() command.

> x<-c(1,3,5,3)
> y<-c(4,5,8,3)
> z<-c(5,6,2,1)
> rbind(x,y,z)
  [,1] [,2] [,3] [,4]
x    1    3    5    3
y    4    5    8    3
z    5    6    2    1
> cbind(x,y,z)
     x y z
[1,] 1 4 5
[2,] 3 5 6
[3,] 5 8 2
[4,] 3 3 1
> c(x,y,z)
 [1] 1 3 5 3 4 5 8 3 5 6 2 1

Exercise 4 How would you convert the butterfat data from the homework to the appropriate kind for R?

Exercise 5 Suppose for the atp data, we instead did:

> atp.mat<-cbind(atp,gr)
> summary(aov(atp.mat[,1]~atp.mat[,2]))
             Df Sum Sq Mean Sq F value Pr(>F)
atp.mat[, 2]  1 0.4278  0.4278  1.3021  0.273
Residuals    14 4.5996  0.3285

What is the problem with your ANOVA with atp.mat? How would you fix it (other than using data.frame)?

Exercise 6 Suppose you started with the two variables that you created in Exercise 1 for the atp data - Var1: Control v. Flooded and Var2:River v. European. How could you get back the factors from the problem - Control River, Control European, Flooded River, and Flooded European?

ANOVA Testing

To create an anova table for F-testing, we use lm which finds the estimates of the parameters with the model, writen as y~treat (one-way). Then anova creates the table with the F statistics. Alternatively, you can use aov to fit the model, and then summary to see the ANOVA table. Some subsequent functions use the product of aov better than of lm

One-Way

For one way, the model is $y_{ij}=\mu +\alpha_i + \epsilon_{ij}$ . For a fixed effect, we assume $\alpha_i$ is a constant number. This would be for data where we are only interested in those specific groups, like Treatment A,B and Control.

For random effects [Model II] we assume $\alpha_i \sim N(0,\sigma_{\alpha})$ . This is for data where the particular groups/treatments we are examining are not all we are interested in, but rather just a sample of all possible groups. For instance if we tested difference among batches of a culture - the batches we get are not all possible, but just a sample, so the effect $\alpha_i$ has a distribution over all possible batches.

For fixed effect, $H_0:\alpha_1=\alpha_2=\ldots=\alpha_a$ . For random effects, $H_0: \sigma_{\alpha}=0$ . The same F-statistic, $F=MS_{among}/MS_{within}$ is used for both.

With this example from last week's homework, we find that there is a significant difference among the $\mu_i$

> atp.lm<-lm(atp~gr, data=atp.df)
> anova(atp.lm)
Analysis of Variance Table

Response: atp
          Df Sum Sq Mean Sq F value    Pr(>F)
gr         3 4.5530  1.5177  38.391 1.984e-06 ***
Residuals 12 0.4744  0.0395
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

> atp.aov<-aov(atp~gr,data=atp.df)
> summary(atp.aov)
            Df Sum Sq Mean Sq F value    Pr(>F)    
gr           3 4.5530  1.5177  38.391 1.984e-06 ***
Residuals   12 0.4744  0.0395                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Exercise 7 Use the dataset zelazo, which gives the age in walking for four groups of infants. For example, the actively trained group is:

> zelazo$active
[1]  9.00  9.50  9.75 10.00 13.00  9.50

Is there a difference in the four groups?

Two-Way, No Interaction

For two way anova with no interaction, the model is $y_{ijk}=\mu +\alpha_i + \beta_j+\epsilon_{ijk}$ . $\alpha_i$ is the effect for each group of the first classification, $\beta_j$ is the effect for each group of the second classification, and $\epsilon_{ijk}$ is the error term for each observation. Again, with no interaction, random and fixed effects [Model I and Model II] use the same F-statistic for both, though is different:

		Fixed effects:
		$\displaystyle H_0^{(1)}:\alpha_1=\alpha_2=\ldots=\alpha_a$
		$\displaystyle H_0^{(2)}:\beta_1=\beta_2=\ldots=\beta_b$

		Random effects:
		$\displaystyle H_0^{(1)}:\sigma_{\alpha}=0$
		$\displaystyle H_0^{(2)}:\sigma_{\beta}=0$

		F-statistic
		$\displaystyle F^{(1)}=MS_{alpha}/MS_{residual}$
		$\displaystyle F^{(2)}=MS_{beta}/MS_{residual}$

With the data heart.rate, we have only one observation per cross-classification cell, where the classifications are subject and time - i.e repeated testing of the same individual. So we cannot estimate an interaction term. In R, we write this as heartrate~time+subject where time corresponds to $\alpha$ and subject corresponds to $\beta$ .

> heart.lm<-lm(hr~time+subj, data=heart.rate)
> anova(heart.lm)
Analysis of Variance Table

Response: hr
          Df Sum Sq Mean Sq F value    Pr(>F)
time       3  151.0    50.3  4.0696   0.01802 *
subj       8 8966.6  1120.8 90.6391 4.863e-16 ***
Residuals 24  296.8    12.4
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Notice if I change the order of time and subj I get the same answers.

> heart.lm2<-lm(hr~subj+time, data=heart.rate)
> anova(heart.lm2)
Analysis of Variance Table

Response: hr
          Df Sum Sq Mean Sq F value    Pr(>F)
subj       8 8966.6  1120.8 90.6391 4.863e-16 ***
time       3  151.0    50.3  4.0696   0.01802 *
Residuals 24  296.8    12.4
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Exercise 8 Do a two way ANOVA for atp with the variables you created in Exercise 1. (But pick a new name so you don't delete atp.aov or atp.lm because you need them later in the Lab!)

Two-way Anova, Interaction

In this case the model is $y_{ijk}=\mu +\alpha_i + \beta_j+\delta_{ij}+\epsilon_{ijk}$ . For both, to test if the interaction term is significant, you compare $MS_{interaction}/MS_{residual}$ with the F distribution to see if the interaction is significant.

Usually you test to see if the interaction is significant, and if not then the other p-values for main effects are relevant for both (and you are in the no-interaction case as above).

> coke.lm<-lm(time~width*temp, data=coking)
> anova(coke.lm)
Analysis of Variance Table
 
Response: time
           Df  Sum Sq Mean Sq F value    Pr(>F)    
width       2 123.143  61.572 222.102 3.312e-10 ***
temp        1  17.209  17.209  62.076 4.394e-06 ***
width:temp  2   5.701   2.851  10.283  0.002504 ** 
Residuals  12   3.327   0.277                      
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Contrasts & Multiple Comparisons

To test a set of contrasts, we need to bring in another library, gregmisc. You should download the library to your computer in the same way that you downloaded the ISwR library (see above).

We look at the peas dataset from class:

Control Glucose Fructose GlucFruc Sucrose
    75      57       58       58      62
    67      58       61       59      66
    70      60       56       58      65
    75      59       58       61      63
    65      62       57       57      64
    71      60       56       56      62
    67      60       61       58      65
    67      57       60       57      65
    76      59       57       57      62
    68      61       58       59      67

> library(gregmisc)
> peas<-scan()
#...paste data here
> culture<-factor(rep(1:5,10),labels=c("Ctrl","Gluc","Fruc","GlucFruc","Sucr"))
> pea.df<-data.frame(peas, culture)
> pea.aov<-aov(peas~culture,data=pea.df)

Now we're interested in a specific contrast, comparing the sugars to the control. We are interested in $\gamma = \mu_c-\frac{1}{4}(\mu_g+\mu_f+\mu_{g+f}+\mu_s)$ . This gives a coefficient vector $c=(1,-\frac{1}{4},-\frac{1}{4},-\frac{1}{4},-\frac{1}{4})$ . We want to test significance of $H_0: \gamma=\gamma_0=0$

> fit.contrast(model=pea.aov,varname=culture,coeff=c(1,-.25,-.25,-.25,-.25))
                                        Estimate Std. Error  t value     Pr(>|t|)
culture c=( 1 -0.25 -0.25 -0.25 -0.25 )     10.2  0.8257993 12.35167 4.679846e-16

Suppose we had more than than 1 contrast we set up:

$\displaystyle c_1=(4,1,1,1,1)$

$\displaystyle c_2=(0,-1,-1,3,-1)$

> contr.peas<-matrix(c(4,-1,-1,-1,-1,0,-1,-1,3,-1),ncol=2)
> contr.peas
     [,1] [,2]
[1,]    4    0
[2,]   -1   -1
[3,]   -1   -1
[4,]   -1    3
[5,]   -1   -1
> crossprod(contr.peas)
     [,1] [,2]
[1,]   20    0
[2,]    0   12
> fit.contrast(pea.aov,culture,t(contr.peas),df=T,conf.int=.95) #Note: We take the transpose, so contrasts on the row
                            Estimate Std. Error   t value     Pr(>|t|) DF  lower CI  upper CI
culture c=( 4 -1 -1 -1 -1 )     40.8   3.303197 12.351670 4.679846e-16 45  34.14702 47.452980
culture c=( 0 -1 -1 3 -1 )      -7.6   2.558645 -2.970322 4.758813e-03 45 -12.75338 -2.446623

Note: This doesn't take into account multiple comparisons problems.

If we want to do all possible pairwise comparisons of the "time" variable for the heart.rate data, we can use the Tukey-Kramer intervals.

> plot(TukeyHSD(heart.aov,which="time"),cex.axis=.5,las=1)

Exercise 9 For the heart.rate data, the "0" time means before receiving enalaprilat. Does the heart rate hr change after receiving the drug? Set up and test a contrast that might address this question.

Plotting

Count Tables

One important thing to look at in data with more than one classification or grouping, is how many observations are in each cell. We use the command table below on heart.rate and coking - notice it is important not to include the (continuous) observation column.

> table(heart.rate$subj,heart.rate$time)   
    0 30 60 120
  1 1  1  1   1
  2 1  1  1   1
  3 1  1  1   1
  4 1  1  1   1
  5 1  1  1   1
  6 1  1  1   1
  7 1  1  1   1
  8 1  1  1   1
  9 1  1  1   1
> table(coking[,-3])
     temp
width 1600 1900
   4     3    3
   8     3    3
   12    3    3

To display the data, we can use boxplots. Another way is to plot the points with stripchart, when you have a small amount of data. An interaction plot also can let you have a good look at what the data is doing.

#two equivalent ways to get boxplot:
> plot(atp~group,data=atp.df) 
> boxplot(atp~group,data=atp.df)

> stripchart(atp.df$group~atp.df$atp,"jitter",pch=16)

For datasets with more than one predictive variable,

> boxplot(time~width+temp,data=coking)
> interaction.plot(ordered(heart.rate$time), heart.rate$subj, 
heart.rate$hr,xlab="Time after Enalaprilat",
ylab="Heart Rate (Beats per Minute)",col=1:9,fixed=T)

Residuals Plots

So far we've just looked at hypothesis tests of the parameters. But we would like to look and see if this model makes sense with the data. For example, in two-way ANOVA we modeled the observations $y_{ijk}$ as a linear combination of $\alpha$ and $\beta$ and maybe an interaction term. Thus for each cross-classification, we have an estimate of what we think the value of that cell should be $\hat{y}_{ij}$ . If our model is correct, the difference between this value that the model fitted and the one we observed would be the random error from $\epsilon_{ijk}$ . This difference between the model's fit and the observed values are the residuals: $e_{ijk}=y_{ijk}-\hat{y}_{ij}$ . All of this information is contained in the object we called named datasetname.lm

> names(atp.lm)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "contrasts"     "xlevels"       "call"          "terms"        
[13] "model"        
> names(atp.aov)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "contrasts"     "xlevels"       "call"          "terms"        
[13] "model"

You can access these by using dataset.lm$name, e.g. atp.aov$fit. For many of the often used ones, there is a command, like fitted() or residuals. So atp.aov$fit is the same as fitted(atp.aov).

> matrix(round(atp.lm$fit,3),ncol=4,byrow=F)
     [,1]  [,2]  [,3] [,4]
[1,] 1.19 1.785 0.292  1.2
[2,] 1.19 1.785 0.292  1.2
[3,] 1.19 1.785 0.292  1.2
[4,] 1.19 1.785 0.292  1.2
> matrix(atp.df$atp,ncol=4,byrow=F)
     [,1] [,2] [,3] [,4]
[1,] 1.45 1.70 0.21 1.34
[2,] 1.19 2.04 0.58 0.99
[3,] 1.05 1.49 0.11 1.17
[4,] 1.07 1.91 0.27 1.30
> matrix(round(atp.lm$res,3),ncol=4,byrow=F)
      [,1]   [,2]   [,3]  [,4]
[1,]  0.26 -0.085 -0.083  0.14
[2,]  0.00  0.255  0.288 -0.21
[3,] -0.14 -0.295 -0.182 -0.03
[4,] -0.12  0.125 -0.022  0.10

We can use plots to see if these residuals are reasonable. We can do a qq plot to see if they look normal, and plot the residuals against the fitted values, and how they relate to the grouping.

> atp.resid<-residuals(atp.lm)
> qq.atp<-qqnorm(atp.resid,type='n')
> text(x=qq.atp$x,y=qq.atp$y,labels=atp.df$gr)
> plot(atp.lm$fit,atp.resid,type="n")
> text(atp.lm$fit,atp.resid,labels=atp.df$gr)

Exercise 10 Look at the residuals of the ANOVA from the coking data using the plots above. For both the qqplot and the residual plot, create three separate plots, one of which labels by temp, one by width and one by the cross-classification of width and temp.

Note on Bringing in packages:

Next: Note on Bringing in

Elizabeth Purdom 2003-11-19

River Birch		European Birch
Flooded	Control	Flooded	Birch Control
1.45	1.70	0.21	1.34
1.19	2.04	0.58	0.99
1.05	1.49	0.11	1.17
1.07	1.91	0.27	1.30