To see a review of how to start R, look at the beginning of Lab1
Addendum to Bringing in Data: If the data is saved as a text file online, you can also bring in data with scan() or read.table() where the "file" is a website
> graft.df<-read.table("http://www-stat.stanford.edu/~epurdom/s141/data/graft.txt",header=T)
Getting Started
We are going to use data from the ISwR library that we used in Lab 1. To bring in the library:
> library(ISwR, lib.loc="/afs/ir/class/stats141/lib/")
> library(ISwR)
> library(ISwR,lib.loc="//Rgmiller/epurdom/public/Rlib")
Now, go ahead and bring the data we want into R:
> data(lung)
> data(heart.rate)
> data(tlc)
> data(coking)
> data(zelazo)
> data(melanom)
To see a description of the data, see Introductory Statistics with R.
Factor Data
The first step for any data analysis is to get the data in the correct form. If we look at the data for heart.rate, where at different times after recieving medication, the heart rate of nine patients with congestive heart failure is measured. We see that the first column is the observed values of the energy for all of the observations collect on heart rate in column hr. Then in the next two columns, subj tells us which subject the observation belongs to and time tells us what time. This is an example of a two-way setup - there are two different classes of ``grouping'' that an observation could fall into.
> heart.rate hr subj time 1 96 1 0 2 110 2 0 3 89 3 0 ... 19 86 1 60 20 108 2 60 21 85 3 60 ...Now if I look at the summary of heart.rate, I see that while for the variable hr, summary gave the standard summary information, for subj and time we see just the counts for each class within the variables. This is because subj and time are entered as factor variables. R does not consider them as numbers but just an indication of classification - which is important, since these numbers have no meaning.
> summary(heart.rate) hr subj time Min. : 67.00 1 : 4 0 :9 1st Qu.: 78.75 2 : 4 30 :9 Median : 93.00 3 : 4 60 :9 Mean : 93.14 4 : 4 120:9 3rd Qu.:104.50 5 : 4 Max. :128.00 6 : 4 (Other):12
Often, if you bring "character" data in with read.table, it will be transformed to factor data. Otherwise we want to create factor variables ourselves.
With data from a balanced design (all the groups within a grouping have the same number of observations) such as the data last week on trees, we can use the command gl to create factors.
Data:
River Birch | European Birch | ||
Flooded | Control | Flooded | Birch Control |
1.45 | 1.70 | 0.21 | 1.34 |
1.19 | 2.04 | 0.58 | 0.99 |
1.05 | 1.49 | 0.11 | 1.17 |
1.07 | 1.91 | 0.27 | 1.30 |
First bring data in:
> atp<-scan() 1: 1.45 1.70 0.21 1.34 5: 1.19 2.04 0.58 0.99 9: 1.05 1.49 0.11 1.17 13: 1.07 1.91 0.27 1.30 17: Read 16 items
Now we want to create a factor variable identifing the four groups, using the command gl, which creates balanced factor data. We need it to match the order of the data. So we need n=4 classifications, and we need to repeat it k=1 times.
> gr_gl(n=4,k=1,length=16,labels=c("R.Flood","R.Ctrl","E.Flood","E.Ctrl")) > gr [1] R.Flood R.Ctrl E.Flood E.Ctrl R.Flood R.Ctrl E.Flood E.Ctrl R.Flood [10] R.Ctrl E.Flood E.Ctrl R.Flood R.Ctrl E.Flood E.Ctrl Levels: R.Flood R.Ctrl E.Flood E.Ctrl > atp.df<-data.frame(atp,group=gr)
ALWAYS check to see if you got the factor variable in the right order!
gl requires that the groups be of equal
length. To manually make factors, we could create a vector of
numbers to represent the group. But for R to read them as factors we
need to make them as factors, or convert them.
Here we manually make the factors for the (balanced) tree data above.
> grvalues<-c(rep(x=c(1,2,3,4),times=4)) > grvalues [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 > is.numeric(grvalues) [1] TRUE > is.factor(grvalues) [1] FALSE > gr2<-factor(grvalues,labels=c("R.Fl","R.C","E.Fl","E.C")) > gr2 [1] R.Fl R.C E.Fl E.C R.Fl R.C E.Fl E.C R.Fl R.C E.Fl E.C R.Fl R.C E.Fl [16] E.C Levels: R.Fl R.C E.Fl E.C > is.numeric(gr2) [1] FALSE > is.factor(gr2) [1] TRUE
We sometimes we might have two categorical variables: with
categories and
with
categories. You sometimes want one variable indicating if the observation was in
and
- i.e.
possible groupings:
> coking$width [1] 4 4 4 4 4 4 8 8 8 8 8 8 12 12 12 12 12 12 Levels: 4 8 12 > coking$temp [1] 1600 1600 1600 1900 1900 1900 1600 1600 1600 1900 1900 1900 1600 1600 1600 1900 1900 1900 Levels: 1600 1900 > coking$width:coking$temp [1] 4:1600 4:1600 4:1600 4:1900 4:1900 4:1900 8:1600 8:1600 8:1600 8:1900 8:1900 [12] 8:1900 12:1600 12:1600 12:1600 12:1900 12:1900 12:1900 Levels: 4:1600 4:1900 8:1600 8:1900 12:1600 12:1900
sex tlc 1 1 3.40 2 1 3.41 3 2 3.80 4 1 3.90 5 1 4.00 6 1 4.10 7 1 4.46 8 2 4.55 ...How would you check to see if the grouping variable was a factor or numeric? How could you convert it to a factor if it was not already? (call the new dataset tlc).
rbind and cbind stick together vectors of equal lengths
into a matrix, where the vectors become the rows or columns,
respectively, of the new matrix.
To combine two vectors into one long vector, you can just use the c() command.
> x<-c(1,3,5,3) > y<-c(4,5,8,3) > z<-c(5,6,2,1) > rbind(x,y,z) [,1] [,2] [,3] [,4] x 1 3 5 3 y 4 5 8 3 z 5 6 2 1 > cbind(x,y,z) x y z [1,] 1 4 5 [2,] 3 5 6 [3,] 5 8 2 [4,] 3 3 1 > c(x,y,z) [1] 1 3 5 3 4 5 8 3 5 6 2 1
> atp.mat<-cbind(atp,gr) > summary(aov(atp.mat[,1]~atp.mat[,2])) Df Sum Sq Mean Sq F value Pr(>F) atp.mat[, 2] 1 0.4278 0.4278 1.3021 0.273 Residuals 14 4.5996 0.3285
What is the problem with your ANOVA with atp.mat? How would you fix it (other than using data.frame)?
ANOVA Testing
To create an anova table for F-testing, we use lm which finds the estimates
of the parameters with the model, writen as y~treat
(one-way). Then
anova creates the table with the F statistics. Alternatively, you can use aov to fit the model, and then summary to see the ANOVA table. Some subsequent functions use the product of aov better than of lm
For one way, the model is
. For a fixed effect, we assume
is a constant number. This would be for data where we are only interested in those specific
groups, like Treatment A,B and Control.
For random effects [Model II] we assume
. This is for data where the particular groups/treatments we are examining are not all we are interested in, but rather just a sample of all possible groups. For instance if we tested difference among batches of a culture - the batches we get are not all possible, but just a sample, so the effect
has a distribution over all possible batches.
For fixed effect,
.
For random effects,
.
The same F-statistic,
is used for both.
With this example from last week's homework, we find that there is a
significant difference among the
> atp.lm<-lm(atp~gr, data=atp.df) > anova(atp.lm) Analysis of Variance Table Response: atp Df Sum Sq Mean Sq F value Pr(>F) gr 3 4.5530 1.5177 38.391 1.984e-06 *** Residuals 12 0.4744 0.0395 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > atp.aov<-aov(atp~gr,data=atp.df) > summary(atp.aov) Df Sum Sq Mean Sq F value Pr(>F) gr 3 4.5530 1.5177 38.391 1.984e-06 *** Residuals 12 0.4744 0.0395 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> zelazo$active [1] 9.00 9.50 9.75 10.00 13.00 9.50Is there a difference in the four groups?
For two way anova with no interaction, the model is
.
is the effect for each
group of the first classification,
is the effect for each
group of the second classification, and
is the
error term for each observation.
Again, with no interaction, random and fixed effects [Model I and Model II] use the same F-statistic for both, though
is different:
Fixed effects: | |||
![]() |
|||
![]() |
|||
Random effects: | |||
![]() |
|||
![]() |
|||
F-statistic | |||
![]() |
|||
![]() |
With the data heart.rate, we have only one observation per
cross-classification cell, where the classifications are subject and
time - i.e repeated testing of the same individual. So we cannot estimate an interaction term.
In R, we write this as heartrate~time+subject
where time corresponds to and subject corresponds to
.
> heart.lm<-lm(hr~time+subj, data=heart.rate) > anova(heart.lm) Analysis of Variance Table Response: hr Df Sum Sq Mean Sq F value Pr(>F) time 3 151.0 50.3 4.0696 0.01802 * subj 8 8966.6 1120.8 90.6391 4.863e-16 *** Residuals 24 296.8 12.4 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Notice if I change the order of time and subj I get the same answers.
> heart.lm2<-lm(hr~subj+time, data=heart.rate) > anova(heart.lm2) Analysis of Variance Table Response: hr Df Sum Sq Mean Sq F value Pr(>F) subj 8 8966.6 1120.8 90.6391 4.863e-16 *** time 3 151.0 50.3 4.0696 0.01802 * Residuals 24 296.8 12.4 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
In this case the model is
. For both, to test if the
interaction term is significant, you compare
with the F distribution to see if the
interaction is significant.
Usually you test to see if the interaction is significant, and if not then the other p-values for main effects are relevant for both (and you are in the no-interaction case as above).
> coke.lm<-lm(time~width*temp, data=coking) > anova(coke.lm) Analysis of Variance Table Response: time Df Sum Sq Mean Sq F value Pr(>F) width 2 123.143 61.572 222.102 3.312e-10 *** temp 1 17.209 17.209 62.076 4.394e-06 *** width:temp 2 5.701 2.851 10.283 0.002504 ** Residuals 12 3.327 0.277 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Contrasts & Multiple Comparisons
To test a set of contrasts, we need to bring in another library, gregmisc. You should download the library to your computer in the same way that you downloaded the ISwR library (see above).
We look at the peas dataset from class:
Control Glucose Fructose GlucFruc Sucrose 75 57 58 58 62 67 58 61 59 66 70 60 56 58 65 75 59 58 61 63 65 62 57 57 64 71 60 56 56 62 67 60 61 58 65 67 57 60 57 65 76 59 57 57 62 68 61 58 59 67
> library(gregmisc) > peas<-scan() #...paste data here > culture<-factor(rep(1:5,10),labels=c("Ctrl","Gluc","Fruc","GlucFruc","Sucr")) > pea.df<-data.frame(peas, culture) > pea.aov<-aov(peas~culture,data=pea.df)
Now we're interested in a specific contrast, comparing the sugars to the control. We are interested in
. This gives a coefficient vector
. We want to test significance of
> fit.contrast(model=pea.aov,varname=culture,coeff=c(1,-.25,-.25,-.25,-.25)) Estimate Std. Error t value Pr(>|t|) culture c=( 1 -0.25 -0.25 -0.25 -0.25 ) 10.2 0.8257993 12.35167 4.679846e-16
Suppose we had more than than 1 contrast we set up:
> contr.peas<-matrix(c(4,-1,-1,-1,-1,0,-1,-1,3,-1),ncol=2) > contr.peas [,1] [,2] [1,] 4 0 [2,] -1 -1 [3,] -1 -1 [4,] -1 3 [5,] -1 -1 > crossprod(contr.peas) [,1] [,2] [1,] 20 0 [2,] 0 12 > fit.contrast(pea.aov,culture,t(contr.peas),df=T,conf.int=.95) #Note: We take the transpose, so contrasts on the row Estimate Std. Error t value Pr(>|t|) DF lower CI upper CI culture c=( 4 -1 -1 -1 -1 ) 40.8 3.303197 12.351670 4.679846e-16 45 34.14702 47.452980 culture c=( 0 -1 -1 3 -1 ) -7.6 2.558645 -2.970322 4.758813e-03 45 -12.75338 -2.446623
Note: This doesn't take into account multiple comparisons problems.
If we want to do all possible pairwise comparisons of the "time" variable for the heart.rate data, we can use the Tukey-Kramer intervals.
> plot(TukeyHSD(heart.aov,which="time"),cex.axis=.5,las=1)
Plotting
One important thing to look at in data with more than one classification or grouping, is how many observations are in each cell. We use the command table below on heart.rate and coking - notice it is important not to include the (continuous) observation column.
> table(heart.rate$subj,heart.rate$time) 0 30 60 120 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 5 1 1 1 1 6 1 1 1 1 7 1 1 1 1 8 1 1 1 1 9 1 1 1 1 > table(coking[,-3]) temp width 1600 1900 4 3 3 8 3 3 12 3 3
To display the data, we can use boxplots. Another way is to plot the points with stripchart, when you have a small amount of data. An interaction plot also can let you have a good look at what the data is doing.
#two equivalent ways to get boxplot: > plot(atp~group,data=atp.df) > boxplot(atp~group,data=atp.df) > stripchart(atp.df$group~atp.df$atp,"jitter",pch=16)For datasets with more than one predictive variable,
> boxplot(time~width+temp,data=coking) > interaction.plot(ordered(heart.rate$time), heart.rate$subj, heart.rate$hr,xlab="Time after Enalaprilat", ylab="Heart Rate (Beats per Minute)",col=1:9,fixed=T)
So far we've just looked at hypothesis tests of the parameters. But we
would like to look and see if this model makes sense with the data. For
example, in two-way ANOVA we
modeled the observations as a linear combination of
and
and maybe an interaction term. Thus for each cross-classification, we have an estimate
of what we think the value of that cell should be
. If
our model is correct, the difference between this value that the model
fitted and the one we observed would be the random error from
. This difference between the model's fit and the observed
values are the residuals:
. All of this information is contained in
the object we called named datasetname.lm
> names(atp.lm) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "contrasts" "xlevels" "call" "terms" [13] "model" > names(atp.aov) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "contrasts" "xlevels" "call" "terms" [13] "model"
You can access these by using dataset.lm$name, e.g. atp.aov$fit. For many of the often used ones, there is a command, like fitted() or residuals. So atp.aov$fit is the same as fitted(atp.aov).
> matrix(round(atp.lm$fit,3),ncol=4,byrow=F) [,1] [,2] [,3] [,4] [1,] 1.19 1.785 0.292 1.2 [2,] 1.19 1.785 0.292 1.2 [3,] 1.19 1.785 0.292 1.2 [4,] 1.19 1.785 0.292 1.2 > matrix(atp.df$atp,ncol=4,byrow=F) [,1] [,2] [,3] [,4] [1,] 1.45 1.70 0.21 1.34 [2,] 1.19 2.04 0.58 0.99 [3,] 1.05 1.49 0.11 1.17 [4,] 1.07 1.91 0.27 1.30 > matrix(round(atp.lm$res,3),ncol=4,byrow=F) [,1] [,2] [,3] [,4] [1,] 0.26 -0.085 -0.083 0.14 [2,] 0.00 0.255 0.288 -0.21 [3,] -0.14 -0.295 -0.182 -0.03 [4,] -0.12 0.125 -0.022 0.10
We can use plots to see if these residuals are reasonable. We can do a qq plot to see if they look normal, and plot the residuals against the fitted values, and how they relate to the grouping.
> atp.resid<-residuals(atp.lm) > qq.atp<-qqnorm(atp.resid,type='n') > text(x=qq.atp$x,y=qq.atp$y,labels=atp.df$gr) > plot(atp.lm$fit,atp.resid,type="n") > text(atp.lm$fit,atp.resid,labels=atp.df$gr)