Analysis of Variance

As another example, consider a data set with information about experience, gender, and wages. Experience is recorded as number of years on the job, and gender is recorded as 0 or 1. To see if the slope of the line relating experience to wages is different for the two genders, we can proceed as follows:

> wages = read.delim('http://www.stat.berkeley.edu/~spector/s133/data/wages.tab')
> wages$gender = factor(wages$gender)
> wages.aov = aov(wage ~ experience*gender,data=wages)
> anova(wages.aov)
Analysis of Variance Table

Response: wage
                   Df  Sum Sq Mean Sq F value    Pr(>F)    
experience          1   106.7  106.69  4.2821   0.03900 *  
gender              1   635.8  635.78 25.5175 6.042e-07 ***
experience:gender   1   128.9  128.94  5.1752   0.02331 *  
Residuals         530 13205.3   24.92                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The significant probability value for experience:gender indicates that the effect of experience on wages is different depending on the gender of the employee. By performing two separate regressions, we can see the values for the slopes for the different genders:

> coef(lm(wage~experience,data=subset(wages,gender == 0)))
(Intercept)  experience 
 8.62215280  0.08091533 
> coef(lm(wage~experience,data=subset(wages,gender == 1)))
(Intercept)  experience 
7.857197368 0.001150118

This indicates that there is a small increase in wages as experience increases for gender == 0, but virtually no increase for gender == 1.

1 Constructing Formulas

For the examples we've looked at, there weren't so many terms in the model that it became tedious entering them by hand, but in models with many interactions it can quickly become a nuisance to have to enter every term into the model. When the terms are all main effects, you can often save typing by using a data= argument specifying a data set with just the variables you are interested in and using the period (.) as the right-hand side of the model, but that will not automatically generate interactions.

The formula function will accept a text string containing a formula, and convert it to a formula that can be passed to any of the modeling functions. While it doesn't really make sense for the wine data, suppose we wanted to add Cultivar and all the interactions between Cultivar and the independent variables to our original regression model. The first step is to create a vector of the variables we want to work with. This can usually be done pretty easily using the names of the data frame we're working with.

> vnames = names(wine)[c(3,5,10,11,13,14)]

For the main part of the model we need to join together these names with plus signs (+):

> main = paste(vnames,collapse=' + ')

The interactions can be created by pasting together Cultivar with each of the continuous variables, using a colon (:) as a separator, and then joining them together with plus signs:

> ints = paste(paste('Cultivar',vnames,sep=':'),collapse=" + ")

Finally, we put the dependent variable and Cultivar into the model, and paste everything together:

> mymodel = paste('Alcohol ~ Cultivar',main,ints,sep='+')
> mymodel
[1] "Alcohol ~ Cultivar+Malic.acid + Alkalinity.ash + Proanthocyanins + Color.intensity + OD.Ratio + Proline+Cultivar:Malic.acid + Cultivar:Alkalinity.ash + Cultivar:Proanthocyanins + Cultivar:Color.intensity + Cultivar:OD.Ratio + Cultivar:Proline"

To run this, we need to pass it to a modeling function through the formula function:

> wine.big = aov(formula(mymodel),data=wine)
> anova(wine.big)
Analysis of Variance Table

Response: Alcohol
                          Df Sum Sq Mean Sq  F value    Pr(>F)
Cultivar                   2 70.795  35.397 154.1166 < 2.2e-16 ***
Malic.acid                 1  0.013   0.013   0.0573   0.81106
Alkalinity.ash             1  0.229   0.229   0.9955   0.31993
Proanthocyanins            1  0.224   0.224   0.9755   0.32483
Color.intensity            1  4.750   4.750  20.6808 1.079e-05 ***
OD.Ratio                   1  0.031   0.031   0.1335   0.71536
Proline                    1  0.262   0.262   1.1410   0.28708
Cultivar:Malic.acid        2  0.116   0.058   0.2524   0.77727
Cultivar:Alkalinity.ash    2  0.876   0.438   1.9071   0.15194
Cultivar:Proanthocyanins   2  1.176   0.588   2.5610   0.08045 .
Cultivar:Color.intensity   2  0.548   0.274   1.1931   0.30602
Cultivar:OD.Ratio          2  0.415   0.207   0.9024   0.40769
Cultivar:Proline           2  1.160   0.580   2.5253   0.08328 .
Residuals                157 36.060   0.230
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected there isn't anything too startling. If we wanted to investigate, say, the Cultivar:Proanthocyanins interaction, we could look at a scatter plot using separate colors for the points and corresponding best regression lines for each Cultivar:

> plot(wine$Proanthocyanins,wine$Alcohol,col=c('red','blue','green')[wine$Cultivar])
> abline(lm(Alcohol~Proanthocyanins,data=wine,subset=Cultivar==1),col='red') 
> abline(lm(Alcohol~Proanthocyanins,data=wine,subset=Cultivar==2),col='blue') 
> abline(lm(Alcohol~Proanthocyanins,data=wine,subset=Cultivar==3),col='green') 
> legend('topleft',legend=levels(wine$Cultivar),pch=1,col=c('red','blue','green'))

The plot appears below:

2 Alternatives for ANOVA

Not all data is suitable for ANOVA - in particular, if the variance varies dramatically between different groups, the assumption of equal variances is violated, and ANOVA results may not be valid. We've seen before that log transformations often help with ratios or percentages, but they may not always be effective.

As an example of a data set not suitable for ANOVA, consider the builtin data set airquality which has daily measurements of ozone and other quantities for a 153 day period. The question to be answered is whether or not the average level of ozone is the same over the five months sampled.

On the surface, this data seems suitable for ANOVA, so let's examine the diagnostic plots that would result from performing the ANOVA:

> airquality$Month = factor(airquality$Month)
> ozone.aov = aov(Ozone~Month,data=airquality)
> plot(ozone.aov)

There are deviations at both the low and high ends of the Q-Q plot, and some deviation from a constant in the Scale-Location plot. Will a log transformation help?

> ozonel.aov = aov(log(Ozone)~Month,data=airquality)
> plot(ozonel.aov)

In this case, the transformation didn't really help.

It might be possible to find a more suitable transformation, but we can also use a statistical test that makes fewer assumptions about our data. One such test is the Kruskal-Wallis test. Basically, the test replaces the data with the ranks of the data, and performs an ANOVA on those ranks. It assumes that observations are independent from each other, but doesn't demand equality of variance across the groups, or that the observations follow a normal distribution. The kruskal.test function in R performs the test, using the same formula interface as aov.

> ozone.kruskal = kruskal.test(Ozone~Month,data=airquality)
> ozone.kruskal

	Kruskal-Wallis rank sum test

data:  Ozone by Month 
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 6.901e-06

All that's reported is the significance level for the test, which tells us that the differences between the Ozone levels for different months is very significant. To see where the differences come from, we can use the kruskalmc function in the strangely-named pgirmess package. Unfortunately this function doesn't use the model interface - you simply provide the response variable and the (factor) grouping variable.

> library(pgirmess)
> kruskalmc(airquality$Ozone,airquality$Month)
Multiple comparison test after Kruskal-Wallis 
p.value: 0.05 
Comparisons
      obs.dif critical.dif difference
5-6 57.048925     31.85565       TRUE
5-7 38.758065     31.59346       TRUE
5-8 37.322581     31.59346       TRUE
5-9  2.198925     31.85565      FALSE
6-7 18.290860     31.85565      FALSE
6-8 19.726344     31.85565      FALSE
6-9 54.850000     32.11571       TRUE
7-8  1.435484     31.59346      FALSE
7-9 36.559140     31.85565       TRUE
8-9 35.123656     31.85565       TRUE

Studying these results shows that months 6,7, and 8 are very similar, but different from months 5 and 9.

To understand why this data is not suitable for ANOVA, we can look at boxplots of the Ozone levels for each month:

> boxplot(with(airquality,split(Ozone,Month)))

The variances are clearly not equal across months, and the lack of symmetry for months 5 and 6 brings the normal assumption into doubt.

File translated from T_EX by T_TH, version 3.67.
On 6 May 2011, 11:07.