Analysis of Variance
1 More Complex Models
When working with the wine data frame,
we've separated the categorical variable (Cultivar) from the
continuous variable for pedagogical reasons, but the aov function
can accomodate both in the same model. Let's add the Cultivar
variable to the regression model we've previously worked with:
> wine.new = lm(Alcohol~Cultivar+Malic.acid+Alkalinity.ash+Proanthocyanins+Color.intensity+OD.Ratio+Proline,data=wine)
> summary(wine.new)
Call:
lm(formula = Alcohol ~ Cultivar + Malic.acid + Alkalinity.ash +
Proanthocyanins + Color.intensity + OD.Ratio + Proline, data = wine)
Residuals:
Min 1Q Median 3Q Max
-1.13591 -0.31737 -0.02623 0.33229 1.65633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.9158487 0.4711149 27.415 < 2e-16 ***
Cultivar2 -0.9957910 0.1776136 -5.607 8.26e-08 ***
Cultivar3 -0.6714047 0.2396380 -2.802 0.00568 **
Malic.acid 0.0559472 0.0410860 1.362 0.17510
Alkalinity.ash -0.0133598 0.0134499 -0.993 0.32198
Proanthocyanins -0.0561493 0.0817366 -0.687 0.49305
Color.intensity 0.1135452 0.0270097 4.204 4.24e-05 ***
OD.Ratio 0.0494695 0.0987946 0.501 0.61721
Proline 0.0002391 0.0002282 1.048 0.29629
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4886 on 169 degrees of freedom
Multiple R-Squared: 0.6541, Adjusted R-squared: 0.6377
F-statistic: 39.95 on 8 and 169 DF, p-value: < 2.2e-16
One problem with the summary display for models like this is that
it's treating our factor variable (Cultivar) as two separate
variables. While that is the way it is fit in the model, it's usually
more informative to combine the effects of the two variables as a single
effect. The anova command will produce a more traditional
ANOVA table:
> anova(wine.new)
Analysis of Variance Table
Response: Alcohol
Df Sum Sq Mean Sq F value Pr(>F)
Cultivar 2 70.795 35.397 148.2546 < 2.2e-16 ***
Malic.acid 1 0.013 0.013 0.0552 0.8146
Alkalinity.ash 1 0.229 0.229 0.9577 0.3292
Proanthocyanins 1 0.224 0.224 0.9384 0.3341
Color.intensity 1 4.750 4.750 19.8942 1.488e-05 ***
OD.Ratio 1 0.031 0.031 0.1284 0.7206
Proline 1 0.262 0.262 1.0976 0.2963
Residuals 169 40.351 0.239
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary display contained other useful information, so you shouldn't
hesitate to look at both.
Comparing these results to our previous regression, we can see that only
one variable (Color.intensity) is still significant, and the
effect of Cultivar is very significant. For this data set, it
means that while we can use the chemical composition to help predict the
Alcohol content of the wines, but that knowing the Cultivar
will be more effective. Let's look at a reduced model that uses only
Cultivar and Color.intensity to see how it compares with
the model containing the extra variables:
> wine.new1 = lm(Alcohol~Cultivar+Color.intensity,data=wine)
> summary(wine.new1)
Call:
lm(formula = Alcohol ~ Cultivar + Color.intensity, data = wine)
Residuals:
Min 1Q Median 3Q Max
-1.12074 -0.32721 -0.04133 0.34799 1.54962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.14845 0.14871 88.417 < 2e-16 ***
Cultivar2 -1.20265 0.10431 -11.530 < 2e-16 ***
Cultivar3 -0.79248 0.10495 -7.551 2.33e-12 ***
Color.intensity 0.10786 0.02434 4.432 1.65e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4866 on 174 degrees of freedom
Multiple R-Squared: 0.6468, Adjusted R-squared: 0.6407
F-statistic: 106.2 on 3 and 174 DF, p-value: < 2.2e-16
The adjusted R-squared for this model is better than that of the previous
one, indicating that removing those extra variables didn't seem to
cause any problems. To formally test to see if there is a difference
between the two models, we can use the anova function. When passed
a single model object, anova prints an ANOVA table, but when it's passed
two model objects, it performs a test to compare the two models:
> anova(wine.new,wine.new1)
Analysis of Variance Table
Model 1: Alcohol ~ Cultivar + Malic.acid + Alkalinity.ash + Proanthocyanins +
Color.intensity + OD.Ratio + Proline
Model 2: Alcohol ~ Cultivar + Color.intensity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 169 40.351
2 174 41.207 -5 -0.856 0.7174 0.6112
The test indicates that there's no significant difference between the two
models.
When all the independent variables in our model were categorical model (the
race/activity example),
the interactions between the categorical variables was one of the most
interesting parts of the analysis. What does an interaction between a
categorical variable and a continuous variable represent? Such an interaction
can tell us if the slope of the continuous variable is different for the
different levels of the categorical variable. In the current model, we can
test to see if the slopes are different by adding the term
Cultivar:Color.intensity to the model:
> anova(wine.new2)
Analysis of Variance Table
Response: Alcohol
Df Sum Sq Mean Sq F value Pr(>F)
Cultivar 2 70.795 35.397 149.6001 < 2.2e-16 ***
Color.intensity 1 4.652 4.652 19.6613 1.644e-05 ***
Cultivar:Color.intensity 2 0.509 0.255 1.0766 0.343
Residuals 172 40.698 0.237
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There doesn't seem to be a significant interaction.
As another example, consider a data set with information about experience, gender,
and wages. Experience is recorded as number of years on the job, and gender is recorded
as 0 or 1. To see if the slope of the line relating experience to wages is different
for the two genders, we can proceed as follows:
> wages = read.delim('http://www.stat.berkeley.edu/~spector/s133/data/wages.tab')
> wages$gender = factor(wages$gender)
> wages.aov = aov(wage ~ experience*gender,data=wages)
> anova(wages.aov)
Analysis of Variance Table
Response: wage
Df Sum Sq Mean Sq F value Pr(>F)
experience 1 106.7 106.69 4.2821 0.03900 *
gender 1 635.8 635.78 25.5175 6.042e-07 ***
experience:gender 1 128.9 128.94 5.1752 0.02331 *
Residuals 530 13205.3 24.92
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The significant probability value for experience:gender indicates
that the effect of experience on wages is different depending on the gender of the
employee. By performing two separate regressions, we can see the values for the
slopes for the different genders:
> coef(lm(wage experience,data=subset(wages,gender == 0)))
(Intercept) experience
8.62215280 0.08091533
> coef(lm(wage experience,data=subset(wages,gender == 1)))
(Intercept) experience
7.857197368 0.001150118
This indicates that there is a small increase in wages as experience increases
for gender == 0, but virtually no increase for gender == 1.
2 Constructing Formulas
For the examples we've looked at, there weren't so many terms in the model that
it became tedious entering them by hand, but in models with many interactions
it can quickly become a nuisance to have to enter every term into the model.
When the terms are all main effects, you can often save typing by using a
data= argument specifying a data set with just the variables you are
interested in and using the period (.) as the right-hand side of the
model, but that will not automatically generate interactions.
The formula function will accept a text string containing a formula,
and convert it to a formula that can be passed to any of the modeling functions.
While it doesn't really make sense for the wine data, suppose we
wanted to add Cultivar and all the interactions between Cultivar
and the independent variables to our original regression model.
The first step is to create a vector of the variables we want to work with.
This can usually be done pretty easily using the names of the data frame
we're working with.
> vnames = names(wine)[c(3,5,10,11,13,14)]
For the main part of the model we need to join together these names with
plus signs (+):
> main = paste(vnames,collapse=' + ')
The interactions can be created by pasting together Cultivar
with each of the continuous variables, using a colon (:) as a
separator, and then joining them together with plus signs:
> ints = paste(paste('Cultivar',vnames,sep=':'),collapse=" + ")
Finally, we put the dependent variable and Cultivar into the model,
and paste everything together:
> mymodel = paste('Alcohol ~ Cultivar',main,ints,sep='+')
> mymodel
[1] "Alcohol ~ Cultivar+Malic.acid + Alkalinity.ash + Proanthocyanins + Color.intensity + OD.Ratio + Proline+Cultivar:Malic.acid + Cultivar:Alkalinity.ash + Cultivar:Proanthocyanins + Cultivar:Color.intensity + Cultivar:OD.Ratio + Cultivar:Proline"
To run this, we need to pass it to a modeling function through the formula
function:
> wine.big = aov(formula(mymodel),data=wine)
> anova(wine.big)
Analysis of Variance Table
Response: Alcohol
Df Sum Sq Mean Sq F value Pr(>F)
Cultivar 2 70.795 35.397 154.1166 < 2.2e-16 ***
Malic.acid 1 0.013 0.013 0.0573 0.81106
Alkalinity.ash 1 0.229 0.229 0.9955 0.31993
Proanthocyanins 1 0.224 0.224 0.9755 0.32483
Color.intensity 1 4.750 4.750 20.6808 1.079e-05 ***
OD.Ratio 1 0.031 0.031 0.1335 0.71536
Proline 1 0.262 0.262 1.1410 0.28708
Cultivar:Malic.acid 2 0.116 0.058 0.2524 0.77727
Cultivar:Alkalinity.ash 2 0.876 0.438 1.9071 0.15194
Cultivar:Proanthocyanins 2 1.176 0.588 2.5610 0.08045 .
Cultivar:Color.intensity 2 0.548 0.274 1.1931 0.30602
Cultivar:OD.Ratio 2 0.415 0.207 0.9024 0.40769
Cultivar:Proline 2 1.160 0.580 2.5253 0.08328 .
Residuals 157 36.060 0.230
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As expected there isn't anything too startling. If we wanted to investigate,
say, the Cultivar:Proanthocyanins interaction, we could look at a
scatter plot using separate colors for the points and corresponding best
regression lines for each Cultivar:
> plot(wine$Proanthocyanins,wine$Alcohol,col=c('red','blue','green')[wine$Cultivar])
> abline(lm(Alcohol~Proanthocyanins,data=wine,subset=Cultivar==1),col='red')
> abline(lm(Alcohol~Proanthocyanins,data=wine,subset=Cultivar==2),col='blue')
> abline(lm(Alcohol~Proanthocyanins,data=wine,subset=Cultivar==3),col='green')
> legend('topleft',legend=levels(wine$Cultivar),pch=1,col=c('red','blue','green'))
The plot appears below:
3 Alternatives for ANOVA
Not all data is suitable for ANOVA - in particular, if the variance
varies dramatically between different groups, the assumption of equal
variances is violated, and ANOVA results may not be valid. We've seen
before that log transformations often help with ratios or percentages,
but they may not always be effective.
As an example of a data set not suitable for ANOVA, consider the builtin
data set airquality which has daily measurements of ozone and other
quantities for a 153 day period. The question to be answered is whether
or not the average level of ozone is the same over the five months sampled.
On the surface, this data seems suitable for ANOVA, so let's examine
the diagnostic plots that would result from performing the ANOVA:
> airquality$Month = factor(airquality$Month)
> ozone.aov = aov(Ozone~Month,data=airquality)
> plot(ozone.aov)
There are deviations at both the low and high ends of the Q-Q plot,
and some deviation from a constant in the Scale-Location plot. Will
a log transformation help?
> ozonel.aov = aov(log(Ozone)~Month,data=airquality)
> plot(ozonel.aov)
In this case, the transformation didn't really help.
It might
be possible to find a more suitable transformation, but we can also
use a statistical test that makes fewer assumptions about our data.
One such test is the Kruskal-Wallis test. Basically, the test replaces
the data with the ranks of the data, and performs an ANOVA on those
ranks. It assumes that observations are independent from each other,
but doesn't demand equality of variance across the groups, or that
the observations follow a normal distribution.
The kruskal.test function in R performs the test,
using the same formula interface as aov.
> ozone.kruskal = kruskal.test(Ozone~Month,data=airquality)
> ozone.kruskal
Kruskal-Wallis rank sum test
data: Ozone by Month
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 6.901e-06
All that's reported is the significance level for the test, which tells
us that the differences between the Ozone levels for different months
is very significant. To see where the differences come from, we can
use the kruskalmc function in the strangely-named pgirmess
package. Unfortunately this function doesn't use the model interface -
you simply provide the response variable and the (factor) grouping variable.
> library(pgirmess)
> kruskalmc(airquality$Ozone,airquality$Month)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
5-6 57.048925 31.85565 TRUE
5-7 38.758065 31.59346 TRUE
5-8 37.322581 31.59346 TRUE
5-9 2.198925 31.85565 FALSE
6-7 18.290860 31.85565 FALSE
6-8 19.726344 31.85565 FALSE
6-9 54.850000 32.11571 TRUE
7-8 1.435484 31.59346 FALSE
7-9 36.559140 31.85565 TRUE
8-9 35.123656 31.85565 TRUE
Studying these results shows that months 6,7, and 8 are
very similar, but different from months 5 and 9.
To understand why this data is not suitable for ANOVA, we can
look at boxplots of the Ozone levels for each month:
> boxplot(with(airquality,split(Ozone,Month)))
The variances are clearly not equal across months, and the
lack of symmetry for months 5 and 6 brings the normal assumption
into doubt.
File translated from
TEX
by
TTH,
version 3.67.
On 30 Apr 2010, 13:24.