Analysis of Variance
1 Two-Way ANOVA
To express the idea of an interaction in the R modeling language, we need to
introduce two new operators. The colon (:) is used to indicate an
interaction between two or more variables in model formula. The asterisk
(*) is use to indicate all main effects and interactions among the
variables that it joins. So, for example the term A*B would expand
to the three terms A, B, and A:B. As an example
of a two-way ANOVA, consider a study to determine the effects of physical
activity on obesity. Subjects were rated for their physical activity on a
three point scale with 1=not very active, 2=somewhat active, and 3=very active.
In addition, the race (either 1 or 2) of the participant was recorded, along
with their Body Mass Index (BMI). We want
to answer the
following three questions:
- Were the means for BMI the same for the two races?
-
Were the means for BMI the same for the three activity levels?
-
Is the effect of activity level different depending on race?, or
equivalently Is the effect of race different depending on activity level?
The first two questions can be answered by looking at the race and
activity main effects, while the third question describes the
race by activity interaction. The data can be found at
http://www.stat.berkeley.edu/classes/s133/data/activity.csv. Here are the R statements to run
the ANOVA:
> activity = read.csv('activity.csv')
> activity$race = factor(activity$race)
> activity$activity = factor(activity$activity)
> activity.aov = aov(bmi~race*activity,data=activity)
> summary(activity.aov)
Df Sum Sq Mean Sq F value Pr(>F)
race 1 3552 3552 102.5894 < 2e-16 ***
activity 2 2672 1336 38.5803 < 2e-16 ***
race:activity 2 301 151 4.3508 0.01303 *
Residuals 1865 64574 35
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that there are two degrees of freedom for activity - this means
two parameters will be estimated in order to explain activity's effect on
bmi. Unlike linear regression, where only a single parameter is
estimated, and the only relationship that can be fit is a linear one, using two
parameters (to account for the three levels of activity) provides more flexibility
than would be possible with linear regression.
To see if the analysis was reasonable, we can look at the default plots:
> plot(activity.aov)
The graphs appear below:
There seems to some deviation from normality when looking at
the Normal Q-Q plot (recall that, if the residuals did follow a normal
distribution, we would see a straight line.) When this situation arises,
analyzing the logarithm of the dependent variable often helps. Here are the
same results for the analysis of log(bmi):
> activity1.aov = aov(log(bmi)~race*activity,data=activity)
> summary(activity1.aov)
Df Sum Sq Mean Sq F value Pr(>F)
race 1 4.588 4.588 100.3741 < 2.2e-16 ***
activity 2 3.251 1.625 35.5596 6.98e-16 ***
race:activity 2 0.317 0.158 3.4625 0.03155 *
Residuals 1865 85.240 0.046
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> plot(activity1.aov)
The Q-Q plot looks better, so this model is probably more appropriate.
We can see both main effects as well as the interaction are significant.
To see what's happening with the main effects, we can use aggregate:
> aggregate(log(activity$bmi),activity['race'],mean)
race x
1 1 3.122940
2 2 3.222024
> aggregate(log(activity$bmi),activity['activity'],mean)
activity x
1 1 3.242682
2 2 3.189810
3 3 3.109518
Race 2 has higher values of BMI than race 1, and BMI decreases as the
level of activity increases.
To study the interaction, we could use aggregate, passing both race
and activity as the second argument:
> aggregate(log(activity$bmi),activity[c('race','activity')],mean)
race activity x
1 1 1 3.161119
2 2 1 3.298576
3 1 2 3.140970
4 2 2 3.230651
5 1 3 3.084426
6 2 3 3.143478
The arrangement of the output from tapply may be more helpful:
> tapply(log(activity$bmi),activity[c('race','activity')],mean)
activity
race 1 2 3
1 3.161119 3.140970 3.084426
2 3.298576 3.230651 3.143478
It's usually difficult to judge relationships like this from a table.
One useful tool in this case is
an interaction plot. An interaction plot has one point
for each combination of the factors defined by an interaction. The x-axis
represents the levels of one of the factors, and the y-axis represents the
mean of the dependent variable, and a separate line is drawn for each
level of the factor not represented on the x-axis. While it wouldn't be
too hard to produce such a plot with basic commands in R, the process is
automated by the interaction.plot function. The first argument to
this function is the factor to appear on the x-axis; the second is the
factor which will define the multiple lines being drawn, and the third
argument is the dependent variable. By default, interaction.plot
uses the mean for its display, but you can provide a function of your own
choosing through the fun= argument.
For the activity data, we can produce an interaction plot with the following
code:
> with(activity,interaction.plot(activity,race,log(bmi)))
Here's the plot:
It can be seen that the interaction is due to the fact that the slope of
the line for race 2 is steeper than the line for race 1.
2 Another Example
This example has to do with iron retention in mice. Two different treatments,
each at three different levels, were fed to mice. The treatments were
tagged with radioactive iron, so that the percentage of iron retained could
be measured after a fixed period of time.
The data is presented in a table as follows:
----------------------------------------------------
Fe2+ Fe3+
----------------------------------------------------
high medium low high medium low
----------------------------------------------------
0.71 2.20 2.25 2.20 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.50 4.27 5.86 9.56
2.60 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.80
3.74 7.90 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.90
4.50 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.40 19.87
8.15 18.30 22.82 11.65 23.89 21.60
8.24 18.59 29.13 12.45 26.39 22.25
----------------------------------------------------
Thus, before we can perform analysis on the data, it needs
to be rearranged. To do this, we can use the reshape
function. Since there are two different sets of variables
that represent the change in the factors of the experiment,
we first read in the data (skipping the header), and create
two groups of variables in our call to reshape:
> iron0 = read.table('iron.txt',skip=5,nrows=18)
> names(iron0) = c('Fe2high','Fe2medium','Fe2low','Fe3high','Fe3medium','Fe3low')
> iron1 = reshape(iron0,varying=list(1:3,4:6),direction='long')
> head(iron1)
time Fe2high Fe3high id
1.1 1 0.71 2.20 1
2.1 1 1.66 2.69 2
3.1 1 2.01 3.54 3
4.1 1 2.16 3.75 4
5.1 1 2.42 3.83 5
6.1 1 2.42 4.08 6
After examining the data, it can be seen that the low, medium, and
high values have been translated into values 1, 2, and 3 in the
variable time. The id variable is created to help us see
which line each observation came from, which is not relevant in this
case, since the table was just used to present the data, and the
values in the table don't represent repeated measures on the same
experimental unit.
Next, we eliminate the id column, rename the column named
"time" and further reshape the
data to represent the two treatments:
> iron1$id = NULL
> names(iron1)[1] = 'level'
> iron = reshape(iron1,varying=list(2:3),direction='long')
> head(iron)
level time Fe2high id
1.1 1 1 0.71 1
2.1 1 1 1.66 2
3.1 1 1 2.01 3
4.1 1 1 2.16 4
5.1 1 1 2.42 5
6.1 1 1 2.42 6
All that's left is to remove the id
column and to rename time and Fe2high:
> iron$id = NULL
> names(iron)[2:3] = c('treatment','retention')
> head(iron)
level treatment retention
1.1 1 1 0.71
2.1 1 1 1.66
3.1 1 1 2.01
4.1 1 1 2.16
5.1 1 1 2.42
6.1 1 1 2.42
Once the data has been reshaped, it's essential to
make sure that the independent variables are correctly stored
as factors:
> sapply(iron,class)
level treatment retention
"integer" "integer" "numeric"
Since treatment and level are not factors,
we must convert them:
> iron$treatment = factor(iron$treatment,labels=c('Fe2+','Fe3+'))
> iron$level = factor(iron$level,labels=c('high','medium','low'))
> head(iron)
level treatment retention
1.1 high Fe2+ 0.71
2.1 high Fe2+ 1.66
3.1 high Fe2+ 2.01
4.1 high Fe2+ 2.16
5.1 high Fe2+ 2.42
6.1 high Fe2+ 2.42
Now we can perform the ANOVA:
> iron.aov = aov(retention ~ level*treatment,data=iron)
> summary(iron.aov)
Df Sum Sq Mean Sq F value Pr(>F)
level 2 983.62 491.81 17.0732 4.021e-07 ***
treatment 1 62.26 62.26 2.1613 0.1446
level:treatment 2 8.29 4.15 0.1439 0.8661
Residuals 102 2938.20 28.81
Before proceeding further, we should examine the
ANOVA plots to see if the data meets the assumptions of ANOVA:
Both the normal Q-Q plot and the scale-location plot indicate
problems similar to the previous example, and a log transformation
is once again suggested. This is not unusual when data is
measured as percentages or ratios.
> ironl.aov = aov(log(retention) ~ level*treatment,data=iron)
> par(mfrow=c(2,2))
> plot(ironl.aov)
The plots look much better, so we'll continue with the analysis
of the log of retention.
Df Sum Sq Mean Sq F value Pr(>F)
level 2 15.588 7.794 22.5241 7.91e-09 ***
treatment 1 2.074 2.074 5.9931 0.01607 *
level:treatment 2 0.810 0.405 1.1708 0.31426
Residuals 102 35.296 0.346
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Since there were only two levels of treatment, the significant
treatment effect means the two treatments were different. We can use the
TukeyHSD function to see if the different levels of the treatment
were different:
> TukeyHSD(ironl.aov,'level')
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = log(retention) ~ level * treatment, data = iron)
$level
diff lwr upr p adj
medium-high 0.5751084 0.24533774 0.9048791 0.0002042
low-high 0.9211588 0.59138806 1.2509295 0.0000000
low-medium 0.3460503 0.01627962 0.6758210 0.0373939
It appears that the high level had much lower retention values
than the other two levels:
> aggregate(log(iron$retention),iron['level'],mean)
level x
1 high 1.420526
2 medium 1.995635
3 low 2.341685
Although there was no significant interaction, an interaction plot
can still be useful in visualizing what happens in an experiment:
> interaction.plot(iron$level,iron$treatment,log(iron$retention))
3 More Complex Models
When working with the wine data frame,
we've separated the categorical variable (Cultivar) from the
continuous variable for pedagogical reasons, but the aov function
can accomodate both in the same model. Let's add the Cultivar
variable to the regression model we've previously worked with:
> wine.new = lm(Alcohol~Cultivar+Malic.acid+Alkalinity.ash+Proanthocyanins+Color.intensity+OD.Ratio+Proline,data=wine)
> summary(wine.new)
Call:
lm(formula = Alcohol ~ Cultivar + Malic.acid + Alkalinity.ash +
Proanthocyanins + Color.intensity + OD.Ratio + Proline, data = wine)
Residuals:
Min 1Q Median 3Q Max
-1.13591 -0.31737 -0.02623 0.33229 1.65633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.9158487 0.4711149 27.415 < 2e-16 ***
Cultivar2 -0.9957910 0.1776136 -5.607 8.26e-08 ***
Cultivar3 -0.6714047 0.2396380 -2.802 0.00568 **
Malic.acid 0.0559472 0.0410860 1.362 0.17510
Alkalinity.ash -0.0133598 0.0134499 -0.993 0.32198
Proanthocyanins -0.0561493 0.0817366 -0.687 0.49305
Color.intensity 0.1135452 0.0270097 4.204 4.24e-05 ***
OD.Ratio 0.0494695 0.0987946 0.501 0.61721
Proline 0.0002391 0.0002282 1.048 0.29629
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4886 on 169 degrees of freedom
Multiple R-Squared: 0.6541, Adjusted R-squared: 0.6377
F-statistic: 39.95 on 8 and 169 DF, p-value: < 2.2e-16
One problem with the summary display for models like this is that
it's treating our factor variable (Cultivar) as two separate
variables. While that is the way it is fit in the model, it's usually
more informative to combine the effects of the two variables as a single
effect. The anova command will produce a more traditional
ANOVA table:
> anova(wine.new)
Analysis of Variance Table
Response: Alcohol
Df Sum Sq Mean Sq F value Pr(>F)
Cultivar 2 70.795 35.397 148.2546 < 2.2e-16 ***
Malic.acid 1 0.013 0.013 0.0552 0.8146
Alkalinity.ash 1 0.229 0.229 0.9577 0.3292
Proanthocyanins 1 0.224 0.224 0.9384 0.3341
Color.intensity 1 4.750 4.750 19.8942 1.488e-05 ***
OD.Ratio 1 0.031 0.031 0.1284 0.7206
Proline 1 0.262 0.262 1.0976 0.2963
Residuals 169 40.351 0.239
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary display contained other useful information, so you shouldn't
hesitate to look at both.
Comparing these results to our previous regression, we can see that only
one variable (Color.intensity) is still significant, and the
effect of Cultivar is very significant. For this data set, it
means that while we can use the chemical composition to help predict the
Alcohol content of the wines, but that knowing the Cultivar
will be more effective. Let's look at a reduced model that uses only
Cultivar and Color.intensity to see how it compares with
the model containing the extra variables:
> wine.new1 = lm(Alcohol~Cultivar+Color.intensity,data=wine)
> summary(wine.new1)
Call:
lm(formula = Alcohol ~ Cultivar + Color.intensity, data = wine)
Residuals:
Min 1Q Median 3Q Max
-1.12074 -0.32721 -0.04133 0.34799 1.54962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.14845 0.14871 88.417 < 2e-16 ***
Cultivar2 -1.20265 0.10431 -11.530 < 2e-16 ***
Cultivar3 -0.79248 0.10495 -7.551 2.33e-12 ***
Color.intensity 0.10786 0.02434 4.432 1.65e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4866 on 174 degrees of freedom
Multiple R-Squared: 0.6468, Adjusted R-squared: 0.6407
F-statistic: 106.2 on 3 and 174 DF, p-value: < 2.2e-16
The adjusted R-squared for this model is better than that of the previous
one, indicating that removing those extra variables didn't seem to
cause any problems. To formally test to see if there is a difference
between the two models, we can use the anova function. When passed
a single model object, anova prints an ANOVA table, but when it's passed
two model objects, it performs a test to compare the two models:
> anova(wine.new,wine.new1)
Analysis of Variance Table
Model 1: Alcohol ~ Cultivar + Malic.acid + Alkalinity.ash + Proanthocyanins +
Color.intensity + OD.Ratio + Proline
Model 2: Alcohol ~ Cultivar + Color.intensity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 169 40.351
2 174 41.207 -5 -0.856 0.7174 0.6112
The test indicates that there's no significant difference between the two
models.
When all the independent variables in our model were categorical model (the
race/activity example),
the interactions between the categorical variables was one of the most
interesting parts of the analysis. What does an interaction between a
categorical variable and a continuous variable represent? Such an interaction
can tell us if the slope of the continuous variable is different for the
different levels of the categorical variable. In the current model, we can
test to see if the slopes are different by adding the term
Cultivar:Color.intensity to the model:
> anova(wine.new2)
Analysis of Variance Table
Response: Alcohol
Df Sum Sq Mean Sq F value Pr(>F)
Cultivar 2 70.795 35.397 149.6001 < 2.2e-16 ***
Color.intensity 1 4.652 4.652 19.6613 1.644e-05 ***
Cultivar:Color.intensity 2 0.509 0.255 1.0766 0.343
Residuals 172 40.698 0.237
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There doesn't seem to be a significant interaction.
File translated from
TEX
by
TTH,
version 3.67.
On 27 Apr 2011, 16:10.