Analysis of Variance
1 Multiple Comparisons
In the previous example, notice that the test for Cultivar
is simply answering the question "Are there any significant differences
among the cultivars?". This is because the F-test which is used
to determine significant is based on the two different ways of
calculating the variance, not on any particular differences among
the means.
Having seen that there is a significant effect for Cultivar
in the previous example, a natural questions is "Which cultivars
are different from each other". One possibility would be to look
at all possible t-tests between the levels of the cultivar, i.e.
do t-tests for 1 vs. 2, 1 vs. 3, and 2 vs. 3. This is a very bad idea
for at least two reasons:
- One of the main goals of ANOVA is to combine together all our data,
so that we can more accurately estimate the residual variance of our
model. In the previous example, notice that there were 175 degrees
of freedom used to estimate the residual variance. Under the assumptions
of ANOVA, the variance of the dependent variable doesn't change across
the different levels of the independent variables, so we can (and should)
use the estimate from the ANOVA for all our tests. When we use a t-test,
we'll be estimating the residual variance only using the observations
for the two groups we're comparing, so we'll have fewer degrees of
freedom, and less power in determining differences.
-
When we're comparing several groups using t-tests, we have to look at
all possible combinations among the groups. This will sometimes result
in many tests, and we can no longer be confident that the probability
level we use for the individual tests will hold up across all of those
comparisons we're making. This is a well-known problem in statistics,
and many techniques have been developed to adjust probability values to
handle this case. However, these techniques tend to be quite conservative,
and they may prevent us from seeing differences that really exist.
To see how probabilities get adjusted when many comparisons are made,
consider a data set on the nitrogen levels in 5 varieties of clover.
We wish to test the hypothesis that the nitrogen level of the different
varieties of clover is the same.
> clover = read.table('http://www.stat.berkeley.edu/~spector/s133/data/clover.txt',header=TRUE)
> clover.aov = aov(Nitrogen~Strain,data=clover)
> summary(clover.aov)
Df Sum Sq Mean Sq F value Pr(>F)
Strain 5 847.05 169.41 14.370 1.485e-06 ***
Residuals 24 282.93 11.79
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Let's say that we want to look at all the possible t-tests
among pairs of the 6 strains. First, we can use the combn
function to show us all the possible 2-way combinations of the strains:
> combs = combn(as.character(unique(clover$Strain)),2)
> combs
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "3DOK1" "3DOK1" "3DOK1" "3DOK1" "3DOK1" "3DOK5" "3DOK5" "3DOK5"
[2,] "3DOK5" "3DOK4" "3DOK7" "3DOK13" "COMPOS" "3DOK4" "3DOK7" "3DOK13"
[,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "3DOK5" "3DOK4" "3DOK4" "3DOK4" "3DOK7" "3DOK7" "3DOK13"
[2,] "COMPOS" "3DOK7" "3DOK13" "COMPOS" "3DOK13" "COMPOS" "COMPOS"
Let's focus on the first column:
> x = combs[,1]
> tt = t.test(Nitrogen~Strain,data=clover)
> names(tt)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "alternative" "method" "data.name"
This suggests a function which would return the probability
for each combination of cars:
> gettprob = function(x)t.test(Nitrogen~Strain,data=clover[clover$Strain %in% x,])$p.value
We can get the probabilities for all the tests, and combine them with
the country names for display:
> probs = data.frame(t(combs),probs=apply(combs,2,gettprob))
> probs
X1 X2 probs
1 3DOK1 3DOK5 1.626608e-01
2 3DOK1 3DOK4 2.732478e-03
3 3DOK1 3DOK7 2.511696e-02
4 3DOK1 3DOK13 3.016445e-03
5 3DOK1 COMPOS 1.528480e-02
6 3DOK5 3DOK4 5.794178e-03
7 3DOK5 3DOK7 7.276336e-02
8 3DOK5 3DOK13 1.785048e-03
9 3DOK5 COMPOS 3.177169e-02
10 3DOK4 3DOK7 4.331464e-02
11 3DOK4 3DOK13 5.107291e-01
12 3DOK4 COMPOS 9.298460e-02
13 3DOK7 3DOK13 4.996374e-05
14 3DOK7 COMPOS 2.055216e-01
15 3DOK13 COMPOS 4.932466e-04
These probabilities are for the individual t-tests, each with
an alpha level of 0.05, but that doesn't guarantee that the experiment-wise
alpha will be .05. We can use the p.adjust function to adjust
these probabilities:
> probs = data.frame(probs,adj.prob=p.adjust(probs$probs,method='bonferroni'))
> probs
X1 X2 probs adj.prob
1 3DOK1 3DOK5 1.626608e-01 1.000000000
2 3DOK1 3DOK4 2.732478e-03 0.040987172
3 3DOK1 3DOK7 2.511696e-02 0.376754330
4 3DOK1 3DOK13 3.016445e-03 0.045246679
5 3DOK1 COMPOS 1.528480e-02 0.229272031
6 3DOK5 3DOK4 5.794178e-03 0.086912663
7 3DOK5 3DOK7 7.276336e-02 1.000000000
8 3DOK5 3DOK13 1.785048e-03 0.026775721
9 3DOK5 COMPOS 3.177169e-02 0.476575396
10 3DOK4 3DOK7 4.331464e-02 0.649719553
11 3DOK4 3DOK13 5.107291e-01 1.000000000
12 3DOK4 COMPOS 9.298460e-02 1.000000000
13 3DOK7 3DOK13 4.996374e-05 0.000749456
14 3DOK7 COMPOS 2.055216e-01 1.000000000
15 3DOK13 COMPOS 4.932466e-04 0.007398699
Notice that many of the comparisons that seemed significant when
using the t-test are no longer significant. Plus, we didn't take
advantage of the increased degrees of freedom. One technique that
uses all the degrees of freedom of the combined test, while still
correcting for the problem of multiple comparisons is known as
Tukey's Honestly Significant Difference (HSD) test. The TukeyHSD
function takes a model object and the name of a factor, and provides
protected probability values for all the two-way comparisons of factor
levels. Here's the output of TukeyHSD for the clover data:
> tclover = TukeyHSD(clover.aov,'Strain')
> tclover
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Nitrogen ~ Strain, data = clover)
$Strain
diff lwr upr p adj
3DOK13-3DOK1 -15.56 -22.27416704 -8.845833 0.0000029
3DOK4-3DOK1 -14.18 -20.89416704 -7.465833 0.0000128
3DOK5-3DOK1 -4.84 -11.55416704 1.874167 0.2617111
3DOK7-3DOK1 -8.90 -15.61416704 -2.185833 0.0048849
COMPOS-3DOK1 -10.12 -16.83416704 -3.405833 0.0012341
3DOK4-3DOK13 1.38 -5.33416704 8.094167 0.9870716
3DOK5-3DOK13 10.72 4.00583296 17.434167 0.0006233
3DOK7-3DOK13 6.66 -0.05416704 13.374167 0.0527514
COMPOS-3DOK13 5.44 -1.27416704 12.154167 0.1621550
3DOK5-3DOK4 9.34 2.62583296 16.054167 0.0029837
3DOK7-3DOK4 5.28 -1.43416704 11.994167 0.1852490
COMPOS-3DOK4 4.06 -2.65416704 10.774167 0.4434643
3DOK7-3DOK5 -4.06 -10.77416704 2.654167 0.4434643
COMPOS-3DOK5 -5.28 -11.99416704 1.434167 0.1852490
COMPOS-3DOK7 -1.22 -7.93416704 5.494167 0.9926132
> class(tclover)
[1] "multicomp" "TukeyHSD"
> names(tclover)
[1] "Strain"
> class(tclover$Strain)
[1] "matrix"
These probabilities seem more reasonable. To combine these
results with the previous ones, notice that tclover$Strain
is a matrix, with row names indicating the comparisons being made.
We can put similar row names on our earlier results and then merge
them:
> row.names(probs) = paste(probs$X2,probs$X1,sep='-')
> probs = merge(probs,tclover$Strain[,'p adj',drop=FALSE],by=0)
> probs
Row.names X1 X2 probs adj.prob p adj
1 3DOK13-3DOK1 3DOK1 3DOK13 0.0030164452 0.045246679 2.888133e-06
2 3DOK4-3DOK1 3DOK1 3DOK4 0.0027324782 0.040987172 1.278706e-05
3 3DOK5-3DOK1 3DOK1 3DOK5 0.1626608271 1.000000000 2.617111e-01
4 3DOK7-3DOK1 3DOK1 3DOK7 0.0251169553 0.376754330 4.884864e-03
5 3DOK7-3DOK4 3DOK4 3DOK7 0.0433146369 0.649719553 1.852490e-01
6 3DOK7-3DOK5 3DOK5 3DOK7 0.0727633570 1.000000000 4.434643e-01
7 COMPOS-3DOK1 3DOK1 COMPOS 0.0152848021 0.229272031 1.234071e-03
8 COMPOS-3DOK13 3DOK13 COMPOS 0.0004932466 0.007398699 1.621550e-01
9 COMPOS-3DOK4 3DOK4 COMPOS 0.0929845957 1.000000000 4.434643e-01
10 COMPOS-3DOK5 3DOK5 COMPOS 0.0317716931 0.476575396 1.852490e-01
11 COMPOS-3DOK7 3DOK7 COMPOS 0.2055215679 1.000000000 9.926132e-01
Finally, we can display the probabilities without scientific
notation as follows:
> format(probs,scientific=FALSE)
Row.names X1 X2 probs adj.prob p adj
1 3DOK13-3DOK1 3DOK1 3DOK13 0.0030164452 0.045246679 0.000002888133
2 3DOK4-3DOK1 3DOK1 3DOK4 0.0027324782 0.040987172 0.000012787061
3 3DOK5-3DOK1 3DOK1 3DOK5 0.1626608271 1.000000000 0.261711120046
4 3DOK7-3DOK1 3DOK1 3DOK7 0.0251169553 0.376754330 0.004884863746
5 3DOK7-3DOK4 3DOK4 3DOK7 0.0433146369 0.649719553 0.185248969392
6 3DOK7-3DOK5 3DOK5 3DOK7 0.0727633570 1.000000000 0.443464260597
7 COMPOS-3DOK1 3DOK1 COMPOS 0.0152848021 0.229272031 0.001234070633
8 COMPOS-3DOK13 3DOK13 COMPOS 0.0004932466 0.007398699 0.162154993324
9 COMPOS-3DOK4 3DOK4 COMPOS 0.0929845957 1.000000000 0.443464260597
10 COMPOS-3DOK5 3DOK5 COMPOS 0.0317716931 0.476575396 0.185248969392
11 COMPOS-3DOK7 3DOK7 COMPOS 0.2055215679 1.000000000 0.992613208547
By using all of the data to estimate the residual error,
Tukey's HSD method actually reports some of the probabilities as
even lower than the t-tests.
2 Two-Way ANOVA
To express the idea of an interaction in the R modeling language, we need to
introduce two new operators. The colon (:) is used to indicate an
interaction between two or more variables in model formula. The asterisk
(*) is use to indicate all main effects and interactions among the
variables that it joins. So, for example the term A*B would expand
to the three terms A, B, and A:B. As an example
of a two-way ANOVA, consider a study to determine the effects of physical
activity on obesity. Subjects were rated for their physical activity on a
three point scale with 1=not very active, 2=somewhat active, and 3=very active.
In addition, the race (either 1 or 2) of the participant was recorded, along
with their Body Mass Index (BMI). We want
to answer the
following three questions:
- Were the means for BMI the same for the two races?
-
Were the means for BMI the same for the three activity levels?
-
Is the effect of activity level different depending on race?, or
equivalently Is the effect of race different depending on activity level?
The first two questions can be answered by looking at the race and
activity main effects, while the third question describes the
race by activity interaction. The data can be found at
http://www.stat.berkeley.edu/~spector/s133/data/activity.csv. Here are the R statements to run
the ANOVA:
> activity = read.csv('activity.csv')
> activity$race = factor(activity$race)
> activity$activity = factor(activity$activity)
> activity.aov = aov(bmi~race*activity,data=activity)
> summary(activity.aov)
Df Sum Sq Mean Sq F value Pr(>F)
race 1 3552 3552 102.5894 < 2e-16 ***
activity 2 2672 1336 38.5803 < 2e-16 ***
race:activity 2 301 151 4.3508 0.01303 *
Residuals 1865 64574 35
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that there are two degrees of freedom for activity - this means
two parameters will be estimated in order to explain activity's effect on
bmi. Unlike linear regression, where only a single parameter is
estimated, and the only relationship that can be fit is a linear one, using two
parameters (to account for the three levels of activity) provides more flexibility
than would be possible with linear regression.
To see if the analysis was reasonable, we can look at the default plots:
> plot(activity.aov)
The graphs appear below:
There seems to some deviation from normality when looking at
the Normal Q-Q plot (recall that, if the residuals did follow a normal
distribution, we would see a straight line.) When this situation arises,
analyzing the logarithm of the dependent variable often helps. Here are the
same results for the analysis of log(bmi):
> activity1.aov = aov(log(bmi)~race*activity,data=activity)
> summary(activity1.aov)
Df Sum Sq Mean Sq F value Pr(>F)
race 1 4.588 4.588 100.3741 < 2.2e-16 ***
activity 2 3.251 1.625 35.5596 6.98e-16 ***
race:activity 2 0.317 0.158 3.4625 0.03155 *
Residuals 1865 85.240 0.046
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> plot(activity1.aov)
The Q-Q plot looks better, so this model is probably more appropriate.
We can see both main effects as well as the interaction are significant.
To see what's happening with the main effects, we can use aggregate:
> aggregate(log(activity$bmi),activity['race'],mean)
race x
1 1 3.122940
2 2 3.222024
> aggregate(log(activity$bmi),activity['activity'],mean)
activity x
1 1 3.242682
2 2 3.189810
3 3 3.109518
Race 2 has higher values of BMI than race 1, and BMI decreases as the
level of activity increases.
To study the interaction, we could use aggregate, passing both race
and activity as the second argument:
> aggregate(log(activity$bmi),activity[c('race','activity')],mean)
race activity x
1 1 1 3.161119
2 2 1 3.298576
3 1 2 3.140970
4 2 2 3.230651
5 1 3 3.084426
6 2 3 3.143478
The arrangement of the output from tapply may be more helpful:
> tapply(log(activity$bmi),activity[c('race','activity')],mean)
activity
race 1 2 3
1 3.161119 3.140970 3.084426
2 3.298576 3.230651 3.143478
It's usually difficult to judge relationships like this from a table.
One useful tool in this case is
an interaction plot. An interaction plot has one point
for each combination of the factors defined by an interaction. The x-axis
represents the levels of one of the factors, and the y-axis represents the
mean of the dependent variable, and a separate line is drawn for each
level of the factor not represented on the x-axis. While it wouldn't be
too hard to produce such a plot with basic commands in R, the process is
automated by the interaction.plot function. The first argument to
this function is the factor to appear on the x-axis; the second is the
factor which will define the multiple lines being drawn, and the third
argument is the dependent variable. By default, interaction.plot
uses the mean for its display, but you can provide a function of your own
choosing through the fun= argument.
For the activity data, we can produce an interaction plot with the following
code:
> with(activity,interaction.plot(activity,race,log(bmi)))
Here's the plot:
It can be seen that the interaction is due to the fact that the slope of
the line for race 2 is steeper than the line for race 1.
3 Another Example
This example has to do with iron retention in mice. Two different treatments,
each at three different levels, were fed to mice. The treatments were
tagged with radioactive iron, so that the percentage of iron retained could
be measured after a fixed period of time.
The data is presented in a table as follows:
----------------------------------------------------
Fe2+ Fe3+
----------------------------------------------------
high medium low high medium low
----------------------------------------------------
0.71 2.20 2.25 2.20 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.50 4.27 5.86 9.56
2.60 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.80
3.74 7.90 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.90
4.50 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.40 19.87
8.15 18.30 22.82 11.65 23.89 21.60
8.24 18.59 29.13 12.45 26.39 22.25
----------------------------------------------------
Thus, before we can perform analysis on the data, it needs
to be rearranged. To do this, we can use the reshape
function. Since there are two different sets of variables
that represent the change in the factors of the experiment,
we first read in the data (skipping the header), and create
two groups of variables in our call to reshape:
> iron0 = read.table('iron.txt',skip=5,nrows=18)
> names(iron0) = c('Fe2high','Fe2medium','Fe2low','Fe3high','Fe3medium','Fe3low')
> iron1 = reshape(iron0,varying=list(1:3,4:6),direction='long')
> head(iron1)
time Fe2high Fe3high id
1.1 1 0.71 2.20 1
2.1 1 1.66 2.69 2
3.1 1 2.01 3.54 3
4.1 1 2.16 3.75 4
5.1 1 2.42 3.83 5
6.1 1 2.42 4.08 6
After examining the data, it can be seen that the low, medium, and
high values have been translated into values 1, 2, and 3 in the
variable time. The id variable is created to help us see
which line each observation came from, which is not relevant in this
case, since the table was just used to present the data, and the
values in the table don't represent repeated measures on the same
experimental unit.
Next, we eliminate the id column, rename the column named
"time and further reshape the
data to represent the two treatments:
> iron1$id = NULL
> names(iron1)[1] = 'level'
> iron = reshape(iron1,varying=list(2:3),direction='long')
> head(iron)
level time Fe2high id
1.1 1 1 0.71 1
2.1 1 1 1.66 2
3.1 1 1 2.01 3
4.1 1 1 2.16 4
5.1 1 1 2.42 5
6.1 1 1 2.42 6
All that's left is to remove the id
column and to rename time and Fe2high:
> iron$id = NULL
> names(iron)[2:3] = c('treatment','retention')
> head(iron)
level treatment retention
1.1 1 1 0.71
2.1 1 1 1.66
3.1 1 1 2.01
4.1 1 1 2.16
5.1 1 1 2.42
6.1 1 1 2.42
Once the data has been reshaped, it's essential to
make sure that the independent variables are correctly stored
as factors:
> sapply(iron,class)
level treatment retention
"integer" "integer" "numeric"
Since treatment and level are not factors,
we must convert them:
> iron$treatment = factor(iron$treatment,labels=c('Fe2+','Fe3+'))
> iron$level = factor(iron$level,labels=c('high','medium','low'))
> head(iron)
level treatment retention
1.1 high Fe2+ 0.71
2.1 high Fe2+ 1.66
3.1 high Fe2+ 2.01
4.1 high Fe2+ 2.16
5.1 high Fe2+ 2.42
6.1 high Fe2+ 2.42
Now we can perform the ANOVA:
> iron.aov = aov(retention ~ level*treatment,data=iron)
> summary(iron.aov)
Df Sum Sq Mean Sq F value Pr(>F)
level 2 983.62 491.81 17.0732 4.021e-07 ***
treatment 1 62.26 62.26 2.1613 0.1446
level:treatment 2 8.29 4.15 0.1439 0.8661
Residuals 102 2938.20 28.81
Before proceeding further, we should examine the
ANOVA plots to see if the data meets the assumptions of ANOVA:
Both the normal Q-Q plot and the scale-location plot indicate
problems similar to the previous example, and a log transformation
is once again suggested. This is not unusual when data is
measured as percentages or ratios.
> ironl.aov = aov(log(retention) ~ level*treatment,data=iron)
> par(mfrow=c(2,2))
> plot(ironl.aov)
The plots look much better, so we'll continue with the analysis
of the log of retention.
Df Sum Sq Mean Sq F value Pr(>F)
level 2 15.588 7.794 22.5241 7.91e-09 ***
treatment 1 2.074 2.074 5.9931 0.01607 *
level:treatment 2 0.810 0.405 1.1708 0.31426
Residuals 102 35.296 0.346
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Since there were only two levels of treatment, the significant
treatment effect means the two treatments were different. We can use the
TukeyHSD function to see if the different levels of the treatment
were different:
> TukeyHSD(ironl.aov,'level')
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = log(retention) ~ level * treatment, data = iron)
$level
diff lwr upr p adj
medium-high 0.5751084 0.24533774 0.9048791 0.0002042
low-high 0.9211588 0.59138806 1.2509295 0.0000000
low-medium 0.3460503 0.01627962 0.6758210 0.0373939
It appears that the high level had much lower retention values
than the other two levels:
> aggregate(log(iron$retention),iron['level'],mean)
level x
1 high 1.420526
2 medium 1.995635
3 low 2.341685
Although there was no significant interaction, an interaction plot
can still be useful in visualizing what happens in an experiment:
> interaction.plot(iron$level,iron$treatment,log(iron$retention))
File translated from
TEX
by
TTH,
version 3.67.
On 30 Apr 2010, 13:19.