Chapter 7 ANOVA
One of the most important experiments we’ve discussed in the course involves the comparison of two populations in terms of their means. This kind of comparison motivated our study of the two-sample t-test and CI for the difference of two population means. These are very useful statistical methods, but why stop at comparing two populations? Often, we would like to compare three or more populations. For example, suppose a company marketing laundry detergent wants to compare 4 product designs (the detergent container and its labeling) in a pilot study. Using t-tests or CIs we could make pairwise comparisons (there are 4 choose 2, or, 6 pf these) but this causes some issues. Specifically, suppose there are no difference in the product designs and suppose we conduct 6 pairwise \(t\) tests at \(\alpha = 0.05\). If the 6 tests are independent (the test statistics are mutually independen r.v.’s) then the chance we make at least 1 Type 1 error is \(1-0.95^6 = 0.265\)! Even if there are no deifferences, there’s a good chance we’ll find at least one difference by chance alone. In order to avoid the Type 1 error inflation caused by multiple testing one strategy—called Bonferroni correction—requires that we conduct each test at level \(\alpha / 6 = 0.00833\) so that the family-wise Type 1 error rate is no more than \(1-(1-0.00833)^6 = 0.041 \leq 0.05\). This fixes our Type 1 error inflation problem, but it also means our power to detect fale null hypotheses (detect differences in response for different product designs) is severely hampered. Can we do better? Fortunately, the answer is yes! The alternative, improved strategy for comparing multiple population means is called ANOVA (analysis of variance) and its the focus of this section.
7.1 Analysis of variance
The general setup is as follows: we conduct an experiment in which we observe a random sample of size \(n\) from each of \(J\) populations so that the total sample size is \(N=nJ\). Denote the samples by \(X_{ij}\) for individual \(i=1, \ldots, n\) in the sample from population \(j=1,\ldots, J\). (It’s possible, of course, to consider random samples of different sizes, but for simplicity we consider only the balanced case). Our goal in this experiemnt is to evaluate the null hypothesis \(H_0: \mu_j = \mu\) for all \(j = 1,\ldots, J\), meaning that all the populations have the same mean. Further, we assume populations are normal with equal variance; this means that the populations are, essentially, equivalent under \(H_0\). The alternative hypothesis is that at least one population mean differs from the others.
Rather than comparing sample means to evaluate \(H_0\) we (somewhat counterintuitively) compare two estimators of the population variance. Consider the following decomposition:
\[\sum_{i=1}^n\sum_{j=1}^J(X_{ij} - \overline X)^2 = \sum_{i=1}^n\sum_{j=1}^J (X_{ij} - \overline X_j)^2 + n\sum_{j=1}^J (\overline X_j - \overline X)^2.\]
The left hand side of the equals sign in the above display is called the total sum of squares TSS and it decomposes into the sum of the error sum of squares ESS (or residual sum of squares) and the treatment sum of squares TrSS. Under the null hypothesis all three of these sums of squares can be used to estimate \(\sigma^2\), the common variance among the \(J\) populations. Further, the error sum of squares and treatment sum of squares are independent and satisfy
\[\frac{TrSS}{ESS} \stackrel{H_0}{\sim}F(J-1, J(n-1)).\]
The ANOVA F-test rejects the null hypothesis of equality of group means if \(F = TrSS/ESS\) is more extreme than the upper \(1-\alpha\) quantile of the F distribution with df \(J-1\) and \(J(n-1)\).
7.2 Crop data in-depth example
Download crop.data.csvThe embedded data comes from an experiment examining the effects of three types of fertilizer on crop yield. According to the data below there are 96 observations of yield, 32 each from each fertilizer type. And, the mean yields for the three fertilizer types are all very close to 177.
<- read.csv("crop.data.csv", header = TRUE, colClasses = c("factor", "factor", "factor", "numeric"))
crops <- mean(crops$yield)
m <- mean(crops$yield[crops$fertilizer == 1])
m1 <- mean(crops$yield[crops$fertilizer == 2])
m2 <- mean(crops$yield[crops$fertilizer == 3])
m3
m
## [1] 177.0155
m1
## [1] 176.757
m2
## [1] 176.9332
m3
## [1] 177.3562
length(crops$yield)
## [1] 96
sum(crops$fertilizer == 1)
## [1] 32
sum(crops$fertilizer == 2)
## [1] 32
sum(crops$fertilizer == 3)
## [1] 32
We can compute the ANOVA F-test “by hand” in R by computing the component sums of squares and comparing to the appropriate lower and upper F quantiles using the qf function or by computing the p-value.
# SSE
<- sum((crops$yield[crops$fertilizer == 1] - m1)^2)+sum((crops$yield[crops$fertilizer == 2] - m2)^2)+sum((crops$yield[crops$fertilizer == 3] - m3)^2)
SSE SSE
## [1] 35.88619
# SSTr
<- sum(crops$fertilizer == 1)
n n
## [1] 32
<- n*((m1-m)^2)+n*((m2-m)^2)+n*((m3-m)^2)
SSTr SSTr
## [1] 6.068047
# F test
<- 3
J <- (SSTr / (J-1)) / (SSE / (J*(n-1)))
F F
## [1] 7.862752
<- (SSE / (J*(n-1)))
MSE
<- 1-pf(abs(F), J-1, J*(n-1))
p.value p.value
## [1] 0.0006999158
We can also use built-in R functions to compute the ANOVA F-test. Note the matching p-values.
summary(aov(yield ~ fertilizer, data = crops))
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.07 3.0340 7.863 7e-04 ***
## Residuals 93 35.89 0.3859
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- aov(yield ~ fertilizer, data = crops)
A A
## Call:
## aov(formula = yield ~ fertilizer, data = crops)
##
## Terms:
## fertilizer Residuals
## Sum of Squares 6.06805 35.88619
## Deg. of Freedom 2 93
##
## Residual standard error: 0.6211867
## Estimated effects may be unbalanced
An important part of applying ANOVA is checking the normalitay and equal variance assumptions are reasonable. If they are not, then the F-test may be misleading because the test statistic may not have an F distribution under the null. The best way to check for normality is to examine the residuals—the values \(x_{ij} - \overline x_{j}\); under the null and assuming equal variance these \(N\) values are approximately normally distributed with variance \(\sigma^2\). We can also examine the Studentized resisduals by normalizing the residuals by the sample standard deviation—these should look approximately standard normal for a large sample (otherwise approximately Student’s t with n-1 df).
hist(A$residuals)
qqnorm(A$residuals)
qqline(A$residuals)
7.2.1 Checking the equal variance assumption
We can also check the group-wise variances for equality. A formal method for checking these is Levene’s test, which amounts to performing ANOVA on the absolute values of the residuals themselves. The null hypothesis of Levene’s ANOVA is equality of variances over groups/populations.
sd(crops$yield[crops$fertilizer==1])
## [1] 0.6849233
sd(crops$yield[crops$fertilizer==2])
## [1] 0.5740668
sd(crops$yield[crops$fertilizer==3])
## [1] 0.5991214
$resid <- c(abs(crops$yield[crops$fertilizer==1]-mean(crops$yield[crops$fertilizer==1])),
cropsabs(crops$yield[crops$fertilizer==2]-mean(crops$yield[crops$fertilizer==2])),
abs(crops$yield[crops$fertilizer==3]-mean(crops$yield[crops$fertilizer==3])))
summary(aov(resid ~ fertilizer, data = crops))
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 0.24 0.1197 0.88 0.418
## Residuals 93 12.65 0.1360
<- mean(crops$resid)
ml <- mean(crops$resid[crops$fertilizer == 1])
m1l <- mean(crops$resid[crops$fertilizer == 2])
m2l <- mean(crops$resid[crops$fertilizer == 3])
m3l
ml
## [1] 0.4894152
m1l
## [1] 0.5596244
m2l
## [1] 0.4475853
m3l
## [1] 0.4610358
# SSE
<- sum((crops$resid[crops$fertilizer == 1] - m1l)^2)+sum((crops$resid[crops$fertilizer == 2] - m2l)^2)+sum((crops$resid[crops$fertilizer == 3] - m3l)^2)
SSE.L SSE.L
## [1] 12.65207
# SSTr
<- sum(crops$fertilizer == 1)
n n
## [1] 32
<- n*((m1l-ml)^2)+n*((m2l-ml)^2)+n*((m3l-ml)^2)
SSTr.L SSTr.L
## [1] 0.2395029
# Levene F test
<- 3
J <- (SSTr.L / (J-1)) / (SSE.L / (J*(n-1)))
F.L F.L
## [1] 0.880242
qf(.95, J-1, J*(n-1))
## [1] 3.094337
7.2.2 Normality and alternatives to the F test
As we have done in the past, we can investigate the assumption of normality by constructing plots of residuals. The observed, Studentized residuals are given by \[e_{ij} = \frac{y_{ij} - \bar y_{i\cdot}}{\sqrt{MSW}}\stackrel{\cdot}{\sim} t_{N-I}\]
The qq-plot reveals no obvious signs of non-normality of residuals.
<- A$residuals
resids <- resids/sqrt(MSE)
stu.resids qqplot(stu.resids, rt(300, df = J*(n-1)))
qqline(stu.resids, distribution = function(p) qt(p, df = J*(n-1)),probs = c(0.25, 0.75), col = 2)
qqline(stu.resids)
In case the qq-plot suggests normality does NOT hold, then an alternative approximate test may be used. The Kruskal-Wallis test is similar (but not exactly the same) as performing ANOVA on the ranks \(r_{ij}\) of the responses \(y_{ij}\). Ranking means sorting the responses from least to greatest and associating each with its order in that sequence. If there are any tied values, ranks are averaged. For example, the sequence of responses \(24.3, 27.9, 27.9, 29.3, 30,1\) has ranks \(1, 2.5, 2.5, 4, 5\).
Let \(r_{ij}\) be the ranks of the \(y_{ij}\) and let \(\bar r_{i\cdot}\) and \(\bar r_{\cdot\cdot}\) be the averages of ranks in each population sample and overall, respectively. Then, the Kruskal-Wallis test statistic is the ratio of the SSB (or called SSTr) to SST of the ranks, multiplied by \(N-1\):
\[\chi^2 = (N-1) \cdot \frac{\sum_{i=1}^I n_{i}(\bar r_{i\cdot} - \bar r_{\cdot\cdot})^2}{\sum_{i=1}^I\sum_{j=1}^{n_i}(r_{ij} - \bar r_{\cdot\cdot})^2}.\]
The null hypothesis of the Kruskall-Wallis test is that the \(I\) populations are really all the same population (the \(I\) distributions are equal). In the special case in which we assume all populations are normal with the same mean, this null hypothesis is equivalent to the assertion all the means are equal. Under the Kruskal-Wallis null hypothesis, the test statistic \(\chi^2\) is approximately distributed as a Chi-squared r.v. with \(I-1\) degrees of freedom.
Next, we will compute the K-W test statistic and associated p-value by hand, and then compare our results to the built-in R function “kruskal.test”.
<- rank(crops$yield)
rij
<- mean(rij[crops$fertilizer == 1])
r1 <- mean(rij[crops$fertilizer == 2])
r2 <- mean(rij[crops$fertilizer == 3])
r3
# SSE
<- sum((rij[crops$fertilizer == 1] - r1)^2)+sum((rij[crops$fertilizer == 2] - r2)^2)+sum((rij[crops$fertilizer == 3] - r3)^2)
SSE SSE
## [1] 63603.81
<- sum((rij - mean(rij))^2)
SST SST
## [1] 73720
# SSTr
<- SST-SSE
SSTr SSTr
## [1] 10116.19
<- (n*J - 1)*(SSTr / SST)
test.stat test.stat
## [1] 13.03632
1-pchisq(test.stat, J-1)
## [1] 0.00147638
kruskal.test(yield ~ fertilizer, data = crops)
##
## Kruskal-Wallis rank sum test
##
## data: yield by fertilizer
## Kruskal-Wallis chi-squared = 13.036, df = 2, p-value = 0.001476
Note that the Kruskal-Wallis test rejects the null hypothesis that the three populations of crop yields are the same, which agrees with the ANOVA F test.