Chapter 11 Missing Cells in Two-Way ANOVA
In this section we consider how to analyze a two-factor experiment when one crossed treatment is unobserved. If there is no interaction, then this is not problem. But, when interaction is present the lack of observations in a crossed treatment group makes it challenging to assess those interactions. There are, essentially, two ways to proceed: 1) use a one-way model, treating the observed crossed treatments as the levels; or, 2) do a partial two-way analysis.
Below, we illustrate these two approaches in the context of a horticulture experiment about jalapenos.
11.1 ANOVA with missing “at random” cell
Scoville (spiciness) measure of jalapeno peppers grown under 3 different watering regimens and 3 different levels of sun exposure.
Unbalanced treatment groups with a missing crossed treatment, sun level 3 with water level 3.
my.data
## scoville water.lvl sun.lvl
## 1 259 1 1
## 2 271 1 1
## 3 289 2 1
## 4 255 2 1
## 5 361 2 1
## 6 118 2 1
## 7 463 3 1
## 8 372 3 1
## 9 372 3 1
## 10 208 1 2
## 11 288 1 2
## 12 482 1 2
## 13 337 1 2
## 14 452 2 2
## 15 325 2 2
## 16 582 3 2
## 17 411 3 2
## 18 467 3 2
## 19 512 1 3
## 20 430 1 3
## 21 478 1 3
## 22 646 2 3
## 23 436 2 3
## 24 345 2 3
kable(aggregate(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data, FUN=mean))
water.lvl | sun.lvl | scoville |
---|---|---|
1 | 1 | 265.0000 |
2 | 1 | 255.7500 |
3 | 1 | 402.3333 |
1 | 2 | 328.7500 |
2 | 2 | 388.5000 |
3 | 2 | 486.6667 |
1 | 3 | 473.3333 |
2 | 3 | 475.6667 |
interaction.plot(water.lvl, sun.lvl, scoville, data = my.data)
boxplot(scoville~water.lvl+ sun.lvl, data = my.data)
11.2 Do any of the treatments matter?
Can test this two equivalent ways:
1. fit a one way anova (ignoring that there are two treatment variables) and test if the treatments are different.
2. fit a linear model and perform the model F test comparing the full model to the intercept only model.
The linear model method is slightly more challenging to implement because of the additional constraint imposed by the missing cell. We have to modify the design matrix from the model.matrix function, or define it by hand.
One way anova below. Notice it says the treatments are not all the same; their mean responses significantly differ for at least one treatment compared to the other 7. One approach would be to follow this test up with tests of contrasts or pairwise comparisons, corrected using Scheff'e or Tukey. Since the data is unbalanced, we should do these by hand.
my.data.oneway
## scoville trt
## 1 259 1
## 2 271 1
## 3 289 2
## 4 255 2
## 5 361 2
## 6 118 2
## 7 463 3
## 8 372 3
## 9 372 3
## 10 208 4
## 11 288 4
## 12 482 4
## 13 337 4
## 14 452 5
## 15 325 5
## 16 582 6
## 17 411 6
## 18 467 6
## 19 512 7
## 20 430 7
## 21 478 7
## 22 646 8
## 23 436 8
## 24 345 8
library(car)
Anova(lm(scoville~trt, data = my.data.oneway), type = 'III')
## Anova Table (Type III tests)
##
## Response: scoville
## Sum Sq Df F value Pr(>F)
## (Intercept) 3339450 1 354.1618 2.44e-12 ***
## trt 188009 7 2.8484 0.0393 *
## Residuals 150867 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
11.3 Fit a linear model with additional constraints
The other option for testing if any treatments matter is to fit a linear model. However, we cannot estimate all the parameters of the effects model due to missing the combination of sun level 3 and water level 3. We can only estimate 8 parameters: intercept, 2 for sun, 2 for water, and 3 (rather than 4) interactions. The way to deal with this is to find the full model matrix and then omit the last column corresponding to the interaction that is not estimable. IF we just use the lm() function, we’ll get an error, but it will still give us some useful output. Again, we end up having to do some things by hand.
options(contrasts = c('contr.sum', 'contr.sum'))
<-lm(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data)
my.lmsummary(my.lm)
##
## Call:
## lm(formula = scoville ~ water.lvl + sun.lvl + water.lvl * sun.lvl,
## data = my.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137.750 -41.396 -3.375 44.167 170.333
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 386.11 41.79 9.240 8.15e-08 ***
## water.lvl1 -30.42 48.55 -0.626 0.540
## water.lvl2 -12.81 52.02 -0.246 0.809
## sun.lvl1 -78.42 48.55 -1.615 0.126
## sun.lvl2 15.19 52.02 0.292 0.774
## water.lvl1:sun.lvl1 -12.28 49.44 -0.248 0.807
## water.lvl2:sun.lvl1 -39.14 68.02 -0.575 0.573
## water.lvl1:sun.lvl2 -42.14 68.02 -0.619 0.544
## water.lvl2:sun.lvl2 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97.1 on 16 degrees of freedom
## Multiple R-squared: 0.5548, Adjusted R-squared: 0.36
## F-statistic: 2.848 on 7 and 16 DF, p-value: 0.0393
$coefficients my.lm
## (Intercept) water.lvl1 water.lvl2 sun.lvl1
## 386.11111 -30.41667 -12.80556 -78.41667
## sun.lvl2 water.lvl1:sun.lvl1 water.lvl2:sun.lvl1 water.lvl1:sun.lvl2
## 15.19444 -12.27778 -39.13889 -42.13889
## water.lvl2:sun.lvl2
## NA
<- model.matrix(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data)
X X
## (Intercept) water.lvl1 water.lvl2 sun.lvl1 sun.lvl2 water.lvl1:sun.lvl1
## 1 1 1 0 1 0 1
## 2 1 1 0 1 0 1
## 3 1 0 1 1 0 0
## 4 1 0 1 1 0 0
## 5 1 0 1 1 0 0
## 6 1 0 1 1 0 0
## 7 1 -1 -1 1 0 -1
## 8 1 -1 -1 1 0 -1
## 9 1 -1 -1 1 0 -1
## 10 1 1 0 0 1 0
## 11 1 1 0 0 1 0
## 12 1 1 0 0 1 0
## 13 1 1 0 0 1 0
## 14 1 0 1 0 1 0
## 15 1 0 1 0 1 0
## 16 1 -1 -1 0 1 0
## 17 1 -1 -1 0 1 0
## 18 1 -1 -1 0 1 0
## 19 1 1 0 -1 -1 -1
## 20 1 1 0 -1 -1 -1
## 21 1 1 0 -1 -1 -1
## 22 1 0 1 -1 -1 0
## 23 1 0 1 -1 -1 0
## 24 1 0 1 -1 -1 0
## water.lvl2:sun.lvl1 water.lvl1:sun.lvl2 water.lvl2:sun.lvl2
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## 7 -1 0 0
## 8 -1 0 0
## 9 -1 0 0
## 10 0 1 0
## 11 0 1 0
## 12 0 1 0
## 13 0 1 0
## 14 0 0 1
## 15 0 0 1
## 16 0 -1 -1
## 17 0 -1 -1
## 18 0 -1 -1
## 19 0 -1 0
## 20 0 -1 0
## 21 0 -1 0
## 22 -1 0 -1
## 23 -1 0 -1
## 24 -1 0 -1
## attr(,"assign")
## [1] 0 1 1 2 2 3 3 3 3
## attr(,"contrasts")
## attr(,"contrasts")$water.lvl
## [1] "contr.sum"
##
## attr(,"contrasts")$sun.lvl
## [1] "contr.sum"
<- X[,-9]
X <- solve(t(X)%*%X)%*%t(X)%*%my.data$scoville
beta.hat beta.hat
## [,1]
## (Intercept) 386.11111
## water.lvl1 -30.41667
## water.lvl2 -12.80556
## sun.lvl1 -78.41667
## sun.lvl2 15.19444
## water.lvl1:sun.lvl1 -12.27778
## water.lvl2:sun.lvl1 -39.13889
## water.lvl1:sun.lvl2 -42.13889
<-X%*%beta.hat
Y.pred <- sum((my.data$scoville-Y.pred)^2)
SSE <- SSE / (24 - 8)
MSE MSE
## [1] 9429.167
11.4 General Linear Test for interaction
The linear model we just developed, with the additional constraint, allows us to perform more tests. We can use this model to perform General linear F tests for interaction and main effects.
The first estimated interaction parameter \[1/3 * [\mu_{11} - \mu_{13} - (\mu_{31}-\mu_{32}) - (\mu_{22} - \mu_{23})]\] (265.0000 - 473.3333 - (402.3333 - 486.6667) - (388.5000 - 475.6667))/3 = -12.27778
The test for interaction does not reject the null hypothesis of no interaction.
Note, we cannot feed the Anova function the fitted linear model; it will produce an error because of the inability to fit all 9 parameters. So, we had to do this “by hand”.
options(contrasts = c('contr.sum', 'contr.sum'))
aggregate(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data, FUN=mean)
## water.lvl sun.lvl scoville
## 1 1 1 265.0000
## 2 2 1 255.7500
## 3 3 1 402.3333
## 4 1 2 328.7500
## 5 2 2 388.5000
## 6 3 2 486.6667
## 7 1 3 473.3333
## 8 2 3 475.6667
<- model.matrix(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data)
X <- X[,1:5] # no interaction parameters
X.R <- solve(t(X.R)%*%X.R)%*%t(X.R)%*%my.data$scoville
beta.hat.R beta.hat.R
## [,1]
## (Intercept) 407.58333
## water.lvl1 -54.98160
## water.lvl2 -38.85173
## sun.lvl1 -102.70887
## sun.lvl2 -11.12446
<-X.R%*%beta.hat.R
Y.pred.R <- sum((my.data$scoville-Y.pred.R)^2)
SSE.R <- ((SSE.R - SSE)/(3)) / MSE
F 1-pf(F,3,16)
## [1] 0.9381543
#library(car)
#Anova(lm(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data), type = 'III')
11.5 Partial analysis for interaction
An alternative approach is to break the ANOVA into “overlapping” pieces that have no missing cells.
Break down the table into two overlapping pieces, one with no sun level 3 and one with no water level 3. Let’s work with the no sun level 3 table. The idea is the same for the other table. We can test for interaction and main effects within this table. Now, there are 6 treatments, so we estimate an intercept, two water parameters, 1 sun parameter, and 2 interaction parameters.
The first interaction term (no sun level 3 table) is estimating
\[\frac{1}{3}(\mu_{11} -\mu_{12}) - \frac{1}{6}(\mu_{21}- \mu_{22}) - \frac{1}{6}(\mu_{31} - \mu_{32})\]
\[265/3 - 255.7500/6 - 402.3333/6-328.7500/3+388.500/6+486.6667/6=14.93057\]
Compare this interaction term to the linear model with the additional constraint. I think it seems like a more interpretable interaction term, which I believe is why some people like the partial analysis.
options(contrasts = c('contr.sum', 'contr.sum'))
set.seed(12345)
<- as.factor(1+c(0,0,1,1,1,1,2,2,2,0,0,0,0,1,1,2,2,2))
water.lvl <- as.factor( 1+c(0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1))
sun.lvl <- round(rnorm(18, 100*(as.numeric(water.lvl)+as.numeric(sun.lvl)),100),0)
scoville <- data.frame(scoville, water.lvl, sun.lvl)
my.data1
aggregate(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data1, FUN=mean)
## water.lvl sun.lvl scoville
## 1 1 1 265.0000
## 2 2 1 255.7500
## 3 3 1 402.3333
## 4 1 2 328.7500
## 5 2 2 388.5000
## 6 3 2 486.6667
library(car)
Anova(lm(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data1),type = 'III')
## Anova Table (Type III tests)
##
## Response: scoville
## Sum Sq Df F value Pr(>F)
## (Intercept) 2088060 1 251.0408 2.075e-09 ***
## water.lvl 71684 2 4.3092 0.03887 *
## sun.lvl 36400 1 4.3763 0.05837 .
## water.lvl:sun.lvl 3360 2 0.2020 0.81984
## Residuals 99811 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data1))
##
## Call:
## lm(formula = scoville ~ water.lvl + sun.lvl + water.lvl * sun.lvl,
## data = my.data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137.750 -38.146 -3.375 53.812 153.250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 354.50 22.37 15.844 2.07e-09 ***
## water.lvl1 -57.62 31.94 -1.804 0.0964 .
## water.lvl2 -32.38 31.94 -1.013 0.3308
## sun.lvl1 -46.81 22.37 -2.092 0.0584 .
## water.lvl1:sun.lvl1 14.93 31.94 0.467 0.6486
## water.lvl2:sun.lvl1 -19.57 31.94 -0.613 0.5516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.2 on 12 degrees of freedom
## Multiple R-squared: 0.5438, Adjusted R-squared: 0.3538
## F-statistic: 2.861 on 5 and 12 DF, p-value: 0.06299
<- model.matrix(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data1)
X X
## (Intercept) water.lvl1 water.lvl2 sun.lvl1 water.lvl1:sun.lvl1
## 1 1 1 0 1 1
## 2 1 1 0 1 1
## 3 1 0 1 1 0
## 4 1 0 1 1 0
## 5 1 0 1 1 0
## 6 1 0 1 1 0
## 7 1 -1 -1 1 -1
## 8 1 -1 -1 1 -1
## 9 1 -1 -1 1 -1
## 10 1 1 0 -1 -1
## 11 1 1 0 -1 -1
## 12 1 1 0 -1 -1
## 13 1 1 0 -1 -1
## 14 1 0 1 -1 0
## 15 1 0 1 -1 0
## 16 1 -1 -1 -1 1
## 17 1 -1 -1 -1 1
## 18 1 -1 -1 -1 1
## water.lvl2:sun.lvl1
## 1 0
## 2 0
## 3 1
## 4 1
## 5 1
## 6 1
## 7 -1
## 8 -1
## 9 -1
## 10 0
## 11 0
## 12 0
## 13 0
## 14 -1
## 15 -1
## 16 1
## 17 1
## 18 1
## attr(,"assign")
## [1] 0 1 1 2 3 3
## attr(,"contrasts")
## attr(,"contrasts")$water.lvl
## [1] "contr.sum"
##
## attr(,"contrasts")$sun.lvl
## [1] "contr.sum"
solve(t(X)%*%X)%*%t(X)%*%my.data1$scoville
## [,1]
## (Intercept) 354.50000
## water.lvl1 -57.62500
## water.lvl2 -32.37500
## sun.lvl1 -46.80556
## water.lvl1:sun.lvl1 14.93056
## water.lvl2:sun.lvl1 -19.56944
11.6 Tests for main effects in the partial table
(no sun level 3)
We can match the main effects tests in the ANOVA (type 3) table by comparing LS means.
# contrast test for sun
aggregate(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data1, FUN=mean)
## water.lvl sun.lvl scoville
## 1 1 1 265.0000
## 2 2 1 255.7500
## 3 3 1 402.3333
## 4 1 2 328.7500
## 5 2 2 388.5000
## 6 3 2 486.6667
<- sum(lm(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data1)$residuals^2)/(18-6)
MSE MSE
## [1] 8317.611
265+255.75+402.333)/3)-((328.75+388.5+486.667)/3))/sqrt(MSE * 2*(1/9 * (1/2+1/4+1/3))) (((
## [1] -2.091965
<-((((265+255.75+402.333)/3)-((328.75+388.5+486.667)/3))/sqrt(MSE * 2*(1/9 * (1/2+1/4+1/3)))
F^2
) F
## [1] 4.376316
1-pf(F,1,12)
## [1] 0.05837152
These tests for main effects should be constructed using the MSE from the full linear model fitted above rather than the partial tables. There is not much of a difference in the conclusion, but the partial table seems to underestimate the variance because it excludes sun level 3 which has larger responses.
# contrast test for sun
aggregate(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data1, FUN=mean)
## water.lvl sun.lvl scoville
## 1 1 1 265.0000
## 2 2 1 255.7500
## 3 3 1 402.3333
## 4 1 2 328.7500
## 5 2 2 388.5000
## 6 3 2 486.6667
<- sum(lm(scoville~water.lvl+sun.lvl+water.lvl*sun.lvl, data = my.data)$residuals^2)/(24-8)
MSE MSE
## [1] 9429.167
265+255.75+402.333)/3)-((328.75+388.5+486.667)/3))/sqrt(MSE * 2*(1/9 * (1/2+1/4+1/3))) (((
## [1] -1.964794
<- ((((265+255.75+402.333)/3)-((328.75+388.5+486.667)/3))/sqrt(MSE * 2*(1/9 * (1/2+1/4+1/3)))
F ^2
) F
## [1] 3.860415
1-pf(F,1,16)
## [1] 0.06704815