I'm trying to conduct an ANCOVA (mix between ANOVA and linear regression) between different models and I'm running into some problems. I think I narrowed it down to a problem (or something I don't understand or doing wrong) about ANOVA: in order to do a comparison between two models, they need to have different residual Df (degree of freedom).
As an example, let's consider mtcars data in R:
library(car)
test_data <- mtcars %>% mutate(factored_variable = as.factor(carb))
model_1 <- aov(drat ~ factored_variable , data = test_data)
Anova(model_1, type = "III")
# Anova Table (Type III tests)
#
# Response: drat
# Sum Sq Df F value Pr(>F)
# (Intercept) 94.870 1 313.3656 0.0000000000000005038 ***
# factored_variable 0.991 5 0.6546 0.6607
# Residuals 7.871 26
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_2 <- aov(drat ~ factored_variable - 1, data = test_data)
Anova(model_2, type = "III")
# Anova Table (Type III tests)
#
# Response: drat
# Sum Sq Df F value Pr(>F)
# factored_variable 414.92 6 228.42 < 0.00000000000000022 ***
# Residuals 7.87 26
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, what I just did is create two models to predict the drat value. First one takes the variable with factor (Df = number of levels - 1 = 5) and an intercept (Df = 1 always), so 6 Df are used. I removed the intercept in the second model so I just have the variable alone. I would expect then that only 5 Df are used by this variable, but it is apparently not the case as Anova says there are 6.
My question is therefore : why is that last Df 6 and not 5? I guess it's related to the fact that the variable has factors, but I don't understand why. Is it not possible to compare two models involving that kind of variable?
edit : thanks for the answer. I think I misunderstood the theory rather than R, it's a bit clearer now
Your two models are essentially the same model, but in the second model you've forced the intercept to be zero. Removing the intercept doesn't change the degrees of freedom, because it results in all 6 levels of the factored_variable
getting parameter estimates, rather than 6-1=5 levels of factored_variable
plus the intercept.
To see that the models are otherwise equivalent (and that each is equivalent to a regression model), we'll create the equivalent linear regression models and then look at the coefficients.
aov1 <- aov(drat ~ factored_variable , data = test_data)
aov2 <- aov(drat ~ factored_variable - 1, data = test_data)
lm1 = lm(drat ~ factored_variable , data = test_data)
lm2 = lm(drat ~ factored_variable - 1 , data = test_data)
Now look at the coefficients for the four models as shown in the code and output below. aov1
and lm1
estimate the intercept plus 5 coefficients for factored_variable
. The coefficient for the missing category of factored_variable
(the "reference" category) is the intercept. The other coefficients are the difference between that category and the reference category. aov2
and lm2
estimate an absolute coefficent for every category of factored_variable
, rather than a coefficient that is relative to the reference category.
coefs = data.frame(aov1=coef(aov1), aov2=coef(aov2), lm1=coef(lm1), lm2=coef(lm2))
aov1 aov2 lm1 lm2 (Intercept)/factored_variable1 3.68142857 3.681429 3.68142857 3.681429 factored_variable2 0.01757143 3.699000 0.01757143 3.699000 factored_variable3 -0.61142857 3.070000 -0.61142857 3.070000 factored_variable4 -0.08542857 3.596000 -0.08542857 3.596000 factored_variable6 -0.06142857 3.620000 -0.06142857 3.620000 factored_variable8 -0.14142857 3.540000 -0.14142857 3.540000
Note that the model pairs lm1
/aov1
and lm2
/aov2
each have the same coefficients. For models aov1
and lm1
, if you add the coefficients for each factored_variable
to the intercept, you'll also see that the coefficients are the same as the coefficients for lm2
and aov2
. In each case, the model is estimating six parameters.