I use model.matrix to create a matrix used by GLM.
formula_test <- as.formula("Y ~ x1 + x2")
data_test <- expand.grid(
Y = 1:100
, x1 = c("A","B")
, x2 = 1:20
)
result_test <- data.frame(model.matrix(
object = formula_test
, data = data_test
))
names(result_test)
Interestingly, the column names of the result_test data are "X.Intercept." "x1B" "x2"
How come the second column name is not "x1A"
?
I then tried data_test$x1 <- factor(x = data_test$x1, levels = c("A","B"))
but it's still the same.
That is because if you had c("X.Intercept.", "x1A", "x1B", "x2")
, then you would have perfect multicollinearity: x1A + x1B
would be a column of ones, just like the X.Intercept.
column. If, for the sake of interpretation, you prefer having x1A
instead of the intercept, we may use
formula_test <- as.formula("Y ~ -1 + x1 + x2")
giving
names(result_test)
# [1] "x1A" "x1B" "x2"
and
all(rowSums(result_test[, c("x1A", "x1B")]) == 1)
# [1] TRUE
As for why it is x1A
that is dropped rather than x1B
, the rule seems to be that the first factor levels goes away. If instead we use
levels(data_test$x1) <- c("B", "A")
then this gives
names(result_test)
# [1] "X.Intercept." "x1A" "x2"