Search code examples
rcategorical-datainteractionlasso-regression

Add all interactions among categorical variables in lasso in R


I want to add all possible interactions among the eight variables, which are all categorical. My dataset looks like following enter image description here

I use as.formula to include all interactions. My code is below

f = as.formula(y ~ .^8)
x = model.matrix(f, data)[, -1]
y = data$y

However, my x becomes following enter image description here

And there are 6560 columns in total. I have no idea why it becomes this. Isn't it should still be 1, 2, 3 in x variables? May I ask how I should fix this or interpret this?

Thank you!


Solution

  • You have eight variables each with three levels. You want to include every possible interaction, that is every possible combination of the eight factors.

    There are 3^8 different possible combinations of values for your predictors. So there are 3^8=6561 possible main effects and interactions (including the intercept) in your design matrix.

    To see how they are encoded consider a single 3-level predictor:

    > model.matrix(lm(y ~ x1))
      (Intercept) x12 x13
    1           1   0   0
    2           1   1   0
    3           1   0   1
    

    A single 3 level factor is encoded as 3 columns, an intercept plus two dummy variables.

    Now add a second 3-level predictor and their interaction:

    > model.matrix(lm(y ~ (x1+x2)^2))
      (Intercept) x12 x13 x22 x23 x12:x22 x13:x22 x12:x23 x13:x23
    1           1   0   0   0   1       0       0       0       0
    2           1   1   0   1   0       1       0       0       0
    3           1   0   1   0   0       0       0       0       0
    

    So here there are 9 permissable combinations of those binary variables. When you get up to 8 variables, each of your 6561 possible combinations of predictors is represented by permissable combinations of these binary variables. (obviously you can't have both x12 and x13 positive at the same time).