Search code examples
rlogistic-regressioncategorical-datainteraction

Logistic Regression in R - Interpreting interaction effects for categorical variables


I have a dataset that looks like this.

enter image description here

Note that variable A and B are binary variables of Low/High
The following code has been run in R

logit = glm(y ~ A*B , family = binomial(link='logit') , data=df)
summary(logit)

and here's the output enter image description here

The reason for including the interaction effect between A and B is my hypothesis does NOT align with the effect of A and B so I thought I'd include the interaction effect between A and B and not surprisingly it turned out to be quite significant.
But how do I interpret these coefficients?
I know how to interpret if either A or B was numeric but dealing with 2 categorical variables is quite hard to get my head around.

Looking forward to some expert's advices/comments.

Many thanks!

Many thanks in advance.


Solution

  • General background: interpreting logistic regression coefficients

    First of all, to learn more about interpreting logistic regression coefficients generally, take a look at this guide for beginners. Logistic regression coefficients are the change in log odds of the outcome associated with an increase of 1 unit in the predictor variable. So if you have a coefficient \beta you can exponentiate it, exp(beta) to get the odds ratio. If beta = 0, exp(beta) = 1 so the OR is 1 and the predictor variable has no effect on the odds of the response. If beta > 0, the OR is positive and the predictor variable increases the odds of the response if it increases.

    Interpreting interaction coefficients on categorical variables in R logistic regressions

    Now that we have that background, we can proceed to a more specific answer to the question here.

    In R, linear models like glm() with categorical predictor variables use the factor data type for those variables. If they are in character format when you pass them to glm(), it will coerce them to factors. Then after that coercion, the model converts each factor to a set of n-1 dummy variables where n is the number of unique levels in the factor. The default ordering is alphabetical, so a factor level coming first in the alphabet will be treated as the reference or intercept level.

    Therefore because A and B each have only two unique levels, and High comes before Low in the alphabet, both A and B will essentially be converted to a single vector of 0 and 1 where High is 0 and Low is 1. You can change this behavior by manually setting the factor level ordering: df$A <- factor(df$A, levels = c('Low', 'High')).

    In your model, the coefficient on the interaction between A and B is saying how strongly the effect of A on whether y is Terminated depends on B (or equivalently, the effect of B on y depends on A). Note this also assumes that the y outcome variable, which is binary, has Active = 0 and Terminated = 1. This is also because of the default alphabetical ordering.

    The model is enter image description here

    Because A is either 0 or 1 and B is either 0 or 1, the last term in that equation above will be 0 unless both A=1 and B=1. That corresponds to both variables being Low assuming you're using the default factor coding. We can interpret the coefficient of 1.41, which is positive, as saying that if A is Low, the effect of B on y is more positive, or causing a greater increase in the probability of y being Terminated. Specifically, if both are Low, the odds of Terminated are about exp(1.41) = 4.1 times higher than if at least one of them isn't Low.

    You can say "if A is Low, then B being Low has a positive effect on the probability of termination, but if A is High, then B being Low has a negative effect on the probability of termination." That's because the main effect of B is < 0 while the interaction coefficient is > 0.