Search code examples
rstatisticsregressionlogistic-regression

Dummy variables for Logistic regression in R


I am running a logistic regression on three factors that are all binary.

My data

   table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
    Priorconv=factor(c("N","P")))
    table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))

and the model

fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)

R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.

Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.

Thank you.


Solution

  • You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:

    Here's the output of your regression:

    Call:
    glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime, 
        family = binomial, data = table1)
    
    Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                   1.9062     0.3231   5.899 3.66e-09 ***
    PriorconvP                   -1.3582     0.3835  -3.542 0.000398 ***
    CrimeShoplifting              0.9842     0.6069   1.622 0.104863    
    PriorconvP:CrimeShoplifting  -0.5513     0.7249  -0.761 0.446942  
    

    So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):

    1. PriorConv = N and Crime = Other. This is just the case where both dummies are 
        zero, so your regression is just the intercept:
    
    log(p/(1-p)) = 1.90
    
    2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the 
       Crime dummy is still zero:
    
    log(p/(1-p)) = 1.90 - 1.36
    
    3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the 
       Crime dummy is now 1:
    
    log(p/(1-p)) = 1.90 + 0.98
    
    4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
    
    log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
    

    You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.

    Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:

    m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
    predict(m1, type="response")
    
    1         2         3         4         5         6         7         8 
    0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634 
    
    table2 = table1
    table2$Priorconv = relevel(table2$Priorconv, ref = "P")
    
    m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
    predict(m2, type="response")
    
    1         2         3         4         5         6         7         8 
    0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634