Search code examples
rstatisticsregressionlogistic-regression

Notation of categorical variables in regression analysis


In the process of studying logistic regression using carret's mdrr data, questions arise. I created a full model using a total of 19 variables, and I have questions about the notation of the categorical variable.

In my regression model, the categorical variables are:

  • nDB : 0 or 1 or 2

  • nR05 : 0 or 1

  • nR10 : 1 or 2

I created a full model using glm, but I do not know why the names of categorical variables have one of the numbers in the category.

-------------------------------------------------------------------------------

glm(formula = mdrrClass ~ ., family = binomial, data = train)

#Coefficients:
#(Intercept)         nDB1         nDB2           nX        nR051        nR101        nBnz2  
  #5.792e+00    5.287e-01   -3.103e-01   -2.532e-01   -9.291e-02    9.259e-01   -2.108e+00  
        #SPI          BLI          PW4         PJI2          Lop         BIC2         VRA1  
  #3.222e-05   -1.201e+01   -3.754e+01   -5.467e-01    1.010e+00   -5.712e+00   -2.424e-04  
       # PCR          H3D          FDI         PJI3        DISPm        DISPe      G.N..N.  
# -6.397e-02   -4.360e-04    3.458e+01   -6.579e+00   -5.690e-02    2.056e-01   -7.610e-03  

#Degrees of Freedom: 263 Total (i.e. Null);  243 Residual
#Null Deviance:     359.3 
#Residual Deviance: 232.6   AIC: 274.6

-------------------------------------------------------------------------------

The above results show that nDB is numbered, and nR05 and nR10 are related to categories. I am wondering why numbers are attached as above.


Solution

  • When you have categorical predictors in any regression model you need to create dummy variables. R does this for you and the output you see are the contrasts

    Your variable nDB has 3 levels: 0, 1, 2

    One of those needs to be chosen as the reference level (R was chosen 0 for you in this case, but this can also be specified manually). Then dummy variables are created to compare every other level against your reference level: 0 vs 1 and 0 vs 2

    R names these dummy variables nDB1 and nDB2. nDB1 is for the 0 vs 1 contrast, and nDB2 is for the 0 vs 2 contrast. The numbers after the variable names are just to indicate which contrast you're looking at

    The coefficient values are interpreted as the difference in your y (outcome) value between groups 0 and 1 (nDB1), and separately between groups 0 and 2 (nDB2). In other words, what change in the outcome would you expect when moving from one group to the other?

    Your other categorical variables have 2 levels and are just a simpler case of the above

    For example, nR05 only has 0 and 1 as values. 0 was chosen as your reference, and because theres only 1 possible contrast here, a single dummy variable is created comparing 0 vs 1. In the output that dummy variable is called nR051