In the process of studying logistic regression using carret's mdrr data, questions arise. I created a full model using a total of 19 variables, and I have questions about the notation of the categorical variable.
In my regression model, the categorical variables are:
nDB : 0 or 1 or 2
nR05 : 0 or 1
nR10 : 1 or 2
I created a full model using glm, but I do not know why the names of categorical variables have one of the numbers in the category.
glm(formula = mdrrClass ~ ., family = binomial, data = train)
#Coefficients:
#(Intercept) nDB1 nDB2 nX nR051 nR101 nBnz2
#5.792e+00 5.287e-01 -3.103e-01 -2.532e-01 -9.291e-02 9.259e-01 -2.108e+00
#SPI BLI PW4 PJI2 Lop BIC2 VRA1
#3.222e-05 -1.201e+01 -3.754e+01 -5.467e-01 1.010e+00 -5.712e+00 -2.424e-04
# PCR H3D FDI PJI3 DISPm DISPe G.N..N.
# -6.397e-02 -4.360e-04 3.458e+01 -6.579e+00 -5.690e-02 2.056e-01 -7.610e-03
#Degrees of Freedom: 263 Total (i.e. Null); 243 Residual
#Null Deviance: 359.3
#Residual Deviance: 232.6 AIC: 274.6
The above results show that nDB is numbered, and nR05 and nR10 are related to categories. I am wondering why numbers are attached as above.
When you have categorical predictors in any regression model you need to create dummy variables. R does this for you and the output you see are the contrasts
Your variable nDB
has 3 levels: 0, 1, 2
One of those needs to be chosen as the reference level (R was chosen 0 for you in this case, but this can also be specified manually). Then dummy variables are created to compare every other level against your reference level: 0 vs 1 and 0 vs 2
R names these dummy variables nDB1
and nDB2
. nDB1
is for the 0 vs 1 contrast, and nDB2
is for the 0 vs 2 contrast. The numbers after the variable names are just to indicate which contrast you're looking at
The coefficient values are interpreted as the difference in your y (outcome) value between groups 0 and 1 (nDB1
), and separately between groups 0 and 2 (nDB2
). In other words, what change in the outcome would you expect when moving from one group to the other?
Your other categorical variables have 2 levels and are just a simpler case of the above
For example, nR05
only has 0 and 1 as values. 0 was chosen as your reference, and because theres only 1 possible contrast here, a single dummy variable is created comparing 0 vs 1. In the output that dummy variable is called nR051