Search code examples
rregressionlinear-regressionlevels

Drop levels to treat 2 of them as a Control Case. Problems with regression/modelling/statistics since its not dummy?


I've stumbled upon a doubt about using droplevels in my dataset. I have 4 factors in my "Disease column".

BD$Etiología <- factor(BD$Etiología, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquémica"), ordered=FALSE)

Then i make a subset in order to just compare the Control Cases vs 1 of the diseases.

BD_C_ID <- subset(BD, Etiología=="Control" | Etiología=="Idiop")

BD_C_ID$Etiología= droplevels(BD_C_ID$Etiología) 

BD_C_ID$Etiología

[1] Control Control Control Control Control Control Control Idiop   Idiop   Control Control Control
[13] Control Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop   Idiop  
[25] Idiop   Idiop   Control Control Control Control Idiop   Control Control Control Control Control
[37] Idiop   Idiop   Idiop   Idiop  
Levels: Control Idiop

Since the first factor was unordered, and i just drop the levels i don't use. Could i treat them as a 0-1 coded value in order to use them in a lm, or a logistic regression? Or will there be a problem?

Also, does that apply if i use the Control VS BAG3 (0-3 in the initial code?)? Or will i need to re-level them so its 0-1 re-applying factors?


Solution

  • Short answer is it doesn't matter. If you use them in a linear model lm or logistic regression, the model will use the first level as a reference level, so in this case, it is always "Control" . The droplevels() is good if you need to perform some functions with the factors, but if it is purely for lm() or glm(), these functions takes care of the factors underneath.

    To illustrate this using your example:

    set.seed(111)
    BD = data.frame(
              Etiologia = sample(0:4,100,replace=TRUE),
              x = rnorm(100),
              y = rnorm(100)
                    )
    

    We can just do:

    BD$E <- factor(BD$Etiologia,levels=0:4,
    labels= c("Control","Idiop","LMNA","BAG3","Isquemica"))
    
    lm(y ~ x + E,data=subset(BD,E %in% c("Control","Idiop")))
    
    Call:
    lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", "Idiop")))
    
    Coefficients:
    (Intercept)            x       EIdiop  
       -0.05524      0.21596      0.30433 
    

    And using another comparison:

    lm(y ~ x + E,data=subset(BD,E %in% c("Control","BAG3")))
    
         Call:
    lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", 
        "BAG3")))
    
    Coefficients:
    (Intercept)            x        EBAG3  
       -0.03355      0.08978     -0.21708  
    

    You get the same result if you do:

    BD$Etiologia <- factor(BD$Etiologia, levels=c(0,1,2,3,4) ,
    labels= c("Control","Idiop","LMNA","BAG3","Isquemica"), ordered=FALSE)
    
    BD_C_ID <- droplevels(subset(BD, Etiologia=="Control" | Etiologia=="Idiop"))
    
    lm(y ~ x + Etiologia,data=BD_C_ID)
    
    Call:
    lm(formula = y ~ x + Etiologia, data = BD_C_ID)
    
    Coefficients:
       (Intercept)               x  EtiologiaIdiop  
          -0.05524         0.21596         0.30433