I've stumbled upon a doubt about using droplevels in my dataset. I have 4 factors in my "Disease column".
BD$Etiología <- factor(BD$Etiología, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquémica"), ordered=FALSE)
Then i make a subset in order to just compare the Control Cases vs 1 of the diseases.
BD_C_ID <- subset(BD, Etiología=="Control" | Etiología=="Idiop")
BD_C_ID$Etiología= droplevels(BD_C_ID$Etiología)
BD_C_ID$Etiología
[1] Control Control Control Control Control Control Control Idiop Idiop Control Control Control
[13] Control Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop Idiop
[25] Idiop Idiop Control Control Control Control Idiop Control Control Control Control Control
[37] Idiop Idiop Idiop Idiop
Levels: Control Idiop
Since the first factor was unordered, and i just drop the levels i don't use. Could i treat them as a 0-1 coded value in order to use them in a lm
, or a logistic regression? Or will there be a problem?
Also, does that apply if i use the Control VS BAG3 (0-3 in the initial code?)? Or will i need to re-level them so its 0-1 re-applying factors?
Short answer is it doesn't matter. If you use them in a linear model lm
or logistic regression, the model will use the first level as a reference level, so in this case, it is always "Control"
. The droplevels()
is good if you need to perform some functions with the factors, but if it is purely for lm()
or glm()
, these functions takes care of the factors underneath.
To illustrate this using your example:
set.seed(111)
BD = data.frame(
Etiologia = sample(0:4,100,replace=TRUE),
x = rnorm(100),
y = rnorm(100)
)
We can just do:
BD$E <- factor(BD$Etiologia,levels=0:4,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"))
lm(y ~ x + E,data=subset(BD,E %in% c("Control","Idiop")))
Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control", "Idiop")))
Coefficients:
(Intercept) x EIdiop
-0.05524 0.21596 0.30433
And using another comparison:
lm(y ~ x + E,data=subset(BD,E %in% c("Control","BAG3")))
Call:
lm(formula = y ~ x + E, data = subset(BD, E %in% c("Control",
"BAG3")))
Coefficients:
(Intercept) x EBAG3
-0.03355 0.08978 -0.21708
You get the same result if you do:
BD$Etiologia <- factor(BD$Etiologia, levels=c(0,1,2,3,4) ,
labels= c("Control","Idiop","LMNA","BAG3","Isquemica"), ordered=FALSE)
BD_C_ID <- droplevels(subset(BD, Etiologia=="Control" | Etiologia=="Idiop"))
lm(y ~ x + Etiologia,data=BD_C_ID)
Call:
lm(formula = y ~ x + Etiologia, data = BD_C_ID)
Coefficients:
(Intercept) x EtiologiaIdiop
-0.05524 0.21596 0.30433