Search code examples
rlogistic-regression

Confused with the reference level in logistic regression in R


I am confused with the answer from Logistic regression - defining reference level in R

It said if you want to predict the probability of "Yes", you set as relevel(auth$class, ref = "YES"). However, in my experiment, if we have a binary response variable with "0" and "1". We only get the estimation for probability of "1" when we set relevel(factor(y),ref="0").

n <- 200
x <- rnorm(n)
sumx <- 5 + 3*x
exp1 <- exp(sumx)/(1+exp(sumx))
y <- rbinom(n,1,exp1) #probability here is for 1
model1 <- glm(y~x,family = "binomial")
summary(model1)$coefficients
            Estimate Std. Error  z value     Pr(>|z|)
(Intercept) 5.324099  1.0610921 5.017565 5.233039e-07
x           2.767035  0.7206103 3.839849 1.231100e-04
model2 <- glm(relevel(factor(y),ref="0")~x,family = "binomial")
summary(model2)$coefficients
            Estimate Std. Error  z value     Pr(>|z|)
(Intercept) 5.324099  1.0610921 5.017565 5.233039e-07
x           2.767035  0.7206103 3.839849 1.231100e-04

So what is my mistake? Actually, what is glm() to predict in default if we use response other than "0" and "1"?


Solution

  • If P(0) is the probability of 0 and P(1) is the probability of 1, then P(0) = 1 - P(1). Thus, you can always calculate the probability of the reference level, regardless of which level you set as the reference.

    For example, predict(model1, type="response") gives you the probability of the non-reference level. 1 - predict(model1, type="response") gives you the probability of the reference level.

    You also asked, "what is glm() to predict in default if we use response other than '0' and '1'." For (binomial) logistic regression to be appropriate, your outcome needs to be a categorical variable with two categories. You can call them whatever you want, 0/1, black/white, because/otherwise, Mal/Serenity, etc. One will be the reference level--whichever you prefer--and the model will give you the probability of the other level. The probability of the reference level is just 1 minus the probability of the other level.

    If your outcome has more than two categories, you can use a multinomial logistic regression model, but the principle is similar.