I split a dataset up in a training and test sample. I then fit a logit model on the training data to predict the outcome of the test sample. I can do this in two ways:
Using Tidyverse:
logit_mod <- logistic_reg() %>%
set_mode("classification") %>%
set_engine("glm") %>%
fit(y ~ x + z, data=train)
res <- predict(logit_mod, new_data = test, type="prob")
Or with the GLM class:
logit_mod <- glm(y ~ x + z, data=train, family='logit')
res <- predict(logit_mod, newdata=test, type="response")
Both methods give me different output (probabilities of y). While the model should be the same. extracting logit_mod[["fit"]]
gives me the same coefficients as I have for logit_mod
using GLM.
Why does the second method give me different predicted probabilities?
If you do predict
on a glm
binomial regression, you get the probability of the positive class, and the probabilities from tidymodels
are rounded up.
For example, a simple regression with response as 0/1, 1 being positive class :
library(tidymodels)
set.seed(111)
df = data.frame(y = factor(rbinom(50,1,0.5)),x=runif(50),z=runif(50))
train = df[1:40,]
test = df[41:50,]
logit_mod <- logistic_reg() %>%
set_mode("classification") %>%
set_engine("glm") %>%
fit(y ~ x + z, data=train)
res <- predict(logit_mod, new_data = test, type="prob")
This is the prediction for class 1 :
res$.pred_1
41 42 43 44 45 46 47 48
0.3186626 0.3931925 0.4259043 0.3651420 0.6670263 0.6732433 0.5844562 0.5584770
49 50
0.6791727 0.7567285
Do glm and you can see its exactly the same:
fit <- glm(y ~ x + z, data=train, family=binomial)
res2 <- predict(fit, newdata=test, type="response")
res2
41 42 43 44 45 46 47 48
0.3186626 0.3931925 0.4259043 0.3651420 0.6670263 0.6732433 0.5844562 0.5584770
49 50
0.6791727 0.7567285