Difference in output between predict.rpart and predict.glm

I split a dataset up in a training and test sample. I then fit a logit model on the training data to predict the outcome of the test sample. I can do this in two ways:

Using Tidyverse:

logit_mod <- logistic_reg() %>% 
 set_mode("classification") %>% 
 set_engine("glm") %>%
 fit(y ~ x + z, data=train)
res <- predict(logit_mod, new_data = test, type="prob")

Or with the GLM class:

logit_mod <- glm(y ~ x + z, data=train, family='logit')
res <- predict(logit_mod, newdata=test, type="response")

Both methods give me different output (probabilities of y). While the model should be the same. extracting logit_mod[["fit"]] gives me the same coefficients as I have for logit_mod using GLM.

Why does the second method give me different predicted probabilities?

Solution

If you do predict on a glm binomial regression, you get the probability of the positive class, and the probabilities from tidymodels are rounded up.

For example, a simple regression with response as 0/1, 1 being positive class :

library(tidymodels)
set.seed(111)
df = data.frame(y = factor(rbinom(50,1,0.5)),x=runif(50),z=runif(50))
train = df[1:40,]
test = df[41:50,]

logit_mod <- logistic_reg() %>% 
 set_mode("classification") %>% 
 set_engine("glm") %>%
 fit(y ~ x + z, data=train)
res <- predict(logit_mod, new_data = test, type="prob")

This is the prediction for class 1 :

res$.pred_1
       41        42        43        44        45        46        47        48 
0.3186626 0.3931925 0.4259043 0.3651420 0.6670263 0.6732433 0.5844562 0.5584770 
       49        50 
0.6791727 0.7567285

Do glm and you can see its exactly the same:

fit <- glm(y ~ x + z, data=train, family=binomial)
res2 <- predict(fit, newdata=test, type="response")

res2
       41        42        43        44        45        46        47        48 
0.3186626 0.3931925 0.4259043 0.3651420 0.6670263 0.6732433 0.5844562 0.5584770 
       49        50 
0.6791727 0.7567285