I am using tidymodels to develop a logistic regression using the Palmer's Penguins dataset.
I believe that female
should be the reference class since it is the first level of my outcome variable, and therefore a positive outcome for the model should female
.
When I've built the model, the coefficient estimates seem to indicate that an increase in bill_depth_mm
and bill_length_mm
result in an increase in the likelihood of being a female, but when plotted the reverse relationship appears to be true.
So this would lead me to believe that male
is the predicted outcome for the model, however when it comes to calculating specificity
and sensitivity
it treats the female
outcome as the true positive outcome. Similarly, an ROC plot needs the reference level to be the predicted likelihood of a female
outcome.
Why does it appear as though the estimates from the model output are reversed? Is there an option in tidymodels to specify what the estimates should relate to? Or even confirm what the estimates are related to?
Or is my understanding of the model incorrect and these estimates are correct - an increase in bill length and depth does increase the likelihood of the penguin being female?
penguins <- read_csv("Data/penguins.csv")
penguins <- penguins %>%
mutate(sex = factor(sex))
set.seed(57475)
penguins_splits <- initial_split(penguins,
strata = sex)
penguins_training <- training(penguins_splits)
penguins_testing <- testing(penguins_splits)
penguins_recipe <- recipe(sex ~ ., data = penguins_training) %>%
step_naomit(all_predictors()) %>%
step_rm(rowid, skip = TRUE) %>%
step_string2factor(all_nominal_predictors()) %>%
step_mutate(year = factor(year)) %>%
step_dummy(all_nominal_predictors())
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_wf <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(penguins_recipe)
logistic_fit <- logistic_wf %>%
fit(data = penguins_training)
logistic_fit %>%
pull_workflow_fit %>%
tidy()
ggplot(penguins_training, aes(x = bill_length_mm, colour = sex)) +
geom_boxplot()
Output from the model object
> logistic_fit %>%
+ pull_workflow_fit %>%
+ tidy()
# A tibble: 11 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -76.7 13.7 -5.58 0.0000000237
2 bill_length_mm 0.831 0.176 4.72 0.00000238
3 bill_depth_mm 1.53 0.391 3.91 0.0000913
4 flipper_length_mm -0.0290 0.0611 -0.474 0.635
5 body_mass_g 0.00593 0.00132 4.49 0.00000724
6 species_Chinstrap -9.17 2.06 -4.45 0.00000843
7 species_Gentoo -8.82 3.22 -2.74 0.00608
8 island_Dream 0.874 0.968 0.903 0.367
9 island_Torgersen -0.237 0.974 -0.243 0.808
10 year_X2008 -0.239 0.701 -0.341 0.733
11 year_X2009 -0.762 0.736 -1.04 0.300
You're correct in our interpretations - this is due to a clash in conventions between the underlying engine (here stats::glm()
) and yardstick. The model uses the first level as "failure" while yardstick uses the first level as "success". To alter the behavior of yardstick, set event_level = "second"
. (To alter the behavior of glm()
, recode the factor.)
https://yardstick.tidymodels.org/reference/sens.html#relevant-level