I have simulated data as follows:
require(tidymodels)
set.seed(123)
x1=rnorm(1000,0,1)
x2=rnorm(1000,0.5,1)
x3=rnorm(1000,2.5,1)
x4=rnorm(1000,2.5,1.5)
y=exp(0.3*x1 + 0.25*x2 + 0.75)/(1+exp(0.3*x1 + 0.25*x2 + 0.75))
y_bin=ifelse(y>0.8,1,0)
data_sim=data.frame(x1,x2=2.5*x2,x3=3*x3, x4,y_bin=factor(y_bin))
Based on this data, I fitted following lasso logistic regression model using 5-fold cross validation
set.seed(123)
fold_data=vfold_cv(data_sim,v=5)
lasso_spec<- logistic_reg( penalty = tune(),
mixture = 1) %>% set_engine("glmnet")
lamd_grid <- grid_regular(penalty(), levels =50)
res_basic <- recipe(y_bin ~ x1 +x2+x3 + x4 , data=data_sim)
wf <- workflow() %>% add_model(lasso_spec) %>% add_recipe(res_basic)
lasso_tune <- wf %>%
tune_grid(
resamples=fold_data,
grid=lamd_grid,
control=control_grid(verbose = FALSE, save_pred = T),
metrics=metric_set(roc_auc, sensitivity, precision, recall)
)
The best model based on precision metric as following values:
lasso_tune %>% show_best("precision")
# A tibble: 5 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 e-10 precision binary 0.997 5 0.00138 Preprocessor1_Model01
2 1.60e-10 precision binary 0.997 5 0.00138 Preprocessor1_Model02
3 2.56e-10 precision binary 0.997 5 0.00138 Preprocessor1_Model03
4 4.09e-10 precision binary 0.997 5 0.00138 Preprocessor1_Model04
5 6.55e-10 precision binary 0.997 5 0.00138 Preprocessor1_Model05
At the mean time, I extracted the predictions from each fold as follows:
fod1=lasso_tune$.predictions[[1]] %>% filter(penalty %in% p1$penalty) %>%
select(.pred_class, y_bin)
fod2=lasso_tune$.predictions[[2]] %>% filter(penalty %in% p1$penalty) %>%
select(.pred_class, y_bin)
fod3=lasso_tune$.predictions[[3]] %>% filter(penalty %in% p1$penalty) %>%
select(.pred_class, y_bin)
fod4=lasso_tune$.predictions[[4]] %>% filter(penalty %in% p1$penalty) %>%
select(.pred_class, y_bin)
fod5=lasso_tune$.predictions[[5]] %>% filter(penalty %in% p1$penalty) %>%
select(.pred_class, y_bin)
fold_cmb=data.frame(rbind(fod1,fod2,fod3,fod4,fod5))
The cross tabulation based on combine results looks like this:
> table(fold_cmb$.pred_class,fold_cmb$y_bin)
0 1
0 885 3
1 1 111
Based on this precision should be equal to 0.991, not 0.997. I guess the package calculate the precision for wrong class label. You can see that for y_bin=0, 885/888=0.997.
So, Can anyone help me to figure out how to change the code so that it will calculate the metrics correctly? Thank you
You are getting that results because precision()
, and all the other {yardstick} metrics uses the first level as the "event". Thus 0
is considered the event for your data due to the construction of your data.
To get the results you want you can:
Using the levels
argument in factor
you can specify which level be the first and thus be considered the "event" by {yardstick}.
Doing this
factor(y_bin, levels = c(1, 0))
instead of this
factor(y_bin)
event_level
argumentThe metric functions that are affected by level ordering have an argument event_level
that you can use to specify how the calculations should be done
library(yardstick)
# event_level defaults to "first"
two_class_example |>
precision(truth = truth, estimate = predicted)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 precision binary 0.819
two_class_example |>
precision(truth = truth, estimate = predicted, event_level = "second")
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 precision binary 0.861
The control_grid()
function also has this argument so you can get the results when running {tune} functions
Do this
control_grid(verbose = FALSE, save_pred = T, event_level = "second")