Search code examples
rmachine-learninglogistic-regressionglmnettidymodels

Why is my logistic regression model (in R, using parsnip) giving me this strange output?


I am a physician conducting research in critical care medicine. I am examining if I can use blood tests to predict a patients likelihood of death under certain circumstances.

I am using R 4.2.2 on an M1 Macbook Air.

NB - all of this research is ethically approved, it won't be used in patient treatment, I don't have a statistician on this project.

I have trained a logistic regression model on my data. The data itself is subject to 5-fold cross validation (though my code only shows training on fold 01). I trained the data through the parsnip package, using the glmnet engine. It is set to train the model using Lasso regularization.

There is also a step to impute small amounts of missing data using recipes, but I do not believe this is the source of any issues.

My problem is as follows: when I ask the model to make predictions, it is repeating the same output in every row instead of making a new prediction for each row. I do not understand why it is doing this, as I believe I am doing everything correctly. Any insight into what is going wrong would be greatly appreciated.

Here is my code:

data_lr_train <- data_folds$splits[[1]] %>% analysis()
data_lr_train <- bake(object = lr_rec, new_data = data_lr_train)

lr_rec_2 <- recipe(data_lr_train, formula = ~ .) %>% 
  step_impute_bag(all_predictors()) %>% 
  prep()

data_lr_train <- bake(object = lr_rec_2, 
                      new_data = data_lr_train) %>% 
  mutate(gender_m = as.factor(gender_m),
         ards_type_Covid_19 = as.factor(ards_type_Covid_19),
         ards_type_ARDSexp = as.factor(ards_type_ARDSexp),
         ards_type_Unknown = as.factor(ards_type_Unknown),
         outcome_rip = as.factor(outcome_rip))

lr_noReg_mod1 <- logistic_reg(mode = 'classification', 
                              engine = 'glmnet', 
                              penalty = 1)

lr_fit <- fit(object = lr_noReg_mod1, data = data_lr_train , formula = outcome_rip ~ .)

data_lr_test <- data_folds$splits[[1]] %>% assessment()
data_lr_test <- bake(object = lr_rec, new_data = data_lr_test)

data_lr_test <- bake(object = lr_rec_2, 
                      new_data = data_lr_test) %>% 
  mutate(gender_m = as.factor(gender_m),
         ards_type_Covid_19 = as.factor(ards_type_Covid_19),
         ards_type_ARDSexp = as.factor(ards_type_ARDSexp),
         ards_type_Unknown = as.factor(ards_type_Unknown),
         outcome_rip = as.factor(outcome_rip))

pred_response <- predict(object = lr_fit, 
                                  new_data = data_lr_test,
                                  type = 'prob')

The code runs without errors. However, when I examine pred_reponse this is the output I get:

# A tibble: 27 × 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1   0.538   0.462
 2   0.538   0.462
 3   0.538   0.462
 4   0.538   0.462
 5   0.538   0.462
 6   0.538   0.462
 7   0.538   0.462
 8   0.538   0.462
 9   0.538   0.462
10   0.538   0.462
#  17 more rows

I am not sure why it is producing the same prediction every time (all 27 rows are identical). Does anyone see what I am doing wrong?


Solution

  • We can't really tell without a reproducible example, but I suspect that the high amount of regularization is selecting out all of the predictors and leaving only the intercept in the model.

    Other than that, I would strongly suggest putting the model and recipe into a workflow and estimate them simultaneously. If you resample the modeling process this way, you are leaving out the imputation, which can significantly affect the results. This could lead to falsely optimistic performance statistics.