Search code examples
rtidymodelsr-ranger

plotting tidymodel rsults with roc_curve() receives numeric vs. character error


I am teaching myself how to use the excellent tidymodels collection of packages to practice machine learning.

In the below example, I am basically trying to reproduce Julie Sigle's blog post here (https://juliasilge.com/blog/water-sources/) on using the ranger package to predict water sources.

I'm not using her dataset in that blog but instead using the built-in diamonds dataset as practice.

I can recreate all of sets except for the yardstick::roc_curv() when I try to plot the truth against the prediction.

The error I get it is below

Error in `dplyr::summarise()`:
! Problem while computing `.estimate = metric_fn(...)`.
ℹ The error occurred in group 1: id = "Fold01".
Caused by error in `validate_class()`:
! `estimate` should be a numeric but a factor was supplied.

While the data set and transformation steps are different, the below steps roughly correspond to what is in the above link.

I recognize statistically there may be more valid or better ways of doing this, but I am just trying to get more familiarity with the tools and packages and experience in using them.

library(tidyverse)
library(tidymodels)

# set a outcome variable that I want to try and predict (e.g. price is above $10,000)
diamonds <- diamonds %>% 
  mutate(high_price_indicator=if_else(price>10000,"high","low"))

#split data sets
data_split <- rsample::initial_split(diamonds,strata = high_price_indicator)

training_split <- rsample::training(data_split)
testing_split <- rsample::testing(data_split)

# cross fold 
diamonds_fold <- rsample::vfold_cv(training_split,strata=high_price_indicator)

#choose model, set engine and mode
rf_spec <- parsnip::rand_forest(trees = 1000) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

#set recipe and do some transformations - not sure if the error is here
rec <- recipes::recipe(high_price_indicator ~., data=training_split) %>%
  recipes::step_normalize(all_numeric_predictors()) %>% 
  step_zv(all_predictors(),) %>% 
  step_dummy(c("cut","color","clarity"),one_hot = TRUE)


# create the workflow

workflow <- workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(rec)

# fit workflow to cross folded data and save predictions
fit_folds <- tune::fit_resamples(workflow,
    resamples = diamonds_fold,
    control = control_resamples(save_pred = TRUE)
  )

# this is where I get the error
collect_predictions(fit_folds) %>%
  group_by(id) %>%
  roc_curve(high_price_indicator, .pred_class) %>%
  autoplot()

Appreciate anyone's guidance!

Below are my steps. Appreciate if anyone can help me understand where I am going wrong to plot the predictions against the outcome variable.


Solution

  • Okay, figured it out. I was trying to plot two categorical variables against each other, but the roc_cuve requires one truth column and one column with the probabilities for it.

    By unnesting the .predictions column in the resampled table fit_folds you can see that there are three columns with results .pred_high, .pred_low and .pred_class. the high and low tag correspond to the high_price_indicator column.

    .pred_class has the character outcome of the prediction, and the .pred_low and .pred_high have the probabilities outcomes. In Julia Silge's example, these columns are represented as .pred_n and pred_y.

    So, when you plot a numerical probabilities column against the truth column you get the graph.

    Below is the code

    collect_predictions(fit_folds) %>%
      group_by(id) %>%
      roc_curve(high_price_indicator,.pred_high) %>%
      autoplot()