Search code examples
rtidymodelsshapxaidalex

How to fit kernelshap and shapviz into a tidymodels workflow?


I am using a mtcars dataset and tune an xgboost model to predict mpg

# Load required packages
library(tidymodels)
library(xgboost)
library(DALEX)
library(DALEXtra)

# Load example data
data(mtcars)

# Create a recipe for preprocessing
recipe_obj <- recipe(mpg ~ ., data = mtcars) %>%
  step_normalize(all_predictors()) %>%
  step_dummy(all_nominal(), one_hot = TRUE)

# Split data into training and testing sets
set.seed(123)
data_split <- initial_split(mtcars, prop = 0.8, strata = "mpg")
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
train_data <- training(data_split)
test_data <- testing(data_split)

# Specify model and parameter ranges
xgb_model <- boost_tree(
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# Create a model workflow
xgb_workflow <- workflow() %>%
  add_recipe(recipe_obj) %>%
  add_model(xgb_model)

# Define a parameter grid for tuning
xgb_grid <- grid_latin_hypercube(
  trees(range = c(50, 500)),
  finalize(mtry(), train_data),
  min_n(range = c(2, 10)),
  size = 20
)

# Tune the model
set.seed(123)
xgb_tuned <- xgb_workflow %>%
  tune_grid(
    resamples = vfold_cv(train_data, v = 5),
    grid = xgb_grid,
    metrics = metric_set(rmse)
  )


# Select the best model
best_xgb <- xgb_tuned %>%
  select_best(metric = "rmse")

# Finalize the workflow with the best model
final_workflow <- xgb_workflow %>%
  finalize_workflow(best_xgb)


# Fit the final workflow to the training data
final_fit <- final_workflow %>% fit(data = train_data)

prepped_recipe <- prep(recipe_obj, training = train_data)

Next I would like to use kernelshap to visualize globally SHAP values for all the datapoints and see the influence of each variable... I tried to follow snippets available online but I admit I have no clue how to proceed. I assume data must be preprocessed according to the initial steps used in the tidymodels workflow, it must be "baked". I realised the data must be supplied as matrices to the function. And the model prediction must be either the continuous value or, in the case of classification, probabilities. I commented the predict function out. I wonder if the code should not be universal for both regression and classificaiton.

  library(kernelshap)
    library(shapviz)
    
    # Bake the data using the same preprocessing recipe
    baked_train_data <- bake(prep(recipe_obj), new_data = train_data)
    
    # Select a background sample for KernelSHAP
    bg_X <- baked_train_data[sample(nrow(baked_train_data), 20), ]
    
    # Define the prediction function
    #    predict_function <- function(model, newdata) {
    #  predict(model, new_data = newdata, type = "prob")[,1] # Return probabilities for the first 
  # class
  #  }
    
    # Compute KernelSHAP values
    ks <- kernelshap(
      extract_fit_engine(final_fit), 
      X = as.matrix(baked_train_data %>% select(-cyl)), 
      bg_X = as.matrix(bg_X %>% select(-cyl)), 
      pred_fun = function(object, newdata) predict_function(object, newdata),
    )
    
    # Create a shapviz object
    shapviz_obj <- shapviz(shap_values)
    
    
    # Plot SHAP importance
    sv_importance(shapviz_obj, kind = "bee")
    
    # Plot SHAP dependence for each feature
    xvars <- colnames(baked_train_data %>% select(-cyl))
    for (xvar in xvars) {
      sv_dependence(shapviz_obj, xvar = xvar)
    }

What should be the proper syntax and objects to pass from tidymodels to kernelshap? How to make it universal for both regression and classification? Would treeshap be more suitable and can it easily be implemented side by side to show differences?


Solution

  • The workflow is quite clear once you know:

    1. Kernel SHAP is model agnostic. It only requires a predict() function and a background data. This is easy with any model, even with unhandy Tidymodels.
    2. TreeSHAP requires the fitted trees and the data API of the fitted model. Since Tidymodels wraps the XGBoost model, this is more complicated as you need peel of the Tidymodels stuff.

    Since it would be quite strange to work with Kernel SHAP on an XGBoost model, we need to swallow this toad. If you want to interpret on the scale of the original features, this will be feasible mainly with 1:1 transforms of the features. This is usually possible as there is hardly a reason for dummy encodings with trees.

    # Load required packages
    library(tidymodels)
    library(xgboost)
    library(kernelshap)
    library(shapviz)
    
    recipe_obj <- recipe(mpg ~ ., data = mtcars) |> 
      step_integer(all_nominal())  # only 1:1 transforms to make it easier with treeshap
    
    set.seed(123)
    data_split <- initial_split(mtcars, prop = 0.8)
    train_data <- training(data_split)
    test_data <- testing(data_split)
    
    # Specify model and parameter ranges
    xgb_model <- boost_tree(
      trees = tune(),
      mtry = tune(),
      min_n = tune()
    ) |> 
      set_engine("xgboost") |>
      set_mode("regression")
    
    # Create a model workflow
    xgb_workflow <- workflow() |>
      add_recipe(recipe_obj) |>
      add_model(xgb_model)
    
    # Define a parameter grid for tuning
    xgb_grid <- grid_latin_hypercube(
      trees(range = c(50, 500)),
      finalize(mtry(), train_data),
      min_n(range = c(2, 10)),
      size = 20
    )
    
    # Tune the model
    set.seed(123)
    xgb_tuned <- xgb_workflow |>
      tune_grid(
        resamples = vfold_cv(train_data, v = 5),
        grid = xgb_grid,
        metrics = metric_set(rmse)
      )
    
    
    # Select the best model
    best_xgb <- xgb_tuned |>
      select_best(metric = "rmse")
    
    # Finalize the workflow with the best model
    final_workflow <- xgb_workflow |>
      finalize_workflow(best_xgb)
    
    # Fit the final workflow to the training data
    final_fit <- final_workflow |> 
      fit(data = train_data)
     
    # Kernelshap
    xvars <- setdiff(colnames(mtcars), "mpg")
    shap_kernel <- kernelshap(
      final_fit, X = train_data, bg_X = train_data, feature_names = xvars
    ) |> 
      shapviz()
    
    # Permutation SHAP (since we don't have many predictors)
    shap_perm <- permshap(
      final_fit, X = train_data, bg_X = train_data, feature_names = xvars
    ) |> 
      shapviz()
    
    # Treeshap
    xgboost_input <- bake(
      prep(recipe_obj), 
      has_role("predictor"),
      new_data = train_data, 
      composition = "matrix"
    )
    head(xgboost_input)
    
    shap_tree <- extract_fit_engine(final_fit) |> 
      shapviz(X_pred = xgboost_input, X = train_data)
    
    shap_values <- c(tree = shap_tree, kernel = shap_kernel, perm = shap_perm)
    sv_importance(shap_values)
    sv_dependence(shap_values, v = "disp")
    

    enter image description here

    enter image description here